Marques de Sá

Chance

Joaquim P. Marques de Sá

CHANCE

The Life of Games

&

the Game of Life

With 105 Figures and 10 Tables

123

Prof. Dr. Joaquim P. Marques de Sá

Original Portuguese edition © Gravida 2006

ISBN 978-3-540-74416-0 e-ISBN 978-3-540-74417-7

DOI 10.1007/978-3-540-74417-7

Library of Congress Control Number: 2007941508

© Springer-Verlag Berlin Heidelberg 2008

his work is subject to copyright. All rights are reserved, whether the whole or part of the material is

concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broad-

casting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of

this publication or parts thereof is permitted only under the provisions of the German Copyright

Law of September 9, 1965, in its current version, and permission for use must always be obtained

from Springer. Violations are liable for prosecution under the German Copyright Law.

he use of general descriptive names, registered names, trademarks, etc. in this publication does not

imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

Typesetting: LE-T

E

X Jelonek, Schmidt & Vöckler GbR, Leipzig, Germany

Production: LE-T

E

X Jelonek, Schmidt & Vöckler GbR, Leipzig, Germany

Cover design: eStudio Calamar S.L., F. Steinen-Broo, Girona, Spain

Cartoons by: Luís Gonçalves and Chris Madden

Printed on acid-free paper

9 8 7 6 5 4 3 2 1

springer.com

Preface

Our lives are immersed in a sea of chance. Everyone’s existence is

a meeting point of a multitude of accidents. The origin of the word

‘chance’ is usually traced back to the vulgar Latin word ‘cadentia’,

meaning a befalling by fortuitous circumstances, with no knowable or

determinable causes. The Roman philosopher Cicero clearly expressed

the idea of ‘chance’ in his work De Divinatione:

For we do not apply the words ‘chance’, ‘luck’, ‘accident’ or

‘casualty’ except to an event which has so occurred or happened

that it either might not have occurred at all, or might have

occurred in any other way. 2.VI.15.

For if a thing that is going to happen, may happen in one way

or another, indiﬀerently, chance is predominant; but things that

happen by chance cannot be certain. 2.IX.24.

In a certain sense chance is the spice of life. If there were no phenomena

with unforeseeable outcomes, phenomena with an element of chance,

all temporal cause–eﬀect sequences would be completely deterministic.

In this way, with suﬃcient information, the events of our daily lives

would be totally predictable, whether it be the time of arrival of the

train tomorrow or the precise nature of what one will be doing at 5.30

pm on 1 April three years from now. All games of chance, such as dice-

throwing games for example, would no longer be worth playing! Every

player would be able to control beforehand, deterministically, how many

points to obtain at each throw. Playing the stock markets would no

longer be hazardous. Instead, it would be a mere contractual operation

between the players. Every experimental measurement would produce

an exact result and as a consequence we would be far more advanced in

VI Preface

our knowledge of the laws of the universe. To put it brieﬂy, we would

be engaged in a long and tedious walk of eternal predictability. Of

course, there would also be some pleasant consequences. Car accidents

would no longer take place, many deaths would be avoided by early

therapeutic action, and economics would become an exact science.

In its role as the spice of life, chance is sweetness and bitterness,

fortune and ruin. This double nature drove our ancestors to take it as

a divine property, as the visible and whimsical emanation of the will

of the gods. This imaginary will, supposedly conditioning every event,

was called sors or sortis by the Romans. This word for fatalistic luck is

present in the Latin languages (‘sorte’ in Portuguese, ‘suerte’ in Span-

ish, and ‘sort’ in French). But it was not only the Romans and peoples

of other ancient civilizations who attributed this divine, supernatural

feeling to this notion of chance. Who has never had the impression that

they are going through a wave of misfortune, with the feeling that the

‘wave’ has something supernatural about it (while in contrast a wave

of good fortune is often overlooked)? Being of divine origin, chance was

not considered a valid subject of study by the sages of ancient civiliza-

tions. After all, what would be the sense in studying divine designs?

Phenomena with an element of chance were simply used as a means of

probing the will of the gods. Indeed, dice were thrown and entrails were

inspected with this very aim of discerning the will of the gods. These

were then the only reasonable experiments with chance. Even today

fortune-telling by palm reading and tarot are a manifestation of this

long-lived belief that chance oﬀers a supernatural way of discerning the

divine will.

Besides being immersed in a sea of chances, we are also immersed

in a sea of regularity. But much of this regularity is intimately related

to chance. In fact, there are laws of order in the laws of chance, whose

discovery was initiated when several illustrious minds tried to clarify

certain features that had always appeared obscure in popular games.

Human intellect then initiated a brilliant journey that has led from

probability theory to many important and modern ﬁelds, like statistical

learning theory, for example. Along the way, a vast body of knowledge

and a battery of techniques have been developed, including statistics,

which has become an indispensable tool both in scientiﬁc research and

in the planning of social activities. In this area of application, the laws

of chance phenomena that were discovered have also provided us with

an adequate way to deal with the risks involved in human activities

and decisions, making our lives safer, even though the number of risk

Preface VII

factors and accidents in our lives is constantly on the increase. One

need only think of the role played by insurance companies.

This book provides a quick and almost chronological tour along the

long road to understanding chance phenomena, picking out the most

relevant milestones. One might think that in seven thousand years of

history mankind would already have solved practically all the problems

arising from the study, analysis and interpretation of chance phenom-

ena. However, many milestones have only recently been passed and

a good number of other issues remain open. Among the many roads of

scientiﬁc discovery, perhaps none is so abundant in amazing and coun-

terintuitive results. We shall see a wide range of examples to illustrate

this throughout the book, many of which relate to important practical

issues.

In writing the book I have tried to reduce to a minimum the math-

ematical prerequisites assumed of the reader. However, a few notes are

included at the end of the book to provide a quick reference concern-

ing certain topics for the sake of a better understanding. I also include

some bibliographic references to popular science papers or papers that

are not too technically demanding.

The idea of writing this book came from my work in areas where the

inﬂuence of chance phenomena is important. Many topics and examples

treated in the book arose from discussions with members of my research

team. Putting this reﬂection into writing has been an exhilarating task,

supported by the understanding of my wife and son.

Porto, J.P. Marques de S´ a

October 2007

Contents

1 Probabilities and Games of Chance . . . . . . . . . . . . . . . . . . 1

1.1 Illustrious Minds Play Dice . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Classic Notion of Probability . . . . . . . . . . . . . . . . . . . . 2

1.3 A Few Basic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Testing the Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 The Frequency Notion of Probability . . . . . . . . . . . . . . . . . 11

1.6 Counting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.7 Games of Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.7.1 Toto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.7.2 Lotto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.7.3 Poker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.7.4 Slot Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.7.5 Roulette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Amazing Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 Conditional Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Experimental Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 A Very Special Reverend . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Probabilities in Daily Life . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6 Challenging the Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.6.1 The Problem of the Sons . . . . . . . . . . . . . . . . . . . . . . 36

2.6.2 The Problem of the Aces . . . . . . . . . . . . . . . . . . . . . . 37

X Contents

2.6.3 The Problem of the Birthdays . . . . . . . . . . . . . . . . . . 39

2.6.4 The Problem of the Right Door . . . . . . . . . . . . . . . . 40

2.6.5 The Problem of Encounters . . . . . . . . . . . . . . . . . . . . 41

3 Expecting to Win . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1 Mathematical Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Bad Luck Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4 The Two-Envelope Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 The Saint Petersburg Paradox . . . . . . . . . . . . . . . . . . . . . . . 54

3.6 When the Improbable Happens . . . . . . . . . . . . . . . . . . . . . . 57

3.7 Paranormal Coincidences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.8 The Chevalier de M´er´e Problem . . . . . . . . . . . . . . . . . . . . . . 60

3.9 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.10 How Not to Lose Much in Games of Chance . . . . . . . . . . . 63

4 The Wonderful Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1 Approximating the Binomial Law . . . . . . . . . . . . . . . . . . . . 67

4.2 Errors and the Bell Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3 A Continuum of Chances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Buﬀon’s Needle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.5 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.6 Normal and Binomial Distributions . . . . . . . . . . . . . . . . . . . 78

4.7 Distribution of the Arithmetic Mean . . . . . . . . . . . . . . . . . . 79

4.8 The Law of Large Numbers Revisited . . . . . . . . . . . . . . . . . 82

4.9 Surveys and Polls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.10 The Ubiquity of Normality . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.11 A False Recipe for Winning the Lotto. . . . . . . . . . . . . . . . . 89

5 Probable Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1 Observations and Experiments . . . . . . . . . . . . . . . . . . . . . . . 93

5.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.3 Chance Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5 Signiﬁcant Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.6 Correlation in Everyday Life . . . . . . . . . . . . . . . . . . . . . . . . . 102

Contents XI

6 Fortune and Ruin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.1 The Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2 Wild Rides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.3 The Strange Arcsine Law . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4 The Chancier the Better . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.5 Averages Without Expectation. . . . . . . . . . . . . . . . . . . . . . . 114

6.6 Borel’s Normal Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

6.7 The Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . 120

6.8 The Law of Iterated Logarithms . . . . . . . . . . . . . . . . . . . . . 121

7 The Nature of Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.1 Chance and Determinism. . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.2 Quantum Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.3 A Planetary Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

7.4 Almost Sure Interest Rates . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.5 Populations in Crisis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.6 An Innocent Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

7.7 Chance in Natural Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 146

7.8 Chance in Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8 Noisy Irregularities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.1 Generators and Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.2 Judging Randomness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.3 Random-Looking Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.4 Quantum Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

8.5 When Nothing Else Looks the Same . . . . . . . . . . . . . . . . . . 160

8.6 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.7 Randomness with Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 166

8.8 Noises and More Than That . . . . . . . . . . . . . . . . . . . . . . . . . 169

8.9 The Fractality of Randomness . . . . . . . . . . . . . . . . . . . . . . . 171

9 Chance and Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.1 The Guessing Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.2 Information and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.3 The Ehrenfest Dogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

XII Contents

9.4 Random Texts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

9.5 Sequence Entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

9.6 Algorithmic Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

9.7 The Undecidability of Randomness . . . . . . . . . . . . . . . . . . . 194

9.8 A Ghost Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

9.9 Occam’s Razor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

10 Living with Chance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

10.1 Learning, in Spite of Chance . . . . . . . . . . . . . . . . . . . . . . . . . 199

10.2 Learning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

10.3 Impossible Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

10.4 To Learn Is to Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

10.5 The Learning of Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

10.6 Chance and Determinism: A Never-Ending Story . . . . . . . 213

Some Mathematical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

A.1 Powers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

A.2 Exponential Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

A.3 Logarithm Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

A.4 Factorial Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218

A.5 Sinusoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

A.6 Binary Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

1

Probabilities and Games of Chance

1.1 Illustrious Minds Play Dice

A well-known Latin saying is the famous alea jacta est (the die has been

cast), attributed to Julius Caesar when he took the decision to cross

with his legions the small river Rubicon, a frontier landmark between

the Italic Republic and Cisalpine Gaul. The decision taken by Julius

Caesar amounted to invading the territory (Italic Republic) assigned

to another triumvir, sharing with him the Roman power. Civil war was

an almost certain consequence of such an act, and this was in fact what

happened. Faced with the serious consequences of his decision, Caesar

found it wise before opting for this line of action to allow the divine

will to manifest itself by throwing dice. Certain objects appropriately

selected to generate unpredictable outcomes, one might say ‘chance’

generators, such as astragals (speciﬁc animal bones) and dice with var-

ious shapes, have been used since time immemorial either to implore

the gods in a neutral manner, with no human bias, to manifest their

will, or purely for the purposes of entertainment.

The Middle Ages brought about an increasing popularization of dice

games to the point that they became a common entertainment in both

taverns and royal courts. In 1560, the Italian philosopher, physician,

astrologist and mathematician Gerolamo Cardano (1501–1576), author

of the monumental work Artis Magnae Sive de Regulis Algebraicis (The

Great Art or the Algebraic Rules) and co-inventor of complex numbers,

wrote a treatise on dice games, Liber de Ludo Aleae (Book of Dice

Games), published posthumously in 1663, in which he introduced the

concept of probability of obtaining a speciﬁc die face, although he did

not explicitly use the word ‘probability’. He also wove together several

ideas on how to obtain speciﬁc face combinations when throwing dice.

2 1 Probabilities and Games of Chance

Nowadays, Gerolamo Cardano (known as J´erˆome Cardan, in French) is

better remembered for one of his inventions: the cardan joint (originally

used to keep ship compasses horizontal).

Dice provided a readily accessible and easy means to study the ‘rules

of chance’, since the number of possible outcomes (chances) in a game

with one or two dice is small and perfectly known. On the other hand,

the incentive for ﬁguring out the best strategy for a player to adopt

when betting was high. It should be no surprise then that several math-

ematicians were questioned by players interested in the topic. Around

1600, Galileo Galilei (1564–1642) wrote a book entitled Sopra le Sco-

perte dei Dadi (Analysis of Dice Games), where among other topics

he analyses the problem of how to obtain a certain number of points

when throwing three dice. The works of Cardano and Galileo, besides

presenting some counting techniques, also reveal concepts such as event

equiprobability (equal likelihood) and expected gain in a game.

Sometime around 1654, Blaise Pascal (1623–1662) and Pierre-Simon

de Fermat (1601–1665) exchanged letters discussing the mathematical

analysis of problems related to dice games and in particular a series of

tough problems raised by the Chevalier de M´er´e, a writer and noble-

man from the court of French king Louis XIV, who was interested in the

mathematical analysis of games. As this correspondence unfolds, Pascal

introduces the classic concept of probability as the ratio between the

number of favorable outcomes over the number of possible outcomes.

A little later, in 1657, the Dutch physicist Christiaan Huygens (1629–

1695) published his book De Ratiociniis in Ludo Aleae (The Theory of

Dice Games), in which he provides a systematic discussion of all the

results concerning dice games as they were understood in his day. In

this way, a humble object that had been used for at least two thousand

years for religious or entertainment purposes inspired human thought

to understand chance (or random) phenomena. It also led to the name

given to random phenomena in the Latin languages: ‘aleat´ orio’ in Por-

tuguese and ‘aleatoire’ in French, from the Latin word ‘alea’ for die.

1.2 The Classic Notion of Probability

Although the notion of probability – as a degree-of-certainty measure

associated with a random phenomenon – reaches back to the works of

Cardano, Pascal and Huygens, the word ‘probability’ itself was only

used for the ﬁrst time by Jacob (James) Bernoulli (1654–1705), profes-

sor of mathematics in Basel and one of the great scientiﬁc personalities

1.2 The Classic Notion of Probability 3

of his day, in his fundamental work Ars Conjectandi (The Art of Con-

jecture), published posthumously in 1713, in which a mathematical

theory of probability is presented. In 1812, the French mathematician

Simon de Laplace (1749–1827) published his Th´eorie Analytique des

Probabilit´es, in which the classic deﬁnition of probability is stated ex-

plicitly:

Pour ´etudier un ph´enom`ene, il faut r´eduire tous les ´ev´enements

du mˆeme type `a un certain nombre de cas ´egalement possi-

bles, et alors la probabilit´e d’un ´ev´enement donn´e est une frac-

tion, dont le num´erateur repr´esente le nombre de cas favorables

`a l’´ev´enement et dont le d´enominateur repr´esente par contre le

nombre des cas possibles.

To study a phenomenon, one must reduce all events of the same

type to a certain number of equally possible cases, and then

the probability is a fraction whose numerator represents the

number of cases favorable to the event and whose denominator

represents the number of possible cases.

Therefore, in order to determine the probability of a given outcome

(also called ‘case’ or ‘event’) of a random phenomenon, one must ﬁrst

reduce the phenomenon to a set of elementary and equally probable

outcomes or ‘cases’. One then proceeds to determine the ratio between

the number of cases favorable to the event and the total number of pos-

sible cases. Denoting the probability of an event by P (event), this gives

P(event) =

number of favorable cases

number of possible cases

.

Let us assume that in the throw of a die we wish to determine the

probability of obtaining a certain face (showing upwards). There are

6 possible outcomes or elementary events (the six faces), which we

represent by the set {1, 2, 3, 4, 5}. These are referred to as elementary

since they cannot be decomposed into other events. Moreover, assuming

that we are dealing with a fair die, i.e., one that has not been tampered

with, and also that the throws are performed in a fair manner, any of

the six faces is equally probable. If the event that interests us is ‘face 5

turned up’, since there is only one face with this value (one favorable

event), we have P (face 5 turned up) = P(5) = 1/6, and likewise for the

other faces. Interpreting probability values as percentages, we may say

that there is 16.7% certainty that any previously speciﬁed die face will

turn up. Figure 1.1 displays the various probability values P for a fair

4 1 Probabilities and Games of Chance

Fig. 1.1. Probability value of a given face

turning up when throwing a fair die

die and throws, the so-called probability function of a fair die. The sum

of all probabilities is 6 × (1/6) = 1, in accord with the idea that there

is 100% certainty that one of the faces will turn up (a sure event).

The fairness of chance-generating devices, such as dice, coins, cards,

roulettes, and so on, and the fairness of their use, will be discussed

later on. For the time being we shall just accept that the device con-

ﬁguration and its use will not favor any particular elementary event.

This is henceforth our basic assumption unless otherwise stated. When

tossing a coin up in the air (a chance decision-maker much used in

football matches), the set of possible results is {heads, tails}. For a fair

coin, tossed in a fair way, we have P(heads) = P(tails) = 1/2. When

extracting a playing card at random from a 52 card deck, the proba-

bility of a card with a speciﬁc value and suit turning up is 1/52. If,

on the other hand, we are not interested in the suit, so that the set of

elementary events is

{ace, king, queen, jack, 10, 9, 8, 7, 6, 5, 4, 3, 2} ,

the probability of a card of a given value turning up is 1/13.

Let us now consider a die with two faces marked with 5, perhaps

as a result of a manufacturing error. In this case the event ‘face 5

turned up’ can occur in two distinct ways. Let us assume that we have

written the letter a on one of the faces with a 5, and b on the other.

The set of equiprobable elementary events is now {1, 2, 3, 4, 5a, 5b} (and

not {1, 2, 3, 4, 5}). Hence, P(5) = 2/6 = 1/3, although 1/6 is still the

probability for each of the remaining faces and the sum of all elementary

probabilities is still of course equal to 1.

1.3 A Few Basic Rules

The great merit of the probability measure lies in the fact that, with

only a few basic rules directly deducible from the classic deﬁnition, one

1.3 A Few Basic Rules 5

can compute the probability of any event composed of ﬁnitely many

elementary events. The quantiﬁcation of chance then becomes a feasible

task. For instance, in the case of a die, the event ‘face value higher than

4’ corresponds to putting together (ﬁnding the union of) the elementary

events ‘face 5’ and ‘face 6’, that is:

face value higher than 4 = face 5 or face 6 = {5, 6} .

Note the word ‘or’ indicating the union of the two events. What is the

value of P({5, 6})? Since the number of favorable cases is 2, we have

P(face value higher than 4) = P({5, 6}) =

2

6

=

1

3

.

It is also obvious that the degree of certainty associated with {5, 6}

must be the sum of the individual degrees of certainty. Hence,

P({5, 6}) = P(5) +P(6) =

1

6

+

1

6

=

1

3

.

The probability of the union of elementary events computed as the sum

of the respective probabilities is then conﬁrmed. In the same way,

P(even face) = P(2) +P(4) +P(6) =

1

6

+

1

6

+

1

6

=

1

2

,

P(any face) = 6 ×

1

6

= 1 .

The event ‘any face’ is the sure event in the die-throwing experiment,

that is, the event that always happens. The probability reaches its

highest value for the sure event, namely 100% certainty.

Let us now consider the event ‘face value lower than or equal to 4’.

We have

P

**face value lower than
**

or equal to 4

= P(1) +P(2) +P(3) +P(4)

=

1

6

+

1

6

+

1

6

+

1

6

=

4

6

=

2

3

.

But one readily notices that ‘face value lower than or equal to 4’ is the

opposite (or negation) of ‘face value higher than 4’. The event ‘face

value lower than or equal to 4’ is called the complement of ‘face value

higher than 4’ relative to the sure event. Given two complementary

events denoted by A and

¯

A, where

¯

A means the complement of A, if

6 1 Probabilities and Games of Chance

we know the probability of one of them, say P(A), the probability of

the other is obtained by subtracting it from 1 (the probability of the

certain event): P(

¯

A) = 1 − P(A). This rule is a direct consequence of

the fact that the sum of cases favorable to A and

¯

A is the total number

of possible cases. Therefore,

P

**face value less than
**

or equal to 4

**= 1 −P(face value higher than 4)
**

= 1 −

1

3

=

2

3

,

which is the result found above.

From the rule about the complement, one concludes that the prob-

ability of the complement of the sure event is 1 − P(sure event) =

1 − 1 = 0, the minimum probability value. Such an event is called the

impossible event, in the case of the die, the event ‘no face turns up’.

Let us now consider the event ‘even face and less than or equal to

4’. The probability of this event, composed of events ‘even face’ and

‘face smaller than or equal to 4’, is obtained by the product of the

probabilities of the individual events. Thus,

P

**even face and less than
**

or equal to 4

= P(even face) ×P

**face less than
**

or equal to 4

=

1

2

×

2

3

=

1

3

.

As a matter of fact there are two possibilities out of six of obtaining

a face that is both even and less than or equal to 4 in value, the two

possibilities forming the set {2, 4}. Note the word ‘and’ indicating event

composition (or intersection).

These rules are easily visualized by the simple ploy of drawing a rect-

angle corresponding to the set of all elementary events, and represent-

ing the latter by speciﬁc domains within it. Such drawings are called

Venn diagrams (after the nineteenth century English mathematician

John Venn). Figure 1.2 shows diagrams corresponding to the previous

examples, clearly illustrating the consistency of the classic deﬁnition.

For instance, the intersection in Fig. 1.2c can be seen as taking away

half of certainty (hatched area in the ﬁgure) from something that had

only two thirds of certainty (gray area in the same ﬁgure), thus yielding

one third of certainty.

Finally, let us consider one more problem. Suppose we wish to de-

termine the probability of ‘even face or less than or equal to 4’ turning

1.3 A Few Basic Rules 7

Fig. 1.2. Venn diagrams. (a) Union (gray area) of face 5 with face 6. (b)

Complement (gray area) of ‘face 5 or face 6’. (c) Intersection (gray hatched

area) of ‘even face’ (hatched area) with ‘face value less than or equal to 4’

(gray area)

up. We are thus dealing with the event corresponding to the union of

the following two events: ‘even face’, ‘face smaller than or equal to 4’.

(What we discussed previously was the intersection of these events,

whereas we are now considering their union.) Applying the addition

rule, we have

P

**even face or less than
**

or equal to 4

= P(even face) +P

**face less than
**

or equal to 4

=

1

2

+

2

3

=

7

6

> 1 .

We have a surprising and clearly incorrect result, greater than 1! What

is going on? Let us take a look at the Venn diagram in Fig. 1.2c. The

union of the two events corresponds to all squares that are hatched,

gray or simultaneously hatched and gray. There are ﬁve. We now see

the cause of the problem: we have counted some of the squares twice,

namely those that are simultaneously hatched and gray, i.e., the two

squares in the intersection. We thus obtained the incorrect result 7/6

instead of 5/6. We must, therefore, modify the addition rule for two

events A and B in the following way:

P(A or B) = P(A) +P(B) −P(A and B) .

Let us recalculate in the above example:

P

**even face or less than
**

or equal to 4

= P(even face) +P

**face less than
**

or equal to 4

−P

**even face and less
**

than or equal to 4

=

1

2

+

2

3

−

2

6

=

5

6

.

In the ﬁrst example of a union of events, when we added the probabili-

ties of elementary events, we did not have to subtract the probability of

8 1 Probabilities and Games of Chance

the intersection because there was no intersection, i.e., the intersection

was empty, and thus had null probability, being the impossible event.

Whenever two events have an empty intersection they are called dis-

joint and the probability of their union is the sum of the probabilities.

For instance,

P(even or odd face) = P(even face) +P(odd face) = 1 .

In summary, the classic deﬁnition of probability aﬀords the means to

measure the ‘degree of certainty’ of a random phenomenon on a con-

venient scale from 0 to 1 (or from 0 to 100% certainty), obeying the

following rules in accordance with common sense:

• P(sure event) = 1 ,

• P(impossible event) = 0 ,

• P(complement event of A) = 1 −P(A) ,

• P(A and B) = P(A) ×P(B) (rule to be revised later),

• P(A or B) = P(A) +P(B) −P(A and B) .

1.4 Testing the Rules

There is nothing special about dice games, except that they may origi-

nally have inspired the study of random phenomena. So let us test our

ability to apply the rules of probability to other random phenomena.

We will often talk about tossing coins in this book. As a matter of

fact, everything we know about chance can be explained by reference to

coin tossing! The ﬁrst question is: what is the probability of heads (or

tails) turning up when (fairly) tossing a (fair) coin in the air? Since both

events are equally probable, we have P(heads) = P(tails) = 1/2 (50%

probability). It is diﬃcult to believe that there should be anything more

to say about coin tossing. However, as we shall see later, the hardest

concepts of randomness can be studied using coin-tossing experiments.

Let us now consider a more mundane situation: the subject of fam-

ilies with two children. Assuming that the probability of a boy being

born is the same as that of a girl being born, what is the probability of

a family with two children having at least one boy? Denoting the birth

of a boy by M and the birth of a girl by F, there are 4 elementary

events:

{MM, MF, FM, FF} ,

where MF means that ﬁrst a boy is born, followed later by a girl,

1.4 Testing the Rules 9

and similarly for the other cases. If the probability of a boy being

born is the same as that for a girl, each of these elementary events

has probability 1/4. In fact, we can look at these events as if they

were event intersections. For instance, the event MF can be seen as the

intersection of the events ‘ﬁrst a boy is born’ and ‘second a girl is born’.

MF is called a compound event . Applying the intersection rule, we get

P(MF) = P(M) × P(F) = 1/2 × 1/2 = 1/4. The probability we wish

to determine is then

P(at least one boy) = P(MM) +P(MF) +P(FM) = 3/4 .

A common mistake in probability computation consists in not correctly

representing the set of elementary events. For example, take the follow-

ing argument. For a family with two children there are three possibil-

ities: both boys; both girls; one boy and one girl. Only two of these

three possibilities correspond to having at least one boy. Therefore, the

probability is 2/3. The mistake here consists precisely in not taking

into account the fact that ‘one boy and one girl’ corresponds to two

equiprobable elementary events: MF and FM.

Let us return to the dice. Suppose we throw two dice. What is

the probability that the sum of the faces is 6? Once again, we are

dealing with compound experiments. Let us name the dice as die 1

and die 2. For any face of die 1 that turns up, any of the six faces

of die 2 may turn up. There are therefore 6 × 6 = 36 possible and

equally probable outcomes (assuming fair dice and throwing). Only

the following amongst them favor the event that interests us here, viz.,

sum equal to 6: {15, 24, 33, 42, 51} (where 15 means die 1 gives 1 and

die 2 gives 5 and likewise for the other cases). Therefore,

P(sum of the faces is 6) = 5/36 .

The number of throws needed to obtain a six with one die or two sixes

with two dice was a subject that interested the previously cited Cheva-

lier de M´er´e. The latter betted that a six would turn up with one die

in four throws and that two sixes would turn up with two dice in 24

throws. He noted, however, that he constantly lost when applying this

strategy with two dice and asked Pascal to help him to explain why.

Let us then see the explanation of the De M´er´e paradox, which con-

stitutes a good illustration of the complement rule. The probability of

not obtaining a 6 (or any other previously chosen face for that matter)

in a one-die throw is

P(no 6) = 1 −

1

6

=

5

6

.

10 1 Probabilities and Games of Chance

Therefore, the probability of not obtaining at least one 6 in four throws

is given by the intersection rule:

P(no 6 in 4 throws)

= P(no 6 in ﬁrst throw) ×P(no 6 in second throw)

×P(no 6 in third throw) ×P(no 6 in fourth throw)

=

5

6

×

5

6

×

5

6

×

5

6

=

5

6

4

.

Hence, by the complement rule,

P(at least one 6 in 4 throws) = 1 −P(no 6 in 4 throws)

= 1 −

5

6

4

= 0.51775 .

We thus conclude that the probability of obtaining at least one 6 in

four throws is higher than 50%, justifying a bet on this event as a good

tactic. Let us list the set of elementary events corresponding to the four

throws:

{1111, 1112, . . . , 1116, 1121, . . . , 1126, . . . , 1161, . . . , 1166, . . . , 6666} .

It is a set with a considerable number of elements, in fact, 1296. To

count the events where a 6 appears at least once is not an easy task.

We should have to count the events where only one six turns up (e.g.,

event 1126), as well as the events where two, three or four sixes appear.

We saw above how, by thinking in terms of the complement of the event

we are interested in, we were able to avoid the diﬃculties inherent in

Fig. 1.3. Putting the laws of

chance to the test

1.5 The Frequency Notion of Probability 11

such counting problems. Let us apply this same technique to the two

dice:

P(no double 6 with two dice) = 1 −

1

36

=

35

36

,

P(no double 6 in 24 throws) =

35

36

24

,

P(at least one double 6 in 24 throws) = 1 −

35

36

24

= 0.49141 .

We obtain a probability lower than 50%: betting on a double 6 in 24

two-die throws is not a good policy. One can thus understand the losses

incurred by Chevalier de M´er´e!

1.5 The Frequency Notion of Probability

The notions and rules introduced in the previous sections assumed fair

dice, fair coins, equal probability of a boy or girl being born, and so on.

That is, they assumed speciﬁc probability values (in fact, equiproba-

bility) for the elementary events. One may wonder whether such values

can be reasonably assumed in normal practice. For instance, a fair coin

assumes a symmetrical mass distribution around its center of gravity.

In normal practice it is impossible to guarantee such a distribution.

As to the fair throw, even without assuming a professional gambler

gifted with a deliberately introduced ﬂick of the wrist, it is known that

some faulty handling can occur in practice, e.g., the coin sticking to

a hand that is not perfectly clean, a coin rolling only along one direc-

tion, and so on. In reality it seems that the tossing of a coin in the

air is usually biased, even for a fair coin that is perfectly symmetri-

cal and balanced. As a matter of fact, the American mathematician

Persi Diaconis and his collaborators have shown, in a paper published

in 2004, that a fair tossing of a coin can only take place when the im-

pulse (given with the ﬁnger) is exactly applied on a diameter line of

the coin. If there is a deviation from this position there is also a prob-

ability higher than 50% that the face turned up by the coin is the

same as (or the opposite of) the one shown initially. For these rea-

sons, instead of assuming more or less idealized conditions, one would

12 1 Probabilities and Games of Chance

like, sometimes, to deal with true values of probabilities of elementary

events.

The French mathematician Abraham De Moivre (1667–1754), in his

work The Doctrine of Chance: A Method of Calculating the Probabil-

ities of Events in Play, published in London in 1718, introduced the

frequency or frequentist theory of probability. Suppose we wish to de-

termine the probability of a 5 turning up when we throw a die in the

air. Then we can do the following: repeat the die-throwing experiment

a certain number of times denoted by n, let k be the number of times

that face 5 turns up, and compute the ratio

f =

k

n

.

The value of f is known as the frequency of occurrence of the relevant

event in n repetitions of the experiment, in this case, the die-throwing

experiment.

Let us return to the birth of a boy or a girl. There are natality statis-

tics in every country, from which one can obtain the birth frequencies

of boys and girls. For instance, in Portugal, out of 113 384 births occur-

ring in 1998, 58 506 were of male sex. Therefore, the male frequency of

birth was f = 58 506/113 384 = 0.516. The female frequency of birth

was the complement, viz., 1 −0.516 = 0.484.

One can use the frequency of an event as an empirical probability

measure, a counterpart of the ideal situation of the classic deﬁnition.

If the event always occurs (sure event), then k is equal to n and f = 1.

In the opposite situation, if the event never occurs (impossible event),

k is equal to 0 and f = 0. Thus, the frequency varies over the same

range as the probability.

Let us now take an event such as ‘face 5 turns up’ when throwing

a die. As we increase n, we verify that the frequency f tends to wander

with decreasing amplitude around a certain value. Figure 1.4a shows

how the frequencies of face 5 and face 6 might evolve with n in two

distinct series of 2 000 throws of a fair die. Some wanderings are clearly

visible, each time with smaller amplitude as n increases, reaching a rea-

sonable level of stability near the theoretical value of 1/6. Figure 1.4b

shows the evolution of the frequency for a new series of 2 000 throws,

now relative to faces 5 or 6 turning up. Note how the stabilization

takes place near the theoretical value of 1/3. Frequency does therefore

exhibit the additivity required for a probability measure, for suﬃciently

large n.

1.5 The Frequency Notion of Probability 13

Fig. 1.4. Possible frequency curves in die throwing. (a) Face 5 and face 6.

(b) Face 5 or 6

Let us assume that for a suﬃciently high value of n, say 2 000 throws,

we obtain f = 0.16 for face 5. We then assign this value to the probabil-

ity of face 5 turning up. This is the value that we evaluate empirically

(or estimate) for P(face 5) based on the 2 000 throws. Computing the

probabilities for the other faces in the same way, let us assume that we

obtain:

P(face 1) = 0.17 , P(face 2) = 0.175 ,

P(face 3) = 0.17 , P(face 4) = 0.165 ,

P(face 5) = 0.16 , P(face 6) = 0.16 .

The elementary events are no longer equiprobable. The die is not fair

(biased die) and has the tendency to turn up low values more frequently

than high values. We also note that, since n is the sum of the diﬀerent

values of k respecting the frequencies of all the faces, the sum of the

empirical probabilities is 1.

Figure 1.5 shows the probability function of this biased die. Using

this function we can apply the same basic rules as before. For example,

P(face higher than 4) = P(5) +P(6) = 0.16 + 0.16 = 0.32 ,

P

**face lower than
**

or equal to 4

= 1 −P

face higher

than 4

= 1 −0.32 = 0.68 ,

P(even face) = P(2) +P(4) +P(6) = 0.175 + 0.165 + 0.16 = 0.5 ,

P

**even face and lower
**

than or equal to 4

= 0.5 ×0.68 = 0.34 .

14 1 Probabilities and Games of Chance

Fig. 1.5. Probability function of a bi-

ased die

In the child birth problem, frequencies of occurrence of the two sexes ob-

tained from suﬃciently large birth statistics can also be used as a prob-

ability measure. Assuming that the probability of boys and girls born

in Portugal is estimated by the previously mentioned frequencies –

P(boy) = 0.516, P(girl) = 0.484 – then, for the problem of families

with two children presented before, we have for Portuguese families

P(at least one boy) = P(MM) +P(MF) +P(FM)

= 0.516 ×0.516 + 0.516 ×0.484 + 0.484 ×0.516

= 0.766 .

Some questions concerning the frequency interpretation of probabilities

will be dealt with later. For instance, when do we consider n suﬃciently

large? For the time being, it is certainly gratifying to know that we can

estimate degrees of certainty of many random phenomena. We make an

exception for those phenomena that do not refer to constant conditions

(e.g., the probability that Chelsea beats Manchester United) or those

that are of a subjective nature (e.g., the probability that I will feel

happy tomorrow).

1.6 Counting Techniques

Up to now we have computed probabilities for relatively simple events

and experiments. We now proceed to a more complicated problem:

computing the probability of winning the ﬁrst prize of the football

pools known as toto in many European countries. A toto entry consists

of a list of 13 football matches for the coming week. Pool entrants have

to select the result of each match, whether it will be a home win, an

away win, or neither of these, typically by marking a cross over each

of the three possibilities. In this case there is a single favorable event

(the right template of crosses) in an event set that we can imagine

1.6 Counting Techniques 15

to have quite a large number of possible events (templates). But how

large is this set? To be able to answer such questions we need to know

some counting techniques. Abraham de Moivre was the ﬁrst to present

a consistent description of such techniques in his book mentioned above.

Let us consider how this works for toto. Each template can be seen as

a list (or sequence) of 13 symbols, and each symbol can have one of

three values: home win, home loss (away win) and draw, that we will

denote, respectively, by W (win), L (loss), and D (draw). A possible

template is thus

WWLDDDWDDDLLW .

How many such templates are there? First, assume that the toto has

only one match. One would then have only three distinct templates,

i.e., as many as there are symbols. Now let us increase the toto to two

matches. Then, for each of the symbols of the ﬁrst match, there are

three diﬀerent templates corresponding to the symbols of the second

match. There are therefore 3 × 3 = 9 templates. Adding one more

match to the toto, the next three symbols can be used with any of

the previous templates, whence we obtain 3 × 3 × 3 templates. This

line of thought can be repeated until one reaches the 13 toto matches,

which then correspond to 3

13

= 1 594 323 templates. This is a rather

large number. The probability of winning the toto by chance alone is

1/1 594 323 = 0.000 000 627. In general, the number of lists of n symbols

from a set of k symbols is given by n

k

.

We now consider the problem of randomly inserting 5 diﬀerent let-

ters into 5 envelopes. Each envelope is assumed to correspond to one

and only one of the letters. What is the probability that all envelopes

receive the right letter? Let us name the letters by numbers 1, 2, 3, 4,

5, and let us imagine that the envelopes are placed in a row, with no

visible clues (sender and addressee), as shown in Fig. 1.6.

Figure 1.6 shows one possible letter sequence, namely, 12345. An-

other is 12354. Only one of the possible sequences is the right one. The

question is: how many distinct sequences are there? Take the leftmost

position. There are 5 distinct possibilities for the ﬁrst letter. Consider

one of them, say 3. Now, for the four rightmost positions, letter 3 is

no longer available. (Note the diﬀerence in relation to the toto case:

Fig. 1.6. Letters and envelopes

16 1 Probabilities and Games of Chance

Table 1.1. Probability P of matching n letters

n 1 2 3 4 5 10 15

P 1 0.5 0.17 0.042 0.0083 0.000 000 28 0.000000 000 000 78

the fact that a certain symbol was chosen as the ﬁrst element of the

list did not imply, as here, that it could not be used again in other list

positions.) Let us then choose one of the four remaining letters for the

second position, say, 5. Only three letters now remain for the last three

positions. So for the ﬁrst position we had 5 alternatives, and for the

second position we had 4. This line of thought is easily extended until

we reach a total number of alternatives given by

5 ×4 ×3 ×2 ×1 = 120 .

Each of the possible alternatives is called a permutation of the 5 sym-

bols. Here are all the permutations of three symbols:

123 132 213 231 312 321 .

We have 3 × 2 × 1 = 6 permutations. Note how the total number

of permutations grows dramatically when one increases from 3 to 5

symbols. Whereas with 3 letters the probability of a correct match is

around 17%, for 5 letters it is only 8 in one thousand.

In general, the number of ordered sequences (permutations) of n

symbols is given by

n ×(n −1) ×(n −2) ×. . . ×1 .

This function of n is called the factorial function and written n! (read n

factorial). Table 1.1 shows the fast decrease of probability P of match-

ing n letters, in correspondence with the fast increase of n!. For exam-

ple, for n = 10 there are 10! = 3 628 800 diﬀerent permutations of the

letters. If we tried to match the envelopes, aligning each permutation

at a constant rate of one per second, we should have to wait 42 days

in the worst case (trying all 10! permutations). For 5 letters, 2 minutes

would be enough, but for 15 letters, 41 466 years would be needed!

Finally, let us consider tossing a coin 5 times in a row. The question

now is: what is the probability of heads turning up 3 times? Denote

heads and tails by 1 and 0, respectively. Assume that for each heads–

tails conﬁguration we take note of the position in which heads occurs,

as in the following examples:

1.6 Counting Techniques 17

Sequence Heads position

10011 1, 4, 5

01101 2, 3, 5

10110 1, 3, 4

Taking into account the set of all 5 possible posi-

tions, {1, 2, 3, 4, 5}, we thus intend to count the

number of distinct subsets that can be formed

using only 3 of those 5 symbols. For this pur-

pose, we start by assuming that the list of all

120 permutations of 5 symbols is available and

select as subsets the ﬁrst 3 permutation symbols,

as illustrated in the diagram on the right.

3

....

123 45

123 54

132 45

132 54

.

.

.

.

.

.

234 15

234 51

.

.

.

.

.

.

When we consider a certain subset, say {1, 2, 3}, we are not concerned

with the order in which the symbols appear in the permutations. It is all

the same whether it is 123, 132, 231, or any other of the 3! = 6 permu-

tations. On the other hand, for each sequence of the ﬁrst three symbols,

there are repetitions corresponding to the permutations of the remain-

ing symbols; for instance, 123 is repeated twice, in correspondence with

the 2! = 2 permutations 45 and 54. In summary, the number of subsets

that we can form using 3 of 5 symbols (without paying attention to the

order) is obtained by dividing the 120 permutations of 5 symbols by

the number of repetitions corresponding to the permutations of 3 and

of 5 −3 = 2 symbols. The conclusion is that

number of subsets of 3 out of 5 symbols =

5!

3! ×(5 −3)!

=

120

6 ×2

=10 .

In general, the number of distinct subsets (the order does not matter)

of k symbols out of a set of n symbols, denoted

n

k

**, and called the
**

number of combinations of n, taking k at a time, is given by

n

k

=

n!

k! ×(n −k)!

.

Let us go back to the problem of the coin tossed 5 times in a row, using

a fair coin with P(1) = P(0) = 1/2. The probability that heads turns up

three times is (1/2)

3

. Likewise, the probability that tails turns up twice

18 1 Probabilities and Games of Chance

Fig. 1.7. Probability of obtaining k heads in 5 throws of a fair coin (a) and

a biased coin (b)

is (1/2)

2

. Therefore, the probability of a given sequence of three heads

and two tails is computed as (1/2)

3

×(1/2)

2

= (1/2)

5

. Since there are

5

3

**distinct sequences on those conditions, we have (disjoint events)
**

P(3 heads in 5 throws) =

5

3

×

1

2

5

=

10

32

= 0.31 .

Figure 1.7a shows the probability function for several k values. Note

that for k = 0 (no heads, i.e., the sequence 00000) and for k = 5 (all

heads, i.e., the sequence 11111) the probability is far lower (0.031) than

when a number of heads near half of the throws turns up (2 or 3).

Let us suppose that the coin was biased, with a probability of 0.65

of turning up heads (and, therefore, 0.35 of turning up tails). We have

P(3 heads in 5 throws) =

5

3

×(0.65)

3

×(0.35)

2

= 0.336 .

For this case, as expected, sequences with more heads are favored when

compared with those with more tails, as shown in Fig. 1.7b. For in-

stance, the sequence 00000 has probability 0.005, whereas the sequence

11111 has probability 0.116.

1.7 Games of Chance

Games of the kind played in the casino – roulette, craps, several card

games and, in more recent times, slot machines and the like – together

with any kind of national lottery are commonly called games of chance.

1.7 Games of Chance 19

All these games involve a chance phenomenon that is only very rarely

favorable to the player. Maybe this is the reason why in some European

languages, for example, Portuguese and Spanish, these games are called

games of bad luck, using for ‘bad luck’ a word that translates literally

to the English word ‘hazard’. The French also use ‘jeux de hasard’ to

refer to games of chance. This word, meaning risk or danger, comes

from the Arab az-zahar, which in its turn comes from the Persian az

zar, meaning dice game.

Now that we have learned the basic rules for operating with proba-

bilities, as well as some counting techniques, we will be able to under-

stand the bad luck involved in games of chance in an objective manner.

1.7.1 Toto

We have already seen that the probability of a full template hit in

the football toto is 0.000 000 627. In order to determine the probability

of other prizes corresponding to 11 or 12 correct hits one only has to

determine the number of diﬀerent ways in which one can obtain the 11

or 12 correct hits out of the 13. This amounts to counting subsets of 11

or 12 symbols (the positions of the hits) of the 13 available positions. We

have seen that such counts are given by

13

12

and

13

11

, respectively.

Hence,

probability of 13 hits (1/3)

13

= 0.000 000 627 ,

probability of 12 hits

13

12

×(1/3)

13

= 0.000 008 15 ,

probability of 11 hits

13

11

×(1/3)

13

= 0.000 048 92 .

In this example and the following, the reader must take into account the

fact that the computed probabilities correspond to idealized models of

reality. For instance, in the case of toto, we are assuming that the choice

of template is made entirely at random with all 1 594 323 templates

being equally likely. This is not usually entirely true.

As a comparison the probability of a ﬁrst toto prize is about twice

as big as the probability of a correct random insertion of 10 letters in

their respective envelopes.

20 1 Probabilities and Games of Chance

1.7.2 Lotto

The lotto or lottery is a popular form of gambling that often corre-

sponds to betting on a subset of numbers. Let us consider the lotto

that runs in several European countries consisting of betting on a sub-

set of six numbers out of 49 (from 1 through 49). We already know how

to compute the number of distinct subsets of 6 out of 49 numbers:

49

6

=

49!

6! ×43!

=

49 ×48 ×. . . ×1

(6 ×5 ×. . . ) ×(43 ×42 ×. . . )

= 13 983 816 .

Therefore, the probability of winning the ﬁrst prize is

P(ﬁrst prize: 6 hits) =

1

13 983 816

= 0.000 000 071 5 ,

smaller than the probability of a ﬁrst prize in the toto.

Let us now compute the probability of the third prize corresponding

to a hit in 5 numbers. There are

6

5

**= 6 diﬀerent ways of selecting
**

ﬁve numbers out of six, and the sixth number of the bet can be any

of the remaining 43 non-winning numbers. Therefore, the probability

of winning the third prize is 6 × 43 = 258 times greater than the

probability of winning the ﬁrst prize, that is,

P(third prize: 5 hits) = 258 ×0.000 000 071 5 = 0.000 018 4 .

The second prize corresponds to ﬁve hits plus a hit on a supplementary

number, which can be any of the remaining 43 numbers. Thus, the

probability of winning the second prize is 43 times smaller than that

of the third prize:

P(second prize: 5 hits + 1 supplementary hit) = 6 ×0.000 000 071 5

= 0.000 000 429 .

In the same way one can compute

P(fourth prize: 4 hits) =

6

4

×

43

2

**×0.000 000 071 5
**

= 15 ×903 ×0.000 000 071 5 = 0.000 969 ,

P(ﬁfth prize: 3 hits) =

6

3

×

43

3

**×0.000 000 071 5
**

= 20 ×12 341 ×0.000 000 071 5 = 0.017 65 .

1.7 Games of Chance 21

The probability of winning the ﬁfth prize is quite a lot higher than the

probability of winning the other prizes and nearly twice as high as the

probability of matching 5 letters to their respective envelopes.

1.7.3 Poker

Poker is played with a 52 card deck (13 cards of 4 suits). A player’s

hand is composed of 5 cards. The game involves a betting system that

guides the action of the players (card substitution, staying in the game

or passing, and so on) according to a certain strategy depending on the

hand of cards and the behavior of the other players. It is not, therefore,

a game that depends only on chance. Here, we will simply analyze the

chance element, computing the probabilities of obtaining speciﬁc hands

when the cards are dealt at the beginning of the game.

First of all, note that there are

52

5

=

52!

5! ×47!

= 2 598 960 diﬀerent 5-card hands .

Let us now consider the hand that is the most diﬃcult to obtain, the

royal ﬂush, consisting of an ace, king, queen, jack and ten, all of the

same suit. Since there are four suits, the probability of randomly draw-

ing a royal ﬂush (or any previously speciﬁed 5-card sequence for that

matter), is given by

P(royal ﬂush) =

4

2 598 960

= 0.000 001 5 ,

almost six times greater than the probability of winning a second prize

in the toto.

Let us now look at the probability of obtaining the hand of lowest

value, containing a simple pair of cards of the same value. We denote

the hand by AABCD, where AA corresponds to the equal-value pair of

cards. Think, for instance, of a pair of aces that can be chosen out of the

4 aces in

4

2

**= 6 diﬀerent ways. As there are 13 distinct values, the
**

pair of cards can be chosen in 6×13 = 78 diﬀerent ways. The remaining

three cards of the hand will be drawn out of the remaining 12 values,

with each card being of any of the 4 suits. There are

12

3

combinations

22 1 Probabilities and Games of Chance

of the values of the three cards. Since each of them can come from any

of the 4 suits, we have

12

3

×4 ×4 ×4 = 14 080

diﬀerent possibilities. Thus,

P(pair) = 13 ×

14 080

2 598 960

= 0.423 .

That is, a pair is almost as probable as getting heads (or tails) when

tossing a coin. Using the same counting techniques, one can work out

the list of probabilities for all interesting poker hands as follows:

Pair AABCD 0.423

Two pairs AABBC 0.048

Three-of-a-kind AAABC 0.021

Straight 5 successive values 0.0039

Flush 5 cards of the same suit 0.0020

Full house AAABB 0.0014

Four-of-a-kind AAAAB 0.000 24

Straight ﬂush Straight + ﬂush 0.000 015

Royal ﬂush Ace, king, queen, jack, ten

of the same suit 0.000 001 5

1.7.4 Slot Machines

The ﬁrst slot machines (invented in the USA in 1897) had 3 wheels,

each with 10 symbols. The wheels were set turning until they stopped

in a certain sequence of symbols. The number of possible sequences was

therefore 10×10×10 = 1000. Thus, the probability of obtaining a spe-

ciﬁc winning sequence (the jackpot) was 0.001. Later on, the number of

symbols was raised to 20, and afterwards to 22 (in 1970), corresponding,

in this last case, to a 0.000 093 914 probability of winning the jackpot.

The number of wheels was also increased and slot machines with re-

peated symbols made their appearance. Everything was done to make

it more diﬃcult (or even impossible!) to obtain certain prizes.

Let us look at the slot machine shown in Fig. 1.8, which as a matter

of fact was actually built and came on the market with the following

list of prizes (in dollars):

1.7 Games of Chance 23

Fig. 1.8. Wheels of a particularly infamous slot machine

Pair (of jacks or better) $ 0.05

Two pairs $ 0.10

Three-of-a-kind $ 0.15

Straight $ 0.25

Flush $ 0.30

Full house $ 0.50

Four-of-a-kind $ 1.00

Straight ﬂush $ 2.50

Royal ﬂush $ 5.00

The conﬁguration of the ﬁve wheels is such that the royal ﬂush can

never show up, although the player may have the illusion when looking

at the spinning wheels that it is actually possible. Even the straight

ﬂush only occurs for one single combination: 7, 8, 9, 10 and jack of

diamonds. The number of possible combinations is 15

5

= 759 375. The

24 1 Probabilities and Games of Chance

number of favorable combinations for each winning sequence and the

respective probabilities are computed as:

Pair (of jacks or better) 68 612 0.090 353

Two pairs 36 978 0.048 695

Three-of-a-kind 16 804 0.022 129

Straight 2 396 0.003 155

Flush 1 715 0.002 258

Full house 506 0.000 666

Four-of-a-kind 60 0.000 079

Straight ﬂush 1 0.000 001

Royal ﬂush 0 0

We observe that the probabilities of these ‘poker’ sequences for the pair

and high-valued sequences are far lower than their true poker counter-

parts.

The present slot machines use electronic circuitry with the ﬁnal

position of each wheel given by a randomly generated number. The

probabilities of ‘bad luck’ are similar.

1.7.5 Roulette

The roulette consists of a sort of dish having on its rim 37 or 38 hemi-

spherical pockets. A ball is thrown into the spinning dish and moves

around until it ﬁnally settles in one of the pockets and the dish (the

roulette) stops. The construction and the way of operating the roulette

are presumed to be such that each throw can be considered as a ran-

dom phenomenon with equiprobable outcomes for all pockets. These

are numbered from 0. In European roulette, the numbering goes from

0 to 36. American roulette has an extra pocket marked 00.

In the simplest bet the player wins if the bet is on the number that

turns up by spinning the roulette, with the exception of 0 or 00 which

are numbers reserved for the house. Thus, in European roulette the

probability of hitting a certain number is 1/37 = 0.027 (better than

getting three-of-a-kind at poker), which does not look so bad, especially

when one realises that the player can place bets on several numbers (as

well as on even or odd numbers, numbers marked red or black, and

so on). However, we shall see later that even in ‘games of bad luck’

whose elementary probabilities are not that ‘bad’, prizes and fees are

arranged in such a way that in a long sequence of bets the player will

lose, while maintaining the illusion that good luck really does lie just

around the corner!

2

Amazing Conditions

2.1 Conditional Events

The results obtained in the last chapter, using the basic rules for op-

erating with probabilities, were in perfect agreement with common

sense. From the present chapter onwards, many surprising results will

emerge, starting with the inclusion of such a humble ingredient as the

assignment of conditions to events. Consider once again throwing a die

and the events ‘face less than or equal to 4’ and ‘even face’. In the

last chapter we computed P(face less than or equal to 4) = 2/3 and

P(even face) = 1/2. One might ask: what is the probability that an

even face turns up if we know that a face less than or equal to 4 has

turned up? Previously, when referring to the even-face event, there was

no prior condition, or information conditioning that event. Now, there

is one condition: we know that ‘face less than or equal to 4’ has turned

up. Thus, out of the four possible elementary events corresponding to

‘face less than or equal to 4’, we enumerate those that are even. There

are two: face 2 and face 4. Therefore,

P(even face if face less than or equal to 4) =

2

4

=

1

2

.

Looking at Fig. 1.2c of the last chapter, we see that the desired prob-

ability is computed by dividing the cases corresponding to the event

intersection (hatched gray squares) by the cases corresponding to the

condition (gray squares). Brieﬂy, from the point of view of the classic

deﬁnition, it is as though we have moved to a new set of elementary

events corresponding to compliance with the conditioning event, viz.,

{1, 2, 3, 4}.

26 2 Amazing Conditions

At the same time notice that, for this example the condition did not

inﬂuence the probability of ‘even face’:

P(even face if face less than or equal to 4) = P(even face) .

This equality does not always hold. For instance, it is an easy task to

check that

P(even face if face less than or equal to 5) =

2

5

= P(even face) .

From the frequentist point of view, and denoting the events of the

random experiment simply by A and B, we may write

P(A if B) ≈

k

Aand B

k

B

,

where k

Aand B

and k

B

are, respectively, the number of occurrences of

‘A and B’ and ‘B’ when the random experiment is repeated n times

with n suﬃciently large. The symbol ≈ means ‘approximately equal

to’. It is a well known fact that one may divide both terms of a fraction

by the same number without changing its value. Therefore, we rewrite

P(A if B) ≈

k

Aand B

k

B

=

k

Aand B

/n

k

B

/n

,

justifying, from the frequentist point of view, the so-called conditional

probability formula:

P(A if B) =

P(A and B)

P(B)

.

For the above example where the random experiment consists of throw-

ing a die, the formula corresponds to dividing 2/6 by 4/6, whence the

above result of 1/2. Suppose we throw a die a large number of times,

say 10 000 times. In this case, about two thirds of the times we expect

‘face less than or equal to 4’ to turn up. Imagine that it turned up 6 505

times. From this number of times, let us say that ‘even face’ turned up

3 260 times. Then the frequency of ‘even face if face less than or equal

to 4’ would be computed as 3 260/6 505, close to 1/2. In other trials

the numbers of occurrences may be diﬀerent, but the ﬁnal result, for

suﬃciently large n, will always be close to 1/2.

We now look at the problem of determining the probability of ‘face

less than or equal to 4 if even face’. Applying the above formula we get

2.2 Experimental Conditioning 27

P

**face less than or equal
**

to 4 if even face

=

P

**face less than or equal
**

to 4 and even face

P(even face)

=

2/6

3/6

=

2

3

.

Thus, changing the order of the conditioning event will in general pro-

duce diﬀerent probabilities. Note, however, that the intersection is (al-

ways) commutative: P(A and B) = P(B and A). In our example:

P(A and B) = P(A if B)P(B) =

1

2

×

2

3

=

1

3

,

P(B and A) = P(B if A)P(A) =

2

3

×

1

2

=

1

3

.

From now on we shall simplify notation by omitting the multiplication

symbol when no possibility of confusion arises.

2.2 Experimental Conditioning

The literature on probability theory is full of examples in which balls

are randomly extracted from urns. Consider an urn containing 3 white

balls, 2 red balls and 1 black ball. The probabilities of randomly ex-

tracting one speciﬁc-color ball are

P(white) = 1/2 , P(red) = 1/3 , P(black) = 1/6 .

If, whenever we extract a ball, we put it back in the urn – extraction

with replacement – these probabilities will remain unchanged in the fol-

lowing extractions. If, on the other hand, we do not put the ball back

in the urn – extraction without replacement – the experimental con-

ditions for ball extraction are changed and we obtain new probability

values. For instance:

• If a white ball comes out in the ﬁrst extraction, then in the following

extraction

P(white) = 2/5 , P(red) = 2/5 , P(black) = 1/5 .

• If a red ball comes out in the ﬁrst extraction, then in the following

extraction:

P(white) = 3/5 , P(red) = 1/5 , P(black) = 1/5 .

28 2 Amazing Conditions

• If a black ball comes out in the ﬁrst extraction, then in the following

extraction:

P(white) = 3/5 , P(red) = 2/5 , P(black) = 0 .

As we can see, the conditions under which a random experiment takes

place may drastically change the probabilities of the various events

involved, as a consequence of event conditioning.

2.3 Independent Events

Let us now consider throwing two dice, one white and the other green,

and suppose someone asks the following question: what is the proba-

bility that an even face turns up on the white die if an even face turns

up on the green die? Obviously, if the dice-throwing is fair, the face

that turns up on the green die has no inﬂuence whatsoever on the face

that turns up on the white die, and vice versa. The events ‘even face

on the white die’ and ‘even face on the green die’ do not interfere. The

same can be said about any other pair of events, one of which refers to

the white die while the other refers to the green die. We say that the

events are independent . Applying the preceding rule, we may work this

out in detail as

P

**even face of white die if
**

even face of green die

=

P

**even face of white die and
**

even face of green die

**P(even face of green die)
**

=

(1/2) ×(1/2)

1/2

=

1

2

,

which is precisely the probability of ‘even face of the white die’. In

conclusion, the condition ‘even face of the green die’ has no inﬂuence

on the probability value for the other die. Consider now the question:

what is the probability of an even face of the white die turning up if

at least one even face turns up (on either of the dice or on both)? We

ﬁrst notice that ‘at least one even face turns up’ is the complement of

‘no even face turns up’, or in other words, both faces are odd, which

may occur 3 ×3 = 9 times. Therefore,

P

**even face of the white die
**

if at least one even face

=

P

**even face of the white die
**

and at least one even face

**P(at least one even face)
**

=

3 ×6/36

(36 −9)/36

=

2

3

.

2.4 A Very Special Reverend 29

We now come to the conclusion that the events ‘even face of the white

die’ and ‘at least one even face’ are not independent. Consequently,

they are said to be dependent . In the previous example of ball extraction

from an urn, the successive extractions are independent if performed

with replacement and dependent if performed without replacement.

For independent events, P(A if B) = P(A) and P(B if A) = P(B).

Let us go back to the conditional probability formula:

P(A if B) =

P(A and B)

P(B)

, or P(A and B) = P(A if B)P(B) .

If the events are independent one may replace P(A if B) by P(A)

and obtain P(A and B) = P(A)P(B). Thus, the probability formula

presented in Chap. 1 for event intersection is only applicable when the

events are independent. In general, if A, B, . . . , Z are independent,

P(A and B and . . . and Z) = P(A) ×P(B) ×. . . ×P(Z) .

2.4 A Very Special Reverend

Sometimes an event can be obtained in distinct ways. For instance, in

the die-throwing experiment let us stipulate that event A means ‘face

less than or equal to 5’. This event may occur when either the event

B = ‘even face’ or its complement B = ‘odd face’ have occurred. We

readily compute

P(A and B) = 2/6 , P(A and B) = 3/6 .

But the union of disjoint events ‘A and B’ and ‘A and B’ is obviously

just A. Applying the addition rule for disjoint events we obtain, without

much surprise,

P(face less than or equal to 5) = P(A and B) +P(A and B)

=

2

6

+

3

6

=

5

6

.

In certain circumstances, we have to evaluate probabilities of inter-

sections, P(A and B) and P(A and B), on the basis of conditional

probabilities. Consider two urns, named X and Y , containing white

and black balls. Imagine that X has 4 white and 5 black balls and Y

has 3 white and 6 black balls. One of the urns is randomly chosen and

30 2 Amazing Conditions

afterwards a ball is randomly drawn out of that urn. What is the prob-

ability that a white ball is drawn? Using the preceding line of thought,

we compute:

P(white ball) = P(white ball and urn X) +P(white ball and urn Y )

= P(white ball if urn X) ×P(urn X)

+P(white ball if urn Y ) ×P(urn Y )

=

4

9

×

1

2

+

3

9

×

1

2

=

7

18

= 0.39 .

In the book entitled The Doctrine of Chances by Abraham de Moivre,

mentioned in the last chapter, reference was made to the urn inverse

problem, that is: What is the probability that the ball came from urn

X if it turned out to be white? The solution to the inverse problem was

ﬁrst discussed in a short document written by the Reverend Thomas

Bayes (1702–1761), an English priest interested in philosophical and

mathematical problems. In Bayes’ work, entitled Essay Towards Solv-

ing a Problem in the Doctrine of Chances, and published posthumously

in 1763, there appears the famous Bayes’ theorem that would so sig-

niﬁcantly inﬂuence the future development of probability theory. Let

us see what Bayes’ theorem tells us in the context of the urn problem.

We have seen that

P(white ball) = P(white ball if urn X) ×P(urn X)

+P(white ball if urn Y ) ×P(urn Y ) .

Now Bayes’ theorem allows us to write

P(urn X if white ball) =

P(urn X) ×P(white ball if urn X)

P(white ball)

.

Note how the inverse conditioning probability P(urn X if white ball)

depends on the direct conditioning probability P(white ball if urn X).

In fact, it corresponds simply to dividing the ‘urn X’ term of its

decomposition, expressed in the formula that we knew already, by

P(white ball). Applying the concrete values of the example, we obtain

P(urn X if white ball) =

(4/9) ×(1/2)

7/18

=

4

7

.

In the same way,

P(urn Y if white ball) =

(3/9) ×(1/2)

7/18

=

3

7

.

2.4 A Very Special Reverend 31

Of course, the two probabilities add up to 1, as they should (the ball

comes either from urn X or from urn Y ). But care should be taken

here: the two conditional probabilities for the white ball do not have

to add up to 1. In fact, in this example, we have

P(white ball if urn X) +P(white ball if urn Y ) =

7

9

.

In the event ‘urn X if white ball’, we may consider that the drawing

out of the white ball is the eﬀect corresponding to the ‘urn X’ cause.

Bayes’ theorem is often referred to as a theorem about the probabilities

of causes (once the eﬀects are known), and written as follows:

P(cause if eﬀect) =

P(cause) ×P(eﬀect if cause)

P(eﬀect)

.

Let us examine a practical example. Consider the detection of breast

cancer by means of a mammography. The mammogram (like many

other clinical analyses) is not infallible. In reality, long studies have

shown that (frequentist estimates)

P

positive mammogram

if breast cancer

= 0.9 , P

positive mammogram

if no breast cancer

= 0.1 .

Let us assume the random selection of a woman in a given population

of a country where it is known that (again frequentist estimates)

P(breast cancer) = 0.01 .

Using Bayes’ theorem we then compute:

P

breast cancer if

positive mammogram

=

0.01 ×0.9

0.01 ×0.9 + 0.99 ×0.1

= 0.08 .

It is a probability that we can consider low in terms of clinical diagnos-

tic, raising reasonable doubt about the isolated use of the mammogram

(as for many other clinical analyses for that matter). Let us now assume

that the mammogram is prescribed following a medical consultation.

The situation is then quite diﬀerent: the candidate is no longer a ran-

domly selected woman from the population of a country, but a woman

from the female population that seeks a speciﬁc medical consultation

(undoubtedly because she has found that something is not as it should

be). Now the probability of the cause (the so-called prevalence) is dif-

ferent. Suppose we have determined (frequentist estimate)

32 2 Amazing Conditions

P

**breast cancer (in women
**

seeking speciﬁc consultation)

= 0.1 .

Using Bayes’ theorem, we now obtain

P

breast cancer if

positive mammogram

=

0.1 ×0.9

0.1 ×0.9 + 0.9 ×0.1

= 0.5 .

Observe how the prevalence of the cause has dramatic consequences

when we infer the probability of the cause assuming that a certain

eﬀect is veriﬁed.

Let us look at another example, namely the use of DNA tests in

assessing the innocence or guilt of someone accused of a crime. Imagine

that the DNA of the suspect matches the DNA found at the crime scene,

and furthermore that the probability of this happening by pure chance

is extremely low, say, one in a million, whence

P(DNA match if innocent) = 0.000 001 .

But,

P(DNA match if guilty) = 1 ,

another situation where the sum of the conditional probabilities is not

1. Suppose that in principle there is no evidence favoring either the

innocence or the guilt (no one is innocent or guilty until there is proof),

that is, P(innocent) = P(guilty) = 0.5. Then,

P(innocent if DNA match) =

0.000 001 ×0.5

0.000 001 ×0.5 + 1 ×0.5

≈ 0.000 001 .

In this case the DNA evidence is decisive. In another scenario let us

suppose that, given the circumstances of the crime and the suspect,

there are good reasons for presuming innocence. Let us say that, for

the suspect to be guilty, one would require a whole set of circumstances

that we estimate to occur only once in a hundred thousand times, that

is, in principle, P(innocent) = 0.999 99. Then,

P(innocent if DNA match) =

0.000 001 ×0.999 99

0.000 001 ×0.999 99 + 1 ×0.000 01

≈ 0.09 .

In this case, in spite of the DNA evidence, the presumption of innocence

is still considerable (about one case in eleven).

2.5 Probabilities in Daily Life 33

Fig. 2.1. The risks of entering the PD if C clinic

2.5 Probabilities in Daily Life

The normal citizen often has an inaccurate perception of probabilities

in everyday life. In fact, several studies have shown that the probability

notion is one of the hardest to grasp. For instance, if after ﬂipping a coin

that turned up heads, we ask a child what will happen in the following

ﬂip, the most probable answer is that the child will opt for tails. The

adult will have no diﬃculty in giving the correct answer to this simple

question (that is, either heads or tails will turn up, meaning that both

events are equally likely), but if the question gets more complicated an

adult will easily fall into the so-called gambler’s fallacy. An example of

this is imagining that one should place a high bet on a certain roulette

number that has not come up for a long time.

Let us illustrate the fallacy with the situation where a coin has

been ﬂipped several times and tails has been turning up each time.

Most gamblers (and not only gamblers) then tend to consider that the

probability of heads turning up in the following ﬂips will be higher

than before. The same applies to other games. Who has never said

something like: 30 is bound to come now; it’s time for black to show

up; the 6 is hot; the odds are now on my side, and so on? It looks as

if one would deem the information from previous outcomes as useful

in determining the following outcome; in other words, it is as if there

were some kind of memory in a sequence of trials. In fact, rigorously

34 2 Amazing Conditions

conducted experiments have even shown that the normal citizen has

the tendency to believe that there is a sort of ‘memory’ phenomenon

in chance devices. In this spirit, if a coin has been showing tails, why

not let it ‘rest’ for a while before playing again? Or change the coin?

Suppose a coin has produced tails 30 times in a row. We will ﬁnd it

strange and cast some doubts about the coin or the fairness of the

player. On the other hand, if instead of being consecutive, the throws

are spaced over one month (say, one throw per day), maybe we would

not ﬁnd the event so strange after all, and maybe we would not attribute

the ‘cause’ of the event to the coin’s ‘memory’, but instead conjecture

that we are having an ‘unlucky month’.

Consider a raﬄe with 1 000 tickets and 1 000 players, but only one

prize. Once all the tickets have been sold – one ticket for each of the

1 000 players – it comes out that John Doe has won the prize. In a new

repetition of the raﬄe John Doe wins again. It looks as if ‘luck’ is with

John Doe. Since the two raﬄes are independent, the probability of

John Doe winning twice consecutively is (1/1000)

2

= 0.000 001. This is

exactly the same probability that corresponds to John Doe winning the

ﬁrst time and Richard Roe the second time. Note that this is the same

probability as if John Doe had instead won a raﬄe in which a million

tickets had been sold, which would certainly seem less strange than the

double victory in the two 1000-ticket raﬄes.

Let us slightly modify our problem, considering an only-one-prize

raﬄe with 10 000 tickets and the participation of 1 000 players, each

one buying the same number of tickets. The probability that a given

player (say, John Doe) wins is 0.001 (the thousand players are equally

likely to win). Meanwhile, the probability that someone wins the raﬄe

is obviously 1 (someone has to win). Now suppose the raﬄe is repeated

three times. The probability that the same player (John Doe) wins all

3 times is (independent events) (0.001)

3

= 0.000 000 001. However, the

probability that someone (whoever he or she may be) wins the raﬄe

three times is not 1, but at most 1000 × (0.001)

3

= 0.000 001. In fact,

any combination of three winning players has a probability (0.001)

3

;

if the same 1 000 players play the three raﬄes, there are 1 000 distinct

possibilities of someone winning the three raﬄes, resulting in the above

probability. If the players are not always the same, the probability

will be smaller. Thus, what will amaze the common citizen is not so

much that John Doe has a probability of one in a thousand million of

winning, but the fact that the probability of someone winning in three

repetitions is one in a million.

2.5 Probabilities in Daily Life 35

Let us now take a look at the following questions that appeared in

tests assessing people’s ability to understand probabilities:

1. A hat contains 10 red and 10 blue smarties. I pull out 10 and 8 are

red. Which am I more likely to get next?

2. A box contains green and yellow buttons in unknown proportions.

I take out 10 and 8 are yellow. Which am I more likely to get next?

3. In the last four matches between Mytholmroyd Athletics and Gig-

gleswick United, Mytholmroyd have kicked oﬀ ﬁrst every time, on

the toss of a coin. Which team is more likely to kick oﬀ next time?

People frequently ﬁnd it diﬃcult to give the correct answer

1

to such

questions. When dealing with conditional events and the application

of Bayes’ theorem, the diﬃculty in arriving at the right answer is even

greater. Let us start with an apparently easy problem. We randomly

draw a card out of a hat containing three cards: one with both faces

red, another with both faces blue, and the remaining one with one face

red and the other blue. We pull a card out at random and observe that

the face showing is red. The question is: what is the probability that

the other face will also be red? In general, the answer one obtains is

based on the following line of thought: if the face shown is red, it can

only be one of two cards, red–red or red–blue; therefore, the probability

that it is red–red is 1/2. This result is wrong and the mistake in the

line of thought has to do with the fact that the set of elementary events

has been incorrectly enumerated. Let us denote the faces by R (red)

and B (blue) and let us put a mark, say an asterisk , to indicate the

face showing. The set of elementary events is then:

{R

∗

R, RR

∗

, R

∗

B, RB

∗

, B

∗

B, BB

∗

} .

In this enumeration, R

∗

B means that for the red–blue card the R face

is showing, and likewise for the other cards. As for the RR card, any

of the faces can be turned up: R

∗

R or RR

∗

. In this way,

P(RR if R

∗

) =

P(RR and R

∗

)

P(R

∗

)

=

2

3

.

In truth, for the three events containing R

∗

, namely R

∗

R, RR

∗

and

R

∗

B, only two correspond to the RR card.

The following problem was presented by two American psycholo-

gists. A taxi was involved in a hit-and-run accident during the night.

1

Correct answers are: blue, yellow, either.

36 2 Amazing Conditions

Two taxi ﬁrms operate in the city: green and blue. The following facts

are known:

• 85% of the taxis are green and 15% are blue.

• One witness identiﬁed the taxi as blue. The court tested the witness’

reliability in the same conditions as on the night of the accident and

concluded that the witness correctly identiﬁed the colors 80% of the

time and failed in the other 20%.

What is the probability that the hit-and-run taxi was blue?

The psychologists found that the typical answer was 80%. Let us

see what the correct answer is. We begin by noting that

P

witness is right

if taxi is blue

= 0.8 , P

witness is wrong

if taxi is blue

= 0.2 .

Hence, applying Bayes’ theorem:

P

taxi is blue

if witness is right

=

0.15 ×0.8

0.15 ×08 + 0.85 ×0.2

= 0.41 .

We see that the desired probability is below 50% and nearly half of the

typical answer. In this and similar problems the normal citizen tends

to neglect the role of prevalences. Now the fact that blue taxis are less

prevalent than green taxis leads to a drastic decrease in the probability

of the witness being right.

2.6 Challenging the Intuition

There are problems on probabilities that lead to apparently counter-

intuitive results and are even a source of embarrassment for experts

and renowned mathematicians.

2.6.1 The Problem of the Sons

Let us go back to the problem presented in the last chapter of ﬁnding

the probability that a family with two children has at least one boy.

The probability is 3/4, assuming the equiprobability of boy and girl.

Let us now suppose that we have got the information that the randomly

selected family had a boy. Our question is: what is the probability that

2.6 Challenging the Intuition 37

the other child is also a boy? Using the same notation as in Chap. 1,

we have

P

{MM} if {MF, FM, MM}

=

1/4

3/4

=

1

3

.

This result may look strange, since apparently the fact that we know

the sex of one child should not condition the probability relative to the

other one. However, let us not forget that, as in the previous three-card

problem, there are two distinct situations for boy–girl, namely MF and

FM.

We now modify the problem involving the visit to a random family

with two children, by assuming that we knock at the door and a boy

shows up. What is the probability that the other child is a boy? It

seems that nothing has changed relative to the above problem, but as

a matter of fact this is not true. The event space is now

{M

∗

M, MM

∗

, M

∗

F, F

∗

M, FM

∗

, F

∗

F, FF

∗

} ,

where the asterisk denotes the child that opens the door. We are inter-

ested in the conditional probability of {M

∗

M, MM

∗

} if {M

∗

M, MM

∗

,

M

∗

F, FM

∗

} takes place. The probability is 1/2. We conclude that the

simple fact that a boy came to the door has changed the probabilities!

2.6.2 The Problem of the Aces

Consider a 5-card hand, drawn randomly from a 52-card deck, and

suppose that an ace is present in the hand. Let us further specify two

categories of hands: in one category, the ace is of clubs; in the other,

the ace is any of the 4 aces. Suppose that in either category of hands we

are interested in knowing the probability that at least one more ace is

present in the hand. We are then interested in the following conditional

probabilities:

P

**two or more aces if
**

an ace of clubs was drawn

, P

**two or more aces
**

if an ace was drawn

.

Are these probabilities equal? The intuitive answer is aﬃrmative. After

all an ace of clubs is an ace and it would seem to be all the same whether

it be of clubs or of something else. With a little bit more thought, we

try to assess the fact that in the second case we are dealing with any

of 4 aces, suggesting a larger number of possible combinations and,

as a consequence, a larger probability. But is that really so? Let us

38 2 Amazing Conditions

actually compute the probabilities. It turns out that for this problem

the conditional probability rule is not of much help. It is preferable to

resort to the classic notion and evaluate the ratio of favorable cases over

possible cases. In reality, since the evaluation of favorable cases would

amount to counting hand conﬁgurations with two, three and four aces,

it is worth trying rather to evaluate unfavorable cases, i.e., hands with

only one ace.

Let us examine the ﬁrst category of hand with an ace of clubs. There

remain 51 cards and 4 positions to ﬁll in. There are therefore

51

4

possible cases, of which

48

4

**are unfavorable, i.e., there is no other
**

ace in the remaining 4 positions, whence

P

**two or more aces if
**

an ace of clubs was drawn

=

51

4

−

48

4

51

4

= 0.22 .

In the second category – a hand with at least one ace – the number of

possible cases is obtained by subtracting from all possible hands those

with no ace, which gives

possible cases

52

5

−

48

5

.

Concerning the unfavorable cases, once we ﬁx an ace, we get the same

number of unfavorable cases as before. Since there are four distinct

possibilities for ﬁxing an ace, we have

unfavorable cases 4 ×

48

4

.

Hence,

P

**two or more aces
**

if an ace was drawn

=

52

5

−

48

5

−4 ×

48

4

52

5

−

48

5

= 0.12 .

We thus obtain a smaller (!) probability than in the preceding situation.

This apparently strange result becomes more plausible if we inspect

what happens in a very simple situation of two-card hands from a deck

2.6 Challenging the Intuition 39

Fig. 2.2. The problem of the aces for a very simple deck

with only 4 cards: ace and queen of hearts and clubs (see Fig. 2.2). We

see that, when we ﬁx the ace of clubs, the probability of a second ace

is 1/3 (Fig. 2.2a). If on the other hand we are allowed any of the two

aces or both, the probability is 1/5 (Fig. 2.2b).

2.6.3 The Problem of the Birthdays

Imagine a party with a certain number of people, say n. The question

is: how large must n be so that the probability of ﬁnding two people

with the same birthday date (month and day) is higher than 50%?

At ﬁrst sight it may seem that n will have to be quite large, perhaps

as large as there are days in a year. We shall ﬁnd that this is not so.

The probability of a person’s birthday falling on a certain date is 1/365,

ignoring leap years. Since the birthdays of the n people are independent

of each other, any list of n people’s birthday dates has a probability

given by the product of the n individual probabilities (refer to what

was said before about the probability of independent events). Thus,

P

(list of) n birthdays

=

1

365

n

.

What is the probability that two birthdays do not coincide? Imagine

the people labeled from 1 to n. We may arbitrarily assign a value to the

ﬁrst person’s birthday, with 365 diﬀerent ways of doing it. Once this

birthday has been assigned there remain 364 possibilities for selecting

a date for the second person with no coincidence; afterwards, 363 pos-

sibilities are available for the third person’s birthday, and so on, up to

40 2 Amazing Conditions

Fig. 2.3. Probability of two out of

n people having the same birthday

the last remaining person with 365 −n + 1 possibilities. Therefore,

P

n birthdays without

two coincidences

=

365 ×364 ×. . . ×(365 −n + 1)

365

n

.

It is now clear that P(n birthdays with two coincidences) is the com-

plement of the above probability with respect to 1. It can be checked

that for n = 23 the probability of ﬁnding two coincident birthdays is

already higher than 0.5. Indeed, P(23 birthdays with two coincidences)

= 0.5073. Figure 2.3 shows how this probability varies with n. Note

that, whereas for 10 people the probability is small (0.1169), for 50

people it is quite considerable (0.9704).

2.6.4 The Problem of the Right Door

The problem of the right door comes from the USA (where it is known

as the Monty Hall paradox) and is described as follows. Consider a TV

quiz where 3 doors are presented to a contestant and behind one of

them is a valuable prize (say, a car). Behind the other two doors,

however, the prizes are of little value. The contestant presents his/her

choice of door to the quiz host. The host, knowing what is behind each

door, opens one of the other two doors (one not hiding the big prize)

and reveals the worthless prize. The contestant is then asked if he/she

wants to change the choice of door. The question is: should the con-

testant change his/her initial choice? When a celebrated columnist of

an American newspaper recommended changing the initial choice, the

newspaper received several letters from mathematicians indignant over

such a recommendation, stating that the probability of winning with

such a strategy was 0.5. Let us see if that is true. We start by number-

ing the doors from 1 to 3 and assume that door 1 is the contestant’s

2.6 Challenging the Intuition 41

Table 2.1. Finding the right door

Prize door Initial choice Host opens New choice

Door 1 Door 1 Door 2 Door 3

Door 2 Door 1 Door 3 Door 2

Door 3 Door 1 Door 2 Door 3

initial choice. If the contestant does not change his/her initial choice,

there is a 1/3 probability of choosing the right door. If, on the other

hand, he/she decides to change his/her choice, we have the situations

shown in Table 2.1.

The contestant wins in the two new choices written in bold in the

table; therefore, the probability of winning when changing the choice

doubles to 2/3! This result is so counter-intuitive that even the Hun-

garian mathematician Paul Erd¨ os (1913–1996), famous for his contri-

butions to number theory, had diﬃculty accepting the solution. The

problem for most people in accepting the right solution is connected

with the strongly intuitive and deeply rooted belief that probabilities

are something immutable. Note, however, how the information obtained

by the host opening one of the doors changes the probability setting

and allows the contestant to increase his/her probability of winning.

2.6.5 The Problem of Encounters

We saw in the last chapter that there is a 1/n! probability of randomly

inserting n letters in the right envelopes. We also noticed how this

probability decreases very quickly with n. Let us now take a look at

the event ‘no match with any envelope’. At ﬁrst sight, one could imag-

ine that the probability of this event would also vary radically with

n. Let us compute the probability. We begin by noticing that the one-

match event has probability 1/n, whence the no-match event for one

letter has the probability 1 −1/n. If the events ‘no match for letter 1’,

‘no match for letter 2’, . . . , ‘no match for letter n’ were independent,

the probability we wish to compute would be (1 − 1/n)

n

, which tends

to 1/e, as n grows. The number denoted e, whose value up to the sixth

digit is 2.718 28, is called the Napier constant, in honor of the Scot-

tish mathematician John Napier, who introduced logarithmic calculus

(1618).

However, as we know, these events are not independent since their

complements are not. For instance, if n − 1 letters are in the right

42 2 Amazing Conditions

envelope, the nth letter has only one position (the right one) to be

inserted in.

It so happens that the problem we are discussing appears under sev-

eral guises in practical issues, generally referred to as the problem of

encounters, which for letters and envelopes is formulated as: supposing

that the letters and the envelopes are numbered according to the natu-

ral order 1, 2, . . . , n, what is the probability that, when inserting letter

i, it will sit in its natural position? (We then say that an encounter has

taken place.)

Consider all the possibilities of encounters of letters 1 and 3 in a set

of 5 letters:

12345 , 12354 , 14325 , 14352 , 15324 , 15342 .

We see that for r encounters (r = 2 in the example), there are only n−r

letters that may permute (in the example, 3! = 6 permutations). Any of

these permutations has probability (n−r)!/n!. Finally, we can compute

the probability of no encounter as the complement of the probability

of at least one encounter. This one is computed using a generalization

of the event union formula for more than two events. Assuming n = 3

and abbreviating ‘encounter’ to ‘enc’, the probability is computed as

follows:

P

**at least one encounter
**

in 3 letters

**= P(enc. letter 1) +P(enc. letter 2) +P(enc. letter 3)
**

−P(enc. letters 1 and 2) −P(enc. letters 1 and 3)

−P(enc. letters 2 and 3) +P(enc. letters 1, 2 and 3)

=

3

1

×

(3 −1)!

3!

−

3

2

×

(3 −2)!

3!

+

3

3

×

(3 −3)!

3!

= 3 ×

1

3

−3 ×

1

6

+ 1 ×

1

6

=

2

3

.

Figure 2.4 shows the Venn diagram for this situation.

Hence,

P(no encounter in 3 letters) =

1

3

.

It can be shown that the above calculations can be carried out more

easily as follows for n letters:

2.6 Challenging the Intuition 43

Fig. 2.4. The gray area in (a) is equal to the sum of the areas of the rectangles

corresponding to the events ‘enc. letter 1’, ‘enc. letter 2’, and ‘enc. letter

3’, after subtracting the hatched areas in (b) and adding the double-hatched

central area

Table 2.2. Relevant functions for the problem of encounters

n 2 3 4 5 6 7 8 9

d(n) 1 2 9 44 265 1 854 14 833 133 496

P(n) 0.5 0.333 3 0.375 0 0.366 7 0.368 1 0.367 9 0.367 9 0.367 9

(1 −1/n)

n

0.25 0.296 3 0.316 4 0.327 7 0.334 9 0.339 9 0.343 6 0.346 4

P(no encounter in n letters) =

d(n)

n!

,

where d(n), known as the number of derangements of n elements (per-

mutations without encounters) is given by

d(n) = (n −1) ×

d(n −1) +d(n −2)

,

with d(0) = 1 and d(1) = 0. Table 2.2 lists the values of d(n), the

probability of no encounter in n letters denoted P(n), and the value of

the function (1 −1/n)

n

.

The probability of no encounter in n letters approaches 1/e =

0.3679, as in the incorrect computation where we assumed the inde-

pendence of the encounters, but it does so in a much faster way than

with the wrong computation. Above 6 letters, the probability of no

match in an envelope is practically constant!

3

Expecting to Win

3.1 Mathematical Expectation

We have made use in the preceding chapters of the frequentist concept

of probability. Let us assume that, in a sequence of 8 tosses of a coin,

we have obtained:

heads, tails, tails, heads, heads, heads, tails, heads .

Assigning the value 1 to heads and 0 to tails, we may then add up the

values of the sequence 10011101. We obtain 5, that is, an average value

of 5/8 = 0.625 for heads. This average, computed on the 0-1 sequence,

corresponds to the frequency of occurrence f = 0.625 heads per toss.

The frequentist interpretation tells us that, if we toss a coin a very

large number of times, about half of the times heads will turn up. In

other words, on average heads will turn up once every two tosses.

As an alternative to the tossing sequence in time, we may imagine

a large number of people tossing coins in the air. On average heads

would turn up once for every two people. Of course this theoretical value

of 0.5 for the average (or frequency) presupposes that the coin-toss

experiment either in time or for a set of people is performed a very large

number of times. When the number of tosses is small one expects large

deviations from an average value around 0.5. (The same phenomenon

was already illustrated in Fig. 1.4 when throwing a die.)

We now ask the reader to imagine playing a card game with someone

and using a pack with only the picture cards (king, queen, jack), the

ace and the joker of the usual four suits. Let us denote the cards by K,

Q, J, A and F, respectively. The 4×5 = 20 cards are fairly shuﬄed and

an attendant extracts cards one by one. After extracting and showing

46 3 Expecting to Win

Fig. 3.1. Evolution of the average gain in a card game

the card it is put back in the deck, which is then reshuﬄed. For each

drawn picture card the reader wins 1 euro from the opponent and pays

1 euro to the opponent if an ace is drawn and 2 euros if a joker is drawn.

Suppose the following sequence turned up:

AKAKAFJQJFAQFAAAJAAQAJAQKKFQKAJKJQJAQKKKF.

For this sequence the reader’s accumulated gain varies as: −1, 0, −1, 0,

−1, −3, −2, and so on. By the end of the ﬁrst ﬁve rounds the average

gain per round would be −0.2 euros (therefore, a loss). Figure 3.1 shows

how the average gain varies over the above 40 rounds.

One may raise the question: what would the reader’s average gain

be in a long sequence of rounds? Each card has an equal probability of

being drawn, which is 4/20 = 1/5. Therefore, the 1 euro gain occurs

three ﬁfths of the time. As for the 1 or 2 euro losses, they only occur

one ﬁfth of the time. Thus, one expects the average gain in a long run

to be as follows:

expected average gain = 1 ×

3

5

+ (−1) ×

1

5

+ (−2) ×

1

5

= 0 euro .

It thus turns out that the expected average gain (theoretical average

gain) in a long run is zero euros. In this case the game is considered

fair, that is, either the reader or the opponent has the same a priori

chances of obtaining the same average gain. If, for instance, the king’s

value was 2 euros, the reader would have an a priori advantage over

the opponent, expecting to win 0.2 euros per round after many rounds.

The game would be biased towards the reader. If, on the other hand,

the ace’s value was −2 euros, we would have the opposite situation: the

game would be biased towards the opponent, favoring him/her by 0.2

euros per round after many rounds.

3.1 Mathematical Expectation 47

Fig. 3.2. The probability function of the gain for

a card game

Let us take a closer look at how the average gain is computed. One

can look at it as if one had to deal with a variable called gain, taking

values in the set {−2, −1, 1}, and such that the appropriate probability

is assigned to each value, as shown in Fig. 3.2. One then simply adds

up the products of the values with the probabilities.

We say that the gain variable is a random variable. The expected

average gain, expected after a large number of rounds, which, as we

saw, is computed using the event probabilities, is called the mathemat-

ical expectation of the gain variable. Note the word ‘mathematical’.

We may have to play an incredibly large number of times before the

average gain per round is suﬃciently near the mathematical, theoreti-

cal, value of the expected average gain. At this moment the reader will

probably suspect, and rightly so, that in the coin-tossing experiment

the probability that heads turns up is the mathematical expectation of

the random variable ‘frequency of occurrence’, denoted by f.

There is in fact a physical interpretation of the mathematical ex-

pectation that we shall now explain with the help of Fig. 3.2. Imagine

that the ‘gain’ axis is a wire along which point masses have been placed

with the same value and in the same positions as in Fig. 3.2. The equi-

librium point (center of gravity) of the wire would then correspond to

the mathematical expectation, in this case, the point 0.

Fig. 3.3. Evolution of the average gain in 1 000 rounds

48 3 Expecting to Win

Fig. 3.4. The euphemistic average

Figure 3.3 shows how the average gain of the card game varies over

a sequence of 1 000 rounds. Note the large initial wanderings followed

by a relative stabilization. Another sequence of rounds would most

certainly lead to a diﬀerent average gain sequence. However, it would

also stabilize close to

mathematical

expectation

=

¸

value of random

variable

×

respective

probability

,

where

¸

denotes the total sum.

3.2 The Law of Large Numbers

The time has come to clarify what is meant by a ‘suﬃciently long run’ or

‘a large number of rounds’. For that purpose, we shall use the preceding

card game example with null mathematical expectation. Denoting the

expectation by E, we have

E(AKQJF game) = 0 .

On the other hand, the average gain after n rounds is computed by

adding the individual gains of the n rounds – that is, the values of the

corresponding random variable – and dividing the resulting sum by n.

In Fig. 3.1, for instance, the values of the random variable for the ﬁrst

10 rounds were

−1 , 1 , −1 , 1 , −1 , −2 , 1 , 1 , 1 , −2 .

3.2 The Law of Large Numbers 49

We thus obtain the following average gain in 10 rounds:

average gain in 10 rounds =

−1 + 1 −1 + 1 −1 −2 + 1 + 1 + 1 −2

10

= −0.2 .

Figure 3.3 shows a sequence of 1 000 rounds. This is only one sequence

out of a large number of possible sequences. With regard to the number

of possible sequences, we apply the same counting technique as for the

toto: the number of possible sequences is 5

1000

, which is closely repre-

sented by 4 followed by 71 zeros! (A number larger than the number of

atoms in the Milky Way, estimated as 1 followed by 68 zeros.)

Figure 3.5 shows 10 curves of the average gain in the AKQJF game,

corresponding to 10 possible sequences of 1 000 rounds. We observe

a tendency for all curves to stabilize around the mathematical expec-

tation 0. However, we also observe that some average gain curves that

looked as though they were approaching zero departed from it later on.

Let us suppose that we establish a certain deviation around zero, for

instance 0.08 (the broken lines in Fig. 3.5). Let us ask whether it is

possible to guarantee that, after a certain number n of rounds, all the

curves will deviate from zero by less than 0.08?

Let us see what happens in the situation shown in Fig. 3.5. When

n = 200, we have 4 curves (out of 10) that exceed the speciﬁed de-

viation. Near n = 350, we have 3 curves. After around 620, we only

have one curve and ﬁnally, after approximately 850, no curve exceeds

the speciﬁed deviation. This experimental evidence suggests, therefore,

that the probability of ﬁnding a curve that deviates from the mathe-

matical expectation by more than a pre-speciﬁed tolerance will become

smaller and smaller as n increases.

Now, it is precisely such a law that the mathematician Jacob

Bernoulli, already mentioned in Chap. 1, was able to demonstrate and

Fig. 3.5. Ten possible evolutions of the average gain in the card game

50 3 Expecting to Win

which has come to be known as the law of large numbers or Bernoulli’s

law: for a sequence of random variables with the same probability law

and independent from each other (in our example, no round depends on

the others), the probability that the respective average deviates from

the mathematical expectation by more than a certain pre-established

value tends to zero. Turning back to our previous question, we cannot

guarantee that after a certain number of rounds all average gain curves

will deviate from the mathematical expectation by less than a certain

value, but it is nevertheless possible to determine a degree of certainty

(compute the probability) of such an event, and that degree of certainty

increases with the number of rounds.

This way, the probability measure (for instance, of heads turning up

in coin tossing) does not predict the outcome of a particular sequence,

but, in a certain sense, ‘predicts’ the average outcome of a large number

of sequences. We cannot predict the detail of all possible evolutions, but

we can determine a certain global degree of certainty.

In a game between John and Mary, if John bets a million with

1/1000 probability of losing and Mary bets 1000 with 999/1000 prob-

ability of losing, the gain expectation is:

1 000 ×

999

1 000

−1 000 000 ×

1

1 000

= −1 for John,

1 000 000 ×

1

1 000

−1 000 ×

999

1 000

= +1 for Mary .

Therefore, for suﬃciently large n, the game is favorable to Mary (see

Fig. 3.6). However, if the number of rounds is small, John has the

advantage.

Fig. 3.6. Evolution of John’s average gain with the number of rounds, in six

possible sequences. Only in one sequence is chance unfavorable to John during

the ﬁrst 37 rounds

3.3 Bad Luck Games 51

When we say ‘tends to zero’, in the law of large numbers, the mean-

ing is that it is possible to ﬁnd a suﬃciently high value of n after which

the probability of a curve exceeding a pre-speciﬁed deviation is below

any arbitrarily low value we may wish to stipulate. We shall see later

what value of n that is, for this convergence in probability.

3.3 Bad Luck Games

The notion of mathematical expectation will allow us to make a better

assessment of the degree of ‘bad luck’ in so-called games of chance. Let

us consider the European roulette game discussed in Chap. 1 and let

us assume that a gambler only bets on a speciﬁc roulette number and

for that purpose (s)he pays 1 euro to the casino. Let us further assume

that if the gambler wins the casino will pay him/her 2 euros.

For a suﬃciently large number of rounds the player’s average gain

will approach the mathematical expectation

E(roulette) = 2 ×

1

37

+ 0 ×

36

37

= 0.054 euros .

Since the player pays 1 euro for each bet, the roulette game is unfavor-

able to him/her, implying an average loss of 1−0.054 = 0.946 euros per

bet. The game would only be equitable if the casino paid 37 euros (!)

for each winning number. On the other hand, for a two euro payment

by the casino, the game would only be equitable for the player if (s)he

paid the mathematical expectation for betting, namely 0.054 euros.

Let us now analyze the slot machine described in Chap. 1 (see

Fig. 1.8). The question is: how much would a gambler have to pay for

the right to one equitable round at the slot machine? To answer this

question we compute the mathematical expectation using the values

presented in Chap. 1:

E(slot machine) = 0.05 ×0.090 353 + 0.10 ×0.048 695

+0.15 ×0.022 129 + 0.25 ×0.003 155

+0.30 ×0.002 258 + 0.50 ×0.000 666

+1.00 ×0.000 079 + 2.50 ×0.000 001 = $ 0.01 .

Therefore the gambler should only pay 1 cent of a dollar per go.

The winning expectation in state games (lottery, lotto, toto, and the

like) is also, as can be expected, unfavorable to the gambler. Consider

52 3 Expecting to Win

the toto and let 0.35 euros be the payment for one bet. In state games,

from the total amount of money amassed from the bets, only half is

spent on prizes, and approximately in proportion to the probabilities

of each type of hit. Now, as we saw in Chap. 1, for each ﬁrst prize,

one expects 43 second prizes, 258 third prizes, 13 545 fourth prizes and

246 820 ﬁfth prizes. Consequently, 50% of the amassed money will be

used to pay prizes with a probability of:

P(prizes) = 0.000 000 071 5 ×(1 + 43 + 258 + 13 545 + 246 820)

= 0.018 6 .

Therefore,

E(toto) = 0.018 6 ×0.175 −0.9814 ×0.175 = −0.096 euros .

That is, the toto gambler loses on average (in the sense of the law of

large numbers), 0.096 euros per bet.

3.4 The Two-Envelope Paradox

The two-envelope paradox, also known as the swap paradox, seems to

have been created by the Belgian mathematician Maurice Kraitchik

(1882–1957) in 1953. There is probably no other paradox (or better,

pseudo-paradox) of probability theory that has stimulated so much dis-

cussion in books and papers. It is interesting to note that its enunciation

is apparently simple and innocuous. Let us consider a ﬁrst version of

the paradox. Suppose I show the reader two identical and closed en-

velopes with unknown contents. The only thing that is disclosed is that

they contain certain amounts of money, with the amount in one of the

envelopes being double the amount in the other one. The reader may

choose one envelope telling me his/her choice. Having told me his/her

choice, the reader is given the possibility of swapping his/her choice.

What should the reader do? Swap or not swap? It seems obvious that

any swap is superﬂuous. It is the same thing as if I oﬀered the possi-

bility of choosing heads or tails in the toss of a coin. The probability is

always 1/2. But let us contemplate the following line of thought:

1. The envelope I chose contains x euros.

2. The envelope I chose contains either the smaller or the bigger

amount, as equally likely alternatives.

3.4 The Two-Envelope Paradox 53

3. If my envelope contains the smaller amount I will get 2x euros if I

swap and I would therefore win x euros.

4. On the other hand, if my envelope contains the bigger amount, by

swapping it I will get x/2 euros and I would thus lose x/2 euros.

5. Therefore, the expected gain if I make the swap is

E(swap gain) = 0.5 ×x −0.5 ×(x/2) = 0.25x euros .

That is, I should swap since the expected gain is positive. (The wisest

decision is always guided by the mathematical expectation, even if it is

not always the winning decision it is deﬁnitely the winning decision with

arbitrarily high probability in a suﬃciently large number of trials.) But,

once the swap takes place the same argument still holds and therefore

I should keep swapping an inﬁnite number of times!

The crux of the paradox lies in the meaning attached to x in the

formula E(swap gain). Let us take a closer look by attributing a value

to x. Let us say that x is 10 euros and I am told that the amount in

the other envelope is either twice this amount (20 euros) or half of it

(5 euros), with equal probability. In this case I should swap, winning

on average 2.5 euros with the swap. There is no paradox in this case.

Or is there? If I swap, does the same argument not hold? For suppose

that the other envelope contains y euros. Therefore, the amount of

the initially chosen envelope is either 2y euros or y/2 euros and the

average swap gain is 0.25y euros. Then the advantage in continuing

to swap ad inﬁnitum would remain, and so would the paradox. The

problem with this argument is that, once we have ﬁxed x, the value of

y is not ﬁxed: it is either 20 euros or 5 euros. Therefore we cannot use

y in the computation of E(swap gain), as we did above. This aspect

becomes clear if we consider the standard version of the paradox that

we analyze in the following.

In the standard version of the paradox what is considered ﬁxed is the

total amount in the two envelopes (the total amount does not depend on

my original choice). Let 3x euros be the total amount. Then, my initial

choice contains either x euros or 2x euros, and the calculation in step 5

of the above argument is not legitimate, since it assumes a ﬁxed value

for x. If, for instance, the initial envelope contains the lower amount,

x euros, I win x euros with the swap; but if it is the higher amount,

2x euros, and I do not lose x/2 euros as in the formula of step 5, but

in fact x euros. Therefore, E(swap gain) = 0.5x − 0.5x = 0 euros and

I do not win or lose with the swap.

54 3 Expecting to Win

Other solutions to the paradox have been presented, based on the

fact that the use of the formula in step 5 would lead to impossible

probabilistic functions for the amount in one of the envelopes (see,

e.g., Keith Devlin, 2005).

Do the two-envelope paradox and its analysis have any practical

interest or is it simply an ‘entertaining problem’ ? In fact, this and

other paradoxes do sometimes have interesting practical implications.

Consider the following problem: let x be a certain amount in euros that

we may exchange today in dollars at a rate of 1 dollar per euro, and later

convert back to euros, at a rate of 2 dollars or 1/2 dollar per euro, with

equal probability. Applying the envelope paradox reasoning we should

perform these conversions hoping to win 0.25x euros on average. But

then the dollar holder would make the opposite conversions and we

arrive at a paradoxical situation known as the Siegel paradox, whose

connection with the two-envelope paradox is obvious (see Friedel Bolle,

2003).

3.5 The Saint Petersburg Paradox

We now discuss a coin-tossing game where a hypothetical casino (ex-

tremely conﬁdent in the gambler’s fallacy) accepts to pay 2 euros if

heads turns up at the ﬁrst throw, 4 euros if heads only turns up at

the second throw, 8 euros if it only turns up at the third throw, 16

euros if it only turns up at the fourth throw, and so on and so forth, al-

ways doubling the previous amount. The question is: how much money

should the player give to the casino as initial down-payment in order

to guarantee an equitable game?

The discussion and solution of this problem were ﬁrst presented

by Daniel Bernoulli (nephew of Jacob Bernoulli) in his Commentarii

Academiae Scientiarum Imperialis Petropolitanae (Comments of the

Imperial Academy of Science of Saint Petersburg), published in 1738.

The problem (of which there are several versions) came to be known as

the Saint Petersburg paradox and was the subject of abundant bibliog-

raphy and discussions. Let us see why. The probability of heads turning

up in the ﬁrst throw is 1/2. For the second throw, taking into account

the compound event probability rule for independent events, we have

P(ﬁrst time tails) ×P(second time heads) =

1

2

2

,

3.5 The Saint Petersburg Paradox 55

and likewise for the following throws, always multiplying the previous

value by 1/2. In this way, as shown in Fig. 3.7b, the probability rapidly

decreases with the throw number (the random variable) when heads

turns up for the ﬁrst time. It then comes as no surprise that the casino

proposes to pay the player an amount that rapidly grows with the throw

number. Applying the rule that the player must pay the mathematical

expectation to make the game equitable, this would correspond to the

player having to pay

E(Saint Petersburg game) = 2 ×

1

2

+ 4 ×

1

4

+ 16 ×

1

16

+· · ·

= 1 + 1 + 1 + 1 +· · · = ∞.

The player would thus have to pay an inﬁnite (!) down-payment to the

casino in order to enter the game.

On the other hand, assuming that for entering the game the player

would have to pay the prize value whenever no heads turned up, when

heads did ﬁnally turn up at the nth throw (s)he would have disbursed

2 + 4 + 8 +· · · + 2

n−1

= 2

n

−2 euros .

At the nth throw (s)he would ﬁnally get the 2

n

euro prize. Thus, only

a net proﬁt of 2 euros is obtained. One is unlikely to ﬁnd a gambler

who would pay out such large amounts for such a meager proﬁt.

The diﬃculty with this paradox lies in the fact that the mathe-

matical expectation is inﬁnite. Strictly, this game has no mathematical

expectation and the law of large numbers is not applicable. The paradox

is solvable if we modify the notion of equitable game. Let us consider

Fig. 3.7. (a) The exponential growth of the gain in the paradoxical Saint

Petersburg game. (b) The decreasing exponential probability in the same

game

56 3 Expecting to Win

Table 3.1. Relevant functions for the Saint Petersburg paradox

n P(n) Gain f Utility log f Expected utility

[euro]

1 1/2 2 0.693 147 0.346 574

2 1/4 4 1.386 294 0.693 147

3 1/8 8 2.079 442 0.953 077

4 1/16 16 2.772 589 1.126 364

5 1/32 32 3.465 736 1.234 668

6 1/64 64 4.158 883 1.299 651

7 1/128 128 4.852 030 1.337 557

8 1/256 256 5.545 177 1.359 218

9 1/512 512 6.238 325 1.371 403

10 1/1024 1024 6.931 472 1.378 172

the sum of the gains after n throws, which we denote S

n

. Moreover, let

us consider that, instead of a ﬁxed amount per throw, the gambler’s

payments are variable, with an accumulated payment denoted P

n

after

n throws. We now modify the notion of an equitable game by imposing

the convergence in probability towards 1 of the ratio S

n

/P

n

. In other

words, we would like any imbalance between what the casino and the

gambler have already paid to become more and more improbable as

the number of throws increases. In this sense it is possible to guaran-

tee the equitability of the Saint Petersburg game if the player pays in

accordance with the following law:

P

n

= n ×log

2

n + 1 ,

corresponding to paying the following amounts:

1, 2, 2.76, 3.25, 3.61, 3.9, 4.14, 4.35, 4.53, 4.69, 4.83, 4.97, 5.09, 5.2, . . . .

When Daniel Bernoulli presented the Saint Petersburg paradox in 1738,

he also presented a solution based on what he called the utility function.

The idea is as follows. Any variation of an individual’s fortune, say f,

has diﬀerent meanings depending on the value of f.

Let us denote the variation of f by ∆f. If, for instance, an individ-

ual’s fortune varies by ∆f = 10 000 euros, the meaning of this variation

for an individual whose fortune is f = 10 000 euros (100% variation)

is quite diﬀerent from the meaning it has for another individual whose

fortune is f = 1 000 000 euros (1% variation). Let us then consider that

any transaction between individuals is deemed equivalent only when

the values of ∆f/f are the same. This amounts to saying that there is

3.6 When the Improbable Happens 57

a utility function of the fortune f expressed by log(f). (Eﬀectively, the

base e logarithm function – see the appendix at the end of the book

– enjoys the property that its variation is expressed in that way, viz.,

∆f/f.) Table 3.1 shows the expected utilities for the Saint Petersburg

game, where the gain represents the fortune f. The expected utilities

converge to a limit (1.386 294) corresponding to 4 euros. The rational

player from the money-utility point of view would then pay an amount

lower than 4 euros to enter the game (but we do not know whether the

casino would accept it).

3.6 When the Improbable Happens

In Chap. 1 reference was made to the problem of how to obtain heads

in 5 throws, either with a fair coin – P(heads) = P(tails) = 0.5 – or

with a biased one – P(heads) = 0.65, P(tails) = 0.35. We are now able

to compute the mathematical expectation in both situations. Taking

into account the probability function for 0 through 5 heads turning up

for a fair coin (see Fig. 1.7a), we have

E(k heads in 5 throws) = 0 ×0.031 25 + 1 ×0.156 25 + 2 ×0.312 5

+3 ×0.312 5 + 4 ×0.156 25 + 5 ×0.031 25

= 2.5 .

We have obtained the entirely expected result that, for a fair coin,

heads turns up on average in half of the throws (for 5-throw sequences

or any other number of throws).

There are a vast number of practical situations requiring the deter-

mination of the probability of an event occurring k times in n indepen-

dent repetitions of a certain random experiment. Assuming that the

event that interests us (success) has a probability p of occurring in any

of the repetitions, the probability function to use is the same as the

one found in Chap. 1 for the coin example, that is,

P(k successes in n) =

n

k

×p

k

×(1 −p)

n−k

.

The distribution of probability values according to this formula is called

the binomial distribution. The name comes from the fact that the prob-

abilities correspond to the terms in the binomial expansion of (p +q)

n

,

with q = 1 −p. It can be shown that the mathematical expectation of

the binomial distribution (the distribution mean) is given by

58 3 Expecting to Win

E(binomial distribution) = n ×p .

We can check this result for the above example: E = 5 ×0.5 = 2.5. For

the biased coin we have E = 5×0.65 = 3.25. Note how the expectation

has been displaced towards a higher number of heads. Note also (see

Fig. 1.7) how the expectation corresponds to a central position (center

of gravity) of the probability distribution. Incidentally, from f = k/n

and E(k) = np, we arrive at the result mentioned in the ﬁrst section of

this chapter that the mathematical expectation of an event frequency

is its probability, i.e., E(f) = p.

Consider a sequence of 20 throws of a fair coin in which 20 heads

turned up. Looking at this sequence

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,

one gets the feeling that something very peculiar has happened. As in

the gambler’s fallacy, there is a great temptation to say that, at the

twenty-ﬁrst throw, the probability of turning up tails will be higher. In

other situations one might be tempted to attribute a magical or divine

cause to such a sequence. The same applies to other highly organized

sequences, such as

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ,

or

1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0 .

Nevertheless all these sequences have the same probability of occurring

as any other sequence of 20 throws, namely, (1/2)

20

= 0.000 000 953 67.

If we repeat the 20-throw experiment a million times, we expect to

obtain such sequences (or any other sequence for that matter) a number

of times given by n × p = 1 000 000 × 0.000 000 95 ≈ 1. That is, there

is a high certainty that once in a million such a sequence may occur.

In the case of the toto game, where the probability of the ﬁrst prize is

0.000 000 627, we may consider what happens if there are 1.6 million

bets. The expectation is 1, i.e., on average, we will ﬁnd (with high

certainty) a ﬁrst prize among 1.6 million bets. The conclusion to be

drawn is that, given a suﬃciently large number of experiments, the

‘improbable’ will happen: highly organized sequences, emergence of life

on a planet, or our own existence.

3.7 Paranormal Coincidences 59

3.7 Paranormal Coincidences

Imagine yourself dreaming about the death of someone you knew. When

reading the newspaper the next day you come across the column re-

porting the death of your acquaintance. Do you have a premonitory

gift? People ‘studying’ so-called paranormal phenomena often present

various experiences of this sort as evidence that ‘something’ such as

telepathy or premonition is occurring.

Many reports on ‘weird’ cases, supposedly indicating supernatural

powers, contain, in fact, an incorrect appreciation of the probabilities

of coincident events. We saw in the birthday problem that it suﬃces to

have 23 people at a party in order to obtain a probability over 50% that

two of them share the same birthdays. If we consider birthday dates

separated by at most one day, the number 23 can be dropped to 14!

Let us go back to the problem of dreaming about the death of some-

one we knew and subsequently receiving the information that (s)he has

died. Let us assume that in 30 years of adult life a person dreams,

on average, only once about someone he/she has known during that

30 year span. (This is probably a low estimate.) The probability of

dreaming about the death of an acquaintance on the day preceding

his/her death (one hit in 30×365.25 days) is then 1/(30×365.25). Let

us further assume that in the 30 year span there are only 15 acquain-

tances (friends and relatives) whose death is reported in the right time

span, so that the coincidence can be detected. With this assumption,

the probability of a coincidence taking into account the 15 deaths is

15/(30 ×365.25) = 1/730.5 in 30 years of one’s life. The probability of

one coincidence per year is estimated as 1/(30 × 730.5) = 0.000 045 6.

Finally, let us take, for instance, the Portuguese adult population esti-

mated at 7 million people. We may then use the binomial distribution

formula for the mean in order to estimate how many coincidences occur

per year, on average, in Portugal:

average number of coincidences per year = n ×p

= 7 000 000 ×0.000 045 6

= 319 .

Therefore, a ‘paranormal’ coincidence will occur, on average, about

once per day (and of course at a higher rate in countries with a larger

population). We may modify some of the previous assumptions without

signiﬁcantly changing the main conclusion: ‘paranormal’ coincidences

60 3 Expecting to Win

are in fact entirely normal when we consider a large population. Once

again, given a suﬃciently large number, the ‘improbable’ will happen.

3.8 The Chevalier de M´er´e Problem

On 29 July 1654, Pascal wrote a long letter to Fermat presenting the so-

lution of a game problem raised by the Chevalier de M´er´e, a member of

the court of Louis XIV already mentioned in Chap. 1. The problem was

as follows. Two players (say Primus and Secundus) bet 32 pistoles (the

French name for a gold coin) in a heads or tails game. The 64 pistoles

would stay with the player that ﬁrst obtained 3 successes (a success

might be, for instance, heads for Primus and tails for Secundus), either

consecutive or not. In the ﬁrst round Primus wins; at that moment the

two players are obliged to leave without ever ﬁnishing the match. How

should the bet be divided in a fair way between them?

Fermat, in a previous letter to Pascal, had proposed a solution based

on the enumeration of all possible cases. Pascal presented Fermat with

another, more elegant solution, based on a hierarchical description of

the problem and introducing the mathematical concept of expectation.

Let us consider Pascal’s solution, based on the enumeration of all the

various paths along which the game might have evolved. It is useful to

represent the enumeration in a tree diagram, as shown in Fig. 3.8.

The circle marked 44 corresponds to the initial situation, where

we assume that Primus won the ﬁrst round. For the moment we do

not know the justiﬁcation for this and other numberings of the circles.

Suppose that Primus wins again; we then move to the circle marked

56, following the branch marked P (Primus wins). If, on the other

hand, Secundus wins the second round, we move downwards to the

circle marked 32, in the tree branch marked S. Thus, upward branches

correspond to Primus wins (P) and downward branches to Secundus

wins (S). Note that, since a Primus loss is a Secundus win and vice

versa, the match goes on for only 5 rounds. Let us now analyze the tree

from right to left (from the end to the beginning), starting with the gray

circles. The circle marked 64 corresponds to a ﬁnal P win, with three

heads for P according to the sequence PPSSP. P gets the 64 pistoles.

The circle marked 0 corresponds to a ﬁnal win for S with three tails for

S according to the sequence PPSSS. Had the match been interrupted

immediately before this ﬁnal round, what would P’s expectation be?

We have already learned how to do this calculation:

3.8 The Chevalier de M´er´e Problem 61

Fig. 3.8. Tree diagram for Pascal’s solution

E(P) = 64 ×

1

2

+ 0 ×

1

2

= 32 .

We thus obtain the situation represented by the hatched circle. For the

paths that start with PPS, we obtain in the same way, at the third

round,

E(P) = 64 ×

1

2

+ 32 ×

1

2

= 48 .

The process for computing the expectation repeats in the same way,

progressing leftward from the end circles, until we arrive at the initial

circle to which the value 44 has been assigned. In this way, when he

leaves, Primus must keep 44 pistoles and Secundus 20. A problem that

seemed diﬃcult was easily solved using the concept of mathematical

expectation. As a ﬁnal observation, let us note that there are 10 end

circles: 10 possible sequences of rounds. Among these, Primus wins 6

and Secundus 4. In spite of this, the money at stake must not be shared

in the ratio 6/10 (38.4 pistoles) for Primus and 4/10 (25.6 pistoles) for

Secundus!

Once more a simple and entertaining problem has important prac-

tical consequences. In fact, Pascal’s reasoning based on an end-to-start

analysis, is similar to the reasoning that leads to the price calculation

of a stock market option, particularly in the context of a famous for-

mula established by Fisher Black and Myron Scholes in 1973 (which

got them the Nobel Prize for Economics in 1997).

62 3 Expecting to Win

3.9 Martingales

The quest for strategies that allow one to win games is probably as

old an endeavor as mankind itself. Let us imagine the following game

between two opponents A and B. Both have before them a stack of

coins, all of the same type. Each piles them up and takes out a coin

from their stack at the same time. If the drawn coins show the same

face, A wins from B in the following way: 9 euros if it is heads–heads and

1 euro if it is tails–tails. If the upper faces of the coins are diﬀerent, B

wins 5 euros from A no matter whether it is heads–tails or tails–heads.

Player A will tend to pile the coins up in such a way as to exhibit

heads more often, expecting to ﬁnd many heads–heads situations. On

the other hand, B, knowing that A will try to show up heads, will try

to show tails.

One might think that, since the average gain in the situation that

A wins, viz., (9+1)/2, and in the situation that B wins, viz., (5+5)/2,

are the same, it will not really matter how the coins are piled up.

However, this is not true. First, consider A piling up the coins so that

they always show up heads. Let t represent the fraction of the stack

in which B puts heads facing up and let us see how the gain of B

varies with t. Let us denote that gain by G. B must then pay 9 euros to

A during a fraction t of the game, but will win 5 euros during a fraction

1 −t of the game. That is, the gain of B is

G(stack A: heads) = −9t + 5(1 −t) = −14t + 5 .

If the coins of stack A show tails, one likewise obtains

G(stack A: tails) = +5t −1(1 −t) = 6t −1 .

As can be seen from Fig. 3.9, the B gain lines for the two situations

intersect at the point t = 0.3, with a positive gain for B of 0.8 euros.

What has been said for a stack segment where A placed the same

coin face showing up applies to the whole stack. This means that if

B stacks his coins in such a way that three tenths of them, randomly

distributed in the stack, show up heads, he will expect to win, during

a long sequence of rounds, an average of 0.8 euros per round.

The word ‘martingale’ is used in games of chance to refer to a win-

ning strategy, like the one just described. The scientiﬁc notion of mar-

tingale is associated with the notion of an equitable game. This no-

tion appeared for the ﬁrst time in the book The Doctrine of Chance

by Abraham de Moivre, already mentioned earlier, and plays an im-

portant role in probability theory and other branches of mathematics.

3.10 How Not to Lose Much in Games of Chance 63

Fig. 3.9. How the gains of B vary with a fraction t of the stack where he

plays heads facing up

The word comes from the Proven¸ cal word ‘martegalo’ (originating from

the town of Martigues), and described a leather strap that prevented

a horse from lifting up its head. A martingale is similarly intended to

hold sway over chance.

One must distinguish true and false martingales. Publications some-

times appear proposing winning systems for the lotto, toto, and so on.

One such system for winning the lotto is based on the analysis of graphs

representing the frequency of occurrence of each number as well as com-

binations of numbers. Of course such information does not allow one to

forecast the winning bet; this is just one more version of the gambler’s

fallacy. Other false martingales are more subtle, like the one that we

shall describe at the end of Chap. 4.

3.10 How Not to Lose Much in Games of Chance

Let us suppose that the reader, possessing an initial capital of I euros,

has decided to play roulette with even–odd- or red–black-type bets, set

at a constant value of a euros. The gain–loss sequence constitutes, in

this case, a coin-tossing type of sequence, with probability p = 18/37 =

0.486 of winning and q = 1−p of losing. Let us further suppose that the

reader decides at the outset to keep playing either until he/she wins

F euros or until ruined by losing the initial capital. Jacob Bernoulli

demonstrated in 1680 that such a game ends in a ﬁnite number of

steps with a winning probability given by

64 3 Expecting to Win

Table 3.2. Roulette for diﬀerent size bets. F represents the winnings at which

the player stops and P(F) is the probability of reaching that value

50 euro bets 100 euro bets 200 euro bets

F P(F) F P(F) F P(F)

1050 0.9225 1100 0.8826 1200 0.8100

1100 0.8527 1200 0.7853 1400 0.6747

1150 0.7896 1300 0.7034 1600 0.5736

1200 0.7324 1400 0.6337 1800 0.4952

1250 0.6804 1500 0.5736 2000 0.4328

1300 0.6330 1600 0.5215 2200 0.3820

1350 0.5896 1700 0.4758 2400 0.3399

1400 0.5498 1800 0.4356 2600 0.3045

P(winning F) = P(F) =

1 −(q/p)

I/a

1 −(q/p)

F/a

.

In the case of roulette, q/p = 1.0556. If the reader starts with an

initial capital of 1 000 euros and makes 50-euro bets, the probability of

reaching 1 200 euros is given by

P(F) =

1 −1.0556

20

1 −1.0556

24

= 0.7324 .

Table 3.2 shows what happens when we make constant bets of 50, 100

and 200 euros.

Looking at the values for situations with F = 1200 euros, printed

in bold in the table, we see that the gambler would be advised to stay

in the game the least time possible, whence he must risk the high-

est possible amount in a single move. In the above example one sin-

gle 200-euro bet has 81% probability of reaching the goal set by the

gambler. This probability value is near the theoretical maximum value

P(F) = 1000/1200 = 83.33%. If the bet amounts can vary, there are

then betting strategies to obtain a probability close to that maximum

value. One such strategy, proposed by Lester Dubins and Leonard Sav-

age in 1956, is inspired by the idea of always betting the highest possible

amount that allows one to reach the proposed goal. In the example we

have been discussing, we should then set the sequence of bets as shown

in Fig. 3.10.

The following probabilities can be computed from the diagram:

P(going from 1000 to 1200) = p +q ×P(going from 800 to 1200) ,

3.10 How Not to Lose Much in Games of Chance 65

Fig. 3.10. The daring bet strategy proposed by Dubins and Savage

P(going from 800 to 1200) = p +q ×p ×P(going from 800 to 1200) .

From this one concludes that

P(going from 1000 to 1200) = p +

qp

1 −qp

= 0.8195 ,

closer to the theoretical maximum.

As a ﬁnal remark let us point out that the martingale of Dubins

and Savage looks quite close to a martingale proposed by the French

mathematician Jean Le Rond d’Alembert (1717–1783), according to

which the gambler should keep doubling his/her bet until he/she wins

or until he/she has enough money to double the bet. For the above

example, this would correspond to stopping when reaching 400 euros,

with a probability of winning of 0.7363.

Needless to say, these martingales should not be viewed as an in-

citement to gambling – quite the opposite. As a matter of fact, there

is even a result about when one should stop a constant-bet game of

chance in order to maximize the winning expectation. The result is:

don’t play at all! In the meantime we would like to draw the reader’s

attention to the fact that all these game-derived results have plenty of

practical applications in diﬀerent areas of human knowledge.

4

The Wonderful Curve

4.1 Approximating the Binomial Law

The binomial distribution was introduced in Chap. 1 and its mathe-

matical expectation in the last chapter. We saw how the binomial law

allows us to determine the probability that an event, of which we know

the individual probability, occurs a certain number of times when the

corresponding random experiment is repeated.

Take for instance the following problem. The probability that a pa-

tient gets infected with a certain bacterial strain in a given hospital

environment is p = 0.0055. What is the probability of ﬁnding more

than 10 infected people in such a hospital with 1000 patients? Apply-

ing the binomial law, we have

P

more than 10

events in 1000

= P(11 events) +P(12 events)

+· · · +P(1000 events)

=

1000

11

×0.0055

11

×0.9945

989

+

1000

12

×0.0055

11

×0.9945

988

+· · · +

1000

1000

×0.0055

1000

×0.9945

0

.

The diﬃculty in carrying out these calculations can be easily imagined.

For instance, calculating

1000

500

**involves performing the multiplica-
**

tion 1000 × 999 × . . . × 501 and then dividing by 500 × 499 × . . . × 2.

68 4 The Wonderful Curve

Fig. 4.1. De Moivre’s approximation (open circles) of the binomial distribu-

tion (black bars). (a) n = 5. (b) n = 30

Even if one applies the complement rule – subtracting from 1 the sum

of the probabilities of 0 through 10 events – one still gets entangled in

painful calculations. When using a computer, it is not uncommon for

this type of calculation to exceed the number representation capacity.

Abraham de Moivre, mentioned several times in previous chapters,

was the ﬁrst to ﬁnd the need to apply an easier way of computing fac-

torials. (In his day, performing the above calculations was an extremely

tedious task.) That easier way (known as Stirling’s formula) led him

to an approximation of the binomial law for high values of n. Denot-

ing the mathematical expectation of the binomial distribution by m

(m = np) and the quantity np(1−p) by v, the approximation obtained

by Abraham de Moivre, assuming that m is not too small, was the

following:

n

k

×p

k

×(1 −p)

n−k

≈

1

√

2πv ×e

(k−m)

2

/2v

.

We have already found Napier’s number, universally denoted by the

letter e, with the approximate value 2.7, when discussing the problem

of encounters (Chap. 2). Let us see if De Moivre’s approximation works

for the example in Chap. 1 of obtaining 3 heads in 5 throws of a fair

coin:

m = np =

5

2

= 2.5 , v = np(1 −p) =

2.5

2

= 1.25 ,

k −m = 3 −2.5 = 0.5 ,

whence

P(3 heads in 5 throws) ≈

1

√

7.85 ×2.7

0.25/2.5

=

1

2.8 ×2.7

0.1

= 0.32 ,

4.2 Errors and the Bell Curve 69

which is really quite close to the value of 0.31 we found in Chap. 1.

Figure 4.1 shows the probability function of the binomial distribution

(bars) for n = 5 and n = 30 and Abraham de Moivre’s approximation

(open circles). For n = 5, the maximum deviation is 0.011; for n = 30,

the maximum deviation comes down to a mere 0.0012.

4.2 Errors and the Bell Curve

Abraham de Moivre’s approximation (described in a paper published in

1738, with the second edition of the book The Doctrine of Chances) was

afterwards worked out in detail by the French mathematician Pierre-

Simon Laplace (1749–1827) in his book Th´eorie Analytique des Proba-

bilit´es (1812) and used to analyze the errors in astronomical measure-

ments. More or less at the same time (1809), the German mathemati-

cian Johann Carl Friedrich Gauss (1777–1855) rigorously justiﬁed the

distribution of certain sequences of measurement errors (particularly

geodesic measurements) in agreement with that formula.

It is important at this point to draw the reader’s attention to the

fact that, in science, the word ‘error’ does not have its everyday mean-

ing of ‘fault’ or ‘mistake’. This semantic diﬀerence is often misunder-

stood. When measuring experimental outcomes, the word ‘error’ refers

to the unavoidable uncertainty associated with any measurement, how-

ever aptly and carefully the measurement procedure has been applied.

The error (uncertainty) associated with a measurement is, therefore,

the eﬀect of chance phenomena; in other words, it is a random vari-

able.

Consider weighing a necklace using the spring balance with 2 g

minimum scale resolution shown in Fig. 4.2. Suppose you measure 45 g

for the weight of the necklace. There are several uncertainty factors

inﬂuencing this measurement:

• The measurement system resolution: the balance hand and the scale

lines have a certain width. Since the hand of Fig. 4.2 has a width

almost corresponding to 1 g, it is diﬃcult to discriminate weights

diﬀering by less than 1 g.

• The interpolation error that leads one to measure 45 g because the

scale hand was approximately half-way between 44 g and 46 g.

• The parallax error, related to the fact that it is diﬃcult to guarantee

a perfect vertical alignment of our line of sight and the scale plane;

for instance, in Fig. 4.2, if we look at the hand with our line of sight

70 4 The Wonderful Curve

Fig. 4.2. Spring balance with 2 g min-

imum resolution

to its right, we read a value close to 250 g; if we look to its left, we

read a value close to 2 g.

• The background noise, corresponding in this case to spring vibra-

tions that are transmitted to the hand.

If we use a digital balance, we eliminate the parallax and interpolation

errors, but the eﬀects of limited resolution (the electronic sensor does

not measure pressures below a certain value) and background noise

(such as the thermal noise in the electronic circuitry) are still present.

Conscious of these uncertainty factors, suppose we weighed the neck-

lace ﬁfty times and obtained the following sequence of ﬁfty measure-

ments:

44, 45, 46, 43, 47, 47, 45, 48, 48, 44, 43, 45, 46, 44, 43, 44, 45, 49, 46, 48,

45, 43, 46, 45, 44, 44, 45, 45, 46, 45, 46, 48, 47, 44, 49, 44, 47, 45, 46, 48,

46, 46, 47, 49, 45, 45, 47, 46, 46, 45 .

The average weight is 45.68 g. It seems reasonable to consider that

since the various error sources act independently of one other, with

contributions on either side of the true value, the errors should have

a tendency to cancel out when the weight is determined in a long se-

ries of measurements. Things happen as though the values of a random

variable (the ‘error’) were added to the true necklace weight, this ran-

dom variable having zero mathematical expectation. In a long sequence

4.2 Errors and the Bell Curve 71

of measurements we expect, from the law of large numbers, the average

of the errors to be close to zero, and we thus obtain the true weight

value.

Let us therefore take 45.68 g as the true necklace weight and consider

now the percentage of times that a certain error value will occur (by

ﬁrst subtracting the average from each value of the above sequence).

Figure 4.3 shows how these percentages (vertical bars) vary with the

error value. This graphical representation of frequencies of occurrence

of a given random variable (the error, in our case) is called a histogram.

For instance, an error of 0.68 g occurred 13 times, which is 26% of the

time.

Laplace and Gauss discovered independently that, for many error

sequences of the most varied nature (relating to astronomical, geodesic,

geomagnetic, electromagnetic measurements, and so on), the frequency

distributions of the errors were well approximated by the values of the

function

f(x) =

1

√

2πν

e

−x

2

/2ν

,

where x is the error variable and ν represents the so-called mean square

error. Let us try to understand this. If we subtract the average weight

from the above measurement sequence, we obtain the following error se-

quence: −1.68, −0.68, 0.32, −2.68, 1.32, 1.32, −0.68, 2.32, 2.32, −1.68,

−2.68, etc. We see that there are positive and negative errors. If we get

many errors of high magnitude, no matter whether they are positive or

negative, the graph in Fig. 4.3 will become signiﬁcantly stretched. In

the opposite direction, if all the errors are of very small magnitude the

whole error distribution will be concentrated around zero. Now, one

way to focus on the contribution of each error to the larger or smaller

Fig. 4.3. Error histogram

when weighing a necklace

72 4 The Wonderful Curve

stretching of the graph but regardless of its sign (positive or negative)

consists in taking its square. So when we add the squares of all errors

and divide by their number, we obtain a measure of the spread of the

error distribution. The mean square error is precisely such a measure

of the spread:

ν =

1.68

2

+ 0.68

2

+ 0.32

2

+ 2.68

2

+ 1.32

2

50

= 2.58 g

2

.

The mean square error ν, is also called the variance. The square root of

the variance, viz., s =

√

ν, is called the standard deviation. In Fig. 4.3,

the value of the standard deviation (

√

2.58 = 1.61 g) is indicated by

the double-ended arrow.

The dotted line in Fig. 4.3 also shows the graph of the function

f(x) found by Laplace and Gauss, and computed with ν = 2.58 g

2

.

We see that the function seems to provide a reasonable approximation

to the histogram values. As we increase the number of measurement

repetitions the function gradually ﬁts better until values of bars are

practically indistinguishable from values taken from the dotted line.

This bell-shaped curve often appears when we study chance phenom-

ena. Gauss called it the normal distribution curve, because error dis-

tributions normally follow such a law.

4.3 A Continuum of Chances

Let us picture a riﬂe shooting at a target like the one shown in Fig. 4.4.

The riﬂe is assumed to be at a perfectly ﬁxed position, aiming at the

target center. Despite the ﬁxed riﬂe position, the position of the bullet

impact is random, depending on chance phenomena such as the wind,

riﬂe vibrations, barrel heating, barrel grooves wearing out, and so on.

If we use the frequentist interpretation of probability, we may easily

obtain probability estimates of hitting a certain target area. For in-

stance, if out of 5000 shots hitting the target 2000 fall in the innermost

circle, an estimate of P(shot in innermost circle) is 2/5. We may apply

the same procedure to the outermost circle; suppose we obtained 3/5.

The estimated probability of a hit in the circular ring is then 1/5 and

of a hit outside the outermost circle is 1 −3/5 = 2/5. We are assuming

that by construction there can be no shots falling outside the target.

We now proceed to consider a small circular area of the target (in

any arbitrarily chosen position) for which we determine by the above

procedure the probability of a hit. Suppose we go on decreasing the

4.3 A Continuum of Chances 73

Fig. 4.4. Shooting at a target

circle radius, therefore shrinking the circular region. The probability

of a hit will also decrease in a way approximately proportional to the

area of the circle when the area is suﬃciently small and the shots are

uniformly spread out over the circle. In other words, there comes a time

when P(shot in the circle) is approximately given by k ×a, where a is

the area of the circle and k is a proportionality factor. In order to obtain

arbitrarily good estimates of P(shot in the circle) for a certain value of

a, we would of course need a large number of shots on the target, and

all the more as a gets smaller.

We thus draw two conclusions from our imaginary experiment of

shooting at a target: the hit probability in a certain region continu-

ally decreases as the region shrinks; for suﬃciently small regions, the

probability is practically proportional to the area of the region. It is as

if the total probability, 1, of hitting the target was ﬁnely smeared out

over the whole target. In a very small region we can speak of probability

density per unit area, and our previous k factor is

k =

P(shot in region)

a

.

When the region shrinks right down to a point, the probability of a hit

is zero because the area is zero:

P(hit a point) = 0 (always, in continuous domains) .

We are now going to imagine that the whole riﬂe contraption was

mounted in such a way that the shots could only hit certain pre-

established points in the target and nothing else. If we randomly select

a small region and proceed to shrink it, we will almost always obtain

discontinuous variations of P(shot in region), depending on whether

the region includes the pre-established points or not. In this case the

74 4 The Wonderful Curve

probability of hitting a point is not zero for the pre-established points.

The total probability is not continuously smeared out on the target; it

is discretely divided among the pre-established points. The domain of

possible impact points is no longer continuous but discrete. We have in

fact the situation of point-like probabilities of discrete events (the pre-

established points) that we have already considered in the preceding

chapters.

Let us now go back to the issue of P(hit a point) equalling zero

for continuous domains. Does this mean that the point-bullet never

hits a speciﬁed point? Surely not! But it does it a ﬁnite number of

times when the number of shots is inﬁnite, so that, according to the

frequentist approach, the value of P(hit a point) is 0. On the other

hand, for discrete domains (shots at pre-established points), hitting

a speciﬁc point would also happen an inﬁnite number of times.

Operating with probabilities in a continuum of chances demands

some extension of the notions we have already learned (not to men-

tion some technical diﬃculties that we will not discuss). In order to

do that, let us consider a similar situation to shooting at a target, but

rather simpler, with the ‘shot’ on a straight-line segment. Once again

the probability of hitting a given point is zero. However, the probability

of the shot falling into a given straight line interval is not zero. We may

divide the interval into smaller intervals and still obtain non-zero prob-

abilities. It is as if our degree of certainty that the shot falls in a given

interval were ﬁnely smeared out along the line segment, resulting in

a probability density per unit length. Thinking of the probability mea-

sure as a mass, there would exist a mass density along the line segment

(and along any segment interval). Figure 4.5a illustrates this situation:

a line segment, between 0 and 1, has a certain probability density of

a hit. Many of the bullets shot by the riﬂe fall near the center (dark

gray); as we move towards the segment ends, the hit probability den-

sity decreases. The ﬁgure also shows the probability density function,

abbreviated to p.d.f. We see how the probability density is large at the

center and drops towards the segment ends. If we wish to determine

the probability of a shot hitting the interval from 0.6 to 0.8, we have

only to determine the hatched area of Fig. 4.5a. (In the mass analogy,

the hatched area is the mass of the interval.) In Fig. 4.5b, the probabil-

ity density is constant. No interval is favored with more hits than any

other equal-length interval. This is the so-called uniform distribution.

Uniformity is the equivalent for continuous domains of equiprobability

(equal likelihood) for discrete domains.

4.4 Buﬀon’s Needle 75

Fig. 4.5. Two cases of the probability density of shots on [0, 1]. (a) The

density is larger at the center. (b) The density is uniform

In the initial situation of shooting at a target, the p.d.f. is a certain

surface, high at the center and decaying towards the periphery. The

probability in a certain region corresponds to the volume enclosed by

the surface on that region. In the two situations, either the total area

(Fig. 4.5) or the total volume (Fig. 4.4) represents the probability of

the sure event, that is, they are equal to 1. Previously, in order to de-

termine the probability of an event supported by a discrete domain, we

added probabilities of elementary events. Now, in continuous domains,

we substitute sums by areas and volumes. With the exception of this

modiﬁcation, we can continue to use the basic rules of Chap. 1.

The normal probability function presented in the last section is in

fact a probability density function. In the measurement error example,

the respective values do not constitute a continuum; it is easy to see that

the values are discrete (for instance, there are no error values between

0.32 g and 0.68 g). Moreover, the error set is not inﬁnite. Here as in

other situations one must bear in mind that a continuous distribution

is merely a mathematical model of reality.

4.4 Buﬀon’s Needle

Georges-Louis Leclerc, Count of Buﬀon (1707–1788), is known for his

remarkable contributions to zoology, botany, geology and in particular

for a range of bold ideas, many ahead of his time. For instance, he

argued that natural causes formed the planets and that life on Earth

was the consequence of the formation of organic essences resulting from

the action of heat on bituminous substances. Less well known is his

important contribution to mathematics, and particularly to probability

theory.

Buﬀon can be considered a pioneer in the experimental study and

application of probability theory, a speciﬁc example of which is the fa-

mous Buﬀon’s needle. It can be described as follows. Consider randomly

throwing a needle of length l on a ﬂoor with parallel lines separated by

76 4 The Wonderful Curve

Fig. 4.6. Area under the curve y = e

−x

d ≥ l (think, for instance, of the plank separations of a wooden ﬂoor).

What is the probability that the needle falls over a line?

We are dealing here with a problem of continuous probabilities sug-

gested to Buﬀon by a fashionable game of his time known as franc-

carreau. The idea was to bet that a coin tossed over a tile ﬂoor would

fall completely inside a tile. For the needle problem, Buﬀon demon-

strated that the required probability is given by

P =

2l

πλ

.

For d = l, one then obtains 2/π, i.e., Buﬀon’s needle experiment oﬀers

a way of computing π ! The interested reader may ﬁnd several Internet

sites with simulations of Buﬀon’s needle experiment and estimates of

π using this method.

4.5 Monte Carlo Methods

It seems that Buﬀon did indeed perform long series of needle throws on

a ﬂoor in order to verify experimentally the result that he had concluded

mathematically. Nowadays, we hardly would engage in such labor, since

we dispose of a superb tool for carrying out any experiments relating

to chance phenomena: the computer.

The Monte Carlo method – due to the mathematician Stanislaw

Ulam (1909–1984) – refers to any problem-solving method based on

the use of random numbers, usually with computers.

Consider the problem of computing the shaded area of Fig. 4.6 cir-

cumscribed by a unit square. In this example, the formula of the curve

4.5 Monte Carlo Methods 77

delimiting the area of interest is assumed to be known, in this case

y = e

−x

. This allows us, by mathematical operations, to obtain the

value of the area as 1 − 1/e, and hence the value of e. However, it of-

ten happens that the curve formula is such that one cannot directly

determine the value of the area, or the formula may even be unknown.

In such cases one has to resort to numerical methods. A well-known

method for computing area is to divide the horizontal axis into suﬃ-

ciently small intervals and add up the vertical ‘slices’ of the relevant

region, approximating each slice by a trapezium. Such computation

methods are deterministic: they use a well deﬁned rule.

In certain situations it may be more attractive (in particular, with

regard to computation time) to use another approach that is said to be

stochastic (from the Greek ‘stokhastik´ os’ for a phenomenon or method

whose individual cases depend on chance) because it uses probabilities.

Let us investigate the stochastic or Monte Carlo method for computing

area. For this purpose we imagine that we randomly throw points in

a haphazard way onto the domain containing the interesting area (the

square domain of Fig. 4.6). We also assume that the randomly thrown

points are uniformly distributed in the respective domain; that is, for

a suﬃciently large number of points any equal area subdomain has

an equal probability of being hit by a point. Figure 4.7 shows sets of

points obtained with a uniform probability density function, thrown

onto a unit square. Let d be the number of points falling inside the

relevant region and n the total number of points thrown. Then, for

suﬃciently large n, the ratio d/n is an estimate of the relevant area

with given accuracy (according to the law of large numbers).

This method is easy to implement using a computer. We only need

a random number generator, that is, a function that supplies num-

Fig. 4.7. Examples of uniformly distributed points. (a) 100 points. (b) 4000

points

78 4 The Wonderful Curve

0 1 2 3 4 5 6 7 8 9 10

x 10

4

2.6

2.65

2.7

2.75

2.8

2.85

n

e

(

n

)

Fig. 4.8. Six Monte Carlo

trials for the estimation of e

bers as if they were generated by a random process. (One could also

use roulette-generated numbers, whence the name Monte Carlo.) Many

programming languages provide random (in fact, pseudo-random) num-

ber generating functions. The randomness of the generated numbers

(the quality of the generator) has an inﬂuence on the results of Monte

Carlo methods.

Figure 4.8 shows the dependence e(n) of the estimate e on n in six

Monte Carlo trials corresponding to Fig. 4.6. By inspection of Fig. 4.8,

one sees that the law of large numbers is also at work here. The estimate

tends to stabilize around the value of e (dotted line) for suﬃciently

large n.

Monte Carlo methods as described here are currently used in many

scientiﬁc and technical areas: aerodynamic studies, forecasts involving

a large number of variables (meteorology, economics, etc.), optimization

problems, special eﬀects in movies, video games, to name but a few.

4.6 Normal and Binomial Distributions

If the reader compares the formula approximating the binomial dis-

tribution, found by Abraham de Moivre, and the probability density

function of the normal distribution, found by Laplace and Gauss, (s)he

will immediately notice their similarity. As a matter of fact, for large

n, the binomial distribution is well approximated by the normal one, if

one uses

x = k −m , m = np , ν = np(1 −p) .

4.7 Distribution of the Arithmetic Mean 79

Fig. 4.9. The value

of P(more than

10 events in 1000)

(shaded area) using

the normal approxi-

mation

The mathematical expectation (mean) of the normal distribution is

the same as that of the binomial distribution: m = np. The variance is

np(1 −p), whence the standard deviation is s =

np(1 −p).

We can now go back to the initial problem of this chapter. For

n = 1000 the approximation of the normal law to the binomial law is

excellent (it is usually considered good for most applications when n is

larger than 20 and the product np is not too low, conditions satisﬁed

in our example). Let us then consider the normal law using

m = np = 1000 ×0.0055 = 5.5 ,

ν = np(1 −p) = 5.5 ×0.9945 = 5.47 ,

whence s = 2.34.

We then have the value of P(more than 10 events in 1000) corre-

sponding to the shaded area of Fig. 4.9. There are tables and computer

programs that supply values of the normal p.d.f. areas.

1

In this way,

one can obtain the value of the area as 0.01. The probability of ﬁnd-

ing more than 10 infected people among the 1000 hospital patients is

therefore 1%.

4.7 Distribution of the Arithmetic Mean

When talking about the measurement error problem we mentioned the

arithmetic mean as a better estimate of the desired measurement. We

are now in a position to analyze this matter in greater detail.

Consider a sequence of n values, obtained independently of each

other and originating from identical random phenomena. Such is the

1

Assuming zero mean and unit variance. Simple manipulations are required to ﬁnd

the area for other values.

80 4 The Wonderful Curve

usual measurement scenario: no measurement depends on the others

and all of them are under the inﬂuence of the same random error

sources. An important result, independently discovered by Laplace and

Gauss, is the following: if each measurement follows the normal dis-

tribution with mean (mathematical expectation) m and standard de-

viation s, then, under the above conditions, the arithmetic mean (the

computed average) random variable also follows the normal distribution

with the same mean (mathematical expectation) and with standard de-

viation given by s/

√

n.

Let us assume that in the necklace weighing example some divin-

ity had informed us that the true necklace weight was 45 g and the

true standard deviation of the measurement process was 1.5 g. In

Fig. 4.10, the solid line shows the normal p.d.f. relative to one mea-

surement while the dotted line shows the p.d.f. of the arithmetic mean

of 10 measurements. Both have the same mathematical expectation

of 45 g. However, the standard deviation of the arithmetic mean is

s/

√

n = 1.5/

√

10 = 0.47 g, about three times smaller; the p.d.f. of the

arithmetic mean is therefore about three times more concentrated, as

illustrated in Fig. 4.10. As we increase the number of measurements the

distribution of the arithmetic mean gets gradually more concentrated,

forcing it to converge rapidly (in probability) to the mathematical ex-

pectation. We may then interpret the mathematical expectation as the

‘true value of the arithmetic mean’ (‘the true weight value for the neck-

lace’) that would be obtained had we performed an inﬁnite number of

measurements.

Suppose we wish to determine an interval of weights around the

true weight value into which one measurement would fall with 95%

probability. It so happens that, for a normal p.d.f., such an interval

starts (to a good approximation) at the distribution mean minus two

Fig. 4.10. Weighing a necklace: p.d.f. of the error in a single measurement

(solid curve) and for the average of 10 measurements (dotted curve)

4.7 Distribution of the Arithmetic Mean 81

standard deviations and ends at the mean plus two standard deviations.

Thus, for the necklace example, the interval is 45±3 g (3 = 2×1.5). If

we take the average of 10 measurements, the interval is then 45±0.94 g

(0.94 = 2 ×0.47).

In conclusion, whereas for one measurement one has a 95% degree

of certainty that it will fall in the interval 45 ± 3 g, for the average

of 10 measurements and the same degree of certainty, the interval is

reduced to 45±0.94 g. Therefore, the weight estimate is clearly better:

in the ﬁrst case 95 out of 100 measurements will produce results that

may be 3 g away from the true value; in the second case 95 out of

100 averages of 10 measurements will produce results that are at most

about 1 gram away. Computing averages of repeated measurements is

clearly an advantageous method in order to obtain better estimates,

and for this reason it is widely used in practice.

The diﬃculty with the above calculations is that no divinity tells

us the true values of the mean and standard deviation. (If it told us,

we would not have to worry about this problem at all.) Supposing we

computed the arithmetic mean and the standard deviation of 10 mea-

surements many times, we would get a situation like the one illustrated

in Fig. 4.11, which shows, for every 10-measurement set, the arithmetic

mean (solid circle) plus or minus two standard deviations (double-ended

segment) computed from the respective mean square errors. The ﬁg-

ure shows 15 sets of 10 measurements, with trials numbered 1 through

15. The arithmetic means and standard deviations vary for each 10-

measurement set; it can be shown, however, that in a large number of

trials nearly 95% of the intervals ‘arithmetic mean plus or minus two

standard deviations’ contain the true mean value (the value that only

the gods know). In Fig. 4.11, 14 out of the 15 segments, i.e., 93% of

the segments, intersect the line corresponding to the true mean value.

What can then be said when computing the arithmetic mean and

standard deviation of a set of measurements? For the necklace example

one computes for the ﬁfty measurements:

m = 45.68 g , s = 1.61 g .

The standard deviation of the arithmetic mean is therefore 1.61/

√

50 =

0.23 g. The interval corresponding to two standard deviations is thus

45.68±0.46 g. This interval is usually called the 95% conﬁdence interval

of the arithmetic mean, and indeed 95% certainty is a popular value

for conﬁdence intervals.

What does it mean to say that the 95% conﬁdence interval of the

necklace’s measured weight is 45.68±0.46 g? Does it mean that we are

82 4 The Wonderful Curve

Fig. 4.11. Computed mean

±2 standard deviations in

several trials (numbered 1

to 15) of 10 measurements

95% sure that the true value is inside that interval? This is an oft-stated

but totally incorrect interpretation. In fact, if the true weight were 45 g

and the true standard deviation were 3 g, we have already seen that,

with 95% certainty, the average of the 50 measurements would fall in

the interval 45 ±0.94 g, and not in 45.68 ± 0.46 g. Whatever the true

values of the mean and standard deviation may be, there will always

exist 50-measurement sets with conﬁdence intervals distinct from the

conﬁdence interval of the true mean value. The interpretation is quite

diﬀerent and much more subtle.

Figure 4.11 can help us to work out the correct interpretation. When

computing the 95% conﬁdence interval, we are computing an interval

that will not contain the true mean value only 5% of the time. We may

have had such bad luck that the 50-measurement set obtained is like

set 8 in Fig. 4.11, a set that does not contain the true mean value.

For this set there is 100% certainty that it does not contain the true

mean (the true weight). We may say that it is an atypical measurement

set. The interpretation to be given to the conﬁdence interval is thus as

follows: when proposing the 95% conﬁdence interval for the true mean,

we are only running a 5% risk of using an atypical interval that does

not contain the true mean.

4.8 The Law of Large Numbers Revisited

There is one issue concerning the convergence of means to mathematical

expectations that has still not been made clear enough: when do we

consider a sequence of experiments to be suﬃciently large to ensure an

acceptable estimate of a mathematical expectation?

The notion of conﬁdence interval is going to supply us with an an-

swer. Consider the necklace-weighing example. What is an ‘acceptable

estimate’ of the necklace weight? It all depends on the usefulness of our

4.8 The Law of Large Numbers Revisited 83

estimate. If we want to select pearl necklaces, maybe we can tolerate

a deviation of 2 g between our estimate and the true weight and maybe

one single weighing is enough. If the necklaces are made of diamonds we

will certainly require a smaller tolerance of deviations, say 0.05 g (1/4

of carat). We may then ask: how many measurements of a diamond

necklace must we perform to ensure that the 95% conﬁdence interval

of the measurements does not deviate by more than 0.05 g? The answer

is now easy to obtain. Assuming a 0.1 g standard deviation, we must

have 2 ×0.1/

√

n = 0.05, corresponding to n = 16 measurements.

Let us now consider the problem of estimating the probability of an

event. Consider, for instance, a die we suspect has been tampered with,

because face 6 seems to turn up more often than would be expected.

To clear this matter up we wish to determine how many times we

must fairly throw the die in order to obtain an acceptable estimate of

face 6 turning up with 95% conﬁdence. We assume that an acceptable

estimate is one that diﬀers from the true probability value by no more

than 1%. Suppose we throw the die n times and face 6 turns up k times.

We know that k follows the binomial distribution, which for large n

is well approximated by the normal distribution with mean np and

variance np(1−p), where p is the probability we wish to estimate. The

variable we use to estimate p is the frequency of occurrence of face 6:

f = k/n. Now, it is possible to show that, when k is well approximated

by the normal distribution, then so is f, and with mean p and standard

deviation

p(1 −p)/

√

n. (Note once again the division by

√

n.)

We thus arrive at a vicious circle: in order to estimate p, we need to

know the standard deviation which depends on p. The easiest way out

of this vicious circle is to place ourselves in a worst case situation of

the standard deviation. Given its formula, it is easy to check that the

worst case (largest spread) occurs for p = 0.5. Since we saw that a 95%

conﬁdence level corresponds to two standard deviations we have

2 ×

√

0.25

√

n

=

1

√

n

.

We just have to equate this simple 1/

√

n formula to the required 1%

tolerance: 1/

√

n = 0.01, implying n = 10 000. Hence, only after 10 000

throws do we attain a risk of no more than 5% that the conﬁdence

interval f ±0.01 is atypical, i.e., does not contain the true p value.

84 4 The Wonderful Curve

4.9 Surveys and Polls

Everyday we see reports of surveys and polls in the newspapers and

media. Consider, for instance, the following report of an investigation

carried out amongst the population of Portugal:

Having interviewed 880 people by phone, 125 answered ‘yes’,

while the rest answered ‘no’. The ‘yes’ percentage is, therefore,

14.2% with an error of 3.4% for a conﬁdence degree of 95%.

The vast majority of people have trouble understanding what this

means. Does it mean that the ‘yes’ percentage is 14.2%, but 3.4% of

the people interviewed ﬁlled in the survey forms in the wrong way or

gave ‘incorrect’ answers to the interviewers? Is it the case that only

14.2−3.4%, that is, 10.8% of those interviewed deserve 95% conﬁdence

(i.e., will not change their opinion)? Maybe it means that only 3.4%

of the Portuguese population will change the 14.2% percentage? Or

does it perhaps mean that we are 95% sure (however this may be inter-

preted) that the ‘yes’ vote of the Portuguese population will fall in the

interval from 10.8% to 17.6%? Indeed, many and varied interpretations

seem to be to hand, so let us analyze the meaning of the survey report.

First of all, everything hinges on estimating a certain variable: the

proportion of Portuguese that vote ‘yes’ (with regard to some question).

That estimate is based on an analysis of the answers from a certain set

of 880 people, selected for that purpose. One usually refers to a set of

cases from a population, studied in order to infer conclusions applica-

ble to the whole population, as a sample. In the previous section we

estimated the probability of turning up a 6 with a die by analyzing the

results of 10 000 throws. Instead of throwing the same die 10 000 times,

we could consider 10 000 people throwing a die, as far as we can guaran-

tee an identical throwing method and equivalent dice. In other words,

it makes no diﬀerence whether we consider performing experiments in

time or across a set of people.

If the throws are made under the same conditions, we may of course,

imagine robots performing the throws instead of people. But one diﬃ-

culty with surveys of this kind is that people are not robots. They are

unequal with regard to knowledge, standard of living, education, geo-

graphical position, and many other factors. We must therefore assume

that these are sources of uncertainty and that the results we hope to

obtain reﬂect all these sources of uncertainty. For instance, if the survey

only covers illiterate people, its result will not reﬂect the source of un-

certainty we refer to as ‘level of education’. Any conclusion drawn from

4.9 Surveys and Polls 85

the survey would only apply to the illiterate part of the Portuguese

population. This is the sampling problem which arises when one must

select a sample.

In order to reﬂect what is going on in the Portuguese population, the

people in our sample must reﬂect all possible nuances in the sources of

uncertainty. One way to achieve this is to select the people randomly.

Suppose we possess information concerning the 7 million identity cards

of the Portuguese adult population. We may number the identity cards

from 1 to 7 000 000. We now imagine seven urns, 6 with 10 balls num-

bered 0 through 9 and the seventh with 8 balls numbered 0 through 7,

corresponding to the millionth digit. Random extraction with replace-

ment from the 7 urns would provide a number selecting an identity

card and a person. For obvious practical reasons, people are not se-

lected in this way when carrying out such a survey. It is more usual

to use other ‘almost random’ selection rules; for instance, by random

choice of phone numbers. The penalty is that in this case the survey

results do not apply to the whole Portuguese population but only to

the population that has a telephone (the source of uncertainty referred

to as ‘standard of living’ has thus been reduced).

In second place, the so-called ‘error’ is not an error. It is half of the

width of the estimated conﬁdence interval: 1/

√

n = 1/

√

880 = 0.0337.

We are not therefore 95% sure that the probability of a ‘yes’ falls into

the interval from 10.8% to 17.6%, in the sense that if we repeated the

survey many times, 95% of the time the percentage estimate of ‘yes’

answers would fall in that interval. The interpretation is completely

diﬀerent, as we have already seen in the last section. We are only 95%

sure that our conﬁdence interval is a typical interval (one of the many,

95%, of the intervals that in many random repetitions of the survey

would contain the true value of the ‘yes’ proportion of voters). We are

not rid of the fact that, by bad luck (occurring 5% of the time in many

repetitions), an atypical set of 880 people may have been surveyed, for

which the conﬁdence interval does not contain the true value of the

proportion.

Suppose now that in an electoral opinion poll, performed under

the same conditions as in the preceding survey, the following voting

percentages are obtained for political parties A and B, respectively:

40.5% and 33.8%. Since the diﬀerence between these two values is well

above the 3.4% ‘error’, one often hears people drawing the conclusion

that it is almost certain (95% certain) that party A will win over party

B. But is this necessarily so? Assuming typical conﬁdence intervals,

which therefore intersect the true values of the respective percentages,

86 4 The Wonderful Curve

it may happen that the true percentage voting A is 40.5−3.4 = 37.1%,

whereas the true percentage voting B is 33.8 + 3.4 = 37.2%. In other

words, it may happen that votes for B will outnumber votes for A! In

conclusion, we will only be able to discriminate the voting superiority

of A over B if the respective estimates are set apart by more than

the width of the conﬁdence interval (6.8% in this case). If we want

a tighter discrimination, say 2%, the ‘error’ will then have to be 1%,

that is, 1/

√

n = 0.01, implying a much bigger sample: n = 10 000!

4.10 The Ubiquity of Normality

Many random phenomena follow the normal distribution law, namely

those that can be considered as emerging from the addition of many

causes acting independently from one another; even when such causes

have non-normal distributions. Why does this happen? The answer to

this question has to do with a famous result, known as the central limit

theorem, whose ﬁrst version was presented by Gauss. Afterwards the

theorem was generalized in several ways.

There is a geometrical construction that can quickly convince us of

this result. First, we have to know how to add two random variables.

Take a very simple variable, with only two values, 0 and 1, exactly as

if we tossed a coin in the air. Let us denote this variable by x

1

and let

us assume that the probabilities are

P(x

1

= 0) = 0.6 , P(x

1

= 1) = 0.4 .

Supposing we add to x

1

an identical and independent variable, denoted

x

2

, what would be the probability function of x

1

+ x

2

? The possible

values of the sum are 0, 1 and 2. The value 0 only occurs when the

value of both variables is 0. Thus, as the variables are independent, the

corresponding probability of x

1

+x

2

= 0 is 0.6 ×0.6 = 0.36. The value

1 occurs when x

1

= 0 and x

2

= 1 or, in the opposite way, x

1

= 1 and

x

2

= 0; thus, P(x

1

+ x

2

= 1) = 0.6 × 0.4 + 0.4 × 0.6 = 0.48. Finally,

x

1

+ x

2

= 2 only occurs when the value of both variables is 1; thus,

P(x

1

+ x

2

= 2) = 0.16. One thus obtains the probability function of

Fig. 4.12b.

Let us now see how we can obtain P(x

1

+x

2

) with a geometric con-

struction. For this purpose we imagine that the P(x

1

) graph is ﬁxed,

while the P(x

2

) graph, reﬂected as in a mirror, ‘walks’ in front of the

P(x

1

) graph in the direction of increasing x

1

values. Along the ‘walk’

4.10 The Ubiquity of Normality 87

Fig. 4.12. Probability function of (a) a random variable with only two values

and (b) the sum of two random variables with the same probability function

we add the products of the overlapping values of P(x

1

) and the reﬂected

P(x

2

). Figure 4.13 shows, from left to right, what happens. When the

P(x

2

) reﬂected graph, shown with white bars in Fig. 4.13, lags behind

the P(x

1

) graph, there is no overlap and the product is zero. In the left-

most graph of Fig. 4.13 we reached a situation where, along the ‘walk’,

the P(x

2

) reﬂected graph overlaps the P(x

1

) graph at the point 0; one

thus obtains the product 0.6 ×0.6, which is the value of P(x

1

+x

2

) at

point 0. The central graph of Fig. 4.13 corresponds to a more advanced

phase of the ‘walk’. There are now two overlapping values whose sum of

products is P(x

1

+x

2

= 1) = 0.4×0.6+0.6×0.4. In the rightmost graph

of Fig. 4.13, the P(x

2

) reﬂected graph only overlaps at point 2, and one

obtains, by the same process, the value of P(x

1

+x

2

= 2) = 0.4 ×0.4.

From this point on there is no overlap and P(x

1

+ x

2

) is zero. The

operation we have just described is called convolution. If the random

variables are continuous, the convolution is performed with the proba-

bility density functions (one of them is reﬂected), computing the area

below the graph of the product in the overlap region.

Fig. 4.13. The geometric construction leading to Fig. 4.12b

88 4 The Wonderful Curve

But what has this to do with the central limit theorem? As unbeliev-

able as it may look, if we repeat the convolution with any probability

functions or any probability density functions (with rare exceptions),

we will get closer and closer to the Gauss function!

Figure 4.14 shows the result of adding variables as in the above

example, for several values of the number of variables. The scale on the

horizontal axis runs from 0 to 1, with 1 corresponding to the maximum

value of the sum. Thus, for instance in the case n = 3, the value 1

of the horizontal axis corresponds to the situation where the value

of the three variables is 1 and their sum is 3. The circles signaling the

probability values are therefore placed on 0×(1/3), 1×(1/3), 2×(1/3),

and 3 ×(1/3). The construction is the same for other values of n. Note

that in this case (variables with the same distribution), the mean of the

sum is the mean of any distribution multiplied by the number of added

variables: 3×0.6 = 1.8, 5×0.6 = 3, 10×0.6 = 6, 100×0.6 = 60. In the

graphs of Fig. 4.14, these means are placed at 0.6, given the division

by n (standardization to the interval 0 through 1).

Fig. 4.14. Probability functions (p.f.) of the sum of random variables (all

with the p.f. of Fig. 4.12a). Note how the p.f. of the sum gets closer to the

Gauss curve for increasing n

4.11 A False Recipe for Winning the Lotto 89

Fig. 4.15. The nor-

mal distribution on

parade

In conclusion, it is the convolution law that makes the normal dis-

tribution ubiquitous in nature. In particular, when measurement errors

depend on several sources of uncertainty (as we saw in the section on

measurement errors), even if each one does not follow the normal law,

the sum of all of them is in general well approximated by the normal

distribution. Besides measurement errors, here are some other exam-

ples of data whose distribution is well approximated by the normal law:

heights of people of a certain age group; diameters of sand grains on the

beach; fruit sizes on identical trees; hair diameters for a certain person;

defect areas on surfaces; chemical concentrations of certain wine com-

ponents; prices of certain shares on the stock market; and heart beat

variations.

4.11 A False Recipe for Winning the Lotto

In Chap. 1 we computed the probability of winning the ﬁrst prize in

the lotto as 1/13 983 816. Is there any possibility of devising a scheme

to change this value? Intuition tells us that there cannot be, and rightly

so, but over and over again there emerge ‘recipes’ with diﬀering degrees

of sophistication promising to raise the probability of winning. Let us

describe one of them. Consider the sum of the 6 randomly selected

numbers (out of 49). For reasons already discussed, the distribution of

this sum will closely follow the normal distribution around the mean

value 6×

(1+49)/2

**= 150 (see Fig. 4.16). It then looks a good idea to
**

choose the six numbers in such a way that their sum falls into a certain

interval around the mean or equal to the mean itself, i.e., 150. Does

this work?

90 4 The Wonderful Curve

Fig. 4.16. Histogram of the

sum of the lotto numbers in

20 000 extractions

First of all, let us note that it is precisely around the mean that the

number of combinations of numbers with the same sum is huge. For

instance, suppose we chose a combination of lotto numbers with sum

precisely 150. There are 165 772 possible combinations of six numbers

between 1 and 49 whose sum is 150. But this is a purely technical

detail. Let us proceed to an analysis of the probabilities by considering

the following events:

A The sum of the numbers of the winning set is 150.

B The sum of the numbers of the reader’s bet is 150.

C The reader wins the ﬁrst prize (no matter whether or not the sum

of the numbers is 150).

Let us suppose the reader chooses numbers whose sum is 150. The

probability of winning is then

P(C and A) = P(C if A) ×P(A)

=

1

165 772

×

165 772

13 983 816

=

1

13 983 816

.

Therefore, the fact that the reader has chosen particular numbers did

not change the probability of the reader winning the ﬁrst prize. On the

other hand, we can also see that

P(C if B) =

P(B and C)

P(B)

=

P(B if C) ×P(C)

P(B)

=

(165 772/13 983 816) ×(1/13 983 816)

165 772/13 983 816

=

1

13 983 816

.

4.11 A False Recipe for Winning the Lotto 91

Thus P(C if B) = P(C), that is, the events ‘the sum of the numbers

of the reader’s bet is 150’ and ‘the reader wins the ﬁrst prize’ are

independent. The probability of winning is the same whether or not

the sum of the numbers is 150! As a matter of fact, this recipe is

nothing else than a sophisticated version of the gambler’s fallacy, using

the celebrated normal curve in an attempt to make us believe that we

may forecast the outcomes of chance.

5

Probable Inferences

5.1 Observations and Experiments

We are bombarded daily with news and advertisements involving infer-

ences based on sets of observations: healthy children drink yoghurt A;

regular use of toothpaste B reduces dental caries; detergent X washes

whiter; recent studies leave no doubt that tobacco smoke may cause

cancer; a US investigation revealed that exposure to mobile phone ra-

diation causes an increase in cancer victims; pharmaceutical product C

promotes a 30% decrease in infected cells. All these statements aim to

generalize a conclusion drawn from the analysis of a certain, more or

less restricted data set. For instance, when one hears ‘healthy children

drink yoghurt A’, one may presuppose that the conclusion is based on

a survey involving a limited number of children. The underlying idea of

all these statements is as follows: there are two variables, say X and Y,

and one hopes to elucidate a measure of the relationship between them.

Measurements of both variables are inﬂuenced by chance phenomena.

Will it be possible, notwithstanding the inﬂuence of chance, to draw

some general conclusion about a possible relation between them?

Consider the statement ‘healthy children drink yoghurt A’. Variable

X is ‘healthy children’ and variable Y is ‘drink yoghurt A’. The val-

ues of X can be established by means of clinical examinations of the

children and consequent assignment of a rank to their state of health.

However, such ranks are not devoid of errors because any clinical anal-

ysis always has a non-zero probability of producing an incorrect value.

In the same way, variable Y has an obvious random component with

regard to the ingested quantities of yoghurt. In summary, X and Y

are random variables and we wish to elucidate whether or not there is

a relation between them. If the relation exists, it can either be causal,

94 5 Probable Inferences

i.e., X causes Y or Y causes X, or otherwise. In the latter case, several

situations may occur, which we exemplify as follows for the statement

‘healthy children drink yoghurt A’:

1. A third variable is involved, e.g., Z = standard of living of the child’s

parents. Z causes X and Y.

2. There are multiple causes, e.g., Z = standard of living of the child’s

parents, U = child’s dietary regime. Y, Z and U together cause X.

3. There are remote causes in causal chains, e.g., G = happy parents

causes ‘happy children’, which causes ‘good immune system in the

children’; G = happy parents causes ‘joyful children’, which causes

‘children love reading comics’, which causes ‘children love comics

containing the advertisement for yoghurt A’.

In all the above advertisement examples, with the exception of the

last one, the data are usually observational , that is, resulting from

observations gathered in a certain population. In fact, many statements

of the kind found in the media are based on observational data coming

from surveys. The inﬂuences assignable to situations 1, 2 and 3 cannot

be avoided, often for ethical reasons (we cannot compel non-smokers

to smoke or expose people with or without cancer to mobile phone

radiation).

Regarding the last statement, viz., ‘pharmaceutical product C pro-

motes a 30% decrease in infected cells’, there is a diﬀerent situation:

it is based on experimental data, in which the method for assessing

the random variables is subject to a control. In the case of the phar-

maceutical industry, it is absolutely essential to establish experimental

conditions that allow one either to accept or to reject a causal link. In

this way, one takes measures to avoid as far as possible the inﬂuence

of further variables and multiple causes. It is current practice to com-

pare the results obtained in a group of patients to whom the drug is

administered with those of another group of patients identical in all

relevant aspects (age, sex, clinical status, etc.) to whom a placebo is

administered (control group).

The data gathered in physics experiments are also frequently experi-

mental data, where the conditions underlying the experiment are rigor-

ously controlled in order to avoid the inﬂuence of any factors foreign to

those one hopes to relate. In practice, it is always diﬃcult to guarantee

the absence of perturbing factors, namely those of a remote causality.

The great intellectual challenge in scientiﬁc work consists precisely in

elucidating the interaction between experimental conditions and alter-

5.2 Statistics 95

native hypotheses explaining the facts. Statistics makes an important

contribution to this issue.

5.2 Statistics

The works of Charles Darwin (1809–1882) on the evolution of species

naturally aroused the interest of many zoologists of the day. One of

them, the English scientist Walter Weldon (1822–1911), began the

painstaking task of measuring the morphological features of animals,

seeking to establish conclusions based on those measurements, and thus

pioneering the ﬁeld of biometrics. He once said: “. . . the questions

raised by the Darwinian hypothesis are purely statistical”. In his day,

‘statistics’ meant a set of techniques for representing, describing and

summarizing large sets of data. The word seems to have come from

the German ‘Statistik’ (originally from the Latin ‘statisticum’) or ‘stat

k¨ unde’, meaning the gathering and description of state data (censuses,

taxes, administrative management, etc.). Coming back to Walter Wel-

don, the desire to analyze biometrical data led him to use some statis-

tical methods developed by Francis Galton (1822–1911), particularly

those based on the concept of correlation, or deviations from the mean.

(Galton is also known for his work on linear regression, a method for ﬁt-

ting straight lines to data sets. The idea here is to minimize the squares

of the deviations from the straight line. This is the mean square error

method, already applied by Abraham de Moivre, Laplace and Gauss.)

The studies of Weldon and Galton were continued by Karl Pearson

(1857–1936), professor of applied Mathematics in London, generally

considered to be the founder of modern statistics. Statistics is based on

probability theory and, besides representational and descriptive tech-

niques, its main aim is to establish general conclusions (inferences)

from data sets. In this role it has become an inescapable tool in ev-

ery area of science, as well as in other areas of study: social, politi-

cal, entrepreneurial, administrative, etc. The celebrated English writer

H.G. Wells (1866–1946) once said: “Statistical thinking will one day be

as necessary for eﬃcient citizenship as the ability to read and write.”

Statistics can be descriptive (e.g., the average salary increased by

2.5% in . . . , the main household commodities suﬀered a 3.1% inﬂation

in . . . , child mortality in third world countries has lately decreased

although it is still on average above 5%, a typical family produces ﬁve

kilos of domestic waste daily) or inferential.

96 5 Probable Inferences

We saw in Chap. 4, how an estimate of the mean allows one to

establish conclusions (to infer) about the true value of the mean. We

then drew the reader’s attention to two essential aspects: the sampling

process (e.g., was the enquiry on domestic waste suﬃciently random

to allow conclusions about the typical family?) and the sample size

(e.g., was the number of people surveyed suﬃciently large in order to

extract a conclusion with a high degree of certainty, i.e., 95% or more?).

Many misuses of statistics are related to these two essential features.

There are, however, many other types of misuse, in particular when

what is at stake is to establish a relation involving two or more chance

phenomena.

5.3 Chance Relations

Let us consider the height and weight of adults. Figure 5.2 shows the

values of these two random variables (in kilos and meters, respectively)

measured for a hundred people. We all know that there is some relation

between height and weight. In general, taller people are heavier than

shorter ones; but there are exceptions, of course. In Fig. 5.2, the average

weight and height are marked by broken lines. Let us now observe the

two cases lying on the dotted line: both have approximately the same

above-average height; however, one of the cases has a weight above the

average and the other a weight below the average.

If there were no relation between height and weight, we would get

many pairs of cases like the ones lying on the dotted line. That is,

Fig. 5.1. Statistical uncertainty

5.3 Chance Relations 97

Fig. 5.2. Height and weight

of one hundred adults

the simple fact of knowing that someone has an above-average height

would not tell us, on average, anything about the person’s weight. The

situation would then be as pictured in Fig. 5.3: a cloud of points of

almost circular shape. But that is not what happens. Figure 5.2 deﬁ-

nitely shows us that there is a relationship of some kind between the

two variables. Indeed, the points tend to concentrate around an up-

ward direction. If for the hundred people we were to measure the waist

size, we would also ﬁnd a certain relation between this variable and

the other two. Relations between two or more chance phenomena are

encountered when examining natural phenomena or everyday events.

There is than a strong temptation to say that one phenomenon is ‘the

cause’ and the other ‘the eﬀect’. Before discussing this issue, let us ﬁrst

get acquainted with the usual way of measuring relationships between

two chance phenomena.

Fig. 5.3. Fictitious height

and weight

98 5 Probable Inferences

5.4 Correlation

One of the most popular statistical methods for analyzing the re-

lation between variables is the correlation-based method. We shall

now explain this by considering simple experiments in which balls are

drawn out of one urn containing three balls numbered 1, 2 and 3. Fig-

ure 5.4a shows a two-dimensional histogram for extraction of two balls,

with replacement, repeated 15 times. The vertical bars in the histogram

show the number of occurrences of each two-ball combination. Let us

denote the balls by B1 and B2, for the ﬁrst and second extraction, re-

spectively. We may assume that this was a game where B1 represented

the amount in euros that a casino would have to pay the reader and

B2 the amount in euros the reader would have to pay the casino.

In the 15 rounds, represented in Fig. 5.4a, the total number of times

that each value from 1 to 3 for B1 occurs, is 6, 5, 4. For B2, it is 6, 4,

5. We may then compute the reader’s average gain:

average gain (B1) = 1 ×

6

15

+ 2 ×

5

15

+ 3 ×

3

15

=

25

15

= 1.67 euros .

In a similar way one would compute the average loss (B2) to be 1.93 .

Let us make use of the idea, alluded to in the last section, of measur-

ing the relation between B1 and B2 through their deviations from the

mean. For that purpose, let us multiply together those deviations. For

instance, when B1 has value 1 and B2 has value 3, the product of the

deviations is

(1 −1.87) ×(3 −1.93) = −0.93 (square euros) .

In this case the deviations of B1 and B2 were discordant (one above and

the other below the mean); the product is negative. When the values

of B1 and B2 are simultaneously above or below the mean, the product

is positive; the deviations are concordant. For Fig. 5.4a, we obtain the

following sum of the products of the deviations:

2 ×(1 −1.87) ×(1 −1.93) +· · · + 2 ×(3 −1.87) ×(3 −1.93) = 1.87 .

The average product of the deviations is 1.87/15 = 0.124. If gains and

losses were two times greater (2, 4 and 6 euros instead of 1, 2 and 3

euros), the average product of the deviations would be nearly 0.5. This

means that, although the histogram in Fig. 5.4a would be similar in

both cases, evidencing an equal relationship, we would ﬁnd a diﬀer-

ent value for the average product of the deviations. This is clearly not

5.4 Correlation 99

Fig. 5.4. (a) Histogram of 15 extractions, with replacement, of two balls from

one urn. (b) The same, for 2000 extractions

what we were hoping for. We solve this diﬃculty (making our relation

measure independent of the measurement scale used) by dividing the

average product of the deviations by the product of the standard devi-

ations. Carrying out the computations of the standard deviations of B1

and B2, we get s(B1) = 0.81 euros, s(B2) = 0.85 euros. Denoting our

measure by C, which is now a dimensionless measure (no more square

euros), we get

C(B1, B2) =

average product of deviations of B1 and B2

product of standard deviations

=

0.124

0.81 ×0.85

= 0.18 .

The histogram of Fig. 5.4b corresponds to n = 2000, with the vertical

bars representing frequencies of occurrence f = k/n, where k denotes

the number of occurrences in the n = 2000 extractions. We see that

the values of f are practically the same. In fact, as we increase the

number of extractions n, the number of occurrences of each pair ap-

proaches np = n/9 (see Sect. 3.6). The measure C, which we have

computed/estimated above for a 15-sample case, then approaches the

so-called correlation coeﬃcient between the two variables. We may ob-

tain (the exact value of) this coeﬃcient, of which the above value is

a mere estimate, by calculating the mathematical expectation of the

product of the deviations from the true means:

100 5 Probable Inferences

E

(B1 −2) ×(B2 −2)

=

1

9

(1 −2)(1 −2) + (3 −2)(3 −2)

+ (1 −2)(3 −2) + (3 −2)(1 −2)

=

1

9

×(1 + 1 −1 −1) = 0 .

Thus, for extraction with replacement of the two balls, the correlation is

zero and we say that variables B1 and B2 are uncorrelated. This is also

the case in Fig. 5.3 (there is no ‘privileged direction’ in Figs. 5.3 and

5.4b). Incidentally, note that the variables B1 and B2 are independent,

and it so happens that, whenever two variables are independent, they

are also uncorrelated (the opposite is not always true).

Let us now consider ball extraction without replacement. For n =

15, we obtained in a computer simulation the histogram of occurrences

shown in Fig. 5.5a. Repeating the previous computations, the estimate

of the correlation between the two variables is −0.59. The exact value

is (see Fig. 5.5b)

C

B1,B2 (without replacement)

=

E

(B1 −2) ×(B2 −2)

2/3 ×

2/3

=

−1/3

2/3

= −0.5 .

In this situation the values of the balls are inversely related: if the value

3 (high) comes ﬁrst, the probability that 1 (low) comes out the following

time increases and vice versa. B1 and B2 are negatively correlated.

Fig. 5.5. (a) Histogram of 15 extractions, without replacement, of two balls

from one urn. (b) The corresponding probability function

5.5 Signiﬁcant Correlation 101

(Note the diagonal with zeros in the histograms of Fig. 5.5, representing

impossible combinations.)

It is easily shown that the correlation measure can only take values

between −1 and +1. We have already seen what a zero correlation

corresponds to. The −1 and +1 values correspond to a total relation,

where the data graph is a straight line. In the case of weight and height,

when the correlation is +1, it would be as though one could tell the

exact weight of a person simply by knowing his/her height, and vice

versa. A −1 correlation reﬂects a perfect inverse relation; this would be

the case if the second ball extraction from the urn only depended on

the ﬁrst one according to the formula B2 = 4 −B1.

5.5 Signiﬁcant Correlation

An important contribution of Karl Pearson to statistics was precisely

the study of correlation and the establishment of criteria supporting

the signiﬁcance of the correlation coeﬃcient (also known as Pearson’s

correlation coeﬃcient). Let us go back to the correlation between adult

weight and height. The computed value of the correlation coeﬃcient is

0.83. Can this value be attributed any kind of signiﬁcance? For instance,

in the case of the urn represented in Fig. 5.4a, we got a correlation esti-

mate of 0.18, when in fact the true correlation value was 0 (uncorrelated

B1 and B2). The 0.18 value suggesting some degree of relation between

the variables could be called an artifact, due only to the small number

of experiments performed.

Karl Pearson approached the problem of the ‘signiﬁcance’ of the

correlation value in the following way: assuming the variables to be

normally distributed and uncorrelated how many cases n would be

needed to ensure that, with a given degree of certainty (for instance, in

95% of the experiments with n cases), one would obtain a correlation

value at least as high as the one computed in the sample? In other

words, how large must the sample be so that with high certainty (95%)

the measured correlation is not due to chance alone (assuming normally

governed chance)?

The values of a signiﬁcant estimate of the correlation for a 95%

degree of certainty are listed in Table 5.1 for several values of n. For

instance, for n = 100, we have 95% certainty that a correlation at least

as high as 0.2 is not attributable to chance. In the height–weight data

set, n is precisely 100. The value C(height, weight) = 0.83 is therefore

102 5 Probable Inferences

Table 5.1. Signiﬁcant estimate of the correlation for a 95% degree of certainty

n 5 10 15 20 50 100 200

Signiﬁcant correlation 0.88 0.63 0.51 0.44 0.27 0.20 0.14

with 95% certainty

signiﬁcant. (In fact, the normality assumption is not important for n

above 50.)

Note once again the expression of a degree of certainty assigned to

conclusions of random data analysis. This is an important aspect of

statistical proof . If, for instance, we wish to prove the relation between

tobacco smoke and lung cancer using the correlation, we cannot expect

to observe all cases of smokers that have developed lung cancer. Basing

ourselves on the correlation alone, there will always exist a non-null

probability that such a relation does not hold. Of course, in general

there is other evidence besides correlation. For instance, in physics the

statistical proof applied to experimental results is often only one among

many proofs. As an example, we may think of (re)determining Boyle–

Mariotte’s law for ideal gases through the statistical analysis of a set of

experiments producing results on the pressure–volume variation, but

the law itself – inverse relation between pressure and volume – is sup-

ported by other evidence beyond the statistical evidence. However, it

should not be forgotten that the statistical proof so widely used in

scientiﬁc research is very diﬀerent from standard mathematical proof.

5.6 Correlation in Everyday Life

Correlation between random variables is one of the most abused topics

in the media and in careless research and technical reports. Such abuse

already shows up with the idea that, if the correlation between two

variables is zero, then there is no relation between them. This is false.

As mentioned before, correlation is a linear measure of association,

so there may be some relation between the variables even when their

Pearson correlation is zero.

The way that results of analyses involving correlations are usually

presented leads the normal citizen to think that a high correlation

implies a cause–eﬀect association. The word ‘correlation’ itself hints at

some ‘intimate’ or deep relation between two variables, whereas it is

5.6 Correlation in Everyday Life 103

nothing other than a simple measure of linear association, which may

be completely fortuitous.

In truth, an intrinsic cause–eﬀect type relation may sometimes exist.

For instance, the graph of Fig. 5.6 shows the area (in ha) of forest burnt

annually in Portugal during the period 1943 through 1978, as a function

of the total number of annual forest ﬁres. There is here an obvious

cause–eﬀect relationship which translates into a high correlation of 0.9.

Let us now consider the tobacco smoke issue. Figure 5.7 is based

on the results of a study carried out in the USA in 1958, which in-

volved over a hundred thousand observations of adult men aged be-

tween 50 and 69 years. The horizontal axis shows the average number

of cigarettes smoked per day, from 0 up to 2.5 packets. The vertical

axis represents the frequency of occurrence of lung cancer relative to

the total number of detected cancer cases. The correlation is high: 0.95.

The probability of obtaining such a correlation due to chance alone (in

a situation of non-correlation and normal distributions) is only 1%.

Does the signiﬁcantly high value of the correlation have anything to

do with a cause–eﬀect relation? In fact, several complementary studies

concerning certain substances (carcinogenic substances) present in the

tobacco smoke point in that direction, although lung cancer may occur

in smokers in a way that is not uniquely caused by tobacco smoke,

Fig. 5.6. Area (in

ha) of annual burnt

forest plotted against

the number of forest

ﬁres in the period

1943 through 1978

(Portugal)

Fig. 5.7. Frequency of lung

cancer (f.l.c.) and average

number of smoked cigarettes

in packets (USA study,

1958)

104 5 Probable Inferences

but instead by the interaction of smoke with the ingestion of alco-

holic drinks, or other causes. Therefore, in this case, the causal link

rests upon many other complementary studies. The mere observation

of a high correlation only hints that such a link could be at work.

In many other situations, however, there is no direct ‘relation’ be-

tween two random variables, despite the fact that a signiﬁcant correla-

tion is observed. This is said to be a spurious correlation. Among the

classic examples of spurious correlations are the correlation between

the number of ice-creams sold and the number of deaths by drowning,

or the correlation between the number of means used to ﬁght city ﬁres

and the total amount of damage caused by those ﬁres. It is obvious

that ice-creams do not cause death by drowning or vice-versa; likewise,

it is not the means used to ﬁght ﬁres that cause the material damage

or vice versa. In both cases there is an obvious third variable, viz., the

weather conditions and the extent of the ﬁres, respectively.

Finally, let us present the analysis of a data set that gives a good

illustration of the kind of abuse that arises from a superﬁcial appreci-

ation of the notion of correlation. The data set, available through the

Internet, concerns housing conditions in Boston, USA, and comprises

506 observations of several variables, evaluated in diﬀerent residential

areas.

We may apply the preceding formula to compute the correlations

between variables. For instance, the correlation between the percentage

of poor people (lower status of the population) and the median value

of the owner-occupied homes is negative: −0.74. This is a highly sig-

niﬁcant value for the 506 cases (the probability of such a value due to

chance alone is less than 1/1 000 000). Figure 5.8a shows the graph for

these two variables. It makes sense for the median value of the houses to

decrease with an increase in the percentage of poor people. The causal

link is obvious: those with less resources will not buy expensive houses.

Let us now consider the correlation between the percentage of area

used for industry and the index of accessibility to highways (measured

on a scale from 1 to 24). The correlation is high, viz., 0.6. However,

looking at Fig. 5.8b, one sees that there are a few atypical cases with

accessibility 24 (the highest level corresponding to being close to the

highways). Removing these atypical cases one obtains an almost zero

correlation. We see how a few atypical cases may create a false idea of

correlation. An examination of graphs relating the relevant variables is

therefore helpful in order to assess this sort of situation. (Incidentally,

5.6 Correlation in Everyday Life 105

even if we remove atypical cases from Fig. 5.6, the correlation will

remain practically the same.)

Imagine now that we stumble on a newspaper column like the fol-

lowing: Enquiry results about the quality of life in Boston revealed that

it is precisely the lower status of the population that most contributes

to town pollution, measured by the concentration of nitric oxides in

the atmosphere. To add credibility to the column we may even imagine

that the columnist went to the lengths of including the graph of the two

variables (something we very rarely come across in any newspaper!), as

shown in Fig. 5.8c, in which one clearly sees that the concentration of

nitric oxides (in parts per 10 million) increases with the percentage of

poor people in the studied residential area.

In the face of this supposed causal relation – poor people cause pol-

lution! – we may even imagine politicians proposing measures for solv-

ing the problem, e.g., compelling poor people to pay an extra pollution

tax. Meanwhile, the wise researcher will try to understand whether that

correlation is real or spurious. He may, for instance, study the relation

Fig. 5.8. Graphs and correlations for pairs of variables in the Boston housing

data set: (a) −0.74, (b) 0.6, (c) 0.59, (d) 0.76

106 5 Probable Inferences

between percentage of area used by industry and percentage of poor

people in the same residential area and verify that a high correlation

exists (0.6). Finally, he may study the relation between percentage of

area used by industry and concentration of nitric oxides (see Fig. 5.8d)

and verify the existence of yet another high correlation, viz., 0.76. Com-

plementary studies would then suggest the following causality diagram:

Industry

ւ ց

Nitric oxide Poor residential

pollution zone implantation

In summary, the correlation triggering the comments in the newspa-

per column is spurious and politicians should think instead of suitably

taxing the polluting industries.

There are many other examples of correlation abuse. Maybe the

most infamous are those pertaining to studies that aim to derive racist

or eugenic conclusions, interpreting in a causal way skull measurements

and intelligence coeﬃcients. Even when using experimental data as in

physics research, correlation alone does not imply causality; it only

hints at, suggests, or opens up possibilities. In order to support a causal-

ity relation, the data will have to be supplemented by other kinds of

evidence, e.g. the right sequencing in time.

Besides the correlation measure, statistics also provides other meth-

ods for assessing relations between variables. For instance, in the Boston

pollution case, one might establish average income ranks in the various

residential areas and obtain the corresponding averages of nitric oxides.

One could then use statistical methods for comparing these averages.

The question is whether this or other statistical methods can be used to

establish a causality inference. Contrary to a widely held opinion, the

answer is negative. The crux of the problem lies not in the methods,

but in the data.

6

Fortune and Ruin

6.1 The Random Walk

The coin-tossing game, modest as it may look, is nevertheless rather

instructive when we wish to consider what may be expected in general

from games, and in particular the ups and downs of a player’s fortunes.

But let us be quite clear, when we speak of games, we are referring

to games that depend on chance alone. We exclude games in which

a player can apply certain logical strategies to counter the previous

moves of his or her opponent (as happens in many card games, checkers,

chess, etc.). On the other hand, we nevertheless understand ‘games’ in

a wide sense: results derived from the analysis of coin-tossing sequences

are applicable to other random sequences occurring in daily life.

The analysis of the coin-tossing game can be carried out in an advan-

tageous manner by resorting to a speciﬁc interpretation of the meaning

of each toss. For this purpose we consider arbitrarily long sequences of

throws of a fair coin and instead of coding heads and tails respectively

by 1 and 0, as we did in Chap. 4, we code them by +1 and −1. We may

consider these values as winning or losing 1 euro. Let us compute the

average gain along the sequence as we did in Chap. 4. For every possi-

ble tossing sequence the average gain will evolve in a distinct way but

always according to the law of large numbers: for a suﬃciently large

number of throws n, the probability that the average gain deviates

from the mathematical expectation (0) by more than a preset value

is as small as one may wish. Figure 6.1 shows six possible sequences

of throws, exhibiting the stabilization around zero, with growing n,

according to the law of large numbers.

Instead of average gains we may simply plot the gains, as in Fig. 6.2.

In the new interpretation of the game, instead of thinking in terms of

108 6 Fortune and Ruin

Fig. 6.1. Six se-

quences of average

gain in 4000 throws

of a fair coin

Fig. 6.2. Gain

curves for the six

sequences of Fig. 6.1

gain we think in terms of upward and downward displacements of a par-

ticle, along the vertical axis. Let us substantiate this idea by considering

the top curve of Fig. 6.1, which corresponds to the following sequence

in the ﬁrst 10 throws:

tails, heads, heads, tails, heads, heads, heads, heads, tails, heads .

We may then imagine, in the new interpretation, that the throws are

performed at the end of an equal time interval, say 1 second, and that

the outcome of each throw will determine how a point particle moves

from its previous position: +1 (in a certain space dimension) if the

outcome is heads, and −1 if it is tails. At the beginning the particle

is assumed to be at the origin. The particle trajectory for the 10 men-

tioned throws is displayed in Fig. 6.3a. We say that the particle has

performed a random walk.

We may consider the six curves of Fig. 6.2 as corresponding to six

random walks in one dimension: the particle either moves upwards or

downwards. We may also group the six curves of Fig. 6.2 in three pairs

of curves representing random walks of three particles in a two-axis

system: one axis represents up and down movements and the other one

right and left movements. Figure 6.3b shows the result for 1000 throws:

the equivalent of three particles moving on a plane. We may also group

6.2 Wild Rides 109

Fig. 6.3. (a) Random walk for ten throws. (b) Random walk of three particles

in two dimensions

the curves in threes, obtaining three-dimensional random walks, as in

Brownian motion (ﬁrst described by the botanist Robert Brown for

pollen particles suspended in a liquid).

The random walk, in one or more dimensions, is a good mathemat-

ical model of many natural and social phenomena involving random

sequences, allowing an enhanced insight into their properties. Random

walk analysis thus has important consequences in a broad range of

disciplines, such as physics and economics.

6.2 Wild Rides

The random walk, despite its apparent simplicity, possesses a great

many amazing and counterintuitive properties, some of which we shall

now present.

Let us start with the property of obtaining a given walk for n time

units (walk steps). As there are 2

n

distinct and equiprobable walks, the

probability is 1/2

n

. This probability quickly decreases with n. For only

100 steps, the probability of obtaining a given walk is

0.000 000 000 000 000 000 000 000 000 000 8 .

For 160 steps it is roughly the probability of a random choice of one

atom from all the matter making up the Earth, viz., roughly 1/8×10

49

.

For 260 steps (far fewer than in Fig. 6.2), it is roughly the probability

of a random choice of one atom from all known galaxies, viz., roughly

1/4 ×10

79

!

110 6 Fortune and Ruin

Let us go back to interpreting the random walk as a heads–tails

game, say between John and Mary. If heads turns up, John wins 1 euro

from Mary; if tails turns up, he pays 1 euro to Mary. The game is thus

equitable. We may assume that the curves of Fig. 6.2 represent John’s

accumulated gains. However, we observe something odd. Contrary to

common intuition, there are several gain curves (5 out of 6) that stay

above or below zero for long periods of time. Let us take a closer look

at this. We note that the number of steps for reaching zero gain will

have to be even (the same number of upward and downward steps), say

2n. Now, it can be shown that the probability of reaching zero gain for

the ﬁrst time in 2n steps is given by

P(zero gain for the ﬁrst time in 2n steps) =

2n

n

2

2n

(2n −1)

.

Using this formula one observes that the probability of zero gain for

the ﬁrst time in 2 steps is 0.5, and for the ﬁrst time in four steps is

0.125. Therefore, the probability that a zero gain has occurred in four

steps, whether or not it is the ﬁrst time, is 0.625.

Let us denote by P(2n) the sum

P(zero gain for the ﬁrst time in 2 steps)

+P(zero gain for the ﬁrst time in 4 steps)

+· · · +P(zero gain for the ﬁrst time in 2n steps) .

This sum represents the probability of John’s and Mary’s gains having

equalized at least once in 2n steps. It can be shown that P(2n) con-

verges to 1, which means that at least one equalization is sure to occur

for suﬃciently high n, and this certainly agrees with common intuition.

However, looking at the P(2n) curve in Fig. 6.4a we notice that the

convergence is rather slow. For instance, after 64 steps the probability

that no gain equalization has yet occurred is 1 −0.9 = 0.1. This means

that, if ten games are being played simultaneously, the most proba-

ble situation (on average terms) is that in one of the games the same

player will always be winning. After 100 steps the situation has not

improved much: the probability that no equalization has yet occurred

is 1 −0.92 = 0.08, which is still quite high!

The random walk really does look like a wild ride. The word ‘ran-

dom’ coming from the old English ‘randon’, has its origin in an old

Frank word, ‘rant’, meaning a frantic and violent ride. Game rides re-

ally are frantic and violent.

6.3 The Strange Arcsine Law 111

Fig. 6.4. (a) Probability of at least a zero gain in 2n steps. (b) Probability

of the last zero gain occurring after x

6.3 The Strange Arcsine Law

We have seen above how the probability of not obtaining gain equal-

ization in 2n steps decreases rather slowly. However, random walk sur-

prises become even more dramatic when one studies the probability of

obtaining zero gain for the last time after a certain number of rounds,

say 2k in a sequence of 2n rounds. Letting x = k/n represent the pro-

portion of the total rounds corresponding to the 2k rounds (x varies

between 0 and 1), it can be shown that this probability is given by the

following formula:

P(last zero gain at time x in 2n rounds) ≈

1

nπ

x(1 −x)

.

For instance, the probability that the last zero gain has occurred in the

middle of a sequence of 20 rounds (n = 10) is computed as follows:

P(last zero gain at the tenth of 20 rounds) ≈

1

10 ×3.14

√

0.5 ×0.5

= 0.064 .

The curves of Fig. 6.4b show this probability for several values of n

(the probabilities for x = 0 and x = 1 are given by another formula).

Note that, for each n, only certain values of x are admissible (vertical

bars shown in the ﬁgure for n = 10 and n = 20). The bigger n is, the

more values there are, with spacing 1/n.

Common intuition would tell us that in a long series of throws John

and Mary would each be in a winning position about half of the time

and that the name of the winner should swap frequently. Let us now

112 6 Fortune and Ruin

see what the curves in Fig. 6.4b tell us. Consider the events x < 0.5 and

x > 0.5, which is the same as saying 2k < n and 2k > n, respectively,

or in more detail, ‘the last zero gain occurred somewhere in the ﬁrst

half of 2n rounds’ and ‘the last zero gain occurred somewhere in the

second half of 2n rounds’. Since the curves are symmetric, P(x < 0.5)

and P(x > 0.5) are equal [they correspond to adding up the vertical

bars from 0 to 0.5 (exclusive) and from 0.5 (exclusive) to 1]. That is,

the events ‘the last zero gain occurred somewhere in the ﬁrst half of 2n

rounds’ and ‘the last zero gain occurred somewhere in the second half

of 2n rounds’ are equiprobable; in other words, with probability close

to 1/2, no zero gain occurs during half of the game, independently of

the total number of rounds! Thus, the most probable situation is that

one of the two, John or Mary, will spend half of the game permanently

in the winning position. Moreover, the curves in Fig. 6.4b show that

the last time that a zero gain occurred is more likely to happen either

at the beginning or at the end of the game!

In order to obtain the probability that the last zero gain occurred

before and including a certain time x, which we will denote by P(x), one

only has to add up the preceding probabilities represented in Fig. 6.4b,

for values smaller than or equal to x. For suﬃciently large n (say, at

least 100), it can be shown that P(x) is well approximated by the

inverse of the sine function, called the arcsine function and denoted

arcsin (see Appendix A):

P(last zero gain occurred at a time ≤ x) = P(x) ≈

2

π

arcsin(

√

x) .

The probability of the last zero gain occurring before a certain time x

is said to follow the arcsine law, displayed in Fig. 6.5. Table 6.1 shows

some values of this law.

Imagine that on a certain day several pairs of players passed 8 hours

playing heads–tails, tossing a coin every 10 seconds. The number of

Fig. 6.5. Probability

of the last zero gain

occurring before or

at x, according to the

arcsine law

6.4 The Chancier the Better 113

Table 6.1. Values of the arcsine law

x 0.005 0.01 0.015 0.02 0.025 0.03 0.095 0.145 0.205 0.345

P(x) 0.045 0.064 0.078 0.090 0.101 0.111 0.199 0.249 0.299 0.400

tosses (8 × 60 × 6 = 2880) is suﬃciently high for a legitimate applica-

tion of the arcsine law. Consulting the above table, we observe that on

average one out of 10 pairs of players would obtain the last gain equal-

ization during the ﬁrst 12 minutes (0.025 × 2880 = 72 tosses) passing

the remaining 7 hours and 48 minutes without having any swap of win-

ner position! Moreover, on average, one of each 3 pairs would pass the

last 6 hours and 22 minutes without changing the winner!

6.4 The Chancier the Better

An amazing aspect of the convergence of averages to expectations is

their behavior when the event probabilities change over a sequence of

random experiments. Let us suppose for concreteness that the proba-

bilities of a heads–tails type of game were changed during the game.

For instance, John and Mary did a ﬁrst round with a fair coin, a sec-

ond round with a tampered coin, producing on average two times more

heads than tails, and so on. Denote the probability of heads turning

up in round k by p

k

. For example, we would have p

1

= 0.5, p

2

= 0.67,

and so on. The question is, will the arithmetic mean still follow the law

of large numbers? The French mathematician Sim´eon Poisson (1781–

1840) proved in 1837 that it does indeed, and that the arithmetic mean

of the gains converges to the arithmetic mean of the p values, viz.,

p

1

+p

2

+· · · +p

n

n

.

(Poisson is better known for a probability law that carries his name and

corresponds to an approximation of the binomial distribution for a very

low probability of the relevant event.) Meanwhile, since the probability

of turning up heads changes in each round, common intuition would tell

us that it might be more diﬃcult to obtain convergence. Yet the Rus-

sian mathematician Vassily Hoeﬀding (1914–1991) demonstrated more

than a century after Poisson’s result that, contrary to intuition, the

probability of a given deviation of the average of the x values from the

average of the p values reaches a maximum when the p

k

are constant.

In other words, the larger the variability of the p values the better

114 6 Fortune and Ruin

Fig. 6.6. Average gains in 20 sequences of 1000 rounds of the heads–tails

game. (a) Constant probability p

k

= 0.5. (b) Variable probability p

k

uniformly

random between 0.3 and 0.7

the convergence is! It is as if one said that by increasing the ‘disorder’

during the game more quickly, one achieves order in the limit!

This counterintuitive result is illustrated in Fig. 6.6, where it can

be seen that the convergence of the averages is faster with variable

probabilities randomly selected between 0.3 and 0.7.

6.5 Averages Without Expectation

We saw how the law of large numbers assured a convergence of the aver-

age to the mathematical expectation. There is an aspect of this law that

is easily overlooked. In order to talk of convergence, the mathematical

expectation must exist. This fact looks so trivial at ﬁrst sight that it

does not seem to deserve much attention. In fact, at ﬁrst sight, it does

not seem possible that there could be random distributions without

a mathematical expectation. But is it really so?

Consider two random variables with normal distribution and with

zero expectation and unit variance. To ﬁx ideas, suppose the variables

are the horizontal position x and the vertical position y of riﬂe shots

at a target, as in Chap. 4. Since x and y have zero expectation, the

riﬂe often hits the bull’s-eye on average terms. We assume that the

bull’s-eye is a very small region around the point (x = 0, y = 0). It may

happen that a systematic deviation along y is present when a systematic

deviation along x occurs. In order to assess whether this is the case,

let us suppose that we computed the ratio z = x/y for 500 shots and

plotted the corresponding histogram. If the deviation along the vertical

6.5 Averages Without Expectation 115

direction is strongly correlated with the deviation along the horizontal

direction, we will get a histogram that is practically one single bar

placed at the correlation value. For instance, if y is approximately equal

to x, the histogram is practically a bar at 1. Now, consider the situation

where x are y uncorrelated. We will get a broad histogram such as the

one in Fig. 6.7a.

Looking at Fig. 6.7a, we may suspect that the probability density

function of z is normal and with unit variance, like the broken curve also

shown in Fig. 6.7a. However, our suspicion is wrong and the mistake

becomes clearly visible when we look at the histogram over a suﬃciently

large interval, as in Fig. 6.7b. It now becomes clear that the normal

curve is unable to approximate the considerably extended histogram

tails. For instance, for z exceeding 5 in absolute value, we would expect

to obtain according the normal law about 6 shots in 10 million. That

is, for the 500 shots we do not expect to obtain any with z exceeding 5.

Now, in reality that is not what happens, and we obtain a considerable

number of shots with z in excess of 5. In the histogram of Fig. 6.7b, we

obtained 74 such shots! In fact, z does not follow the normal law, but

another law discovered by the French mathematician Augustin Louis

Cauchy (1789–1857) and called the Cauchy distribution in his honor.

The Cauchy probability density function

f(z) =

1

π(1 +z

2

)

is represented by the solid line in Fig. 6.7b.

Fig. 6.7. Histogram of the ratio of normal deviations in 500 shots: (a) with

the normal probability density function (broken curve), (b) with the Cauchy

probability density function (continuous curve)

116 6 Fortune and Ruin

Visually, Cauchy’s curve represented in Fig. 6.7b does not look much

of a monster; it even looks like the bell-shaped curve and there is a ten-

dency to think that its expectation is the central 0 point (as in the

analogy with the center of gravity). But let us consider, in Fig. 6.8a,

what happens with a sequence of values generated by such a distri-

bution. We see that the sequence exhibits many occurrences of large

values (sometimes, very large), clearly standing out from most of the

values in the sequence. In contrast, a normally distributed sequence is

characterized by the fact that it very rarely exhibits values that stand

out from the ‘background noise’. The impulsive behavior of the Cauchy

distribution is related to the very slow decay of its tails for very small

or very large values of the random variable; this slow decay is reﬂected

in a considerable probability for the occurrence of extreme values.

Suppose now that we take the random values with Cauchy distri-

bution as wins and losses of a game and compute the average gain.

Figure 6.8b shows 6 average gain sequences for 1000 rounds. It seems

that there is some stabilization around 0 for some sequences, but this

impression is deceptive. For example, we observe a sequence close to

400 rounds that exhibits a large jump with a big deviation from zero.

Had we carried out a large number of experiments with a large number

of rounds, we would then conﬁrm that the averages do not stabilize

around zero. The reason is that the Cauchy law does not really have

a mathematical expectation. In some everyday phenomena, such as risk

assessment in ﬁnancial markets, sequences obeying such strange laws as

the Cauchy law do emerge. Since such laws do not have a mathematical

expectation, it has no meaning to speak of the law of large numbers

for those sequences, and it makes no sense to speak of convergence or

long term behavior; fortune and ruin may occur abruptly.

Fig. 6.8. (a) A sequence of rounds with Cauchy distribution. (b) Six average

gain sequences for 1000 rounds

6.6 Borel’s Normal Numbers 117

6.6 Borel’s Normal Numbers

We spoke of conﬁdence intervals in Chap. 4 and, on the basis of the

normal law, we were able to determine how many throws of a die were

needed in order to be 95% sure of the typicality of a ±0.01 interval

around the probability p for the frequency of occurrence f in the worst

case (worst standard deviation). This result (inverse of

√

n) constituted

an illustration of the law of large numbers. Denoting the number of

throws by n, we could write

P(deviation of f from p is below 0.01) = 0.95 , for n > 10 000 .

In fact, the law of large numbers stipulates that whatever the tolerance

of the deviation might be (0.01 in the above example), it is always

possible to ﬁnd a minimum value of n above which one obtains a degree

of certainty as large as one may wish (0.95 in the above example). In the

limit, i.e., for arbitrarily large n, the degree of certainty is 1. Therefore,

denoting any preselected tolerance by t, we have

limit of P(deviation of f from p is below t) = 1 .

In general, using the word ‘deviation’ to denote the absolute value of

the deviation of an average (f, in the example) from the mathematical

expectation (p, in the example), we may write

limit of P(deviation is below t) = 1 .

Consequently, when applying the law of large numbers, we ﬁrst compute

a probability and thereafter its limit. Now, it does frequently happen

that we do not know how many times the random experiment will be

repeated, and we are more interested in ﬁrst computing the limit of

the deviations between average and expectation and only afterward

determining the probability of the limit. That is, right from the start

we must consider inﬁnite sequences of repetitions.

Setting aside the artiﬁciality of the notion of an ‘inﬁnite sequence’,

which has no practical correspondence, the study of inﬁnite sequences

is in fact rather insightful. Let us take a sequence of n tosses of a coin.

Any of the 2

n

possible sequences is equiprobable, with probability 1/2

n

.

Imagine then that the sequence continued without limit. Since 1/2

n

decreases with n, we would thus be led to the paradoxical conclusion

that no sequence is feasible; all of them would have zero probability.

This diﬃculty can be overcome if we only assign zero probability to

certain sequences.

118 6 Fortune and Ruin

Table 6.2. Assigning numbers to sequences

Sequence Binary Decimal value

Tails, tails, tails, tails, heads 0.00001 0.03125

Heads, heads, heads, heads, heads 0.11111 0.96875

Heads, tails, heads, tails, heads 0.10101 0.65625

Let us see how this is possible, omitting some technical details. For

this purpose, consider representing heads and tails respectively by 1

and 0, as we have already done on several occasions. For instance, the

sequence heads, tails, heads, heads is represented by 1011. We may

assume that 1011 represents a number in a binary numbering system.

Instead of the decimal system that we are accustomed to from an early

age, and where the position of each digit represents a power of 10 (e.g.,

375 = 3 × 10

2

+ 7 × 10

1

+ 5 × 10

0

), in the binary numbering system

the position of each digit (there are only two, 0 and 1), called a bit,

represents a power of 2. So 1011 in the binary system is evaluated as

1×2

3

+0×2

2

+1×2

1

+1×2

0

, that is, 11 (eleven) in the decimal system.

Let us now take the 1011 sequence as the representation of a number

in the interval [0, 1]. It is thus 0.1011; we just omit the 0 followed by

the decimal point. To compute its value in the decimal system we shall

have to multiply each bit by a power of 1/2 (by analogy, in the decimal

system each digit is multiplied by a power of 1/10). That is, 1011 is

then evaluated as

1 ×

1

2

+ 0 ×

1

4

+ 1 ×

1

8

+ 1 ×

1

16

= 0.6875 .

Table 6.2 shows some more examples.

It is easy to understand that all inﬁnite heads–tails sequences can

thus be put in correspondence with numbers in the interval [0, 1], with

1 represented by 0.1111111111 . . . . In the case of 1/2, which may be

represented by both 0.1000000 . . . and 0.0111111111 . . . , we adopt the

convention of taking the latter (likewise for 1/4, 1/8, etc.).

We may visualize this binary representation by imagining that each

digit corresponds to dividing the interval [0, 1] into two halves. The

ﬁrst digit represents the division of [0, 1] into two intervals: from 0

to 0.5 (excluding 0.5) and from 0.5 to 1; the second digit represents

the division of each of these two intervals into halves; and so on, in

succession, as shown in Fig. 6.9, assuming a sequence of only n = 3

digits.

6.6 Borel’s Normal Numbers 119

Fig. 6.9. The ﬁrst 3 digits (d

1

, d

2

, d

3

) and

the intervals they determine in [0, 1]. The value

0.625 corresponds to the dotted intervals

Figure 6.9 shows the intervals that 0.625 (1×1/2+0×1/4+1×1/8)

‘occupies’. We now establish a mapping between probabilities of bi-

nary sequences and their respective intervals in [0, 1]. Take the sequence

heads, tails, heads represented by 0.101. We may measure the probabil-

ity that this sequence occurs by taking the width of the corresponding

interval in [0, 1]: the small interval between 0.625 and 0.75 with width

1/8 shown in Fig. 6.9. That is, from the point of view of assigning prob-

abilities to the sequence heads, tails, heads, it comes to the same thing

as randomly throwing a point onto the straight line segment between 0

and 1 (excluding 0), with the segment representing a kind of target cut

into slices 0.125 units wide. If the point-throwing process has a uniform

distribution, each number has probability 1/8, as required.

Now, consider inﬁnite sequences. To each sequence corresponds

a number in [0, 1]. Let us look at those sequences where 0 and 1 show

up the same number of times (thus, the probability of ﬁnding 0 or 1 is

1/2), as well as any of the four two-bit subsequences 00, 01, 10 and 11

(each showing up 1/4 of the time), any of the three-bit subsequences

000, 001, 010, 011,100, 101, 110 and 111 (each showing up 1/8 of the

time), and so forth. The real numbers in [0, 1] corresponding to such

sequences are called Borel’s normal numbers.

The French mathematician Emile Borel (1871–1956) proved in 1909

that the set of non-normal numbers occupies such a meaningless ‘space’

in [0, 1] that we may consider their probability to be zero. Take, for in-

stance, the sequence 0.101100000 . . . (always 0 from the ﬁfth digit on-

ward). It represents the rational number 11/16. Now take the sequence

0.1010101010 . . . (repeating forever). It represents the rational number

2/3. Obviously, no rational number (represented by a sequence that ei-

ther ends in an inﬁnite tail of zeros or ones, or by a periodic sequence)

is a Borel normal number. The set of all sequences corresponding to

rational numbers thus has zero probability; in other words, the prob-

ability that we get an inﬁnite sequence corresponding to a rational

120 6 Fortune and Ruin

number when playing heads–tails is zero. On the other hand, since the

set of all normal numbers has probability 1, we may then say that prac-

tically all real numbers in [0, 1] are normal (once again, a result that

may seem contrary to common intuition). Yet, it has been found to be

a diﬃcult task to prove that a given number is normal. Are the deci-

mal expansions of

√

2, e or π normal? No one knows (although there

is some evidence that they are). A few numbers are known to be nor-

mal. One of them, somewhat obvious, was presented by the economist

David Champernowne (1920–2000) and corresponds to concatenating

the distinct binary blocks of length 1, 2, 3, etc.:

0.0100011011000001010011100101110111 . . . .

The so-called Copeland–Erd¨ os constant

0.235 711 131 719 232 931 373 941 . . . ,

obtained by concatenating the prime numbers, is also normal in the

base 10 system.

We end with another amazing result. It is well known that the set

of all inﬁnite 0–1 sequences is uncountable; that is, it is not possible

to establish a one-to-one mapping between these sequences and the

natural numbers. Well, although the set of non-normal numbers has

zero probability, it is also uncountable!

6.7 The Strong Law of Large Numbers

Let us look at sequences of n zeros and ones as coin tosses, denoting

by f the frequency of occurrence of 1. The normal numbers satisfy, by

deﬁnition, the following property:

P(limit of the deviation of f from 1/2 is 0) = 1 .

This result is a particular case of a new version of the law of large

numbers. In Chap. 3, the Bernoulli law of large numbers stated that

limit of P(deviation is below t) = 1 ,

where the term ‘deviation’ refers to the absolute value of the diﬀerence

between the average and the expectation (1/2, for coin throwing), as

already pointed out in the last section. The new law, demonstrated by

the Russian mathematician Andrei Kolmogorov (1903–1987) in 1930

(more than two centuries after the Bernoulli law), is stated as follows:

6.8 The Law of Iterated Logarithms 121

Fig. 6.10. Convergence conditions for the weak law (a) and the strong law

(b) of large numbers

P(limit of the deviation is 0) = 1 .

Let us examine the diﬀerence between the two formulations using

Fig. 6.10, which displays several deviation curves as a function of n,

with t the tolerance of the deviations. In the ﬁrst version, illustrated in

Fig. 6.10a, the probability of a deviation larger than t must converge

to 0; that is, once we have ﬁxed the tolerance t (the gray band), we ﬁnd

fewer curves above t as we proceed toward larger values of n. In the case

of Fig. 6.10a, there are three curves above t at n

0

, two at n

1

and one

at n

2

. The law says nothing about the deviation values. In Fig. 6.10a,

the deviation at n

2

is even more prominent than the deviation at n

1

.

In the second version of the law, in order for the limit of the average to

equal the limit of the expectation with high probability, large wander-

ings above any preselected t must become less frequent each time, as

Fig. 6.10b is intended to illustrate. An equivalent formulation consists

in saying that, in the worst case – the dotted line in Fig. 6.10b – the

probability of exceeding t converges to zero. Thus, the second version of

the law of large numbers imposes a stronger restriction on the behavior

of the averages than the ﬁrst version. It is thus appropriate to call it

the strong law of large numbers. The Bernoulli law is then the weak

law. Understanding the laws of large numbers is an essential prerequi-

site to understanding the convergence properties of averages of chance

sequences.

6.8 The Law of Iterated Logarithms

We spoke of the central limit theorem in Chap. 4, which allowed us to

compute the value of n above which there is a high certainty of obtain-

ing a frequency close to the corresponding probability value. We were

thus able to determine that, when p = 1/2 (the coin-tossing example),

122 6 Fortune and Ruin

only after 10 000 throws was there a risk of at most 5% that the f ±0.01

conﬁdence interval was not typical, i.e., did not contain the true value

of p. On the other hand, the strong law of large numbers shows us that

there is an almost sure convergence of f toward p. How fast is that

convergence? Take a sequence of experiments with only two possible

outcomes, as in the coin-tossing example, and denote the probability

of one of the outcomes by p. Let S(n) be the accumulated gain after n

outcomes and consider the absolute value of the discrepancy between

S(n) and its expectation, which we know to be np (binomial law).

Now, the Russian mathematician Alexander Khintchine demonstrated

the following result in 1924:

P(limit, in the worst case, of a

2npq log

log n

deviation) = 1 .

Note that the statement is formulated in terms of a worst case deviation

curve as in Fig. 6.10b. This result, known as the theorem of iterated

logarithms (the logarithm is repeated), allows one to conclude that f =

S(n)/n will almost certainly be between p plus or minus a deviation of

2pq log

log n

**/n, where ‘almost certainly’ means that, if we add an
**

inﬁnitesimal increase to that quantity, only a ﬁnite subset of all possible

inﬁnite sequences will deviate by more than that quantity. For instance,

for 10 000 throws of a coin we are almost sure that all sequences will

deviate by less than

0.5 log

log 10 000

/10 000 = 0.0105.

Figure 6.11 shows 50 sequences of 10 000 throws of a coin with

the deviation of the iterated logarithm (black curves). Note how, af-

ter 10 000 throws, practically all curves are inside 0.5 plus or minus

Fig. 6.11. Frequen-

cies of heads or tails

in 50 sequences of

10 000 throws of

a coin, with the de-

viation of the strong

law

6.8 The Law of Iterated Logarithms 123

Fig. 6.12. Pushing the poor

souls to the limit

the deviation of the iterated logarithm. Note also the slow decay of the

iterated logarithm. The strong law of large numbers and the law of iter-

ated logarithms show us that, even when we consider inﬁnite sequences

of random variables, there is order in the limit.

7

The Nature of Chance

7.1 Chance and Determinism

The essential property characterizing chance phenomena is the impos-

sibility of predicting any individual outcome. When we toss a fair die

or a fair coin in a fair way we assume that that condition is fulﬁlled.

We assume the impossibility of predicting any individual outcome. The

only thing we may perhaps be able to tell with a given degree of cer-

tainty, based on the laws of large numbers, is the average outcome after

a large number of experiments. Yet one might think that it would be

possible in principle to predict any individual outcome. If one knew

all the initial conditions of motion, either of the die or of the coin, to-

gether with the laws of motion, one might think it possible to forecast

the outcome. Let us look at the example of the coin. The initial con-

ditions that would have to be determined are the force vector applied

to the coin, the position of its center of gravity relative to the ground,

the shape of the coin and its orientation at the instant of throwing, the

viscosity of the air, the value of the acceleration due to gravity at the

place of the experiment, and the friction and elasticity factors of the

ground. Once all these conditions are known, writing and solving the

respective motion equations would certainly be a very complex task,

although not impossible. The problem is thus deterministic. Every as-

pect of the problem (coin fall, coin rebound, motion on the ground with

friction, etc.) obeys perfectly deﬁned and known laws.

At the beginning of the twentieth century there was a general convic-

tion that the laws of the universe were totally deterministic. Probability

theory was considered to be a purely mathematical trick allowing us

to deal with our ignorance of the initial conditions of many phenom-

126 7 The Nature of Chance

ena. According to Pierre-Simon Laplace in his treatise on probability

theory:

If an intelligence knew at a given instant all the forces animat-

ing matter, as well as the position and velocity of any of its

molecules, and if moreover it was suﬃciently vast to submit

these data to analysis, it would comprehend in a single formula

the movements of the largest bodies of the Universe and those

of the tiniest atom. For such an intelligence nothing would be

irregular, and the curve described by an air or vapor molecule

would appear to be governed in as precise a way as is for us

the course of the Sun. But due to our ignorance of all the data

needed for the solution of this grand problem, and the impos-

sibility, due to our limited abilities, of subjecting most of the

data in our possession to calculation, even when the amount of

data is very limited, we attribute the phenomena that seem to

us to occur and succeed without any particular order to change-

able and hidden causes, whose action is designated by the word

chance, a word that after all is only the expression of our igno-

rance. Probability relates partly to this ignorance and in other

parts to our knowledge.

The development of quantum mechanics, starting with the pioneering

work of the physicists Max Planck (research on black body radiation,

which established the concept of the emission of discrete energy quan-

tities, or energy quanta), Albert Einstein (research on the photoelectric

eﬀect, establishing the concept of photons as light quanta) and Louis

de Broglie (dual wave–particle nature of the electron), at the beginning

of the twentieth century, came to change the Laplacian conception of

the laws of nature, revealing the existence of many so-called quantum

phenomena whose probabilistic description is not simply the result of

resorting to a comfortable way of dealing with our ignorance about the

conditions that produced them. For quantum phenomena the proba-

bilistic description is not just a handy trick, but an absolute necessity

imposed by their intrinsically random nature. Quantum phenomena are

‘pure chance’ phenomena.

Quantum mechanics thus made the ﬁrst radical shift away from

the classical concept that everything in the universe is perfectly de-

termined. However, since quantum phenomena are mainly related to

elementary particles, i.e., microscopic particles, the conviction that, in

the case of macroscopic phenomena (such as tossing a coin or a die), the

probabilistic description was merely a handy way of looking at things

7.1 Chance and Determinism 127

and that it would in principle be possible to forecast any individual

outcome, persisted for quite some time. In the middle of the twentieth

century several researchers showed that this concept would also have to

be revised. It was observed that in many phenomena governed by per-

fectly deterministic laws, there was in many circumstances such a high

sensitivity to initial conditions that one would never be able to guar-

antee the same outcome by properly adjusting those initial conditions.

However small a deviation from the initial conditions might be, it could

sometimes lead to an unpredictably large deviation in the experimental

outcomes. The only way of avoiding such disorderly behavior would be

to deﬁne the initial conditions with inﬁnite accuracy, which amounts

to an impossibility. Moreover, it was observed that much of this be-

havior, highly sensitive to initial conditions, gave birth to sequences of

values – called chaotic orbits – with no detectable periodic pattern. It

is precisely this chaotic behaviour that characterizes a vast number of

chance events in nature. We can thus say that chance has a threefold

nature:

• A consequence of our incapacity or inability to take into account

all factors that inﬂuence the outcome of a given phenomenon, even

though the applicable laws are purely deterministic. This is the coin-

tossing scenario.

• A consequence of the extreme sensitivity of many phenomena to

certain initial conditions, rendering it impossible to forecast future

outcomes, although we know that the applicable laws are purely

deterministic. This is the nature of deterministic chaos which is

manifest in weather changes, for instance.

• The intrinsic random nature of phenomena involving microscopic

particles, which are described by quantum mechanics. Unlike the

preceding cases, since the random nature of quantum phenomena

is an intrinsic feature, the best that one can do is to determine the

probability of a certain outcome.

In what follows we ﬁrst focus our attention on some amazing aspects

of probabilistic descriptions of quantum mechanics. We then proceed

to a more detailed analysis of deterministic chaos, which plays a role

in a vast number of macroscopic phenomena in everyday life.

128 7 The Nature of Chance

7.2 Quantum Probabilities

The description of nature at extremely small scales, as is the case of

photons, electrons, protons, neutrons and other elementary particles,

is ruled by the laws of quantum mechanics. Such particles are generi-

cally named quanta. There is a vast popular science literature on this

topic (a few indications can be found in the references), describing in

particular the radical separation between classical physics, where laws

of nature are purely deterministic, and quantum mechanics where laws

of nature at a suﬃciently small, microscopic, scale are probabilistic. In

this section we restrict ourselves to outlining some of the more remark-

able aspects of the probabilistic descriptions of quantum mechanics

(now considered to be one of the best validated theories of physics).

A ﬁrst remarkable feature consists in the fact that, for a given mi-

croscopic system, there are variables whose values are objectively unde-

ﬁned. When the system undergoes a certain evolution, with or without

human intervention, it is possible to obtain values of these variables

which only depend on purely random processes: objective, pure, ran-

domness. More clearly, the chance nature of the obtained outcome does

not depend on our ignorance or on incomplete speciﬁcations as is the

case for chaotic phenomena. In quantum mechanics, event probabilities

are objective.

An experiment illustrating the objective nature of chance in quan-

tum mechanics consists in letting the light of an extremely weak light

source fall upon a photographic plate, for a very short time. Suppose

that some object is placed between the light source and the plate. Since

the light intensity is extremely weak, only a small number of photons

falls on the plate, which will then reveal some spots distributed in

a haphazard way around the photographed object. If we repeat the

experiment several times the plates will reveal diﬀerent spot patterns.

Moreover, it is observed that there is no regularity in the process, as

would happen for instance if the ﬁrst half of the photons impacting

the plate were to do so in the middle region and the other half at the

edges. The process is genuinely random. On the other hand, if we com-

bine the images of a large number of plates, the law of large numbers

comes into play and the image (silhouette) of the photographed object

then emerges from a ‘sequence of chance events’. It is as though we

had used a Monte Carlo method as a way of deﬁning the contour of

the object, by throwing tiny points onto the plate.

One strange aspect of quantum mechanics lies in the fact that cer-

tain properties of a microscopic particle, such as its position, are char-

7.2 Quantum Probabilities 129

acterized by a probability wave (described by the famous wave equation

discovered by the Austrian physicist Erwin Schr¨ odinger in 1926). We

cannot, for instance, say at which point a given particle is at a given

instant. In the absence of a measurement there is a non-null probability

that the particle is ‘positioned’ at several points at the same time. We

write ‘positioned’ between quotes because in fact it does not have the

usual meaning; it really is as though the particle were ‘smeared out’

in space, being present partly ‘here’, with a certain probability, and

partly ‘there’, with another probability. The particle is said to be in

a superposition state. This superposition state aspect and the wave na-

ture, and therefore vector nature of the quantum description, give rise

to counterintuitive results which are impossible to obtain when dealing

with uncertainties in everyday life, such as the outcomes of games, fore-

casts in the stock market, weather forecasts, and so on, characterized

by a scalar probability measure.

There is a much discussed experiment, the two-slit experiment, that

illustrates the strange behavior of the probabilistic vector nature of

quantum phenomena. A narrow beam of light shines on a target, sepa-

rated from the light source by a screen with two small holes (or slits),

as shown in Fig. 7.1.

Let us interpret the experiment as in Chap. 4 for the example of

shooting at a target, treating the photons like tiny bullets. With this

assumption the probability of a certain target region being hit by a bul-

let is the sum of the probabilities for each of the two possible bullet

trajectories; either through A or through B. We may even assume a nor-

mal probability density for each of the two possible trajectories, as in

Fig. 7.1a. In any small region of the target the bullet-hit probability

is the sum of the probabilities corresponding to each of the normal

Fig. 7.1. Probability density functions in the famous two-slit experiment. (a)

According to classical mechanics. (b) What really happens

130 7 The Nature of Chance

distributions. Thus, the probability density of the impact due to the

two possible trajectories is obtained as the sum of the two densities,

shown with a dotted line in Fig. 7.1a. This would certainly be the

result with macroscopic bullets. Now, that is not what happens with

photons, where a so-called interference phenomenon like the one shown

in Fig. 7.1b is observed, with concentric rings of smaller or larger light

intensity than the one corresponding to a single slit. This result can

be explained in the framework of classical wave theory. Quantum the-

ory also explains this phenomenon by associating a probability wave to

each photon.

In order to get an idea as to how the calculations are performed, we

suppose that an arrow (vector) is associated with each possible trajec-

tory of a photon, with the base of the arrow moving at the speed of light

and its tip rotating like a clock hand. At a given trajectory point the

probability of ﬁnding the photon at that point is given by the square

of the arrow length. Consider the point P. Figure 7.1b shows two ar-

rows corresponding to photon trajectories passing simultaneously (!)

through A and B; this is possible due to the state superposition. When

arriving at P, the clock hands exhibit diﬀerent orientations, asymmetric

in relation to the horizontal bearing, since the lengths of the trajecto-

ries FAP and FBP are diﬀerent. The total probability is obtained as

follows: the two arrows are added according to the parallelogram rule

(the construction illustrated in Fig. 7.1b), and the arrow ending at S is

obtained as a result. The square of the length of the latter arrow corre-

sponds to the probability of an impact at P. At the central point Q, the

arrows corresponding to the two trajectories come up with symmetric

angles around the horizontal bearing, since the lengths of FAQ and

FBQ are now equal. Depending on the choice of the distance between

the slits, one may obtain arrows with the same bearing but opposite di-

rections, which cancel each other when summing. There are also points,

such as R, where the arrows arrive in a concordant way and the length

of the sum arrow is practically double the length of each vector. It is

then possible to observe null probabilities and probabilities that are

four times larger than those corresponding to a single slit. We thus

arrive at the surprising conclusion that the quantum description does

not always comply with

P(passing through A or passing through B)

= P(passing through A) +P(passing through B) ,

as would be observed in a classical description!

7.2 Quantum Probabilities 131

What happens when a single photon passes through a single slit?

Does it smear out in the target as a blot with a bell-shaped intensity?

No. The photon ends at an exact (but unknown) point. The bell-shaped

blot is a result of the law of large numbers. There is no intrinsic feature

deciding which photons hit the center and which ones hit the periphery.

The same applies to the interference phenomenon. Only the probabilis-

tic laws that govern the phenomena are responsible for the ﬁnal result.

Incidentally, if we place photon detectors just after the two slits A and

B, the interference phenomenon no longer takes place. Now, by acting

with a detector, the state superposition is broken and we can say which

slit a given photon has passed through.

The interference phenomenon has also been veriﬁed experimentally

with other particles of reduced dimensions such as electrons, atoms, and

even with larger molecules, as in recent experiments with the buckmin-

sterfullerene molecule, composed of 60 carbon atoms structured like

a football.

Another class of experiments illustrating the strangeness of quan-

tum probabilities relates to properties of microscopic particles exhibit-

ing well deﬁned values. It is a known fact that light has a wave nature as

well as a particle nature, represented by the oscillation of an electromag-

netic ﬁeld propagating as a wave, with the electric ﬁeld vector vibrating

perpendicularly to the propagation direction. Common light includes

many wave components with various vibration directions, called polar-

izations. However, with the help of a polarizing device (for instance,

a sheet of a special substance called polaroid or a crystal of the min-

eral calcite), one may obtain a single light component with a speciﬁc

direction of polarization.

In Fig. 7.2b, horizontally polarized light (the electric ﬁeld vibrates

perpendicularly to the page), denoted by h, shines on a polarizer whose

direction of polarization makes a 45-degree angle with the horizontal

Fig. 7.2. (a) Polarization directions in a plane perpendicular to the propaga-

tion direction. (b) Horizontally polarized light passing through two polaroids.

(c) As in (b) but in the opposite order

132 7 The Nature of Chance

direction (see Fig. 7.2a). Let E

h

be the electric ﬁeld intensity of h.

The intensity coming out of the 45

◦

polarizer corresponds to the ﬁeld

projection along this direction, that is,

E

1

= E

h

cos 45

◦

=

E

h

√

2

.

The light passes next through a vertical polarizer (the electric ﬁeld

vibrates in the page perpendicularly to the direction of propagation),

and likewise

E

2

= E

1

sin 45

◦

=

E

1

√

2

=

E

h

2

.

Let us now suppose that we change the order of the polarizers as in

Fig. 7.2c. We obtain

E

1

= E

h

cos 90

◦

= 0 ,

hence, E

2

= 0. These results, involving electric ﬁeld projections along

polarization directions, are obvious in the framework of the classical

wave theory. Let us now interpret them according to quantum mechan-

ics. In fact, light intensity, which is proportional to the square of the

electric ﬁeld amplitude, measures the probability of a photon passing

through the two-polarizer sequence. But we thus arrive at the conclu-

sion, contrary to common sense, that in the quantum description the

probabilities of a chain of events are not always commutative, inde-

pendent of the order of events, since the probability associated with

the experimental result of Fig. 7.2b is diﬀerent from the one associated

with Fig. 7.2c. Suppose that instead of light we had a ﬂow of sand

grains passing through a sequence of two sieves: sieve 1 and sieve 2. In

the classical description, the commutative property holds:

P(passing through sieve 1) ×P

**passing through sieve 2
**

if it has passed through sieve 1

= P(passing through sieve 2) ×P

**passing through sieve 1
**

if it has passed through sieve 2

.

In the quantum description the probability of the intersection depends

on the order of the events! This non-commutative feature of quantum

probabilities has also been observed in many other well-deﬁned prop-

erties of microscopic particles, such as the electron spin.

The laws of large numbers also come into play in all quantum phe-

nomena, as we saw earlier in the case where an object was photographed

7.2 Quantum Probabilities 133

with very weak light. In the interference example, if the source light

only emits a small number of photons per second, one initially observes

a poorly deﬁned interference pattern, but it will begin to build up as

time goes by. Another particularly interesting illustration of the law of

large numbers acting on quantum phenomena is obtained in an experi-

ment in which the intensity of a light beam with a certain polarization,

say h, is measured after it has passed through a polarizer with polar-

ization axis at an angle θ to h. We have already seen that the light

intensity is then proportional to the square of cos θ. In fact, it is pro-

portional to the number of photons passing through the polarizer, viz.,

N

p

= N cos

2

θ, where N is the number of photons hitting the polar-

izer. If we carry out the experiment with a light beam of very weak

intensity (in such a way that only a small number of photons hits the

polarizer) and, for each value of θ, we count how many photons pass

through (this can be done with a device called a photomultiplier), we

obtain a graph like the one shown in Fig. 7.3. This graph clearly shows

random deviations in the average number of photons N

p

/N passing

through the polarizer from the theoretical value of cos

2

θ. An experi-

ment with a normal light beam, with very high N, can be interpreted as

carrying out multiple experiments such as the one portrayed in Fig. 7.3.

But in this case the law of large numbers comes into play, and in such

a way that the average number of photons passing through the polar-

izer exhibits a standard deviation proportional to 1/

√

N (as already

seen in Chap. 4). In summary, for an everyday type of light beam, the

law of large numbers explains the cos

2

θ proportionality, as deduced in

the framework of classical physics for the macroscopic object we call

a light beam.

Fig. 7.3. Average number

of photons passing through

a polarizer for several polar-

izer angles (simulation)

134 7 The Nature of Chance

Another strange aspect of quantum mechanics, which still raises

much debate, concerns the transition between a quantum description

and a classical description. In other words, when is a particle large

enough to ensure that quantum behavior is no longer observed? The

modern decoherence theory presents an explanation consistent with

experimental results, according to which for suﬃciently large particles,

where suﬃcient means that an interaction with other particles in the

surrounding space is inevitable, the state superposition is no longer

preserved.

The intrinsically random nature of quantum phenomena was a source

of embarrassment for many famous physicists. It was even suggested

that instead of probability waves, there existed in the microscopic par-

ticles some internal variables, called hidden variables, that were de-

terministically responsible for quantum phenomena. However, a result

discovered in 1964 by the physicist John Bell (known as Bell’s theo-

rem) proved that no theory of local hidden variables would be able to

reproduce in a consistent way the probabilistic outcomes of quantum

mechanics. The random nature of quanta is therefore the stronghold of

true randomness, of true chance. Quanta are the quintessential chance

generators, putting dice, coins and the like to shame.

7.3 A Planetary Game

Playing dice in space may not be a very entertaining activity. In weight-

less conditions, astronauts would see the dice ﬂoating in the air, without

much hope of deciding what face to read oﬀ. Imagine a space game for

astronauts, using balls instead of dice, and making use of the weightless

conditions in order to obtain orbital ball trajectories. One of the balls,

of largest mass, plays the role of the Sun in a mini-planetary system.

In this planetary game two balls are used, playing the role of planets

with co-planar orbits.

Let us ﬁrst take a single planet P

1

and to ﬁx ideas let us suppose

that the Sun and planet have masses 10 kg and 1 kg, respectively. The

mass of the planet is reasonably smaller than the mass of the Sun so

that its inﬂuence on the Sun’s position can be neglected. We may then

consider that the Sun is ﬁxed at the point (0, 0) of a coordinate sys-

tem with orthogonal axes labelled x and y in the orbital plane. If an

astronaut places P

1

1.5 m away from the Sun with an initial velocity

of 7 cm/hr along one of the orthogonal directions, (s)he will be able to

conﬁrm the universal law of gravitation, observing P

1

moving round in

a slightly elliptic orbit with an orbital period of about 124 hr (assuming

7.3 A Planetary Game 135

that the astronaut is patient enough). We now suppose that a second

planet P

2

, also with mass 1 kg, is placed farther away from the Sun

than P

1

and with an initial velocity of 6 cm/hr along one of the orthog-

onal directions. The game consists in predicting where (to within some

tolerance) one of the planets can be found after a certain number of

hours. If P

2

is placed at an initial distance of r = 3 m from the Sun, we

get the stable orbit situation shown in Fig. 7.4a and the position of the

planets is predictable. In this situation a slight deviation in the position

of P

2

does not change the orbits that much, as shown in Fig. 7.4b. On

the other hand, if P

2

is placed 2 m from the Sun, we get the chaotic

orbits shown in Fig. 7.5a and the positions of the planets can no longer

be predicted. Even as tiny a change in the position as an increment of

0.00000001 m (a microbial increment!) causes a large perturbation, as

shown in Fig. 7.5b. One cannot even predict the answer to such a simple

question as: Will P

1

have made an excursion beyond 3 m after 120 hr?

Faced with such a question, the astronauts are in the same situation

as if they were predicting the results of a coin-tossing experiment.

The chaotic behavior of this so-called three-body problem was ﬁrst

studied by the French mathematician Jules Henri Poincar´e (1854–

1912), pioneer in the study of chaotic phenomena. Poincar´e and other

researchers have shown that, by randomly specifying the initial con-

ditions of three or more bodies, the resulting motion is almost always

chaotic, in contrast to what happens with two bodies, where the motion

is never chaotic. For three or more bodies no one can tell, in general,

which trajectories they will follow. This is the situation with certain

Fig. 7.4. Stable orbits of the planets, with the ‘Sun’ at (0, 0). (a) r = 3 m.

(b) r = 3.01 m

136 7 The Nature of Chance

Fig. 7.5. Chaotic orbits of the planets, with the ‘Sun’ at (0, 0). (a) r = 2 m.

(b) r = 2.000 000 01 m

asteroids. The problem can only be treated for certain speciﬁc sets of

conditions. In the following sections, we shall try to cast some light on

the nature of these phenomena.

7.4 Almost Sure Interest Rates

Consider the calculation of accumulated capital after n years, in a cer-

tain bank with an annual constant interest rate of j. Assume an initial

capital, denoted by C

0

, of 1 000 euro, and an interest rate of 3% per

year. After the ﬁrst year the accumulated capital C

1

is

C

1

= C

0

+C

0

×j = 1 000 + 1 000 ×0.03 = C

0

×1.03 .

After the second year we get in the same way, C

2

= C

1

× 1.03 =

C

0

× 1.03

2

. Thus, after n years, iteration of the preceding rule shows

that the capital is C

n

= C

0

×1.03

n

.

After carrying out the computations with a certain initial capital C

0

,

suppose we came to the conclusion that its value was wrong and we

should have used C

0

+E

0

. For instance, instead of 1 000 euro we should

have used 1 001 euro, corresponding to a relative error of E

0

/C

0

=

1/1 000. Do we have to redo all calculations or is there a simple rule

for determining how the initial E

0

= 1 euro error propagates after n

years? Let us see. After the ﬁrst year the capital becomes

C

0

×1.03 +E

0

×1.03 .

7.4 Almost Sure Interest Rates 137

That is, we get a diﬀerence relative to C

1

of E

0

×1.03. The relative error

(relative to C

1

) is again E

0

/C

0

(in detail, E

0

×1.03/C

0

×1.03). Repeat-

ing the same reasoning, it is easy to conclude that the initial error E

0

propagates in a fairly simple way over the years: the propagation is such

that the relative error remains constant. If we suppose a scenario with

n = 10 years and imagine that we had computed C

10

= 1 343.92 euro,

then given the preceding rule, the relative error 1/1 000 will remain

constant and we do not therefore need to redo the whole calculation;

we just have to add to C

10

one thousandth of its value in order to ob-

tain the correct value. Note how this simple rule is the consequence of

a linear relation between the capital C

n

after a given number of years

and the capital C

n+1

after one further year: C

n+1

= C

n

× 1.03. (In

a graph of C

n+1

against C

n

, this relation corresponds to a straight line

passing through the origin and with slope 1.03, which is what justiﬁes

use of the word ‘linear’.)

Let us now look at another aspect of capital growth. Suppose that

another bank oﬀers an interest rate of 3.05% per year. Is this small

diﬀerence very important? Let us see by what factor the capital grows

in the two cases after 10 years:

3% interest rate after 10 years: 1.343 916 ,

3.05% interest rate after 10 years: 1.350 455 .

If the initial capital is 100 euro, we get a diﬀerence that many of us

would consider negligible, namely 65 cents. But, if the initial capital

is one million euros, the diﬀerence will no doubt look quite important:

6 538 euro. This is of course a trivial observation and it seems that not

much remains to be said about it. But let us take a closer look at how

the diﬀerence of capital growth evolves when the interest rates are close

to each other. As n increases, that diﬀerence also increases. Figure 7.6

shows what happens for two truly usurious interest rates: 50% and

51%. The vertical axis shows the capital C

n+1

after one further year,

corresponding to the present capital C

n

, represented on the horizontal

axis: C

n+1

= C

n

(1 +j). The graphical construction is straightforward.

Starting with C

0

= 1, we extend this value vertically until we meet

the straight line 1 + j, in this case the straight line with slope 1.5 or

1.51. We thus obtain the value C

1

, which we extend horizontally until

we reach the line at 45

◦

, so that we use the same C

1

value on the

horizontal axis when doing the next iteration. The process is repeated

over and over again. The solid staircase shows how the capital grows for

the 51% interest rate. One sees that as the number of years increases

138 7 The Nature of Chance

Fig. 7.6. Capital growth

for (usurious!) interest rates:

50% and 51%

(see Fig. 7.6) the diﬀerence between the two capital growth factors

increases in proportion to the number of years. In fact, Fig. 7.6 clearly

illustrates that the diﬀerence in values of the capital is proportional to

n, i.e., changes linearly with n. For the 3% and 3.05% interest rates,

the diﬀerence in values of the capital varies with n as a straight line

with slope (0.030 5 −0.03)/1.03 = 0.000 485 per year.

This linearly growing diﬀerence in capital for distinct (and constant)

interest rates is also well known. However, what most people do not re-

alize is that numerical computations are always performed with ﬁnite

accuracy. That is, it does not matter how the calculations are carried

out, e.g., by hand or by computer, the number of digits used in the

numerical representation is necessarily ﬁnite. Therefore, the results are

generally approximate. For instance, we never get exact calculations

for an interest rate of 1/30 per year, since 1/30 = 0.033 333 3 . . . (con-

tinuing inﬁnitely).

Banks compute accumulated capitals with computers. These may

use diﬀerent numbers of bits in their calculations, corresponding to dif-

ferent numbers of decimal digits. Personal computers usually perform

the calculations using 32 bits (the so-called simple precision represen-

tation, equivalent to about 7 signiﬁcant decimal digits) and in 64 bits

(the so-called double precision representation, equivalent to about 15

signiﬁcant decimal digits). Imagine a ﬁrm deciding to invest one million

euros in a portfolio of shares with an 8% annual interest rate. Apply-

ing this constant 8% interest rate the ﬁrm’s accountant calculates the

accumulated capital on the ﬁrm’s computer with 32-bit precision. The

bank performs the same calculation but with 64-bit precision. They

compare their calculations and ﬁnd that after 25 years there is a 41

7.5 Populations in Crisis 139

Fig. 7.7. Annual

evolution of the dif-

ference of accumu-

lated capital com-

puted by two diﬀer-

ent computers

cent diﬀerence in an accumulated capital of about six million eight

hundred thousand euros. In order to clear up this issue they compare

the respective calculations for a 50-year period and draw the graph of

the diﬀerences between the two calculations, as in Fig. 7.7. After 50

years the diﬀerence is already 6.51 euro. Moreover, they verify that

the diﬀerence does not progress in a systematic, linear way; it shows

chaotic ﬂuctuations that are diﬃcult to forecast. If there is a notice-

able diﬀerence when using a 64-bit computer, why not use a 128-bit

or a 256-bit computer instead? In fact, for problems where numerical

precision is crucial one may use supercomputers that allow a very large

number of bits for their computations. But the bottom line is this: no

matter how super the computer is, we will always have to deal with

uncertainty factors in our computations due to the ﬁnite precision of

number representation and error propagation. In some cases the men-

tioned uncertainty is insigniﬁcant (the ﬁrm would probably not pay

much attention to a 41 cent diﬀerence in six million eight hundred

thousand euros). In other cases error propagation can have dramatic

consequences.

7.5 Populations in Crisis

In the above interest rate problem, the formula C

n+1

= C

n

(1 + j) for

capital growth is a linear formula, i.e., C

n+1

is directly proportional to

C

n

. This linear dependency is reﬂected in the well-behaved evolution

of the capital growth for a small ‘perturbation’ of the initial capital.

Unfortunately, linear behavior is more the exception than the rule,

either in natural phenomena or in everyday life.

We now consider a non-linear phenomenon, relating to the evolution

of the number of individuals in a population after n time units. Let us

denote the number of individuals by N

n

and let N be the maximum

140 7 The Nature of Chance

reference value allowed by the available food resources. We may, for

instance, imagine a population of rabbits living inside a fenced ﬁeld.

If our time unit is the month, we may further assume that after three

months we have N

3

= 60 rabbits living in a ﬁeld that can only support

up to N = 300 rabbits. Let P

n

= N

n

/N denote the fraction of the

maximum number of individuals that can be supported by the available

resources. For the above-mentioned values, we have P

3

= 0.2. If the

population grew at a constant rate r, we would get a situation similar

to the one for the interest rate: P

n+1

= P

n

(1 + r). However, with an

increasing population, the necessary resources become scarcer and one

may reach a point where mortality overtakes natality owing to lack of

resources; P

n

will then decrease. A formula reﬂecting the need to impose

a growth rate that decreases as one approaches a maximum number of

individuals, becoming a dwindle rate if the maximum is exceeded, is

P

n+1

= P

n

1 +r(1 −P

n

)

.

According to this rule the population grows or dwindles at the rate of

r(1−P

n

). If P

n

is below 1 (the P

n

value for the population maximum),

the population grows, and all the more as the deviation from the max-

imum is greater. As soon as P

n

exceeds 1, a dwindling phase sets in.

Note that the above formula can be written as P

n+1

= P

n

(1+r) −rP

2

n

,

clearly showing a non-linear, quadratic, dependency of P

n+1

on P

n

.

Suppose we start with P

0

= 0.01 (one per cent of the reference

maximum) and study the population evolution for r = 1.9. We ﬁnd

that the population keeps on growing until it stabilizes at the maximum

(Fig. 7.8a). If r = 2.5, we get an oscillating pattern (Fig. 7.8b) with

period 4 (months).

Sometimes the periodic behavior is more complex than the one in

Fig. 7.8b. For instance, in the case of Fig. 7.9, the period is 32 and it

takes about 150 time units until a stable oscillation is reached. If we

set a 0.75 threshold, as in Fig. 7.9, interpreting as 1 the values above

the threshold and as 0 the values below, we get the following periodic

sequence:

111010101110101011101010111011101110101011101010111 . . .

But strange things start to happen when r approaches 3. For r = 3

and P

0

= 0.1, we get the behavior illustrated in Fig. 7.10a. A very tiny

error in the initial condition P

0

, for instance P

0

= 0.100 001, is enough

to trigger a totally diﬀerent evolution, as shown in Fig. 7.10b.

7.5 Populations in Crisis 141

Fig. 7.8. Population evolution starting from P

0

= 0.01 and with two diﬀerent

growth rates. (a) r = 1.9. (b) r = 2.5

Fig. 7.9. Population evolution starting from 0.5 and with r = 2.569

This is a completely diﬀerent scenario from the error propagation

issue in the interest rate problem. There, a small deviation in the initial

conditions was propagated in a controlled, linear, way. In the present

case, as in the planetary game, error propagation is completely chaotic.

What we have here is not only a problem of high sensitivity to

changes in initial conditions. The crux of the problem is that the law

governing such sensitivity is unknown! To be precise, the graphs of

Figs. 7.5 and 7.10 were those obtained on my computer. On this basis,

the reader might say, for instance, that the population for r = 3 and

P

0

= 0.1 reaches a local maximum at the 60th generation. However,

such a statement does not make sense. Given the high sensitivity to

initial conditions, only by using an inﬁnitely large number of digits

would one be able to reach a conclusion as to whether or not the value

attained at the 60th generation is a local maximum. Figure 7.11 shows

142 7 The Nature of Chance

Fig. 7.10. Chaotic population evolution for r = 3. (a) P

0

= 0.1. (b) P

0

=

0.100 001

Fig. 7.11. Time evolution

of the diﬀerence in the

Fig. 7.10 populations

the diﬀerence of P

n

values between the evolutions shown in Fig. 7.10. It

is clear that after a certain time no rule allows us to say how the initial

perturbation evolves with n. This is the nature of deterministic chaos.

In the case of the interest rate, we had no chaos and a very simple rule

was applied to a ‘perturbation’ of the initial capital: the perturbation

was propagated in such a way that the relative diﬀerences were kept

constant. In deterministic chaos there are no rules. The best one can

do is to study statistically how the perturbations are propagated, on

average.

7.6 An Innocent Formula

There is a very simple formula that reﬂects a quadratic dependency of

the next value of a given variable x relative to its present value. It is

7.6 An Innocent Formula 143

called a quadratic iterator and is written as follows:

x

n+1

= ax

n

(1 −x

n

) .

The name ‘iterator’ comes from the fact that the same formula is ap-

plied repeatedly to any present value x

n

(as in the previous formulas).

Since the formula can be rewritten x

n+1

= ax

n

− ax

2

n

, the quadratic

dependency becomes evident. (We could say that the formula for the

interest rate is the formula for a linear iterator.)

The formula is so simple that one might expect exemplary deter-

ministic behavior. Nevertheless, and contrary to one’s expectations, the

quadratic iterator allows a simple and convincing illustration of the de-

terministic chaos that characterizes many non-linear phenomena. For

instance, expressing a = r + 1 and x

n

= rP

n

/(r + 1), one obtains the

preceding formula for the population growth.

Let us analyze the behavior of the quadratic iterator as in the case

of the interest rate and population growth. For this purpose we start

by noticing that the function y = ax(1 −x) describes a parabola pass-

ing through the points (0, 0) and (0, 1), as shown in Fig. 7.12a. The

parameter a inﬂuences the height of the parabola vertex occurring at

x = 0.5 with value a/4. If a does not exceed 4, any value of x in the

interval [0, 1] produces a value of y in the same interval.

Figure 7.12a shows the graph of the iterations, in the same way

as for the interest rate problem, for a = 2.5 and the initial value

x

0

= 0.15. Successive iterations generate the following sequence of val-

ues: 0.15, 0.3187, 0.5429, 0.6204, 0.5888, 0.6053, 0.5973, 0.6013, 0.5993,

Fig. 7.12. Quadratic iterator for a = 2.5 and x

0

= 0.15. (a) Graph of the

iterations. (b) Orbit

144 7 The Nature of Chance

0.6003, 0.5998, and so on. This sequence, converging to 0.6, is shown

in Fig. 7.12b. As in the previous examples we may assume that the

iterations occur in time. The graph of Fig. 7.12b is said to represent

one orbit of the iterator. If we introduce a small perturbation of the

initial value (for instance, x

0

= 0.14 or x

0

= 0.16), the sequence still

converges to 0.6.

Figure 7.13 shows a non-convergent chaotic situation for a = 4.

A slight perturbation of the initial value produces very diﬀerent orbits,

as shown in Fig. 7.14. Between the convergent case and the chaotic

case the quadratic iterator also exhibits periodic behavior, as we had

already detected in the population growth model.

Fig. 7.13. Fifteen iterations of the quadratic iterator for a = 4. (a) x

0

= 0.2.

(b) x

0

= 0.199

Fig. 7.14. Quadratic iterator orbits for a = 4. (a) x

0

= 0.2. (b) x

0

= 0.199999

7.6 An Innocent Formula 145

Now suppose we have divided the [0, 1] interval into small subinter-

vals of width 0.01 and that we try to ﬁnd the frequency with which any

of these subintervals is visited by the quadratic iterator orbit, for a = 4

and x

0

= 0.2. Figure 7.15 shows what happens after 500 000 iterations.

As the number of iterations grows the graph obtained does not change

when we specify diﬀerent initial values x

0

. Interpreting frequencies as

probabilities, we see that the probability of obtaining a visit of a subin-

terval near 0 or 1 is rather larger than when it is near 0.5. In fact, it

is possible to prove that for inﬁnitesimal subintervals one would obtain

the previously mentioned arcsine law (see Chap. 6) for the frequency

of visits of an arbitrary point. That is, the probability of our chaotic

orbit visiting a point x is equal to the probability of the last zero gain

having occurred before time x in a coin-tossing game!

Consider the following variable transformation for the quadratic it-

erator:

x

n

= sin

2

(πy

n

) .

For a = 4, it can be shown that this transformation allows one to

rewrite the iterator formula as

y

n+1

= 2y

n

(mod 1) ,

where the mod 1 (read modulo 1) operator means removing the integer

part of the expression to which it applies (2y

n

in this case). The chaotic

orbits of the quadratic iterator are therefore mapped onto chaotic orbits

of y

n

. We may also express the y

n

value relative to an initial value y

0

.

As in the case of the interest rate, we get

y

n

= 2

n

y

0

(mod 1) .

Fig. 7.15. Probability of

the chaotic quadratic itera-

tor (a = 4) visiting a subin-

terval of [0, 1] of width 0.01,

estimated in 500 000 itera-

tions

146 7 The Nature of Chance

Let us see what this equation means, by considering y

0

expressed

in binary numbering, for instance y

0

= 0.101 100 100 010 111 1. In

the ﬁrst iteration we have to multiply by 2 and remove the integer

part. In binary numbering, multiplying by 2 means a one-bit leftward

shift of the whole number, with a zero carried in at the right (sim-

ilarly to what happens when we multiply by 10 in decimal number-

ing). We get 1.011 001 000 101 111 0. Removing the integer part we get

0.011 001 000 101 111 0. The same operations are performed in the fol-

lowing iterations. Suppose we only look at the successive bits of the

removed integer parts. What the iterations of the above formula then

produce are the consecutive bits of the initial value, which can be in-

terpreted as an inﬁnite sequence of coin throws. Now, we already know

that, with probability 1, a Borel normal number can be obtained by

random choice in [0, 1]. As a matter of fact, we shall even see in Chap. 9

that, with probability 1, an initial value y

0

randomly chosen in [0, 1]

turns out to be a number satisfying a strict randomness criterion, whose

bits cannot be predicted either by experiment or by theory. Chaos al-

ready exists in the initial y

0

value! The quadratic iterator does not com-

pute anything; it only reproduces the bits of y

0

. The main issue here is

not an inﬁnite precision issue. The main issue is that we do not have

a means of producing the bits of the sequence other than having a pre-

stored list of them (we shall return to this point in Chap. 9). However,

not all equations of the above type produce chaotic sequences, in the

sense that they do not propagate chaos to the following iterations. For

instance, the equation x

n+1

= x

n

+b (mod 1) does not propagate chaos

into the following iteration. We see this because, if x

0

and x

′

0

are very

close initial conditions, after n iterations we will get x

′

n

−x

n

= x

′

0

−x

0

.

That is, the diﬀerence of initial values continues indeﬁnitely.

7.7 Chance in Natural Systems

Many chance phenomena in natural systems composed of macroscopic

objects are of a chaotic nature. These systems generally involve several

variables to characterize the state of the system as time goes by, rather

than one as in the example of population evolution:

• for the 3-sphere system described above, the characterizing variables

are the instantaneous values of two or three components of position

and velocity,

• for a container where certain reagents are poured, the characterizing

variables are the concentrations of reagents and reaction products,

7.7 Chance in Natural Systems 147

• for weather forecasting, the characterizing variables are the pressure

and temperature gradients,

• for cell culture evolution the relevant variables may be nutrient and

cell concentrations,

• for ﬁnancial investments, they may be rental rates and other market

variables,

and so on and so forth.

Let x denote a variable characterizing the system and let x

n

and

x

n+1

represent the values at consecutive instants of time, assumed

close enough to each other to ensure that the diﬀerence x

n+1

− x

n

reﬂects the instantaneous time evolution in a suﬃciently accurate way.

If these diﬀerences vary linearly with x

n

, no chaotic phenomena take

place. Chaos in the initial conditions does not propagate. Chaos re-

quires non-linearities in order to propagate, as was the case for the

quadratic dependence in the population growth example. Now it just

happens that the time evolution in a huge range of natural systems

depends in a non-linear way on the variables characterizing the sys-

tem. Consider the example of a sphere orbiting in the xy plane around

another sphere of much larger mass placed at (0, 0). Due to the law

of gravitation, the time evolution of the sphere’s coordinates depends

on the reciprocal of its distance

x

2

+y

2

to the center. We thus get

a non-linear dependency. In the case of a single orbiting sphere no

chaotic behavior will arise. However, by adding a new sphere to the

system we are adding an extra non-linear term and, as we have seen,

chaotic orbits may arise.

The French mathematician Poincar´e carried out pioneering work on

the time evolution of non-linear systems, a scientiﬁc discipline which

has attracted great interest not only in physics and chemistry, but also

in the ﬁelds of biology, physiology, economics, meteorology and oth-

ers. Many non-linear systems exhibit chaotic evolution under certain

conditions. There is an extensive popular science literature describing

the topic of chaos, where many bizarre kinds of behavior are presented

and characterized. What interests us here is the fact that a system

with chaotic behavior is a chance generator (like tossing coins or dice),

in the sense that it is not possible to predict the values of the vari-

ables. The only thing one can do is to characterize the system variables

probabilistically.

An example of a chaotic chance generator is given by the thermal

convection of the so-called B´enard cells. The generator consists of a thin

layer of liquid between two glass plates. The system is placed over

148 7 The Nature of Chance

a heat source. When the temperature of the plate directly in contact

with the heat source reaches a certain critical value, small domains

are formed in which the liquid layer rotates by thermal convection.

These are the so-called B´enard cells. The direction of rotation of the

cells (clockwise or counterclockwise) is unpredictable and depends on

microscopic perturbations. From this point of view, one may consider

that the rotational state of a B´enard cell is a possible substitute for

coin tossing. If after obtaining the convection cells we keep on raising

the temperature, a new critical value is eventually reached where the

ﬂuid motion becomes chaotic (turbulent), corresponding to an even

more complex chance generator. Many natural phenomena exhibit some

similarity with the B´enard cells, e.g., meteorological phenomena.

Meteorologists are perfectly familiar with turbulent atmospheric

phenomena and with their hypersensitivity to small changes in cer-

tain conditions. In fact, it was a meteorologist (and mathematician)

Edward Lorenz who discovered in 1956 the lack of predictability of

certain deterministic systems, when he noticed that weather forecasts

based on certain computer-implemented models were highly sensitive

to tiny perturbations (initially Lorenz mistrusted the computer and re-

peated the calculations with several machines!). As Lorenz says in one

of his writings: “One meteorologist remarked that if the theory were

correct, one ﬂap of a seagull’s wings could change the course of weather

forever.” This high sensitivity to initial conditions afterwards came to

be called the butterﬂy eﬀect.

The Lorenz equations, which are also applied to the study of con-

vection in B´enard cells, are

x

n+1

−x

n

= σ(y

n

−x

n

) , y

n+1

−y

n

= x

n

(τ −z

n

) −y

n

,

z

n+1

−z

n

= x

n

y

n

−βz

n

.

In these equations, x represents the intensity of the convective motion,

y the temperature diﬀerence between ascending and descending ﬂuid,

and z the deviation from linearity of the vertical temperature proﬁle.

Note the presence of the non-linear terms x

n

z

n

and x

n

y

n

, responsible

for chaotic behavior for certain values of the control parameters, as

illustrated in Fig. 7.16. This three-dimensional orbit is known as the

Lorenz attractor.

Progress in other studies of chaotic phenomena has allowed the iden-

tiﬁcation of many other systems of inanimate nature with chaotic be-

havior, e.g., variations in the volume of ice covering the Earth, vari-

7.8 Chance in Life 149

Fig. 7.16. Chaotic orbit of

Lorenz equations for σ = 10,

τ = 28, β = 8/3

ations in the Earth’s magnetic ﬁeld, river ﬂows, turbulent ﬂuid mo-

tion, water dripping from a tap, spring vibrations subject to non-linear

forces, and asteroid orbits.

7.8 Chance in Life

While the existence of chaos in inanimate nature may not come as

a surprise, the same cannot be said when we observe chaos in nature.

In fact, life is ruled by laws that tend to maintain equilibrium of the

biological variables involved, in order to guarantee the development of

metabolic processes compatible with life itself. For instance, if we have

been running and we are then allowed to rest, the heart rate which

increased during the run goes back to its normal value after some time;

concomitantly, the energy production mode that was adjusted to suit

the greater muscular requirements falls back into the normal production

mode which prevails in the absence of muscular stress. One then ob-

serves that the physiological variables (as well as the biological variables

at cellular level) fall back to the equilibrium situation corresponding to

an absence of eﬀort. If on the other hand we were not allowed to rest

and were instead obliged to run indeﬁnitely, there would come a point

of death by exhaustion, when it would no longer be possible to sustain

the metabolic processes in a way compatible with life.

Life thus demands the maintenance of certain basic balances and

one might think that such balances demand a perfectly controlled, non-

chaotic, evolution of the physiological and biological variables. Surpris-

ingly enough this is not the case. Consider the heart rate of a rest-

ing adult human. One might think that the heart rate should stay

at a practically constant value or at most exhibit simple, non-chaotic

150 7 The Nature of Chance

ﬂuctuations. However, it is observed that the heart rate of a healthy

adult (in fact, of any healthy animal and in any stage of life) exhibits

a chaotic behavior, even at rest, as shown in Fig. 7.17. Heart rate vari-

ability is due to two factors: the pacemaker and the autonomic nervous

system. The pacemaker consists of a set of self-exciting cells in the right

atrium of the heart. It is therefore an oscillator that maintains a practi-

cally constant frequency in the absence of stimuli from the autonomic

nervous system. The latter is responsible for maintaining certain in-

voluntary internal functions approximately constant. Beside the heart

beat, these include digestion, breathing, and the metabolism. Although

these functions are usually outside voluntary control they can be inﬂu-

enced by the mental state. The autonomic nervous system is composed

of two subsystems: the sympathetic and the parasympathetic. The ﬁrst

is responsible for triggering responses appropriate to stress situations,

such as fear or extremes of physical activity. Increase in heart rate is

one of the elicited responses. The parasympathetic subsystem is re-

sponsible for triggering energy saving and recovery actions after stress.

A decrease in heart rate is one of the elicited actions.

The variability one observes in the heart rate is therefore inﬂuenced

by the randomness of external phenomena causing stress situations.

However, it has also been observed that there is a chaotic behavior of

the heart rate associated with the pacemaker itself, whose causes are

not yet fully understood. (However, some researchers have been able

to explain the mechanism leading to the development of chaotic phe-

nomena in heart cells of chicken, stimulated periodically.) Whatever its

causes, this chaotic heart rate behavior attenuates with age, reﬂecting

the heart’s loss of ﬂexibility in its adaptation to internal and external

stimuli.

Chaotic phenomena can also be observed in electrical impulses

transmitted by nervous cells, the so-called action potentials. An action

potential is nothing other than an electrical impulse sent by a neuron to

Fig. 7.17. Heart

rate of an adult man.

The frequency is

expressed in beats

per minute

7.8 Chance in Life 151

Fig. 7.18. Evolution of

successive action potentials

in nervous cells (t

0

= 0.1 s,

R = 3 s

−1

)

other neurons. These impulses have an amplitude of about 0.1 volt and

last about 2 milliseconds. Two researchers (Glass and Mackey in 1988)

were able to present a deterministic model for the time intervals t

n

between successive peaks of action potentials. The model is non-linear

and stipulates that the following inter-peak interval is proportional to

the logarithm of the absolute value of 1 −2e

−Rt

n

, where R is a control

parameter of the model. This model exhibits chaotic behavior of the

kind shown in Fig. 7.18 for certain values of the control parameter, and

this behavior has in fact been observed in practice.

Many other physiological phenomena exhibit chaotic behavior, e.g.,

electrical potentials in cell membranes, breathing tidal volumes, vascu-

lar movements, metabolic phenomena like glucose conversion in adeno-

sine triphosphate, and electroencephalographic signals.

Fig. 7.19. Will it work?

8

Noisy Irregularities

8.1 Generators and Sequences

Imagine that you were presented with two closed boxes named C

1

and

C

2

, both producing a tape with a sequence of sixteen zeros and ones

whenever a button is pressed. The reader is informed that one of the

boxes uses a random generator to produce the tape, while the other

box uses a periodic generator. The reader’s task is to determine, using

only the tapes produced by the boxes, which one, C

1

or C

2

, contains

the random generator. The reader presses the button and obtains:

C

1

sequence : 0011001100110011 ,

C

2

sequence : 1101010111010101 .

The reader then decides that box C

2

contains the random generator.

The boxes are opened and to the reader’s amazement the random gen-

erator is in box C

1

(where, for instance, a mechanical device tosses

a coin) whereas box C

2

contains the periodic generator correspond-

ing to Fig. 7.9. This example illustrates the impossibility of deciding

the nature (deterministic or random) of a sequence generator on the

basis of observation, and in fact on the basis of any randomness mea-

surement, of one of its sequences (necessarily ﬁnite). For the moment

the ‘measure’ is only subjective. The reader decides upon C

2

because

the corresponding sequence looks irregular. Later on, we present a few

objective methods for measuring sequence irregularity. Note that seg-

ments of periodic sequences characterized by long and complex periods

tend to be considered as random, independently of whatever method

we choose to measure irregularity. For instance, the following sequence:

154 8 Noisy Irregularities

8, 43, 26, 5, 28, 15, 14, 9, 48, 51, 2, 13, 4, 23, 54, 17, 24, 59, 42,

21, 44, 31, 30, 25, 0, 3, 18, 29, 20, 39, 6, 33, 40, 11, 58, 37, 60,

47, 46, 41, 16, 19, 34, 45, 36, 55, 22, 49, 56, 27, 10, 53, 12, 63,

62, 57, 32, 35, 50, 61, 52, 7, 381, 8, 43, 26, 5, 28, 15, 14, 9, 48,

51, 2, 13, 4, 23, 54, 17, 24, 59, 42, 21, 44, 31, 30, 25, 0, 3, 18,

29, 20, 39, 6, 33, 40, 11, 58, 37

is generated by a periodic generator of period 64 of the population

evolution type described in Chap. 7. It is diﬃcult, however, to perceive

the existence of periodicity. If we convert the sequence into binary

numbers, viz.,

001000101011011010000101011100001111001110001001110000

110011000010001101000100010111110110010001011000111011

101010010101101100011111011110011001000000000011010010

011101010100100111000110100001101000001011111010100101

111100101111101110101001010000010011100010101101100100

110111010110110001111000011011001010110101001100111111

111110111001100000100011110010111101110100000111100110

000001001000101011011010000101011100001111001110001001

110000110011000010001101000100010111110110010001011000

111011101010010101101100011111011110011001000000000011

010010011101010100100111000110100001101000001011111010

100101

it becomes even more diﬃcult.

Now, consider sequences generated by random generators such as

coin-tossing. We may, of course, obtain sequences such as the one

generated by C

1

or other, even more anomalous sequences, such as

0000000000000000 or 0000000011111111. In fact, as we know already,

any sequence of n bits has the same probability of occurring, namely

1/2

n

. We also know that only after generating a large number of se-

quences of n bits can we expect to elucidate the nature of the genera-

tor. If the sequences are generated by a coin-tossing process, we expect

about 95% of the generated sequences to exhibit average values S/n

within 0.5 ±1/

√

n. For n = 16, this interval is 0.5 ±0.25. For instance,

the sequences 0000000000000000 (S/n = 0) and 0111111111111110

(S/n = 0.87) are outside that interval. If more than 5% of the sequences

are outside that interval, we then have strong doubts (based on the

usual 95% conﬁdence level) that our generator is random. Moreover,

8.2 Judging Randomness 155

if the generator produces independent outcomes – a usual assumption

in fair coin-tossing – we then hope to ﬁnd an equal rate of occurrence

of 00, 01, 10 and 11 in a large number of sequences. The same ap-

plies to all distinct, equal-length subsequences. It is also possible in

this case to establish conﬁdence intervals that would exclude sequences

such as 0111111111111110, where 01 and 10 only occur once and 00

never occurs. There are also statistical tests that allow one to reject

the randomness assumption, with a certain degree of certainty, when

the number of equal-value subsequences (called runs) is either too large

– as in 0100110011001100, exhibiting 9 equal-value subsequences –, or

too small, as in 0000000011111111.

Brieﬂy, given a large number of sequences there are statistical tests

allowing one to infer something about the randomness of the gener-

ator, with a certain degree of certainty. The important point is this:

the probability measure associated with the generator outcomes is only

manifest in sets of a large number of experiments (laws of large num-

bers). Probability (generator randomness) is an ensemble property.

In daily life, however, we often cannot aﬀord the luxury of repeat-

ing experiments. We cannot repeat the evolution of share values in

the stock market; we cannot repeat the historical evolution of eco-

nomic development; we cannot repeat the evolution of the universe in

its ﬁrst few seconds; we cannot repeat the evolution of life on Earth

and assess the impact of gene mutations. We are then facing unique se-

quences of events on the basis of which we hope to judge determinism

or randomness. From all that has been said, an important observation

emerges: generator randomness and sequence randomness are distinct

issues. From this chapter onwards we embark on a quite diﬀerent topic

from those discussed previously: is it possible to speak of sequence ran-

domness? What does it mean?

8.2 Judging Randomness

How can one judge the randomness of a sequence of symbols? The

Russian mathematicians Kolmogorov and Usphenski established the

following criteria of sequence randomness, based on a certain historical

consensus and a certain plausibility as described in probability theory:

• Typicality: the sequence must belong to a majority of sequences

that do not exhibit obvious patterns.

156 8 Noisy Irregularities

• Chaoticity: there is no simple law governing the evolution of the

sequence terms.

• Frequency stability: equal-length subsequences must exhibit the

same frequency of occurrence of the diﬀerent symbols.

In any case, the practical assessment of sequence randomness is based

on results of statistical tests and on values of certain measures, some

of which will be discussed later on. No set of these tests or measures is

infallible; that is, it is always possible to come up with sequences that

pass the tests or present ‘good’ randomness measures and yet do not

satisfy one or another of the Kolmogorov–Usphenski criteria. Moreover,

as we shall see in the next chapter, although there is a good deﬁnition

of sequence randomness, there is no deﬁnite proof of randomness. In

this sense, the best that can be done in practice using the available

tests and measures is to judge not randomness itself, in the strict and

well-deﬁned sense presented in the next chapter, but rather to judge

a certain degree of irregularity or chaoticity in the sequences.

It is an interesting fact that the human speciﬁcation of random

sequences is a diﬃcult task. Psychologists have shown that the notion

of randomness emerges at a late phase of human development. If after

tossing a coin which turned up heads we ask a child what will come up

next, the most probable answer the child will give is tails. Psychologists

Maya Bar-Hillel and Willem Wagenaar performed an extensive set of

experiments with volunteers who were required to write down random

number sequences. They came to the conclusion that the sequences

produced by humans exhibited:

• too few symmetries and too few long runs,

• too many alternating values,

• too much symbol frequency balancing in short segments,

• a tendency to think that deviations from equal likelihood confer

more randomness,

• a tendency to use local representations of randomness, i.e., people

believe in a sort of ‘law of small numbers’, according to which even

short sequences must be representative of randomness.

It was also observed that these shortcomings are manifest, not only in

lay people, but also in probability experts as well. It seems as though

we each have an internal prototype of randomness; if a sequence is

presented which does not ﬁt the prototype, the most common tendency

is to reject it as non-random.

8.3 Random-Looking Numbers 157

8.3 Random-Looking Numbers

So-called ‘random number’ sequences are more and more often used in

science and technology, where many analyses and studies of phenomena

and properties of objects are based on stochastic simulations (using the

Monte Carlo method, mentioned in Chap. 4). The simulation of random

number sequences in computers is often based on a simple formula:

x

n+1

= ax

n

mod m .

The mod symbol (read modulo) represents the remainder after integer

division (in this case of ax

n

by m; see also what was said about the

quadratic iterator in the last chapter). Suppose we take a = 2, m = 10

and that the initial value x

0

of the sequence is 6. The following value

is then 2×6 mod 10, i.e., the remainder of the integer division of 12 by

10. We thus obtain x

1

= 2. Successive iterations produce the periodic

sequence 48624862, shown in Fig. 8.1a. Since the sequence values are

given by the remainder after integer division, they necessarily belong to

the interval from 0 to m−1 and the sequence is necessarily periodical:

as soon as it reaches the initial value, in at most m iterations, the

sequence repeats itself. One can obtain the maximum period by using

a new version of the above formula, x

n

= (ax

n−1

+ c) mod m, setting

appropriate values for a and c. Figure 8.1b shows 15 iterations using

the same m and initial value, with a = 11 and c = 3. The generated

sequence is the following, with period 10:

25814703692581470369 . . . .

Fig. 8.1. Pseudo random-number generator starting at 2 and using m = 10.

(a) Period 4, with a = 2, c = 0. (b) Period 10, with a = 11 and c = 3

158 8 Noisy Irregularities

Note how the latter sequence looks random when we observe the ﬁrst

10 values, although it is deterministic and periodic. For this reason it

is customary to call the sequence pseudo-random. This type of chaotic

behavior induced by the mod operator is essentially of the same nature

as the chaoticity observed in Fig. 7.7 for the diﬀerence in interest rates.

For suﬃciently large m, the periodic nature of the sequence is of little

importance for a very wide range of practical applications. Figure 8.2

illustrates one such sequence obtained by the above method and with

period 64. In computer applications the period is often at least as large

as 2

16

= 65 536 numbers.

There are other generators that use sophisticated versions of the

above formula. For instance, the next iterated sequence number instead

of depending only on the preceding number may depend on other past

values, as in the formula

x

n

= (ax

n−1

+bx

n−2

+c) mod m .

A generator of this type that has been and continues to be much used

for the generation of pseudo-random binary sequences is

x

n

= (x

n−p

+x

n−q

) mod 2 , with integer p and q .

In 2004, on the basis of a study that involved the analysis of sequences

of 26 000 bits, the researchers Heiko Bauke, from the University of

Magdeburg, Germany, and Stephan Mertens, from the Centre for The-

oretical Physics, Trieste, Italy, showed (more than twenty years after

the invention of this generator and its worldwide adoption!), that this

Fig. 8.2. Pseudo-random sequence of period 64 (a = 5, c = 3, m = 64)

8.4 Quantum Generators 159

Fig. 8.3. Drunk and disorderly, on

a regular basis

generator was strongly biased, producing certain subsequences of zeros

and ones exhibiting strong deviations from what would be expected

from a true random sequence. And so a reputable random generator

that had been in use for over 20 years without much suspicion was

suddenly discredited.

8.4 Quantum Generators

There are some applications of random number sequences that are quite

demanding as far as their randomness is concerned. One example is

the use of random numbers in cryptography. Let us take the following

message:

ANNARRIVESTODAY .

This message can be coded using a random sequence of letters, known

as a key, such as KDPWWYAKWDGBBEFAHJOJJJXMLI. The pro-

cedure is as follows. We consider the letters represented by their number

when in alphabetic order. For the English language the alphabet has

26 letters, which we number from 0 through 25. We then take the let-

ters corresponding to the message and the key and add the respective

160 8 Noisy Irregularities

order numbers. For the above message and key the ﬁrst letters are re-

spectively, A and K, with order numbers 0 and 10. The sum is 10. We

next take the remainder of the integer division by 26 (the number of

letters in the alphabet), which is 10. The letter of order 10 is K and it

will be the ﬁrst letter of the coded message. Repeating the process one

obtains:

KQCWNPIFAVZPEED .

If the key is completely random each letter of the key is unpredictable

and the ‘opponent’ does not possess any information which can be used

to decipher it. If it is pseudo-random there are regularities in the key

that a cryptanalyst may use in order to decipher it.

For this sort of application, requiring high quality sequence ran-

domness, there are commercialized random number generators based

on quantum phenomena. Here are some examples of quantum phenom-

ena presently used by random number generators:

• Radioactive emission.

• Atmospheric noise.

• Thermal noise generated by passing an electric current through a re-

sistance.

• Noise detected in the computer supply line (with thermal noise com-

ponent).

• Noise detected at the input of the sound drive of a computer (es-

sentially thermal noise).

• Arrival event detection of photons in optical ﬁbers.

8.5 When Nothing Else Looks the Same

We mentioned the existence of statistical methods for randomness as-

sessment, or more rigorously for assessing sequence irregularity. Meth-

ods such as the conﬁdence interval of S/n (arithmetic average) or the

runs test allow one to assess criterion 3 of Kolmogorov and Usphenski.

However, these tests have severe limitations allowing only the rejection

of the irregularity hypothesis in obvious cases. Consider the sequence

00110011001100110011. It can be checked that it passes the single pro-

portion test and the runs test; that is, on the basis of those tests, one

cannot reject the irregularity hypothesis with 95% conﬁdence. However,

8.5 When Nothing Else Looks the Same 161

the sequence is clearly periodic and therefore does not satisfy criterion

2 of Kolmogorov and Usphenski.

There is an interesting method for assessing irregularity which is

based on the following idea: if there is no rule that will generate the

following sequence elements starting with a given element, then a cer-

tain sequence segment will not resemble any other equal-length seg-

ment. On the other hand, if there are similar segments, an element-

generating rule may then exist. We now present a ‘similarity’ measure

consisting of adding up the product of corresponding elements between

the original sequence and a shifted version of it. Take the very sim-

ple sequence 20102. Figure 8.4a shows this sequence (open bars) and

another version of it (gray bars) perfectly aligned (for the purposes of

visualization, bars in the same position are represented side by side; the

reader must imagine them superimposed). In the case of Fig. 8.4a, the

similarity between the two sequences is perfect. The sum of products

is

2 ×2 + 1 ×1 + 2 ×2 = 9 .

Figures 8.4b and c show what happens when the replica of the original

sequence is shifted left (say, corresponding to positive shift values). In

the case of Fig. 8.4b, there is a one-position shift and the sum of prod-

ucts is 0 (when shifting we assume that zeros are carried in). In the

case of Fig. 8.4c, there is a two-position shift and the sum of products

is 4. The sum of products thus works as a similarity measure. When

the two sequences are perfectly aligned the similarity reaches its max-

imum. When the shift is only by one position, the similarity reaches

its minimum (in fact zero values of one sequence are in this case in

correspondence with non-zero values of the other).

Figure 8.5a shows, for the 20102 sequence, the curve of the succes-

sive values of this similarity measure known as autocorrelation. The

Fig. 8.4. Correlation of 20102 with itself for 0 (a), 1 (b) and 2 shifts (c)

162 8 Noisy Irregularities

Fig. 8.5. Autocorrelation curves of (a) 20102, (b) 01220

name is due to the fact that we are (co)relating a sequence with itself.

The sequence exhibits a certain zero–non-zero periodicity. The auto-

correlation also exhibits that periodicity. Now suppose we shuﬄed the

sequence values so that it looked like a random sequence of 0, 1 and 2.

For instance, 01220. Now, the autocorrelation curve, shown in Fig. 8.5b,

does not reveal any periodicity.

For a random sequence we expect the autocorrelation to be diﬀerent

from zero only when the sequence and its replica are perfectly aligned.

Any shift must produce zero similarity (total dissimilarity): nothing

looks the same. As a matter of fact, we are interested in a canceling

eﬀect between concordant and discordant values, through the use of

the products of the deviations from the mean value, as we did with the

correlation measure described in Chap. 5. Figure 8.6 shows the autocor-

Fig. 8.6. Autocorrelation curves of (a) 00110011001100110011 and (b)

00100010101101101000

8.6 White Noise 163

Fig. 8.7. Autocorrelation

of a 20 000-bit sequence

generated by a radioactive

emission phenomenon

relations for the previous periodic binary sequence (at the beginning

of this section) and for the ﬁrst twenty values of the long-period se-

quence presented at the beginning of this chapter, computed using the

deviations from the mean. The irregularity of the latter is evident.

Finally, Fig. 8.7 shows the autocorrelation of a 20 000-bit sequence

generated by a quantum phenomenon (radioactive emission). Figure 8.7

is a good illustration that, oﬀ perfect alignment, nothing looks the

same.

8.6 White Noise

Light shining from the Sun or from the lamps in our homes, besides

its corpuscular nature (photons), also has a wave nature (electromag-

netic ﬁelds) decomposable into sinusoidal vibrations of the respective

ﬁelds. Newton’s decomposition of white light reproducing the rainbow

phenomenon is a well-known experiment. The French mathematician

Joseph Fourier (1768–1830) introduced functional analysis based on

sums of sinusoids (the celebrated Fourier transform) which became

a widely used tool for analysis in many areas of science. We may also

apply it to the study of random sequences.

Suppose we add two sinusoids (over time t) as shown in Fig. 8.8a.

We obtain a periodic curve. We may vary the amplitudes, the phases

(deviation from the origin of the sinusoid starting point), and the fre-

quencies (related to the inter-peak interval) and the result is always

periodic. In Fourier analysis the inverse operation is performed: given

the signal (solid curve of Fig. 8.8a), one obtains the sinusoids that

compose it (the rainbow). Figure 8.8b shows the Fourier analysis result

164 8 Noisy Irregularities

Fig. 8.8. (a) Sum of two sinusoids (dotted curves) resulting in a periodic

curve (solid curve). One of the sinusoids has amplitude A = 2, frequency 0.1

(reciprocal of the period T = 10) and phase delay ∆ ≈ 2. (b) Spectrum of

the periodic signal

(only the amplitudes and the frequencies) of the signal in Fig. 8.8a.

This is the so-called Fourier spectrum.

What happens if we add up an arbitrarily large number of sinusoids

of the same frequency but with random amplitudes and phases? Well,

strange as it may seem at ﬁrst, we obtain a sinusoid with the same

frequency (this result is easily deducible from trigonometric formulas).

Let us now suppose that we keep the amplitudes and phases of the

sinusoids constant, but randomly and uniformly select their frequencies.

That is, if the highest frequency value is 1, then according to the uni-

form law mentioned in Chap. 4, no subinterval of [0, 1] is favored relative

to others (of equal length) in the random selection of a frequency value.

The result obtained is now completely diﬀerent. We obtain a random

signal! Figure 8.9 shows a possible example for the sum of 500 sinusoids

with random frequencies. As we increase the number of sinusoids, the

Fig. 8.9. An approximation

of white noise (1 024 values)

obtained by adding up 500

sinusoids with uniformly

distributed frequencies

8.6 White Noise 165

autocorrelation of the sum approaches a single impulse and the sig-

nal spectrum reﬂects an even sinusoidal composition for all frequency

intervals; that is, the spectrum tends to a constant value (1, for unit

amplitudes of the sinusoids). This result is better observed in the au-

tocorrelation spectrum. Figure 8.10 shows the autocorrelation and its

spectrum for the Fig. 8.9 sequence. The same result (approximately) is

obtained for any other similar experiment of sinusoidal addition.

A signal of the type shown in Fig. 8.9 is called white noise (by

analogy with white light). If we hear a signal of this type, it resembles

the sound of streaming water. An interesting aspect is that, although

the frequencies are uniformly distributed, the amplitudes of the white

noise are not uniformly distributed. Their distribution is governed by

the wonderful Gauss curve (see Fig. 8.11). In fact, it does not matter

what the frequency distribution is, uniform or otherwise! Once again,

the central limit theorem is at work here, requiring in the ﬁnal result

that the amplitudes be distributed according to the normal law.

Fig. 8.10. Autocorrelation and

its spectrum for the sequence of

Fig. 8.9

166 8 Noisy Irregularities

Fig. 8.11. Histogram of a white

noise sequence (1 024 values) ob-

tained by adding up 50 000 sinu-

soids with random amplitudes

White noise is, therefore, nothing other than a random sequence in

which the diﬀerent values are independent of each other (zero auto-

correlation for any non-zero shift) produced by a Gaussian random

generator.

We thus see that the autocorrelation and spectrum analysis methods

tell us something about the randomness of sequences. However, these

are not infallible either. For instance, the spectrum of the sequence

0000000010000000, clearly not random, is also 1. On the other hand, the

spectrum of Champernowne’s normal number (mentioned in Chap. 6)

suggests the existence of ﬁve periodic regimes which are not visually

apparent.

8.7 Randomness with Memory

White noise behaves like a sequence produced by coin-tossing. In fact, if

we compute the autocorrelation and the spectrum of a sequence of coin

throws (0 and 1), we obtain results that are similar to those obtained for

white noise (Fig. 8.10). Let us now calculate the accumulated gain by

adding the successive values of white noise as we did in the coin-tossing

game. Figure 8.12a shows the accumulated gain of a possible instance

of white noise. We obtain a possible instance of the one-dimensional

Brownian motion already known to us. On the basis of this result,

we may simulate Brownian motions in two (or more) dimensions (see

Fig. 8.13), in a more realistic way than using the coin-tossing generator,

which imposed a constant step in every direction (see Chap. 6).

In 1951, the hydraulic engineer Harold Hurst (1880–1970) presented

the results of his study on the river Nile ﬂoods (in the context of the

Assuan dam project, in Egypt). At ﬁrst sight one might think that the

8.7 Randomness with Memory 167

Fig. 8.12. Brownian motion sequence (a) and its autocorrelation spectrum

(b). The dotted line with slope −2 corresponds to the 1/f

2

evolution

Fig. 8.13. Simulation of

Brownian motion using

white noise

average yearly ﬂows of any river constitute a random sequence such

as the white noise sequence. However, in that study and on the basis

of ﬂow records spanning 800 years, Hurst discovered that there was

a correlation among the ﬂows of successive years. There was a tendency

for a year of good (high) ﬂow rate to be followed by another year of

good ﬂow rate and for a year of bad (low) ﬂow rate to be followed by

another year of bad ﬂow rate. It is almost as if there were a ‘memory’

in the ﬂow rate evolution.

The study also revealed that the evolution of river ﬂows was well ap-

proximated by a mathematical model, where instead of simply adding

up white noise increments, one would add up the increments multiplied

by a certain weighting factor inﬂuencing the contribution of past in-

crements. Let t be the current time (year) and s some previous time

(year). The factor in the mathematical model is given by (t −s)

H−0.5

,

168 8 Noisy Irregularities

Fig. 8.14. Examples of fractional Brownian motion and their respective auto-

correlation spectra (right ). (a) H = 0.8. (c) H = 0.2. The spectral evolutions

shown in (b) and (d) reveal the 1/f

2H+1

decay, 2.6 and 1.4 for (a) and (c),

respectively

where H (known as Hurst exponent) is a value greater than zero and

less than one.

Consider the case with H = 0.8. The weight (t − s)

0.3

will then

inﬂate the contribution of past values of s and one will obtain smooth

curves with a ‘long memory of the past’, as in Fig. 8.14a. The current

ﬂow value tends to stray away from the mean in the same way as the

previous year’s value. Besides river ﬂows, there are other sequences

that behave in this persistent way, such as currency exchange rates

and certain share values on the stock market.

On the other hand, if H = 0.2, the (t − s)

−0.3

term will quickly

decrease for past increments and we obtain curves with high irregular-

ity, with only ‘short-term memory’, such as the one in Fig. 8.14c. We

observe this anti-persistent behavior, where each new value tends to

revert to the mean, in several phenomena, such as heart rate tracings.

8.8 Noises and More Than That 169

Between the two extremes, for H = 0.5, we have a constant value of

(t −s)

H−0.5

, equal to 1. There is no memory in this case, corresponding

to Brownian motion. The cases with H diﬀerent from 0.5 are called

fractional Brownian motion.

8.8 Noises and More Than That

Besides river ﬂows there are many other natural phenomena, involv-

ing chance factors, well described by fractional Brownian motion. An

interesting example is the heart rate, including the fetal heart rate,

a tracing of which is shown in Fig. 8.15. In this, as in other ﬁelds, the

application of mathematical models reveals its usefulness.

Let us consider how chance phenomena are combined in nature and

daily life. We have already seen that phenomena resulting from the

addition of several random factors, acting independently of one another,

are governed by the normal distribution law. There are also speciﬁc

laws for phenomena resulting from multiplication (instead of addition)

of random factors, as we shall see in a moment.

If we record a phrase of violin music, similar to a sum of sinusoids as

in Fig. 8.8a, and we play it at another speed, the sound is distorted and

the distinctive quality of the violin sound is lost. There are, however,

classes of sounds whose quality does not change in any appreciable way

when heard at other speeds. Such sounds are called scalable noises. The

increase or decrease of the playing speed used to hear the sound is an

observation scale. If we record white noise, such as in Fig. 8.12a, and

play it at another speed, it will still sound like white noise (the sound of

Fig. 8.15. Sequence of

(human) fetal heart rate

in calm sleep (H = 0.35,

D = 1.65)

170 8 Noisy Irregularities

streaming water or ‘static’ radio noise). The spectrum is still a straight

line and the autocorrelation is zero except at the origin. White noise

remains white at any observation scale.

Let us now consider Brownian motion sequences. In this case the

standard deviation of the noise amplitude can be shown to vary with

the observation scale. Related to this fact, it can also be shown that

the ﬂuctuations of Brownian motion are correlated in a given time scale

and the corresponding (autocorrelation) spectrum varies as 1/f

2

(see

Fig. 8.12b). Brownian ‘music’ sounds less chaotic than white noise and,

when played at diﬀerent speeds, sounds diﬀerent while nevertheless re-

taining its Brownian characteristic. We have seen before that sequences

of fractional Brownian motion also exhibit correlations. As a matter of

fact, their spectra vary as 1/f

a

with a = 1 + 2H (see Figs. 8.14b and

d) taking values between 1 and 3, that is, fractional Brownian noises

have a spectrum above the 1/f spectrum. What kind of noise does

this spectrum have? Well, it has been shown that music by Bach and

other composers also has a 1/f spectrum. If we play Brownian music

with a 1/f

2

spectrum, we get an impression of monotony, whereas if we

play ‘noise’ with a 1/f spectrum, we get a certain amount of aesthetic

feeling.

Do chance phenomena generating 1/f spectrum sequences occur

that often? In Chap. 3 we mentioned the Bernoulli utility function

U(f) = log f. Utility increments also obey the 1/f rule, which is noth-

ing other than instantaneous variation of the logarithm. Imagine that

I pick up the temporal evolution of the fortune of some individual and

I multiply all the values by a scale factor a. What happens to the

evolution of the respective utilities? Since log(af) = log a + log f, the

evolution of the scaled utility shows the same pattern with the minor

diﬀerence that there is a shift of the original utility by the constant

value log a. Furthermore, the relative variation, i.e.,

variation of U

variation of f

,

is maintained. Brieﬂy, the utility function exhibits scale invariance.

There are many natural phenomena exhibiting scale invariance, known

as 1/f phenomena. For instance, it has long been known that animals

do not respond to the absolute value of some stimuli (such as smell,

heat, electric shock, skin pressure, etc.), but rather to the logarithm

of the stimuli. There are also many socio-economical 1/f phenomena,

such as the wealth distribution in many societies, the frequency that

words are used, and the production of scientiﬁc papers.

8.9 The Fractality of Randomness 171

The existence of 1/f random phenomena has been attributed to the

so-called multiplicative eﬀect of their causes. For instance, in the case

of the production of scientiﬁc papers, it is well known that a senior

researcher tends to co-author many papers corresponding to works of

supervised junior researchers, more papers than he would individually

have produced. In the same way, people that are already rich tend

to accumulate extra wealth by a multiplicative process, mediated by

several forms of investment and/or ﬁnancial speculation. Consider the

product f of several independent random variables acting in a multi-

plicative way. The probability law of the logarithm of the product is, in

general, well approximated by the normal law, given the central limit

theorem and the fact that the logarithm of a product converts into

a sum of logarithms. Now, if log f is well approximated by the normal

law, f is well approximated by another law (called the lognormal law)

whose shape tends to 1/f as we increase the number of multiplicative

causes. Hence, 1/f phenomena also tend to be ubiquitous in nature

and everyday life.

8.9 The Fractality of Randomness

There is another interesting aspect of the irregularity of Brownian se-

quences. Suppose that I am asked to measure the total length of the

curve in Fig. 8.12a and for that purpose I am given a very small non-

graduated ruler, of length 1 in certain measurement units, say 1 mm.

All I can do is to see how many times the ruler (the whole ruler) ‘ﬁts’

into the curve end-to-end. Suppose that I obtain a length which I call

L(1), e.g., L(1) = 91 mm. Next, I am given a ruler of length 2 mm

and I am once again asked to measure the curve length. I ﬁnd that

the ruler ‘ﬁts’ 32 times, and therefore L(2) = 64 mm. Now, 64 mm is

approximately L(1)/

√

2. In fact, by going on with this process I would

come to the conclusion that the following relation holds between the

measurement scale r (ruler length), and the total curve length, L(r):

L(r) = L(1)/

√

r, which we may rewrite as L(r) = L(1)r

−0.5

. We thus

have an object (a Brownian motion sequence) with a property (length)

satisfying a power law (r

α

) relative to the scale (r) used to measure

the property. In the present case, α = −0.5. Whenever such a power

law is satisﬁed, we are in the presence of a fractal property. Thus, the

length of a sequence of Brownian motion is a fractal.

Much has been written about fractals and there is a vast popular

literature on this topic. The snowﬂake curve (Fig. 8.16) is a well-known

172 8 Noisy Irregularities

Fig. 8.16. The ﬁrst four iterations of the snowﬂake curve. As one increases the

number of iterations, the curve length tends to inﬁnity, while staying conﬁned

inside a bounded plane region

fractal object characterized by an inﬁnite length, although contained

in a ﬁnite area. In fact, the number of snowﬂake segments N(r) mea-

sured with resolution r is given by N(r) = (1/r)

1.26

= r

−1.26

. Were it

a ‘common’, non-fractal curve, we would obtain N(r) = (1/r)

1

= r

−1

(if, for instance, a 1-meter straight line is measured in centimeters,

we ﬁnd 1/0.01 = 100 one-centimeter segments). In the same way, for

a ‘common’, non-fractal area, we would obtain N(r) = (1/r)

2

= r

−2

[if, for instance, I have to ﬁll up a square with side 1 meter with tiles

of side 20 cm, I need (1/0.2)

2

= 25 tiles]. The snowﬂake curve is placed

somewhere between a ﬂat ‘common’ curve of dimension 1 and a plane

region of dimension 2. For ‘common’ objects, the exponent of r is al-

ways an integer. However, for fractal objects such as the snowﬂake, the

exponent of r often contains a fractional part leading to the ‘fractal’

designation. (As a matter of fact there are also fractal objects with in-

teger exponents.) We say that 1.26 is the fractal dimension of the snow

ﬂake. Suppose that, instead of counting segments, we measure the curve

length. We then obtain the relation L(r) = L(1)(1/r)

D−1

, where D is

the fractal dimension. For the snowﬂake, we get L(r) = L(1)(1/r)

0.26

.

The length grows with 1/r, but not as quickly as 1/r.

We thus arrive at the conclusion that the length of the Brownian

motion sequence behaves like the length of the snowﬂake curve, the

only diﬀerence being that the growth is faster: L(r) = L(1)(1/r)

0.5

.

The Brownian motion sequence has fractal dimension D = 1.5 (since

D −1 = 0.5).

When talking about fractal objects it is customary to draw atten-

tion to the self-similarity property of these objects. For instance, when

observing a ﬂake bud at the fourth iteration (enclosed by a dotted ball

in Fig. 8.16), one sees an image corresponding to several buds of the

previous iteration. Does something similar happen with the Brownian

motion sequence? Let us take the sequence in Fig. 8.12a and zoom in

on one of its segments. Will the zoomed segment resemble the origi-

nal sequence? If it did, something repetitive would be present, as in

8.9 The Fractality of Randomness 173

the snowﬂake, and the sequence would not have a random behavior.

Does self-similarity reside in the type of sequence variability? We know

that the standard deviation of a sequence of numbers is a measure of

its degree of variability. Let us then proceed as follows: we divide the

sequence into segments of length (number of values) r and determine

the average standard deviation for all segments of the same length.

Figure 8.17 shows how the average standard deviation s(r) varies with

r. It varies approximately as s(r) =

√

r. The conclusion is therefore

aﬃrmative, that there does in fact exist a ‘similarity’ of the standard

deviations – and hence of the variabilities – among segments of Brow-

nian motion observed at diﬀerent scales. It does come as a surprise

that a phenomenon entirely dependent on chance should manifest this

statistical self-similarity.

It can be shown that the curves of fractional Brownian motion have

fractal dimension D = 2−H. Thus, whereas the curve of Fig. 8.14a has

dimension 1.2 and is close to a ‘common’ curve, without irregularities,

the one in Fig. 8.14c has fractal dimension 1.8, and is therefore close to

ﬁlling up a certain plane region. In the latter case, there is a reduced

self-similarity and instead of Fig. 8.17 one would obtain a curve whose

standard deviation changes little with r (varying as r

0.2

). In the ﬁrst

case the self-similarity varies as r

0.8

. For values of H very close to 1,

the variation with r is almost linear and the self-similarity is perfect,

corresponding to an absence of irregularity.

Fig. 8.17. Evolution of the

average standard deviation

for segments of length r of

the curve in Fig. 8.12a

9

Chance and Order

9.1 The Guessing Game

The reader is probably acquainted with the guess-the-noun game, where

someone thinks of a noun, say an animal, and the reader has to guess

which animal their interlocutor thought of by asking him/her closed

questions, which can be answered either “Yes” or “No”. A possible

game sequence could be:

– Is it a mammal?

– Yes.

– Is it a herbivore?

– No.

– Is it a feline?

– Yes.

– Is it the lion?

– No.

And so on and so forth.

The game will surely end, because the number of set elements of the

game is ﬁnite. It may, however, take too much time for the patience and

knowledge of the player who will end up just asking the interlocutor

which animal (s)he had thought of. Just imagine that your interlocutor

thought of the Atlas beetle and the reader had to make in the worst

case about a million questions, as many as the number of classiﬁed

insects!

Let us move on to a diﬀerent guessing game, independent of the

player’s knowledge, and where one knows all the set elements of the

game. Suppose the game set is the set of positions on a chess board

176 9 Chance and Order

where a queen is placed, but the reader does not know where. The

reader is invited to ask closed questions aiming at determining the

queen’s position. The most eﬃcient way of completing the game, the

one corresponding to the smallest number of questions, consists in per-

forming successive dichotomies of the board. A dichotomous question

divides the current search domain into halves. Let us assume the situa-

tion pictured in Fig. 9.1. A possible sequence of dichotomous questions

and their respective answers could be:

– Is it in the left half?

– Yes.

– Is it in the upper half? (In this and following questions it is always

understood half of the previously determined domain; the left half in

this case.)

– No.

– Is it in the left half?

– No.

– Is it in the upper half?

– Yes.

– Is it in the left half?

– Yes.

– Is it in the upper half?

– No. (And the game ends because the queen’s position has now

been found.)

Whatever the queen’s position may be, it will always be found in at

most the above six questions. Let us analyze this guessing game in more

detail. We may code each “Yes” and “No” answer by 1 and 0, respec-

tively, which we will consider as in Chap. 6 (when talking about Borel’s

normal numbers) as representing binary digits (bits). The sequence of

Fig. 9.1. Chessboard with a queen

9.1 The Guessing Game 177

answers is therefore a sequence of bits. The diﬀerence between this sec-

ond game and the ﬁrst lies in the fact that the correspondence between

the position of each bit in the sequence and a given search domain is

known and utilized.

Let us now consider the following set of letters which we call E:

E = {a, b, c, d, e, f, g, h} .

We may code the set elements by {000, 001, 010, 011, 100, 101, 110, 111}.

Finding a letter in E corresponds to asking 3 questions, as many as the

number of bits needed to code the elements of E. For instance, ﬁnding

the letter c corresponds to the following sequence of answers regarding

whether or not the letter is in the left half of the previous domain: Yes

(c is in the left half of E), No (c is in the right half of {a, b, c, d}), Yes

(c is in the left half of {c, d}). Let N be the number of elements in

the set. We notice that in the ﬁrst example of the chessboard we had

N = 2

6

= 64 distinct elements and needed (at most) 6 questions; in

this second example we have N = 2

3

= 8 distinct elements and need

3 questions. Brieﬂy, it is not hard to see that the number of questions

needed is in general given by log

2

N (the base 2 logarithm of N, which

is the exponent to which one must raise 2 in order to obtain N; a brief

explanation of logarithms can be found in Appendix A at the end of

the book).

In 1920, the electronic engineer Ralph Hartley (1888–1970), working

at the Bell Laboratories in the United States, was interested by the

problem of deﬁning a measure quantifying the amount of information

in a message. The word ‘information’ is used here in an abstract sense.

Concretely, it does not reﬂect either the form or the contents of the

message; it only depends on the symbols used to compose the message.

Let us start with a very simple situation where the message sender can

only send the symbol “Yes”. Coding “Yes” as 1, the sender can only

send sequences of ones. There is then no possible information in the

sense that, from the mere fact that we received “Yes”, there has been

no decrease in our initial uncertainty (already zero) as to the symbol

being sent. The simplest case from the point of view of getting non-zero

information is the one where the emitter uses with equal probability

the symbols from the set {Yes, No} and sends one of them. We then

say that the information or degree of uncertainty is of 1 bit (a binary

digit codes {Yes, No}, e.g., Yes = 1, No = 0).

Now, suppose that the sender sends each of the symbols of the above

set E = {a, b, c, d, e, f, g, h} with equal probability. The information

obtained from one symbol, reﬂecting the degree of uncertainty relative

178 9 Chance and Order

to the symbol value, is equivalent to the number of Yes/No questions

one would have to make in order to select it. Hartley’s deﬁnition of

the information measure when the sender uses an equally probable N-

symbol set is precisely log

2

N. It can be shown that even when N is not

a power of 2, as in the previous examples, this measure still makes sense.

The information unit is the bit and the measure satisﬁes the additivity

property, as it should in order to be useful. We would like to say that

we need 1 bit of information to describe the experiment in which one

coin is tossed and 2 bits of information to describe the experiment in

which two coins are tossed. Suppose that in the example of selecting

an element of E, we started with the selection of an element from

E

1

= {left half of E, right half of E} and, according to the outcome,

we then proceeded to choose an element from the left or right half of E.

The choice in E

1

implies 1 bit of information. The choice in either the

left or right half of E implies 2 bits. According to the additivity rule

the choice of an element of E then involves 3 bits, exactly as expected.

Now, consider the following telegram: ANN SICK. This message is

a sequence of capital letters and spaces. In the English alphabet there

are 26 letters. When including other symbols such as a space, hyphen,

comma, and so forth, we arrive at a 32-symbol set. If the occurrence

of each symbol were equiprobable, we would then have an information

measure of 5 bits per symbol. Since the message has 8 symbols it will

have 5 × 8 = 40 bits of information in total. What we really mean by

this is that ANN SICK has the same information as ANN DEAD or

as any other of the N = 2

40

possible messages using 8 symbols from

the set. We clearly see that this information measure only has to do

with the number of possible choices and not with the usual meaning we

attach to the word ‘information’. As a matter of fact, it is preferable to

call this information measure an uncertainty measure, i.e., uncertainty

as to the possible choices by sheer chance.

9.2 Information and Entropy

In the last section we assumed that the sending of each symbol was

equally likely. This is not often the case. For instance, the letter e occurs

about 10% of the time in English texts, whereas the letter z only occurs

about 0.04% of the time. Suppose that we are only using two symbols,

0 and 1, with probabilities P

0

= 3/4 and P

1

= 1/4, and that we want to

determine the amount of information when the sender sends one of the

two symbols. In 1948 the American mathematician Claude Shannon

9.2 Information and Entropy 179

(1916–2001), considered as the father of information theory, presented

the solution to this problem in a celebrated paper. In fact, he gave the

solution not only for the particular two-symbol case, but for any other

set of symbols. We present here a simple explanation of what happens

in the above two-symbol example. We ﬁrst note that if P(0) = 3/4, this

means that, in a long sequence of symbols, 0 occurs three times more

often than 1. We may look at the experiment as if the sender, instead

of randomly selecting a symbol in {0, 1}, picked it from a set whose

composition reﬂects those probabilities, say, E = {0, 0, 0, 0, 0, 0, 1, 1},

as shown in Fig. 9.2.

We thus have a set E with 8 elements which is the union of two sets,

E

1

and E

2

, containing 6 zeros and two ones, respectively. The infor-

mation corresponding to making a random selection of a symbol from

E, which we know to be log

2

8, decomposes into two terms (additivity

rule):

• The information corresponding to selecting E

1

or E

2

, which is the

term we are searching for and which we denote by H.

• The average information corresponding to selecting an element in

E

1

or E

2

.

We already know how to compute the average (mathematical expecta-

tion) of any random variable, in this case the quantity of information

for E

1

and E

2

(log

2

6 and log

2

2, respectively):

Average information in the selection

of an element in E

1

or E

2

= P

0

×log

2

6 +P

1

×log

2

2

= P

0

log

2

(8P

0

) +P

1

log

2

(8P

1

) .

Hence,

log

2

8 = H +P

0

log

2

(8P

0

) +P

1

log

2

(8P

1

) .

Taking into account the fact that we may multiply log

2

8 by P

0

+ P

1

,

since it is equal to 1, one obtains, after some simple algebraic manipu-

lation,

H = −P

0

log

2

P

0

−P

1

log

2

P

1

.

Fig. 9.2. A set with three times more 0s than 1s

180 9 Chance and Order

For the above-mentioned probabilities, one obtains H = 0.81. Shannon

called this information measure entropy. It is always non-negative and

constitutes a generalization of Hartley’s formula. If P

0

= P

1

= 1/2

(total disorder), one obtains H = 1, the maximum entropy and infor-

mation in the selection of an element from E. On the other hand, if P

0

or P

1

is equal to 1 (total order), one obtains H = 0, the previously men-

tioned case of zero information, corresponding to the entropy minimum

(by convention 0 log

2

0 = 0). Between the two extremes, we get an en-

tropy value between 0 and 1. The general deﬁnition of entropy, for any

set E of elementary events with probabilities P, is a straightforward

generalization of the above result:

H = −sum of P log

2

P for all elementary events of E .

If E has k elements, the entropy maximum is log

2

k and occurs when

all elements are equally probable (uniform distribution). As a matter

of fact, the entropy maximum of an unrestricted discrete distribution

occurs when the distribution is uniform.

Shannon proved that the entropy H is the minimum number of

bits needed to code the complete statistical description of a symbol

set. Let us once again take the set E = {a, b, c, d, e, f, g, h} and as-

sume that each symbol is equiprobable. Then H = log

2

8 = 3 and

we need 3 bits in order to code each symbol. Now imagine that the

symbols are not equiprobable and that the respective probabilities are

{0.5, 0.15, 0.12, 0.10, 0.04, 0.04, 0.03, 0.02}. Applying the above formula

one obtains H = 2.246, less than 3. Is there any way to code the sym-

bols so that an average information of 2.246 bits per symbol is attained?

There is in fact such a possibility and we shall give an idea how such

a code may be built. Take the following coding scheme which assigns

more bits to the symbols occurring less often: a = 1, b = 001, c = 011,

d = 010, e = 00011, f = 00010, g = 00001, h = 00000. Note that, in

a sequence built with these codes, whenever a single 0 occurs (and not

three consecutive 0s), we know that the following two bits discriminate

the symbol; whenever three consecutive 0s occur we know that they

are followed by two discriminating bits. There is then no possible con-

fusion. Consider the following message obtained with a generator that

used those probabilities: bahaebaaaababaaacge. This message would be

coded as

0011000001000110011111100110011110100011

and would use 45 bits instead of the 57 bits needed if we had used

three bits per symbol. Applying the formula for the average, one easily

9.3 The Ehrenfest Dogs 181

veriﬁes that an average of 2.26 bits per symbol are used with the above

coding scheme, close to the entropy value.

9.3 The Ehrenfest Dogs

The word ‘entropy’ was coined for the ﬁrst time in 1865 by the Ger-

man physicist Rudolf Clausius (1822–1888) in relation to the concept of

transformation/degradation of energy, postulated by the second prin-

ciple of thermodynamics. In fact ‘entropy’ comes from the Greek word

‘entropˆe’ meaning ‘transformation’. Later on, in 1877, the Austrian

physicist Ludwig Boltzmann (1844–1906) derived a formula for the en-

tropy of an isolated system which reﬂects the number of possible states

at molecular level (microstates), establishing a statistical description

of an isolated system consistent with its thermodynamic properties.

Boltzmann’s formula for the entropy, designated by S, is S = k log Ω,

where Ω is the number of equiprobable microstates needed to describe

a macroscopic system. In Boltzmann’s formula the logarithm is taken

in the natural base (number e) and k is the so-called Boltzmann con-

stant: k ≈ 1.38 × 10

−23

joule/kelvin. Note the similarity between the

physical and mathematical entropy formulas. The only noticeable dif-

ference is that the physical entropy has the dimensions of an energy di-

vided by temperature, whereas mathematical entropy is dimensionless.

However, if temperature is measured in energy units (say, measured

in boltzmanns instead of K,

◦

C or

◦

F), the similarity is complete (the

factor k can be absorbed by taking the logarithm in another base).

In 1907, the Dutch physicists Tatiana and Paul Ehrenfest proposed

an interesting example which helps to understand how entropy evolves

in a closed system. Suppose we have two dogs, Rusty and Toby, side-

by-side, exchanging ﬂeas between them, and no one else is present.

Initially, Rusty has 50 ﬂeas and Toby has none. Suppose that after this

perfectly ordered initial state, a ﬂea is randomly chosen at each instant

of time to move from one dog to the other. What happens as time goes

by? Figure 9.3 shows a possible result of the numerical simulation of

the (isolated) dog–ﬂea system. As time goes by the distribution of the

ﬂeas between the two dogs tends to an equilibrium situation with small

ﬂuctuations around an equal distribution of the ﬂeas between the two

dogs: this is the situation of total disorder.

At each instant of time, one can determine the probability distribu-

tion of the ﬂeas by performing a large number of simulations. The result

is shown in Fig. 9.4. The distribution converges rapidly to a binomial

182 9 Chance and Order

Fig. 9.3. Time depen-

dence of the number of

ﬂeas on Toby

Fig. 9.4. The ﬂea distribution as it evolves in time. The vertical axis repre-

sents the probability of the number of ﬂeas for each time instant (estimated

in 10 000 simulations)

distribution very close to the normal one. Eﬀectively, the normal distri-

bution is, among all possible continuous distributions with constrained

mean value (25 in this case), the one exhibiting maximum entropy. It

is the ‘most random’ of the continuous distributions (as the uniform

distribution is the most random of the discrete distributions).

Figure 9.5 shows how the entropy of the dog–ﬂea system grows,

starting from the ordered state (all ﬂeas on Rusty) and converging to

the disordered state.

9.3 The Ehrenfest Dogs 183

Fig. 9.5. Entropy evolution

in the isolated dog–ﬂea

system

The entropy evolution in the Ehrenfest dogs is approximately what

one can observe in a large number of physical systems. Is it possible for

a situation where the ﬂeas are all on Rusty to occur? Yes, theoretically,

it may indeed occur. If the distribution were uniform, each possible

ﬂea partition would have a 1/2

50

= 0.000 000 000 000 000 89 probability

of occurring. For the binomial distribution, this is also the probability

of getting an ordered situation (either all ﬂeas on Toby or all ﬂeas on

Rusty). If the time unit for a change in the ﬂea distribution is the

second, one would have to wait about 2

50

seconds, which amounts to

more than 35 million years (!), in order to have a reasonable probability

of observing such a situation.

Similarly to the dog–ﬂea system, given a recipient initially contain-

ing luke warm water, we do not expect to observe water molecules with

higher kinetic energy accumulating at one side of the recipient, which

would result in an ordered partition with the emergence of hot water

at one side and cold water at the other. If instead of water we take

the simpler example of one liter of dilute gas at normal pressure and

temperature, we are dealing with something like 2.7 × 10

22

molecules,

a huge number compared to the 50-ﬂea system! Not even the whole

lifetime of the Universe would suﬃce to reach a reasonable probability

of observing a situation where hot gas (molecules with high kinetic en-

ergy) would miraculously be separated from cold gas (molecules with

low kinetic energy)!

It is this tendency-to-disorder phenomenon and its implied entropy

rise in an isolated system that is postulated by the famous second

principle of thermodynamics. The entropy rise in many isolated physi-

cal systems is the trademark of time’s ‘arrow’. (As a matter of fact, no

184 9 Chance and Order

system is completely isolated and the whole analysis is more complex

than the basic ideas we have presented; in a strict sense, it would have

to embrace the whole Universe.)

9.4 Random Texts

We now take a look at ‘texts’ written by randomly selecting lower-case

letters from the English alphabet plus the space character. We are deal-

ing with a 27-symbol set. If the character selection is governed by the

uniform distribution, we obtain the maximum possible entropy for our

random-text system, i.e., log

2

27 = 4.75 bits. A possible 85-character

text produced by such a random-text system with the maximum of

disorder, usually called a zero-order model , is:

rlvezsﬁpc icrkbxknwk awlirdjokmgnmxdlfaskjn jmxckjvar

jkzwhkuulsk odrzguqtjf hrlywwn

No similarity with a real text written in English is apparent. Now,

suppose that instead of using the uniform distribution, we used the

probability distribution of the characters in (true) English texts. For

instance, in English texts the character that most often occurs is the

space, occurring about 18.2% of the time, while the letter occurring

most often is e, about 10.2% of the time, and z is the least frequent

character, with a mere 0.04% of occurrences. A possible random text

for this ﬁrst-order model is:

vsn wehtmvbols p si e moacgotlions hmelirtrl ro esnelh

s t emtamiodys ee oatc taa p invoj oetoi dngasndeyse ie

mrtrtlowh t eehdsafbtre rrau drnwsr

In order to generate this text we used the probabilities listed in Ta-

ble 9.1, which were estimated from texts of a variety of magazine ar-

ticles made available by the British Council web site, totalizing more

than 523 thousand characters.

Using this ﬁrst-order model, the entropy decreases to 4.1 bits, re-

ﬂecting a lesser degree of disorder. We may go on imposing restrictions

on the character distribution in accordance with what happens in En-

glish texts. For instance, in English, the letter j never occurs before

a space, i.e., it never ends a word (with the exception of foreign words).

In addition, in English q is always followed by u. In fact, given a char-

acter c, we can take into account the probabilities of c being followed by

9.4 Random Texts 185

Table 9.1. Frequency of occurrence of the letters of the alphabet and the

space in a selection of magazine articles written in English

space 0.1818 g 0.0161 n 0.0567 u 0.0231

a 0.0685 h 0.0425 o 0.0635 v 0.0091

b 0.0128 i 0.0575 p 0.0171 w 0.0166

c 0.0239 j 0.0012 q 0.0005 x 0.0016

d 0.0291 k 0.0061 r 0.0501 y 0.0169

e 0.1020 l 0.0342 s 0.0549 z 0.0004

f 0.0180 m 0.0207 t 0.0750

another character, say c

1

, written P(c

1

if c), when generating the text.

For instance, P(space if j) = 0 and P(u if q) = 1. Any texts randomly

generated by this probability model, the so-called second-order model ,

have a slight resemblance with a text in English. Here is an example:

litestheeneltr worither chethantrs wall o p the junsatinlere moritr

istingaleorouson atout oven olle trinmobbulicsopexagender letur-

orode meacs iesind cl whall meallfofedind f cincexilif ony o agre

fameis othn d ailalertiomoutre pinlas thorie the st m

Note that, in this ‘phrase’, the last letter of the ‘words’ is a valid end

letter in English. Moreover, some legitimate English words do appear,

such as ‘wall’, ‘oven’ and ‘the’. The entropy for this model of conditional

probability is known as conditional entropy and is computed in the

following way:

H(letter 1 if letter 2) = H(letter 1 and letter 2) −H(letter 2) .

Performing the calculations, one may verify that the entropy of this

second-order model is 3.38 bits. Moving to third- and fourth-order mod-

els, we use the conditional probabilities estimated in English texts of

a given character being followed by another two or three characters,

respectively. Here is a phrase produced by the third-order model :

l min trown redle behin of thrion sele ster doludicully to in the

an agapt weace din ar be a ciente gair as thly eantryoung othe

of reark brathey re the strigull sucts and cre ske din tweakes hat

he eve id iten hickischrit thin ﬁcated on bel pacts of shoot ity

numany prownis a cord we danceady suctures acts i he sup ﬁr

is pene

186 9 Chance and Order

Fig. 9.6. Automating the art of public

speaking

And a phrase produced by the fourth-order model :

om some people one mes a love by today feethe del purestiling

touch one from hat giving up imalso the dren you migrant for

mists of name dairst on drammy arem nights abouthey is the the

dea of envire name a collowly was copicawber beliver aken the

days dioxide of quick lottish on diﬃca big tem revolvetechnology

as in a country laugh had

The entropies are 2.71 and 2.06, respectively. Note that in all these

examples we are only restricting probability distributions, with the re-

sult that the texts are only vaguely reminiscent of English texts. It

would be possible to add syntactic and semantic rules imposing further

restrictions on the generated phrases, thus enhancing the resemblance

to legitimate English phrases. These additional rules would of course

contribute to further lowering the entropy. As a matter of fact, each

added restriction means added information, therefore a decrease in the

uncertainty of the text.

As we go on imposing restrictions and as a consequence increasing

the order of the model, the entropy steadily decreases. The question is:

how far will it go? It is diﬃcult to arrive at an accurate answer, since

for high orders one would need extraordinarily large sequences in order

to estimate the respective probabilities with some conﬁdence. For in-

stance, for a ﬁfth-order model there are 27

5

possible 5-letter sequences,

that is, more than 14 million combinations (although a large number

of them never occur in English texts)! There is, however, a possibility

9.4 Random Texts 187

Table 9.2. Relative frequencies of certain words in English as estimated from

a large selection of magazine articles

the 0.06271 it 0.01015 have 0.00631 from 0.00482

and 0.03060 for 0.00923 people 0.00586 their 0.00465

of 0.03043 are 0.00860 I 0.00584 can 0.00443

to 0.02731 was 0.00819 with 0.00564 at 0.00442

a 0.02702 you 0.00712 or 0.00559 there 0.00407

in 0.02321 they 0.00705 be 0.00541 one 0.00400

is 0.01427 on 0.00632 but 0.00521 were 0.00394

that 0.01157 as 0.00631 this 0.00513 about 0.00390

of estimating that limit using the so-called Zipf law. The American lin-

guist George Zipf (1902–1950) studied the statistics of word occurrence

in several languages, having detected an inverse relation between the

frequency of occurrence of any word and its rank in the corresponding

frequency table. If for instance the word ‘the’ is the one that most of-

ten occurs in English texts and the word ‘with’ is ranked in twentieth

position, then according to Zipf’s law (presented in 1930), the word

‘with’ occurs about 1/20

a

of the times that ‘the’ does. The same hap-

pens to other words. The parameter a has a value close to 1 and is

characteristic of the language.

Let n denote the rank value of the words and P(n) the correspond-

ing occurrence probability. Then Zipf’s law prescribes the following

dependency of P(n) on n:

P(n) =

b

n

a

.

This relation is better visualized in a graph if one performs a loga-

rithmic transformation: log P(n) = log b −a log n. Hence, the graph of

log P(n) against log n is a straight line with slope −a.

Table 9.2 shows the relative frequencies of some words estimated

from the previously mentioned English texts. Figure 9.7 shows the ap-

plication of Zipf’s law using these estimates and the above-mentioned

logarithmic transformation. Figure 9.7 also shows a possible straight

line ﬁtting the observed data.

According to the straight line in Fig. 9.7, we obtain

P(n) ≈

0.08

n

0.96

.

188 9 Chance and Order

Fig. 9.7. Zipf’s law applied to English texts

One may then apply the entropy formula to the P(n) values for a suf-

ﬁciently large n. An entropy H ≈ 9.2 bits per word is then obtained.

Given the average size of 7.3 letters per word in English (again es-

timated from the same texts), one ﬁnally determines a letter entropy

estimate for English texts as 1.26 bits per letter. Better estimates based

on larger and more representative texts yield smaller values; about 1.2

bits per letter. Other languages manifest other entropy values. For in-

stance, entropy is higher in Portuguese texts; about 1.9 bits per letter.

However, in Portuguese the number of characters to consider is not 27

as in English but 37 (this is due to the use of accents and of ¸ c).

Zipf’s law is empirical. Although its applicability to other types of

data has been demonstrated (for instance, city populations, business

volumes and genetic descriptions), a theoretical explanation of the law

remains to be found.

9.5 Sequence Entropies

We now return to the sequence randomness issue. In a random sequence

we expect to ﬁnd k-symbol blocks appearing with equal probability for

all possible symbol combinations. This is, in fact, the property exhibited

by Borel’s normal numbers presented in Chap. 6.

Let us assume that we measure the entropy for the combinations of

k symbols using the respective probabilities of occurrence P(k), as we

9.5 Sequence Entropies 189

did before in the case of texts:

H(k) = −sum of P(k) ×log

2

P(k) .

Furthermore, let us take the limit of the H(k)−H(k−1) diﬀerences for

arbitrarily large k. One then obtains a number known as the sequence

(Shannon) entropy, which measures the average uncertainty content

per symbol when all symbol dependencies are taken into account. For

random sequences there are no dependencies and H(k) remains con-

stant. In the case of an inﬁnite binary sequence of the fair coin-tossing

type, H(k) is always equal to 1; if it is a tampered coin with P(1) = 0.2,

a value of H(k) equal to

−0.2 ×log

2

0.2 −0.8 ×log

2

0.8 = 0.72

is always obtained.

In the last section, we obtained an entropy estimate for an arbitrarily

long text sequence in English. If there are many symbols, we will also

need extremely long sequences in order to obtain reliable probability

estimates. For instance, in order to estimate the entropy of fourth-

order models of English texts, some researchers have used texts with

70 million letters!

The sequence length required for entropy estimation reduces dras-

tically when dealing with binary sequences. For instance, the sequence

consisting of indeﬁnitely repeating 01 has entropy 1 for one-symbol

blocks and zero entropy for more than one symbol. For two-symbol

blocks, only 01 and 10 occur with equal probability. There is not, there-

fore, an increase in uncertainty when progressing from 1 to 2 symbols.

The sequence consisting of indeﬁnitely repeating 00011011 has entropy

1 for blocks of 1 and 2 symbols. However, for 3 symbols, the entropy falls

to 0.5, signaling an uncertainty drop for 3-symbol blocks, and it falls

again to 0.25 for 4-symbol blocks. As a matter of fact, the reader may

conﬁrm that all 2-symbol combinations are present in the sequence,

whereas there are 2 combinations of 3 symbols and 8 of 4 symbols that

never occur (and there is a deﬁcit for other combinations).

It is interesting to determine and compare the entropy values of

several binary sequences. Table 9.3 shows estimated values for several

sequences (long enough to guarantee acceptable estimates) from H(1)

up to H(3). The sequences named e, π and

√

2, are the sequences of

the fractional parts of these constants, expressed in binary. Here are

the ﬁrst 50 bits of these sequences:

190 9 Chance and Order

Table 9.3. Estimated entropy values for several sequences. See text for ex-

planation

Sequence H(1) H(2) H(3)

e 1 1 0.9999

π 0.9999 0.9999 0.9999

√

2 1 1 1

Normal number 1 1 1

Quantum number 1 0.9998 0.9997

Quadratic iterator 0.9186 0.6664 0.6663

Normal FHR 0.9071 0.8753 0.8742

Abnormal FHR 0.4586 0.4531 0.4483

e 01000101100101101000100000100011011101000100101011

π 01010000011011001011110111011010011100111100001110

√

2 00100100001101000011000010100011111111101100011000

The other sequences are as follows:

• normal number refers to Champernowne’s binary number mentioned

in Chap. 6,

• quantum number is a binary sequence obtained with the quantum

generator mentioned in Chap. 8 (atmospheric noise),

• quadratic iterator is the chaotic sequence shown in Fig. 7.14a, ex-

pressed in binary,

• normal FHR is a sequence of heart rate values (expressed in binary)

of a normal human fetus sleeping at rest (FHR stands for fetal heart

rate),

• abnormal FHR is the same for a fetus at severe risk.

Note the irregularity of e, π,

√

2, normal number and quantum num-

ber according to the entropy measure: constant H(k) equal to 1. The

FHR sequences, referring to fetal heart rate, both look like coin-tossing

sequences with tampered coins, where the abnormal FHR is far more

tampered with than the normal FHR. Finally, note that the quadratic

iterator sequence is less irregular than the normal fetus sequence!

The American researcher Steve Pincus proposed, in 1991, a new ir-

regularity/chaoticity measure combining the ideas of sequence entropy

and autocorrelation. In this new measure, called approximate entropy,

9.6 Algorithmic Complexity 191

the probabilities used to calculate entropy, instead of referring to the

possible k-symbol combinations, refer to occurrences of autocorrela-

tions of k-symbol segments, which are below a speciﬁed threshold. This

new measure has the important advantage of not demanding very long

sequences in order to obtain an adequate estimate. For this reason, it

has been successfully applied to the study of many practical problems,

e.g., variability of muscular tremor, variability of breathing ﬂow, anes-

thetic eﬀects in electroencephalograms, assessment of the complexity

of information transmission in the human brain, regularity of hormonal

secretions, regularity of genetic sequences, dynamic heart rate assess-

ment, early detection of the sudden infant death syndrome, the eﬀects

of aging on the heart rate, and characterization of fetal heart rate vari-

ability.

9.6 Algorithmic Complexity

We saw that the e, π, and

√

2 sequences, as well as the sequence for

Champernowne’s normal numbers, have maximum entropy. Should we,

on the basis of this fact, consider them to be random? Note that all

these sequences obey a perfectly deﬁned rule. For instance, Champer-

nowne’s number consists of simply generating the successive blocks of

k bits, which can easily be performed by (binary) addition of 1 to the

preceding block. The digit sequences of the other numbers can also be

obtained with any required accuracy using certain formulas, such as:

e = 1 +

1

1!

+

1

2!

+

1

3!

+

1

4!

+

1

5!

+

1

6!

+

1

7!

+

1

8!

+· · · ,

π

2

= 1 +

1

3

1 +

2

5

1 +

3

7

1 +

4

9

(1 +· · · )

,

√

2 = 1 +

1

2 +

1

2 +

1

2 +· · ·

.

That is, in spite of the ‘disorder’ one may detect using the entropy

measures of the digit sequences of these numbers (the empirical evi-

dence that they are Borel normal numbers, as mentioned in Chap. 6),

each digit is perfectly predictable. This being so it seems unreasonable

to qualify such numbers as random. In fact, for a random sequence

192 9 Chance and Order

one would not expect any of its digits to be predictable. In around

1975, the mathematicians Andrei Kolmogorov, Gregory Chaitin and

Ray Solomonof developed, independently of each other, a new the-

ory for classifying a number as random, based on the idea of its com-

putability. Suppose we wished to obtain the binary sequence consisting

of repeating 01 eight times: 0101010101010101. We may generate this

sequence in a computer by means of a program with the following two

instructions:

REPEAT 8 TIMES: PRINT 01 .

Imagine that the REPEAT and PRINT instructions are coded with 8

bits and that the ‘8 TIMES’ and ‘01’ arguments are coded with 16 bits.

We then have a program composed of 48 bits for the task of printing

a 16-bit sequence. It would have been more economical if the program

were simply: PRINT 0101010101010101, in which case it would have

only 24 bits. However, if the sequence is composed of 1000 repetitions of

01, we may also use the above program replacing 8 by 1000 (also codable

with 16 bits). In this case, the 48-bit program allows one to compact the

information of a 2000-bit sequence. Elaborating this idea, Kolmogorov,

Chaitin and Solomonof deﬁned the algorithmic complexity

1

of a binary

sequence s as the length of the smallest program s

∗

, also a binary

sequence, that outputs s (see Fig. 9.8).

In this deﬁnition of algorithmic complexity, it does not matter in

which language the program is written (we may always translate from

one language to another), nor which type of computer is used. (In fact,

the deﬁnition can be referred to a certain type of minimalist computer,

but this is by and large an irrelevant aspect.) In the case where s is

a thousand repeats of 01, the algorithmic complexity is clearly low:

the rather simple program shown above reproduces the 2000 bits of s

using only 48 bits. Even for eight repeats of 01, the algorithmic com-

plexity may be low in a computer where the instructions are coded

with a smaller number of bits. Instruction codes represent a constant

surplus, dependent on the computer being used, which we may neglect

Fig. 9.8. Computer gener-

ation of a sequence

1

Not to be confused with computational complexity, which refers to an algorithm’s

computational eﬃciency in run time.

9.6 Algorithmic Complexity 193

when applying the deﬁnition; one has only to take a suﬃciently long

length for s to justify neglecting the surplus.

We may apply the deﬁnition of algorithmic complexity to the gener-

ation of any number (independently of whether it is expressed in binary

or in any other numbering system). Consider the generation of e. Its

digits seem to occur in a haphazard way, but as a matter of fact by

using the above formula one may generate an arbitrarily large number

of digits of e using a much shorter program than would correspond to

printing out the whole digit sequence. In brief, the algorithmic com-

plexity of e, and likewise of π and

√

2, is rather low.

From the point of view of algorithmic complexity, a random sequence

is one which cannot be described in a more compact way than by

simply instructing the computer to write down its successive digits.

The instruction that generates the sequence is then the sequence itself

(neglecting the PRINT surplus). Thus, algorithmically random means

irreducible, or incompressible.

Algorithmic complexity has a deep connection with information the-

ory. In fact, consider a binary sequence s with n bits and a certain

proportion p of 1s. Then, as we saw previously when talking about

entropy-based coding, there exists a sequence s

c

of length nH which

codes s using Shannon’s coding by exploring the redundancies of s.

If p is diﬀerent from 1/2, the entropy is then smaller than 1; hence,

nH < n. But if s can be compressed we may build the s

∗

program sim-

ply as a decoding of s

c

. Neglecting the decoding surplus, the algorithmic

complexity is nH < n. Therefore, in order for s to be algorithmically

random, the proportions of 0s and 1s must be equal. Moreover, every

possible equal-length block of 0s and 1s must occur in equal propor-

tions; otherwise, once again by exploring the conditional entropies, one

would be able to shorten s

∗

relative to s. In conclusion, inﬁnite and al-

gorithmically random sequences must have maximum entropy (log

2

k,

for sequences using a k-symbol set) and must be normal numbers! How-

ever, not all normal numbers are algorithmically random. For instance,

Champernowne’s number is not algorithmically random.

As mentioned above, the maximum complexity of an n-bit sequence,

and therefore the length of its minimal program (the smallest length

program outputting s), is approximately equal to n (equal to n, remov-

ing instruction surplus). At the other extreme, a sequence composed

of either 0s or 1s has minimum complexity, approximately equal to

log

2

n, and its pattern is obvious. If P(0) = 3/4 and P(1) = 1/4, we

194 9 Chance and Order

have a middle situation: the complexity is approximately nH = 0.8n

(see Sect. 9.1).

An interesting fact is that it can be demonstrated that there are

fewer than 1/2

c

sequences whose complexity is smaller than n −c. For

instance, for c = 10, fewer than 1/1024 sequences have complexity less

than n −10. We may conclude that, if one considers the interval [0, 1]

representing inﬁnite sequences of 0s and 1s, as we did in Chap. 6, there

is an inﬁnite subset of Borel normal numbers that is algorithmically

random. It is then not at all surprising that the quadratic iterator has

inﬁnitely many chaotic orbits; a chaotic orbit is its shortest description

and its fastest computer. Unfortunately, although there are inﬁnitely

many algorithmically random sequences, it is practically impossible to

prove whether or not a given sequence is algorithmically random. Inci-

dentally, note that a minimal program s

∗

must itself be algorithmically

random. If it were not, there would then exist another program s

∗∗

outputting s

∗

, and we would be able to generate s with a program in-

structing the generation of s

∗

with s

∗∗

, followed by the generation of s

with s

∗

. The latter program would only have a few more bits than s

∗∗

(those needed for storing s

∗

and running it); therefore, s

∗

would not be

a minimal program, which is a contradiction.

9.7 The Undecidability of Randomness

The English mathematician Alan Turing (1912–1954), pioneer of com-

puter theory, demonstrated in 1936 a famous theorem known as the

halting problem theorem. In simple terms it states the following: a gen-

eral program deciding in ﬁnite time whether a program for a ﬁnite input

ﬁnishes running or will run forever (solving the halting problem) can-

not exist for all possible program–input pairs. Note that the statement

applies to all possible program–input (ﬁnite input) pairs. One may

build programs which decide whether or not a given program comes

to a halt. The impossibility refers to building a program which would

decide the halting condition for every possible program–input pair. The

halting problem theorem is related to the famous theorem about the

incompleteness of formal systems, demonstrated by the German math-

ematician Kurt G¨ odel (1906–1978): there is no consistent and complete

system of axioms, i.e., allowing one to decide whether any statement

built with the axioms of the system is true or false.

From these results it is possible to demonstrate the impossibility of

proving whether or not a sequence is algorithmically random. To see

9.8 A Ghost Number 195

this, suppose that for every positive integer n, we wish to determine

whether there is a sequence whose complexity exceeds n. For that pur-

pose we suppose that we have somehow got a program that can assess

the algorithmic complexity of sequences. This program works in such

a way that, as soon as it detects a sequence with complexity larger

than n, it prints it and stops. The program would then have to include

instructions for the evaluation of algorithmic complexity, occupying k

bits, plus the speciﬁcation of n, occupying log

2

n bits. For suﬃciently

large n the value of k + log

2

n is smaller than n (see Appendix A) and

we arrive at a contradiction: a program whose length is smaller than n

computes the ﬁrst sequence whose complexity exceeds n. The contra-

diction can only be avoided if the program never halts! That is, we will

never be able to decide whether or not a sequence is algorithmically

random.

9.8 A Ghost Number

In 1976, the American mathematician Gregory Chaitin deﬁned a truly

amazing number (presently known as Chaitin’s constant). No one

knows its value (or better, one of its possible values), but we know

what it represents. It is a Borel normal number that can be described

but not computed; in other words, there is no set of formulas (as for in-

stance the above formulas for the calculation of e, π, or

√

2) and/or rules

(as, for instance, the one used to generate Champernowne’s number)

which implemented by a program would allow one to compute it. This

number, denoted by Ω, corresponds to the probability of a randomly

chosen program coming to a halt in a given computer. The random

choice of the program means that the computer is fed with binary se-

quences interpreted as programs, and with those sequences generated

by a random process of the coin-tossing type. Assuming that the pro-

grams are delimited (with begin–end or a preﬁx indicating the length),

the calculation of Ω is made by adding up, for each program of k bits

coming to a halt, a contribution of 1/2

k

to Ω. With this construction

one may prove that Ω is between 0 and 1, excluding these values. In

fact, 0 would correspond to no program coming to a halt and 1 to all

programs halting, and both situations are manifestly impossible.

Imagine that some goddess supplied us the sequence of the ﬁrst n

bits of Ω, which we denote by Ω

n

. We would then be able to decide

about the halting or no halting of all programs with length shorter than

or equal to n, as follows: we would check, for each k between 1 and n,

196 9 Chance and Order

whether any of the possible 2

k

programs of length k stopped or not. We

would then feed the computer with all sequences of 1, 2, 3, . . . , n bits.

In order to avoid getting stuck with the ﬁrst non-halting program, we

would have to run one step of all the programs in shared time. Mean-

while, each halting program contributes 1/2

k

. When the accumulated

sum of all these contributions reaches the number Ω

n

, which was mirac-

ulously given to us, we have solved the halting problem for all programs

of length up to n bits (the computation time for not too small values

of n would be enormous!). In fact, another interpretation of Ω is the

following: if we repeatedly toss a coin, noting down 1 if it turns up

heads and 0 if it turns up tails, the probability of eventually reaching

a sequence representing a halting program is Ω.

Let us now attempt to determine whether or not Ω

n

is algorithmi-

cally random. Knowing Ω

n

, we also know all programs up to n bits that

come to a halt; among them, the programs producing random sequences

of length up to and including n bits. If the algorithmic complexity of

Ω

n

were smaller than n, there would exist a program of length shorter

than n outputting Ω

n

. As a consequence, using the process described

above, the program would produce all random sequences of length n,

which is a contradiction. Eﬀectively, Ω is algorithmically random; it is

a sort of quintessence of randomness.

Recently, Cristian Calude, Michael Dinneen, and Chi-Kou Shu have

computed the ﬁrst 64 bits of Ω in a special machine (they actually

computed 84 bits, but only the ﬁrst 64 are trustworthy). Here they are:

0.000 000 100 000 010 000 011 000 100 001 101 000 111 111 001 011

101 110 100 001 000 0

The corresponding decimal number is 0.078 749 969 978 123 844. Besides

the purely theoretical interest, Chaitin’s constant ﬁnds interesting ap-

plications in the solution of some mathematical problems (in the area

of Diophantine equations).

9.9 Occam’s Razor

William of Occam (or Ockham) (1285–1349?) was an English Francis-

can monk who taught philosophy and theology at the University of

Oxford. He became well known and even controversial (he ended up

being excommunicated by Pope John XXII) because he had ideas well

ahead of his time, that ran against the traditionalist trend of medieval

9.9 Occam’s Razor 197

religious thought. Occam contributed to separating logic from meta-

physics, paving the way toward an attitude of scientiﬁc explanation.

A famous principle used by Occam in his works, known as the princi-

ple of parsimony, principle of simplicity, or principle of economy, has

been raised to the level of an assessment/comparison criterion for the

scientiﬁc reasonableness of contending theories. The principle is popu-

larly known as Occam’s razor. It states the following: one should not

increase the number of explanatory assumptions for a hypothesis or

theory beyond what is absolutely necessary (unnecessary assumptions

should be shaved oﬀ), or in a more compact form, the simpler the ex-

planation, the better it is. When several competing theories are equal in

other respects, the principle recommends selecting the theory that in-

troduces the fewest assumptions and postulates the fewest hypothetical

entities.

The algorithmic complexity theory brings interesting support for

Occam’s razor. In fact, a scientiﬁc theory capable of explaining a set

of observations can be viewed as a minimal program reproducing the

observations and allowing the prediction of future observations. Let

us then consider the probability P(s) of a randomly selected program

producing the sequence s. For that purpose and in analogy with the

computation of Ω, each program outputting s contributes 1/2

r

to P(s),

where r is the program length. It can be shown that P(s) = 1/2

C(s)

,

where C(s) represents the algorithmic complexity of s. Thus, sequences

with low complexity occur more often than those with high complexity.

If the assumptions required by a theorem are coded as sequences of

symbols (for instance, as binary sequences), the most probable ones

are the shortest.

Another argument stands out when considering the production of

binary sequences of length n. If they are produced by a chance mecha-

nism, they have probability 1/2

n

of showing up. Meanwhile, a possible

cause for the production of s could be the existence of another sequence

s

∗

of length C(s) used as input in a certain machine. The probability

of obtaining s

∗

by random choice among all programs of length m is

1/2

C(s)

. The ratio 2

−C(s)

/2

−n

= 2

n−C(s)

thus measures the probability

that s occurred due to a deﬁnite cause rather than pure luck. Now, if

C(s) is much smaller than n, the probability that s occurred by pure

luck is very small. One more reason to prefer simple theories!

10

Living with Chance

10.1 Learning, in Spite of Chance

Although immersed in a vast sea of chance, we have so far demon-

strated that we are endowed with the necessary learning skills to solve

the problems arising in our daily life, including those whose solution

is a prerequisite to our survival as a species. Up to now, we have also

shown that we are able to progress along the path of identifying the

laws (regularities) of nature. But it is not only the human species that

possesses learning skills, i.e., the capability of detecting regularities in

phenomena where chance is present. A large set of living beings also

manifests the learning skills needed for survival. Ants, for instance,

are able to determine the shortest path leading to some food source.

The process unfolds as follows. Several ants leave their nest to forage

for food, following diﬀerent paths; those that, by chance, reach a food

source along the shortest path are sooner to reinforce that path with

pheromones, because they are sooner to come back to the nest with the

food; those that subsequently go out foraging ﬁnd a higher concentra-

tion of pheromones on the shortest path, and therefore have a greater

tendency (higher probability) to follow it, even if they did not follow

that path before. There is therefore a random variation of the individual

itineraries, but a learning mechanism is superimposed on that random

variation – an increased probability of following the path with a higher

level of pheromone –, which basically capitalizes on the law of large

numbers. This type of social learning, which we shall call sociogenetic,

is common among insects.

Other classes of animals provide, not only examples of social learn-

ing, but also individual learning, frequently depending on a scheme

of punishment and reward. For instance, many animals learn to dis-

200 10 Living with Chance

tinguish certain food sources (leaves, fruits, nuts, etc.) according to

whether they have a good taste (reward) or a bad taste (punishment).

Human beings begin a complex learning process at a tender age, where

social and individual mechanisms complement and intertwine with each

other. Let us restrict the notion of learning here to the ability to clas-

sify events/objects: short or long path, tasty or bitter fruit, etc. There

are other manifestations of learning, such as the ability to assign values

to events/objects or the ability to plan a future action on the basis of

previous values of events/objects. However, in the last instance, these

manifestations can be reduced to a classiﬁcation: a ﬁne-grained classi-

ﬁcation (involving a large number of classes) in the case of assigning

values to events/objects, or a ﬁne-grained and extrapolated classiﬁca-

tion in the case of future planning.

Many processes of individual learning, which we shall call ontoge-

netic, are similar to the guessing game mentioned in Chap. 9. Imagine

a patient with suspicion of cardiovascular disease going to a cardiolo-

gist. The event/object the cardiologist has to learn is the classiﬁcation

of the patient in one of several diagnostic classes. This classiﬁcation

depends on the answer to several questions:

• Does the patient have a high cholesterol level?

• Is the patient overweight?

• Is the electrocardiogram normal?

• Does s(he) have cardiac arrhythmia?

We may call this type of learning, in which the learner is free to put

whatever questions s(he) wants, active learning. As seen in Chap. 9,

active learning guarantees that the target object, in a possible domain

of n objects (in the case of the cardiologist, n is the number of possible

diagnostic classes), is learned after presenting at most log

2

n questions.

(Of course things are not usually so easy in practice. At the start of

the diagnostic process one does not have the answer to all possible

questions and some of the available answers may be ambiguous.)

We may formalize the guessing game by assuming that one intends

to guess a number x belonging to the interval [0, 1], that some person

X thought of. To guess the number, the reader L may randomly chose

numbers in [0, 1], following the guessing game strategy. Imagine that

the number that X thought of is 0.3 and that the following sequence

of questions and answers takes place:

10.1 Learning, in Spite of Chance 201

L: Is it greater than 0.5?

X : No.

L: Is it greater than 0.25?

X : Yes.

And so on and so forth. After n questions, L comes to know x with

a deviation that is guaranteed not to exceed 1/2

n

. Learning means here

classifying x in one of the 2

n

intervals of [0, 1] with width 1/2

n

.

Things get more complicated when we are not free to present the

questions we want. Coming back to the cardiologist example, suppose

that the consultation is done in a remote way and the cardiologist has

only an abridged list of the ‘answers’. In this case the cardiologist is

constrained to learn the diagnosis using only the available data. We

are now dealing with a passive learning process. Is it also possible to

guarantee learning in this case? The answer is negative. (As a matter of

fact, were the answer positive, we would probably already have learned

everything there is to learn about the universe!)

Passive learning can be exempliﬁed by learning the probability of

turning up heads in the coin-tossing game, when the results of n throws

are known. L is given a list of the results of n throws and L tries to

determine, on the basis of this list, the probability P of heads, a value

in [0, 1]. For this purpose L determines the frequency f of heads. The

learning of P is obviously not guaranteed, in the sense that the best

one can obtain is a degree of certainty, for some tolerance of the de-

viation between f and P. In passive learning there is therefore a new

element: there is no longer an absolute certainty of reaching a given

deviation between what we have learned and the ideal object/value to

be learned – as when choosing a number x in [0, 1]. Rather, we are

dealing with degrees of certainty (conﬁdence intervals) associated with

the deviations. When talking about Bernoulli’s law of large numbers in

Chap. 4, we saw how to establish a minimum value for n for probability

estimation. The dotted line in Fig. 10.1 shows the minimum values of n

when learning a probability, for several values of the deviation tolerance

between f and P, and using a 95% certainty level.

Let us once again consider learning a number x in the interval [0, 1],

but in a passive way this time. Let us assume that X has created

a sequence of random numbers, all with the same distribution in [0, 1]

and independent of each other, assigning a label or code according to

the numbers belonging (say, label 1) or not (say, label 0) to [0, x]. The

distribution law for the random numbers is assumed unknown to L. To

202 10 Living with Chance

Fig. 10.1. Learning curves

showing the minimum num-

ber of cases (logarithmic

scale) needed to learn

a probability (dotted curve)

and a number (solid curve)

with a given tolerance and

95% certainty

ﬁx ideas, suppose that X thought of x = 0.3 and L was given the list:

(0.37, 0) (0.29, 1) (0.63, 0) (0.13, 1) .

A list like this one, consisting of a set of labeled points (the objects in

this case), is known as a training set. On the basis of this training set,

the best that L can do is to choose the value 0.29, with a deviation of

0.01. Now suppose that the random numbers were such that L received

the list:

(0.78, 0) (0.92, 0) (0.83, 0) (0.87, 0) .

The best that L can do now is to choose 0.78, with a deviation of 0.48.

In general, assuming that the distribution of the random numbers gen-

erated by X is uniform, there is a 1/2

n

probability that all numbers fall

in [0.5, 1]. If x < 0.5, the supplied list does not help L to learn a num-

ber. In the worst case we may get a deviation of 0.5 with probability

at least 1/2

n

.

It can be shown that in this learning of number x, in order to get

a tolerance interval with 95% certainty, we need more cases than when

learning the probability value in the coin-tossing experiment. The min-

imum values of n in the passive learning of a number, for several val-

ues of the deviation tolerance and using 95% certainty, are shown by

the solid curve in Fig. 10.1. Here is a concrete example: in order to

learn a probability with 0.1 tolerance and 95% certainty, 185 cases are

enough. To learn a number under the same conditions requires at least

1 696 trials. Learning a number when the probability law of the training

set is unknown is more diﬃcult than learning a probability. It requires

larger training sets in order to be protected against atypical training

sets (for instance, all points in [0.5, 1]).

Finally, there is a third type of learning which we shall call phylo-

genetic. It corresponds to the mechanisms of natural selection of the

10.1 Learning, in Spite of Chance 203

species, which grants a higher success to certain genomes to the detri-

ment of others. Learning works as follows here: several gene combina-

tions are generated inside the same species or phylum, via mutations

and hereditary laws. As a consequence, a non-homogeneous population

of individuals becomes available where a genomic variability compo-

nent is present, as well as the individual traits that reﬂect it. Some

individuals are better oﬀ than others in the prosecution of certain vi-

tal goals, tending therefore to a better success in genomic transmission

to the following generation. Thus, learning corresponds here to better

performance in the face of the external environment.

In all these learning models, chance phenomena are present and, in

the last instance, it is the laws of large numbers that make learning

possible. In the simplest sociogenetic learning model, we have a homo-

geneous set of individuals, each of them performing a distinct random

experiment. Everything happens as if several individuals were throw-

ing coins, with the throws performed under the same conditions. Here

learning resembles the convergence of an ‘ensemble average’. The more

individuals there are, the better the learning will be.

In ontogenetic learning, we have a single individual carrying out or

being confronted with random experiments as time goes by. This is the

case of ‘temporal average’ convergence that we have spoken about so

often. The more experiments there are, the better the learning will be.

In phylogenetic learning, the individuals are heterogeneous. We may

establish an analogy with the coin-tossing case by assuming a living be-

ing, called a ‘coin’, whose best performance given the external environ-

ment corresponds to fairness: P(heads) = 1/2. At some instant there is

a ‘coin’ population with diﬀerent genes to which correspond diﬀerent

values of P(heads). In the next generation, the ‘coins’ departing from

fairness get a non-null probability of being eliminated from the popula-

tion, and with higher probability the more they deviate from fairness.

Thus, in the following generation the distribution of P(heads) is dif-

ferent from the one existing in the previous generation. We thus have

a temporal convergence of the average resembling the one in Sect. 6.4.

Meanwhile, the decision about which ‘coins’ to eliminate from each

generation depends on the departure of P(heads) from the better per-

formance objective (fairness); that is, it depends on ‘ensemble average’

convergence.

These diﬀerent learning schemes are currently being used to build

so-called intelligent systems. Examples are ant colony optimization al-

gorithms for sociogenetic learning, genetic algorithms for phylogenetic

learning, and artiﬁcial neural networks for ontogenetic learning.

204 10 Living with Chance

10.2 Learning Rules

In the preceding examples, learning consisted in getting close to a spec-

iﬁed goal (class label), close enough with a given degree of certainty,

after n experiments, that is, for a training set having n cases. This

notion of learning is intimately related to the Bernoulli law of large

numbers and entirely in accordance with common sense. When a physi-

cian needs to establish a diagnosis on the basis of a list of observations,

s(he) uses a speciﬁc rule (or set of rules) with which s(he) expects to

get close to the exact diagnosis with a given degree of certainty. This

rule tries to value or weight the observations in such a way as to get

close to an exact diagnosis – which may be, for instance, having or

not having some disease – with a degree of certainty that increases

with the number of diagnoses performed (the physician’s training set).

The observations gathered in n clinical consultations and the evidence

obtained afterwards about whether the patient really had some dis-

ease constitute the physician’s training set. As another example, when

learning how to pronounce a word, say in French, by pronouncing it n

times and receiving the comments of a French teacher, we expect to get

close to the correct pronunciation, with a given degree of certainty, as

n increases. For that purpose, we use some rule (or set of rules) for ad-

justing the pronunciation to ensure the above-mentioned convergence,

in probability, to the correct pronunciation.

In all examples of task learning, there is the possibility of applying

one rule to the same observations (supposedly a learning rule), chosen

from several possible rules. Let us go back to the problem of deter-

mining a number x in [0, 1], in order to illustrate this aspect. We may

establish a practical correspondence for this problem. For instance, the

practical task of determining the point of contact of two underground

structures (or rocks) on the basis of drilling results along some straight

line. The drillings are assumed to be made at random points and with

uniform distribution in some interval, as illustrated in Fig. 10.2. They

reveal whether, at the drilling point, we are above structure 1 or above

structure 2. If we are able to devise a rule behaving like the Bernoulli

law of large numbers in the determination of the contact point, we say

that the rule is a learning rule for the task at hand.

There are several feasible rules that can serve as learning rules for

this task. Let d be the estimate of the unknown contact point ∆ sup-

plied by the rule. If the rule is d = x

M

, that is, select the maximum

point (revealed as being) from structure 1 (rule 1), or d = x

m

, select

the minimum point from structure 2 (rule 2), or again d = (x

M

+x

m

)/2

10.2 Learning Rules 205

Fig. 10.2. Drillings in some interval, designated by [0, 1], aiming to determine

the contact point ∆ between structure 1 (open circles) and structure 2 (solid

circles)

(rule 3), the rule is a learning rule, as illustrated in Fig. 10.3. The three

rules produce a value d converging in probability (Bernoulli law of large

numbers) to ∆.

We now consider another version of the problem of ﬁnding a point

on a line segment. In contrast to what we had in the above problem,

where there was a separation between the two categories of points, we

now suppose that this separation does not exist. Let us concretize this

scenario by considering a clinical analysis for detecting the presence

or absence of some disease. Our intention is to determine which value,

produced by the analysis, separates the two classes of cases: the nega-

tive ones (without disease) and the positive ones (with disease), which

we denote N and P, respectively. Usually, clinical analyses do not al-

low a total class separation, and instead one observes an overlap of the

respective probabilistic distributions. Figure 10.4 shows an example of

this situation.

In Fig. 10.4 we assume equal-variance normal distributions for the

two classes of diagnosis. Whatever the threshold value d we may choose

on the straight line, we will always have to cope with a non-null clas-

Fig. 10.3. Average deviation d − ∆ curves (solid lines) with ±1 standard

deviation bounds (dotted lines), depending on the training set size n, com-

puted in 25 simulations for ∆ = 0.5 and two learning rules: (a) d = x

M

, (b)

d = (x

M

+x

m

)/2

206 10 Living with Chance

Fig. 10.4. Upper: Probability densities of a clinical analysis value for two

classes, N and P. Lower: Possible training set

siﬁcation error. In the case of Fig. 10.4, the best value for d – the one

yielding a minimum error – is the one indicated in the ﬁgure (as sug-

gested by the symmetry of the problem): d = ∆. The wrongly classiﬁed

cases are the positive ones producing analysis values below ∆ and the

negative ones producing values above ∆. Unfortunately, in many prac-

tical problems where one is required to learn the task ‘what is the best d

separating positive from negative cases?’, the probability distributions

are unknown. We are only given a training set like the one represented

at the bottom of Fig. 10.4. What is then the behavior of the previous

rules 1, 2 and 3? Rules 1 and 2 for inﬁnite tailed distributions as in

Fig. 10.4 converge to values that are clearly far away from ∆, and hence

they do not behave as learning rules in this case. In fact, in this case

we will get a non-null probability of obtaining d values arbitrarily far

from ∆. As for rule 3, it is still a learning rule for symmetrical distribu-

tions, since it will lead on average and for the wrongly classiﬁed cases

to a balance between positive and negative deviations from ∆. Had we

written rule 3 as d = (x

M

+x

m

)/3, that is, changing the weight of the

rule parameters (x

M

, x

m

) from (1/2, 1/2) to (1/3, 2/3), the rule would

no longer be a learning rule. In fact, even with parameters (1/2, 1/2),

rule 3 is not, in general, a learning rule.

There are, however, more sophisticated rules that behave as learning

rules for large classes of problems and distributions. One such rule is

based on a technique that reaches back to Abraham de Moivre. It

consists in adjusting d so as to minimize the sum of squares of the

deviations for the wrongly classiﬁed cases. Another rule is based on the

notion of entropy presented in the last chapter. It consists in adjusting d

so as to minimize the entropy of the classiﬁcation errors and, therefore,

their degree of disorder around the null error. (For certain classes of

problems, my research team has shown that, surprisingly enough, it

may be preferable to maximize instead of minimizing the error entropy.)

10.3 Impossible Learning 207

10.3 Impossible Learning

As strange as it may look at ﬁrst sight, there exist scenarios where

learning is impossible. Some have to do with the distribution of the

data; others correspond to stranger scenarios. Let us ﬁrst take a look

at an example of impossible learning for certain data distributions. Let

us again consider the previous example of ﬁnding out a threshold value

∆ separating two classes, N and P. The only diﬀerence is that we now

use the following rule 4: we determine the averages of the values for

classes N and P, say m

N

and m

P

, and thereafter their average, that is,

d =

m

N

+m

P

2

.

This is a learning rule for a large class of distributions, given the con-

vergence in probability of averages. But we now consider the situa-

tion where the distributions in both classes are Cauchy. As we saw in

Chap. 6, the law of large numbers is not applicable to Cauchy distri-

butions, and the convergence in probability of averages does not hold

in this case. The situation we will encounter is the one illustrated in

Fig. 10.5. Learning is impossible.

There are other and stranger scenarios where learning is impossible.

Let us examine an example which has to do with expressing numbers of

the interval [0, 1] in binary, as we did in the section on Borel’s normal

numbers in Chap. 6 and illustrated in Fig. 6.9. Let us assume that

X chooses an integer n that L has to guess using active learning. L

does not know any rule that might lead X to favor some integers over

others. For L everything happens as if X’s choice were made completely

at random. Since the aim is to learn/guess n, the game stipulates that

Fig. 10.5. The application

of rule 4, under the same

conditions as in Fig. 10.3,

for two classes with Cauchy

distributed values

208 10 Living with Chance

L should ask X if a certain number in the interval [0, 1] contains the

binary digit d

n

. For instance, the number 0.3 represented in binary by

0.01001100110011 . . . (repetition of 0011) does not contain d

3

, 0.001,

which corresponds in decimal to 0.125. Here is an example of questions

and answers when X chooses n = 3 (d

3

):

L: 0.1?

X : 0 [(the binary representation of) 0.1 does not contain (the

binary representation of) 0.125].

L: 0.2?

X : 1 (0.2 contains 0.125).

L: 0.3?

X : 0 (0.3 does not contain 0.125).

L: 0.22?

X : 1 (0.22 contains 0.125).

The question now is: will L be able to guess n after a ﬁnite number of

questions?

We ﬁrst note that whatever d

i

the learner L decides to present as

a guess, s(he) always has 50% probability of hitting upon the value

chosen by X. This is due to the fact that, for a uniformly distributed x

in [0, 1], the average of the d

i

values for any selected value of x converges

to 0.5 independently of the chosen d

i

. In other words, any random choice

made by L has always 50% probability of a hit. Is it possible that, by

using a suﬃciently ‘clever’ rule, we will be able to substantially improve

over this 50% probability? The surprising aspect comes now. Consider

the average of the absolute diﬀerences between what is produced by the

rule and what is produced by the right choice. In the above example,

the values produced by the right choice 0.3 correspond to the 0101

sequence of answers. Supposing that L, using some rule, decided 0.25,

that is d

2

, we would get the 0010 sequence. The average of the absolute

diﬀerences between what is produced by the rule and what is produced

by the right choice is (0 + 1 + 1 + 1)/4 = 0.75. Now, this value always

converges to 0.5 with an increasing number of questions because the

diﬀerence between any pair (d

n

, d

i

) encloses a 0.5 area (as the reader

may check geometrically). We may conclude that either L makes a hit,

and the error is then zero, or s(he) does not make a hit, and the error

is then still 50%. There is no middle term in the learning process, as

happens in a vast number of scenarios, including the previous examples.

Is it at least possible to arrive at a sophisticated rule allowing us to

10.4 To Learn Is to Generalize 209

reach the right solution in a ﬁnite number of steps? The answer is

negative. It is impossible to learn n from questions about the d

i

set.

10.4 To Learn Is to Generalize

Let us look back at the example in Fig. 10.4. In a previous section

we were interested in obtaining a learning rule that would determine

a separation threshold with minimum classiﬁcation error. That is, we

are making the assumption that the best type of classiﬁer for the two

classes (N and P) is a point. If we use rule 3, we may summarize the

learning task in the following way:

• Task to be learned: classify the value of a single variable (e.g., clinical

analysis) in one of two classes.

• Classiﬁer type: a number (threshold) x.

• Learning rule: rule 3.

This is a deductive approach to the classiﬁcation problem. Once the

type of classiﬁer has been decided, as well as the weighting of its pa-

rameters, we apply the classiﬁer in order to classify/explain new facts.

We progress from the general (classiﬁer type) to the particular (ex-

plaining the observations). What can provide us with the guarantee

that the surmised classiﬁer type is acceptable? It could well happen

that the diagnostic classes were ‘nested’, as shown in Fig. 10.6. In the

case of Fig. 10.6a, if we knew the class conﬁguration, instead of consid-

ering one threshold we would consider three and the classiﬁer would be:

If x is below d

1

or between d

2

and d

3

, it belongs to class N; otherwise,

it belongs to class P. In the case of Fig. 10.6b, we would consider ﬁve

thresholds and the following classiﬁer: If x is below d

1

or between d

2

and d

3

, or between d

4

and d

5

, then x belongs to class N; otherwise,

it belongs to class P. Classiﬁers involving one, three or ﬁve thresholds

are linear in the respective parameters. In other scenarios it could well

happen that the most appropriate classiﬁer is quadratic, as in Fig. 10.7,

where we separate the two classes with: If x

2

is greater than d, then x

belongs to class N; otherwise, it belongs to class P. We may go on imag-

ining the need to use ever more complex types of classiﬁer, in particular

when a larger number of parameters is involved.

Meanwhile, we never know, in general, if we are confronted with

the conﬁguration of Figs. 10.6a, 10.6b, 10.7, or some other situation.

All we usually have available are simply the observations constituting

210 10 Living with Chance

Fig. 10.6. Situation of classes N

(white) and P (grey) separated

by three (a) and ﬁve (b) thresh-

olds. The training set is the same

as in Fig. 10.4

Fig. 10.7. Quadratic classiﬁer

our training set. How can we then choose which type of classiﬁer to

use? In this situation we intend to determine the classiﬁer type and

its parameters simply on the basis of the observations, without any

assumptions, or with only a minimum of assumptions. Learning is now

inductive. We progress from the particular (the observations) to the

general (classiﬁer type). The answer to the question of what type of

classiﬁer to use is related to the diﬃcult generalization capability issue

of inductive learning.

Suppose from now on that in the above example the distributions of

the analysis value x for the two classes of diagnosis did correspond, in

fact, to the four-structure situation of Fig. 10.6a. However, the person

who designed the classiﬁer does not know this fact; s(he) only knows

the training set. The classiﬁcation rule based on three thresholds is then

adequate. The only thing we have to do is to use a learning rule for

this type of classiﬁer, which adjusts the three thresholds to the three

unknown values ∆

1

, ∆

2

, and ∆

3

, in order to minimize the number of

errors. Let us further suppose that in the case of Fig. 10.6a we were

fortunate with the training set supplied to us and we succeeded in

determining d

1

, d

2

, and d

3

quite close to the true thresholds ∆

1

, ∆

2

,

and ∆

3

. Our solution would then behave equally well, on average terms,

for other (new) sets of values, aﬀording the smallest probability of error.

The designed classiﬁer would allow generalization and would generalize

well (with the smallest probability of error, on average). We could have

been unfortunate, having to deal with an atypical training set, leading

to the determination of d

1

, d

2

, and d

3

far away from the true values. In

this case, the classiﬁer designed on the basis of the training set would

not generalize to our satisfaction.

In summary, we can never be certain that a given type of classiﬁer

allows generalization. The best we can have is some degree of certainty.

But what does this degree of certainty depend on? In fact, it depends

10.5 The Learning of Science 211

on two things: the number of elements of the training set and the com-

plexity of the classiﬁer (number of parameters and type of classiﬁer).

As far as the number of training set elements is concerned, it is easily

understandable that, as we increase that number, the probability of the

training set being typical also increases, in the sense of the strong law

of large numbers mentioned in Chap. 6.

Let us now focus on the complexity issue. In the present example

the complexity has to do with the number of thresholds used by the

classiﬁer. The simpler the classiﬁer, the easier the generalization. For

the above example, the classiﬁer with a single threshold does not need

such large training sets as the three-threshold classiﬁer, in order to

generalize; however, its generalization is always accompanied by a non-

null classiﬁcation error. It generalizes, but does so rather poorly. We

are thus in a sub-learning situation. At the other extreme, the ﬁve-

threshold classiﬁer does not generalize in any reasonable way, since its

performance behaves erratically; in training sets like those of Fig. 10.6,

it attains zero errors, whereas in new data sets it may produce very

high error rates. We are thus in an over-learning situation, where the

classiﬁer, given its higher complexity, gets so stuck to the training set

details that it will not behave in a similar way when applied to other

(new) data sets. We may say that the one-threshold classiﬁer only de-

scribes the coarse-grained structure of the training set, while the ﬁve-

threshold classiﬁer describes a structure that is too ﬁnely grained, very

much inﬂuenced by the random idiosyncrasies of the training set. The

three-threshold classiﬁer is the best compromise solution for the learn-

ing task. Finding such a best compromise is not always an easy task in

practice.

10.5 The Learning of Science

Occam’s razor told us that, in the presence of uncertainty, the simplest

explanations are the best. Consider the intelligibility of the universe,

or in other words the scientiﬁc understanding of material phenomena.

If, in order to explain a given set of observations of these phenomena,

say n observations, we needed to resort to a rule using n parameters,

we could say that no intelligibility was present. There is no diﬀerence

between knowing the rule and knowing the observations. In fact, to

understand is to compress. The intelligibility of the universe exists only

if inﬁnitely many observations can be explained by laws with a reduced

number of parameters. It amounts to having a small program that

212 10 Living with Chance

is able to produce inﬁnitely many outcomes. This is the aim of the

‘theory of everything’ that physicists are searching for, encouraged by

the appreciation that, up to now, simple, low complexity (in the above

sense) laws have been found that are capable of explaining a multitude

of observations. We will never be certain (fortunately!) of having found

the theory of everything. On the other hand, inductively learning the

laws of the universe – that is, obtaining laws based on observations

alone –, presents the problem of their generalization: can we be certain,

with a given level of certainty, that the laws we have found behave

equally well for as yet unobserved cases?

The Russian mathematician Vladimir Vapnik developed pioneering

work in statistical learning theory, establishing (in 1991) a set of theo-

retical results which allow one in principle to determine a degree of cer-

tainty with which a given classiﬁer (or predictive system) generalizes.

Statistical learning theory clariﬁes what is involved in the inductive

learning model, supplying an objective foundation for inductive valid-

ity. According to this theory we may say that the objective foundation

of inductive validity can be stated in a compact way as follows: in order

to have an acceptable degree of certainty about the generalization of

our inductive model (e.g., classiﬁer type), one must never use a more

complex type of model than is allowed by the available data. Now,

one of the interesting aspects of the theory developed by Vapnik is

the characterization of the generalizing capability by means of a single

number, known as the Vapnik–Chervonenkis dimension (often denoted

by D

VC

). A learning model can only be generalized if its D

VC

is ﬁnite,

and all the more easily as its value is smaller. There are certain models

with inﬁnite D

VC

. The parameters of such models can be learned so

that a complete separation of the values of any training set into the

respective classes can be achieved, no matter how complex those values

are and even if we deliberately add incorrect class-labeled values to the

training set.

The fact that an inductive model must have ﬁnite D

VC

in order

to be valid and therefore be generalizable is related to the falsiﬁability

criterion proposed by the Austrian philosopher Karl Popper for demar-

cating a theory as scientiﬁc. As a matter of fact, in the same way as

classiﬁers with inﬁnite D

VC

will never produce incorrect classiﬁcations

on any training set (even if deliberately incorrectly labeled observa-

tions are added), it can be said that non-scientiﬁc theories will never

err; they always explain the observations, no matter how strange they

are. On the other hand, if scientiﬁc laws are used – thus, capable of

generalization –, we are in a situation corresponding to the use of ﬁnite

10.6 Chance and Determinism: A Never-Ending Story 213

D

VC

classiﬁers; that is, it is possible to present observational data that

is not explainable by the theory (falsifying the theory).

Occam’s razor taught us to be parsimonious in the choice of the

number of parameters of the classiﬁer (more generally, of the learning

system). Vapnik’s theory goes a step further. It tells us the level of

complexity of the classiﬁer (or learning system) to which we may go

for a certain ﬁnite number of observations, in such a way that learning

with generalization is achievable.

10.6 Chance and Determinism: A Never-Ending Story

It seems that God does play dice after all, contradicting Albert Ein-

stein’s famous statement. It seems that the decision of where a photon

goes or when a beta emission takes place is of divine random origin, as is

the knowledge of the inﬁnitely many algorithmically complex numbers,

readily available to create a chaotic phenomenon.

The vast majority of the scientiﬁc community presently supports the

vision of a universe where chance events of objective nature do exist. We

have seen that it was not always so. Suﬃce it to remember the famous

statement by Laplace, quoted in Chap. 7. There is, however, a minority

of scientists who defend the idea of a computable universe. This idea

– perhaps better referred to as a belief, because those that advocate it

recognize that it is largely speculative, and not supported by demon-

strable facts – is mainly based on some results of computer science: the

computing capacity of the so-called cellular automata and the algorith-

mic compression of information. Cellular automata are somewhat like

two- and three-dimensional versions of the quadratic iterator presented

in Chap. 7, in the sense that, by using a set of simple rules, they can

give rise to very complex structures, which may be used as models of

such complex phenomena as biological growth. Algorithmic compres-

sion of information deals with the possibility of a simple program being

able to generate a sequence that passes all irregularity tests, like those

mentioned in Chap. 9 concerning the digit sequences of π, e and

√

2.

On the basis of these ideas, it has been proposed that everything in the

universe is computable, with these proposals ranging from an extreme

vision of a virtual reality universe – a universe without matter, some

sort of hologram –, to a less extreme vision of an inanimate universe

behaving as if it were a huge computer.

There are of course many relevant and even obvious objections

to such visions of a computable universe [42]. The strongest objec-

214 10 Living with Chance

Fig. 10.8. When

further observations

are required

tion to this neo-deterministic resurrection has to do with the non-

computability of pure chance, i.e., of quantum randomness. The most

profound question regarding the intelligibility of the universe thus asks

whether it is completely ordered or chaotic; in other words, whether

chance is a deterministic product of nature or one of its fundamental

features. If the universe has inﬁnite complexity, as believed by the ma-

jority of scientists, it will never be completely understood. Further, we

do not know whether the complexity of a living being can be explained

by a simple program. We do not know, for instance, which rules in-

ﬂuence biological evolution, in such a way that we may forecast which

path it will follow. Are mutations a phenomenon of pure chance? We

do not know. Life seems to be characterized by an inﬁnite complexity

which we will never be able to learn in its entirety.

As a matter of fact, as we said already in the preface, if there were

no chance events, the temporal cause–eﬀect sequences would be de-

terministic and, therefore, computable, although we might need huge

amounts of information for that purpose. It remains to be seen whether

one would ever be able to compute those vast quantities of information

in a useful time. Notwithstanding, the profound conviction of the scien-

tiﬁc community at present is that objective chance is here to stay. We

may nevertheless be sure of one thing – really sure with a high degree of

certainty – that chance and determinism will forever remain a subject

of debate, since objective chance will always remain absolutely beyond

the cognition of human understanding.

A

Some Mathematical Notes

A.1 Powers

For integer n (i.e., a whole number, without fractional part), the ex-

pression x

n

, read x to the power n, designates the result of multiplying

x by itself n times. x is called the base and n the exponent . If n is

positive, the calculation of x

n

is straightforward, e.g.,

2.5

3

= 2.5 ×2.5 ×2.5 = 15.625 .

From the deﬁnition, one has

x

n

×x

m

= x

n+m

, (x

n

)

m

= x

nm

,

where nm means n ×m, e.g.,

2.5

5

= 2.5

3

×2.5

2

= 2.5 ×2.5 ×2.5 ×2.5 ×2.5 = 97.65625 ,

(0.5

2

)

3

= (0.5)

6

= 0.015 625 .

In the book, expressions such as (2/3)

6

sometimes occur. We may apply

the deﬁnition directly, as before, or perform the calculation as

(2/3)

6

= 2

6

/3

6

= 64/729 .

Note also that 1

n

is 1.

From x

n

× x

m

= x

n+m

comes the convention x

0

= 1 (for any x).

Moreover, for a negative exponent, −n, we have

x

−n

=

1

x

n

,

216 A Some Mathematical Notes

since

x

n

×x

−n

= x

0

= 1 .

For example,

2

−3

=

1

2

3

=

1

8

.

Powers with fractions as exponents and non-negative bases are calcu-

lated as a generalization of (x

n

)

m

= x

nm

. For instance, from (x

1/2

)

2

=

x, one has x

1/2

=

√

x.

A.2 Exponential Function

The ﬁgure below shows the graphs of functions y = x

4

and y = 2

x

.

Both are calculated with the above power rules. In the ﬁrst function,

the exponent is constant; the function is a power function. In the second

function the base is constant; it is an exponential function. Figure A.1

shows the fast growth of the exponential function in comparison with

the power function. Given any power and exponential functions, it is

always possible to ﬁnd a value x beyond which the exponential is above

the power.

Given an exponential function a

x

, its growth rate (measured by the

slope at each point of the curve) is given by Ca

x

, where C is a constant.

C can be shown to be 1 when a is the value to which (1 + 1/h)

h

tends

when the integer h increases arbitrarily. Let us see some values of this

quantity:

1 +

1

2

2

= 1.5

2

= 2.25 ,

Fig. A.1. Comparing the growth

of the exponential function and

the power function

A.3 Logarithm Function 217

1 +

1

10

10

= 1.1

10

= 2.593 742 ,

1 +

1

1000

1000

= 1.001

1000

= 2.716 924 2 .

These quantities converge to the natural or Napier number, denoted

by e. The value of e to 6 signiﬁcant ﬁgures is 2.718 28. The natural

exponential e

x

thus enjoys the property that its growth rate is equal to

itself.

A.3 Logarithm Function

The logarithm function is the inverse of the exponential function; that

is, given y = a

x

, the logarithm of y in base a is x. The word ‘log-

arithm’, like the word ‘algorithm’, comes from Khwarizm, presently

Khiva in Uzbekistan, birth town of Abu J´ a’far Muhammad ibn Musa

Al-Khwarizmi (780–850), author of the ﬁrst treatise on algebra.

This function is usually denoted by log

a

. Here are some examples:

• log

2

(2.25) = 1.5, because 1.5

2

= 2.25.

• log

10

(2.593 742) = 1.1, because 1.1

10

= 2.593 742.

• log

a

(1) = 0, because as we have seen, a

0

= 1 for any a.

The logarithm function in the natural base, viz., log

e

(x), or simply

log(x), is frequently used. An important property of log(x) is that its

growth rate is 1/x. The graphs of these functions are shown in Fig. A.2.

Fig. A.2. The logarithm function

218 A Some Mathematical Notes

Fig. A.3. The behavior of the

functions log(x

4

) and log(2

x

)

The following results are a consequence of the properties of the

power function:

log(xy) = log(x) + log(y) , log(x

y

) = y log(x) .

The logarithm function provides a useful transformation in the analysis

of fast growth phenomena. For instance, in Fig. A.1 it is not possible

to compare the growth of the two functions for very small or very high

values of x. If we apply the log transformation, we get

log(x

4

) = 4 log(x) , log(2

x

) = xlog(2) ,

whose graphs are shown in Fig. A.3. The behavior of the two functions

is now graphically comparable. It becomes clearly understandable that,

in the long run, any exponential function overtakes any power function.

A.4 Factorial Function

The factorial function of a positive integer number n, denoted n!, is

deﬁned as the product of all integers from n down to 1:

n! = n ×(n −1) ×(n −2) ×. . . ×2 ×1 .

The factorial function is a very fast-growing function, as can be checked

in the graphs of Fig. A.4. In the left-hand graph, n! is represented to-

gether with x

4

and 2

x

, while the respective logarithms are shown on the

right. In the long run, the factorial function overtakes any exponential

function.

A.5 Sinusoids 219

Fig. A.4. Left : The functions n!, x

4

and 2

x

. Right : Logarithms of the three

functions on the left

A.5 Sinusoids

The sine function sin(x) (sine of x) is deﬁned by considering the radius

OP of a unit-radius circle enclosing the angle x, measured in the an-

ticlockwise direction in Fig. A.5. The function represents the length of

the segment AP drawn perpendicularly from P to the horizontal axis.

The length can be either positive or negative, depending on whether

the segment is above or below the horizontal axis, respectively. With

the angle measured in radians, that is, as an arc length represented by

a certain fraction of the unitary radius, and since 2π is the circumfer-

ence, one has

sin(0) = sin(π) = sin(2π) = 0 , sin(π/2) = 1 ,

sin(3π/2) = −1 , sin(π/4) =

√

2/2 .

The inverse function of sin(x) is arcsin(x) (arcsine of x). As an example,

arcsin(

√

2/2) = π/4 (for an arc between 0 and 2π).

Considering point P rotating uniformly around O, so that it de-

scribes a whole circle every T seconds, the segment AP will vary

as sin(2πt/T), where t is the time, or equivalently sin(2πft), where

f = 1/T is the frequency (cycles per second). If the movement of P

starts at the position corresponding to the angle φ, called the phase,

the segment will vary as sin(2πft +φ) (see Fig. 8.8).

The cosine function cos(x) (cosine of x) corresponds to the segment

OA in Fig. A.5. Thus,

cos(0) = cos(2π) = 1 , cos(π) = −1 ,

cos(π/2) = cos(3π/2) = 0 , cos(π/4) =

√

2/2 .

220 A Some Mathematical Notes

Fig. A.5. Unit circle and deﬁnition of the sine and

cosine functions

From Pythagoras’ theorem, one always has

sin

2

(x) + cos

2

(x) = 1 ,

for any x.

A.6 Binary Number System

In the decimal system which we are all accustomed to, the value of

any number is obtained from the given representation by multiplying

each digit by the corresponding power of ten. For instance, the digit

sequence 375 represents

3 ×10

2

+ 7 ×10

1

+ 5 ×10

0

= 300 + 70 + 5 .

The rule is the same when the number is represented in the binary

system (or any other system base for that matter). In the binary system

there are only two digits (commonly called bits): 0 and 1. Here are some

examples: 101 has value

1 ×2

2

+ 0 ×2

1

+ 1 ×2

0

= 4 + 1 = 5 ,

while 0.101 has value

1 ×2

−1

+ 0 ×2

−2

+ 1 ×2

−3

= 1/2 + 1/8 = 0.5 + 0.125 = 0.625 .

References

The following list of references includes works of popular science and other

sources that are not not technically demanding.

1. Alvarez, L.: A pseudo-experience in parapsychology (letter), Science 148,

1541 (1965)

2. Amaral, L., Goldberger, A., Ivanov, P., Stanley, E.: Modeling heart rate

variability by stochastic feedback, Comp. Phys. Communications 121–2,

126–128 (1999)

3. Ambegaokar, V.: Reasoning about Luck. Probability and Its Uses in

Physics, Cambridge University Press (1996)

4. Bar-Hillel, M., Wagenaar, W.A.: The perception of randomness, Advances

in Applied Mathematics 12, 428–454 (1991)

5. Bassingthwaighte, J.B., Liebovitch, L., West, B.: Fractal Physiology, Ox-

ford University Press (1994)

6. Beltrami, E.: What is Random?, Springer-Verlag (1999)

7. Bennet, D.J.: Randomness, Harvard University Press (1998)

8. Bernstein, P.: Against the Gods. The Remarkable Story of Risk, John

Wiley & Sons (1996)

9. Blackmore, S.: The lure of the paranormal, New Scientist 22, 62–65 (1990)

10. Bolle, F.: The envelope paradox, the Siegel paradox, and the impossi-

bility of random walks in equity and ﬁnancial markets, www.econ.euv

-frankfurto.de/veroeffentlichungen/bolle.html (2003)

11. Bouchaud, J.-P.: Les lois des grands nombres, La Recherche 26, 784–788

(1995)

12. Brown, J.: Is the universe a computer? New Scientist 14, 37–39 (1990)

222 References

13. Casti, J.: Truly, madly, randomly, New Scientist 155, 32–35 (1997)

14. Chaitin, G.: Randomness and mathematical proof, Scientiﬁc American

232, No. 5, 47–52 (1975)

15. Chaitin, G.: Randomness in arithmetic, Scientiﬁc American 259, No. 7,

52–57 (1988)

16. Chaitin, G.: Le hasard des nombres, La Recherche 22, 610–615 (1991)

17. Chaitin, G.: L’Univers est-il intelligible? La Recherche 370, 34–41 (2003)

18. Crutchﬁeld, J., Farmer, J.D., Packard, N., Shaw, R.: Chaos, Scientiﬁc

American 255, No. 6, 38–49 (1986)

19. Deheuvels, P.: La probabilit´e, le hasard et la certitude, collection Que Sais-

Je?, Presses Universitaires de France (1982)

20. Devlin, K.: The two envelopes paradox. Devlin’s angle, MAA On-

line, The Mathematical Association of America, www.maa.org/devlin/

devlin 0708 04.html (2005)

21. Diaconis, P., Mosteller, F.: Methods for studying coincidences, Journal of

the American Statistical Association 84, 853–861 (1989)

22. Droesbeke, J.-J., Tassi, P.: Histoire de la statistique, collection Que Sais-

Je?, Presses Universitaires de France (1990)

23. Everitt, B.S.: Chance Rules. An Informal Guide to Probability, Risk and

Statistics, Springer-Verlag (1999)

24. Feynman, R.: QED. The Strange Theory of Light and Matter, Penguin,

London (1985)

25. Ford, J.: What is chaos that we should be mindful of? In: The New

Physics, ed. by P. Davies, Cambridge University Press (1989)

26. Gamow, G., Stern, M.: Jeux Math´ematiques. Quelques casse-tˆete, Dunod

(1961)

27. Gell-Mann, M.: The Quark and the Jaguar, W.H. Freeman (1994)

28. Ghirardi, G.: Sneaking a Look at God’s Cards. Unraveling the Mysteries

of Quantum Mechanics, Princeton University Press (2005)

29. Gindikin, S.: Tales of Physicists and Mathematicians, Birkh¨ auser, Boston

(1988)

30. Goldberger, A., Rigney, D., West, B.: Chaos and fractals in human phys-

iology, Scientiﬁc American 262, No. 2, 35–41 (1990)

31. Haken, H., Wunderlin, A.: Le chaos d´eterministe, La Recherche 21, 1248–

1255 (1990)

References 223

32. Hanley, J.: Jumping to coincidences: Defying odds in the realm of the

preposterous, The American Statistician 46, 197–202 (1992)

33. Haroche, S., Raimond, J.-M., Brune, M.: Le chat de Schr¨ odinger se prˆete

`a l’exp´erience, La Recherche 301, 50–55 (1997)

34. H´enon, M.: La diﬀusion chaotique, La Recherche 209, 490–498 (1989)

35. Idrac, J.: Mesure et instrument de mesure, Dunod (1960)

36. Jacquard, A.: Les Probabilit´es, collection Que Sais-Je?, Presses Universi-

taires de France (1974)

37. Kac, M.: What is Random?, American Scientist 71, 405–406 (1983)

38. Layzer, D.: The arrow of time, Scientiﬁc American 233, No. 6, 56–69

(1975)

39. Mandelbrot, B.: A multifractal walk down Wall Street, Scientiﬁc American

280, No. 2, 50–53 (1999)

40. Martinoli, A., Theraulaz, G., Deneubourg, J.-L.: Quand les robots imitent

la nature, La Recherche 358, 56–62 (2002)

41. McGrew, T.J., Shier, D., Silverstein, H.S.: The two-envelope paradox re-

solved, Analysis 57.1, 28–33 (1997)

42. Mitchell, M.: L’Univers est-il un calculateur? Quelques raisons pour

douter, La Recherche 360, 38–43 (2003)

43. Nicolis, G.: Physics of far-from-equilibrium systems and self-organisation.

In: The New Physics, ed. by P. Davies, Cambridge University Press (1989)

44. Ollivier, H., Pajot, P.: La d´ecoh´erence, espoir du calcul quantique, La

Recherche 378, 34–37 (2004)

45. Orl´ean, A.: Les d´esordres boursiers, La Recherche 22, 668–672 (1991)

46. Paulson, R.: Using lottery games to illustrate statistical concepts and

abuses, The American Statistician 46, 202–204 (1992)

47. Peitgen, H.-O., J¨ urgens, H., Saupe, D.: Chaos and Fractals. New Frontiers

of Science, Springer-Verlag (2004)

48. Perakh, M.: Improbable probabilities, www.nctimes.net/~mark/bibl

science/probabilities.htm (1999)

49. Pierce, J.: An Introduction to Information Theory. Symbols, Signals and

Noise, Dover Publications (1980)

50. Pincus, S.M.: Assessing serial irregularity and its implications for health,

Proc. Natl. Acad. Sci. USA 954, 245–267 (2001)

51. Polkinghorne, J.C.: The Quantum World, Princeton University Press

(1984)

224 References

52. Postel-Vinay, O.: L’Univers est-il un calculateur? Les nouveaux d´emi-

urges, La Recherche 360, 33–37 (2003)

53. Rae, A.: Quantum Physics: Illusion or Reality?, 2nd edn., Cambridge

University Press (2004)

54. Rittaud, B.: L’Ordinateur ` a rude ´epreuve, La Recherche 381, 28–35 (2004)

55. Ruelle, D.: Hasard et chaos, collection Points, Editions Odile Jacob (1991)

56. Salsburg, D.: The Lady Tasting Tea: How Statistics Revolutionized Science

in the Twentieth Century, W.H. Freeman (2001)

57. Schmidhuber, J.: A computer scientist’s view of life, the universe, and

everything, http://www.idsia.ch/~juergen/everything (1997)

58. Shimony, A.: Conceptual foundations of quantum mechanics. In: The New

Physics, ed. by P. Davies, Cambridge University Press (1989)

59. Steyer, D.: The Strange World of Quantum Mechanics, Cambridge Uni-

versity Press (2000)

60. Taylor, J.: An Introduction to Error Analysis, 2nd edn., University Science

Books (1997)

61. Tegmark, M., Wheeler, J.A.: 100 years of quantum mysteries, Scientiﬁc

American 284, 54–61 (2001)

62. UCI Machine Learning Repository (The Boston Housing Data): Website

www.ics.uci.edu/~mlearn/MLSummary.html

63. West, B., Goldberger, A.: Physiology in fractal dimensions, American Sci-

entist 75, 354–364 (1987)

64. West, B., Shlesinger, M.: The noise in natural phenomena, American Sci-

entist 78, 40–45 (1990)

65. Wikipedia: Ordre de grandeur (nombre), fr.wikipedia.org/wiki/

Ordre de grandeur (nombre) (2005)

Joaquim P. Marques de Sá Chance

Joaquim P. Marques de Sá

CHANCE

The Life of Games & the Game of Life

With 105 Figures and 10 Tables

123

Prof. Dr. Joaquim P. Marques de Sá

Original Portuguese edition © Gravida 2006

**ISBN 978-3-540-74416-0 DOI 10.1007/978-3-540-74417-7
**

Library of Congress Control Number: 2007941508 © Springer-Verlag Berlin Heidelberg 2008

e-ISBN 978-3-540-74417-7

his work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. he use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig, Germany Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig, Germany Cover design: eStudio Calamar S.L., F. Steinen-Broo, Girona, Spain Cartoons by: Luís Gonçalves and Chris Madden Printed on acid-free paper 987654321 springer.com

may happen in one way or another. The origin of the word ‘chance’ is usually traced back to the vulgar Latin word ‘cadentia’. 2.Preface Our lives are immersed in a sea of chance. the events of our daily lives would be totally predictable. Playing the stock markets would no longer be hazardous. with suﬃcient information. If there were no phenomena with unforeseeable outcomes. Everyone’s existence is a meeting point of a multitude of accidents.15. meaning a befalling by fortuitous circumstances. it would be a mere contractual operation between the players. Every experimental measurement would produce an exact result and as a consequence we would be far more advanced in . all temporal cause–eﬀect sequences would be completely deterministic. 2. but things that happen by chance cannot be certain. In a certain sense chance is the spice of life. phenomena with an element of chance.IX. In this way.VI. ‘accident’ or ‘casualty’ except to an event which has so occurred or happened that it either might not have occurred at all. deterministically.24. how many points to obtain at each throw. For if a thing that is going to happen. or might have occurred in any other way. chance is predominant. The Roman philosopher Cicero clearly expressed the idea of ‘chance’ in his work De Divinatione: For we do not apply the words ‘chance’. whether it be the time of arrival of the train tomorrow or the precise nature of what one will be doing at 5. such as dicethrowing games for example. All games of chance. with no knowable or determinable causes. ‘luck’. Instead.30 pm on 1 April three years from now. indiﬀerently. would no longer be worth playing! Every player would be able to control beforehand.

These were then the only reasonable experiments with chance. supposedly conditioning every event. In fact. there would also be some pleasant consequences. for example. including statistics. chance is sweetness and bitterness. we would be engaged in a long and tedious walk of eternal predictability. This double nature drove our ancestors to take it as a divine property. with the feeling that the ‘wave’ has something supernatural about it (while in contrast a wave of good fortune is often overlooked)? Being of divine origin. what would be the sense in studying divine designs? Phenomena with an element of chance were simply used as a means of probing the will of the gods. In its role as the spice of life. This imaginary will. After all.VI Preface our knowledge of the laws of the universe. dice were thrown and entrails were inspected with this very aim of discerning the will of the gods. which has become an indispensable tool both in scientiﬁc research and in the planning of social activities. as the visible and whimsical emanation of the will of the gods. In this area of application. Along the way. Car accidents would no longer take place. was called sors or sortis by the Romans. To put it brieﬂy. even though the number of risk . the laws of chance phenomena that were discovered have also provided us with an adequate way to deal with the risks involved in human activities and decisions. and ‘sort’ in French). Who has never had the impression that they are going through a wave of misfortune. ‘suerte’ in Spanish. and economics would become an exact science. Of course. But it was not only the Romans and peoples of other ancient civilizations who attributed this divine. fortune and ruin. making our lives safer. chance was not considered a valid subject of study by the sages of ancient civilizations. Human intellect then initiated a brilliant journey that has led from probability theory to many important and modern ﬁelds. Even today fortune-telling by palm reading and tarot are a manifestation of this long-lived belief that chance oﬀers a supernatural way of discerning the divine will. Besides being immersed in a sea of chances. a vast body of knowledge and a battery of techniques have been developed. But much of this regularity is intimately related to chance. there are laws of order in the laws of chance. supernatural feeling to this notion of chance. many deaths would be avoided by early therapeutic action. This word for fatalistic luck is present in the Latin languages (‘sorte’ in Portuguese. like statistical learning theory. whose discovery was initiated when several illustrious minds tried to clarify certain features that had always appeared obscure in popular games. we are also immersed in a sea of regularity. Indeed.

However. many milestones have only recently been passed and a good number of other issues remain open. analysis and interpretation of chance phenomena. Among the many roads of scientiﬁc discovery. picking out the most relevant milestones. I also include some bibliographic references to popular science papers or papers that are not too technically demanding.P. One need only think of the role played by insurance companies. many of which relate to important practical issues. The idea of writing this book came from my work in areas where the inﬂuence of chance phenomena is important. a few notes are included at the end of the book to provide a quick reference concerning certain topics for the sake of a better understanding. In writing the book I have tried to reduce to a minimum the mathematical prerequisites assumed of the reader. Many topics and examples treated in the book arose from discussions with members of my research team. October 2007 J. One might think that in seven thousand years of history mankind would already have solved practically all the problems arising from the study. We shall see a wide range of examples to illustrate this throughout the book. However. Porto. This book provides a quick and almost chronological tour along the long road to understanding chance phenomena. perhaps none is so abundant in amazing and counterintuitive results. Putting this reﬂection into writing has been an exhilarating task. supported by the understanding of my wife and son. Marques de S´ a .Preface VII factors and accidents in our lives is constantly on the increase.

. . . . . . . . . . . . . . 1. . . . . . . . . . . . . .5 Probabilities in Daily Life . . . . . .2 Experimental Conditioning . .7. . . . . . . . . . . . . . . . . . 2. . . . . . .3 A Few Basic Rules . . . . . .1 Toto . . . . . . . . 2. . . . . .1 Conditional Events . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . 1. . .6. . . . . . . . . . . .2 The Classic Notion of Probability . .7. . . . . . . 1. . . . . . . . . . . . . . 2. . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Illustrious Minds Play Dice . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2. . .5 The Frequency Notion of Probability . . . . . . . . 1. . . . . . . . . . . . .3 Independent Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amazing Conditions . 1. . . . . . . . . . . . .7. . . . . 1. . . . . . .4 A Very Special Reverend . . . . . . . . . . . . . . . . . . . . . . .6 Counting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Games of Chance . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 The Problem of the Sons .2 The Problem of the Aces . . . .7. . . . . . .3 Poker . . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Testing the Rules . . . . . . . . . . . . . . . . . . 2. . . 2. . . . . . .5 Roulette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. . . .2 Lotto . . . . . . .6 Challenging the Intuition . . . . . . . . . . . . . . . . . . . . . .Contents 1 Probabilities and Games of Chance . . . . . . . . . . . . . . 1.4 Slot Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7. . 1 1 2 4 8 11 14 18 19 20 21 22 24 25 25 27 28 29 33 36 36 37 2 . . . .

. . . . . . . . . . . 4. . . . . . . . . . . . 3. . . . . . . . . . . . . . . . 3. . . . . . .X Contents 2. . . . . . . . . . . . . . 4. . . . . . . . . .8 The Law of Large Numbers Revisited . . .2 Errors and the Bell Curve . . . . . .4 The Two-Envelope Paradox . . . . . . . . 4. . . . . . . . 93 5. . . .3 The Problem of the Birthdays . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . .1 Mathematical Expectation . . . . . . . . . . . . . . . . . . .6. . . . . . . . . 93 5. . . . . . .7 Distribution of the Arithmetic Mean .5 Signiﬁcant Correlation . . . . . . . . . 3. . . . . .10 The Ubiquity of Normality . . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5. . . . . . . . 102 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . . . . 40 2. . . . . . . . . . . . . . . . .4 The Problem of the Right Door . . . . . . . . . . The Wonderful Curve . . . . . . . . . . . 4. . . . . . . . . .3 Chance Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6. . . . . . . . . . . . . . . . . . . . .6 Normal and Binomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Paranormal Coincidences . . . . . . .3 Bad Luck Games . . . . . . . .1 Observations and Experiments .8 The Chevalier de M´r´ Problem . . . . . . . . .5 Monte Carlo Methods . . . . 4. . . .2 The Law of Large Numbers . . 3. . . . . . . . . . . . . . . . . . .1 Approximating the Binomial Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5. . . . . . . . . . . . . . . . . . . . . . . 3.6 Correlation in Everyday Life . . . . . . . . . . . . . . . . . . . . . . . . 45 45 48 51 52 54 57 59 60 62 63 67 67 69 72 75 76 78 79 82 84 86 89 4 5 Probable Inferences . . . .6 When the Improbable Happens . . . . . .5 The Problem of Encounters . . . . . . . . . . . . . . .3 A Continuum of Chances . . . . . .5 The Saint Petersburg Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Martingales . . . .4 Buﬀon’s Needle . . . . . . . . . . . . .9 Surveys and Polls . . . . . . 4. . . . . . . . . . . 101 5. . . . . . . . . . . . . . . . . 95 5. . . . . . . . . . . 4. 4. 4. . . . . . . .10 How Not to Lose Much in Games of Chance . . . . . . .2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . .11 A False Recipe for Winning the Lotto . .4 Correlation . . . . . . 4. . . . . . . . 3. . . ee 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3 Expecting to Win . . . . . . . . 39 2.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Almost Sure Interest Rates . . . . 142 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9. . . . . 125 7. . . . . . . . . . . . . . . . . . . 107 6. .1 Chance and Determinism . . . . .2 Wild Rides . . . . . . . . 136 7. . . . . . .2 Judging Randomness . . . . . . . . . . . .6 An Innocent Formula .7 Chance in Natural Systems . . . . . . . . . . . . . . . . . . . .4 The Chancier the Better . . . . . . . . . 175 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The Ehrenfest Dogs . . . . . . .5 Populations in Crisis . . . . . . . 117 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 The Guessing Game . . . .2 Information and Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 When Nothing Else Looks the Same . . . . .1 The Random Walk . . . . 171 Chance and Order . . . . . . . . . 155 8. . . . . . . . . . . . . . . . . . . 109 6. . . 134 7. . . .6 White Noise . . . . . . . . . . . . . 111 6. . . . . . . . 160 8. . . . . . . . . . . . . . . . 169 8. . . . .7 The Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The Strange Arcsine Law . . . . . . . . 153 8. .8 Noises and More Than That . . . . . . . . . 125 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Noisy Irregularities . . . . . . . . . . . . . 114 6. . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Borel’s Normal Numbers . . .3 A Planetary Game . . . . . . . . . . . . . . 157 8. . . . . . . .Contents XI 6 Fortune and Ruin . . . . . . . . . . . . . . . . . . . . . .8 Chance in Life . . . .8 The Law of Iterated Logarithms . . 120 6. . . . . . . . . . . . . . . . . . .7 Randomness with Memory . . . . . . . . . . . . . . 175 9. . . . . . . . . . . . . . . . . . . . 121 The Nature of Chance . . . . . . . . . . . . . . . . . . . . .2 Quantum Probabilities . . . . . . . . . . .5 Averages Without Expectation . . . . . . . . . . . . . . . . . . . . . . 181 7 8 9 . . . . . . . . . . . . . . . . 159 8. . . 153 8. . . . . . . . . . . . . .3 Random-Looking Numbers . . . 107 6. . . . . .1 Generators and Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . .9 The Fractality of Randomness . . . . . . . . . . . . . . 163 8. 146 7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6. . . . . . . . . . . . . . . . . . .4 Quantum Generators . . . . . . . . . . . . 139 7. . . . . . . . . 166 8. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10. . . . . . . . . . 217 A. . . . . . . . . 213 Some Mathematical Notes . . . . 204 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 10 Living with Chance . . . . .6 Binary Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . in Spite of Chance . . . . . . . . . . . . . . . . . . . . . . . .7 9. . . . . . . . . . . . . . . . . . . . . . . . 191 The Undecidability of Randomness . . . .3 Logarithm Function . . . . . . . . . . . . 211 10. . . . . . . . . . . . .2 Learning Rules . . . . . . . .4 Factorial Function . . . . . . 199 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 215 A. . .3 Impossible Learning . . .5 9. . . . . . . . . . 207 10. . . . . . . . . . . 221 . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Chance and Determinism: A Never-Ending Story . .5 The Learning of Science . . . .4 To Learn Is to Generalize . . 195 Occam’s Razor . . . . . . . . . . . . . .2 Exponential Function . . . . . . . . . . .8 9. . . . . . . . . . . . . . . . . .1 Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Algorithmic Complexity . . 215 A. . . 219 A. . .4 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Sinusoids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Random Texts . . . . . . . . . . . . . . . . . 209 10. 184 Sequence Entropies . . . . . . . . . . . . . . . . . . 194 A Ghost Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Powers . . . . . . . . . . . 218 A. 220 References . . . . . . . . . . . . . . . . . . . .6 9. . . .XII Contents 9. . . . . . . . . . . . . . . . . . . . . 216 A. . . . . . . . . . . . . . . . . . .

In 1560. Faced with the serious consequences of his decision. attributed to Julius Caesar when he took the decision to cross with his legions the small river Rubicon. with no human bias. although he did not explicitly use the word ‘probability’. He also wove together several ideas on how to obtain speciﬁc face combinations when throwing dice. such as astragals (speciﬁc animal bones) and dice with various shapes. Liber de Ludo Aleae (Book of Dice Games). Certain objects appropriately selected to generate unpredictable outcomes. and this was in fact what happened. The Middle Ages brought about an increasing popularization of dice games to the point that they became a common entertainment in both taverns and royal courts. author of the monumental work Artis Magnae Sive de Regulis Algebraicis (The Great Art or the Algebraic Rules) and co-inventor of complex numbers. . or purely for the purposes of entertainment. wrote a treatise on dice games. to manifest their will. in which he introduced the concept of probability of obtaining a speciﬁc die face. The decision taken by Julius Caesar amounted to invading the territory (Italic Republic) assigned to another triumvir. one might say ‘chance’ generators. a frontier landmark between the Italic Republic and Cisalpine Gaul. the Italian philosopher. published posthumously in 1663.1 Probabilities and Games of Chance 1. Caesar found it wise before opting for this line of action to allow the divine will to manifest itself by throwing dice. astrologist and mathematician Gerolamo Cardano (1501–1576). have been used since time immemorial either to implore the gods in a neutral manner.1 Illustrious Minds Play Dice A well-known Latin saying is the famous alea jacta est (the die has been cast). sharing with him the Roman power. Civil war was an almost certain consequence of such an act. physician.

A little later. Pascal introduces the classic concept of probability as the ratio between the number of favorable outcomes over the number of possible outcomes. On the other hand. the incentive for ﬁguring out the best strategy for a player to adopt when betting was high. besides presenting some counting techniques. in which he provides a systematic discussion of all the results concerning dice games as they were understood in his day. since the number of possible outcomes (chances) in a game with one or two dice is small and perfectly known.2 The Classic Notion of Probability Although the notion of probability – as a degree-of-certainty measure associated with a random phenomenon – reaches back to the works of Cardano. Blaise Pascal (1623–1662) and Pierre-Simon de Fermat (1601–1665) exchanged letters discussing the mathematical analysis of problems related to dice games and in particular a series of tough problems raised by the Chevalier de M´r´. the Dutch physicist Christiaan Huygens (1629– 1695) published his book De Ratiociniis in Ludo Aleae (The Theory of Dice Games). Galileo Galilei (1564–1642) wrote a book entitled Sopra le Scoperte dei Dadi (Analysis of Dice Games). 1. who was interested in the mathematical analysis of games. In this way. a humble object that had been used for at least two thousand years for religious or entertainment purposes inspired human thought to understand chance (or random) phenomena. Sometime around 1654. where among other topics he analyses the problem of how to obtain a certain number of points when throwing three dice. in 1657. a writer and nobleee man from the court of French king Louis XIV. also reveal concepts such as event equiprobability (equal likelihood) and expected gain in a game. in French) is eo better remembered for one of his inventions: the cardan joint (originally used to keep ship compasses horizontal). professor of mathematics in Basel and one of the great scientiﬁc personalities . Gerolamo Cardano (known as J´rˆme Cardan. It should be no surprise then that several mathematicians were questioned by players interested in the topic. As this correspondence unfolds. Dice provided a readily accessible and easy means to study the ‘rules of chance’. It also led to the name given to random phenomena in the Latin languages: ‘aleat´rio’ in Poro tuguese and ‘aleatoire’ in French. Around 1600. The works of Cardano and Galileo. from the Latin word ‘alea’ for die. the word ‘probability’ itself was only used for the ﬁrst time by Jacob (James) Bernoulli (1654–1705).2 1 Probabilities and Games of Chance Nowadays. Pascal and Huygens.

published posthumously in 1713.1 displays the various probability values P for a fair . 5}. In 1812. et alors la probabilit´ d’un ´v´nement donn´ est une frace e e e tion.7% certainty that any previously speciﬁed die face will turn up. Therefore. this gives P (event) = number of favorable cases . Moreover.1. There are 6 possible outcomes or elementary events (the six faces). since there is only one face with this value (one favorable event). and then the probability is a fraction whose numerator represents the number of cases favorable to the event and whose denominator represents the number of possible cases. assuming that we are dealing with a fair die. in his fundamental work Ars Conjectandi (The Art of Conjecture). in order to determine the probability of a given outcome (also called ‘case’ or ‘event’) of a random phenomenon. Interpreting probability values as percentages. 3. in which a mathematical theory of probability is presented. Denoting the probability of an event by P (event). any of the six faces is equally probable. in which the classic deﬁnition of probability is stated exe plicitly: e e e e e e Pour ´tudier un ph´nom`ne. one that has not been tampered with. Figure 1.2 The Classic Notion of Probability 3 of his day. One then proceeds to determine the ratio between the number of cases favorable to the event and the total number of possible cases. we may say that there is 16. and likewise for the other faces.e. which we represent by the set {1. If the event that interests us is ‘face 5 turned up’. These are referred to as elementary since they cannot be decomposed into other events. 2. one must ﬁrst reduce the phenomenon to a set of elementary and equally probable outcomes or ‘cases’. dont le num´rateur repr´sente le nombre de cas favorables e e a e e ` l’´v´nement et dont le d´nominateur repr´sente par contre le e e nombre des cas possibles. the French mathematician Simon de Laplace (1749–1827) published his Th´orie Analytique des e Probabilit´s. To study a phenomenon. il faut r´duire tous les ´v´nements du mˆme type ` un certain nombre de cas ´galement possie a e bles. we have P (face 5 turned up) = P (5) = 1/6.. i. number of possible cases Let us assume that in the throw of a die we wish to determine the probability of obtaining a certain face (showing upwards). one must reduce all events of the same type to a certain number of equally possible cases. 4. and also that the throws are performed in a fair manner.

and so on. roulettes. 6. will be discussed later on. with only a few basic rules directly deducible from the classic deﬁnition. the so-called probability function of a fair die. Let us assume that we have written the letter a on one of the faces with a 5. The sum of all probabilities is 6 × (1/6) = 1. 1. 5}). 2. 1. coins. in accord with the idea that there is 100% certainty that one of the faces will turn up (a sure event). 7. the probability of a card with a speciﬁc value and suit turning up is 1/52. 8. we have P (heads) = P (tails) = 1/2. such as dice. Probability value of a given face turning up when throwing a fair die die and throws. Hence. 4. This is henceforth our basic assumption unless otherwise stated. we are not interested in the suit. 2} . For a fair coin. the probability of a card of a given value turning up is 1/13. 2. the set of possible results is {heads. tossed in a fair way.1. so that the set of elementary events is {ace. When extracting a playing card at random from a 52 card deck. one . P (5) = 2/6 = 1/3. 4.4 1 Probabilities and Games of Chance Fig. and the fairness of their use. In this case the event ‘face 5 turned up’ can occur in two distinct ways. 5b} (and not {1. 3. although 1/6 is still the probability for each of the remaining faces and the sum of all elementary probabilities is still of course equal to 1. jack. and b on the other. perhaps as a result of a manufacturing error. Let us now consider a die with two faces marked with 5. 3. 5. The set of equiprobable elementary events is now {1. 4. queen. 5a.3 A Few Basic Rules The great merit of the probability measure lies in the fact that. If. on the other hand. cards. 3. 10. tails}. The fairness of chance-generating devices. When tossing a coin up in the air (a chance decision-maker much used in football matches). For the time being we shall just accept that the device conﬁguration and its use will not favor any particular elementary event. king. 9.

where A means the complement of A.3 A Few Basic Rules 5 can compute the probability of any event composed of ﬁnitely many elementary events. The probability reaches its highest value for the sure event. 6})? Since the number of favorable cases is 2. 6}) = P (5) + P (6) = 1 1 1 + = . the event ‘face value higher than 4’ corresponds to putting together (ﬁnding the union of) the elementary events ‘face 5’ and ‘face 6’. 6 The event ‘any face’ is the sure event in the die-throwing experiment. we have P (face value higher than 4) = P ({5. 6 6 3 The probability of the union of elementary events computed as the sum of the respective probabilities is then conﬁrmed. What is the value of P ({5. Given two complementary ¯ ¯ events denoted by A and A. Note the word ‘or’ indicating the union of the two events. Let us now consider the event ‘face value lower than or equal to 4’. P ({5. For instance. We have P face value lower than or equal to 4 = P (1) + P (2) + P (3) + P (4) = 4 2 1 1 1 1 + + + = = . that is. P (even face) = P (2) + P (4) + P (6) = 1 1 1 1 + + = . namely 100% certainty. the event that always happens. In the same way. 6 3 It is also obvious that the degree of certainty associated with {5. 6 6 6 2 P (any face) = 6 × 1 =1. Hence. The event ‘face value lower than or equal to 4’ is called the complement of ‘face value higher than 4’ relative to the sure event.1. 6}) = 1 2 = . 6 6 6 6 6 3 But one readily notices that ‘face value lower than or equal to 4’ is the opposite (or negation) of ‘face value higher than 4’. The quantiﬁcation of chance then becomes a feasible task. if . 6} must be the sum of the individual degrees of certainty. 6} . in the case of a die. that is: face value higher than 4 = face 5 or face 6 = {5.

6 1 Probabilities and Games of Chance we know the probability of one of them. This rule is a direct consequence of ¯ the fact that the sum of cases favorable to A and A is the total number of possible cases. 4}. the probability of the other is obtained by subtracting it from 1 (the probability of the ¯ certain event): P (A) = 1 − P (A). P even face and less than or equal to 4 = P (even face) × P = 1 1 2 × = . The probability of this event. the intersection in Fig. Finally. one concludes that the probability of the complement of the sure event is 1 − P (sure event) = 1 − 1 = 0.2 shows diagrams corresponding to the previous examples. Note the word ‘and’ indicating event composition (or intersection). Such drawings are called Venn diagrams (after the nineteenth century English mathematician John Venn). is obtained by the product of the probabilities of the individual events.2c can be seen as taking away half of certainty (hatched area in the ﬁgure) from something that had only two thirds of certainty (gray area in the same ﬁgure). For instance. in the case of the die. clearly illustrating the consistency of the classic deﬁnition. Let us now consider the event ‘even face and less than or equal to 4’. thus yielding one third of certainty. the two possibilities forming the set {2. composed of events ‘even face’ and ‘face smaller than or equal to 4’. Thus. and representing the latter by speciﬁc domains within it. 3 3 which is the result found above. P face value less than or equal to 4 = 1 − P (face value higher than 4) =1− 2 1 = . Therefore. From the rule about the complement. These rules are easily visualized by the simple ploy of drawing a rectangle corresponding to the set of all elementary events. Such an event is called the impossible event. Suppose we wish to determine the probability of ‘even face or less than or equal to 4’ turning . the event ‘no face turns up’. say P (A). let us consider one more problem. the minimum probability value. 2 3 3 face less than or equal to 4 As a matter of fact there are two possibilities out of six of obtaining a face that is both even and less than or equal to 4 in value. Figure 1. 1.

e. We must. (What we discussed previously was the intersection of these events.2. (c) Intersection (gray hatched area) of ‘even face’ (hatched area) with ‘face value less than or equal to 4’ (gray area) up. namely those that are simultaneously hatched and gray. 1. therefore. Let us recalculate in the above example: P even face or less than or equal to 4 = P (even face) + P −P = face less than or equal to 4 even face and less than or equal to 4 5 1 2 2 + − = . gray or simultaneously hatched and gray. (a) Union (gray area) of face 5 with face 6. greater than 1! What is going on? Let us take a look at the Venn diagram in Fig.) Applying the addition rule. We thus obtained the incorrect result 7/6 instead of 5/6. We are thus dealing with the event corresponding to the union of the following two events: ‘even face’. when we added the probabilities of elementary events. whereas we are now considering their union. we have P even face or less than or equal to 4 = P (even face) + P = 7 1 2 + = >1.2c. There are ﬁve. i. 2 3 6 face less than or equal to 4 We have a surprising and clearly incorrect result. We now see the cause of the problem: we have counted some of the squares twice.3 A Few Basic Rules 7 Fig. ‘face smaller than or equal to 4’. the two squares in the intersection. 1. The union of the two events corresponds to all squares that are hatched. Venn diagrams.1.. modify the addition rule for two events A and B in the following way: P (A or B) = P (A) + P (B) − P (A and B) . (b) Complement (gray area) of ‘face 5 or face 6’. we did not have to subtract the probability of . 2 3 6 6 In the ﬁrst example of a union of events.

F F } . So let us test our ability to apply the rules of probability to other random phenomena. Let us now consider a more mundane situation: the subject of families with two children. As a matter of fact. {M M.. being the impossible event. P (A and B) = P (A) × P (B) (rule to be revised later). i. except that they may originally have inspired the study of random phenomena. and thus had null probability. P (complement event of A) = 1 − P (A) .8 1 Probabilities and Games of Chance the intersection because there was no intersection. we have P (heads) = P (tails) = 1/2 (50% probability). P (even or odd face) = P (even face) + P (odd face) = 1 . For instance. Assuming that the probability of a boy being born is the same as that of a girl being born. the classic deﬁnition of probability aﬀords the means to measure the ‘degree of certainty’ of a random phenomenon on a convenient scale from 0 to 1 (or from 0 to 100% certainty). P (impossible event) = 0 . the intersection was empty. followed later by a girl. 1. the hardest concepts of randomness can be studied using coin-tossing experiments.e. M F. It is diﬃcult to believe that there should be anything more to say about coin tossing. P (A or B) = P (A) + P (B) − P (A and B) . . In summary.4 Testing the Rules There is nothing special about dice games. what is the probability of a family with two children having at least one boy? Denoting the birth of a boy by M and the birth of a girl by F . obeying the following rules in accordance with common sense: • • • • • P (sure event) = 1 . Whenever two events have an empty intersection they are called disjoint and the probability of their union is the sum of the probabilities. However. as we shall see later. there are 4 elementary events: where MF means that ﬁrst a boy is born. everything we know about chance can be explained by reference to coin tossing! The ﬁrst question is: what is the probability of heads (or tails) turning up when (fairly) tossing a (fair) coin in the air? Since both events are equally probable. F M. We will often talk about tossing coins in this book.

Only the following amongst them favor the event that interests us here. however. 51} (where 15 means die 1 gives 1 and die 2 gives 5 and likewise for the other cases). Therefore. we are dealing with compound experiments. For instance. The number of throws needed to obtain a six with one die or two sixes with two dice was a subject that interested the previously cited Chevalier de M´r´. 6 6 . we can look at these events as if they were event intersections.4 Testing the Rules 9 and similarly for the other cases. 24. The probability of not obtaining a 6 (or any other previously chosen face for that matter) in a one-die throw is 5 1 P (no 6) = 1 − = . Let us then see the explanation of the De M´r´ paradox. both girls. Applying the intersection rule. one boy and one girl. sum equal to 6: {15. Let us return to the dice. P (sum of the faces is 6) = 5/36 . 33. the probability is 2/3. The latter betted that a six would turn up with one die ee in four throws and that two sixes would turn up with two dice in 24 throws. take the following argument. we get P (M F ) = P (M ) × P (F ) = 1/2 × 1/2 = 1/4. He noted.1. the event MF can be seen as the intersection of the events ‘ﬁrst a boy is born’ and ‘second a girl is born’. any of the six faces of die 2 may turn up. 42. Only two of these three possibilities correspond to having at least one boy. Let us name the dice as die 1 and die 2. Suppose we throw two dice. For example. that he constantly lost when applying this strategy with two dice and asked Pascal to help him to explain why. The mistake here consists precisely in not taking into account the fact that ‘one boy and one girl’ corresponds to two equiprobable elementary events: MF and FM. For any face of die 1 that turns up. The probability we wish to determine is then P (at least one boy) = P (M M ) + P (M F ) + P (F M ) = 3/4 . A common mistake in probability computation consists in not correctly representing the set of elementary events. each of these elementary events has probability 1/4. For a family with two children there are three possibilities: both boys. viz.. In fact. which conee stitutes a good illustration of the complement rule. If the probability of a boy being born is the same as that for a girl. There are therefore 6 × 6 = 36 possible and equally probable outcomes (assuming fair dice and throwing). Therefore. MF is called a compound event. What is the probability that the sum of the faces is 6? Once again.

. we were able to avoid the diﬃculties inherent in Fig. in fact. . 1166. . . We saw above how. . .3. the probability of not obtaining at least one 6 in four throws is given by the intersection rule: P (no 6 in 4 throws) = P (no 6 in ﬁrst throw) × P (no 6 in second throw) × P (no 6 in third throw) × P (no 6 in fourth throw) = 5 5 5 5 × × × = 6 6 6 6 5 6 4 .51775 . 1121. We should have to count the events where only one six turns up (e. . It is a set with a considerable number of elements. Hence. 1116. 1161. . .. . .g. We thus conclude that the probability of obtaining at least one 6 in four throws is higher than 50%.10 1 Probabilities and Games of Chance Therefore. . . . 1112. 1126. . P (at least one 6 in 4 throws) = 1 − P (no 6 in 4 throws) =1− 5 6 4 = 0. 1. Let us list the set of elementary events corresponding to the four throws: {1111. . 1296. To count the events where a 6 appears at least once is not an easy task. by thinking in terms of the complement of the event we are interested in. Putting the laws of chance to the test . by the complement rule. as well as the events where two. . . justifying a bet on this event as a good tactic. . . event 1126). 6666} . three or four sixes appear.

1.5 The Frequency Notion of Probability

11

such counting problems. Let us apply this same technique to the two dice: P (no double 6 with two dice) = 1 − P (no double 6 in 24 throws) = 35 1 = , 36 36 35 36 35 36

24

,

24

P (at least one double 6 in 24 throws) = 1 −

= 0.49141 .

We obtain a probability lower than 50%: betting on a double 6 in 24 two-die throws is not a good policy. One can thus understand the losses incurred by Chevalier de M´r´! ee

**1.5 The Frequency Notion of Probability
**

The notions and rules introduced in the previous sections assumed fair dice, fair coins, equal probability of a boy or girl being born, and so on. That is, they assumed speciﬁc probability values (in fact, equiprobability) for the elementary events. One may wonder whether such values can be reasonably assumed in normal practice. For instance, a fair coin assumes a symmetrical mass distribution around its center of gravity. In normal practice it is impossible to guarantee such a distribution. As to the fair throw, even without assuming a professional gambler gifted with a deliberately introduced ﬂick of the wrist, it is known that some faulty handling can occur in practice, e.g., the coin sticking to a hand that is not perfectly clean, a coin rolling only along one direction, and so on. In reality it seems that the tossing of a coin in the air is usually biased, even for a fair coin that is perfectly symmetrical and balanced. As a matter of fact, the American mathematician Persi Diaconis and his collaborators have shown, in a paper published in 2004, that a fair tossing of a coin can only take place when the impulse (given with the ﬁnger) is exactly applied on a diameter line of the coin. If there is a deviation from this position there is also a probability higher than 50% that the face turned up by the coin is the same as (or the opposite of) the one shown initially. For these reasons, instead of assuming more or less idealized conditions, one would

12

1 Probabilities and Games of Chance

like, sometimes, to deal with true values of probabilities of elementary events. The French mathematician Abraham De Moivre (1667–1754), in his work The Doctrine of Chance: A Method of Calculating the Probabilities of Events in Play, published in London in 1718, introduced the frequency or frequentist theory of probability. Suppose we wish to determine the probability of a 5 turning up when we throw a die in the air. Then we can do the following: repeat the die-throwing experiment a certain number of times denoted by n, let k be the number of times that face 5 turns up, and compute the ratio f= k . n

The value of f is known as the frequency of occurrence of the relevant event in n repetitions of the experiment, in this case, the die-throwing experiment. Let us return to the birth of a boy or a girl. There are natality statistics in every country, from which one can obtain the birth frequencies of boys and girls. For instance, in Portugal, out of 113 384 births occurring in 1998, 58 506 were of male sex. Therefore, the male frequency of birth was f = 58 506/113 384 = 0.516. The female frequency of birth was the complement, viz., 1 − 0.516 = 0.484. One can use the frequency of an event as an empirical probability measure, a counterpart of the ideal situation of the classic deﬁnition. If the event always occurs (sure event), then k is equal to n and f = 1. In the opposite situation, if the event never occurs (impossible event), k is equal to 0 and f = 0. Thus, the frequency varies over the same range as the probability. Let us now take an event such as ‘face 5 turns up’ when throwing a die. As we increase n, we verify that the frequency f tends to wander with decreasing amplitude around a certain value. Figure 1.4a shows how the frequencies of face 5 and face 6 might evolve with n in two distinct series of 2 000 throws of a fair die. Some wanderings are clearly visible, each time with smaller amplitude as n increases, reaching a reasonable level of stability near the theoretical value of 1/6. Figure 1.4b shows the evolution of the frequency for a new series of 2 000 throws, now relative to faces 5 or 6 turning up. Note how the stabilization takes place near the theoretical value of 1/3. Frequency does therefore exhibit the additivity required for a probability measure, for suﬃciently large n.

1.5 The Frequency Notion of Probability

13

Fig. 1.4. Possible frequency curves in die throwing. (a) Face 5 and face 6. (b) Face 5 or 6

Let us assume that for a suﬃciently high value of n, say 2 000 throws, we obtain f = 0.16 for face 5. We then assign this value to the probability of face 5 turning up. This is the value that we evaluate empirically (or estimate) for P (face 5) based on the 2 000 throws. Computing the probabilities for the other faces in the same way, let us assume that we obtain: P (face 1) = 0.17 , P (face 3) = 0.17 , P (face 5) = 0.16 , P (face 2) = 0.175 , P (face 4) = 0.165 , P (face 6) = 0.16 .

The elementary events are no longer equiprobable. The die is not fair (biased die) and has the tendency to turn up low values more frequently than high values. We also note that, since n is the sum of the diﬀerent values of k respecting the frequencies of all the faces, the sum of the empirical probabilities is 1. Figure 1.5 shows the probability function of this biased die. Using this function we can apply the same basic rules as before. For example, P (face higher than 4) = P (5) + P (6) = 0.16 + 0.16 = 0.32 , P face lower than or equal to 4 =1−P face higher than 4 = 1 − 0.32 = 0.68 ,

P (even face) = P (2) + P (4) + P (6) = 0.175 + 0.165 + 0.16 = 0.5 , P even face and lower than or equal to 4 = 0.5 × 0.68 = 0.34 .

14

1 Probabilities and Games of Chance Fig. 1.5. Probability function of a biased die

In the child birth problem, frequencies of occurrence of the two sexes obtained from suﬃciently large birth statistics can also be used as a probability measure. Assuming that the probability of boys and girls born in Portugal is estimated by the previously mentioned frequencies – P (boy) = 0.516, P (girl) = 0.484 – then, for the problem of families with two children presented before, we have for Portuguese families P (at least one boy) = P (M M ) + P (M F ) + P (F M ) = 0.516 × 0.516 + 0.516 × 0.484 + 0.484 × 0.516 = 0.766 . Some questions concerning the frequency interpretation of probabilities will be dealt with later. For instance, when do we consider n suﬃciently large? For the time being, it is certainly gratifying to know that we can estimate degrees of certainty of many random phenomena. We make an exception for those phenomena that do not refer to constant conditions (e.g., the probability that Chelsea beats Manchester United) or those that are of a subjective nature (e.g., the probability that I will feel happy tomorrow).

1.6 Counting Techniques

Up to now we have computed probabilities for relatively simple events and experiments. We now proceed to a more complicated problem: computing the probability of winning the ﬁrst prize of the football pools known as toto in many European countries. A toto entry consists of a list of 13 football matches for the coming week. Pool entrants have to select the result of each match, whether it will be a home win, an away win, or neither of these, typically by marking a cross over each of the three possibilities. In this case there is a single favorable event (the right template of crosses) in an event set that we can imagine

respectively.1. and D (draw). 12345. for the four rightmost positions. Then. There are 5 distinct possibilities for the ﬁrst letter. say 3. the next three symbols can be used with any of the previous templates. Letters and envelopes .6. as shown in Fig. Now let us increase the toto to two matches. One would then have only three distinct templates. i. This line of thought can be repeated until one reaches the 13 toto matches. letter 3 is no longer available.6 shows one possible letter sequence. Each envelope is assumed to correspond to one and only one of the letters. In general. Abraham de Moivre was the ﬁrst to present a consistent description of such techniques in his book mentioned above. What is the probability that all envelopes receive the right letter? Let us name the letters by numbers 1. Consider one of them. and each symbol can have one of three values: home win. How many such templates are there? First.6. which then correspond to 313 = 1 594 323 templates. 3. Figure 1. A possible template is thus WWLDDDWDDDLLW . There are therefore 3 × 3 = 9 templates. with no visible clues (sender and addressee). assume that the toto has only one match. Another is 12354. by W (win). there are three diﬀerent templates corresponding to the symbols of the second match. for each of the symbols of the ﬁrst match. home loss (away win) and draw. Adding one more match to the toto. whence we obtain 3 × 3 × 3 templates. L (loss).000 000 627. 1. Only one of the possible sequences is the right one. The probability of winning the toto by chance alone is 1/1 594 323 = 0. 5. 4. 1. The question is: how many distinct sequences are there? Take the leftmost position. and let us imagine that the envelopes are placed in a row.e.. Let us consider how this works for toto. This is a rather large number. Now. But how large is this set? To be able to answer such questions we need to know some counting techniques.6 Counting Techniques 15 to have quite a large number of possible events (templates). 2. the number of lists of n symbols from a set of k symbols is given by nk . We now consider the problem of randomly inserting 5 diﬀerent letters into 5 envelopes. that we will denote. (Note the diﬀerence in relation to the toto case: Fig. as many as there are symbols. namely. Each template can be seen as a list (or sequence) of 13 symbols.

If we tried to match the envelopes. 2 minutes would be enough. 41 466 years would be needed! Finally.1. In general. Probability P of matching n letters n 1 2 3 4 5 10 15 P 1 0.1 shows the fast decrease of probability P of matching n letters. This function of n is called the factorial function and written n! (read n factorial). × 1 . as in the following examples: . as here.000 000 000 000 78 the fact that a certain symbol was chosen as the ﬁrst element of the list did not imply. for 5 letters it is only 8 in one thousand. in correspondence with the fast increase of n!. say. Only three letters now remain for the last three positions. Whereas with 3 letters the probability of a correct match is around 17%.16 1 Probabilities and Games of Chance Table 1. Each of the possible alternatives is called a permutation of the 5 symbols. Assume that for each heads– tails conﬁguration we take note of the position in which heads occurs.) Let us then choose one of the four remaining letters for the second position. For 5 letters. we should have to wait 42 days in the worst case (trying all 10! permutations).000 000 28 0. for n = 10 there are 10! = 3 628 800 diﬀerent permutations of the letters. .17 0. let us consider tossing a coin 5 times in a row. 5. This line of thought is easily extended until we reach a total number of alternatives given by 5 × 4 × 3 × 2 × 1 = 120 . aligning each permutation at a constant rate of one per second. Table 1.5 0.0083 0. the number of ordered sequences (permutations) of n symbols is given by n × (n − 1) × (n − 2) × . but for 15 letters. Here are all the permutations of three symbols: 123 132 213 231 312 321 . Note how the total number of permutations grows dramatically when one increases from 3 to 5 symbols. We have 3 × 2 × 1 = 6 permutations. . So for the ﬁrst position we had 5 alternatives. that it could not be used again in other list positions. and for the second position we had 4.042 0. respectively. The question now is: what is the probability of heads turning up 3 times? Denote heads and tails by 1 and 0. For example.

denoted . . 1. . the number of distinct subsets (the order does not matter) n of k symbols out of a set of n symbols. 3}. or any other of the 3! = 6 permutations. 15 51 .6 Counting Techniques 17 Sequence 10011 01101 10110 Heads 1. When we consider a certain subset. as illustrated in the diagram on the right. there are repetitions corresponding to the permutations of the remaining symbols. . using a fair coin with P (1) = P (0) = 1/2. taking k at a time. On the other hand. 5}. . 2. 231. 4. 234 234 . 45 54 45 54 . 132. Likewise. In summary. for each sequence of the ﬁrst three symbols. we are not concerned with the order in which the symbols appear in the permutations. in correspondence with the 2! = 2 permutations 45 and 54. 2. The probability that heads turns up three times is (1/2)3 .1. . . position 4. is given by n k = n! . the number of subsets that we can form using 3 of 5 symbols (without paying attention to the order) is obtained by dividing the 120 permutations of 5 symbols by the number of repetitions corresponding to the permutations of 3 and of 5 − 3 = 2 symbols. 4 3 Taking into account the set of all 5 possible positions. 3! × (5 − 3)! 6 × 2 In general. {1. 3. we thus intend to count the number of distinct subsets that can be formed using only 3 of those 5 symbols. 5 3. 123 123 132 132 . for instance. and called the k number of combinations of n. The conclusion is that number of subsets of 3 out of 5 symbols = 5! 120 = = 10 . It is all the same whether it is 123. we start by assuming that the list of all 120 permutations of 5 symbols is available and select as subsets the ﬁrst 3 permutation symbols. k! × (n − k)! Let us go back to the problem of the coin tossed 5 times in a row. For this purpose. . say {1. 2. 5 3. 123 is repeated twice. the probability that tails turns up twice . .

.35 of turning up tails). Since there are 5 distinct sequences on those conditions.65 of turning up heads (and. we have (disjoint events) 3 P (3 heads in 5 throws) = 5 3 × 1 2 5 = 10 = 0. Note that for k = 0 (no heads. 1. as shown in Fig.7 Games of Chance Games of the kind played in the casino – roulette.e. the sequence 11111) the probability is far lower (0. the probability of a given sequence of three heads and two tails is computed as (1/2)3 × (1/2)2 = (1/2)5 . the sequence 00000) and for k = 5 (all heads. i.031) than when a number of heads near half of the throws turns up (2 or 3). sequences with more heads are favored when compared with those with more tails.31 . 32 Figure 1. For instance. Therefore.e. whereas the sequence 11111 has probability 0. 0. i. slot machines and the like – together with any kind of national lottery are commonly called games of chance. the sequence 00000 has probability 0. in more recent times. Probability of obtaining k heads in 5 throws of a fair coin (a) and a biased coin (b) is (1/2)2 .35)2 = 0. Let us suppose that the coin was biased. several card games and.7b.7a shows the probability function for several k values.. We have P (3 heads in 5 throws) = 5 3 × (0. 1. with a probability of 0. therefore.116.7..336 . as expected. craps.65)3 × (0.18 1 Probabilities and Games of Chance Fig. For this case. 1.005.

For instance. probability of 13 hits probability of 12 hits (1/3)13 = 0. Portuguese and Spanish. This amounts to counting subsets of 11 or 12 symbols (the positions of the hits) of the 13 available positions. × (1/3)13 = 0. 13 12 13 11 × (1/3)13 = 0. for example. these games are called games of bad luck.000 000 627 .000 000 627. which in its turn comes from the Persian az zar .000 048 92 .1 Toto We have already seen that the probability of a full template hit in the football toto is 0. meaning risk or danger. Now that we have learned the basic rules for operating with probabilities. 1. Maybe this is the reason why in some European languages. .7 Games of Chance 19 All these games involve a chance phenomenon that is only very rarely favorable to the player. in the case of toto. This is not usually entirely true. meaning dice game. 12 11 Hence. We 13 13 have seen that such counts are given by and . we will be able to understand the bad luck involved in games of chance in an objective manner.1. comes from the Arab az-zahar . as well as some counting techniques. This word.000 008 15 . the reader must take into account the fact that the computed probabilities correspond to idealized models of reality. As a comparison the probability of a ﬁrst toto prize is about twice as big as the probability of a correct random insertion of 10 letters in their respective envelopes. The French also use ‘jeux de hasard’ to refer to games of chance. probability of 11 hits In this example and the following. we are assuming that the choice of template is made entirely at random with all 1 594 323 templates being equally likely. respectively.7. In order to determine the probability of other prizes corresponding to 11 or 12 correct hits one only has to determine the number of diﬀerent ways in which one can obtain the 11 or 12 correct hits out of the 13. using for ‘bad luck’ a word that translates literally to the English word ‘hazard’.

2 Lotto The lotto or lottery is a popular form of gambling that often corresponds to betting on a subset of numbers. .000 000 429 .000 000 071 5 = 20 × 12 341 × 0.20 1 Probabilities and Games of Chance 1.000 000 071 5 = 0. In the same way one can compute P (fourth prize: 4 hits) = 6 4 × 43 2 × 0. 13 983 816 Therefore. the probability of winning the second prize is 43 times smaller than that of the third prize: P (second prize: 5 hits + 1 supplementary hit) = 6 × 0. ) × (43 × 42 × . We already know how to compute the number of distinct subsets of 6 out of 49 numbers: 49 6 = 49 × 48 × . that is. Thus. which can be any of the remaining 43 numbers. Let us now compute the probability of the third prize corresponding 6 to a hit in 5 numbers.000 018 4 . . 6! × 43! (6 × 5 × . . . Therefore.017 65 . and the sixth number of the bet can be any of the remaining 43 non-winning numbers.000 000 071 5 = 0.000 000 071 5 = 15 × 903 × 0.7. Let us consider the lotto that runs in several European countries consisting of betting on a subset of six numbers out of 49 (from 1 through 49). The second prize corresponds to ﬁve hits plus a hit on a supplementary number. P (third prize: 5 hits) = 258 × 0.000 000 071 5 . P (ﬁfth prize: 3 hits) = 6 3 × 43 3 × 0.000 000 071 5 = 0. ) 1 = 0.000 000 071 5 = 0. the probability of winning the third prize is 6 × 43 = 258 times greater than the probability of winning the ﬁrst prize. the probability of winning the ﬁrst prize is P (ﬁrst prize: 6 hits) = smaller than the probability of a ﬁrst prize in the toto. . . There are = 6 diﬀerent ways of selecting 5 ﬁve numbers out of six. . × 1 49! = = 13 983 816 .000 969 .

Think. therefore. Since there are four suits. We denote the hand by AABCD. king. There are combinations 3 . all of the same suit. As there are 13 distinct values. and so on) according to a certain strategy depending on the hand of cards and the behavior of the other players. jack and ten. for instance. note that there are 52 5 = 52! = 2 598 960 diﬀerent 5-card hands . 12 with each card being of any of the 4 suits. First of all.000 001 5 . we will simply analyze the chance element. queen. A player’s hand is composed of 5 cards. the 2 pair of cards can be chosen in 6×13 = 78 diﬀerent ways. 1. is given by P (royal ﬂush) = 4 = 0.7. Let us now look at the probability of obtaining the hand of lowest value. consisting of an ace. a game that depends only on chance.1. The remaining three cards of the hand will be drawn out of the remaining 12 values. It is not.3 Poker Poker is played with a 52 card deck (13 cards of 4 suits). where AA corresponds to the equal-value pair of cards. 2 598 960 almost six times greater than the probability of winning a second prize in the toto. of a pair of aces that can be chosen out of the 4 4 aces in = 6 diﬀerent ways. The game involves a betting system that guides the action of the players (card substitution.7 Games of Chance 21 The probability of winning the ﬁfth prize is quite a lot higher than the probability of winning the other prizes and nearly twice as high as the probability of matching 5 letters to their respective envelopes. the probability of randomly drawing a royal ﬂush (or any previously speciﬁed 5-card sequence for that matter). Here. staying in the game or passing. the royal ﬂush. computing the probabilities of obtaining speciﬁc hands when the cards are dealt at the beginning of the game. containing a simple pair of cards of the same value. 5! × 47! Let us now consider the hand that is the most diﬃcult to obtain.

corresponding.4 Slot Machines The ﬁrst slot machines (invented in the USA in 1897) had 3 wheels.22 1 Probabilities and Games of Chance of the values of the three cards. The number of wheels was also increased and slot machines with repeated symbols made their appearance. Thus. Thus.021 0.000 093 914 probability of winning the jackpot. 1.048 0. to a 0.000 24 0. The wheels were set turning until they stopped in a certain sequence of symbols. one can work out the list of probabilities for all interesting poker hands as follows: Pair Two pairs Three-of-a-kind Straight Flush Full house Four-of-a-kind Straight ﬂush Royal ﬂush AABCD AABBC AAABC 5 successive values 5 cards of the same suit AAABB AAAAB Straight + ﬂush Ace.0020 0.423 0. each with 10 symbols. Later on. which as a matter of fact was actually built and came on the market with the following list of prizes (in dollars): . queen. Since each of them can come from any of the 4 suits. the number of symbols was raised to 20. Everything was done to make it more diﬃcult (or even impossible!) to obtain certain prizes. king. The number of possible sequences was therefore 10 × 10 × 10 = 1000.0039 0.000 001 5 1. the probability of obtaining a speciﬁc winning sequence (the jackpot) was 0. Let us look at the slot machine shown in Fig.001. we have 12 3 diﬀerent possibilities. 2 598 960 × 4 × 4 × 4 = 14 080 That is. ten of the same suit 0. P (pair) = 13 × 14 080 = 0.0014 0.8. a pair is almost as probable as getting heads (or tails) when tossing a coin.7. Using the same counting techniques. in this last case. jack.423 .000 015 0. and afterwards to 22 (in 1970).

50 $ 1. The .25 $ 0. 8. Even the straight ﬂush only occurs for one single combination: 7. 10 and jack of diamonds.1. The number of possible combinations is 155 = 759 375. 9. although the player may have the illusion when looking at the spinning wheels that it is actually possible.00 $ 2.00 The conﬁguration of the ﬁve wheels is such that the royal ﬂush can never show up.7 Games of Chance 23 Fig.05 $ 0.15 $ 0.8.50 $ 5.30 $ 0. Wheels of a particularly infamous slot machine Pair (of jacks or better) Two pairs Three-of-a-kind Straight Flush Full house Four-of-a-kind Straight ﬂush Royal ﬂush $ 0. 1.10 $ 0.

while maintaining the illusion that good luck really does lie just around the corner! . The construction and the way of operating the roulette are presumed to be such that each throw can be considered as a random phenomenon with equiprobable outcomes for all pockets. These are numbered from 0. Thus.7. which does not look so bad. prizes and fees are arranged in such a way that in a long sequence of bets the player will lose. However.048 695 0. 1.000 666 0. and so on). numbers marked red or black. In European roulette. with the exception of 0 or 00 which are numbers reserved for the house.002 258 0.027 (better than getting three-of-a-kind at poker). The present slot machines use electronic circuitry with the ﬁnal position of each wheel given by a randomly generated number.5 Roulette The roulette consists of a sort of dish having on its rim 37 or 38 hemispherical pockets.000 001 0 We observe that the probabilities of these ‘poker’ sequences for the pair and high-valued sequences are far lower than their true poker counterparts. we shall see later that even in ‘games of bad luck’ whose elementary probabilities are not that ‘bad’. The probabilities of ‘bad luck’ are similar. A ball is thrown into the spinning dish and moves around until it ﬁnally settles in one of the pockets and the dish (the roulette) stops. the numbering goes from 0 to 36.24 1 Probabilities and Games of Chance number of favorable combinations for each winning sequence and the respective probabilities are computed as: Pair (of jacks or better) Two pairs Three-of-a-kind Straight Flush Full house Four-of-a-kind Straight ﬂush Royal ﬂush 68 612 36 978 16 804 2 396 1 715 506 60 1 0 0. especially when one realises that the player can place bets on several numbers (as well as on even or odd numbers.000 079 0. in European roulette the probability of hitting a certain number is 1/37 = 0.090 353 0.003 155 0. In the simplest bet the player wins if the bet is on the number that turns up by spinning the roulette.022 129 0. American roulette has an extra pocket marked 00.

2c of the last chapter.. Thus. 3. many surprising results will emerge. In the last chapter we computed P (face less than or equal to 4) = 2/3 and P (even face) = 1/2. Therefore.2 Amazing Conditions 2. there was no prior condition. Now. 2. From the present chapter onwards. were in perfect agreement with common sense. Brieﬂy. starting with the inclusion of such a humble ingredient as the assignment of conditions to events.1 Conditional Events The results obtained in the last chapter. it is as though we have moved to a new set of elementary events corresponding to compliance with the conditioning event. . 4 2 Looking at Fig. using the basic rules for operating with probabilities. P (even face if face less than or equal to 4) = 1 2 = . out of the four possible elementary events corresponding to ‘face less than or equal to 4’. or information conditioning that event. There are two: face 2 and face 4. from the point of view of the classic deﬁnition. when referring to the even-face event. we enumerate those that are even. One might ask: what is the probability that an even face turns up if we know that a face less than or equal to 4 has turned up? Previously. we see that the desired probability is computed by dividing the cases corresponding to the event intersection (hatched gray squares) by the cases corresponding to the condition (gray squares). {1. there is one condition: we know that ‘face less than or equal to 4’ has turned up. viz. 4}. Consider once again throwing a die and the events ‘face less than or equal to 4’ and ‘even face’. 1.

We now look at the problem of determining the probability of ‘face less than or equal to 4 if even face’. Suppose we throw a die a large number of times. the number of occurrences of ‘A and B’ and ‘B’ when the random experiment is repeated n times with n suﬃciently large. From this number of times. In other trials the numbers of occurrences may be diﬀerent. kB where kA and B and kB are. the formula corresponds to dividing 2/6 by 4/6. For instance. say 10 000 times. The symbol ≈ means ‘approximately equal to’. Then the frequency of ‘even face if face less than or equal to 4’ would be computed as 3 260/6 505. P (B) For the above example where the random experiment consists of throwing a die. for suﬃciently large n. In this case. This equality does not always hold. will always be close to 1/2. close to 1/2. Imagine that it turned up 6 505 times. and denoting the events of the random experiment simply by A and B. let us say that ‘even face’ turned up 3 260 times. for this example the condition did not inﬂuence the probability of ‘even face’: P (even face if face less than or equal to 4) = P (even face) . from the frequentist point of view. it is an easy task to check that P (even face if face less than or equal to 5) = 2 = P (even face) . Therefore. Applying the above formula we get . we rewrite P (A if B) ≈ kA and B /n kA and B = . the so-called conditional probability formula: P (A if B) = P (A and B) .26 2 Amazing Conditions At the same time notice that. whence the above result of 1/2. It is a well known fact that one may divide both terms of a fraction by the same number without changing its value. about two thirds of the times we expect ‘face less than or equal to 4’ to turn up. 5 From the frequentist point of view. we may write P (A if B) ≈ kA and B . respectively. but the ﬁnal result. kB kB /n justifying.

P (black) = 1/6 . changing the order of the conditioning event will in general produce diﬀerent probabilities. we put it back in the urn – extraction with replacement – these probabilities will remain unchanged in the following extractions. . on the other hand. Note. then in the following extraction: P (white) = 3/5 . however. then in the following extraction P (white) = 2/5 . we do not put the ball back in the urn – extraction without replacement – the experimental conditions for ball extraction are changed and we obtain new probability values.2. 2 red balls and 1 black ball. The probabilities of randomly extracting one speciﬁc-color ball are P (white) = 1/2 . Consider an urn containing 3 white balls. 3 2 3 From now on we shall simplify notation by omitting the multiplication symbol when no possibility of confusion arises. If. • If a red ball comes out in the ﬁrst extraction. For instance: • If a white ball comes out in the ﬁrst extraction. 2 3 3 1 2 1 × = . 2. P (black) = 1/5 . If. P (red) = 1/3 . In our example: P (A and B) = P (A if B)P (B) = P (B and A) = P (B if A)P (A) = 1 2 1 × = . P (red) = 2/5 .2 Experimental Conditioning 27 P face less than or equal to 4 if even face P = = face less than or equal to 4 and even face P (even face) 2/6 2 = . whenever we extract a ball. 3/6 3 Thus. that the intersection is (always) commutative: P (A and B) = P (B and A). P (red) = 1/5 . P (black) = 1/5 .2 Experimental Conditioning The literature on probability theory is full of examples in which balls are randomly extracted from urns.

if the dice-throwing is fair. In conclusion. The same can be said about any other pair of events. both faces are odd. as a consequence of event conditioning. which may occur 3 × 3 = 9 times. Applying the preceding rule. Therefore. As we can see. the conditions under which a random experiment takes place may drastically change the probabilities of the various events involved. the condition ‘even face of the green die’ has no inﬂuence on the probability value for the other die. 1/2 2 which is precisely the probability of ‘even face of the white die’. and suppose someone asks the following question: what is the probability that an even face turns up on the white die if an even face turns up on the green die? Obviously. one white and the other green. (36 − 9)/36 3 . then in the following extraction: P (white) = 3/5 . we may work this out in detail as even face of white die and P even face of green die even face of white die if P = even face of green die P (even face of green die) = 1 (1/2) × (1/2) = . We say that the events are independent. the face that turns up on the green die has no inﬂuence whatsoever on the face that turns up on the white die. or in other words. P (red) = 2/5 .3 Independent Events Let us now consider throwing two dice. The events ‘even face on the white die’ and ‘even face on the green die’ do not interfere. P even face of the white die if at least one even face P = = even face of the white die and at least one even face P (at least one even face) 3 × 6/36 2 = .28 2 Amazing Conditions • If a black ball comes out in the ﬁrst extraction. P (black) = 0 . Consider now the question: what is the probability of an even face of the white die turning up if at least one even face turns up (on either of the dice or on both)? We ﬁrst notice that ‘at least one even face turns up’ is the complement of ‘no even face turns up’. 2. one of which refers to the white die while the other refers to the green die. and vice versa.

Consequently. Consider two urns. Applying the addition rule for disjoint events we obtain. . . . In the previous example of ball extraction from an urn. in the die-throwing experiment let us stipulate that event A means ‘face less than or equal to 5’. If the events are independent one may replace P (A if B) by P (A) and obtain P (A and B) = P (A)P (B). 1 for event intersection is only applicable when the events are independent. One of the urns is randomly chosen and . the probability formula presented in Chap. Imagine that X has 4 white and 5 black balls and Y has 3 white and 6 black balls. Thus. B. Z are independent.4 A Very Special Reverend 29 We now come to the conclusion that the events ‘even face of the white die’ and ‘at least one even face’ are not independent. they are said to be dependent. and Z) = P (A) × P (B) × . we have to evaluate probabilities of intersections. But the union of disjoint events ‘A and B’ and ‘A and B’ is obviously just A. For instance. . This event may occur when either the event B = ‘even face’ or its complement B = ‘odd face’ have occurred. P (A and B) and P (A and B). . 2. We readily compute P (A and B) = 2/6 . the successive extractions are independent if performed with replacement and dependent if performed without replacement.4 A Very Special Reverend Sometimes an event can be obtained in distinct ways. . × P (Z) . if A. P (face less than or equal to 5) = P (A and B) + P (A and B) = 5 2 3 + = . For independent events. without much surprise. on the basis of conditional probabilities. . named X and Y . P (A and B and . Let us go back to the conditional probability formula: P (A if B) = P (A and B) . P (A and B) = 3/6 . P (B) or P (A and B) = P (A if B)P (B) . 6 6 6 In certain circumstances. P (A if B) = P (A) and P (B if A) = P (B). In general. .2. containing white and black balls.

We have seen that P (white ball) = P (white ball if urn X) × P (urn X) +P (white ball if urn Y ) × P (urn Y ) . reference was made to the urn inverse problem. mentioned in the last chapter. P (white ball) Note how the inverse conditioning probability P (urn X if white ball) depends on the direct conditioning probability P (white ball if urn X). 7/18 7 . entitled Essay Towards Solving a Problem in the Doctrine of Chances. and published posthumously in 1763. we compute: P (white ball) = P (white ball and urn X) + P (white ball and urn Y ) = P (white ball if urn X) × P (urn X) = 7 4 1 3 1 × + × = = 0. What is the probability that a white ball is drawn? Using the preceding line of thought. 9 2 9 2 18 +P (white ball if urn Y ) × P (urn Y ) In the book entitled The Doctrine of Chances by Abraham de Moivre. In fact. In Bayes’ work. we obtain P (urn X if white ball) = In the same way.39 . by P (white ball). Let us see what Bayes’ theorem tells us in the context of the urn problem. expressed in the formula that we knew already. that is: What is the probability that the ball came from urn X if it turned out to be white? The solution to the inverse problem was ﬁrst discussed in a short document written by the Reverend Thomas Bayes (1702–1761).30 2 Amazing Conditions afterwards a ball is randomly drawn out of that urn. Applying the concrete values of the example. an English priest interested in philosophical and mathematical problems. 7/18 7 4 (4/9) × (1/2) = . there appears the famous Bayes’ theorem that would so signiﬁcantly inﬂuence the future development of probability theory. Now Bayes’ theorem allows us to write P (urn X if white ball) = P (urn X) × P (white ball if urn X) . it corresponds simply to dividing the ‘urn X’ term of its decomposition. P (urn Y if white ball) = 3 (3/9) × (1/2) = .

In fact.1 It is a probability that we can consider low in terms of clinical diagnostic. we have P (white ball if urn X) + P (white ball if urn Y ) = 7 . The situation is then quite diﬀerent: the candidate is no longer a randomly selected woman from the population of a country.01 . and written as follows: P (cause if eﬀect) = P (cause) × P (eﬀect if cause) .9 . The mammogram (like many other clinical analyses) is not infallible. Consider the detection of breast cancer by means of a mammography.9 = 0.2. Bayes’ theorem is often referred to as a theorem about the probabilities of causes (once the eﬀects are known). Using Bayes’ theorem we then compute: P breast cancer if positive mammogram = 0. 9 In the event ‘urn X if white ball’.08 . long studies have shown that (frequentist estimates) P positive mammogram = 0. if no breast cancer Let us assume the random selection of a woman in a given population of a country where it is known that (again frequentist estimates) P (breast cancer) = 0.1 . Let us now assume that the mammogram is prescribed following a medical consultation. Now the probability of the cause (the so-called prevalence) is different. Suppose we have determined (frequentist estimate) . in this example.01 × 0. P (eﬀect) Let us examine a practical example. raising reasonable doubt about the isolated use of the mammogram (as for many other clinical analyses for that matter). if breast cancer P positive mammogram = 0. 0. In reality. but a woman from the female population that seeks a speciﬁc medical consultation (undoubtedly because she has found that something is not as it should be). But care should be taken here: the two conditional probabilities for the white ball do not have to add up to 1.01 × 0. the two probabilities add up to 1.4 A Very Special Reverend 31 Of course.9 + 0.99 × 0. as they should (the ball comes either from urn X or from urn Y ). we may consider that the drawing out of the white ball is the eﬀect corresponding to the ‘urn X’ cause.

000 001 × 0. 0. in spite of the DNA evidence.000 001 × 0.000 01 ≈ 0.9 × 0. Using Bayes’ theorem. In another scenario let us suppose that.1 .1 Observe how the prevalence of the cause has dramatic consequences when we infer the probability of the cause assuming that a certain eﬀect is veriﬁed. In this case. P (innocent if DNA match) = 0. one would require a whole set of circumstances that we estimate to occur only once in a hundred thousand times.000 001 × 0.5 ≈ 0. Then.09 . But.32 2 Amazing Conditions P breast cancer (in women seeking speciﬁc consultation) = 0. P (innocent) = 0.9 + 0. P (innocent) = P (guilty) = 0.000 001 .1 × 0.000 001 . given the circumstances of the crime and the suspect.999 99 + 1 × 0. Let us say that. P (innocent if DNA match) = 0. there are good reasons for presuming innocence.5 + 1 × 0. and furthermore that the probability of this happening by pure chance is extremely low. we now obtain P breast cancer if positive mammogram = 0. Then. Suppose that in principle there is no evidence favoring either the innocence or the guilt (no one is innocent or guilty until there is proof). that is.1 × 0. another situation where the sum of the conditional probabilities is not 1.999 99. whence P (DNA match if innocent) = 0.000 001 × 0.5.5 In this case the DNA evidence is decisive. that is. . say. Let us look at another example. one in a million. P (DNA match if guilty) = 1 . for the suspect to be guilty. in principle. Imagine that the DNA of the suspect matches the DNA found at the crime scene.999 99 0.5 . namely the use of DNA tests in assessing the innocence or guilt of someone accused of a crime.9 = 0. the presumption of innocence is still considerable (about one case in eleven). 0.

Let us illustrate the fallacy with the situation where a coin has been ﬂipped several times and tails has been turning up each time. The same applies to other games.5 Probabilities in Daily Life The normal citizen often has an inaccurate perception of probabilities in everyday life.5 Probabilities in Daily Life 33 Fig. the odds are now on my side. it is as if there were some kind of memory in a sequence of trials. we ask a child what will happen in the following ﬂip. 2. Most gamblers (and not only gamblers) then tend to consider that the probability of heads turning up in the following ﬂips will be higher than before. Who has never said something like: 30 is bound to come now.1. The adult will have no diﬃculty in giving the correct answer to this simple question (that is. In fact. The risks of entering the PD if C clinic 2. several studies have shown that the probability notion is one of the hardest to grasp. In fact. For instance. the most probable answer is that the child will opt for tails. meaning that both events are equally likely). and so on? It looks as if one would deem the information from previous outcomes as useful in determining the following outcome.2. either heads or tails will turn up. in other words. if after ﬂipping a coin that turned up heads. the 6 is hot. rigorously . An example of this is imagining that one should place a high bet on a certain roulette number that has not come up for a long time. but if the question gets more complicated an adult will easily fall into the so-called gambler’s fallacy. it’s time for black to show up.

000 001. .000 000 001. but instead conjecture that we are having an ‘unlucky month’. In fact. why not let it ‘rest’ for a while before playing again? Or change the coin? Suppose a coin has produced tails 30 times in a row. but the fact that the probability of someone winning in three repetitions is one in a million. which would certainly seem less strange than the double victory in the two 1000-ticket raﬄes. In this spirit. one throw per day). However. the probability that someone (whoever he or she may be) wins the raﬄe three times is not 1. Once all the tickets have been sold – one ticket for each of the 1 000 players – it comes out that John Doe has won the prize. The probability that the same player (John Doe) wins all 3 times is (independent events) (0. what will amaze the common citizen is not so much that John Doe has a probability of one in a thousand million of winning.34 2 Amazing Conditions conducted experiments have even shown that the normal citizen has the tendency to believe that there is a sort of ‘memory’ phenomenon in chance devices. We will ﬁnd it strange and cast some doubts about the coin or the fairness of the player. This is exactly the same probability that corresponds to John Doe winning the ﬁrst time and Richard Roe the second time. In a new repetition of the raﬄe John Doe wins again. the probability of John Doe winning twice consecutively is (1/1000)2 = 0. maybe we would not ﬁnd the event so strange after all. each one buying the same number of tickets.000 001.001)3 . if a coin has been showing tails. Note that this is the same probability as if John Doe had instead won a raﬄe in which a million tickets had been sold. any combination of three winning players has a probability (0. if instead of being consecutive. Meanwhile. the probability that someone wins the raﬄe is obviously 1 (someone has to win). Since the two raﬄes are independent. Let us slightly modify our problem. The probability that a given player (say. considering an only-one-prize raﬄe with 10 000 tickets and the participation of 1 000 players. Consider a raﬄe with 1 000 tickets and 1 000 players. On the other hand. the throws are spaced over one month (say. If the players are not always the same. Thus. but only one prize. It looks as if ‘luck’ is with John Doe. if the same 1 000 players play the three raﬄes. but at most 1000 × (0.001)3 = 0. the probability will be smaller. Now suppose the raﬄe is repeated three times.001)3 = 0. John Doe) wins is 0. and maybe we would not attribute the ‘cause’ of the event to the coin’s ‘memory’. there are 1 000 distinct possibilities of someone winning the three raﬄes.001 (the thousand players are equally likely to win). resulting in the above probability.

R∗ B means that for the red–blue card the R face is showing.5 Probabilities in Daily Life 35 Let us now take a look at the following questions that appeared in tests assessing people’s ability to understand probabilities: 1. The set of elementary events is then: {R∗ R. In this enumeration. B ∗ B. say an asterisk . As for the RR card. the probability that it is red–red is 1/2. R∗ B. therefore. Which am I more likely to get next? 3. Mytholmroyd have kicked oﬀ ﬁrst every time. The question is: what is the probability that the other face will also be red? In general. 1 Correct answers are: blue. Which team is more likely to kick oﬀ next time? People frequently ﬁnd it diﬃcult to give the correct answer1 to such questions. Let us start with an apparently easy problem. I pull out 10 and 8 are red. any of the faces can be turned up: R∗ R or RR∗ . We pull a card out at random and observe that the face showing is red. RB ∗ . Let us denote the faces by R (red) and B (blue) and let us put a mark. We randomly draw a card out of a hat containing three cards: one with both faces red. only two correspond to the RR card. namely R∗ R. This result is wrong and the mistake in the line of thought has to do with the fact that the set of elementary events has been incorrectly enumerated.2. either. P (R∗ ) 3 In truth. yellow. on the toss of a coin. When dealing with conditional events and the application of Bayes’ theorem. the answer one obtains is based on the following line of thought: if the face shown is red. it can only be one of two cards. A hat contains 10 red and 10 blue smarties. RR∗ . for the three events containing R∗ . RR∗ and R∗ B. BB ∗ } . The following problem was presented by two American psychologists. another with both faces blue. red–red or red–blue. and the remaining one with one face red and the other blue. Which am I more likely to get next? 2. In the last four matches between Mytholmroyd Athletics and Giggleswick United. . A taxi was involved in a hit-and-run accident during the night. P (RR if R∗ ) = 2 P (RR and R∗ ) = . In this way. A box contains green and yellow buttons in unknown proportions. and likewise for the other cards. the diﬃculty in arriving at the right answer is even greater. to indicate the face showing. I take out 10 and 8 are yellow.

Hence.8 .85 × 0. What is the probability that the hit-and-run taxi was blue? The psychologists found that the typical answer was 80%.6. Let us see what the correct answer is. We see that the desired probability is below 50% and nearly half of the typical answer. 2.36 2 Amazing Conditions Two taxi ﬁrms operate in the city: green and blue. assuming the equiprobability of boy and girl.6 Challenging the Intuition There are problems on probabilities that lead to apparently counterintuitive results and are even a source of embarrassment for experts and renowned mathematicians. Our question is: what is the probability that . Now the fact that blue taxis are less prevalent than green taxis leads to a drastic decrease in the probability of the witness being right.15 × 08 + 0.2 . Let us now suppose that we have got the information that the randomly selected family had a boy.41 . The probability is 3/4.8 0. P witness is wrong if taxi is blue = 0. We begin by noting that P witness is right if taxi is blue = 0.15 × 0. applying Bayes’ theorem: P taxi is blue if witness is right = 0.2 = 0. The court tested the witness’ reliability in the same conditions as on the night of the accident and concluded that the witness correctly identiﬁed the colors 80% of the time and failed in the other 20%. The following facts are known: • 85% of the taxis are green and 15% are blue. 2. • One witness identiﬁed the taxi as blue. In this and similar problems the normal citizen tends to neglect the role of prevalences.1 The Problem of the Sons Let us go back to the problem presented in the last chapter of ﬁnding the probability that a family with two children has at least one boy.

P two or more aces if an ace was drawn . We are then interested in the following conditional probabilities: P two or more aces if an ace of clubs was drawn . we have P {M M } if {M F. M M } = 1 1/4 = . We now modify the problem involving the visit to a random family with two children. F F ∗ } . in the other. Let us further specify two categories of hands: in one category. F ∗ M. But is that really so? Let us . let us not forget that. drawn randomly from a 52-card deck. M ∗ F. The event space is now {M ∗ M. we try to assess the fact that in the second case we are dealing with any of 4 aces. What is the probability that the other child is a boy? It seems that nothing has changed relative to the above problem. After all an ace of clubs is an ace and it would seem to be all the same whether it be of clubs or of something else. F M ∗ } takes place. 3/4 3 This result may look strange. F M. However. M ∗ F. The probability is 1/2. Are these probabilities equal? The intuitive answer is aﬃrmative. With a little bit more thought. by assuming that we knock at the door and a boy shows up. 1. as a consequence. suggesting a larger number of possible combinations and. the ace is any of the 4 aces.6 Challenging the Intuition 37 the other child is also a boy? Using the same notation as in Chap. M M ∗ } if {M ∗ M. Suppose that in either category of hands we are interested in knowing the probability that at least one more ace is present in the hand. but as a matter of fact this is not true. F ∗ F. We conclude that the simple fact that a boy came to the door has changed the probabilities! 2. M M ∗ . where the asterisk denotes the child that opens the door. there are two distinct situations for boy–girl.6. as in the previous three-card problem. F M ∗ . since apparently the fact that we know the sex of one child should not condition the probability relative to the other one. and suppose that an ace is present in the hand. a larger probability.2. the ace is of clubs.2 The Problem of the Aces Consider a 5-card hand. We are interested in the conditional probability of {M ∗ M. M M ∗ . namely MF and FM.

.. Let us examine the ﬁrst category of hand with an ace of clubs.e. In the second category – a hand with at least one ace – the number of possible cases is obtained by subtracting from all possible hands those with no ace. Concerning the unfavorable cases. We thus obtain a smaller (!) probability than in the preceding situation. once we ﬁx an ace. two or more aces if an ace was drawn 52 5 − 52 5 48 5 − −4× 48 5 48 4 4× 48 4 . There 51 remain 51 cards and 4 positions to ﬁll in. This apparently strange result becomes more plausible if we inspect what happens in a very simple situation of two-card hands from a deck . Since there are four distinct possibilities for ﬁxing an ace. P = = 0. there is no other 4 ace in the remaining 4 positions. hands with only one ace. i. we get the same number of unfavorable cases as before.22 . it is worth trying rather to evaluate unfavorable cases. It is preferable to resort to the classic notion and evaluate the ratio of favorable cases over possible cases. since the evaluation of favorable cases would amount to counting hand conﬁgurations with two. three and four aces. It turns out that for this problem the conditional probability rule is not of much help. whence two or more aces if an ace of clubs was drawn 51 4 − 51 4 48 4 P = = 0. In reality.e. i. There are therefore 4 48 possible cases.12 .38 2 Amazing Conditions actually compute the probabilities. we have unfavorable cases Hence. of which are unfavorable. which gives possible cases 52 5 − 48 5 .

the probability of a second ace is 1/3 (Fig. Once this birthday has been assigned there remain 364 possibilities for selecting a date for the second person with no coincidence. We may arbitrarily assign a value to the ﬁrst person’s birthday. up to .6. We see that. We shall ﬁnd that this is not so.2. with 365 diﬀerent ways of doing it. when we ﬁx the ace of clubs.3 The Problem of the Birthdays Imagine a party with a certain number of people. 363 possibilities are available for the third person’s birthday. any list of n people’s birthday dates has a probability given by the product of the n individual probabilities (refer to what was said before about the probability of independent events). 2. 2. What is the probability that two birthdays do not coincide? Imagine the people labeled from 1 to n. The probability of a person’s birthday falling on a certain date is 1/365.2a). The question is: how large must n be so that the probability of ﬁnding two people with the same birthday date (month and day) is higher than 50%? At ﬁrst sight it may seem that n will have to be quite large. 2.6 Challenging the Intuition 39 Fig.2.2). and so on. Since the birthdays of the n people are independent of each other. P (list of) n birthdays = 1 365 n . say n. perhaps as large as there are days in a year. 2. If on the other hand we are allowed any of the two aces or both. ignoring leap years. The problem of the aces for a very simple deck with only 4 cards: ace and queen of hearts and clubs (see Fig. afterwards. 2. Thus.2b). the probability is 1/5 (Fig.

Let us see if that is true.3 shows how this probability varies with n.6.4 The Problem of the Right Door The problem of the right door comes from the USA (where it is known as the Monty Hall paradox) and is described as follows. opens one of the other two doors (one not hiding the big prize) and reveals the worthless prize. 2. We start by numbering the doors from 1 to 3 and assume that door 1 is the contestant’s .1169).5073. × (365 − n + 1) . 2. Note that. . Figure 2.9704).40 2 Amazing Conditions Fig. P n birthdays without two coincidences = 365 × 364 × .3. 365n It is now clear that P (n birthdays with two coincidences) is the complement of the above probability with respect to 1. Therefore. Indeed. Behind the other two doors. Consider a TV quiz where 3 doors are presented to a contestant and behind one of them is a valuable prize (say. The question is: should the contestant change his/her initial choice? When a celebrated columnist of an American newspaper recommended changing the initial choice. for 50 people it is quite considerable (0. stating that the probability of winning with such a strategy was 0. P (23 birthdays with two coincidences) = 0. . the newspaper received several letters from mathematicians indignant over such a recommendation. The host. knowing what is behind each door. It can be checked that for n = 23 the probability of ﬁnding two coincident birthdays is already higher than 0. however. The contestant is then asked if he/she wants to change the choice of door. a car). Probability of two out of n people having the same birthday the last remaining person with 365 − n + 1 possibilities. the prizes are of little value. The contestant presents his/her choice of door to the quiz host. whereas for 10 people the probability is small (0.5.5.

If the contestant does not change his/her initial choice. . ‘no match for letter 2’. Let us compute the probability. as n grows. whose value up to the sixth digit is 2. . how the information obtained by the host opening one of the doors changes the probability setting and allows the contestant to increase his/her probability of winning. whence the no-match event for one letter has the probability 1 − 1/n. The problem for most people in accepting the right solution is connected with the strongly intuitive and deeply rooted belief that probabilities are something immutable. there is a 1/3 probability of choosing the right door. However. is called the Napier constant. the probability we wish to compute would be (1 − 1/n)n . If. one could imagine that the probability of this event would also vary radically with n. . he/she decides to change his/her choice. we have the situations shown in Table 2. 2.5 The Problem of Encounters We saw in the last chapter that there is a 1/n! probability of randomly inserting n letters in the right envelopes. If the events ‘no match for letter 1’. ‘no match for letter n’ were independent. famous for his contrio butions to number theory. The number denoted e. Note. . had diﬃculty accepting the solution.2. which tends to 1/e. We begin by noticing that the onematch event has probability 1/n. We also noticed how this probability decreases very quickly with n. therefore.6. on the other hand. as we know. the probability of winning when changing the choice doubles to 2/3! This result is so counter-intuitive that even the Hungarian mathematician Paul Erd¨s (1913–1996). if n − 1 letters are in the right . these events are not independent since their complements are not. The contestant wins in the two new choices written in bold in the table.718 28. however.1. Let us now take a look at the event ‘no match with any envelope’. in honor of the Scottish mathematician John Napier.1. who introduced logarithmic calculus (1618). For instance. At ﬁrst sight. Finding the right door Prize door Initial choice Host opens New choice Door 1 Door 2 Door 3 Door 1 Door 1 Door 1 Door 2 Door 3 Door 2 Door 3 Door 2 Door 3 41 initial choice.6 Challenging the Intuition Table 2.

This one is computed using a generalization of the event union formula for more than two events. it will sit in its natural position? (We then say that an encounter has taken place. 15324 .) Consider all the possibilities of encounters of letters 1 and 3 in a set of 5 letters: 12345 . letter 2) + P (enc. letters 1 and 2) − P (enc. 15342 . It so happens that the problem we are discussing appears under several guises in practical issues. . Finally. when inserting letter i. We see that for r encounters (r = 2 in the example). . the n th letter has only one position (the right one) to be inserted in.42 2 Amazing Conditions envelope. Hence. letter 3) − P (enc. Assuming n = 3 and abbreviating ‘encounter’ to ‘enc’. letter 1) + P (enc. Any of these permutations has probability (n−r)!/n!. 14352 . . letters 1 and 3) − P (enc. 12354 . which for letters and envelopes is formulated as: supposing that the letters and the envelopes are numbered according to the natural order 1. n.4 shows the Venn diagram for this situation. 3 It can be shown that the above calculations can be carried out more easily as follows for n letters: . we can compute the probability of no encounter as the complement of the probability of at least one encounter. . 2. generally referred to as the problem of encounters. letters 1. P (no encounter in 3 letters) = 1 . there are only n−r letters that may permute (in the example. 14325 . letters 2 and 3) + P (enc. 3! = 6 permutations). what is the probability that. 3 6 6 3 Figure 2. 2 and 3) = 3 1 × (3 − 1)! − 3! 3 2 × (3 − 2)! + 3! 3 3 × (3 − 3)! 3! =3× 1 1 2 1 −3× +1× = . the probability is computed as follows: P at least one encounter in 3 letters = P (enc.

The probability of no encounter in n letters approaches 1/e = 0.316 4 0. after subtracting the hatched areas in (b) and adding the double-hatched central area Table 2. but it does so in a much faster way than with the wrong computation. ‘enc.2. letter 2’. Above 6 letters.375 0 0. letter 1’.367 9 (1 − 1/n)n 0. the probability of no encounter in n letters denoted P (n). n! where d(n).334 9 0.346 4 P (no encounter in n letters) = d(n) .5 0. as in the incorrect computation where we assumed the independence of the encounters.343 6 0.6 Challenging the Intuition 43 Fig.333 3 0.367 9 0.327 7 0. Relevant functions for the problem of encounters n 2 3 4 5 6 7 8 9 d(n) 1 2 9 44 265 1 854 14 833 133 496 P (n) 0.296 3 0. The gray area in (a) is equal to the sum of the areas of the rectangles corresponding to the events ‘enc. and the value of the function (1 − 1/n)n .367 9 0.2.25 0. and ‘enc.368 1 0.3679. with d(0) = 1 and d(1) = 0.4.366 7 0. Table 2. 2. the probability of no match in an envelope is practically constant! . known as the number of derangements of n elements (permutations without encounters) is given by d(n) = (n − 1) × d(n − 1) + d(n − 2) .339 9 0. letter 3’.2 lists the values of d(n).

.

We obtain 5. corresponds to the frequency of occurrence f = 0. that is.4 when throwing a die. Assigning the value 1 to heads and 0 to tails. an average value of 5/8 = 0. Of course this theoretical value of 0. computed on the 0-1 sequence. queen. This average. the ace and the joker of the usual four suits. (The same phenomenon was already illustrated in Fig. we may then add up the values of the sequence 10011101. Q. The frequentist interpretation tells us that. A and F. we may imagine a large number of people tossing coins in the air. As an alternative to the tossing sequence in time. tails. tails. heads. in a sequence of 8 tosses of a coin. J. we have obtained: heads. After extracting and showing . respectively.625 for heads. 1. tails. if we toss a coin a very large number of times. jack).5 for the average (or frequency) presupposes that the coin-toss experiment either in time or for a set of people is performed a very large number of times. about half of the times heads will turn up. Let us assume that. heads.1 Mathematical Expectation We have made use in the preceding chapters of the frequentist concept of probability.625 heads per toss. The 4 × 5 = 20 cards are fairly shuﬄed and an attendant extracts cards one by one. In other words. Let us denote the cards by K. On average heads would turn up once for every two people.5. heads . on average heads will turn up once every two tosses.3 Expecting to Win 3. When the number of tosses is small one expects large deviations from an average value around 0.) We now ask the reader to imagine playing a card game with someone and using a pack with only the picture cards (king. heads.

For each drawn picture card the reader wins 1 euro from the opponent and pays 1 euro to the opponent if an ace is drawn and 2 euros if a joker is drawn.2 euros (therefore. Suppose the following sequence turned up: AKAKAFJQJFAQFAAAJAAQAJAQKKFQKAJKJQJAQKKKF . −2.2 euros per round after many rounds. for instance. the 1 euro gain occurs three ﬁfths of the time. that is. they only occur one ﬁfth of the time. and so on.46 3 Expecting to Win Fig. on the other hand. the reader would have an a priori advantage over the opponent. If. one expects the average gain in a long run to be as follows: expected average gain = 1 × 1 1 3 + (−1) × + (−2) × = 0 euro . we would have the opposite situation: the game would be biased towards the opponent. One may raise the question: what would the reader’s average gain be in a long sequence of rounds? Each card has an equal probability of being drawn. −3. 0. As for the 1 or 2 euro losses. . a loss).2 euros per round after many rounds. favoring him/her by 0. 3. The game would be biased towards the reader. Therefore. Figure 3. the king’s value was 2 euros.1. If. −1. which is then reshuﬄed. For this sequence the reader’s accumulated gain varies as: −1.1 shows how the average gain varies over the above 40 rounds. −1. the ace’s value was −2 euros. 5 5 5 It thus turns out that the expected average gain (theoretical average gain) in a long run is zero euros. In this case the game is considered fair. 0. By the end of the ﬁrst ﬁve rounds the average gain per round would be −0. expecting to win 0. which is 4/20 = 1/5. either the reader or the opponent has the same a priori chances of obtaining the same average gain. Evolution of the average gain in a card game the card it is put back in the deck. Thus.

as we saw. 3. is computed using the event probabilities. 3. which. taking values in the set {−2. −1. expected after a large number of rounds. The equilibrium point (center of gravity) of the wire would then correspond to the mathematical expectation. as shown in Fig. One then simply adds up the products of the values with the probabilities. value of the expected average gain. The expected average gain. Evolution of the average gain in 1 000 rounds .2.3. 3. and such that the appropriate probability is assigned to each value. The probability function of the gain for a card game Let us take a closer look at how the average gain is computed. denoted by f . Note the word ‘mathematical’.2.1 Mathematical Expectation 47 Fig. We say that the gain variable is a random variable. is called the mathematical expectation of the gain variable.3. 1}. 3. We may have to play an incredibly large number of times before the average gain per round is suﬃciently near the mathematical. One can look at it as if one had to deal with a variable called gain. At this moment the reader will probably suspect. Imagine that the ‘gain’ axis is a wire along which point masses have been placed with the same value and in the same positions as in Fig.2. There is in fact a physical interpretation of the mathematical expectation that we shall now explain with the help of Fig.2. and rightly so. the point 0. in this case. that in the coin-tossing experiment the probability that heads turns up is the mathematical expectation of the random variable ‘frequency of occurrence’. 3. Fig. theoretical.

In Fig.3 shows how the average gain of the card game varies over a sequence of 1 000 rounds. On the other hand. for instance. 1 . the average gain after n rounds is computed by adding the individual gains of the n rounds – that is. 1 . Another sequence of rounds would most certainly lead to a diﬀerent average gain sequence. we have E(AKQJF game) = 0 . it would also stabilize close to mathematical = expectation where value of random variable × respective probability . 3. The euphemistic average Figure 3. 3. −2 .48 3 Expecting to Win Fig. we shall use the preceding card game example with null mathematical expectation. −1 . 1 . denotes the total sum.4.2 The Law of Large Numbers The time has come to clarify what is meant by a ‘suﬃciently long run’ or ‘a large number of rounds’. −1 . Denoting the expectation by E. −2 . 3.1. 1 . the values of the corresponding random variable – and dividing the resulting sum by n. For that purpose. However. 1 . Note the large initial wanderings followed by a relative stabilization. the values of the random variable for the ﬁrst 10 rounds were −1 . .

Figure 3.3 shows a sequence of 1 000 rounds. it is precisely such a law that the mathematician Jacob Bernoulli. after approximately 850. 3. already mentioned in Chap. 3. This is only one sequence out of a large number of possible sequences.5.08? Let us see what happens in the situation shown in Fig. However. after a certain number n of rounds. Let us suppose that we establish a certain deviation around zero. estimated as 1 followed by 68 zeros.3. we only have one curve and ﬁnally. 1. we have 4 curves (out of 10) that exceed the speciﬁed deviation.5 shows 10 curves of the average gain in the AKQJF game. no curve exceeds the speciﬁed deviation. 3.) Figure 3. all the curves will deviate from zero by less than 0.08 (the broken lines in Fig. which is closely represented by 4 followed by 71 zeros! (A number larger than the number of atoms in the Milky Way. corresponding to 10 possible sequences of 1 000 rounds.2 The Law of Large Numbers 49 We thus obtain the following average gain in 10 rounds: average gain in 10 rounds = −1 + 1 − 1 + 1 − 1 − 2 + 1 + 1 + 1 − 2 10 = −0.5). for instance 0. we apply the same counting technique as for the toto: the number of possible sequences is 51000 . With regard to the number of possible sequences. that the probability of ﬁnding a curve that deviates from the mathematical expectation by more than a pre-speciﬁed tolerance will become smaller and smaller as n increases. After around 620. We observe a tendency for all curves to stabilize around the mathematical expectation 0. Ten possible evolutions of the average gain in the card game . was able to demonstrate and Fig. When n = 200. Let us ask whether it is possible to guarantee that. we also observe that some average gain curves that looked as though they were approaching zero departed from it later on. Near n = 350. we have 3 curves. This experimental evidence suggests. therefore.5.2 . Now.

3. 1 000 1 000 1 999 1 000 000 × − 1 000 × = +1 for Mary . the probability measure (for instance. but. for suﬃciently large n. John has the advantage. the gain expectation is: 1 999 − 1 000 000 × = −1 for John . of heads turning up in coin tossing) does not predict the outcome of a particular sequence.50 3 Expecting to Win which has come to be known as the law of large numbers or Bernoulli’s law : for a sequence of random variables with the same probability law and independent from each other (in our example. We cannot predict the detail of all possible evolutions. 1 000 × Fig. no round depends on the others). in a certain sense. the game is favorable to Mary (see Fig.6). However. This way. if John bets a million with 1/1000 probability of losing and Mary bets 1000 with 999/1000 probability of losing. the probability that the respective average deviates from the mathematical expectation by more than a certain pre-established value tends to zero. but it is nevertheless possible to determine a degree of certainty (compute the probability) of such an event. we cannot guarantee that after a certain number of rounds all average gain curves will deviate from the mathematical expectation by less than a certain value. but we can determine a certain global degree of certainty. Only in one sequence is chance unfavorable to John during the ﬁrst 37 rounds . 3. 1 000 1 000 Therefore. In a game between John and Mary. if the number of rounds is small. ‘predicts’ the average outcome of a large number of sequences. in six possible sequences. Evolution of John’s average gain with the number of rounds. Turning back to our previous question. and that degree of certainty increases with the number of rounds.6.

05 × 0. the game would only be equitable for the player if (s)he paid the mathematical expectation for betting.000 666 +1.054 euros. The winning expectation in state games (lottery.10 × 0.50 × 0. the roulette game is unfavorable to him/her. Let us consider the European roulette game discussed in Chap. 1: E(slot machine) = 0.3 Bad Luck Games 51 When we say ‘tends to zero’.000 001 = $ 0. 3. The game would only be equitable if the casino paid 37 euros (!) for each winning number. Therefore the gambler should only pay 1 cent of a dollar per go. toto. for a two euro payment by the casino.30 × 0.022 129 + 0. Let us now analyze the slot machine described in Chap.8). lotto. 1 (see Fig.3. and the like) is also.054 = 0.002 258 + 0. implying an average loss of 1−0. the meaning is that it is possible to ﬁnd a suﬃciently high value of n after which the probability of a curve exceeding a pre-speciﬁed deviation is below any arbitrarily low value we may wish to stipulate. as can be expected.25 × 0.946 euros per bet. Let us further assume that if the gambler wins the casino will pay him/her 2 euros. Consider .000 079 + 2. in the law of large numbers.054 euros . namely 0.01 . 37 37 Since the player pays 1 euro for each bet. The question is: how much would a gambler have to pay for the right to one equitable round at the slot machine? To answer this question we compute the mathematical expectation using the values presented in Chap.3 Bad Luck Games The notion of mathematical expectation will allow us to make a better assessment of the degree of ‘bad luck’ in so-called games of chance. for this convergence in probability. 1.090 353 + 0. unfavorable to the gambler.15 × 0. On the other hand.003 155 +0. 1 and let us assume that a gambler only bets on a speciﬁc roulette number and for that purpose (s)he pays 1 euro to the casino.50 × 0.00 × 0. For a suﬃciently large number of rounds the player’s average gain will approach the mathematical expectation E(roulette) = 2 × 1 36 +0× = 0.048 695 +0. We shall see later what value of n that is.

the toto gambler loses on average (in the sense of the law of large numbers). 2. But let us contemplate the following line of thought: 1. pseudo-paradox) of probability theory that has stimulated so much discussion in books and papers. the reader is given the possibility of swapping his/her choice. Having told me his/her choice. as equally likely alternatives.000 000 071 5 × (1 + 43 + 258 + 13 545 + 246 820) = 0. The envelope I chose contains either the smaller or the bigger amount. What should the reader do? Swap or not swap? It seems obvious that any swap is superﬂuous.175 = −0.175 − 0. That is. Now.018 6 × 0. for each ﬁrst prize. 1. Therefore. only half is spent on prizes. It is interesting to note that its enunciation is apparently simple and innocuous.52 3 Expecting to Win the toto and let 0. There is probably no other paradox (or better. It is the same thing as if I oﬀered the possibility of choosing heads or tails in the toss of a coin. 50% of the amassed money will be used to pay prizes with a probability of: P (prizes) = 0. . and approximately in proportion to the probabilities of each type of hit. from the total amount of money amassed from the bets.096 euros . with the amount in one of the envelopes being double the amount in the other one.9814 × 0. 258 third prizes. as we saw in Chap. In state games. 3. 0. Consequently. one expects 43 second prizes. The reader may choose one envelope telling me his/her choice.018 6 . The only thing that is disclosed is that they contain certain amounts of money. 13 545 fourth prizes and 246 820 ﬁfth prizes.096 euros per bet. also known as the swap paradox. Let us consider a ﬁrst version of the paradox. The probability is always 1/2.4 The Two-Envelope Paradox The two-envelope paradox. E(toto) = 0.35 euros be the payment for one bet. Suppose I show the reader two identical and closed envelopes with unknown contents. seems to have been created by the Belgian mathematician Maurice Kraitchik (1882–1957) in 1953. The envelope I chose contains x euros.

In the standard version of the paradox what is considered ﬁxed is the total amount in the two envelopes (the total amount does not depend on my original choice). 4. since it assumes a ﬁxed value for x. but if it is the higher amount. Therefore we cannot use y in the computation of E(swap gain). That is. and I do not lose x/2 euros as in the formula of step 5. 2x euros. for instance. the amount of the initially chosen envelope is either 2y euros or y/2 euros and the average swap gain is 0. once the swap takes place the same argument still holds and therefore I should keep swapping an inﬁnite number of times! The crux of the paradox lies in the meaning attached to x in the formula E(swap gain). Therefore. . Or is there? If I swap. my initial choice contains either x euros or 2x euros. does the same argument not hold? For suppose that the other envelope contains y euros. winning on average 2. the value of y is not ﬁxed: it is either 20 euros or 5 euros. I win x euros with the swap.5 euros with the swap. with equal probability. once we have ﬁxed x.5 × x − 0.3. by swapping it I will get x/2 euros and I would thus lose x/2 euros. In this case I should swap. Then.) But. E(swap gain) = 0. Then the advantage in continuing to swap ad inﬁnitum would remain. The problem with this argument is that. the expected gain if I make the swap is E(swap gain) = 0. Let us take a closer look by attributing a value to x. If my envelope contains the smaller amount I will get 2x euros if I swap and I would therefore win x euros.25y euros. This aspect becomes clear if we consider the standard version of the paradox that we analyze in the following. If. but in fact x euros. I should swap since the expected gain is positive. 5. On the other hand.25x euros .5 × (x/2) = 0. x euros.5x − 0. Therefore. (The wisest decision is always guided by the mathematical expectation. There is no paradox in this case. as we did above. if my envelope contains the bigger amount. Let us say that x is 10 euros and I am told that the amount in the other envelope is either twice this amount (20 euros) or half of it (5 euros). the initial envelope contains the lower amount.5x = 0 euros and I do not win or lose with the swap. Therefore.4 The Two-Envelope Paradox 53 3. and the calculation in step 5 of the above argument is not legitimate. even if it is not always the winning decision it is deﬁnitely the winning decision with arbitrarily high probability in a suﬃciently large number of trials. and so would the paradox. Let 3x euros be the total amount.

this and other paradoxes do sometimes have interesting practical implications. Keith Devlin. The question is: how much money should the player give to the casino as initial down-payment in order to guarantee an equitable game? The discussion and solution of this problem were ﬁrst presented by Daniel Bernoulli (nephew of Jacob Bernoulli) in his Commentarii Academiae Scientiarum Imperialis Petropolitanae (Comments of the Imperial Academy of Science of Saint Petersburg). 8 euros if it only turns up at the third throw. and later convert back to euros.54 3 Expecting to Win Other solutions to the paradox have been presented. we have P (ﬁrst time tails) × P (second time heads) = 1 2 2 . whose connection with the two-envelope paradox is obvious (see Friedel Bolle. 16 euros if it only turns up at the fourth throw. and so on and so forth. 3.g. based on the fact that the use of the formula in step 5 would lead to impossible probabilistic functions for the amount in one of the envelopes (see. Let us see why. 2005).25x euros on average. e. Applying the envelope paradox reasoning we should perform these conversions hoping to win 0. The probability of heads turning up in the ﬁrst throw is 1/2. taking into account the compound event probability rule for independent events. . But then the dollar holder would make the opposite conversions and we arrive at a paradoxical situation known as the Siegel paradox . 2003). Consider the following problem: let x be a certain amount in euros that we may exchange today in dollars at a rate of 1 dollar per euro. at a rate of 2 dollars or 1/2 dollar per euro.. with equal probability. The problem (of which there are several versions) came to be known as the Saint Petersburg paradox and was the subject of abundant bibliography and discussions.5 The Saint Petersburg Paradox We now discuss a coin-tossing game where a hypothetical casino (extremely conﬁdent in the gambler’s fallacy) accepts to pay 2 euros if heads turns up at the ﬁrst throw. published in 1738. always doubling the previous amount. Do the two-envelope paradox and its analysis have any practical interest or is it simply an ‘entertaining problem’ ? In fact. 4 euros if heads only turns up at the second throw. For the second throw.

At the n th throw (s)he would ﬁnally get the 2n euro prize. 3. Strictly. The paradox is solvable if we modify the notion of equitable game. The diﬃculty with this paradox lies in the fact that the mathematical expectation is inﬁnite. this would correspond to the player having to pay E(Saint Petersburg game) = 2 × 1 1 1 + 4 × + 16 × + ··· 2 4 16 = 1 + 1 + 1 + 1 + ··· = ∞ .7. this game has no mathematical expectation and the law of large numbers is not applicable.5 The Saint Petersburg Paradox 55 and likewise for the following throws. Thus.3. Applying the rule that the player must pay the mathematical expectation to make the game equitable. 3. Let us consider Fig.7b. only a net proﬁt of 2 euros is obtained. assuming that for entering the game the player would have to pay the prize value whenever no heads turned up. when heads did ﬁnally turn up at the n th throw (s)he would have disbursed 2 + 4 + 8 + · · · + 2n−1 = 2n − 2 euros . The player would thus have to pay an inﬁnite (!) down-payment to the casino in order to enter the game. always multiplying the previous value by 1/2. as shown in Fig. (b) The decreasing exponential probability in the same game . (a) The exponential growth of the gain in the paradoxical Saint Petersburg game. the probability rapidly decreases with the throw number (the random variable) when heads turns up for the ﬁrst time. It then comes as no surprise that the casino proposes to pay the player an amount that rapidly grows with the throw number. One is unlikely to ﬁnd a gambler who would pay out such large amounts for such a meager proﬁt. In this way. On the other hand.

he also presented a solution based on what he called the utility function.386 294 2.1.97.238 325 6.126 364 1. .2. the gambler’s payments are variable.25.346 574 0. with an accumulated payment denoted Pn after n throws.09. . When Daniel Bernoulli presented the Saint Petersburg paradox in 1738.772 589 3. 5. This amounts to saying that there is .35. 2.9. let us consider that. 3. The idea is as follows. we would like any imbalance between what the casino and the gambler have already paid to become more and more improbable as the number of throws increases. In this sense it is possible to guarantee the equitability of the Saint Petersburg game if the player pays in accordance with the following law: Pn = n × log2 n + 1 . 3. 4. If. 5.337 557 1. for instance.371 403 1.234 668 1. 4.852 030 5.693 147 0. instead of a ﬁxed amount per throw. 4.465 736 4.14. 4.76. corresponding to paying the following amounts: 1. 3.693 147 1.953 077 1.69. .56 3 Expecting to Win Table 3.61.53. say f . Any variation of an individual’s fortune. . 4. Relevant functions for the Saint Petersburg paradox n 1 2 3 4 5 6 7 8 9 10 P (n) Gain f [euro] Utility log f 0.158 883 4. the meaning of this variation for an individual whose fortune is f = 10 000 euros (100% variation) is quite diﬀerent from the meaning it has for another individual whose fortune is f = 1 000 000 euros (1% variation).299 651 1. We now modify the notion of an equitable game by imposing the convergence in probability towards 1 of the ratio Sn /Pn . has diﬀerent meanings depending on the value of f .079 442 2. In other words.931 472 Expected utility 0.83. Let us then consider that any transaction between individuals is deemed equivalent only when the values of ∆f /f are the same.359 218 1. an individual’s fortune varies by ∆f = 10 000 euros. which we denote Sn .545 177 6. Let us denote the variation of f by ∆f .378 172 1/2 2 1/4 4 1/8 8 1/16 16 1/32 32 1/64 64 1/128 128 1/256 256 1/512 512 1/1024 1024 the sum of the gains after n throws. 2. 4. Moreover.

1 shows the expected utilities for the Saint Petersburg game. with q = 1 − p. +3 × 0.5 – or with a biased one – P (heads) = 0.65. (Eﬀectively.35.031 25 + 1 × 0.312 5 = 2.386 294) corresponding to 4 euros. ∆f /f . 3. We have obtained the entirely expected result that.3. either with a fair coin – P (heads) = P (tails) = 0. It can be shown that the mathematical expectation of the binomial distribution (the distribution mean) is given by . viz. Taking into account the probability function for 0 through 5 heads turning up for a fair coin (see Fig.031 25 The distribution of probability values according to this formula is called the binomial distribution.7a). P (tails) = 0. The rational player from the money-utility point of view would then pay an amount lower than 4 euros to enter the game (but we do not know whether the casino would accept it). heads turns up on average in half of the throws (for 5-throw sequences or any other number of throws).6 When the Improbable Happens 57 a utility function of the fortune f expressed by log(f ). P (k successes in n) = n k × pk × (1 − p)n−k . Assuming that the event that interests us (success) has a probability p of occurring in any of the repetitions. for a fair coin. we have E(k heads in 5 throws) = 0 × 0. 1 reference was made to the problem of how to obtain heads in 5 throws. where the gain represents the fortune f . We are now able to compute the mathematical expectation in both situations.5 .156 25 + 2 × 0. The expected utilities converge to a limit (1. that is.156 25 + 5 × 0. There are a vast number of practical situations requiring the determination of the probability of an event occurring k times in n independent repetitions of a certain random experiment. 1 for the coin example.) Table 3.312 5 + 4 × 0. 1..6 When the Improbable Happens In Chap. the probability function to use is the same as the one found in Chap. the base e logarithm function – see the appendix at the end of the book – enjoys the property that its variation is expressed in that way. The name comes from the fact that the probabilities correspond to the terms in the binomial expansion of (p + q)n .

e. we expect to obtain such sequences (or any other sequence for that matter) a number of times given by n × p = 1 000 000 × 0.5. 1. 0. 0. 1. we will ﬁnd (with high certainty) a ﬁrst prize among 1. 1. 0. Consider a sequence of 20 throws of a fair coin in which 20 heads turned up. 1. 1. If we repeat the 20-throw experiment a million times. the ‘improbable’ will happen: highly organized sequences. 1. . 1.65 = 3. 1 . 0.5 = 2. 1. 1. such as 0. or 1. 0. 1. In other situations one might be tempted to attribute a magical or divine cause to such a sequence. 1. (1/2)20 = 0. the probability of turning up tails will be higher. The same applies to other highly organized sequences. 1. 1. In the case of the toto game. 0. 1. As in the gambler’s fallacy. 1. 1. 0. 1. 1. namely. Incidentally. 1. 1. 1. from f = k/n and E(k) = np. 1. Looking at this sequence 1. 0.25. 1. one gets the feeling that something very peculiar has happened. 1. there is a high certainty that once in a million such a sequence may occur.6 million bets. 1. 1. 0. 0. emergence of life on a planet. Note also (see Fig.000 000 953 67.000 000 95 ≈ 1. where the probability of the ﬁrst prize is 0. E(f ) = p.000 000 627.. For the biased coin we have E = 5 × 0. 1. Nevertheless all these sequences have the same probability of occurring as any other sequence of 20 throws. 1. i. 1 . 1. 0. or our own existence. 0. we may consider what happens if there are 1. at the twenty-ﬁrst throw. we arrive at the result mentioned in the ﬁrst section of this chapter that the mathematical expectation of an event frequency is its probability. 1. 1.58 3 Expecting to Win E(binomial distribution) = n × p .e. The expectation is 1. 1. That is. 0. 0. i. on average. 1. We can check this result for the above example: E = 5 × 0. 1. 1.7) how the expectation corresponds to a central position (center of gravity) of the probability distribution. 0..6 million bets. 1. Note how the expectation has been displaced towards a higher number of heads. given a suﬃciently large number of experiments. 1. The conclusion to be drawn is that. 0. 0. 0. there is a great temptation to say that. 0 .

Let us assume that in 30 years of adult life a person dreams.5) = 0. a ‘paranormal’ coincidence will occur. the Portuguese adult population estimated at 7 million people. contain.7 Paranormal Coincidences 59 3.000 045 6 = 319 . the probability of a coincidence taking into account the 15 deaths is 15/(30 × 365. Therefore.25) = 1/730. on average. We saw in the birthday problem that it suﬃces to have 23 people at a party in order to obtain a probability over 50% that two of them share the same birthdays. about once per day (and of course at a higher rate in countries with a larger population). If we consider birthday dates separated by at most one day.5 in 30 years of one’s life. We may modify some of the previous assumptions without signiﬁcantly changing the main conclusion: ‘paranormal’ coincidences .25).25 days) is then 1/(30 × 365. Do you have a premonitory gift? People ‘studying’ so-called paranormal phenomena often present various experiences of this sort as evidence that ‘something’ such as telepathy or premonition is occurring. so that the coincidence can be detected.7 Paranormal Coincidences Imagine yourself dreaming about the death of someone you knew. an incorrect appreciation of the probabilities of coincident events. let us take. only once about someone he/she has known during that 30 year span. on average. in Portugal: average number of coincidences per year = n × p = 7 000 000 × 0. We may then use the binomial distribution formula for the mean in order to estimate how many coincidences occur per year. the number 23 can be dropped to 14! Let us go back to the problem of dreaming about the death of someone we knew and subsequently receiving the information that (s)he has died. The probability of one coincidence per year is estimated as 1/(30 × 730. Finally. (This is probably a low estimate. on average. for instance. Let us further assume that in the 30 year span there are only 15 acquaintances (friends and relatives) whose death is reported in the right time span.) The probability of dreaming about the death of an acquaintance on the day preceding his/her death (one hit in 30 × 365.3. in fact. When reading the newspaper the next day you come across the column reporting the death of your acquaintance. supposedly indicating supernatural powers. Many reports on ‘weird’ cases. With this assumption.000 045 6.

at that moment the two players are obliged to leave without ever ﬁnishing the match. Secundus wins the second round. In the ﬁrst round Primus wins. a member of ee the court of Louis XIV already mentioned in Chap. How should the bet be divided in a fair way between them? Fermat. we then move to the circle marked 56. given a suﬃciently large number. following the branch marked P (Primus wins). the ‘improbable’ will happen. Note that. 1. based on a hierarchical description of the problem and introducing the mathematical concept of expectation. It is useful to represent the enumeration in a tree diagram. The circle marked 44 corresponds to the initial situation. on the other hand. Let us now analyze the tree from right to left (from the end to the beginning). The circle marked 0 corresponds to a ﬁnal win for S with three tails for S according to the sequence PPSSS. in a previous letter to Pascal.8. for instance. P gets the 64 pistoles. had proposed a solution based on the enumeration of all possible cases.60 3 Expecting to Win are in fact entirely normal when we consider a large population. with three heads for P according to the sequence PPSSP.8 The Chevalier de M´r´ Problem e e On 29 July 1654. Let us consider Pascal’s solution. For the moment we do not know the justiﬁcation for this and other numberings of the circles. Pascal wrote a long letter to Fermat presenting the solution of a game problem raised by the Chevalier de M´r´. in the tree branch marked S. where we assume that Primus won the ﬁrst round. since a Primus loss is a Secundus win and vice versa. the match goes on for only 5 rounds. 3. based on the enumeration of all the various paths along which the game might have evolved. The 64 pistoles would stay with the player that ﬁrst obtained 3 successes (a success might be. we move downwards to the circle marked 32. starting with the gray circles. as shown in Fig. Once again. The circle marked 64 corresponds to a ﬁnal P win. 3. Pascal presented Fermat with another. If. upward branches correspond to Primus wins (P) and downward branches to Secundus wins (S). more elegant solution. Had the match been interrupted immediately before this ﬁnal round. either consecutive or not. heads for Primus and tails for Secundus). Suppose that Primus wins again. The problem was as follows. Thus. what would P’s expectation be? We have already learned how to do this calculation: . Two players (say Primus and Secundus) bet 32 pistoles (the French name for a gold coin) in a heads or tails game.

at the third round. 2 2 The process for computing the expectation repeats in the same way. E(P ) = 64 × 1 1 + 32 × = 48 . In fact. In spite of this. 3.8. Primus wins 6 and Secundus 4. is similar to the reasoning that leads to the price calculation of a stock market option. Primus must keep 44 pistoles and Secundus 20. Tree diagram for Pascal’s solution E(P ) = 64 × 1 1 + 0 × = 32 . In this way. A problem that seemed diﬃcult was easily solved using the concept of mathematical expectation. when he leaves. . For the paths that start with PPS. progressing leftward from the end circles.8 The Chevalier de M´r´ Problem e e 61 Fig.3. Among these. let us note that there are 10 end circles: 10 possible sequences of rounds.4 pistoles) for Primus and 4/10 (25. Pascal’s reasoning based on an end-to-start analysis. until we arrive at the initial circle to which the value 44 has been assigned.6 pistoles) for Secundus! Once more a simple and entertaining problem has important practical consequences. particularly in the context of a famous formula established by Fisher Black and Myron Scholes in 1973 (which got them the Nobel Prize for Economics in 1997). the money at stake must not be shared in the ratio 6/10 (38. 2 2 We thus obtain the situation represented by the hatched circle. As a ﬁnal observation. we obtain in the same way.

. the gain of B is G(stack A: heads) = −9t + 5(1 − t) = −14t + 5 . As can be seen from Fig. Let us imagine the following game between two opponents A and B. but will win 5 euros during a fraction 1 − t of the game. since the average gain in the situation that A wins. knowing that A will try to show up heads. If the drawn coins show the same face. What has been said for a stack segment where A placed the same coin face showing up applies to the whole stack. viz. he will expect to win. However. (5 + 5)/2. it will not really matter how the coins are piled up. randomly distributed in the stack. B must then pay 9 euros to A during a fraction t of the game. this is not true. 3. (9 + 1)/2. . show up heads. Let t represent the fraction of the stack in which B puts heads facing up and let us see how the gain of B varies with t. and in the situation that B wins. already mentioned earlier. expecting to ﬁnd many heads–heads situations. like the one just described. A wins from B in the following way: 9 euros if it is heads–heads and 1 euro if it is tails–tails. Player A will tend to pile the coins up in such a way as to exhibit heads more often. One might think that. the B gain lines for the two situations intersect at the point t = 0. are the same. during a long sequence of rounds. If the upper faces of the coins are diﬀerent. This notion appeared for the ﬁrst time in the book The Doctrine of Chance by Abraham de Moivre. B wins 5 euros from A no matter whether it is heads–tails or tails–heads. all of the same type. will try to show tails. viz.62 3 Expecting to Win 3. The word ‘martingale’ is used in games of chance to refer to a winning strategy. an average of 0. This means that if B stacks his coins in such a way that three tenths of them.8 euros per round. Let us denote that gain by G.9 Martingales The quest for strategies that allow one to win games is probably as old an endeavor as mankind itself. one likewise obtains G(stack A: tails) = +5t − 1(1 − t) = 6t − 1 . B. with a positive gain for B of 0. First.9. Each piles them up and takes out a coin from their stack at the same time. That is. Both have before them a stack of coins. On the other hand. consider A piling up the coins so that they always show up heads. If the coins of stack A show tails.8 euros.3. and plays an important role in probability theory and other branches of mathematics. The scientiﬁc notion of martingale is associated with the notion of an equitable game..

One must distinguish true and false martingales.or red–black-type bets. with probability p = 18/37 = 0. How the gains of B vary with a fraction t of the stack where he plays heads facing up The word comes from the Proven¸al word ‘martegalo’ (originating from c the town of Martigues). Publications sometimes appear proposing winning systems for the lotto. The gain–loss sequence constitutes. One such system for winning the lotto is based on the analysis of graphs representing the frequency of occurrence of each number as well as combinations of numbers. and so on. 3. possessing an initial capital of I euros. like the one that we shall describe at the end of Chap. and described a leather strap that prevented a horse from lifting up its head. A martingale is similarly intended to hold sway over chance. set at a constant value of a euros. this is just one more version of the gambler’s fallacy. a coin-tossing type of sequence. Of course such information does not allow one to forecast the winning bet.9. toto.10 How Not to Lose Much in Games of Chance Let us suppose that the reader. has decided to play roulette with even–odd. 3. 4.10 How Not to Lose Much in Games of Chance 63 Fig. Jacob Bernoulli demonstrated in 1680 that such a game ends in a ﬁnite number of steps with a winning probability given by .3. Other false martingales are more subtle. Let us further suppose that the reader decides at the outset to keep playing either until he/she wins F euros or until ruined by losing the initial capital.486 of winning and q = 1−p of losing. in this case.

7324 .6804 0. Roulette for diﬀerent size bets.8100 0.8527 0.7853 0. 1 − 1.7034 0. we should then set the sequence of bets as shown in Fig. q/p = 1.3399 0. is inspired by the idea of always betting the highest possible amount that allows one to reach the proposed goal.4758 0. .10.5896 0. 3.5498 100 euro bets F 1100 1200 1300 1400 1500 1600 1700 1800 P (F ) 0.5215 0. proposed by Lester Dubins and Leonard Savage in 1956. If the reader starts with an initial capital of 1 000 euros and makes 50-euro bets. In the example we have been discussing. F represents the winnings at which the player stops and P (F ) is the probability of reaching that value 50 euro bets F 1050 1100 1150 1200 1250 1300 1350 1400 P (F ) 0.0556.9225 0.2 shows what happens when we make constant bets of 50.055624 Table 3.4952 0.6747 0.6330 0.3820 0.6337 0. there are then betting strategies to obtain a probability close to that maximum value.2. The following probabilities can be computed from the diagram: P (going from 1000 to 1200) = p + q × P (going from 800 to 1200) .3045 P (winning F ) = P (F ) = 1 − (q/p)I/a .4356 200 euro bets F 1200 1400 1600 1800 2000 2200 2400 2600 P (F ) 0. This probability value is near the theoretical maximum value P (F ) = 1000/1200 = 83. whence he must risk the highest possible amount in a single move.64 3 Expecting to Win Table 3.5736 0. One such strategy.33%.055620 = 0. If the bet amounts can vary.4328 0. 1 − (q/p)F/a In the case of roulette. In the above example one single 200-euro bet has 81% probability of reaching the goal set by the gambler.7896 0.5736 0. the probability of reaching 1 200 euros is given by P (F ) = 1 − 1. printed in bold in the table.7324 0. we see that the gambler would be advised to stay in the game the least time possible. Looking at the values for situations with F = 1200 euros. 100 and 200 euros.8826 0.

these martingales should not be viewed as an incitement to gambling – quite the opposite. according to which the gambler should keep doubling his/her bet until he/she wins or until he/she has enough money to double the bet. with a probability of winning of 0.10 How Not to Lose Much in Games of Chance 65 Fig. there is even a result about when one should stop a constant-bet game of chance in order to maximize the winning expectation. . The daring bet strategy proposed by Dubins and Savage P (going from 800 to 1200) = p + q × p × P (going from 800 to 1200) . From this one concludes that P (going from 1000 to 1200) = p + qp = 0.8195 . this would correspond to stopping when reaching 400 euros. Needless to say.7363. For the above example. 3.10. As a ﬁnal remark let us point out that the martingale of Dubins and Savage looks quite close to a martingale proposed by the French mathematician Jean Le Rond d’Alembert (1717–1783). The result is: don’t play at all! In the meantime we would like to draw the reader’s attention to the fact that all these game-derived results have plenty of practical applications in diﬀerent areas of human knowledge. 1 − qp closer to the theoretical maximum.3. As a matter of fact.

.

99450 . . Take for instance the following problem.00551000 × 0. We saw how the binomial law allows us to determine the probability that an event.4 The Wonderful Curve 4. .1 Approximating the Binomial Law The binomial distribution was introduced in Chap.9945988 1000 1000 × 0. . of which we know the individual probability. × 2. occurs a certain number of times when the corresponding random experiment is repeated. The probability that a patient gets infected with a certain bacterial strain in a given hospital environment is p = 0. we have P more than 10 events in 1000 = P (11 events) + P (12 events) + · · · + P (1000 events) × 0. × 501 and then dividing by 500 × 499 × . calculating 500 tion 1000 × 999 × .9945989 × 0. 1 and its mathematical expectation in the last chapter. . .005511 × 0. = + 1000 11 1000 12 +··· + The diﬃculty in carrying out these calculations can be easily imagined.0055. 1000 involves performing the multiplicaFor instance. What is the probability of ﬁnding more than 10 infected people in such a hospital with 1000 patients? Applying the binomial law.005511 × 0.

whence P (3 heads in 5 throws) ≈ √ 1 1 = = 0.8 × 2.5 . 2.5 2 = 1. was the ﬁrst to ﬁnd the need to apply an easier way of computing factorials.85 × 2.) That easier way (known as Stirling’s formula) led him to an approximation of the binomial law for high values of n. v = np(1 − p) = 2. when discussing the problem of encounters (Chap.25/2. (In his day. it is not uncommon for this type of calculation to exceed the number representation capacity. 4. Abraham de Moivre. De Moivre’s approximation (open circles) of the binomial distribution (black bars). (b) n = 30 Even if one applies the complement rule – subtracting from 1 the sum of the probabilities of 0 through 10 events – one still gets entangled in painful calculations. k − m = 3 − 2. When using a computer. Denoting the mathematical expectation of the binomial distribution by m (m = np) and the quantity np(1 − p) by v.70.32 . 2).5 . Let us see if De Moivre’s approximation works for the example in Chap. mentioned several times in previous chapters. universally denoted by the letter e. performing the above calculations was an extremely tedious task. with the approximate value 2.7. (a) n = 5.5 . 1 of obtaining 3 heads in 5 throws of a fair coin: m = np = 5 2 = 2.1 7.25 .70.68 4 The Wonderful Curve Fig.1. × pk × (1 − p)n−k ≈ √ 2 2πv × e(k−m) /2v We have already found Napier’s number. the approximation obtained by Abraham de Moivre. assuming that m is not too small. was the following: n k 1 .5 = 0.

the word ‘error’ refers to the unavoidable uncertainty associated with any measurement. • The interpolation error that leads one to measure 45 g because the scale hand was approximately half-way between 44 g and 46 g.0012. For n = 5. the eﬀect of chance phenomena. it is a random variable. the German mathematician Johann Carl Friedrich Gauss (1777–1855) rigorously justiﬁed the distribution of certain sequences of measurement errors (particularly geodesic measurements) in agreement with that formula. the maximum deviation comes down to a mere 0. however aptly and carefully the measurement procedure has been applied. Suppose you measure 45 g for the weight of the necklace. with the second edition of the book The Doctrine of Chances) was afterwards worked out in detail by the French mathematician PierreSimon Laplace (1749–1827) in his book Th´orie Analytique des Probae bilit´s (1812) and used to analyze the errors in astronomical measuree ments.2.31 we found in Chap. There are several uncertainty factors inﬂuencing this measurement: • The measurement system resolution: the balance hand and the scale lines have a certain width. When measuring experimental outcomes. 4. It is important at this point to draw the reader’s attention to the fact that.011. 1. 4. 4. for n = 30. The error (uncertainty) associated with a measurement is. Consider weighing a necklace using the spring balance with 2 g minimum scale resolution shown in Fig. in other words. for instance. in science.2 has a width almost corresponding to 1 g. More or less at the same time (1809).2. if we look at the hand with our line of sight .2 Errors and the Bell Curve 69 which is really quite close to the value of 0. in Fig. therefore.2 Errors and the Bell Curve Abraham de Moivre’s approximation (described in a paper published in 1738.1 shows the probability function of the binomial distribution (bars) for n = 5 and n = 30 and Abraham de Moivre’s approximation (open circles). • The parallax error . the maximum deviation is 0. related to the fact that it is diﬃcult to guarantee a perfect vertical alignment of our line of sight and the scale plane. 4. This semantic diﬀerence is often misunderstood. the word ‘error’ does not have its everyday meaning of ‘fault’ or ‘mistake’. it is diﬃcult to discriminate weights diﬀering by less than 1 g. Since the hand of Fig. Figure 4.4.

45. 44. • The background noise. 49. 44. 46. 49. 44. Things happen as though the values of a random variable (the ‘error’) were added to the true necklace weight. 45. 45. 45. 46. 48. corresponding in this case to spring vibrations that are transmitted to the hand. 48. 44. we eliminate the parallax and interpolation errors. 49. 45 . the errors should have a tendency to cancel out when the weight is determined in a long series of measurements. 45. 46. 45. but the eﬀects of limited resolution (the electronic sensor does not measure pressures below a certain value) and background noise (such as the thermal noise in the electronic circuitry) are still present. 48. 45. 43. this random variable having zero mathematical expectation. 45. 45. 48. 46. 46. 47.2. 43. 46. 44. 44. with contributions on either side of the true value. 47. 4. The average weight is 45. 47. Conscious of these uncertainty factors. 45. if we look to its left. If we use a digital balance. 45. 45. 47. we read a value close to 250 g. 48. we read a value close to 2 g. 46. 43. suppose we weighed the necklace ﬁfty times and obtained the following sequence of ﬁfty measurements: 44. Spring balance with 2 g minimum resolution to its right. 46. 46.70 4 The Wonderful Curve Fig. It seems reasonable to consider that since the various error sources act independently of one other. 44. In a long sequence . 46. 47. 47. 46.68 g. 43.

If we subtract the average weight from the above measurement sequence.32. the graph in Fig.32. which is 26% of the time. Let us try to understand this.68 g occurred 13 times. electromagnetic measurements.32. no matter whether they are positive or negative. Laplace and Gauss discovered independently that. and we thus obtain the true weight value. Let us therefore take 45.3. 4. 4.68. If we get many errors of high magnitude.68. the frequency distributions of the errors were well approximated by the values of the function 1 2 e−x /2ν .68. −1. the average of the errors to be close to zero.4.2 Errors and the Bell Curve 71 of measurements we expect. −0.32. for many error sequences of the most varied nature (relating to astronomical.3 will become signiﬁcantly stretched. from the law of large numbers. in our case) is called a histogram.32.68. etc. if all the errors are of very small magnitude the whole error distribution will be concentrated around zero. 1. geomagnetic. Error histogram when weighing a necklace . −2. 2. 0. For instance. We see that there are positive and negative errors. 2.68 g as the true necklace weight and consider now the percentage of times that a certain error value will occur (by ﬁrst subtracting the average from each value of the above sequence). an error of 0.68. and so on). Figure 4. This graphical representation of frequencies of occurrence of a given random variable (the error. one way to focus on the contribution of each error to the larger or smaller Fig.68. In the opposite direction.3 shows how these percentages (vertical bars) vary with the error value. Now. 1. f (x) = √ 2πν where x is the error variable and ν represents the so-called mean square error. −2. we obtain the following error sequence: −1. −0. geodesic.

barrel heating.72 4 The Wonderful Curve stretching of the graph but regardless of its sign (positive or negative) consists in taking its square..58 g2 . We now proceed to consider a small circular area of the target (in any arbitrarily chosen position) for which we determine by the above procedure the probability of a hit. 4. So when we add the squares of all errors and divide by their number. the position of the bullet impact is random.61 g) is indicated by the double-ended arrow. Despite the ﬁxed riﬂe position. if out of 5000 shots hitting the target 2000 fall in the innermost circle. The square root of ν. aiming at the target center. For instance. As we increase the number of measurement repetitions the function gradually ﬁts better until values of bars are practically indistinguishable from values taken from the dotted line. we may easily obtain probability estimates of hitting a certain target area. s = ν. barrel grooves wearing out. The riﬂe is assumed to be at a perfectly ﬁxed position.322 = 2. is called the standard deviation. 4. The mean square error is precisely such a measure of the spread: ν= 1. and so on. 50 The mean square error√ is also called the variance. This bell-shaped curve often appears when we study chance phenomena.322 + 2. We see that the function seems to provide a reasonable approximation to the histogram values.3 A Continuum of Chances Let us picture a riﬂe shooting at a target like the one shown in Fig. Suppose we go on decreasing the . The dotted line in Fig.58 = 1. If we use the frequentist interpretation of probability. an estimate of P (shot in innermost circle) is 2/5. suppose we obtained 3/5. we obtain a measure of the spread of the error distribution. 4.3.3 also shows the graph of the function f (x) found by Laplace and Gauss.682 + 0. because error distributions normally follow such a law. and computed with ν = 2. The estimated probability of a hit in the circular ring is then 1/5 and of a hit outside the outermost circle is 1 − 3/5 = 2/5.4. depending on chance phenomena such as the wind. Gauss called it the normal distribution curve. viz. 4. In Fig.682 + 0. We may apply the same procedure to the outermost circle.58 g2 . the variance.682 + 1. √ the value of the standard deviation ( 2. riﬂe vibrations. We are assuming that by construction there can be no shots falling outside the target.

In other words. Shooting at a target 73 circle radius. in continuous domains) . In a very small region we can speak of probability density per unit area. we will almost always obtain discontinuous variations of P (shot in region). and all the more as a gets smaller. In order to obtain arbitrarily good estimates of P (shot in the circle) for a certain value of a. depending on whether the region includes the pre-established points or not. 1. for suﬃciently small regions. If we randomly select a small region and proceed to shrink it. therefore shrinking the circular region. 4.4. We are now going to imagine that the whole riﬂe contraption was mounted in such a way that the shots could only hit certain preestablished points in the target and nothing else. We thus draw two conclusions from our imaginary experiment of shooting at a target: the hit probability in a certain region continually decreases as the region shrinks.4. In this case the . there comes a time when P (shot in the circle) is approximately given by k × a. and our previous k factor is k= P (shot in region) . where a is the area of the circle and k is a proportionality factor. The probability of a hit will also decrease in a way approximately proportional to the area of the circle when the area is suﬃciently small and the shots are uniformly spread out over the circle. of hitting the target was ﬁnely smeared out over the whole target. It is as if the total probability. we would of course need a large number of shots on the target. the probability of a hit is zero because the area is zero: P (hit a point) = 0 (always. the probability is practically proportional to the area of the region. a When the region shrinks right down to a point.3 A Continuum of Chances Fig.

We have in fact the situation of point-like probabilities of discrete events (the preestablished points) that we have already considered in the preceding chapters. between 0 and 1. 4. The domain of possible impact points is no longer continuous but discrete. but rather simpler. has a certain probability density of a hit. let us consider a similar situation to shooting at a target.f. Thinking of the probability measure as a mass.d. In order to do that. the probability density is constant. Many of the bullets shot by the riﬂe fall near the center (dark gray). the hatched area is the mass of the interval. We may divide the interval into smaller intervals and still obtain non-zero probabilities. The ﬁgure also shows the probability density function. hitting a speciﬁc point would also happen an inﬁnite number of times. for discrete domains (shots at pre-established points). the value of P (hit a point) is 0.5a. It is as if our degree of certainty that the shot falls in a given interval were ﬁnely smeared out along the line segment.5a illustrates this situation: a line segment. Let us now go back to the issue of P (hit a point) equalling zero for continuous domains. . there would exist a mass density along the line segment (and along any segment interval). Operating with probabilities in a continuum of chances demands some extension of the notions we have already learned (not to mention some technical diﬃculties that we will not discuss).6 to 0.) In Fig. so that. as we move towards the segment ends. On the other hand.5b. with the ‘shot’ on a straight-line segment. However. The total probability is not continuously smeared out on the target. according to the frequentist approach. No interval is favored with more hits than any other equal-length interval. (In the mass analogy. If we wish to determine the probability of a shot hitting the interval from 0. We see how the probability density is large at the center and drops towards the segment ends. resulting in a probability density per unit length.74 4 The Wonderful Curve probability of hitting a point is not zero for the pre-established points. Does this mean that the point-bullet never hits a speciﬁed point? Surely not! But it does it a ﬁnite number of times when the number of shots is inﬁnite. Uniformity is the equivalent for continuous domains of equiprobability (equal likelihood) for discrete domains.8. abbreviated to p. Once again the probability of hitting a given point is zero. we have only to determine the hatched area of Fig. 4. the probability of the shot falling into a given straight line interval is not zero. This is the so-called uniform distribution. Figure 4. it is discretely divided among the pre-established points. the hit probability density decreases.

either the total area (Fig.4 Buﬀon’s Needle 75 Fig. geology and in particular for a range of bold ideas. is known for his remarkable contributions to zoology. we added probabilities of elementary events. In the measurement error example.5) or the total volume (Fig.4. Consider randomly throwing a needle of length l on a ﬂoor with parallel lines separated by . (a) The density is larger at the center.4 Buﬀon’s Needle Georges-Louis Leclerc. he argued that natural causes formed the planets and that life on Earth was the consequence of the formation of organic essences resulting from the action of heat on bituminous substances. it is easy to see that the values are discrete (for instance. we can continue to use the basic rules of Chap. Previously. Count of Buﬀon (1707–1788). in continuous domains. and particularly to probability theory. the p.5. in order to determine the probability of an event supported by a discrete domain. The normal probability function presented in the last section is in fact a probability density function. Moreover. high at the center and decaying towards the periphery. It can be described as follows. 4. Two cases of the probability density of shots on [0.f. In the two situations. the error set is not inﬁnite. many ahead of his time. 4. there are no error values between 0. 1. botany. the respective values do not constitute a continuum. Less well known is his important contribution to mathematics.32 g and 0.68 g). 1]. With the exception of this modiﬁcation.d. The probability in a certain region corresponds to the volume enclosed by the surface on that region. Now. For instance. 4. Buﬀon can be considered a pioneer in the experimental study and application of probability theory. they are equal to 1. we substitute sums by areas and volumes. a speciﬁc example of which is the famous Buﬀon’s needle. Here as in other situations one must bear in mind that a continuous distribution is merely a mathematical model of reality.4) represents the probability of the sure event. is a certain surface. 4. that is. (b) The density is uniform In the initial situation of shooting at a target.

For the needle problem. since we dispose of a superb tool for carrying out any experiments relating to chance phenomena: the computer.5 Monte Carlo Methods It seems that Buﬀon did indeed perform long series of needle throws on a ﬂoor in order to verify experimentally the result that he had concluded mathematically. The idea was to bet that a coin tossed over a tile ﬂoor would fall completely inside a tile. πλ For d = l. i. 4. The Monte Carlo method – due to the mathematician Stanislaw Ulam (1909–1984) – refers to any problem-solving method based on the use of random numbers. Nowadays. Area under the curve y = e−x d ≥ l (think. for instance.. In this example.e. usually with computers.76 4 The Wonderful Curve Fig.6. Buﬀon demonstrated that the required probability is given by P = 2l . Consider the problem of computing the shaded area of Fig. 4.6 circumscribed by a unit square. the formula of the curve . one then obtains 2/π. Buﬀon’s needle experiment oﬀers a way of computing π ! The interested reader may ﬁnd several Internet sites with simulations of Buﬀon’s needle experiment and estimates of π using this method. 4. we hardly would engage in such labor. of the plank separations of a wooden ﬂoor). What is the probability that the needle falls over a line? We are dealing here with a problem of continuous probabilities suggested to Buﬀon by a fashionable game of his time known as franccarreau.

However. In such cases one has to resort to numerical methods. Let d be the number of points falling inside the relevant region and n the total number of points thrown. A well-known method for computing area is to divide the horizontal axis into suﬃciently small intervals and add up the vertical ‘slices’ of the relevant region. to obtain the value of the area as 1 − 1/e. by mathematical operations.5 Monte Carlo Methods 77 delimiting the area of interest is assumed to be known. and hence the value of e. for a suﬃciently large number of points any equal area subdomain has an equal probability of being hit by a point. for suﬃciently large n. We also assume that the randomly thrown points are uniformly distributed in the respective domain. the ratio d/n is an estimate of the relevant area with given accuracy (according to the law of large numbers). This allows us. This method is easy to implement using a computer. a function that supplies num- Fig. Examples of uniformly distributed points. 4. Let us investigate the stochastic or Monte Carlo method for computing area. that is. Then. 4. that is.4. (a) 100 points. Figure 4. thrown onto a unit square. or the formula may even be unknown. In certain situations it may be more attractive (in particular. approximating each slice by a trapezium. We only need a random number generator . with regard to computation time) to use another approach that is said to be stochastic (from the Greek ‘stokhastik´s’ for a phenomenon or method o whose individual cases depend on chance) because it uses probabilities. For this purpose we imagine that we randomly throw points in a haphazard way onto the domain containing the interesting area (the square domain of Fig. (b) 4000 points . Such computation methods are deterministic: they use a well deﬁned rule.6). it often happens that the curve formula is such that one cannot directly determine the value of the area.7 shows sets of points obtained with a uniform probability density function.7. in this case y = e−x .

video games.8. forecasts involving a large number of variables (meteorology. The randomness of the generated numbers (the quality of the generator) has an inﬂuence on the results of Monte Carlo methods.) Many programming languages provide random (in fact.7 2. 4. ν = np(1 − p) . 4.85 4 The Wonderful Curve Fig. (One could also use roulette-generated numbers. found by Abraham de Moivre.8 2. (s)he will immediately notice their similarity.). . special eﬀects in movies. 4. Figure 4. The estimate tends to stabilize around the value of e (dotted line) for suﬃciently large n. whence the name Monte Carlo. etc. and the probability density function of the normal distribution. optimization problems. 4. As a matter of fact.75 e(n) 2. Six Monte Carlo trials for the estimation of e 2. for large n.65 2.6 0 1 2 3 4 5 n 6 7 8 9 10 x 10 4 bers as if they were generated by a random process. pseudo-random) number generating functions. the binomial distribution is well approximated by the normal one.8. one sees that the law of large numbers is also at work here. found by Laplace and Gauss.6 Normal and Binomial Distributions If the reader compares the formula approximating the binomial distribution. Monte Carlo methods as described here are currently used in many scientiﬁc and technical areas: aerodynamic studies.8 shows the dependence e(n) of the estimate e on n in six Monte Carlo trials corresponding to Fig.6. By inspection of Fig. economics. m = np . to name but a few.78 2. if one uses x=k−m.

4. Let us then consider the normal law using m = np = 1000 × 0.f. ν = np(1 − p) = 5. . areas.9945 = 5. 4. We are now in a position to analyze this matter in greater detail. Such is the 1 Assuming zero mean and unit variance.47 .5 .34.0055 = 5.4. 4. The value of P (more than 10 events in 1000) (shaded area) using the normal approximation The mathematical expectation (mean) of the normal distribution is the same as that of the binomial distribution: m = np. We then have the value of P (more than 10 events in 1000) corresponding to the shaded area of Fig. The variance is np(1 − p).5 × 0.01.1 In this way. There are tables and computer programs that supply values of the normal p. whence the standard deviation is s = np(1 − p).7 Distribution of the Arithmetic Mean When talking about the measurement error problem we mentioned the arithmetic mean as a better estimate of the desired measurement.9.9. conditions satisﬁed in our example).7 Distribution of the Arithmetic Mean 79 Fig. For n = 1000 the approximation of the normal law to the binomial law is excellent (it is usually considered good for most applications when n is larger than 20 and the product np is not too low. obtained independently of each other and originating from identical random phenomena. We can now go back to the initial problem of this chapter. one can obtain the value of the area as 0. whence s = 2. The probability of ﬁnding more than 10 infected people among the 1000 hospital patients is therefore 1%. Simple manipulations are required to ﬁnd the area for other values.d. Consider a sequence of n values.

4.d. the solid line shows the normal p. of the arithmetic mean is therefore about three times more concentrated. about three times smaller. of the arithmetic mean of 10 measurements. Suppose we wish to determine an interval of weights around the true weight value into which one measurement would fall with 95% probability. 4. independently discovered by Laplace and Gauss.f.5 g.d.10.80 4 The Wonderful Curve usual measurement scenario: no measurement depends on the others and all of them are under the inﬂuence of the same random error sources.d. We may then interpret the mathematical expectation as the ‘true value of the arithmetic mean’ (‘the true weight value for the necklace’) that would be obtained had we performed an inﬁnite number of measurements.f. for a normal p. As we increase the number of measurements the distribution of the arithmetic mean gets gradually more concentrated. of the error in a single measurement (solid curve) and for the average of 10 measurements (dotted curve) . the standard deviation of the arithmetic mean is 45 √ s/ n = 1.f.47 g. then. forcing it to converge rapidly (in probability) to the mathematical expectation.10. relative to one measurement while the dotted line shows the p.d. An important result. 4. Let us assume that in the necklace weighing example some divinity had informed us that the true necklace weight was 45 g and the true standard deviation of the measurement process was 1.5/ 10 = 0. the p. the arithmetic mean (the computed average) random variable also follows the normal distribution with the same mean (mathematical expectation) and with standard de√ viation given by s/ n. is the following: if each measurement follows the normal distribution with mean (mathematical expectation) m and standard deviation s. as illustrated in Fig.10. Weighing a necklace: p. Both have the same mathematical expectation of √ g..d.f. It so happens that. However.f. In Fig. such an interval starts (to a good approximation) at the distribution mean minus two Fig. under the above conditions.

the interval is 45 ± 3 g (3 = 2 × 1. Thus.61/ 50 = 0.47).94 = 2 × 0. for every 10-measurement set.46 g. Computing averages of repeated measurements is clearly an advantageous method in order to obtain better estimates. and indeed 95% certainty is a popular value for conﬁdence intervals.) Supposing we computed the arithmetic mean and the standard deviation of 10 measurements many times.e.5). the weight estimate is clearly better: in the ﬁrst case 95 out of 100 measurements will produce results that may be 3 g away from the true value..94 g. 14 out of the 15 segments. The interval corresponding to two standard deviations is thus 45. for the necklace example. i. for the average of 10 measurements and the same degree of certainty. intersect the line corresponding to the true mean value.11. Therefore. the interval is reduced to 45 ± 0. In Fig. What can then be said when computing the arithmetic mean and standard deviation of a set of measurements? For the necklace example one computes for the ﬁfty measurements: √ The standard deviation of the arithmetic mean is therefore 1.68±0. 4. whereas for one measurement one has a 95% degree of certainty that it will fall in the interval 45 ± 3 g. 4. In conclusion. The ﬁgure shows 15 sets of 10 measurements.68 ± 0. If we take the average of 10 measurements.23 g. s = 1.7 Distribution of the Arithmetic Mean 81 standard deviations and ends at the mean plus two standard deviations. .61 g .68 g . This interval is usually called the 95% conﬁdence interval of the arithmetic mean. (If it told us. 93% of the segments. we would get a situation like the one illustrated in Fig.94 g (0. in the second case 95 out of 100 averages of 10 measurements will produce results that are at most about 1 gram away. The diﬃculty with the above calculations is that no divinity tells us the true values of the mean and standard deviation. the interval is then 45± 0. The arithmetic means and standard deviations vary for each 10measurement set. we would not have to worry about this problem at all.46 g? Does it mean that we are m = 45. and for this reason it is widely used in practice. with trials numbered 1 through 15. What does it mean to say that the 95% conﬁdence interval of the necklace’s measured weight is 45. however. that in a large number of trials nearly 95% of the intervals ‘arithmetic mean plus or minus two standard deviations’ contain the true mean value (the value that only the gods know). it can be shown. the arithmetic mean (solid circle) plus or minus two standard deviations (double-ended segment) computed from the respective mean square errors. which shows.11.4.

8 The Law of Large Numbers Revisited There is one issue concerning the convergence of means to mathematical expectations that has still not been made clear enough: when do we consider a sequence of experiments to be suﬃciently large to ensure an acceptable estimate of a mathematical expectation? The notion of conﬁdence interval is going to supply us with an answer. We may say that it is an atypical measurement set. In fact.11 can help us to work out the correct interpretation. When computing the 95% conﬁdence interval. if the true weight were 45 g and the true standard deviation were 3 g. Consider the necklace-weighing example. The interpretation to be given to the conﬁdence interval is thus as follows: when proposing the 95% conﬁdence interval for the true mean.82 4 The Wonderful Curve Fig. We may have had such bad luck that the 50-measurement set obtained is like set 8 in Fig.11. Computed mean ±2 standard deviations in several trials (numbered 1 to 15) of 10 measurements 95% sure that the true value is inside that interval? This is an oft-stated but totally incorrect interpretation.11. What is an ‘acceptable estimate’ of the necklace weight? It all depends on the usefulness of our . the average of the 50 measurements would fall in the interval 45 ± 0.94 g. 4. The interpretation is quite diﬀerent and much more subtle. 4. Whatever the true values of the mean and standard deviation may be. we are only running a 5% risk of using an atypical interval that does not contain the true mean. and not in 45. we have already seen that.68 ± 0. For this set there is 100% certainty that it does not contain the true mean (the true weight). we are computing an interval that will not contain the true mean value only 5% of the time. there will always exist 50-measurement sets with conﬁdence intervals distinct from the conﬁdence interval of the true mean value. Figure 4. a set that does not contain the true mean value. with 95% certainty. 4.46 g.

. Now. maybe we can tolerate a deviation of 2 g between our estimate and the true weight and maybe one single weighing is enough. (Note once again the division by n.) We thus arrive at a vicious circle: in order to estimate p. it is easy to check that the worst case (largest spread) occurs for p = 0.e. corresponding to n = 16 measurements. i. where p is the probability we wish to estimate. implying n = 10 000.8 The Law of Large Numbers Revisited 83 estimate.01.05. To clear this matter up we wish to determine how many times we must fairly throw the die in order to obtain an acceptable estimate of face 6 turning up with 95% conﬁdence. it is possible to show that.05 g? The answer is now easy to obtain. for instance.5. because face 6 seems to turn up more often than would be expected. when k is well approximated by the normal distribution.25 =√ . we must √ have 2 × 0.4. does not contain the true p value. Suppose we throw the die n times and face 6 turns up k times. which for large n is well approximated by the normal distribution with mean np and variance np(1 − p). only after 10 000 throws do we attain a risk of no more than 5% that the conﬁdence interval f ± 0. we need to know the standard deviation which depends on p. We know that k follows the binomial distribution. Since we saw that a 95% conﬁdence level corresponds to two standard deviations we have √ 1 0. Assuming a 0.01 is atypical.1/ n = 0.1 g standard deviation. The easiest way out of this vicious circle is to place ourselves in a worst case situation of the standard deviation. say 0. then so is f . 2× √ n n √ We just have√ equate this simple 1/ n formula to the required 1% to tolerance: 1/ n = 0. If the necklaces are made of diamonds we will certainly require a smaller tolerance of deviations. Hence. and with mean p and standard √ √ deviation p(1 − p)/ n. We assume that an acceptable estimate is one that diﬀers from the true probability value by no more than 1%. .05 g (1/4 of carat). We may then ask: how many measurements of a diamond necklace must we perform to ensure that the 95% conﬁdence interval of the measurements does not deviate by more than 0. The variable we use to estimate p is the frequency of occurrence of face 6: f = k/n. a die we suspect has been tampered with. If we want to select pearl necklaces. Given its formula. Let us now consider the problem of estimating the probability of an event. Consider.

geographical position.8% to 17. If the throws are made under the same conditions. The ‘yes’ percentage is.84 4 The Wonderful Curve 4. for instance. We must therefore assume that these are sources of uncertainty and that the results we hope to obtain reﬂect all these sources of uncertainty. 14. imagine robots performing the throws instead of people.2 − 3.4% for a conﬁdence degree of 95%.4% of the Portuguese population will change the 14. we may of course. therefore. standard of living.2%. as a sample. Consider. selected for that purpose. In other words. For instance.2% with an error of 3. One usually refers to a set of cases from a population. and many other factors. the following report of an investigation carried out amongst the population of Portugal: Having interviewed 880 people by phone. we could consider 10 000 people throwing a die. That estimate is based on an analysis of the answers from a certain set of 880 people. In the previous section we estimated the probability of turning up a 6 with a die by analyzing the results of 10 000 throws. They are unequal with regard to knowledge. if the survey only covers illiterate people. will not change their opinion)? Maybe it means that only 3. The vast majority of people have trouble understanding what this means. everything hinges on estimating a certain variable: the proportion of Portuguese that vote ‘yes’ (with regard to some question). that is. it makes no diﬀerence whether we consider performing experiments in time or across a set of people. 125 answered ‘yes’.4%. as far as we can guarantee an identical throwing method and equivalent dice. Any conclusion drawn from .9 Surveys and Polls Everyday we see reports of surveys and polls in the newspapers and media.2% percentage? Or does it perhaps mean that we are 95% sure (however this may be interpreted) that the ‘yes’ vote of the Portuguese population will fall in the interval from 10. Does it mean that the ‘yes’ percentage is 14.4% of the people interviewed ﬁlled in the survey forms in the wrong way or gave ‘incorrect’ answers to the interviewers? Is it the case that only 14. its result will not reﬂect the source of uncertainty we refer to as ‘level of education’.e. while the rest answered ‘no’. But one diﬃculty with surveys of this kind is that people are not robots. Instead of throwing the same die 10 000 times. but 3. many and varied interpretations seem to be to hand. education. First of all. studied in order to infer conclusions applicable to the whole population.8% of those interviewed deserve 95% conﬁdence (i.6%? Indeed.. so let us analyze the meaning of the survey report. 10.

In order to reﬂect what is going on in the Portuguese population.9 Surveys and Polls 85 the survey would only apply to the illiterate part of the Portuguese population. for instance. Suppose now that in an electoral opinion poll. We are not rid of the fact that. by random choice of phone numbers. . an atypical set of 880 people may have been surveyed. We now imagine seven urns. It is more usual to use other ‘almost random’ selection rules. performed under the same conditions as in the preceding survey. We are only 95% sure that our conﬁdence interval is a typical interval (one of the many.6%.8% to 17. respectively: 40. Suppose we possess information concerning the 7 million identity cards of the Portuguese adult population. The interpretation is completely diﬀerent. the people in our sample must reﬂect all possible nuances in the sources of uncertainty. which therefore intersect the true values of the respective percentages. for which the conﬁdence interval does not contain the true value of the proportion. people are not selected in this way when carrying out such a survey. The penalty is that in this case the survey results do not apply to the whole Portuguese population but only to the population that has a telephone (the source of uncertainty referred to as ‘standard of living’ has thus been reduced). This is the sampling problem which arises when one must select a sample. in the sense that if we repeated the survey many times.4% ‘error’. We may number the identity cards from 1 to 7 000 000. by bad luck (occurring 5% of the time in many repetitions).8%. corresponding to the millionth digit. One way to achieve this is to select the people randomly.0337.√ is half of the an It width of the estimated conﬁdence interval: 1/ n = 1/ 880 = 0. But is this necessarily so? Assuming typical conﬁdence intervals. one often hears people drawing the conclusion that it is almost certain (95% certain) that party A will win over party B. 6 with 10 balls numbered 0 through 9 and the seventh with 8 balls numbered 0 through 7. In second place. as we have already seen in the last section. the following voting percentages are obtained for political parties A and B. For obvious practical reasons. 95%. the so-called ‘error’ is not √ error. Since the diﬀerence between these two values is well above the 3.4. of the intervals that in many random repetitions of the survey would contain the true value of the ‘yes’ proportion of voters). Random extraction with replacement from the 7 urns would provide a number selecting an identity card and a person.5% and 33. 95% of the time the percentage estimate of ‘yes’ answers would fall in that interval. We are not therefore 95% sure that the probability of a ‘yes’ falls into the interval from 10.

2%. Let us denote this variable by x1 and let us assume that the probabilities are P (x1 = 0) = 0. √ that is.6 × 0. For this purpose we imagine that the P (x1 ) graph is ﬁxed. One thus obtains the probability function of Fig.5 − 3. x1 = 1 and x2 = 0. the ‘error’ will then have to be 1%. Along the ‘walk’ . thus. 4. 1 and 2.12b. even when such causes have non-normal distributions. whereas the true percentage voting B is 33. whose ﬁrst version was presented by Gauss.6 = 0. the corresponding probability of x1 + x2 = 0 is 0. P (x1 + x2 = 1) = 0. we have to know how to add two random variables.8% in this case). 1/ n = 0. say 2%. we will only be able to discriminate the voting superiority of A over B if the respective estimates are set apart by more than the width of the conﬁdence interval (6. Let us now see how we can obtain P (x1 + x2 ) with a geometric construction. what would be the probability function of x1 + x2 ? The possible values of the sum are 0. If we want a tighter discrimination. The value 0 only occurs when the value of both variables is 0.4 . while the P (x2 ) graph. Take a very simple variable.4 + 0.6 .16.4 = 37. Supposing we add to x1 an identical and independent variable. ‘walks’ in front of the P (x1 ) graph in the direction of increasing x1 values. x1 + x2 = 2 only occurs when the value of both variables is 1. as the variables are independent.86 4 The Wonderful Curve it may happen that the true percentage voting A is 40. Why does this happen? The answer to this question has to do with a famous result. In other words. known as the central limit theorem. First. thus. P (x1 = 1) = 0. denoted x2 .6 × 0.6 = 0.4 = 37. namely those that can be considered as emerging from the addition of many causes acting independently from one another.1%. it may happen that votes for B will outnumber votes for A! In conclusion. 0 and 1.8 + 3. Thus. with only two values. implying a much bigger sample: n = 10 000! 4. reﬂected as in a mirror. in the opposite way.4 × 0. The value 1 occurs when x1 = 0 and x2 = 1 or.36. exactly as if we tossed a coin in the air. P (x1 + x2 = 2) = 0. There is a geometrical construction that can quickly convince us of this result.10 The Ubiquity of Normality Many random phenomena follow the normal distribution law. Finally. Afterwards the theorem was generalized in several ways.01.48.

13.6×0. 4. 4. one thus obtains the product 0.10 The Ubiquity of Normality 87 Fig. In the leftmost graph of Fig. from left to right. the convolution is performed with the probability density functions (one of them is reﬂected).4. Fig. by the same process. In the rightmost graph of Fig. computing the area below the graph of the product in the overlap region. 4. There are now two overlapping values whose sum of products is P (x1 +x2 = 1) = 0. The central graph of Fig. Probability function of (a) a random variable with only two values and (b) the sum of two random variables with the same probability function we add the products of the overlapping values of P (x1 ) and the reﬂected P (x2 ).6 × 0.4×0. 4. lags behind the P (x1 ) graph.6+0.6.13 shows. Figure 4.12.13. along the ‘walk’. From this point on there is no overlap and P (x1 + x2 ) is zero. 4. 4. If the random variables are continuous. shown with white bars in Fig. there is no overlap and the product is zero. and one obtains.12b .4. 4. the P (x2 ) reﬂected graph only overlaps at point 2. what happens. When the P (x2 ) reﬂected graph.13 we reached a situation where. which is the value of P (x1 + x2 ) at point 0.4. The geometric construction leading to Fig.4 × 0. the P (x2 ) reﬂected graph overlaps the P (x1 ) graph at the point 0.13 corresponds to a more advanced phase of the ‘walk’.13. the value of P (x1 + x2 = 2) = 0. The operation we have just described is called convolution.

88

4 The Wonderful Curve

But what has this to do with the central limit theorem? As unbelievable as it may look, if we repeat the convolution with any probability functions or any probability density functions (with rare exceptions), we will get closer and closer to the Gauss function! Figure 4.14 shows the result of adding variables as in the above example, for several values of the number of variables. The scale on the horizontal axis runs from 0 to 1, with 1 corresponding to the maximum value of the sum. Thus, for instance in the case n = 3, the value 1 of the horizontal axis corresponds to the situation where the value of the three variables is 1 and their sum is 3. The circles signaling the probability values are therefore placed on 0× (1/3), 1× (1/3), 2× (1/3), and 3 × (1/3). The construction is the same for other values of n. Note that in this case (variables with the same distribution), the mean of the sum is the mean of any distribution multiplied by the number of added variables: 3 × 0.6 = 1.8, 5 × 0.6 = 3, 10 × 0.6 = 6, 100 × 0.6 = 60. In the graphs of Fig. 4.14, these means are placed at 0.6, given the division by n (standardization to the interval 0 through 1).

Fig. 4.14. Probability functions (p.f.) of the sum of random variables (all with the p.f. of Fig. 4.12a). Note how the p.f. of the sum gets closer to the Gauss curve for increasing n

4.11 A False Recipe for Winning the Lotto

89

Fig. 4.15. The normal distribution on parade

In conclusion, it is the convolution law that makes the normal distribution ubiquitous in nature. In particular, when measurement errors depend on several sources of uncertainty (as we saw in the section on measurement errors), even if each one does not follow the normal law, the sum of all of them is in general well approximated by the normal distribution. Besides measurement errors, here are some other examples of data whose distribution is well approximated by the normal law: heights of people of a certain age group; diameters of sand grains on the beach; fruit sizes on identical trees; hair diameters for a certain person; defect areas on surfaces; chemical concentrations of certain wine components; prices of certain shares on the stock market; and heart beat variations.

**4.11 A False Recipe for Winning the Lotto
**

In Chap. 1 we computed the probability of winning the ﬁrst prize in the lotto as 1/13 983 816. Is there any possibility of devising a scheme to change this value? Intuition tells us that there cannot be, and rightly so, but over and over again there emerge ‘recipes’ with diﬀering degrees of sophistication promising to raise the probability of winning. Let us describe one of them. Consider the sum of the 6 randomly selected numbers (out of 49). For reasons already discussed, the distribution of this sum will closely follow the normal distribution around the mean value 6× (1+ 49)/2 = 150 (see Fig. 4.16). It then looks a good idea to choose the six numbers in such a way that their sum falls into a certain interval around the mean or equal to the mean itself, i.e., 150. Does this work?

90

4 The Wonderful Curve Fig. 4.16. Histogram of the sum of the lotto numbers in 20 000 extractions

First of all, let us note that it is precisely around the mean that the number of combinations of numbers with the same sum is huge. For instance, suppose we chose a combination of lotto numbers with sum precisely 150. There are 165 772 possible combinations of six numbers between 1 and 49 whose sum is 150. But this is a purely technical detail. Let us proceed to an analysis of the probabilities by considering the following events: A The sum of the numbers of the winning set is 150. B The sum of the numbers of the reader’s bet is 150. C The reader wins the ﬁrst prize (no matter whether or not the sum of the numbers is 150). Let us suppose the reader chooses numbers whose sum is 150. The probability of winning is then P (C and A) = P (C if A) × P (A) =

1 165 772 1 × = . 165 772 13 983 816 13 983 816

Therefore, the fact that the reader has chosen particular numbers did not change the probability of the reader winning the ﬁrst prize. On the other hand, we can also see that P (C if B) = = P (B and C) P (B if C) × P (C) = P (B) P (B) 1 (165 772/13 983 816) × (1/13 983 816) = . 165 772/13 983 816 13 983 816

4.11 A False Recipe for Winning the Lotto

91

Thus P (C if B) = P (C), that is, the events ‘the sum of the numbers of the reader’s bet is 150’ and ‘the reader wins the ﬁrst prize’ are independent. The probability of winning is the same whether or not the sum of the numbers is 150! As a matter of fact, this recipe is nothing else than a sophisticated version of the gambler’s fallacy, using the celebrated normal curve in an attempt to make us believe that we may forecast the outcomes of chance.

.

regular use of toothpaste B reduces dental caries. a US investigation revealed that exposure to mobile phone radiation causes an increase in cancer victims. Will it be possible. detergent X washes whiter. The values of X can be established by means of clinical examinations of the children and consequent assignment of a rank to their state of health. Variable X is ‘healthy children’ and variable Y is ‘drink yoghurt A’. In the same way. In summary. . say X and Y. one may presuppose that the conclusion is based on a survey involving a limited number of children.1 Observations and Experiments We are bombarded daily with news and advertisements involving inferences based on sets of observations: healthy children drink yoghurt A. pharmaceutical product C promotes a 30% decrease in infected cells. variable Y has an obvious random component with regard to the ingested quantities of yoghurt. to draw some general conclusion about a possible relation between them? Consider the statement ‘healthy children drink yoghurt A’. For instance.5 Probable Inferences 5. The underlying idea of all these statements is as follows: there are two variables. and one hopes to elucidate a measure of the relationship between them. Measurements of both variables are inﬂuenced by chance phenomena. when one hears ‘healthy children drink yoghurt A’. recent studies leave no doubt that tobacco smoke may cause cancer. However. X and Y are random variables and we wish to elucidate whether or not there is a relation between them. such ranks are not devoid of errors because any clinical analysis always has a non-zero probability of producing an incorrect value. All these statements aim to generalize a conclusion drawn from the analysis of a certain. notwithstanding the inﬂuence of chance. it can either be causal. If the relation exists. more or less restricted data set.

where the conditions underlying the experiment are rigorously controlled in order to avoid the inﬂuence of any factors foreign to those one hopes to relate. The great intellectual challenge in scientiﬁc work consists precisely in elucidating the interaction between experimental conditions and alter- .e. clinical status. In fact.. it is absolutely essential to establish experimental conditions that allow one either to accept or to reject a causal link. 2 and 3 cannot be avoided. sex.g.94 5 Probable Inferences i. In the case of the pharmaceutical industry. 3. in which the method for assessing the random variables is subject to a control. which causes ‘children love comics containing the advertisement for yoghurt A’. In this way. ‘pharmaceutical product C promotes a 30% decrease in infected cells’.. namely those of a remote causality. The inﬂuences assignable to situations 1.. Z causes X and Y. In practice. There are multiple causes. Z = standard of living of the child’s parents. Regarding the last statement.. etc. e. several situations may occur. with the exception of the last one. often for ethical reasons (we cannot compel non-smokers to smoke or expose people with or without cancer to mobile phone radiation). or otherwise. G = happy parents causes ‘joyful children’. resulting from observations gathered in a certain population. one takes measures to avoid as far as possible the inﬂuence of further variables and multiple causes. Z and U together cause X. Y. In all the above advertisement examples. which we exemplify as follows for the statement ‘healthy children drink yoghurt A’: 1.. It is current practice to compare the results obtained in a group of patients to whom the drug is administered with those of another group of patients identical in all relevant aspects (age. X causes Y or Y causes X. U = child’s dietary regime. e.) to whom a placebo is administered (control group). There are remote causes in causal chains. The data gathered in physics experiments are also frequently experimental data. 2. the data are usually observational. e. many statements of the kind found in the media are based on observational data coming from surveys. that is. In the latter case. G = happy parents causes ‘happy children’. there is a diﬀerent situation: it is based on experimental data. it is always diﬃcult to guarantee the absence of perturbing factors. which causes ‘children love reading comics’. Z = standard of living of the child’s parents.g. A third variable is involved. which causes ‘good immune system in the children’.g. viz.

its main aim is to establish general conclusions (inferences) from data sets. Laplace and Gauss. The idea here is to minimize the squares of the deviations from the straight line. a method for ﬁtting straight lines to data sets. In this role it has become an inescapable tool in every area of science. (Galton is also known for his work on linear regression. the average salary increased by 2. child mortality in third world countries has lately decreased although it is still on average above 5%. etc. and thus pioneering the ﬁeld of biometrics. professor of applied Mathematics in London. One of them. . the English scientist Walter Weldon (1822–1911). . political. already applied by Abraham de Moivre. Coming back to Walter Weldon. The celebrated English writer H. Statistics makes an important contribution to this issue. the desire to analyze biometrical data led him to use some statistical methods developed by Francis Galton (1822–1911). etc. . u taxes. . as well as in other areas of study: social. besides representational and descriptive techniques. Statistics is based on probability theory and. . administrative.” Statistics can be descriptive (e. Wells (1866–1946) once said: “Statistical thinking will one day be as necessary for eﬃcient citizenship as the ability to read and write.. administrative management. or deviations from the mean. He once said: “. entrepreneurial. particularly those based on the concept of correlation. .1% inﬂation in . 5.).g. . The word seems to have come from the German ‘Statistik’ (originally from the Latin ‘statisticum’) or ‘stat k¨nde’. . seeking to establish conclusions based on those measurements. In his day. the questions raised by the Darwinian hypothesis are purely statistical”. began the painstaking task of measuring the morphological features of animals. . describing and summarizing large sets of data.) The studies of Weldon and Galton were continued by Karl Pearson (1857–1936).5.G.2 Statistics The works of Charles Darwin (1809–1882) on the evolution of species naturally aroused the interest of many zoologists of the day. a typical family produces ﬁve kilos of domestic waste daily) or inferential.5% in . meaning the gathering and description of state data (censuses. the main household commodities suﬀered a 3.2 Statistics 95 native hypotheses explaining the facts. generally considered to be the founder of modern statistics. ‘statistics’ meant a set of techniques for representing. This is the mean square error method.

95% or more?). but there are exceptions. respectively) measured for a hundred people.. in particular when what is at stake is to establish a relation involving two or more chance phenomena. In general. Fig.. the average weight and height are marked by broken lines. Figure 5. many other types of misuse. That is. 5. If there were no relation between height and weight.1.2 shows the values of these two random variables (in kilos and meters. In Fig. 4. We then drew the reader’s attention to two essential aspects: the sampling process (e. Let us now observe the two cases lying on the dotted line: both have approximately the same above-average height. we would get many pairs of cases like the ones lying on the dotted line. one of the cases has a weight above the average and the other a weight below the average. taller people are heavier than shorter ones. however.g. of course. Many misuses of statistics are related to these two essential features..2. however. 5. 5.96 5 Probable Inferences We saw in Chap. how an estimate of the mean allows one to establish conclusions (to infer) about the true value of the mean. was the enquiry on domestic waste suﬃciently random to allow conclusions about the typical family?) and the sample size (e. There are. Statistical uncertainty .3 Chance Relations Let us consider the height and weight of adults. We all know that there is some relation between height and weight.g. was the number of people surveyed suﬃciently large in order to extract a conclusion with a high degree of certainty. i.e.

5. Fictitious height and weight . Relations between two or more chance phenomena are encountered when examining natural phenomena or everyday events. Figure 5. There is than a strong temptation to say that one phenomenon is ‘the cause’ and the other ‘the eﬀect’. Indeed. The situation would then be as pictured in Fig. 5.3. 5.2 deﬁnitely shows us that there is a relationship of some kind between the two variables. anything about the person’s weight. we would also ﬁnd a certain relation between this variable and the other two. the points tend to concentrate around an upward direction.2. If for the hundred people we were to measure the waist size. Fig. Height and weight of one hundred adults the simple fact of knowing that someone has an above-average height would not tell us. 5. on average. let us ﬁrst get acquainted with the usual way of measuring relationships between two chance phenomena. But that is not what happens. Before discussing this issue.3 Chance Relations 97 Fig.3: a cloud of points of almost circular shape.

is 6. when B1 has value 1 and B2 has value 3. respectively.87/15 = 0. 5.93 (square euros) . This is clearly not . the product is positive. We shall now explain this by considering simple experiments in which balls are drawn out of one urn containing three balls numbered 1.93) = −0.93) = 1.93 . When the values of B1 and B2 are simultaneously above or below the mean. If gains and losses were two times greater (2.87) × (1 − 1. we would ﬁnd a diﬀerent value for the average product of the deviations. The average product of the deviations is 1. Let us make use of the idea. 5. represented in Fig. For instance. 5.87) × (3 − 1. 2 and 3. the product of the deviations is (1 − 1. For Fig. it is 6.98 5 Probable Inferences 5. the deviations are concordant.4a would be similar in both cases. Figure 5. 4. the total number of times that each value from 1 to 3 for B1 occurs. the average product of the deviations would be nearly 0. repeated 15 times. let us multiply together those deviations. 4 and 6 euros instead of 1. The vertical bars in the histogram show the number of occurrences of each two-ball combination.87 . 2 and 3 euros). We may assume that this was a game where B1 represented the amount in euros that a casino would have to pay the reader and B2 the amount in euros the reader would have to pay the casino.5. although the histogram in Fig. of measuring the relation between B1 and B2 through their deviations from the mean.4a. we obtain the following sum of the products of the deviations: 2 × (1 − 1. alluded to in the last section. 5. In the 15 rounds.93) + · · · + 2 × (3 − 1.4a. for the ﬁrst and second extraction.67 euros . 15 15 15 15 In a similar way one would compute the average loss (B2) to be 1. This means that. We may then compute the reader’s average gain: average gain (B1) = 1 × 6 5 3 25 +2× +3× = = 1. In this case the deviations of B1 and B2 were discordant (one above and the other below the mean).4a shows a two-dimensional histogram for extraction of two balls. For B2. Let us denote the balls by B1 and B2. evidencing an equal relationship.87) × (3 − 1. For that purpose. the product is negative.124. with replacement.4 Correlation One of the most popular statistical methods for analyzing the relation between variables is the correlation-based method. 4. 5.

3. we get s(B1) = 0. we get C(B1.85 The histogram of Fig. of which the above value is a mere estimate. then approaches the so-called correlation coeﬃcient between the two variables. which we have computed/estimated above for a 15-sample case. s(B2) = 0.4b corresponds to n = 2000. 5. (a) Histogram of 15 extractions. as we increase the number of extractions n. which is now a dimensionless measure (no more square euros). the number of occurrences of each pair approaches np = n/9 (see Sect. for 2000 extractions what we were hoping for. where k denotes the number of occurrences in the n = 2000 extractions. The measure C. Carrying out the computations of the standard deviations of B1 and B2. In fact.4. B2) = average product of deviations of B1 and B2 product of standard deviations 0. (b) The same. Denoting our measure by C.5. We see that the values of f are practically the same. by calculating the mathematical expectation of the product of the deviations from the true means: . = 0.81 × 0.81 euros.85 euros. We solve this diﬃculty (making our relation measure independent of the measurement scale used) by dividing the average product of the deviations by the product of the standard deviations. 5.6).18 . of two balls from one urn. We may obtain (the exact value of) this coeﬃcient. with the vertical bars representing frequencies of occurrence f = k/n. with replacement.124 = 0.4 Correlation 99 Fig.

the estimate of the correlation between the two variables is −0. 9 Thus. Fig. 5. they are also uncorrelated (the opposite is not always true). the probability that 1 (low) comes out the following time increases and vice versa. (a) Histogram of 15 extractions. note that the variables B1 and B2 are independent. we obtained in a computer simulation the histogram of occurrences shown in Fig. without replacement.5b) C B1. for extraction with replacement of the two balls. The exact value is (see Fig.59. of two balls from one urn.4b).100 5 Probable Inferences E (B1 − 2) × (B2 − 2) = 1 (1 − 2)(1 − 2) + (3 − 2)(3 − 2) 9 + (1 − 2)(3 − 2) + (3 − 2)(1 − 2) = 1 × (1 + 1 − 1 − 1) = 0 . B1 and B2 are negatively correlated.3 (there is no ‘privileged direction’ in Figs. This is also the case in Fig. (b) The corresponding probability function . For n = 15.B2 (without replacement) = = E (B1 − 2) × (B2 − 2) −1/3 = −0. Repeating the previous computations.5 . the correlation is zero and we say that variables B1 and B2 are uncorrelated. and it so happens that. Incidentally.5. Let us now consider ball extraction without replacement. 5. 2/3 2/3 × 2/3 In this situation the values of the balls are inversely related: if the value 3 (high) comes ﬁrst. 5.5a. 5. 5. whenever two variables are independent.3 and 5.

5 Signiﬁcant Correlation 101 (Note the diagonal with zeros in the histograms of Fig. In the height–weight data set. weight) = 0. Let us go back to the correlation between adult weight and height. when the correlation is +1. we have 95% certainty that a correlation at least as high as 0. due only to the small number of experiments performed.18 value suggesting some degree of relation between the variables could be called an artifact. in 95% of the experiments with n cases). 5. n is precisely 100. in the case of the urn represented in Fig. 5. For instance. one would obtain a correlation value at least as high as the one computed in the sample? In other words. when in fact the true correlation value was 0 (uncorrelated B1 and B2). it would be as though one could tell the exact weight of a person simply by knowing his/her height. 5. with a given degree of certainty (for instance.1 for several values of n. where the data graph is a straight line. representing impossible combinations.83.5.) It is easily shown that the correlation measure can only take values between −1 and +1. Can this value be attributed any kind of signiﬁcance? For instance.83 is therefore .5 Signiﬁcant Correlation An important contribution of Karl Pearson to statistics was precisely the study of correlation and the establishment of criteria supporting the signiﬁcance of the correlation coeﬃcient (also known as Pearson’s correlation coeﬃcient). we got a correlation estimate of 0. A −1 correlation reﬂects a perfect inverse relation. In the case of weight and height.4a. Karl Pearson approached the problem of the ‘signiﬁcance’ of the correlation value in the following way: assuming the variables to be normally distributed and uncorrelated how many cases n would be needed to ensure that. for n = 100. The −1 and +1 values correspond to a total relation. this would be the case if the second ball extraction from the urn only depended on the ﬁrst one according to the formula B2 = 4 − B1. The 0. The value C(height.2 is not attributable to chance.5.18. We have already seen what a zero correlation corresponds to. and vice versa. The computed value of the correlation coeﬃcient is 0. how large must the sample be so that with high certainty (95%) the measured correlation is not due to chance alone (assuming normally governed chance)? The values of a signiﬁcant estimate of the correlation for a 95% degree of certainty are listed in Table 5.

44 0.6 Correlation in Everyday Life Correlation between random variables is one of the most abused topics in the media and in careless research and technical reports. in physics the statistical proof applied to experimental results is often only one among many proofs. whereas it is . Basing ourselves on the correlation alone. Of course. If. This is an important aspect of statistical proof . there will always exist a non-null probability that such a relation does not hold. if the correlation between two variables is zero.27 0.20 0. but the law itself – inverse relation between pressure and volume – is supported by other evidence beyond the statistical evidence. 5. we wish to prove the relation between tobacco smoke and lung cancer using the correlation. As an example.14 with 95% certainty signiﬁcant. As mentioned before. For instance. we may think of (re)determining Boyle– Mariotte’s law for ideal gases through the statistical analysis of a set of experiments producing results on the pressure–volume variation. Signiﬁcant estimate of the correlation for a 95% degree of certainty n 5 10 15 20 50 100 200 Signiﬁcant correlation 0. (In fact.1. then there is no relation between them. in general there is other evidence besides correlation. so there may be some relation between the variables even when their Pearson correlation is zero. This is false. The word ‘correlation’ itself hints at some ‘intimate’ or deep relation between two variables.88 0. the normality assumption is not important for n above 50.51 0.63 0. correlation is a linear measure of association. for instance. The way that results of analyses involving correlations are usually presented leads the normal citizen to think that a high correlation implies a cause–eﬀect association. Such abuse already shows up with the idea that.) Note once again the expression of a degree of certainty assigned to conclusions of random data analysis. However. we cannot expect to observe all cases of smokers that have developed lung cancer.102 5 Probable Inferences Table 5. it should not be forgotten that the statistical proof so widely used in scientiﬁc research is very diﬀerent from standard mathematical proof.

6 Correlation in Everyday Life 103 nothing other than a simple measure of linear association.7.95. For instance.) and average number of smoked cigarettes in packets (USA study. several complementary studies concerning certain substances (carcinogenic substances) present in the tobacco smoke point in that direction. which may be completely fortuitous. The probability of obtaining such a correlation due to chance alone (in a situation of non-correlation and normal distributions) is only 1%.c. although lung cancer may occur in smokers in a way that is not uniquely caused by tobacco smoke. The correlation is high: 0. 5. Figure 5. the graph of Fig. The vertical axis represents the frequency of occurrence of lung cancer relative to the total number of detected cancer cases.5.6. an intrinsic cause–eﬀect type relation may sometimes exist. Let us now consider the tobacco smoke issue. as a function of the total number of annual forest ﬁres.l. Area (in ha) of annual burnt forest plotted against the number of forest ﬁres in the period 1943 through 1978 (Portugal) Fig.6 shows the area (in ha) of forest burnt annually in Portugal during the period 1943 through 1978. 5.5 packets.7 is based on the results of a study carried out in the USA in 1958. 1958) . Fig. which involved over a hundred thousand observations of adult men aged between 50 and 69 years.9. Frequency of lung cancer (f. There is here an obvious cause–eﬀect relationship which translates into a high correlation of 0. 5. Does the signiﬁcantly high value of the correlation have anything to do with a cause–eﬀect relation? In fact. The horizontal axis shows the average number of cigarettes smoked per day. from 0 up to 2. In truth.

. Therefore. despite the fact that a signiﬁcant correlation is observed.8b. Figure 5.74. Finally. It makes sense for the median value of the houses to decrease with an increase in the percentage of poor people. For instance.8a shows the graph for these two variables. viz. let us present the analysis of a data set that gives a good illustration of the kind of abuse that arises from a superﬁcial appreciation of the notion of correlation. available through the Internet. The causal link is obvious: those with less resources will not buy expensive houses. An examination of graphs relating the relevant variables is therefore helpful in order to assess this sort of situation. Let us now consider the correlation between the percentage of area used for industry and the index of accessibility to highways (measured on a scale from 1 to 24).. or the correlation between the number of means used to ﬁght city ﬁres and the total amount of damage caused by those ﬁres. however. 0. and comprises 506 observations of several variables. Removing these atypical cases one obtains an almost zero correlation. (Incidentally.6. it is not the means used to ﬁght ﬁres that cause the material damage or vice versa. . one sees that there are a few atypical cases with accessibility 24 (the highest level corresponding to being close to the highways). The data set. We see how a few atypical cases may create a false idea of correlation. USA. We may apply the preceding formula to compute the correlations between variables. there is no direct ‘relation’ between two random variables. the causal link rests upon many other complementary studies. The mere observation of a high correlation only hints that such a link could be at work. Among the classic examples of spurious correlations are the correlation between the number of ice-creams sold and the number of deaths by drowning. In both cases there is an obvious third variable. 5.104 5 Probable Inferences but instead by the interaction of smoke with the ingestion of alcoholic drinks. The correlation is high. or other causes. concerns housing conditions in Boston. This is said to be a spurious correlation. likewise. This is a highly signiﬁcant value for the 506 cases (the probability of such a value due to chance alone is less than 1/1 000 000). looking at Fig. in this case. respectively. the correlation between the percentage of poor people (lower status of the population) and the median value of the owner-occupied homes is negative: −0. However. the weather conditions and the extent of the ﬁres. In many other situations. It is obvious that ice-creams do not cause death by drowning or vice-versa. viz. evaluated in diﬀerent residential areas.

) Imagine now that we stumble on a newspaper column like the following: Enquiry results about the quality of life in Boston revealed that it is precisely the lower status of the population that most contributes to town pollution.6. To add credibility to the column we may even imagine that the columnist went to the lengths of including the graph of the two variables (something we very rarely come across in any newspaper!). e.59. Graphs and correlations for pairs of variables in the Boston housing data set: (a) −0.6.g. 5. 5.8. for instance.8c. measured by the concentration of nitric oxides in the atmosphere. as shown in Fig.6 Correlation in Everyday Life 105 even if we remove atypical cases from Fig. 5. compelling poor people to pay an extra pollution tax. He may. the wise researcher will try to understand whether that correlation is real or spurious.74. In the face of this supposed causal relation – poor people cause pollution! – we may even imagine politicians proposing measures for solving the problem. the correlation will remain practically the same. (d) 0.76 . (b) 0. study the relation Fig. (c) 0.5.. in which one clearly sees that the concentration of nitric oxides (in parts per 10 million) increases with the percentage of poor people in the studied residential area. Meanwhile.

Complementary studies would then suggest the following causality diagram: Industry ւ Nitric oxide pollution ց Poor residential zone implantation In summary. the correlation triggering the comments in the newspaper column is spurious and politicians should think instead of suitably taxing the polluting industries.76. . 0.6). or opens up possibilities. For instance. the data will have to be supplemented by other kinds of evidence. Even when using experimental data as in physics research.106 5 Probable Inferences between percentage of area used by industry and percentage of poor people in the same residential area and verify that a high correlation exists (0. Contrary to a widely held opinion. Besides the correlation measure. One could then use statistical methods for comparing these averages.8d) and verify the existence of yet another high correlation. it only hints at. in the Boston pollution case. Maybe the most infamous are those pertaining to studies that aim to derive racist or eugenic conclusions.. interpreting in a causal way skull measurements and intelligence coeﬃcients. the right sequencing in time.g. The question is whether this or other statistical methods can be used to establish a causality inference. suggests. one might establish average income ranks in the various residential areas and obtain the corresponding averages of nitric oxides. e. Finally. 5. the answer is negative. There are many other examples of correlation abuse. correlation alone does not imply causality. but in the data. In order to support a causality relation. viz. he may study the relation between percentage of area used by industry and concentration of nitric oxides (see Fig. statistics also provides other methods for assessing relations between variables. The crux of the problem lies not in the methods.

we nevertheless understand ‘games’ in a wide sense: results derived from the analysis of coin-tossing sequences are applicable to other random sequences occurring in daily life. is nevertheless rather instructive when we wish to consider what may be expected in general from games. chess. On the other hand.1 The Random Walk The coin-tossing game. For every possible tossing sequence the average gain will evolve in a distinct way but always according to the law of large numbers: for a suﬃciently large number of throws n. In the new interpretation of the game. we are referring to games that depend on chance alone.1 shows six possible sequences of throws. with growing n. We exclude games in which a player can apply certain logical strategies to counter the previous moves of his or her opponent (as happens in many card games. 6. But let us be quite clear. We may consider these values as winning or losing 1 euro.6 Fortune and Ruin 6. 4. checkers.). For this purpose we consider arbitrarily long sequences of throws of a fair coin and instead of coding heads and tails respectively by 1 and 0. Instead of average gains we may simply plot the gains. when we speak of games. according to the law of large numbers. Figure 6. etc. modest as it may look. as we did in Chap. instead of thinking in terms of . exhibiting the stabilization around zero. 4. as in Fig. the probability that the average gain deviates from the mathematical expectation (0) by more than a preset value is as small as one may wish. The analysis of the coin-tossing game can be carried out in an advantageous manner by resorting to a speciﬁc interpretation of the meaning of each toss. Let us compute the average gain along the sequence as we did in Chap. and in particular the ups and downs of a player’s fortunes.2. we code them by +1 and −1.

1. Six sequences of average gain in 4000 throws of a fair coin Fig. heads. 6. in the new interpretation. along the vertical axis. heads . Let us substantiate this idea by considering the top curve of Fig.108 6 Fortune and Ruin Fig.3a. We say that the particle has performed a random walk .2 as corresponding to six random walks in one dimension: the particle either moves upwards or downwards. heads.3b shows the result for 1000 throws: the equivalent of three particles moving on a plane. We may also group the six curves of Fig. The particle trajectory for the 10 mentioned throws is displayed in Fig. heads.1 gain we think in terms of upward and downward displacements of a particle.1. 6. 6. heads. Figure 6. heads.2. that the throws are performed at the end of an equal time interval. We may consider the six curves of Fig. say 1 second.2 in three pairs of curves representing random walks of three particles in a two-axis system: one axis represents up and down movements and the other one right and left movements. tails. tails. and that the outcome of each throw will determine how a point particle moves from its previous position: +1 (in a certain space dimension) if the outcome is heads. which corresponds to the following sequence in the ﬁrst 10 throws: tails. Gain curves for the six sequences of Fig. 6. and −1 if it is tails. 6. We may also group . At the beginning the particle is assumed to be at the origin. heads. We may then imagine. 6. 6.

. (b) Random walk of three particles in two dimensions the curves in threes. obtaining three-dimensional random walks. viz. viz. is a good mathematical model of many natural and social phenomena involving random sequences. the probability of obtaining a given walk is 0. This probability quickly decreases with n.. As there are 2n distinct and equiprobable walks.2 Wild Rides The random walk. For 160 steps it is roughly the probability of a random choice of one atom from all the matter making up the Earth. such as physics and economics. the probability is 1/2n .000 000 000 000 000 000 000 000 000 000 8 . 6. 6. Random walk analysis thus has important consequences in a broad range of disciplines.6.2 Wild Rides 109 Fig. as in Brownian motion (ﬁrst described by the botanist Robert Brown for pollen particles suspended in a liquid). possesses a great many amazing and counterintuitive properties. in one or more dimensions. (a) Random walk for ten throws. it is roughly the probability of a random choice of one atom from all known galaxies. some of which we shall now present. roughly 1/4 × 1079 ! . roughly 1/8×1049 . 6. despite its apparent simplicity. For only 100 steps. Let us start with the property of obtaining a given walk for n time units (walk steps). The random walk.2).3. For 260 steps (far fewer than in Fig. allowing an enhanced insight into their properties.

meaning a frantic and violent ride. is 0. we observe something odd. The game is thus equitable. which means that at least one equalization is sure to occur for suﬃciently high n.625. Let us denote by P (2n) the sum P (zero gain for the ﬁrst time in 2 steps) +P (zero gain for the ﬁrst time in 4 steps) + · · · + P (zero gain for the ﬁrst time in 2n steps) . After 100 steps the situation has not improved much: the probability that no equalization has yet occurred is 1 − 0.92 = 0. This means that.1. say 2n.110 6 Fortune and Ruin Let us go back to interpreting the random walk as a heads–tails game. which is still quite high! The random walk really does look like a wild ride. However. he pays 1 euro to Mary.5. and for the ﬁrst time in four steps is 0. This sum represents the probability of John’s and Mary’s gains having equalized at least once in 2n steps. the most probable situation (on average terms) is that in one of the games the same player will always be winning. P (zero gain for the ﬁrst time in 2n steps) = 2n 2 (2n − 1) Using this formula one observes that the probability of zero gain for the ﬁrst time in 2 steps is 0. . The word ‘random’ coming from the old English ‘randon’. looking at the P (2n) curve in Fig.9 = 0. has its origin in an old Frank word.4a we notice that the convergence is rather slow. 6. We may assume that the curves of Fig. if ten games are being played simultaneously. 6. We note that the number of steps for reaching zero gain will have to be even (the same number of upward and downward steps). whether or not it is the ﬁrst time. after 64 steps the probability that no gain equalization has yet occurred is 1 − 0. the probability that a zero gain has occurred in four steps. If heads turns up. Contrary to common intuition. Game rides really are frantic and violent. Let us take a closer look at this.2 represent John’s accumulated gains. and this certainly agrees with common intuition. if tails turns up. there are several gain curves (5 out of 6) that stay above or below zero for long periods of time. it can be shown that the probability of reaching zero gain for the ﬁrst time in 2n steps is given by 2n n . For instance. John wins 1 euro from Mary. Now.08. ‘rant’. Therefore. However. say between John and Mary.125. It can be shown that P (2n) converges to 1.

Common intuition would tell us that in a long series of throws John and Mary would each be in a winning position about half of the time and that the name of the winner should swap frequently.4b show this probability for several values of n (the probabilities for x = 0 and x = 1 are given by another formula). 6. say 2k in a sequence of 2n rounds.5 = 0. (b) Probability of the last zero gain occurring after x 6.064 .14 0. Let us now .4. for each n.3 The Strange Arcsine Law 111 Fig. Note that. only certain values of x are admissible (vertical bars shown in the ﬁgure for n = 10 and n = 20). x(1 − x) For instance. However. 6. random walk surprises become even more dramatic when one studies the probability of obtaining zero gain for the last time after a certain number of rounds. The curves of Fig. Letting x = k/n represent the proportion of the total rounds corresponding to the 2k rounds (x varies between 0 and 1). The bigger n is. the probability that the last zero gain has occurred in the middle of a sequence of 20 rounds (n = 10) is computed as follows: P (last zero gain at the tenth of 20 rounds) ≈ 1 √ 10 × 3.6. with spacing 1/n.5 × 0. it can be shown that this probability is given by the following formula: P (last zero gain at time x in 2n rounds) ≈ nπ 1 . the more values there are.3 The Strange Arcsine Law We have seen above how the probability of not obtaining gain equalization in 2n steps decreases rather slowly. (a) Probability of at least a zero gain in 2n steps.

for values smaller than or equal to x. π The probability of the last zero gain occurring before a certain time x is said to follow the arcsine law. the events ‘the last zero gain occurred somewhere in the ﬁrst half of 2n rounds’ and ‘the last zero gain occurred somewhere in the second half of 2n rounds’ are equiprobable. Since the curves are symmetric.4b show that the last time that a zero gain occurred is more likely to happen either at the beginning or at the end of the game! In order to obtain the probability that the last zero gain occurred before and including a certain time x.5 (exclusive) and from 0. P (x < 0. Table 6. Consider the events x < 0. according to the arcsine law .4b. 6.112 6 Fortune and Ruin see what the curves in Fig. Imagine that on a certain day several pairs of players passed 8 hours playing heads–tails. independently of the total number of rounds! Thus.5) and P (x > 0. tossing a coin every 10 seconds. For suﬃciently large n (say.1 shows some values of this law. will spend half of the game permanently in the winning position.5. at least 100). John or Mary. 6. in other words. the curves in Fig. or in more detail.5. no zero gain occurs during half of the game. with probability close to 1/2. which we will denote by P (x). which is the same as saying 2k < n and 2k > n. The number of Fig.5) are equal [they correspond to adding up the vertical bars from 0 to 0.4b tell us. displayed in Fig. respectively. That is. called the arcsine function and denoted arcsin (see Appendix A): √ 2 P (last zero gain occurred at a time ≤ x) = P (x) ≈ arcsin( x) . Moreover. Probability of the last zero gain occurring before or at x. ‘the last zero gain occurred somewhere in the ﬁrst half of 2n rounds’ and ‘the last zero gain occurred somewhere in the second half of 2n rounds’. 6. one only has to add up the preceding probabilities represented in Fig. 6. the most probable situation is that one of the two.5 (exclusive) to 1].5 and x > 0. 6. it can be shown that P (x) is well approximated by the inverse of the sine function.5.

the larger the variability of the p values the better .299 0. the probability of a given deviation of the average of the x values from the average of the p values reaches a maximum when the pk are constant.249 0. on average.6.) Meanwhile.205 0.045 0.101 0.199 0. In other words.67.005 0. For example.078 0. Let us suppose for concreteness that the probabilities of a heads–tails type of game were changed during the game.145 0.064 0. For instance. Values of the arcsine law x 0.015 0. and so on. will the arithmetic mean still follow the law of large numbers? The French mathematician Sim´on Poisson (1781– e 1840) proved in 1837 that it does indeed. and so on. contrary to intuition. n (Poisson is better known for a probability law that carries his name and corresponds to an approximation of the binomial distribution for a very low probability of the relevant event.5. since the probability of turning up heads changes in each round.111 0.025 0.400 tosses (8 × 60 × 6 = 2880) is suﬃciently high for a legitimate application of the arcsine law. we observe that on average one out of 10 pairs of players would obtain the last gain equalization during the ﬁrst 12 minutes (0. Yet the Russian mathematician Vassily Hoeﬀding (1914–1991) demonstrated more than a century after Poisson’s result that. one of each 3 pairs would pass the last 6 hours and 22 minutes without changing the winner! 6.02 0.095 0. Consulting the above table.4 The Chancier the Better Table 6. p1 + p2 + · · · + pn .345 P (x) 0. Denote the probability of heads turning up in round k by pk .090 0.025 × 2880 = 72 tosses) passing the remaining 7 hours and 48 minutes without having any swap of winner position! Moreover. common intuition would tell us that it might be more diﬃcult to obtain convergence. p2 = 0. John and Mary did a ﬁrst round with a fair coin. and that the arithmetic mean of the gains converges to the arithmetic mean of the p values.03 113 0. producing on average two times more heads than tails. we would have p1 = 0.4 The Chancier the Better An amazing aspect of the convergence of averages to expectations is their behavior when the event probabilities change over a sequence of random experiments.1.01 0. a second round with a tampered coin.. The question is. viz.

Since x and y have zero expectation. In fact. it does not seem possible that there could be random distributions without a mathematical expectation. let us suppose that we computed the ratio z = x/y for 500 shots and plotted the corresponding histogram. There is an aspect of this law that is easily overlooked. But is it really so? Consider two random variables with normal distribution and with zero expectation and unit variance.3 and 0. 4. suppose the variables are the horizontal position x and the vertical position y of riﬂe shots at a target. the riﬂe often hits the bull’s-eye on average terms.114 6 Fortune and Ruin Fig. In order to talk of convergence. To ﬁx ideas. We assume that the bull’s-eye is a very small region around the point (x = 0. This fact looks so trivial at ﬁrst sight that it does not seem to deserve much attention. 6.6. where it can be seen that the convergence of the averages is faster with variable probabilities randomly selected between 0. 6. at ﬁrst sight. (b) Variable probability pk uniformly random between 0. It may happen that a systematic deviation along y is present when a systematic deviation along x occurs.6. 6. the mathematical expectation must exist.7.7 the convergence is! It is as if one said that by increasing the ‘disorder’ during the game more quickly. as in Chap. (a) Constant probability pk = 0.5 Averages Without Expectation We saw how the law of large numbers assured a convergence of the average to the mathematical expectation. y = 0).3 and 0. one achieves order in the limit! This counterintuitive result is illustrated in Fig. Average gains in 20 sequences of 1000 rounds of the heads–tails game. In order to assess whether this is the case. If the deviation along the vertical .5.

6. we obtained 74 such shots! In fact. Now. 6. we would expect to obtain according the normal law about 6 shots in 10 million. our suspicion is wrong and the mistake becomes clearly visible when we look at the histogram over a suﬃciently large interval.7a. the histogram is practically a bar at 1. in reality that is not what happens. Histogram of the ratio of normal deviations in 500 shots: (a) with the normal probability density function (broken curve). for the 500 shots we do not expect to obtain any with z exceeding 5.7b.7b. as in Fig.6.7a. we will get a histogram that is practically one single bar placed at the correlation value. Now. z does not follow the normal law. That is. and we obtain a considerable number of shots with z in excess of 5. Fig. 6. we may suspect that the probability density function of z is normal and with unit variance.7a. The Cauchy probability density function f (z) = 1 π(1 + z 2 ) is represented by the solid line in Fig. We will get a broad histogram such as the one in Fig.7b. 6. consider the situation where x are y uncorrelated. Looking at Fig. 6. if y is approximately equal to x. for z exceeding 5 in absolute value. However. but another law discovered by the French mathematician Augustin Louis Cauchy (1789–1857) and called the Cauchy distribution in his honor. For instance. 6. In the histogram of Fig.7. It now becomes clear that the normal curve is unable to approximate the considerably extended histogram tails.5 Averages Without Expectation 115 direction is strongly correlated with the deviation along the horizontal direction. (b) with the Cauchy probability density function (continuous curve) . For instance. 6. like the broken curve also shown in Fig.

in Fig. 6. it even looks like the bell-shaped curve and there is a tendency to think that its expectation is the central 0 point (as in the analogy with the center of gravity). fortune and ruin may occur abruptly. Fig. such as risk assessment in ﬁnancial markets. very large). we observe a sequence close to 400 rounds that exhibits a large jump with a big deviation from zero. (b) Six average gain sequences for 1000 rounds . it has no meaning to speak of the law of large numbers for those sequences. Cauchy’s curve represented in Fig. In some everyday phenomena.8.7b does not look much of a monster. The impulsive behavior of the Cauchy distribution is related to the very slow decay of its tails for very small or very large values of the random variable. this slow decay is reﬂected in a considerable probability for the occurrence of extreme values. 6. Since such laws do not have a mathematical expectation. we would then conﬁrm that the averages do not stabilize around zero. Suppose now that we take the random values with Cauchy distribution as wins and losses of a game and compute the average gain. what happens with a sequence of values generated by such a distribution. Had we carried out a large number of experiments with a large number of rounds.116 6 Fortune and Ruin Visually. But let us consider. The reason is that the Cauchy law does not really have a mathematical expectation. We see that the sequence exhibits many occurrences of large values (sometimes. sequences obeying such strange laws as the Cauchy law do emerge. Figure 6.8b shows 6 average gain sequences for 1000 rounds. a normally distributed sequence is characterized by the fact that it very rarely exhibits values that stand out from the ‘background noise’. In contrast. For example. and it makes no sense to speak of convergence or long term behavior. 6.8a. but this impression is deceptive. It seems that there is some stabilization around 0 for some sequences. (a) A sequence of rounds with Cauchy distribution. clearly standing out from most of the values in the sequence.

on the basis of the normal law.95 in the above example). 4 and. it does frequently happen that we do not know how many times the random experiment will be repeated. Since 1/2n decreases with n. In fact. for arbitrarily large n. Any of the 2n possible sequences is equiprobable. we would thus be led to the paradoxical conclusion that no sequence is feasible.95 . i.6 Borel’s Normal Numbers We spoke of conﬁdence intervals in Chap. the degree of certainty is 1. Setting aside the artiﬁciality of the notion of an ‘inﬁnite sequence’. Consequently. In general. and we are more interested in ﬁrst computing the limit of the deviations between average and expectation and only afterward determining the probability of the limit. This diﬃculty can be overcome if we only assign zero probability to certain sequences.6 Borel’s Normal Numbers 117 6. which has no practical correspondence. for n > 10 000 .e. Now. we may write limit of P (deviation is below t) = 1 . with probability 1/2n . it is always possible to ﬁnd a minimum value of n above which one obtains a degree of certainty as large as one may wish (0. Imagine then that the sequence continued without limit. In the limit. we were able to determine how many throws of a die were needed in order to be 95% sure of the typicality of a ±0. That is. in the example) from the mathematical expectation (p. in the example). when applying the law of large numbers.01 in the above example). all of them would have zero probability. we ﬁrst compute a probability and thereafter its limit.01) = 0. the study of inﬁnite sequences is in fact rather insightful.01 interval around the probability p for the frequency of occurrence f in the worst √ case (worst standard deviation). using the word ‘deviation’ to denote the absolute value of the deviation of an average (f . we have limit of P (deviation of f from p is below t) = 1 . ..6. Therefore. Denoting the number of throws by n. right from the start we must consider inﬁnite sequences of repetitions. Let us take a sequence of n tosses of a coin. the law of large numbers stipulates that whatever the tolerance of the deviation might be (0. denoting any preselected tolerance by t. we could write P (deviation of f from p is below 0. This result (inverse of n) constituted an illustration of the law of large numbers.

We may visualize this binary representation by imagining that each digit corresponds to dividing the interval [0. tails. and so on. heads. 1011 is then evaluated as 1× 1 1 1 1 +0× +1× +1× = 0. and 0. . heads. called a bit. heads 0. . 375 = 3 × 102 + 7 × 101 + 5 × 100 ).11111 0. etc. in the decimal system each digit is multiplied by a power of 1/10). heads is represented by 1011.2 shows some more examples.g. with 1 represented by 0.5 (excluding 0. which may be represented by both 0. in succession.. 0 and 1). For instance. It is easy to understand that all inﬁnite heads–tails sequences can thus be put in correspondence with numbers in the interval [0. tails. omitting some technical details.0111111111 .5) and from 0. 2 4 8 16 Table 6. heads Binary Decimal value 0. tails.5 to 1. It is thus 0. . tails. heads. . tails.2.1000000 . the second digit represents the division of each of these two intervals into halves. . tails. and where the position of each digit represents a power of 10 (e. In the case of 1/2. We may assume that 1011 represents a number in a binary numbering system. For this purpose.118 6 Fortune and Ruin Table 6. in the binary numbering system the position of each digit (there are only two.00001 0.9. 1]. Instead of the decimal system that we are accustomed to from an early age. 6. .). To compute its value in the decimal system we shall have to multiply each bit by a power of 1/2 (by analogy.10101 0. 1/8.96875 Heads. assuming a sequence of only n = 3 digits. 1]. as we have already done on several occasions.65625 Let us see how this is possible. 1] into two halves. 11 (eleven) in the decimal system. .6875 .1011. 1] into two intervals: from 0 to 0. we just omit the 0 followed by the decimal point. Assigning numbers to sequences Sequence Tails.03125 Heads. .1111111111 . heads. heads. the sequence heads. we adopt the convention of taking the latter (likewise for 1/4. as shown in Fig. . Let us now take the 1011 sequence as the representation of a number in the interval [0. heads 0. that is. The ﬁrst digit represents the division of [0. That is. consider representing heads and tails respectively by 1 and 0. represents a power of 2. So 1011 in the binary system is evaluated as 1×23 +0×22 +1×21 +1×20 .

any of the three-bit subsequences 000.9 shows the intervals that 0. heads. Now.101. . in other words.6. 010. We now establish a mapping between probabilities of binary sequences and their respective intervals in [0. 1].9. 011. Take the sequence heads. (repeating forever). The ﬁrst 3 digits (d1 . 6.75 with width 1/8 shown in Fig.625 (1× 1/2+ 0× 1/4+ 1× 1/8) ‘occupies’. and so forth. 6. We may measure the probability that this sequence occurs by taking the width of the corresponding interval in [0. That is. tails. The value 0. no rational number (represented by a sequence that either ends in an inﬁnite tail of zeros or ones. 1] corresponding to such sequences are called Borel’s normal numbers. tails. It represents the rational number 11/16. 1]: the small interval between 0. 1] that we may consider their probability to be zero.125 units wide. with the segment representing a kind of target cut into slices 0. It represents the rational number 2/3. 001. The real numbers in [0. each number has probability 1/8. the probability that we get an inﬁnite sequence corresponding to a rational . the probability of ﬁnding 0 or 1 is 1/2). from the point of view of assigning probabilities to the sequence heads.101100000 . 1].625 and 0. If the point-throwing process has a uniform distribution.9. heads represented by 0. d2 . Obviously. The set of all sequences corresponding to rational numbers thus has zero probability.6 Borel’s Normal Numbers 119 Fig. The French mathematician Emile Borel (1871–1956) proved in 1909 that the set of non-normal numbers occupies such a meaningless ‘space’ in [0. . for instance. as required. To each sequence corresponds a number in [0. it comes to the same thing as randomly throwing a point onto the straight line segment between 0 and 1 (excluding 0). Take.1010101010 .625 corresponds to the dotted intervals Figure 6. 10 and 11 (each showing up 1/4 of the time). 101. consider inﬁnite sequences. . or by a periodic sequence) is a Borel normal number. 1]. . Now take the sequence 0. Let us look at those sequences where 0 and 1 show up the same number of times (thus.100. 01. the sequence 0. as well as any of the four two-bit subsequences 00. d3 ) and the intervals they determine in [0. (always 0 from the ﬁfth digit onward). 110 and 111 (each showing up 1/8 of the time).

denoting by f the frequency of occurrence of 1. This result is a particular case of a new version of the law of large numbers. On the other hand. by deﬁnition.120 6 Fortune and Ruin number when playing heads–tails is zero. it has been found to be a diﬃcult task to prove that a given number is normal.235 711 131 719 232 931 373 941 . it is also uncountable! 6. Well. . . . 2. a result that may seem contrary to common intuition). The new law. was presented by the economist David Champernowne (1920–2000) and corresponds to concatenating the distinct binary blocks of length 1. One of them.: 0. 1] are normal (once again. for coin throwing). demonstrated by the Russian mathematician Andrei Kolmogorov (1903–1987) in 1930 (more than two centuries after the Bernoulli law). Are the deci√ mal expansions of 2.0100011011000001010011100101110111 . .7 The Strong Law of Large Numbers Let us look at sequences of n zeros and ones as coin tosses. although the set of non-normal numbers has zero probability. 3. we may then say that practically all real numbers in [0. . somewhat obvious. since the set of all normal numbers has probability 1. where the term ‘deviation’ refers to the absolute value of the diﬀerence between the average and the expectation (1/2. The normal numbers satisfy. Yet. In Chap. obtained by concatenating the prime numbers. the Bernoulli law of large numbers stated that limit of P (deviation is below t) = 1 . is stated as follows: . A few numbers are known to be normal. 3. . o The so-called Copeland–Erd¨s constant 0. it is not possible to establish a one-to-one mapping between these sequences and the natural numbers. that is. the following property: P (limit of the deviation of f from 1/2 is 0) = 1 . It is well known that the set of all inﬁnite 0–1 sequences is uncountable. We end with another amazing result. etc. is also normal in the base 10 system. as already pointed out in the last section. e or π normal? No one knows (although there is some evidence that they are).

which allowed us to compute the value of n above which there is a high certainty of obtaining a frequency close to the corresponding probability value. Understanding the laws of large numbers is an essential prerequisite to understanding the convergence properties of averages of chance sequences. large wanderings above any preselected t must become less frequent each time.10b is intended to illustrate. two at n1 and one at n2 . . In the case of Fig.8 The Law of Iterated Logarithms We spoke of the central limit theorem in Chap. the second version of the law of large numbers imposes a stronger restriction on the behavior of the averages than the ﬁrst version. 6. in order for the limit of the average to equal the limit of the expectation with high probability. illustrated in Fig. the probability of a deviation larger than t must converge to 0.10. It is thus appropriate to call it the strong law of large numbers. when p = 1/2 (the coin-tossing example).10b – the probability of exceeding t converges to zero. which displays several deviation curves as a function of n. that is. Convergence conditions for the weak law (a) and the strong law (b) of large numbers P (limit of the deviation is 0) = 1 . 6. 4. The law says nothing about the deviation values. the deviation at n2 is even more prominent than the deviation at n1 . 6.8 The Law of Iterated Logarithms 121 Fig. In the ﬁrst version.10a. 6. In the second version of the law. once we have ﬁxed the tolerance t (the gray band).6. An equivalent formulation consists in saying that. as Fig. The Bernoulli law is then the weak law . 6. 6. 6. 6. In Fig. we ﬁnd fewer curves above t as we proceed toward larger values of n. We were thus able to determine that. there are three curves above t at n0 . in the worst case – the dotted line in Fig. Let us examine the diﬀerence between the two formulations using Fig.10a.10.10a. with t the tolerance of the deviations. Thus.

for 10 000 throws of a coin we are almost sure that all sequences will deviate by less than 0.122 6 Fortune and Ruin only after 10 000 throws was there a risk of at most 5% that the f ±0. did not contain the true value of p. allows one to conclude that f = S(n)/n will almost certainly be between p plus or minus a deviation of 2pq log log n /n. known as the theorem of iterated logarithms (the logarithm is repeated).11 shows 50 sequences of 10 000 throws of a coin with the deviation of the iterated logarithm (black curves).e. if we add an inﬁnitesimal increase to that quantity. Frequencies of heads or tails in 50 sequences of 10 000 throws of a coin. 6.11. 6. Note how. of a 2npq log log n deviation) = 1 . On the other hand. with the deviation of the strong law . the Russian mathematician Alexander Khintchine demonstrated the following result in 1924: P (limit. only a ﬁnite subset of all possible inﬁnite sequences will deviate by more than that quantity. Figure 6. and denote the probability of one of the outcomes by p. Let S(n) be the accumulated gain after n outcomes and consider the absolute value of the discrepancy between S(n) and its expectation. after 10 000 throws. Now.5 plus or minus Fig.10b. where ‘almost certainly’ means that.01 conﬁdence interval was not typical. as in the coin-tossing example.5 log log 10 000 /10 000 = 0. i. practically all curves are inside 0.0105.. For instance. How fast is that convergence? Take a sequence of experiments with only two possible outcomes. Note that the statement is formulated in terms of a worst case deviation curve as in Fig. the strong law of large numbers shows us that there is an almost sure convergence of f toward p. which we know to be np (binomial law). This result. in the worst case.

The strong law of large numbers and the law of iterated logarithms show us that. even when we consider inﬁnite sequences of random variables. . 6.8 The Law of Iterated Logarithms 123 Fig.6. there is order in the limit.12. Pushing the poor souls to the limit the deviation of the iterated logarithm. Note also the slow decay of the iterated logarithm.

.

one might think it possible to forecast the outcome.7 The Nature of Chance 7. Every aspect of the problem (coin fall. is the average outcome after a large number of experiments. although not impossible. writing and solving the respective motion equations would certainly be a very complex task. coin rebound. The only thing we may perhaps be able to tell with a given degree of certainty. motion on the ground with friction. If one knew all the initial conditions of motion. the shape of the coin and its orientation at the instant of throwing. Let us look at the example of the coin. and the friction and elasticity factors of the ground. Probability theory was considered to be a purely mathematical trick allowing us to deal with our ignorance of the initial conditions of many phenom- . together with the laws of motion.1 Chance and Determinism The essential property characterizing chance phenomena is the impossibility of predicting any individual outcome. The initial conditions that would have to be determined are the force vector applied to the coin. etc. When we toss a fair die or a fair coin in a fair way we assume that that condition is fulﬁlled. We assume the impossibility of predicting any individual outcome. Yet one might think that it would be possible in principle to predict any individual outcome. the value of the acceleration due to gravity at the place of the experiment. the position of its center of gravity relative to the ground. The problem is thus deterministic. Once all these conditions are known. either of the die or of the coin.) obeys perfectly deﬁned and known laws. At the beginning of the twentieth century there was a general conviction that the laws of the universe were totally deterministic. based on the laws of large numbers. the viscosity of the air.

which established the concept of the emission of discrete energy quantities. but an absolute necessity imposed by their intrinsically random nature. Quantum phenomena are ‘pure chance’ phenomena. even when the amount of data is very limited. and the impossibility. in the case of macroscopic phenomena (such as tossing a coin or a die). Albert Einstein (research on the photoelectric eﬀect. as well as the position and velocity of any of its molecules. we attribute the phenomena that seem to us to occur and succeed without any particular order to changeable and hidden causes. For quantum phenomena the probabilistic description is not just a handy trick. whose action is designated by the word chance. it would comprehend in a single formula the movements of the largest bodies of the Universe and those of the tiniest atom. came to change the Laplacian conception of the laws of nature. However.e. Probability relates partly to this ignorance and in other parts to our knowledge. and if moreover it was suﬃciently vast to submit these data to analysis. the probabilistic description was merely a handy way of looking at things . establishing the concept of photons as light quanta) and Louis de Broglie (dual wave–particle nature of the electron). revealing the existence of many so-called quantum phenomena whose probabilistic description is not simply the result of resorting to a comfortable way of dealing with our ignorance about the conditions that produced them. For such an intelligence nothing would be irregular. due to our limited abilities. microscopic particles. According to Pierre-Simon Laplace in his treatise on probability theory: If an intelligence knew at a given instant all the forces animating matter. The development of quantum mechanics. of subjecting most of the data in our possession to calculation.. at the beginning of the twentieth century. since quantum phenomena are mainly related to elementary particles. But due to our ignorance of all the data needed for the solution of this grand problem. or energy quanta). and the curve described by an air or vapor molecule would appear to be governed in as precise a way as is for us the course of the Sun.126 7 The Nature of Chance ena. the conviction that. a word that after all is only the expression of our ignorance. Quantum mechanics thus made the ﬁrst radical shift away from the classical concept that everything in the universe is perfectly determined. starting with the pioneering work of the physicists Max Planck (research on black body radiation. i.

Moreover. highly sensitive to initial conditions. there was in many circumstances such a high sensitivity to initial conditions that one would never be able to guarantee the same outcome by properly adjusting those initial conditions. In what follows we ﬁrst focus our attention on some amazing aspects of probabilistic descriptions of quantum mechanics. the best that one can do is to determine the probability of a certain outcome. although we know that the applicable laws are purely deterministic. which are described by quantum mechanics. This is the cointossing scenario. We can thus say that chance has a threefold nature: • A consequence of our incapacity or inability to take into account all factors that inﬂuence the outcome of a given phenomenon. It was observed that in many phenomena governed by perfectly deterministic laws. it was observed that much of this behavior.1 Chance and Determinism 127 and that it would in principle be possible to forecast any individual outcome. We then proceed to a more detailed analysis of deterministic chaos. gave birth to sequences of values – called chaotic orbits – with no detectable periodic pattern. In the middle of the twentieth century several researchers showed that this concept would also have to be revised. The only way of avoiding such disorderly behavior would be to deﬁne the initial conditions with inﬁnite accuracy. which amounts to an impossibility. persisted for quite some time. since the random nature of quantum phenomena is an intrinsic feature. rendering it impossible to forecast future outcomes. which plays a role in a vast number of macroscopic phenomena in everyday life. . However small a deviation from the initial conditions might be. even though the applicable laws are purely deterministic. it could sometimes lead to an unpredictably large deviation in the experimental outcomes. for instance. It is precisely this chaotic behaviour that characterizes a vast number of chance events in nature. • The intrinsic random nature of phenomena involving microscopic particles. This is the nature of deterministic chaos which is manifest in weather changes. • A consequence of the extreme sensitivity of many phenomena to certain initial conditions.7. Unlike the preceding cases.

protons. which will then reveal some spots distributed in a haphazard way around the photographed object. as would happen for instance if the ﬁrst half of the photons impacting the plate were to do so in the middle region and the other half at the edges. as is the case of photons. describing in particular the radical separation between classical physics. with or without human intervention. it is possible to obtain values of these variables which only depend on purely random processes: objective. microscopic. An experiment illustrating the objective nature of chance in quantum mechanics consists in letting the light of an extremely weak light source fall upon a photographic plate. the law of large numbers comes into play and the image (silhouette) of the photographed object then emerges from a ‘sequence of chance events’. and quantum mechanics where laws of nature at a suﬃciently small. are char- . There is a vast popular science literature on this topic (a few indications can be found in the references). is ruled by the laws of quantum mechanics. for a very short time. scale are probabilistic. More clearly. Moreover. such as its position. for a given microscopic system. where laws of nature are purely deterministic. Suppose that some object is placed between the light source and the plate. randomness. it is observed that there is no regularity in the process. On the other hand. In this section we restrict ourselves to outlining some of the more remarkable aspects of the probabilistic descriptions of quantum mechanics (now considered to be one of the best validated theories of physics). If we repeat the experiment several times the plates will reveal diﬀerent spot patterns. One strange aspect of quantum mechanics lies in the fact that certain properties of a microscopic particle. only a small number of photons falls on the plate. the chance nature of the obtained outcome does not depend on our ignorance or on incomplete speciﬁcations as is the case for chaotic phenomena.2 Quantum Probabilities The description of nature at extremely small scales. there are variables whose values are objectively undeﬁned. neutrons and other elementary particles. It is as though we had used a Monte Carlo method as a way of deﬁning the contour of the object. In quantum mechanics. if we combine the images of a large number of plates. A ﬁrst remarkable feature consists in the fact that. event probabilities are objective. pure. The process is genuinely random.128 7 The Nature of Chance 7. electrons. Such particles are generically named quanta. Since the light intensity is extremely weak. by throwing tiny points onto the plate. When the system undergoes a certain evolution.

1. We write ‘positioned’ between quotes because in fact it does not have the usual meaning. This superposition state aspect and the wave nature. We may even assume a normal probability density for each of the two possible trajectories. and so on. such as the outcomes of games. with another probability. it really is as though the particle were ‘smeared out’ in space.1. and partly ‘there’. The particle is said to be in a superposition state. as in Fig.7. 7. 4 for the example of shooting at a target. 7. Let us interpret the experiment as in Chap. In the absence of a measurement there is a non-null probability that the particle is ‘positioned’ at several points at the same time.1a. say at which point a given particle is at a given instant. With this assumption the probability of a certain target region being hit by a bullet is the sum of the probabilities for each of the two possible bullet trajectories. being present partly ‘here’.2 Quantum Probabilities 129 acterized by a probability wave (described by the famous wave equation discovered by the Austrian physicist Erwin Schr¨dinger in 1926). and therefore vector nature of the quantum description. characterized by a scalar probability measure. that illustrates the strange behavior of the probabilistic vector nature of quantum phenomena. separated from the light source by a screen with two small holes (or slits). forecasts in the stock market. There is a much discussed experiment. give rise to counterintuitive results which are impossible to obtain when dealing with uncertainties in everyday life. (b) What really happens . 7. with a certain probability. In any small region of the target the bullet-hit probability is the sum of the probabilities corresponding to each of the normal Fig. as shown in Fig. We o cannot. for instance. either through A or through B. A narrow beam of light shines on a target. Probability density functions in the famous two-slit experiment. (a) According to classical mechanics. the two-slit experiment. treating the photons like tiny bullets. weather forecasts.

The total probability is obtained as follows: the two arrows are added according to the parallelogram rule (the construction illustrated in Fig. and the arrow ending at S is obtained as a result.1b is observed. At the central point Q. Quantum theory also explains this phenomenon by associating a probability wave to each photon. Thus. one may obtain arrows with the same bearing but opposite directions. The square of the length of the latter arrow corresponds to the probability of an impact at P. Now. such as R. We thus arrive at the surprising conclusion that the quantum description does not always comply with P (passing through A or passing through B) = P (passing through A) + P (passing through B) .130 7 The Nature of Chance distributions. where a so-called interference phenomenon like the one shown in Fig. as would be observed in a classical description! . with the base of the arrow moving at the speed of light and its tip rotating like a clock hand. In order to get an idea as to how the calculations are performed. this is possible due to the state superposition. 7. the arrows corresponding to the two trajectories come up with symmetric angles around the horizontal bearing. that is not what happens with photons. since the lengths of FAQ and FBQ are now equal. 7. where the arrows arrive in a concordant way and the length of the sum arrow is practically double the length of each vector.1b). There are also points. Consider the point P. It is then possible to observe null probabilities and probabilities that are four times larger than those corresponding to a single slit. asymmetric in relation to the horizontal bearing. with concentric rings of smaller or larger light intensity than the one corresponding to a single slit. Depending on the choice of the distance between the slits.1b shows two arrows corresponding to photon trajectories passing simultaneously (!) through A and B. This would certainly be the result with macroscopic bullets. At a given trajectory point the probability of ﬁnding the photon at that point is given by the square of the arrow length. 7. we suppose that an arrow (vector) is associated with each possible trajectory of a photon. which cancel each other when summing. the clock hands exhibit diﬀerent orientations. Figure 7. This result can be explained in the framework of classical wave theory. since the lengths of the trajectories FAP and FBP are diﬀerent.1a. When arriving at P. shown with a dotted line in Fig. the probability density of the impact due to the two possible trajectories is obtained as the sum of the two densities.

represented by the oscillation of an electromagnetic ﬁeld propagating as a wave. The bell-shaped blot is a result of the law of large numbers. composed of 60 carbon atoms structured like a football. horizontally polarized light (the electric ﬁeld vibrates perpendicularly to the page). 7. Another class of experiments illustrating the strangeness of quantum probabilities relates to properties of microscopic particles exhibiting well deﬁned values. The photon ends at an exact (but unknown) point. (c) As in (b) but in the opposite order . with the help of a polarizing device (for instance.2. the interference phenomenon no longer takes place. one may obtain a single light component with a speciﬁc direction of polarization.2 Quantum Probabilities 131 What happens when a single photon passes through a single slit? Does it smear out in the target as a blot with a bell-shaped intensity? No. denoted by h. if we place photon detectors just after the two slits A and B. called polarizations. with the electric ﬁeld vector vibrating perpendicularly to the propagation direction. shines on a polarizer whose direction of polarization makes a 45-degree angle with the horizontal Fig. In Fig. Only the probabilistic laws that govern the phenomena are responsible for the ﬁnal result. The interference phenomenon has also been veriﬁed experimentally with other particles of reduced dimensions such as electrons. Incidentally. Common light includes many wave components with various vibration directions.2b. (b) Horizontally polarized light passing through two polaroids. The same applies to the interference phenomenon.7. (a) Polarization directions in a plane perpendicular to the propagation direction. as in recent experiments with the buckminsterfullerene molecule. 7. There is no intrinsic feature deciding which photons hit the center and which ones hit the periphery. and even with larger molecules. Now. by acting with a detector. the state superposition is broken and we can say which slit a given photon has passed through. atoms. a sheet of a special substance called polaroid or a crystal of the mineral calcite). It is a known fact that light has a wave nature as well as a particle nature. However.

are obvious in the framework of the classical wave theory. 7. In fact. as we saw earlier in the case where an object was photographed . which is proportional to the square of the electric ﬁeld amplitude. light intensity. E2 = 0. 2 2 Let us now suppose that we change the order of the polarizers as in Fig. But we thus arrive at the conclusion.2c. Let Eh be the electric ﬁeld intensity of h. involving electric ﬁeld projections along polarization directions. that in the quantum description the probabilities of a chain of events are not always commutative.2c. 2 The light passes next through a vertical polarizer (the electric ﬁeld vibrates in the page perpendicularly to the direction of propagation). since the probability associated with the experimental result of Fig. Eh E1 = Eh cos 45◦ = √ . The intensity coming out of the 45◦ polarizer corresponds to the ﬁeld projection along this direction. measures the probability of a photon passing through the two-polarizer sequence. The laws of large numbers also come into play in all quantum phenomena. such as the electron spin. Suppose that instead of light we had a ﬂow of sand grains passing through a sequence of two sieves: sieve 1 and sieve 2. We obtain E1 = Eh cos 90◦ = 0 . 7.2a). the commutative property holds: P (passing through sieve 1) × P passing through sieve 2 if it has passed through sieve 1 passing through sieve 1 . independent of the order of events. 7. Let us now interpret them according to quantum mechanics. and likewise Eh E1 E2 = E1 sin 45◦ = √ = . 7. These results.2b is diﬀerent from the one associated with Fig. that is. In the classical description. hence. if it has passed through sieve 2 = P (passing through sieve 2) × P In the quantum description the probability of the intersection depends on the order of the events! This non-commutative feature of quantum probabilities has also been observed in many other well-deﬁned properties of microscopic particles. contrary to common sense.132 7 The Nature of Chance direction (see Fig.

the law of large numbers explains the cos2 θ proportionality. Average number of photons passing through a polarizer for several polarizer angles (simulation) .7.2 Quantum Probabilities 133 with very weak light. An experiment with a normal light beam. 4). for each value of θ. it is proportional to the number of photons passing through the polarizer. one initially observes a poorly deﬁned interference pattern. but it will begin to build up as time goes by. viz. But in this case the law of large numbers comes into play. Np = N cos2 θ. if the source light only emits a small number of photons per second. where N is the number of photons hitting the polarizer. In summary. as deduced in the framework of classical physics for the macroscopic object we call a light beam. for an everyday type of light beam. 7. and in such a way that the average number of photons passing through the polar√ izer exhibits a standard deviation proportional to 1/ N (as already seen in Chap. can be interpreted as carrying out multiple experiments such as the one portrayed in Fig. Fig. In fact. In the interference example.3. 7. is measured after it has passed through a polarizer with polarization axis at an angle θ to h.3. with very high N . We have already seen that the light intensity is then proportional to the square of cos θ. 7.. If we carry out the experiment with a light beam of very weak intensity (in such a way that only a small number of photons hits the polarizer) and. Another particularly interesting illustration of the law of large numbers acting on quantum phenomena is obtained in an experiment in which the intensity of a light beam with a certain polarization. This graph clearly shows random deviations in the average number of photons Np /N passing through the polarizer from the theoretical value of cos2 θ. we obtain a graph like the one shown in Fig. say h. we count how many photons pass through (this can be done with a device called a photomultiplier).3.

However.134 7 The Nature of Chance Another strange aspect of quantum mechanics. there existed in the microscopic particles some internal variables. observing P1 moving round in a slightly elliptic orbit with an orbital period of about 124 hr (assuming . Quanta are the quintessential chance generators. (s)he will be able to conﬁrm the universal law of gravitation. of true chance. If an astronaut places P1 1. The random nature of quanta is therefore the stronghold of true randomness.3 A Planetary Game Playing dice in space may not be a very entertaining activity. the state superposition is no longer preserved. where suﬃcient means that an interaction with other particles in the surrounding space is inevitable. In weightless conditions. In other words. 0) of a coordinate system with orthogonal axes labelled x and y in the orbital plane. plays the role of the Sun in a mini-planetary system. using balls instead of dice. and making use of the weightless conditions in order to obtain orbital ball trajectories. It was even suggested that instead of probability waves. called hidden variables. 7. when is a particle large enough to ensure that quantum behavior is no longer observed? The modern decoherence theory presents an explanation consistent with experimental results. Let us ﬁrst take a single planet P1 and to ﬁx ideas let us suppose that the Sun and planet have masses 10 kg and 1 kg. which still raises much debate. We may then consider that the Sun is ﬁxed at the point (0.5 m away from the Sun with an initial velocity of 7 cm/hr along one of the orthogonal directions. respectively. The intrinsically random nature of quantum phenomena was a source of embarrassment for many famous physicists. according to which for suﬃciently large particles. The mass of the planet is reasonably smaller than the mass of the Sun so that its inﬂuence on the Sun’s position can be neglected. Imagine a space game for astronauts. In this planetary game two balls are used. coins and the like to shame. putting dice. One of the balls. of largest mass. concerns the transition between a quantum description and a classical description. a result discovered in 1964 by the physicist John Bell (known as Bell’s theorem) proved that no theory of local hidden variables would be able to reproduce in a consistent way the probabilistic outcomes of quantum mechanics. without much hope of deciding what face to read oﬀ. playing the role of planets with co-planar orbits. that were deterministically responsible for quantum phenomena. astronauts would see the dice ﬂoating in the air.

is placed farther away from the Sun than P1 and with an initial velocity of 6 cm/hr along one of the orthogonal directions.01 m . 7. if P2 is placed 2 m from the Sun. in general.5a and the positions of the planets can no longer be predicted. 0). On the other hand. 7. the astronauts are in the same situation as if they were predicting the results of a coin-tossing experiment. 7. This is the situation with certain Fig. (a) r = 3 m. Stable orbits of the planets. We now suppose that a second planet P2 . If P2 is placed at an initial distance of r = 3 m from the Sun.4b.7.4a and the position of the planets is predictable. For three or more bodies no one can tell. Poincar´ and other e researchers have shown that.00000001 m (a microbial increment!) causes a large perturbation.5b. as shown in Fig. in contrast to what happens with two bodies. we get the chaotic orbits shown in Fig. which trajectories they will follow. In this situation a slight deviation in the position of P2 does not change the orbits that much. we get the stable orbit situation shown in Fig.4.3 A Planetary Game 135 that the astronaut is patient enough). One cannot even predict the answer to such a simple question as: Will P1 have made an excursion beyond 3 m after 120 hr? Faced with such a question. pioneer in the study of chaotic phenomena. as shown in Fig. Even as tiny a change in the position as an increment of 0. where the motion is never chaotic. (b) r = 3. with the ‘Sun’ at (0. 7. 7. also with mass 1 kg. The game consists in predicting where (to within some tolerance) one of the planets can be found after a certain number of hours. The chaotic behavior of this so-called three-body problem was ﬁrst studied by the French mathematician Jules Henri Poincar´ (1854– e 1912). by randomly specifying the initial conditions of three or more bodies. the resulting motion is almost always chaotic.

Do we have to redo all calculations or is there a simple rule for determining how the initial E0 = 1 euro error propagates after n years? Let us see. after n years. After carrying out the computations with a certain initial capital C0 . . iteration of the preceding rule shows that the capital is Cn = C0 × 1. C2 = C1 × 1.03 = C0 × 1. in a certain bank with an annual constant interest rate of j. In the following sections.5. After the second year we get in the same way. Assume an initial capital.03 = C0 × 1. 7. and an interest rate of 3% per year.136 7 The Nature of Chance Fig. After the ﬁrst year the accumulated capital C1 is C1 = C0 + C0 × j = 1 000 + 1 000 × 0. (a) r = 2 m. with the ‘Sun’ at (0.03 + E0 × 1. After the ﬁrst year the capital becomes C0 × 1. Thus. denoted by C0 . (b) r = 2. instead of 1 000 euro we should have used 1 001 euro.03n .4 Almost Sure Interest Rates Consider the calculation of accumulated capital after n years. 7. we shall try to cast some light on the nature of these phenomena. For instance. The problem can only be treated for certain speciﬁc sets of conditions. 0).03 .032 . Chaotic orbits of the planets. of 1 000 euro. suppose we came to the conclusion that its value was wrong and we should have used C0 + E0 .03 . corresponding to a relative error of E0 /C0 = 1/1 000.000 000 01 m asteroids.

As n increases.) Let us now look at another aspect of capital growth.7. If the initial capital is 100 euro. the diﬀerence will no doubt look quite important: 6 538 euro. Is this small diﬀerence very important? Let us see by what factor the capital grows in the two cases after 10 years: 3% interest rate after 10 years: 3.03/C0 × 1. we get a diﬀerence relative to C1 of E0 ×1. E0 × 1. that diﬀerence also increases. it is easy to conclude that the initial error E0 propagates in a fairly simple way over the years: the propagation is such that the relative error remains constant. The graphical construction is straightforward. (In a graph of Cn+1 against Cn .5 or 1.05% interest rate after 10 years: 1. The vertical axis shows the capital Cn+1 after one further year. The relative error (relative to C1 ) is again E0 /C0 (in detail. represented on the horizontal axis: Cn+1 = Cn (1 + j). the relative error 1/1 000 will remain constant and we do not therefore need to redo the whole calculation.05% per year. One sees that as the number of years increases . which we extend horizontally until we reach the line at 45◦ .4 Almost Sure Interest Rates 137 That is. then given the preceding rule. which is what justiﬁes use of the word ‘linear’. But. this relation corresponds to a straight line passing through the origin and with slope 1. in this case the straight line with slope 1. 1. Starting with C0 = 1.350 455 . so that we use the same C1 value on the horizontal axis when doing the next iteration. we get a diﬀerence that many of us would consider negligible.343 916 . The solid staircase shows how the capital grows for the 51% interest rate. This is of course a trivial observation and it seems that not much remains to be said about it. if the initial capital is one million euros. Note how this simple rule is the consequence of a linear relation between the capital Cn after a given number of years and the capital Cn+1 after one further year: Cn+1 = Cn × 1. Suppose that another bank oﬀers an interest rate of 3. We thus obtain the value C1 . we just have to add to C10 one thousandth of its value in order to obtain the correct value. If we suppose a scenario with n = 10 years and imagine that we had computed C10 = 1 343.6 shows what happens for two truly usurious interest rates: 50% and 51%.51. we extend this value vertically until we meet the straight line 1 + j. namely 65 cents.92 euro. But let us take a closer look at how the diﬀerence of capital growth evolves when the interest rates are close to each other. Figure 7.03.03.03.03). corresponding to the present capital Cn . Repeating the same reasoning. The process is repeated over and over again.

000 485 per year. These may use diﬀerent numbers of bits in their calculations. (continuing inﬁnitely). e. Capital growth for (usurious!) interest rates: 50% and 51% (see Fig.05% interest rates. In fact. . we never get exact calculations for an interest rate of 1/30 per year. corresponding to different numbers of decimal digits.033 333 3 . it does not matter how the calculations are carried out. Fig. 7.138 7 The Nature of Chance Fig. They compare their calculations and ﬁnd that after 25 years there is a 41 . The bank performs the same calculation but with 64-bit precision.03)/1. what most people do not realize is that numerical computations are always performed with ﬁnite accuracy.6) the diﬀerence between the two capital growth factors increases in proportion to the number of years. This linearly growing diﬀerence in capital for distinct (and constant) interest rates is also well known.6. That is.e. the diﬀerence in values of the capital varies with n as a straight line with slope (0. However. the results are generally approximate. changes linearly with n.g. For instance. equivalent to about 15 signiﬁcant decimal digits).03 = 0. the number of digits used in the numerical representation is necessarily ﬁnite. by hand or by computer. equivalent to about 7 signiﬁcant decimal digits) and in 64 bits (the so-called double precision representation. Applying this constant 8% interest rate the ﬁrm’s accountant calculates the accumulated capital on the ﬁrm’s computer with 32-bit precision. since 1/30 = 0. Personal computers usually perform the calculations using 32 bits (the so-called simple precision representation. .030 5 − 0. 7. Therefore. For the 3% and 3. i. Banks compute accumulated capitals with computers.6 clearly illustrates that the diﬀerence in values of the capital is proportional to n. Imagine a ﬁrm deciding to invest one million euros in a portfolio of shares with an 8% annual interest rate... 7.

Unfortunately. Moreover. why not use a 128-bit or a 256-bit computer instead? In fact.7. If there is a noticeable diﬀerence when using a 64-bit computer. linear way. they verify that the diﬀerence does not progress in a systematic. We now consider a non-linear phenomenon. we will always have to deal with uncertainty factors in our computations due to the ﬁnite precision of number representation and error propagation. either in natural phenomena or in everyday life. In other cases error propagation can have dramatic consequences. 7.51 euro.. After 50 years the diﬀerence is already 6. In some cases the mentioned uncertainty is insigniﬁcant (the ﬁrm would probably not pay much attention to a 41 cent diﬀerence in six million eight hundred thousand euros). This linear dependency is reﬂected in the well-behaved evolution of the capital growth for a small ‘perturbation’ of the initial capital. 7. But the bottom line is this: no matter how super the computer is.7. Let us denote the number of individuals by Nn and let N be the maximum .7.5 Populations in Crisis In the above interest rate problem. for problems where numerical precision is crucial one may use supercomputers that allow a very large number of bits for their computations. the formula Cn+1 = Cn (1 + j) for capital growth is a linear formula. it shows chaotic ﬂuctuations that are diﬃcult to forecast.5 Populations in Crisis 139 Fig. linear behavior is more the exception than the rule. 7. In order to clear up this issue they compare the respective calculations for a 50-year period and draw the graph of the diﬀerences between the two calculations. as in Fig. relating to the evolution of the number of individuals in a population after n time units. Annual evolution of the difference of accumulated capital computed by two diﬀerent computers cent diﬀerence in an accumulated capital of about six million eight hundred thousand euros. i.e. Cn+1 is directly proportional to Cn .

imagine a population of rabbits living inside a fenced ﬁeld. 7.8a). becoming a dwindle rate if the maximum is exceeded.140 7 The Nature of Chance reference value allowed by the available food resources. Pn will then decrease. . we get an oscillating pattern (Fig. the population grows. If we set a 0. with an increasing population. If our time unit is the month. A very tiny error in the initial condition P0 . We may.9.8b) with period 4 (months). is enough to trigger a totally diﬀerent evolution.01 (one per cent of the reference maximum) and study the population evolution for r = 1. clearly showing a non-linear. However. as shown in Fig. a dwindling phase sets in. . If the population grew at a constant rate r. 2 Note that the above formula can be written as Pn+1 = Pn (1 + r) − rPn .9. the period is 32 and it takes about 150 time units until a stable oscillation is reached. in the case of Fig.10a. But strange things start to happen when r approaches 3.2. 7. For the above-mentioned values. we have P3 = 0. the necessary resources become scarcer and one may reach a point where mortality overtakes natality owing to lack of resources. 7. For r = 3 and P0 = 0. Let Pn = Nn /N denote the fraction of the maximum number of individuals that can be supported by the available resources.9. A formula reﬂecting the need to impose a growth rate that decreases as one approaches a maximum number of individuals. is Pn+1 = Pn 1 + r(1 − Pn ) . dependency of Pn+1 on Pn . We ﬁnd that the population keeps on growing until it stabilizes at the maximum (Fig. we would get a situation similar to the one for the interest rate: Pn+1 = Pn (1 + r). Suppose we start with P0 = 0.8b. quadratic. interpreting as 1 the values above the threshold and as 0 the values below. According to this rule the population grows or dwindles at the rate of r(1 − Pn ). As soon as Pn exceeds 1. If r = 2. . for instance P0 = 0. as in Fig. 7.100 001. If Pn is below 1 (the Pn value for the population maximum). we get the following periodic sequence: 111010101110101011101010111011101110101011101010111 . 7. we may further assume that after three months we have N3 = 60 rabbits living in a ﬁeld that can only support up to N = 300 rabbits. for instance. 7. and all the more as the deviation from the maximum is greater. For instance.75 threshold.10b. Sometimes the periodic behavior is more complex than the one in Fig.1.5. 7. we get the behavior illustrated in Fig.

5 Fig.569 This is a completely diﬀerent scenario from the error propagation issue in the interest rate problem. Given the high sensitivity to initial conditions. for instance. 7.1 reaches a local maximum at the 60th generation.5 and with r = 2. only by using an inﬁnitely large number of digits would one be able to reach a conclusion as to whether or not the value attained at the 60th generation is a local maximum. linear. What we have here is not only a problem of high sensitivity to changes in initial conditions. Population evolution starting from 0. such a statement does not make sense. the graphs of Figs.01 and with two diﬀerent growth rates. The crux of the problem is that the law governing such sensitivity is unknown! To be precise. that the population for r = 3 and P0 = 0. the reader might say. (a) r = 1.10 were those obtained on my computer. (b) r = 2.7. In the present case. Figure 7.8. There.9. On this basis. 7. a small deviation in the initial conditions was propagated in a controlled. as in the planetary game.5 Populations in Crisis 141 Fig. error propagation is completely chaotic. However. Population evolution starting from P0 = 0.5 and 7.9. 7. way.11 shows .

7.10.100 001 Fig. 7. (a) P0 = 0. This is the nature of deterministic chaos.11.10 populations the diﬀerence of Pn values between the evolutions shown in Fig.142 7 The Nature of Chance Fig.10. Time evolution of the diﬀerence in the Fig. we had no chaos and a very simple rule was applied to a ‘perturbation’ of the initial capital: the perturbation was propagated in such a way that the relative diﬀerences were kept constant. 7. Chaotic population evolution for r = 3. In the case of the interest rate. on average.6 An Innocent Formula There is a very simple formula that reﬂects a quadratic dependency of the next value of a given variable x relative to its present value. 7. 7. It is . The best one can do is to study statistically how the perturbations are propagated. (b) P0 = 0. In deterministic chaos there are no rules. It is clear that after a certain time no rule allows us to say how the initial perturbation evolves with n.1.

7.6 An Innocent Formula

143

called a quadratic iterator and is written as follows: xn+1 = axn (1 − xn ) . The name ‘iterator’ comes from the fact that the same formula is applied repeatedly to any present value xn (as in the previous formulas). Since the formula can be rewritten xn+1 = axn − ax2 , the quadratic n dependency becomes evident. (We could say that the formula for the interest rate is the formula for a linear iterator.) The formula is so simple that one might expect exemplary deterministic behavior. Nevertheless, and contrary to one’s expectations, the quadratic iterator allows a simple and convincing illustration of the deterministic chaos that characterizes many non-linear phenomena. For instance, expressing a = r + 1 and xn = rPn /(r + 1), one obtains the preceding formula for the population growth. Let us analyze the behavior of the quadratic iterator as in the case of the interest rate and population growth. For this purpose we start by noticing that the function y = ax(1 − x) describes a parabola passing through the points (0, 0) and (0, 1), as shown in Fig. 7.12a. The parameter a inﬂuences the height of the parabola vertex occurring at x = 0.5 with value a/4. If a does not exceed 4, any value of x in the interval [0, 1] produces a value of y in the same interval. Figure 7.12a shows the graph of the iterations, in the same way as for the interest rate problem, for a = 2.5 and the initial value x0 = 0.15. Successive iterations generate the following sequence of values: 0.15, 0.3187, 0.5429, 0.6204, 0.5888, 0.6053, 0.5973, 0.6013, 0.5993,

Fig. 7.12. Quadratic iterator for a = 2.5 and x0 = 0.15. (a) Graph of the iterations. (b) Orbit

144

7 The Nature of Chance

0.6003, 0.5998, and so on. This sequence, converging to 0.6, is shown in Fig. 7.12b. As in the previous examples we may assume that the iterations occur in time. The graph of Fig. 7.12b is said to represent one orbit of the iterator. If we introduce a small perturbation of the initial value (for instance, x0 = 0.14 or x0 = 0.16), the sequence still converges to 0.6. Figure 7.13 shows a non-convergent chaotic situation for a = 4. A slight perturbation of the initial value produces very diﬀerent orbits, as shown in Fig. 7.14. Between the convergent case and the chaotic case the quadratic iterator also exhibits periodic behavior, as we had already detected in the population growth model.

Fig. 7.13. Fifteen iterations of the quadratic iterator for a = 4. (a) x0 = 0.2. (b) x0 = 0.199

Fig. 7.14. Quadratic iterator orbits for a = 4. (a) x0 = 0.2. (b) x0 = 0.199999

7.6 An Innocent Formula

145

Now suppose we have divided the [0, 1] interval into small subintervals of width 0.01 and that we try to ﬁnd the frequency with which any of these subintervals is visited by the quadratic iterator orbit, for a = 4 and x0 = 0.2. Figure 7.15 shows what happens after 500 000 iterations. As the number of iterations grows the graph obtained does not change when we specify diﬀerent initial values x0 . Interpreting frequencies as probabilities, we see that the probability of obtaining a visit of a subinterval near 0 or 1 is rather larger than when it is near 0.5. In fact, it is possible to prove that for inﬁnitesimal subintervals one would obtain the previously mentioned arcsine law (see Chap. 6) for the frequency of visits of an arbitrary point. That is, the probability of our chaotic orbit visiting a point x is equal to the probability of the last zero gain having occurred before time x in a coin-tossing game! Consider the following variable transformation for the quadratic iterator: xn = sin2 (πyn ) . For a = 4, it can be shown that this transformation allows one to rewrite the iterator formula as yn+1 = 2yn (mod 1) ,

where the mod 1 (read modulo 1) operator means removing the integer part of the expression to which it applies (2yn in this case). The chaotic orbits of the quadratic iterator are therefore mapped onto chaotic orbits of yn . We may also express the yn value relative to an initial value y0 . As in the case of the interest rate, we get yn = 2n y0 (mod 1) .

Fig. 7.15. Probability of the chaotic quadratic iterator (a = 4) visiting a subinterval of [0, 1] of width 0.01, estimated in 500 000 iterations

146

7 The Nature of Chance

Let us see what this equation means, by considering y0 expressed in binary numbering, for instance y0 = 0.101 100 100 010 111 1. In the ﬁrst iteration we have to multiply by 2 and remove the integer part. In binary numbering, multiplying by 2 means a one-bit leftward shift of the whole number, with a zero carried in at the right (similarly to what happens when we multiply by 10 in decimal numbering). We get 1.011 001 000 101 111 0. Removing the integer part we get 0.011 001 000 101 111 0. The same operations are performed in the following iterations. Suppose we only look at the successive bits of the removed integer parts. What the iterations of the above formula then produce are the consecutive bits of the initial value, which can be interpreted as an inﬁnite sequence of coin throws. Now, we already know that, with probability 1, a Borel normal number can be obtained by random choice in [0, 1]. As a matter of fact, we shall even see in Chap. 9 that, with probability 1, an initial value y0 randomly chosen in [0, 1] turns out to be a number satisfying a strict randomness criterion, whose bits cannot be predicted either by experiment or by theory. Chaos already exists in the initial y0 value! The quadratic iterator does not compute anything; it only reproduces the bits of y0 . The main issue here is not an inﬁnite precision issue. The main issue is that we do not have a means of producing the bits of the sequence other than having a prestored list of them (we shall return to this point in Chap. 9). However, not all equations of the above type produce chaotic sequences, in the sense that they do not propagate chaos to the following iterations. For instance, the equation xn+1 = xn + b (mod 1) does not propagate chaos into the following iteration. We see this because, if x0 and x′ are very 0 ′ close initial conditions, after n iterations we will get xn − xn = x′ − x0 . 0 That is, the diﬀerence of initial values continues indeﬁnitely.

**7.7 Chance in Natural Systems
**

Many chance phenomena in natural systems composed of macroscopic objects are of a chaotic nature. These systems generally involve several variables to characterize the state of the system as time goes by, rather than one as in the example of population evolution: • for the 3-sphere system described above, the characterizing variables are the instantaneous values of two or three components of position and velocity, • for a container where certain reagents are poured, the characterizing variables are the concentrations of reagents and reaction products,

7.7 Chance in Natural Systems

147

• for weather forecasting, the characterizing variables are the pressure and temperature gradients, • for cell culture evolution the relevant variables may be nutrient and cell concentrations, • for ﬁnancial investments, they may be rental rates and other market variables, and so on and so forth. Let x denote a variable characterizing the system and let xn and xn+1 represent the values at consecutive instants of time, assumed close enough to each other to ensure that the diﬀerence xn+1 − xn reﬂects the instantaneous time evolution in a suﬃciently accurate way. If these diﬀerences vary linearly with xn , no chaotic phenomena take place. Chaos in the initial conditions does not propagate. Chaos requires non-linearities in order to propagate, as was the case for the quadratic dependence in the population growth example. Now it just happens that the time evolution in a huge range of natural systems depends in a non-linear way on the variables characterizing the system. Consider the example of a sphere orbiting in the xy plane around another sphere of much larger mass placed at (0, 0). Due to the law of gravitation, the time evolution of the sphere’s coordinates depends on the reciprocal of its distance x2 + y 2 to the center. We thus get a non-linear dependency. In the case of a single orbiting sphere no chaotic behavior will arise. However, by adding a new sphere to the system we are adding an extra non-linear term and, as we have seen, chaotic orbits may arise. The French mathematician Poincar´ carried out pioneering work on e the time evolution of non-linear systems, a scientiﬁc discipline which has attracted great interest not only in physics and chemistry, but also in the ﬁelds of biology, physiology, economics, meteorology and others. Many non-linear systems exhibit chaotic evolution under certain conditions. There is an extensive popular science literature describing the topic of chaos, where many bizarre kinds of behavior are presented and characterized. What interests us here is the fact that a system with chaotic behavior is a chance generator (like tossing coins or dice), in the sense that it is not possible to predict the values of the variables. The only thing one can do is to characterize the system variables probabilistically. An example of a chaotic chance generator is given by the thermal convection of the so-called B´nard cells. The generator consists of a thin e layer of liquid between two glass plates. The system is placed over

148

7 The Nature of Chance

a heat source. When the temperature of the plate directly in contact with the heat source reaches a certain critical value, small domains are formed in which the liquid layer rotates by thermal convection. These are the so-called B´nard cells. The direction of rotation of the e cells (clockwise or counterclockwise) is unpredictable and depends on microscopic perturbations. From this point of view, one may consider that the rotational state of a B´nard cell is a possible substitute for e coin tossing. If after obtaining the convection cells we keep on raising the temperature, a new critical value is eventually reached where the ﬂuid motion becomes chaotic (turbulent), corresponding to an even more complex chance generator. Many natural phenomena exhibit some similarity with the B´nard cells, e.g., meteorological phenomena. e Meteorologists are perfectly familiar with turbulent atmospheric phenomena and with their hypersensitivity to small changes in certain conditions. In fact, it was a meteorologist (and mathematician) Edward Lorenz who discovered in 1956 the lack of predictability of certain deterministic systems, when he noticed that weather forecasts based on certain computer-implemented models were highly sensitive to tiny perturbations (initially Lorenz mistrusted the computer and repeated the calculations with several machines!). As Lorenz says in one of his writings: “One meteorologist remarked that if the theory were correct, one ﬂap of a seagull’s wings could change the course of weather forever.” This high sensitivity to initial conditions afterwards came to be called the butterﬂy eﬀect. The Lorenz equations, which are also applied to the study of convection in B´nard cells, are e xn+1 − xn = σ(yn − xn ) , yn+1 − yn = xn (τ − zn ) − yn ,

zn+1 − zn = xn yn − βzn . In these equations, x represents the intensity of the convective motion, y the temperature diﬀerence between ascending and descending ﬂuid, and z the deviation from linearity of the vertical temperature proﬁle. Note the presence of the non-linear terms xn zn and xn yn , responsible for chaotic behavior for certain values of the control parameters, as illustrated in Fig. 7.16. This three-dimensional orbit is known as the Lorenz attractor. Progress in other studies of chaotic phenomena has allowed the identiﬁcation of many other systems of inanimate nature with chaotic behavior, e.g., variations in the volume of ice covering the Earth, vari-

7.8 Chance in Life

149

Fig. 7.16. Chaotic orbit of Lorenz equations for σ = 10, τ = 28, β = 8/3

ations in the Earth’s magnetic ﬁeld, river ﬂows, turbulent ﬂuid motion, water dripping from a tap, spring vibrations subject to non-linear forces, and asteroid orbits.

7.8 Chance in Life

While the existence of chaos in inanimate nature may not come as a surprise, the same cannot be said when we observe chaos in nature. In fact, life is ruled by laws that tend to maintain equilibrium of the biological variables involved, in order to guarantee the development of metabolic processes compatible with life itself. For instance, if we have been running and we are then allowed to rest, the heart rate which increased during the run goes back to its normal value after some time; concomitantly, the energy production mode that was adjusted to suit the greater muscular requirements falls back into the normal production mode which prevails in the absence of muscular stress. One then observes that the physiological variables (as well as the biological variables at cellular level) fall back to the equilibrium situation corresponding to an absence of eﬀort. If on the other hand we were not allowed to rest and were instead obliged to run indeﬁnitely, there would come a point of death by exhaustion, when it would no longer be possible to sustain the metabolic processes in a way compatible with life. Life thus demands the maintenance of certain basic balances and one might think that such balances demand a perfectly controlled, nonchaotic, evolution of the physiological and biological variables. Surprisingly enough this is not the case. Consider the heart rate of a resting adult human. One might think that the heart rate should stay at a practically constant value or at most exhibit simple, non-chaotic

stimulated periodically. The autonomic nervous system is composed of two subsystems: the sympathetic and the parasympathetic. whose causes are not yet fully understood.150 7 The Nature of Chance ﬂuctuations. and the metabolism. The pacemaker consists of a set of self-exciting cells in the right atrium of the heart. An action potential is nothing other than an electrical impulse sent by a neuron to Fig. The ﬁrst is responsible for triggering responses appropriate to stress situations. the so-called action potentials. Beside the heart beat. The variability one observes in the heart rate is therefore inﬂuenced by the randomness of external phenomena causing stress situations. as shown in Fig. of any healthy animal and in any stage of life) exhibits a chaotic behavior. reﬂecting the heart’s loss of ﬂexibility in its adaptation to internal and external stimuli.) Whatever its causes. The parasympathetic subsystem is responsible for triggering energy saving and recovery actions after stress. However. it is observed that the heart rate of a healthy adult (in fact. even at rest.17. 7. Chaotic phenomena can also be observed in electrical impulses transmitted by nervous cells. such as fear or extremes of physical activity. Increase in heart rate is one of the elicited responses.17. The latter is responsible for maintaining certain involuntary internal functions approximately constant. these include digestion. 7. this chaotic heart rate behavior attenuates with age. Heart rate of an adult man. breathing. It is therefore an oscillator that maintains a practically constant frequency in the absence of stimuli from the autonomic nervous system. (However. The frequency is expressed in beats per minute . Although these functions are usually outside voluntary control they can be inﬂuenced by the mental state. it has also been observed that there is a chaotic behavior of the heart rate associated with the pacemaker itself. A decrease in heart rate is one of the elicited actions. Heart rate variability is due to two factors: the pacemaker and the autonomic nervous system. However. some researchers have been able to explain the mechanism leading to the development of chaotic phenomena in heart cells of chicken.

Will it work? . and electroencephalographic signals.19. Fig. Two researchers (Glass and Mackey in 1988) were able to present a deterministic model for the time intervals tn between successive peaks of action potentials.18 for certain values of the control parameter.1 s. These impulses have an amplitude of about 0.7. R = 3 s−1 ) other neurons.g.18.8 Chance in Life 151 Fig. breathing tidal volumes. Many other physiological phenomena exhibit chaotic behavior. electrical potentials in cell membranes. vascular movements. e. metabolic phenomena like glucose conversion in adenosine triphosphate. 7. and this behavior has in fact been observed in practice.. This model exhibits chaotic behavior of the kind shown in Fig. The model is non-linear and stipulates that the following inter-peak interval is proportional to the logarithm of the absolute value of 1 − 2e−Rtn .1 volt and last about 2 milliseconds. Evolution of successive action potentials in nervous cells (t0 = 0. where R is a control parameter of the model. 7. 7.

.

7. for instance. and in fact on the basis of any randomness measurement. of one of its sequences (necessarily ﬁnite). The reader is informed that one of the boxes uses a random generator to produce the tape. The reader’s task is to determine. we present a few objective methods for measuring sequence irregularity.8 Noisy Irregularities 8. C1 or C2 . independently of whatever method we choose to measure irregularity. For the moment the ‘measure’ is only subjective. The reader then decides that box C2 contains the random generator. 1101010111010101 . Later on.9. This example illustrates the impossibility of deciding the nature (deterministic or random) of a sequence generator on the basis of observation. both producing a tape with a sequence of sixteen zeros and ones whenever a button is pressed. The reader decides upon C2 because the corresponding sequence looks irregular. while the other box uses a periodic generator. the following sequence: . a mechanical device tosses a coin) whereas box C2 contains the periodic generator corresponding to Fig. For instance. contains the random generator.1 Generators and Sequences Imagine that you were presented with two closed boxes named C1 and C2 . The reader presses the button and obtains: C1 sequence : C2 sequence : 0011001100110011 . using only the tapes produced by the boxes. The boxes are opened and to the reader’s amazement the random generator is in box C1 (where. which one. Note that segments of periodic sequences characterized by long and complex periods tend to be considered as random.

29. 28. 52. to perceive the existence of periodicity. 14. It is diﬃcult. 16. 37 42. 50. 57. 12. 44. 381. 45. 30. viz. If the sequences are generated by a coin-tossing process. 5. of course. 25. 58. 9. 40. 24. 43. 31.5 ± 0. 4. 0. 21. For instance. 3. 33. 26. 11.. 33. 62. 56. the sequences 0000000000000000 (S/n = 0) and 0111111111111110 (S/n = 0. 39. 13. 49. 15. 51. 61. Now. obtain sequences such as the one generated by C1 or other. 22. 11. 9. We may. 25. 17. 30. We also know that only after generating a large number of sequences of n bits can we expect to elucidate the nature of the generator. is generated by a periodic generator of period 64 of the population evolution type described in Chap. 001000101011011010000101011100001111001110001001110000 110011000010001101000100010111110110010001011000111011 101010010101101100011111011110011001000000000011010010 011101010100100111000110100001101000001011111010100101 111100101111101110101001010000010011100010101101100100 110111010110110001111000011011001010110101001100111111 111110111001100000100011110010111101110100000111100110 000001001000101011011010000101011100001111001110001001 110000110011000010001101000100010111110110010001011000 111011101010010101101100011111011110011001000000000011 010010011101010100100111000110100001101000001011111010 100101 it becomes even more diﬃcult. 27. 47.154 8 Noisy Irregularities 8. 0. 41. 20. 13. If more than 5% of the sequences are outside that interval. 39. 48. 6. 5. 15. 34. we expect about 95% of the generated sequences to exhibit average values S/n √ within 0.25. any sequence of n bits has the same probability of occurring. 43.87) are outside that interval. 23. 55. 4. 44. we then have strong doubts (based on the usual 95% conﬁdence level) that our generator is random. 51. 46. 53. 37. 18. 23. 58. If we convert the sequence into binary numbers. Moreover. 7. 8. 54. 29. 6. 17. 26. 59.5 ± 1/ n. 10. this interval is 0. 18. 31. 2. In fact. 2. 42. 21. 35. 20. 19. 32. such as 0000000000000000 or 0000000011111111. . 14. consider sequences generated by random generators such as coin-tossing. 28. 48. 63. 59. For n = 16. 36. 24. as we know already. namely 1/2n . however. 3. 60. even more anomalous sequences. 40. 54. 7.

as in 0000000011111111. given a large number of sequences there are statistical tests allowing one to infer something about the randomness of the generator.2 Judging Randomness 155 if the generator produces independent outcomes – a usual assumption in fair coin-tossing – we then hope to ﬁnd an equal rate of occurrence of 00. Probability (generator randomness) is an ensemble property.2 Judging Randomness How can one judge the randomness of a sequence of symbols? The Russian mathematicians Kolmogorov and Usphenski established the following criteria of sequence randomness. with a certain degree of certainty. however. an important observation emerges: generator randomness and sequence randomness are distinct issues. Brieﬂy. There are also statistical tests that allow one to reject the randomness assumption. From all that has been said. we cannot repeat the evolution of the universe in its ﬁrst few seconds. In daily life. we cannot repeat the historical evolution of economic development. The same applies to all distinct. with a certain degree of certainty. . or too small. where 01 and 10 only occur once and 00 never occurs. The important point is this: the probability measure associated with the generator outcomes is only manifest in sets of a large number of experiments (laws of large numbers). It is also possible in this case to establish conﬁdence intervals that would exclude sequences such as 0111111111111110. 10 and 11 in a large number of sequences. 01. We cannot repeat the evolution of share values in the stock market. We are then facing unique sequences of events on the basis of which we hope to judge determinism or randomness. From this chapter onwards we embark on a quite diﬀerent topic from those discussed previously: is it possible to speak of sequence randomness? What does it mean? 8. when the number of equal-value subsequences (called runs) is either too large – as in 0100110011001100. we cannot repeat the evolution of life on Earth and assess the impact of gene mutations.8. equal-length subsequences. based on a certain historical consensus and a certain plausibility as described in probability theory: • Typicality: the sequence must belong to a majority of sequences that do not exhibit obvious patterns. we often cannot aﬀord the luxury of repeating experiments. exhibiting 9 equal-value subsequences –.

too much symbol frequency balancing in short segments. there is no deﬁnite proof of randomness. but rather to judge a certain degree of irregularity or chaoticity in the sequences. people believe in a sort of ‘law of small numbers’. Psychologists have shown that the notion of randomness emerges at a late phase of human development. that is. the practical assessment of sequence randomness is based on results of statistical tests and on values of certain measures. in the strict and well-deﬁned sense presented in the next chapter. If after tossing a coin which turned up heads we ask a child what will come up next. but also in probability experts as well. In this sense.. if a sequence is presented which does not ﬁt the prototype. Psychologists Maya Bar-Hillel and Willem Wagenaar performed an extensive set of experiments with volunteers who were required to write down random number sequences. No set of these tests or measures is infallible. too many alternating values. the best that can be done in practice using the available tests and measures is to judge not randomness itself. • Frequency stability: equal-length subsequences must exhibit the same frequency of occurrence of the diﬀerent symbols. It seems as though we each have an internal prototype of randomness. as we shall see in the next chapter. Moreover. not only in lay people.156 8 Noisy Irregularities • Chaoticity: there is no simple law governing the evolution of the sequence terms. the most probable answer the child will give is tails. . In any case. It was also observed that these shortcomings are manifest. although there is a good deﬁnition of sequence randomness. • a tendency to use local representations of randomness. It is an interesting fact that the human speciﬁcation of random sequences is a diﬃcult task. some of which will be discussed later on. according to which even short sequences must be representative of randomness.e. it is always possible to come up with sequences that pass the tests or present ‘good’ randomness measures and yet do not satisfy one or another of the Kolmogorov–Usphenski criteria. a tendency to think that deviations from equal likelihood confer more randomness. They came to the conclusion that the sequences produced by humans exhibited: • • • • too few symmetries and too few long runs. the most common tendency is to reject it as non-random. i.

shown in Fig. . in at most m iterations.e. i. . Pseudo random-number generator starting at 2 and using m = 10.1b shows 15 iterations using the same m and initial value. they necessarily belong to the interval from 0 to m − 1 and the sequence is necessarily periodical: as soon as it reaches the initial value. The following value is then 2 × 6 mod 10.3 Random-Looking Numbers 157 8. the remainder of the integer division of 12 by 10. Fig. We thus obtain x1 = 2. c = 0. see also what was said about the quadratic iterator in the last chapter).3 Random-Looking Numbers So-called ‘random number’ sequences are more and more often used in science and technology. The simulation of random number sequences in computers is often based on a simple formula: xn+1 = axn mod m . Successive iterations produce the periodic sequence 48624862. 8. Suppose we take a = 2. . with period 10: 25814703692581470369 . with a = 2.. xn = (axn−1 + c) mod m. (a) Period 4. with a = 11 and c = 3 . Figure 8. Since the sequence values are given by the remainder after integer division. 4). the sequence repeats itself. m = 10 and that the initial value x0 of the sequence is 6. (b) Period 10. mentioned in Chap.1a. where many analyses and studies of phenomena and properties of objects are based on stochastic simulations (using the Monte Carlo method. One can obtain the maximum period by using a new version of the above formula. The mod symbol (read modulo) represents the remainder after integer division (in this case of axn by m.1.8. setting appropriate values for a and c. with a = 11 and c = 3. The generated sequence is the following. 8.

A generator of this type that has been and continues to be much used for the generation of pseudo-random binary sequences is xn = (xn−p + xn−q ) mod 2 . Figure 8. the researchers Heiko Bauke. the periodic nature of the sequence is of little importance for a very wide range of practical applications.2 illustrates one such sequence obtained by the above method and with period 64. 7. that this Fig. For this reason it is customary to call the sequence pseudo-random.2. from the Centre for Theoretical Physics. from the University of Magdeburg. m = 64) . In computer applications the period is often at least as large as 216 = 65 536 numbers. There are other generators that use sophisticated versions of the above formula. Germany. on the basis of a study that involved the analysis of sequences of 26 000 bits. with integer p and q . Pseudo-random sequence of period 64 (a = 5. as in the formula xn = (axn−1 + bxn−2 + c) mod m . c = 3. and Stephan Mertens. 8. the next iterated sequence number instead of depending only on the preceding number may depend on other past values.158 8 Noisy Irregularities Note how the latter sequence looks random when we observe the ﬁrst 10 values. This type of chaotic behavior induced by the mod operator is essentially of the same nature as the chaoticity observed in Fig. although it is deterministic and periodic. Trieste. Italy. For suﬃciently large m. showed (more than twenty years after the invention of this generator and its worldwide adoption!). For instance.7 for the diﬀerence in interest rates. In 2004.

on a regular basis generator was strongly biased. producing certain subsequences of zeros and ones exhibiting strong deviations from what would be expected from a true random sequence. We consider the letters represented by their number when in alphabetic order. Let us take the following message: ANNARRIVESTODAY .3. such as KDPWWYAKWDGBBEFAHJOJJJXMLI. And so a reputable random generator that had been in use for over 20 years without much suspicion was suddenly discredited.4 Quantum Generators 159 Fig.4 Quantum Generators There are some applications of random number sequences that are quite demanding as far as their randomness is concerned. For the English language the alphabet has 26 letters. Drunk and disorderly.8. One example is the use of random numbers in cryptography. known as a key. The procedure is as follows. This message can be coded using a random sequence of letters. We then take the letters corresponding to the message and the key and add the respective . 8. 8. which we number from 0 through 25.

We next take the remainder of the integer division by 26 (the number of letters in the alphabet). A and K. The sum is 10. one cannot reject the irregularity hypothesis with 95% conﬁdence. • Arrival event detection of photons in optical ﬁbers. with order numbers 0 and 10. It can be checked that it passes the single proportion test and the runs test. • Noise detected at the input of the sound drive of a computer (essentially thermal noise). which is 10. on the basis of those tests. • Noise detected in the computer supply line (with thermal noise component). If it is pseudo-random there are regularities in the key that a cryptanalyst may use in order to decipher it. or more rigorously for assessing sequence irregularity. there are commercialized random number generators based on quantum phenomena.160 8 Noisy Irregularities order numbers. Methods such as the conﬁdence interval of S/n (arithmetic average) or the runs test allow one to assess criterion 3 of Kolmogorov and Usphenski. • Thermal noise generated by passing an electric current through a resistance. that is.5 When Nothing Else Looks the Same We mentioned the existence of statistical methods for randomness assessment. . Repeating the process one obtains: KQCWNPIFAVZPEED . Here are some examples of quantum phenomena presently used by random number generators: • Radioactive emission. For this sort of application. For the above message and key the ﬁrst letters are respectively. The letter of order 10 is K and it will be the ﬁrst letter of the coded message. • Atmospheric noise. Consider the sequence 00110011001100110011. If the key is completely random each letter of the key is unpredictable and the ‘opponent’ does not possess any information which can be used to decipher it. 8. However. However. these tests have severe limitations allowing only the rejection of the irregularity hypothesis in obvious cases. requiring high quality sequence randomness.

The sum of products is 2×2+1×1+2×2 = 9.8. the reader must imagine them superimposed).4c.5 When Nothing Else Looks the Same 161 the sequence is clearly periodic and therefore does not satisfy criterion 2 of Kolmogorov and Usphenski.4b. for the 20102 sequence. there is a two-position shift and the sum of products is 4. the curve of the successive values of this similarity measure known as autocorrelation. there is a one-position shift and the sum of products is 0 (when shifting we assume that zeros are carried in). The sum of products thus works as a similarity measure. Correlation of 20102 with itself for 0 (a). 8. bars in the same position are represented side by side. the similarity between the two sequences is perfect. an elementgenerating rule may then exist. Figures 8. 8. There is an interesting method for assessing irregularity which is based on the following idea: if there is no rule that will generate the following sequence elements starting with a given element.5a shows.4.4a shows this sequence (open bars) and another version of it (gray bars) perfectly aligned (for the purposes of visualization. Take the very simple sequence 20102. We now present a ‘similarity’ measure consisting of adding up the product of corresponding elements between the original sequence and a shifted version of it. In the case of Fig. then a certain sequence segment will not resemble any other equal-length segment. the similarity reaches its minimum (in fact zero values of one sequence are in this case in correspondence with non-zero values of the other). if there are similar segments. When the shift is only by one position.4b and c show what happens when the replica of the original sequence is shifted left (say.4a. In the case of Fig. The Fig. 8. In the case of Fig. On the other hand. When the two sequences are perfectly aligned the similarity reaches its maximum. corresponding to positive shift values). 8. Figure 8. Figure 8. 1 (b) and 2 shifts (c) .

5b. The autocorrelation also exhibits that periodicity. the autocorrelation curve.6 shows the autocor- Fig. Autocorrelation curves of (a) 20102. Now suppose we shuﬄed the sequence values so that it looked like a random sequence of 0. does not reveal any periodicity. Now.6. For a random sequence we expect the autocorrelation to be diﬀerent from zero only when the sequence and its replica are perfectly aligned. as we did with the correlation measure described in Chap. The sequence exhibits a certain zero–non-zero periodicity. 5. through the use of the products of the deviations from the mean value. 01220. Figure 8.162 8 Noisy Irregularities Fig. Autocorrelation curves of (a) 00110011001100110011 and (b) 00100010101101101000 . we are interested in a canceling eﬀect between concordant and discordant values. As a matter of fact. 8. 8. shown in Fig.5. 1 and 2. (b) 01220 name is due to the fact that we are (co)relating a sequence with itself. 8. Any shift must produce zero similarity (total dissimilarity): nothing looks the same. For instance.

6 White Noise 163 Fig. The irregularity of the latter is evident. In Fourier analysis the inverse operation is performed: given the signal (solid curve of Fig. Figure 8. besides its corpuscular nature (photons). 8. also has a wave nature (electromagnetic ﬁelds) decomposable into sinusoidal vibrations of the respective ﬁelds. Newton’s decomposition of white light reproducing the rainbow phenomenon is a well-known experiment. nothing looks the same. 8. 8. The French mathematician Joseph Fourier (1768–1830) introduced functional analysis based on sums of sinusoids (the celebrated Fourier transform) which became a widely used tool for analysis in many areas of science. Fig.8a). We may vary the amplitudes.8a.6 White Noise Light shining from the Sun or from the lamps in our homes. oﬀ perfect alignment. Suppose we add two sinusoids (over time t) as shown in Fig. Finally. 8. 8. one obtains the sinusoids that compose it (the rainbow). Figure 8.7. We obtain a periodic curve.8b shows the Fourier analysis result .7 shows the autocorrelation of a 20 000-bit sequence generated by a quantum phenomenon (radioactive emission). computed using the deviations from the mean. the phases (deviation from the origin of the sinusoid starting point).7 is a good illustration that. We may also apply it to the study of random sequences.8. Autocorrelation of a 20 000-bit sequence generated by a radioactive emission phenomenon relations for the previous periodic binary sequence (at the beginning of this section) and for the ﬁrst twenty values of the long-period sequence presented at the beginning of this chapter. and the frequencies (related to the inter-peak interval) and the result is always periodic.

That is. (a) Sum of two sinusoids (dotted curves) resulting in a periodic curve (solid curve). but randomly and uniformly select their frequencies.9. no subinterval of [0. The result obtained is now completely diﬀerent. 8. the Fig. We obtain a random signal! Figure 8. An approximation of white noise (1 024 values) obtained by adding up 500 sinusoids with uniformly distributed frequencies .164 8 Noisy Irregularities Fig.8. we obtain a sinusoid with the same frequency (this result is easily deducible from trigonometric formulas). then according to the uniform law mentioned in Chap. 1] is favored relative to others (of equal length) in the random selection of a frequency value. if the highest frequency value is 1. 4. strange as it may seem at ﬁrst. What happens if we add up an arbitrarily large number of sinusoids of the same frequency but with random amplitudes and phases? Well. 8. Let us now suppose that we keep the amplitudes and phases of the sinusoids constant.8a.1 (reciprocal of the period T = 10) and phase delay ∆ ≈ 2. As we increase the number of sinusoids. (b) Spectrum of the periodic signal (only the amplitudes and the frequencies) of the signal in Fig. frequency 0. This is the so-called Fourier spectrum. 8.9 shows a possible example for the sum of 500 sinusoids with random frequencies. One of the sinusoids has amplitude A = 2.

The same result (approximately) is obtained for any other similar experiment of sinusoidal addition. In fact.10. An interesting aspect is that. that is.6 White Noise 165 autocorrelation of the sum approaches a single impulse and the signal spectrum reﬂects an even sinusoidal composition for all frequency intervals.9 is called white noise (by analogy with white light). 8. although the frequencies are uniformly distributed. the amplitudes of the white noise are not uniformly distributed. it does not matter what the frequency distribution is.9 . 8. Autocorrelation and its spectrum for the sequence of Fig. If we hear a signal of this type. 8.8. 8. the central limit theorem is at work here.10 shows the autocorrelation and its spectrum for the Fig. This result is better observed in the autocorrelation spectrum. 8. uniform or otherwise! Once again.9 sequence. requiring in the ﬁnal result that the amplitudes be distributed according to the normal law. Their distribution is governed by the wonderful Gauss curve (see Fig. the spectrum tends to a constant value (1. for unit amplitudes of the sinusoids).11). it resembles the sound of streaming water. A signal of the type shown in Fig. Fig. Figure 8.

the spectrum of Champernowne’s normal number (mentioned in Chap. However. 8. nothing other than a random sequence in which the diﬀerent values are independent of each other (zero autocorrelation for any non-zero shift) produced by a Gaussian random generator. We obtain a possible instance of the one-dimensional Brownian motion already known to us. in Egypt). if we compute the autocorrelation and the spectrum of a sequence of coin throws (0 and 1). Let us now calculate the accumulated gain by adding the successive values of white noise as we did in the coin-tossing game.10). 8. 6). We thus see that the autocorrelation and spectrum analysis methods tell us something about the randomness of sequences. the hydraulic engineer Harold Hurst (1880–1970) presented the results of his study on the river Nile ﬂoods (in the context of the Assuan dam project. In 1951. which imposed a constant step in every direction (see Chap. in a more realistic way than using the coin-tossing generator.13). is also 1. 6) suggests the existence of ﬁve periodic regimes which are not visually apparent. For instance.12a shows the accumulated gain of a possible instance of white noise. we may simulate Brownian motions in two (or more) dimensions (see Fig. Figure 8. At ﬁrst sight one might think that the . 8. these are not infallible either.11. we obtain results that are similar to those obtained for white noise (Fig. In fact. therefore. 8. Histogram of a white noise sequence (1 024 values) obtained by adding up 50 000 sinusoids with random amplitudes White noise is. the spectrum of the sequence 0000000010000000.7 Randomness with Memory White noise behaves like a sequence produced by coin-tossing.166 8 Noisy Irregularities Fig. On the other hand. On the basis of this result. clearly not random.

one would add up the increments multiplied by a certain weighting factor inﬂuencing the contribution of past increments.5 . The study also revealed that the evolution of river ﬂows was well approximated by a mathematical model. The factor in the mathematical model is given by (t − s)H−0. The dotted line with slope −2 corresponds to the 1/f 2 evolution Fig. 8. where instead of simply adding up white noise increments.7 Randomness with Memory 167 Fig. However. There was a tendency for a year of good (high) ﬂow rate to be followed by another year of good ﬂow rate and for a year of bad (low) ﬂow rate to be followed by another year of bad ﬂow rate.8. in that study and on the basis of ﬂow records spanning 800 years. . Brownian motion sequence (a) and its autocorrelation spectrum (b). 8. Let t be the current time (year) and s some previous time (year). It is almost as if there were a ‘memory’ in the ﬂow rate evolution.13. Hurst discovered that there was a correlation among the ﬂows of successive years. Simulation of Brownian motion using white noise average yearly ﬂows of any river constitute a random sequence such as the white noise sequence.12.

. We observe this anti-persistent behavior. 8.4 for (a) and (c). such as currency exchange rates and certain share values on the stock market. The weight (t − s)0.3 will then inﬂate the contribution of past values of s and one will obtain smooth curves with a ‘long memory of the past’. the (t − s)−0. (c) H = 0. as in Fig. respectively where H (known as Hurst exponent) is a value greater than zero and less than one. The current ﬂow value tends to stray away from the mean in the same way as the previous year’s value.14c. such as heart rate tracings. in several phenomena.8.2. if H = 0. Consider the case with H = 0. such as the one in Fig.14a.168 8 Noisy Irregularities Fig.8. Examples of fractional Brownian motion and their respective autocorrelation spectra (right ). Besides river ﬂows. The spectral evolutions shown in (b) and (d) reveal the 1/f 2H+1 decay. 8.14.2. 2. On the other hand. with only ‘short-term memory’.3 term will quickly decrease for past increments and we obtain curves with high irregularity. there are other sequences that behave in this persistent way. where each new value tends to revert to the mean. 8.6 and 1. (a) H = 0.

8.12a. D = 1. We have already seen that phenomena resulting from the addition of several random factors. An interesting example is the heart rate. and play it at another speed. There is no memory in this case. If we record white noise. In this. 8. The increase or decrease of the playing speed used to hear the sound is an observation scale. acting independently of one another. such as in Fig. corresponding to Brownian motion.8a.8 Noises and More Than That 169 Between the two extremes. it will still sound like white noise (the sound of Fig. 8. There are also speciﬁc laws for phenomena resulting from multiplication (instead of addition) of random factors. 8. for H = 0. and we play it at another speed. The cases with H diﬀerent from 0. are governed by the normal distribution law. as in other ﬁelds.5. a tracing of which is shown in Fig. classes of sounds whose quality does not change in any appreciable way when heard at other speeds.15.5 are called fractional Brownian motion.65) . Let us consider how chance phenomena are combined in nature and daily life. If we record a phrase of violin music. including the fetal heart rate. Such sounds are called scalable noises.8. equal to 1.8 Noises and More Than That Besides river ﬂows there are many other natural phenomena. 8. as we shall see in a moment.15. the sound is distorted and the distinctive quality of the violin sound is lost. involving chance factors. the application of mathematical models reveals its usefulness. well described by fractional Brownian motion. There are. Sequence of (human) fetal heart rate in calm sleep (H = 0. we have a constant value of (t − s)H−0. similar to a sum of sinusoids as in Fig.5 . however.35.

electric shock. heat. Let us now consider Brownian motion sequences. but rather to the logarithm of the stimuli.). their spectra vary as 1/f a with a = 1 + 2H (see Figs. What happens to the evolution of the respective utilities? Since log(af ) = log a + log f . 8. Brownian ‘music’ sounds less chaotic than white noise and. it has been shown that music by Bach and other composers also has a 1/f spectrum. variation of U . Imagine that I pick up the temporal evolution of the fortune of some individual and I multiply all the values by a scale factor a. i. If we play Brownian music with a 1/f 2 spectrum. For instance. 8. the frequency that words are used.e. when played at diﬀerent speeds. . The spectrum is still a straight line and the autocorrelation is zero except at the origin. and the production of scientiﬁc papers. we get a certain amount of aesthetic feeling. it can also be shown that the ﬂuctuations of Brownian motion are correlated in a given time scale and the corresponding (autocorrelation) spectrum varies as 1/f 2 (see Fig. fractional Brownian noises have a spectrum above the 1/f spectrum. whereas if we play ‘noise’ with a 1/f spectrum. Furthermore. In this case the standard deviation of the noise amplitude can be shown to vary with the observation scale. 3 we mentioned the Bernoulli utility function U (f ) = log f . There are many natural phenomena exhibiting scale invariance. the evolution of the scaled utility shows the same pattern with the minor diﬀerence that there is a shift of the original utility by the constant value log a. As a matter of fact. skin pressure.14b and d) taking values between 1 and 3.12b). the utility function exhibits scale invariance. variation of f is maintained. Brieﬂy. sounds diﬀerent while nevertheless retaining its Brownian characteristic. White noise remains white at any observation scale. Utility increments also obey the 1/f rule.170 8 Noisy Irregularities streaming water or ‘static’ radio noise). Related to this fact. that is. What kind of noise does this spectrum have? Well.. such as the wealth distribution in many societies. it has long been known that animals do not respond to the absolute value of some stimuli (such as smell. which is nothing other than instantaneous variation of the logarithm. known as 1/f phenomena. etc. the relative variation. We have seen before that sequences of fractional Brownian motion also exhibit correlations. we get an impression of monotony. There are also many socio-economical 1/f phenomena. Do chance phenomena generating 1/f spectrum sequences occur that often? In Chap.

8. Thus. more papers than he would individually have produced. if log f is well approximated by the normal law.5. f is well approximated by another law (called the lognormal law) whose shape tends to 1/f as we increase the number of multiplicative causes.g. the length of a sequence of Brownian motion is a fractal. Whenever such a power law is satisﬁed. people that are already rich tend to accumulate extra wealth by a multiplicative process.. it is well known that a senior researcher tends to co-author many papers corresponding to works of supervised junior researchers. in the case of the production of scientiﬁc papers. L(1) = 91 mm.12a and for that purpose I am given a very small nongraduated ruler. 8. Consider the product f of several independent random variables acting in a multiplicative way. 8. In the present case. in general. Suppose that I am asked to measure the total length of the curve in Fig. 1/f phenomena also tend to be ubiquitous in nature and everyday life. All I can do is to see how many times the ruler (the whole ruler) ‘ﬁts’ into the curve end-to-end.9 The Fractality of Randomness 171 The existence of 1/f random phenomena has been attributed to the so-called multiplicative eﬀect of their causes. given the central limit theorem and the fact that the logarithm of a product converts into a sum of logarithms. Now. mediated by several forms of investment and/or ﬁnancial speculation. we are in the presence of a fractal property. Much has been written about fractals and there is a vast popular literature on this topic. of length 1 in certain measurement units. α = −0.16) is a well-known . In the same way. The snowﬂake curve (Fig. which we may rewrite as L(r) = L(1)r −0. I am given a ruler of length 2 mm and I am once again asked to measure the curve length.9 The Fractality of Randomness There is another interesting aspect of the irregularity of Brownian sequences. and the total curve length. well approximated by the normal law. and therefore L(2) = 64 mm. 64 mm is √ approximately L(1)/ 2. We thus have an object (a Brownian motion sequence) with a property (length) satisfying a power law (r α ) relative to the scale (r) used to measure the property. In fact. The probability law of the logarithm of the product is. I ﬁnd that the ruler ‘ﬁts’ 32 times.5 . For instance. Suppose that I obtain a length which I call L(1). Hence. say 1 mm. 8. Now. e. by going on with this process I would come to the conclusion that the following relation holds between the measurement √ scale r (ruler length). L(r): L(r) = L(1)/ r. Next.

I have to ﬁll up a square with side 1 meter with tiles of side 20 cm. as in . The Brownian motion sequence has fractal dimension D = 1. for instance. Suppose that. non-fractal curve. the exponent of r is always an integer. the exponent of r often contains a fractional part leading to the ‘fractal’ designation. We thus arrive at the conclusion that the length of the Brownian motion sequence behaves like the length of the snowﬂake curve. although contained in a ﬁnite area. In the same way. I need (1/0. As one increases the number of iterations. The length grows with 1/r.) We say that 1. The ﬁrst four iterations of the snowﬂake curve. the number of snowﬂake segments N (r) measured with resolution r is given by N (r) = (1/r)1. 8. the only diﬀerence being that the growth is faster: L(r) = L(1)(1/r)0.12a and zoom in on one of its segments. we measure the curve length. (As a matter of fact there are also fractal objects with integer exponents. when observing a ﬂake bud at the fourth iteration (enclosed by a dotted ball in Fig.2)2 = 25 tiles]. but not as quickly as 1/r.26 = r −1.5 (since D − 1 = 0.26 .16).26 is the fractal dimension of the snow ﬂake. For ‘common’ objects. for a ‘common’. Were it a ‘common’. 8. for fractal objects such as the snowﬂake.26 .16. we ﬁnd 1/0. while staying conﬁned inside a bounded plane region fractal object characterized by an inﬁnite length. Does something similar happen with the Brownian motion sequence? Let us take the sequence in Fig. 8. instead of counting segments. In fact.5 . Will the zoomed segment resemble the original sequence? If it did. we get L(r) = L(1)(1/r)0. a 1-meter straight line is measured in centimeters. where D is the fractal dimension. When talking about fractal objects it is customary to draw attention to the self-similarity property of these objects.5). non-fractal area. We then obtain the relation L(r) = L(1)(1/r)D−1 . we would obtain N (r) = (1/r)1 = r −1 (if. for instance. For instance. For the snowﬂake. something repetitive would be present. the curve length tends to inﬁnity. The snowﬂake curve is placed somewhere between a ﬂat ‘common’ curve of dimension 1 and a plane region of dimension 2.172 8 Noisy Irregularities Fig. one sees an image corresponding to several buds of the previous iteration. However.01 = 100 one-centimeter segments). we would obtain N (r) = (1/r)2 = r −2 [if.

It does come as a surprise that a phenomenon entirely dependent on chance should manifest this statistical self-similarity.17 one would obtain a curve whose standard deviation changes little with r (varying as r 0.17 shows how the average standard deviation s(r) varies with √ r. Does self-similarity reside in the type of sequence variability? We know that the standard deviation of a sequence of numbers is a measure of its degree of variability. It can be shown that the curves of fractional Brownian motion have fractal dimension D = 2− H. 8. there is a reduced self-similarity and instead of Fig. Evolution of the average standard deviation for segments of length r of the curve in Fig.12a . It varies approximately as s(r) = r. Thus. the variation with r is almost linear and the self-similarity is perfect. Let us then proceed as follows: we divide the sequence into segments of length (number of values) r and determine the average standard deviation for all segments of the same length.17. without irregularities. 8. Figure 8.14c has fractal dimension 1. the one in Fig. and the sequence would not have a random behavior.8. 8.8 . The conclusion is therefore aﬃrmative.2 ).9 The Fractality of Randomness 173 the snowﬂake. that there does in fact exist a ‘similarity’ of the standard deviations – and hence of the variabilities – among segments of Brownian motion observed at diﬀerent scales. 8. In the ﬁrst case the self-similarity varies as r 0. whereas the curve of Fig. and is therefore close to ﬁlling up a certain plane region. 8.14a has dimension 1.2 and is close to a ‘common’ curve. corresponding to an absence of irregularity.8. For values of H very close to 1. Fig. In the latter case.

.

– Is it a feline? – Yes. It may. take too much time for the patience and knowledge of the player who will end up just asking the interlocutor which animal (s)he had thought of. independent of the player’s knowledge. because the number of set elements of the game is ﬁnite. which can be answered either “Yes” or “No”. and where one knows all the set elements of the game. say an animal. The game will surely end. A possible game sequence could be: – Is it a mammal? – Yes.9 Chance and Order 9. – Is it a herbivore? – No. And so on and so forth. Suppose the game set is the set of positions on a chess board . as many as the number of classiﬁed insects! Let us move on to a diﬀerent guessing game. however. and the reader has to guess which animal their interlocutor thought of by asking him/her closed questions.1 The Guessing Game The reader is probably acquainted with the guess-the-noun game. Just imagine that your interlocutor thought of the Atlas beetle and the reader had to make in the worst case about a million questions. where someone thinks of a noun. – Is it the lion? – No.

– Is it in the upper half? – No. but the reader does not know where. – Is it in the upper half? – Yes. A possible sequence of dichotomous questions and their respective answers could be: – Is it in the left half? – Yes. 6 (when talking about Borel’s normal numbers) as representing binary digits (bits). A dichotomous question divides the current search domain into halves. Let us assume the situation pictured in Fig.176 9 Chance and Order where a queen is placed. the left half in this case. 9. – Is it in the left half? – No. The sequence of Fig. consists in performing successive dichotomies of the board. The most eﬃcient way of completing the game. respectively. Chessboard with a queen . Let us analyze this guessing game in more detail.1. – Is it in the upper half? (In this and following questions it is always understood half of the previously determined domain. We may code each “Yes” and “No” answer by 1 and 0. (And the game ends because the queen’s position has now been found. it will always be found in at most the above six questions.1. the one corresponding to the smallest number of questions.) Whatever the queen’s position may be. which we will consider as in Chap. The reader is invited to ask closed questions aiming at determining the queen’s position.) – No. 9. – Is it in the left half? – Yes.

No = 0). d. Brieﬂy. 111}. Coding “Yes” as 1. was interested by the problem of deﬁning a measure quantifying the amount of information in a message. Yes (c is in the left half of {c. Let us start with a very simple situation where the message sender can only send the symbol “Yes”. Finding a letter in E corresponds to asking 3 questions. For instance. working at the Bell Laboratories in the United States. reﬂecting the degree of uncertainty relative . as many as the number of bits needed to code the elements of E. No}. g. ﬁnding the letter c corresponds to the following sequence of answers regarding whether or not the letter is in the left half of the previous domain: Yes (c is in the left half of E). g. h} with equal probability. from the mere fact that we received “Yes”. 110. it only depends on the symbols used to compose the message. The word ‘information’ is used here in an abstract sense. 001. f. The diﬀerence between this second game and the ﬁrst lies in the fact that the correspondence between the position of each bit in the sequence and a given search domain is known and utilized. 010. c. Concretely. it does not reﬂect either the form or the contents of the message. e.1 The Guessing Game 177 answers is therefore a sequence of bits. c. d}). c. f.. h} . there has been no decrease in our initial uncertainty (already zero) as to the symbol being sent. Let us now consider the following set of letters which we call E : E = {a. No (c is in the right half of {a. There is then no possible information in the sense that. 101. the electronic engineer Ralph Hartley (1888–1970).9. which is the exponent to which one must raise 2 in order to obtain N . Now. Let N be the number of elements in the set. In 1920. d}). suppose that the sender sends each of the symbols of the above set E = {a. No} and sends one of them. e. a brief explanation of logarithms can be found in Appendix A at the end of the book). 011. b. Yes = 1. We notice that in the ﬁrst example of the chessboard we had N = 26 = 64 distinct elements and needed (at most) 6 questions. it is not hard to see that the number of questions needed is in general given by log2 N (the base 2 logarithm of N . 100. the sender can only send sequences of ones. b. We may code the set elements by {000. The simplest case from the point of view of getting non-zero information is the one where the emitter uses with equal probability the symbols from the set {Yes. The information obtained from one symbol. in this second example we have N = 23 = 8 distinct elements and need 3 questions. e. d. We then say that the information or degree of uncertainty is of 1 bit (a binary digit codes {Yes. b.g.

is equivalent to the number of Yes/No questions one would have to make in order to select it. we would then have an information measure of 5 bits per symbol. as it should in order to be useful. When including other symbols such as a space. For instance. and that we want to determine the amount of information when the sender sends one of the two symbols.e. 0 and 1. The choice in E1 implies 1 bit of information.. i.04% of the time. 9. Since the message has 8 symbols it will have 5 × 8 = 40 bits of information in total. with probabilities P0 = 3/4 and P1 = 1/4. Hartley’s deﬁnition of the information measure when the sender uses an equally probable N symbol set is precisely log2 N . In the English alphabet there are 26 letters. We clearly see that this information measure only has to do with the number of possible choices and not with the usual meaning we attach to the word ‘information’. Suppose that we are only using two symbols. In 1948 the American mathematician Claude Shannon . and so forth. This is not often the case. we started with the selection of an element from E1 = {left half of E. comma. consider the following telegram: ANN SICK. it is preferable to call this information measure an uncertainty measure. According to the additivity rule the choice of an element of E then involves 3 bits. We would like to say that we need 1 bit of information to describe the experiment in which one coin is tossed and 2 bits of information to describe the experiment in which two coins are tossed. If the occurrence of each symbol were equiprobable. This message is a sequence of capital letters and spaces.2 Information and Entropy In the last section we assumed that the sending of each symbol was equally likely. whereas the letter z only occurs about 0. according to the outcome. It can be shown that even when N is not a power of 2. as in the previous examples.178 9 Chance and Order to the symbol value. As a matter of fact. hyphen. What we really mean by this is that ANN SICK has the same information as ANN DEAD or as any other of the N = 240 possible messages using 8 symbols from the set. this measure still makes sense. The choice in either the left or right half of E implies 2 bits. we then proceeded to choose an element from the left or right half of E. Suppose that in the example of selecting an element of E. Now. the letter e occurs about 10% of the time in English texts. The information unit is the bit and the measure satisﬁes the additivity property. we arrive at a 32-symbol set. uncertainty as to the possible choices by sheer chance. exactly as expected. right half of E} and.

A set with three times more 0s than 1s . picked it from a set whose composition reﬂects those probabilities. H = −P0 log2 P0 − P1 log2 P1 . We thus have a set E with 8 elements which is the union of two sets. We already know how to compute the average (mathematical expectation) of any random variable.2. 0 occurs three times more often than 1. which we know to be log2 8. 9. 9. in this case the quantity of information for E1 and E2 (log2 6 and log2 2. since it is equal to 1. he gave the solution not only for the particular two-symbol case. which is the term we are searching for and which we denote by H. 1}. • The average information corresponding to selecting an element in E1 or E2 . respectively. 1}. In fact. decomposes into two terms (additivity rule): • The information corresponding to selecting E1 or E2 . in a long sequence of symbols. Hence. The information corresponding to making a random selection of a symbol from E. after some simple algebraic manipulation.2. containing 6 zeros and two ones. 0. 1. as shown in Fig. considered as the father of information theory. but for any other set of symbols. presented the solution to this problem in a celebrated paper. instead of randomly selecting a symbol in {0. Taking into account the fact that we may multiply log2 8 by P0 + P1 . this means that. Fig. log2 8 = H + P0 log2 (8P0 ) + P1 log2 (8P1 ) . We may look at the experiment as if the sender. 0. E1 and E2 . respectively): Average information in the selection = P0 × log2 6 + P1 × log2 2 of an element in E1 or E2 = P0 log2 (8P0 ) + P1 log2 (8P1 ) . say. one obtains. 0. We present here a simple explanation of what happens in the above two-symbol example. We ﬁrst note that if P (0) = 3/4. 0. E = {0.2 Information and Entropy 179 (1916–2001). 0.9.

Let us once again take the set E = {a.180 9 Chance and Order For the above-mentioned probabilities. c = 011. is a straightforward generalization of the above result: H = −sum of P log2 P for all elementary events of E .81. Consider the following message obtained with a generator that used those probabilities: bahaebaaaababaaacge. the entropy maximum is log2 k and occurs when all elements are equally probable (uniform distribution).15. f. one obtains H = 1. Applying the above formula one obtains H = 2.246 bits per symbol is attained? There is in fact such a possibility and we shall give an idea how such a code may be built. It is always non-negative and constitutes a generalization of Hartley’s formula. h} and assume that each symbol is equiprobable. whenever three consecutive 0s occur we know that they are followed by two discriminating bits. d. if P0 or P1 is equal to 1 (total order). If E has k elements. b. e = 00011. 0.04. 0. c. As a matter of fact. we know that the following two bits discriminate the symbol. for any set E of elementary events with probabilities P .04. 0. 0. the entropy maximum of an unrestricted discrete distribution occurs when the distribution is uniform. we get an entropy value between 0 and 1. The general deﬁnition of entropy. Is there any way to code the symbols so that an average information of 2. If P0 = P1 = 1/2 (total disorder). Now imagine that the symbols are not equiprobable and that the respective probabilities are {0. Applying the formula for the average. Between the two extremes.246. less than 3. e. the previously mentioned case of zero information. in a sequence built with these codes. one obtains H = 0. This message would be coded as 0011000001000110011111100110011110100011 and would use 45 bits instead of the 57 bits needed if we had used three bits per symbol. whenever a single 0 occurs (and not three consecutive 0s). one easily . h = 00000. Take the following coding scheme which assigns more bits to the symbols occurring less often: a = 1. Note that.03.5.10. 0. d = 010. 0.02}. f = 00010. 0. g.12. Then H = log2 8 = 3 and we need 3 bits in order to code each symbol. one obtains H = 0. There is then no possible confusion. Shannon called this information measure entropy. the maximum entropy and information in the selection of an element from E. Shannon proved that the entropy H is the minimum number of bits needed to code the complete statistical description of a symbol set. On the other hand. corresponding to the entropy minimum (by convention 0 log2 0 = 0). g = 00001. b = 001.

where Ω is the number of equiprobable microstates needed to describe a macroscopic system. Note the similarity between the physical and mathematical entropy formulas. 9. ◦ C or ◦ F). if temperature is measured in energy units (say. close to the entropy value.3 shows a possible result of the numerical simulation of the (isolated) dog–ﬂea system. the Austrian e physicist Ludwig Boltzmann (1844–1906) derived a formula for the entropy of an isolated system which reﬂects the number of possible states at molecular level (microstates). establishing a statistical description of an isolated system consistent with its thermodynamic properties. postulated by the second principle of thermodynamics.26 bits per symbol are used with the above coding scheme.3 The Ehrenfest Dogs The word ‘entropy’ was coined for the ﬁrst time in 1865 by the German physicist Rudolf Clausius (1822–1888) in relation to the concept of transformation/degradation of energy. In 1907.4. sideby-side. The only noticeable difference is that the physical entropy has the dimensions of an energy divided by temperature. whereas mathematical entropy is dimensionless. Later on. one can determine the probability distribution of the ﬂeas by performing a large number of simulations. Rusty has 50 ﬂeas and Toby has none.3 The Ehrenfest Dogs 181 veriﬁes that an average of 2. In fact ‘entropy’ comes from the Greek word ‘entropˆ’ meaning ‘transformation’. The distribution converges rapidly to a binomial . Rusty and Toby. Suppose that after this perfectly ordered initial state. As time goes by the distribution of the ﬂeas between the two dogs tends to an equilibrium situation with small ﬂuctuations around an equal distribution of the ﬂeas between the two dogs: this is the situation of total disorder. The result is shown in Fig. Suppose we have two dogs. exchanging ﬂeas between them. measured in boltzmanns instead of K. Boltzmann’s formula for the entropy. designated by S. Initially.9. the similarity is complete (the factor k can be absorbed by taking the logarithm in another base). in 1877. What happens as time goes by? Figure 9. At each instant of time. 9. the Dutch physicists Tatiana and Paul Ehrenfest proposed an interesting example which helps to understand how entropy evolves in a closed system.38 × 10−23 joule/kelvin. and no one else is present. However. In Boltzmann’s formula the logarithm is taken in the natural base (number e) and k is the so-called Boltzmann constant: k ≈ 1. is S = k log Ω. a ﬂea is randomly chosen at each instant of time to move from one dog to the other.

The ﬂea distribution as it evolves in time. 9. It is the ‘most random’ of the continuous distributions (as the uniform distribution is the most random of the discrete distributions). Figure 9.5 shows how the entropy of the dog–ﬂea system grows. The vertical axis represents the probability of the number of ﬂeas for each time instant (estimated in 10 000 simulations) distribution very close to the normal one. the normal distribution is. among all possible continuous distributions with constrained mean value (25 in this case).182 9 Chance and Order Fig. 9. Eﬀectively.4. . starting from the ordered state (all ﬂeas on Rusty) and converging to the disordered state. Time dependence of the number of ﬂeas on Toby Fig. the one exhibiting maximum entropy.3.

5. which amounts to more than 35 million years (!). If the time unit for a change in the ﬂea distribution is the second. theoretically. For the binomial distribution.9. (As a matter of fact. Is it possible for a situation where the ﬂeas are all on Rusty to occur? Yes. If instead of water we take the simpler example of one liter of dilute gas at normal pressure and temperature.3 The Ehrenfest Dogs 183 Fig. in order to have a reasonable probability of observing such a situation. Similarly to the dog–ﬂea system. each possible ﬂea partition would have a 1/250 = 0. one would have to wait about 250 seconds.7 × 1022 molecules. we do not expect to observe water molecules with higher kinetic energy accumulating at one side of the recipient. we are dealing with something like 2. this is also the probability of getting an ordered situation (either all ﬂeas on Toby or all ﬂeas on Rusty).000 000 000 000 000 89 probability of occurring. The entropy rise in many isolated physical systems is the trademark of time’s ‘arrow’. it may indeed occur. If the distribution were uniform. which would result in an ordered partition with the emergence of hot water at one side and cold water at the other. Entropy evolution in the isolated dog–ﬂea system The entropy evolution in the Ehrenfest dogs is approximately what one can observe in a large number of physical systems. a huge number compared to the 50-ﬂea system! Not even the whole lifetime of the Universe would suﬃce to reach a reasonable probability of observing a situation where hot gas (molecules with high kinetic energy) would miraculously be separated from cold gas (molecules with low kinetic energy)! It is this tendency-to-disorder phenomenon and its implied entropy rise in an isolated system that is postulated by the famous second principle of thermodynamics. no . given a recipient initially containing luke warm water. 9.

We are dealing with a 27-symbol set.1 bits.e. totalizing more than 523 thousand characters. it would have to embrace the whole Universe. and z is the least frequent character. occurring about 18. In addition. we obtain the maximum possible entropy for our random-text system. given a character c. the letter j never occurs before a space.75 bits. with a mere 0.2% of the time. For instance..04% of occurrences. A possible random text for this ﬁrst-order model is: vsn wehtmvbols p si e moacgotlions hmelirtrl ro esnelh s t emtamiodys ee oatc taa p invoj oetoi dngasndeyse ie mrtrtlowh t eehdsafbtre rrau drnwsr In order to generate this text we used the probabilities listed in Table 9. reﬂecting a lesser degree of disorder. i. in English. we can take into account the probabilities of c being followed by . suppose that instead of using the uniform distribution.184 9 Chance and Order system is completely isolated and the whole analysis is more complex than the basic ideas we have presented.) 9. Now.. For instance. in a strict sense.1. log2 27 = 4. which were estimated from texts of a variety of magazine articles made available by the British Council web site. in English q is always followed by u. We may go on imposing restrictions on the character distribution in accordance with what happens in English texts. A possible 85-character text produced by such a random-text system with the maximum of disorder. we used the probability distribution of the characters in (true) English texts. If the character selection is governed by the uniform distribution.2% of the time. In fact. Using this ﬁrst-order model. in English texts the character that most often occurs is the space. usually called a zero-order model. while the letter occurring most often is e.e. is: rlvezsﬁpc icrkbxknwk awlirdjokmgnmxdlfaskjn jmxckjvar jkzwhkuulsk odrzguqtjf hrlywwn No similarity with a real text written in English is apparent.4 Random Texts We now take a look at ‘texts’ written by randomly selecting lower-case letters from the English alphabet plus the space character. i. the entropy decreases to 4. it never ends a word (with the exception of foreign words). about 10.

0005 0.9. Here is a phrase produced by the third-order model: l min trown redle behin of thrion sele ster doludicully to in the an agapt weace din ar be a ciente gair as thly eantryoung othe of reark brathey re the strigull sucts and cre ske din tweakes hat he eve id iten hickischrit thin ﬁcated on bel pacts of shoot ity numany prownis a cord we danceady suctures acts i he sup ﬁr is pene . in this ‘phrase’.0575 0. some legitimate English words do appear. Any texts randomly generated by this probability model.0239 0.0342 0. The entropy for this model of conditional probability is known as conditional entropy and is computed in the following way: H(letter 1 if letter 2) = H(letter 1 and letter 2) − H(letter 2) .0004 another character. we use the conditional probabilities estimated in English texts of a given character being followed by another two or three characters. the so-called second-order model.0161 0.0016 0. Frequency of occurrence of the letters of the alphabet and the space in a selection of magazine articles written in English space a b c d e f 0. such as ‘wall’.0169 0. the last letter of the ‘words’ is a valid end letter in English.0207 n o p q r s t 0.0291 0. say c1 . Performing the calculations.0501 0.0567 0. Moving to third.0166 0.0012 0.0549 0.0685 0.0091 0.0635 0.1.0171 0.0061 0. when generating the text.1818 0.38 bits.0128 0. ‘oven’ and ‘the’.1020 0. For instance.0231 0.0750 u v w x y z 0. Here is an example: litestheeneltr worither chethantrs wall o p the junsatinlere moritr istingaleorouson atout oven olle trinmobbulicsopexagender leturorode meacs iesind cl whall meallfofedind f cincexilif ony o agre fameis othn d ailalertiomoutre pinlas thorie the st m Note that. Moreover. one may verify that the entropy of this second-order model is 3.0180 g h i j k l m 0. have a slight resemblance with a text in English.and fourth-order models.4 Random Texts 185 Table 9. written P (c1 if c).0425 0. P (space if j) = 0 and P (u if q) = 1. respectively.

For instance. thus enhancing the resemblance to legitimate English phrases. As we go on imposing restrictions and as a consequence increasing the order of the model. 9. for a ﬁfth-order model there are 275 possible 5-letter sequences. These additional rules would of course contribute to further lowering the entropy. a possibility . respectively. each added restriction means added information. however. It would be possible to add syntactic and semantic rules imposing further restrictions on the generated phrases. the entropy steadily decreases. therefore a decrease in the uncertainty of the text. As a matter of fact. Note that in all these examples we are only restricting probability distributions. The question is: how far will it go? It is diﬃcult to arrive at an accurate answer. with the result that the texts are only vaguely reminiscent of English texts.06. that is. Automating the art of public speaking And a phrase produced by the fourth-order model: om some people one mes a love by today feethe del purestiling touch one from hat giving up imalso the dren you migrant for mists of name dairst on drammy arem nights abouthey is the the dea of envire name a collowly was copicawber beliver aken the days dioxide of quick lottish on diﬃca big tem revolvetechnology as in a country laugh had The entropies are 2. more than 14 million combinations (although a large number of them never occur in English texts)! There is.6. since for high orders one would need extraordinarily large sequences in order to estimate the respective probabilities with some conﬁdence.71 and 2.186 9 Chance and Order Fig.

03043 0.00400 0.03060 0. na This relation is better visualized in a graph if one performs a logarithmic transformation: log P (n) = log b − a log n.01157 it for are was you they on as 0. Then Zipf’s law prescribes the following dependency of P (n) on n : P (n) = b .01427 0.00443 0. According to the straight line in Fig.00407 0. n0.00513 from their can at there one were about 0.08 .00923 0.7 also shows a possible straight line ﬁtting the observed data. 9.7 shows the application of Zipf’s law using these estimates and the above-mentioned logarithmic transformation. The American linguist George Zipf (1902–1950) studied the statistics of word occurrence in several languages. the graph of log P (n) against log n is a straight line with slope −a.00390 of estimating that limit using the so-called Zipf law. we obtain P (n) ≈ 0.06271 0.00819 0.96 .00705 0. If for instance the word ‘the’ is the one that most often occurs in English texts and the word ‘with’ is ranked in twentieth position.00584 0. the word ‘with’ occurs about 1/20a of the times that ‘the’ does.7.2 shows the relative frequencies of some words estimated from the previously mentioned English texts. The parameter a has a value close to 1 and is characteristic of the language. Let n denote the rank value of the words and P (n) the corresponding occurrence probability.00586 0.00465 0.02702 0.00564 0.9. Relative frequencies of certain words in English as estimated from a large selection of magazine articles the and of to a in is that 0.2.4 Random Texts 187 Table 9. Figure 9.00442 0. The same happens to other words.02731 0.00632 0.00394 0. Figure 9.02321 0.00541 0. then according to Zipf’s law (presented in 1930).00559 0. Hence. having detected an inverse relation between the frequency of occurrence of any word and its rank in the corresponding frequency table.00860 0.01015 0.00521 0.00482 0.00631 0. Table 9.00631 have people I with or be but this 0.00712 0.

about 1. business volumes and genetic descriptions). This is. the property exhibited by Borel’s normal numbers presented in Chap. city populations. in Portuguese the number of characters to consider is not 27 as in English but 37 (this is due to the use of accents and of c). 9.26 bits per letter. 9. Let us assume that we measure the entropy for the combinations of k symbols using the respective probabilities of occurrence P (k).3 letters per word in English (again estimated from the same texts). In a random sequence we expect to ﬁnd k-symbol blocks appearing with equal probability for all possible symbol combinations. Better estimates based on larger and more representative texts yield smaller values.2 bits per word is then obtained.7. Given the average size of 7. However. Zipf’s law applied to English texts One may then apply the entropy formula to the P (n) values for a sufﬁciently large n. 6. a theoretical explanation of the law remains to be found. Other languages manifest other entropy values. An entropy H ≈ 9. For instance.9 bits per letter. in fact. entropy is higher in Portuguese texts. about 1. Although its applicability to other types of data has been demonstrated (for instance.188 9 Chance and Order Fig.2 bits per letter. as we . ¸ Zipf’s law is empirical.5 Sequence Entropies We now return to the sequence randomness issue. one ﬁnally determines a letter entropy estimate for English texts as 1.

Table 9. the reader may conﬁrm that all 2-symbol combinations are present in the sequence. let us take the limit of the H(k)− H(k − 1) diﬀerences for arbitrarily large k. For random sequences there are no dependencies and H(k) remains constant. The sequences named e. The sequence consisting of indeﬁnitely repeating 00011011 has entropy 1 for blocks of 1 and 2 symbols. There is not.72 is always obtained. whereas there are 2 combinations of 3 symbols and 8 of 4 symbols that never occur (and there is a deﬁcit for other combinations). Here are the ﬁrst 50 bits of these sequences: . if it is a tampered coin with P (1) = 0. and it falls again to 0. For instance. in order to estimate the entropy of fourthorder models of English texts.2 × log2 0.2 − 0. an increase in uncertainty when progressing from 1 to 2 symbols. expressed in binary.5. signaling an uncertainty drop for 3-symbol blocks. One then obtains a number known as the sequence (Shannon) entropy. are the sequences of the fractional parts of these constants. only 01 and 10 occur with equal probability. therefore.25 for 4-symbol blocks. In the case of an inﬁnite binary sequence of the fair coin-tossing type. If there are many symbols. a value of H(k) equal to −0. For instance. some researchers have used texts with 70 million letters! The sequence length required for entropy estimation reduces drastically when dealing with binary sequences. In the last section. For two-symbol blocks. we obtained an entropy estimate for an arbitrarily long text sequence in English. the entropy falls to 0. H(k) is always equal to 1. we will also need extremely long sequences in order to obtain reliable probability estimates. which measures the average uncertainty content per symbol when all symbol dependencies are taken into account.8 = 0.3 shows estimated values for several sequences (long enough to guarantee acceptable estimates) from H(1) √ up to H(3).9. π and 2.5 Sequence Entropies 189 did before in the case of texts: H(k) = −sum of P (k) × log2 P (k) .8 × log2 0. However. Furthermore.2. the sequence consisting of indeﬁnitely repeating 01 has entropy 1 for one-symbol blocks and zero entropy for more than one symbol. It is interesting to determine and compare the entropy values of several binary sequences. for 3 symbols. As a matter of fact.

8753 0. • normal FHR is a sequence of heart rate values (expressed in binary) of a normal human fetus sleeping at rest (FHR stands for fetal heart rate). . The FHR sequences.8742 0.4483 e π √ 01000101100101101000100000100011011101000100101011 01010000011011001011110111011010011100111100001110 2 00100100001101000011000010100011111111101100011000 The other sequences are as follows: • normal number refers to Champernowne’s binary number mentioned in Chap. Finally. 2. 8 (atmospheric noise). normal number and quantum number according to the entropy measure: constant H(k) equal to 1.14a.6663 0. • quantum number is a binary sequence obtained with the quantum generator mentioned in Chap.4531 H(3) 0. called approximate entropy.9999 1 1 0.9998 0.190 9 Chance and Order Table 9. expressed in binary.9997 0. 6.4586 H(2) 1 0. • quadratic iterator is the chaotic sequence shown in Fig.9071 0. referring to fetal heart rate. √ Note the irregularity of e. 7. where the abnormal FHR is far more tampered with than the normal FHR. See text for explanation Sequence e π √ 2 Normal number Quantum number Quadratic iterator Normal FHR Abnormal FHR H(1) 1 0.9999 1 1 0. a new irregularity/chaoticity measure combining the ideas of sequence entropy and autocorrelation. • abnormal FHR is the same for a fetus at severe risk. note that the quadratic iterator sequence is less irregular than the normal fetus sequence! The American researcher Steve Pincus proposed.6664 0. in 1991. Estimated entropy values for several sequences.9999 1 1 1 0. In this new measure.3. π.9999 0.9186 0. both look like coin-tossing sequences with tampered coins.

1 π =1+ 2 3 2=1+ That is. dynamic heart rate assessment. variability of muscular tremor. in spite of the ‘disorder’ one may detect using the entropy measures of the digit sequences of these numbers (the empirical evidence that they are Borel normal numbers. and characterization of fetal heart rate variability. regularity of genetic sequences. for a random sequence . . instead of referring to the possible k-symbol combinations.. consider them to be random? Note that all these sequences obey a perfectly deﬁned rule.g. and 2 sequences. which are below a speciﬁed threshold. anesthetic eﬀects in electroencephalograms. 1! 2! 3! 4! 5! 6! 7! 8! 1+ √ 2 5 1+ 3 7 4 1 + (1 + · · · ) 9 1 2+ 2+ 1 1 2 + ··· . the eﬀects of aging on the heart rate. π. For instance. 6). In fact. as mentioned in Chap. For this reason. each digit is perfectly predictable. This new measure has the important advantage of not demanding very long sequences in order to obtain an adequate estimate. Should we. such as: e =1+ 1 1 1 1 1 1 1 1 + + + + + + + + ··· .6 Algorithmic Complexity √ We saw that the e. The digit sequences of the other numbers can also be obtained with any required accuracy using certain formulas. regularity of hormonal secretions. have maximum entropy. variability of breathing ﬂow. Champernowne’s number consists of simply generating the successive blocks of k bits. e. assessment of the complexity of information transmission in the human brain. early detection of the sudden infant death syndrome. which can easily be performed by (binary) addition of 1 to the preceding block. refer to occurrences of autocorrelations of k-symbol segments.6 Algorithmic Complexity 191 the probabilities used to calculate entropy. 9. it has been successfully applied to the study of many practical problems.9. on the basis of this fact. as well as the sequence for Champernowne’s normal numbers. This being so it seems unreasonable to qualify such numbers as random.

that outputs s (see Fig. However. It would have been more economical if the program were simply: PRINT 0101010101010101. independently of each other. but this is by and large an irrelevant aspect. if the sequence is composed of 1000 repetitions of 01. We may generate this sequence in a computer by means of a program with the following two instructions: REPEAT 8 TIMES: PRINT 01 . in which case it would have only 24 bits. We then have a program composed of 48 bits for the task of printing a 16-bit sequence. we may also use the above program replacing 8 by 1000 (also codable with 16 bits). the deﬁnition can be referred to a certain type of minimalist computer.) In the case where s is a thousand repeats of 01. based on the idea of its computability. also a binary sequence. nor which type of computer is used. Instruction codes represent a constant surplus. Computer generation of a sequence 1 Not to be confused with computational complexity. Elaborating this idea. it does not matter in which language the program is written (we may always translate from one language to another). the 48-bit program allows one to compact the information of a 2000-bit sequence. Gregory Chaitin and Ray Solomonof developed. 9. 9. dependent on the computer being used. Suppose we wished to obtain the binary sequence consisting of repeating 01 eight times: 0101010101010101. Chaitin and Solomonof deﬁned the algorithmic complexity 1 of a binary sequence s as the length of the smallest program s∗ . the algorithmic complexity may be low in a computer where the instructions are coded with a smaller number of bits. the mathematicians Andrei Kolmogorov. Imagine that the REPEAT and PRINT instructions are coded with 8 bits and that the ‘8 TIMES’ and ‘01’ arguments are coded with 16 bits. (In fact. which we may neglect Fig. the algorithmic complexity is clearly low: the rather simple program shown above reproduces the 2000 bits of s using only 48 bits. In this case.8). which refers to an algorithm’s computational eﬃciency in run time.8. Even for eight repeats of 01.192 9 Chance and Order one would not expect any of its digits to be predictable. Kolmogorov. . a new theory for classifying a number as random. In this deﬁnition of algorithmic complexity. In around 1975.

is approximately equal to n (equal to n. If p is diﬀerent from 1/2. Then. For instance. and its pattern is obvious. If P (0) = 3/4 and P (1) = 1/4. and therefore the length of its minimal program (the smallest length program outputting s). The instruction that generates the sequence is then the sequence itself (neglecting the PRINT surplus). the maximum complexity of an n-bit sequence. or incompressible. removing instruction surplus). every possible equal-length block of 0s and 1s must occur in equal proportions. Its digits seem to occur in a haphazard way. is rather low. consider a binary sequence s with n bits and a certain proportion p of 1s.6 Algorithmic Complexity 193 when applying the deﬁnition. as we saw previously when talking about entropy-based coding.9. a random sequence is one which cannot be described in a more compact way than by simply instructing the computer to write down its successive digits. one has only to take a suﬃciently long length for s to justify neglecting the surplus. Moreover. the algorithmic complexity is nH < n. nH < n. In fact. From the point of view of algorithmic complexity. we . Algorithmic complexity has a deep connection with information theory. there exists a sequence sc of length nH which codes s using Shannon’s coding by exploring the redundancies of s. As mentioned above. In conclusion. the entropy is then smaller than 1. We may apply the deﬁnition of algorithmic complexity to the generation of any number (independently of whether it is expressed in binary or in any other numbering system). At the other extreme. Champernowne’s number is not algorithmically random. Therefore. once again by exploring the conditional entropies. Consider the generation of e. otherwise. inﬁnite and algorithmically random sequences must have maximum entropy (log2 k. In brief. Thus. not all normal numbers are algorithmically random. in order for s to be algorithmically random. approximately equal to log2 n. But if s can be compressed we may build the s∗ program simply as a decoding of sc . for sequences using a k-symbol set) and must be normal numbers! However. but as a matter of fact by using the above formula one may generate an arbitrarily large number of digits of e using a much shorter program than would correspond to printing out the whole digit sequence. the algorithmic com√ plexity of e. Neglecting the decoding surplus. one would be able to shorten s∗ relative to s. and likewise of π and 2. hence. the proportions of 0s and 1s must be equal. algorithmically random means irreducible. a sequence composed of either 0s or 1s has minimum complexity.

In simple terms it states the following: a general program deciding in ﬁnite time whether a program for a ﬁnite input ﬁnishes running or will run forever (solving the halting problem) cannot exist for all possible program–input pairs. although there are inﬁnitely many algorithmically random sequences. Incidentally. The latter program would only have a few more bits than s∗∗ (those needed for storing s∗ and running it). 6. which is a contradiction. demonstrated by the German mathematician Kurt G¨del (1906–1978): there is no consistent and complete o system of axioms. demonstrated in 1936 a famous theorem known as the halting problem theorem. allowing one to decide whether any statement built with the axioms of the system is true or false. there is an inﬁnite subset of Borel normal numbers that is algorithmically random. and we would be able to generate s with a program instructing the generation of s∗ with s∗∗ . for c = 10. From these results it is possible to demonstrate the impossibility of proving whether or not a sequence is algorithmically random. if one considers the interval [0.8n (see Sect. there would then exist another program s∗∗ outputting s∗ .194 9 Chance and Order have a middle situation: the complexity is approximately nH = 0. One may build programs which decide whether or not a given program comes to a halt.. We may conclude that. To see . 9. s∗ would not be a minimal program. If it were not.1). a chaotic orbit is its shortest description and its fastest computer. 9. The halting problem theorem is related to the famous theorem about the incompleteness of formal systems. note that a minimal program s∗ must itself be algorithmically random. Unfortunately.7 The Undecidability of Randomness The English mathematician Alan Turing (1912–1954). as we did in Chap. followed by the generation of s with s∗ . An interesting fact is that it can be demonstrated that there are fewer than 1/2c sequences whose complexity is smaller than n − c. The impossibility refers to building a program which would decide the halting condition for every possible program–input pair.e. For instance. fewer than 1/1024 sequences have complexity less than n − 10. i. it is practically impossible to prove whether or not a given sequence is algorithmically random. therefore. It is then not at all surprising that the quadratic iterator has inﬁnitely many chaotic orbits. 1] representing inﬁnite sequences of 0s and 1s. pioneer of computer theory. Note that the statement applies to all possible program–input (ﬁnite input) pairs.

denoted by Ω.8 A Ghost Number In 1976. the one used to generate Champernowne’s number) which implemented by a program would allow one to compute it. plus the speciﬁcation of n. Assuming that the programs are delimited (with begin–end or a preﬁx indicating the length). but we know what it represents.9. Imagine that some goddess supplied us the sequence of the ﬁrst n bits of Ω. The program would then have to include instructions for the evaluation of algorithmic complexity. the American mathematician Gregory Chaitin deﬁned a truly amazing number (presently known as Chaitin’s constant). for each program of k bits coming to a halt. It is a Borel normal number that can be described but not computed. as follows: we would check. for each k between 1 and n. as soon as it detects a sequence with complexity larger than n. The random choice of the program means that the computer is fed with binary sequences interpreted as programs. 9. 0 would correspond to no program coming to a halt and 1 to all programs halting. π. and with those sequences generated by a random process of the coin-tossing type. it prints it and stops. there is no set of formulas (as for in√ stance the above formulas for the calculation of e. For suﬃciently large n the value of k + log2 n is smaller than n (see Appendix A) and we arrive at a contradiction: a program whose length is smaller than n computes the ﬁrst sequence whose complexity exceeds n. This number. a contribution of 1/2k to Ω. For that purpose we suppose that we have somehow got a program that can assess the algorithmic complexity of sequences. for instance. in other words. or 2) and/or rules (as. we will never be able to decide whether or not a sequence is algorithmically random. No one knows its value (or better. occupying k bits. corresponds to the probability of a randomly chosen program coming to a halt in a given computer. . In fact. suppose that for every positive integer n. We would then be able to decide about the halting or no halting of all programs with length shorter than or equal to n. excluding these values. occupying log2 n bits.8 A Ghost Number 195 this. The contradiction can only be avoided if the program never halts! That is. the calculation of Ω is made by adding up. This program works in such a way that. one of its possible values). which we denote by Ωn . and both situations are manifestly impossible. we wish to determine whether there is a sequence whose complexity exceeds n. With this construction one may prove that Ω is between 0 and 1.

Michael Dinneen.000 000 100 000 010 000 011 000 100 001 101 000 111 111 001 011 101 110 100 001 000 0 The corresponding decimal number is 0. which is a contradiction. As a consequence. we have solved the halting problem for all programs of length up to n bits (the computation time for not too small values of n would be enormous!). 2. If the algorithmic complexity of Ωn were smaller than n. but only the ﬁrst 64 are trustworthy). we also know all programs up to n bits that come to a halt. In order to avoid getting stuck with the ﬁrst non-halting program. Here they are: 0. 3. the probability of eventually reaching a sequence representing a halting program is Ω. Meanwhile. We would then feed the computer with all sequences of 1. the program would produce all random sequences of length n. Chaitin’s constant ﬁnds interesting applications in the solution of some mathematical problems (in the area of Diophantine equations). among them. there would exist a program of length shorter than n outputting Ωn . Recently. and Chi-Kou Shu have computed the ﬁrst 64 bits of Ω in a special machine (they actually computed 84 bits. noting down 1 if it turns up heads and 0 if it turns up tails. n bits. which was miraculously given to us. each halting program contributes 1/2k .078 749 969 978 123 844. . . 9. When the accumulated sum of all these contributions reaches the number Ωn . the programs producing random sequences of length up to and including n bits. Cristian Calude.9 Occam’s Razor William of Occam (or Ockham) (1285–1349?) was an English Franciscan monk who taught philosophy and theology at the University of Oxford. that ran against the traditionalist trend of medieval . . He became well known and even controversial (he ended up being excommunicated by Pope John XXII) because he had ideas well ahead of his time. Ω is algorithmically random. another interpretation of Ω is the following: if we repeatedly toss a coin. In fact. Eﬀectively. Knowing Ωn . we would have to run one step of all the programs in shared time.196 9 Chance and Order whether any of the possible 2k programs of length k stopped or not. . using the process described above. it is a sort of quintessence of randomness. Let us now attempt to determine whether or not Ωn is algorithmically random. Besides the purely theoretical interest.

Occam contributed to separating logic from metaphysics. The algorithmic complexity theory brings interesting support for Occam’s razor. Meanwhile. they have probability 1/2n of showing up. the probability that s occurred by pure luck is very small. Thus. For that purpose and in analogy with the computation of Ω. The principle is popularly known as Occam’s razor. or principle of economy. principle of simplicity. If they are produced by a chance mechanism. It can be shown that P (s) = 1/2C(s) . If the assumptions required by a theorem are coded as sequences of symbols (for instance. where r is the program length. When several competing theories are equal in other respects. the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest hypothetical entities. the simpler the explanation. A famous principle used by Occam in his works. It states the following: one should not increase the number of explanatory assumptions for a hypothesis or theory beyond what is absolutely necessary (unnecessary assumptions should be shaved oﬀ). a scientiﬁc theory capable of explaining a set of observations can be viewed as a minimal program reproducing the observations and allowing the prediction of future observations. Let us then consider the probability P (s) of a randomly selected program producing the sequence s. or in a more compact form. known as the principle of parsimony.9 Occam’s Razor 197 religious thought. as binary sequences). where C(s) represents the algorithmic complexity of s. each program outputting s contributes 1/2r to P (s).9. a possible cause for the production of s could be the existence of another sequence s∗ of length C(s) used as input in a certain machine. In fact. Another argument stands out when considering the production of binary sequences of length n. Now. sequences with low complexity occur more often than those with high complexity. The ratio 2−C(s) /2−n = 2n−C(s) thus measures the probability that s occurred due to a deﬁnite cause rather than pure luck. the better it is. has been raised to the level of an assessment/comparison criterion for the scientiﬁc reasonableness of contending theories. the most probable ones are the shortest. if C(s) is much smaller than n. paving the way toward an attitude of scientiﬁc explanation. One more reason to prefer simple theories! . The probability of obtaining s∗ by random choice among all programs of length m is 1/2C(s) .

.

for instance.1 Learning. many animals learn to dis- . including those whose solution is a prerequisite to our survival as a species. but a learning mechanism is superimposed on that random variation – an increased probability of following the path with a higher level of pheromone –. There is therefore a random variation of the individual itineraries. we have also shown that we are able to progress along the path of identifying the laws (regularities) of nature. Other classes of animals provide. but also individual learning.e. we have so far demonstrated that we are endowed with the necessary learning skills to solve the problems arising in our daily life. in Spite of Chance Although immersed in a vast sea of chance. which basically capitalizes on the law of large numbers. A large set of living beings also manifests the learning skills needed for survival. For instance. the capability of detecting regularities in phenomena where chance is present. are able to determine the shortest path leading to some food source. following diﬀerent paths. Up to now. by chance. reach a food source along the shortest path are sooner to reinforce that path with pheromones. The process unfolds as follows. Several ants leave their nest to forage for food.10 Living with Chance 10. frequently depending on a scheme of punishment and reward. and therefore have a greater tendency (higher probability) to follow it.. those that. is common among insects. because they are sooner to come back to the nest with the food. i. even if they did not follow that path before. But it is not only the human species that possesses learning skills. which we shall call sociogenetic. Ants. those that subsequently go out foraging ﬁnd a higher concentration of pheromones on the shortest path. not only examples of social learning. This type of social learning.

Imagine that the number that X thought of is 0. these manifestations can be reduced to a classiﬁcation: a ﬁne-grained classiﬁcation (involving a large number of classes) in the case of assigning values to events/objects. the reader L may randomly chose numbers in [0.) according to whether they have a good taste (reward) or a bad taste (punishment). 9. n is the number of possible diagnostic classes).) We may formalize the guessing game by assuming that one intends to guess a number x belonging to the interval [0. Let us restrict the notion of learning here to the ability to classify events/objects: short or long path. or a ﬁne-grained and extrapolated classiﬁcation in the case of future planning.200 10 Living with Chance tinguish certain food sources (leaves. 1]. tasty or bitter fruit. At the start of the diagnostic process one does not have the answer to all possible questions and some of the available answers may be ambiguous. in a possible domain of n objects (in the case of the cardiologist. nuts. 9. that some person X thought of. There are other manifestations of learning. such as the ability to assign values to events/objects or the ability to plan a future action on the basis of previous values of events/objects. This classiﬁcation depends on the answer to several questions: • • • • Does the patient have a high cholesterol level? Is the patient overweight? Is the electrocardiogram normal? Does s(he) have cardiac arrhythmia? We may call this type of learning. (Of course things are not usually so easy in practice.3 and that the following sequence of questions and answers takes place: . is learned after presenting at most log2 n questions. etc. fruits. As seen in Chap. etc. are similar to the guessing game mentioned in Chap. in which the learner is free to put whatever questions s(he) wants. Human beings begin a complex learning process at a tender age. active learning. where social and individual mechanisms complement and intertwine with each other. which we shall call ontogenetic. active learning guarantees that the target object. 1]. The event/object the cardiologist has to learn is the classiﬁcation of the patient in one of several diagnostic classes. in the last instance. However. To guess the number. Imagine a patient with suspicion of cardiovascular disease going to a cardiologist. following the guessing game strategy. Many processes of individual learning.

25? X : Yes. (As a matter of fact. in the sense that the best one can obtain is a degree of certainty. In this case the cardiologist is constrained to learn the diagnosis using only the available data. Let us assume that X has created a sequence of random numbers. but in a passive way this time. L : Is it greater than 0. The learning of P is obviously not guaranteed. and using a 95% certainty level. Let us once again consider learning a number x in the interval [0. for some tolerance of the deviation between f and P .5? X : No. on the basis of this list.10. Rather. L comes to know x with a deviation that is guaranteed not to exceed 1/2n . assigning a label or code according to the numbers belonging (say. 1]. To . label 1) or not (say. For this purpose L determines the frequency f of heads. for several values of the deviation tolerance between f and P . a value in [0. 10. Learning means here classifying x in one of the 2n intervals of [0.1 shows the minimum values of n when learning a probability. 4. when the results of n throws are known. L is given a list of the results of n throws and L tries to determine. we saw how to establish a minimum value for n for probability estimation. suppose that the consultation is done in a remote way and the cardiologist has only an abridged list of the ‘answers’. After n questions. We are now dealing with a passive learning process. label 0) to [0. 1] and independent of each other. in Spite of Chance 201 L : Is it greater than 0. Is it also possible to guarantee learning in this case? The answer is negative. And so on and so forth. all with the same distribution in [0. When talking about Bernoulli’s law of large numbers in Chap. x]. The distribution law for the random numbers is assumed unknown to L. In passive learning there is therefore a new element: there is no longer an absolute certainty of reaching a given deviation between what we have learned and the ideal object/value to be learned – as when choosing a number x in [0.1 Learning. Coming back to the cardiologist example. we would probably already have learned everything there is to learn about the universe!) Passive learning can be exempliﬁed by learning the probability of turning up heads in the coin-tossing game. 1] with width 1/2n . 1]. Things get more complicated when we are not free to present the questions we want. were the answer positive. The dotted line in Fig. we are dealing with degrees of certainty (conﬁdence intervals) associated with the deviations. the probability P of heads. 1].

10.83. 1) . 0) (0.13. 0) (0.1.202 10 Living with Chance Fig.92.63.3 and L was given the list: (0.37. On the basis of this training set. 1].78. assuming that the distribution of the random numbers generated by X is uniform. are shown by the solid curve in Fig. The minimum values of n in the passive learning of a number. 185 cases are enough.5.29.1. Learning curves showing the minimum number of cases (logarithmic scale) needed to learn a probability (dotted curve) and a number (solid curve) with a given tolerance and 95% certainty ﬁx ideas. 0) (0.1 tolerance and 95% certainty. in order to get a tolerance interval with 95% certainty. with a deviation of 0. with a deviation of 0. Now suppose that the random numbers were such that L received the list: (0. for several values of the deviation tolerance and using 95% certainty. It requires larger training sets in order to be protected against atypical training sets (for instance.5 with probability at least 1/2n . 0) (0. 1]). all points in [0. 10. In the worst case we may get a deviation of 0.29.78. we need more cases than when learning the probability value in the coin-tossing experiment. It can be shown that in this learning of number x. To learn a number under the same conditions requires at least 1 696 trials. Finally. 1) (0. consisting of a set of labeled points (the objects in this case). there is a third type of learning which we shall call phylogenetic. The best that L can do now is to choose 0. Here is a concrete example: in order to learn a probability with 0.48. Learning a number when the probability law of the training set is unknown is more diﬃcult than learning a probability. the supplied list does not help L to learn a number. there is a 1/2n probability that all numbers fall in [0. the best that L can do is to choose the value 0. 0) .5. A list like this one.01.87.5. suppose that X thought of x = 0. is known as a training set. If x < 0. 0) (0. In general. It corresponds to the mechanisms of natural selection of the .

The more experiments there are. each of them performing a distinct random experiment.1 Learning. tending therefore to a better success in genomic transmission to the following generation. it is the laws of large numbers that make learning possible. called a ‘coin’. it depends on ‘ensemble average’ convergence. Learning works as follows here: several gene combinations are generated inside the same species or phylum. Here learning resembles the convergence of an ‘ensemble average’. 6. Thus. chance phenomena are present and. which grants a higher success to certain genomes to the detriment of others. and with higher probability the more they deviate from fairness. We thus have a temporal convergence of the average resembling the one in Sect. In the next generation. the better the learning will be. the decision about which ‘coins’ to eliminate from each generation depends on the departure of P (heads) from the better performance objective (fairness). In ontogenetic learning. we have a single individual carrying out or being confronted with random experiments as time goes by. In phylogenetic learning. Examples are ant colony optimization algorithms for sociogenetic learning. a non-homogeneous population of individuals becomes available where a genomic variability component is present. Thus.4. that is. and artiﬁcial neural networks for ontogenetic learning. the better the learning will be. as well as the individual traits that reﬂect it. The more individuals there are.10. in the last instance. These diﬀerent learning schemes are currently being used to build so-called intelligent systems. We may establish an analogy with the coin-tossing case by assuming a living being. In all these learning models. At some instant there is a ‘coin’ population with diﬀerent genes to which correspond diﬀerent values of P (heads). . in the following generation the distribution of P (heads) is different from the one existing in the previous generation. Some individuals are better oﬀ than others in the prosecution of certain vital goals. the individuals are heterogeneous. in Spite of Chance 203 species. via mutations and hereditary laws. genetic algorithms for phylogenetic learning. with the throws performed under the same conditions. we have a homogeneous set of individuals. learning corresponds here to better performance in the face of the external environment. As a consequence. whose best performance given the external environment corresponds to fairness: P (heads) = 1/2. In the simplest sociogenetic learning model. Meanwhile. This is the case of ‘temporal average’ convergence that we have spoken about so often. Everything happens as if several individuals were throwing coins. the ‘coins’ departing from fairness get a non-null probability of being eliminated from the population.

for a training set having n cases. This notion of learning is intimately related to the Bernoulli law of large numbers and entirely in accordance with common sense. select the minimum point from structure 2 (rule 2). s(he) uses a speciﬁc rule (or set of rules) with which s(he) expects to get close to the exact diagnosis with a given degree of certainty. we are above structure 1 or above structure 2. say in French. For that purpose. to the correct pronunciation.2. If the rule is d = xM . we use some rule (or set of rules) for adjusting the pronunciation to ensure the above-mentioned convergence. or d = xm . Let us go back to the problem of determining a number x in [0. as illustrated in Fig.2 Learning Rules In the preceding examples. that is. 1]. after n experiments. This rule tries to value or weight the observations in such a way as to get close to an exact diagnosis – which may be. the practical task of determining the point of contact of two underground structures (or rocks) on the basis of drilling results along some straight line. that is. For instance. The drillings are assumed to be made at random points and with uniform distribution in some interval. in probability. learning consisted in getting close to a speciﬁed goal (class label). chosen from several possible rules. we say that the rule is a learning rule for the task at hand. Let d be the estimate of the unknown contact point ∆ supplied by the rule. in order to illustrate this aspect. We may establish a practical correspondence for this problem. close enough with a given degree of certainty. we expect to get close to the correct pronunciation. When a physician needs to establish a diagnosis on the basis of a list of observations. There are several feasible rules that can serve as learning rules for this task. As another example. when learning how to pronounce a word. In all examples of task learning. 10. If we are able to devise a rule behaving like the Bernoulli law of large numbers in the determination of the contact point. there is the possibility of applying one rule to the same observations (supposedly a learning rule). The observations gathered in n clinical consultations and the evidence obtained afterwards about whether the patient really had some disease constitute the physician’s training set. at the drilling point.204 10 Living with Chance 10. as n increases. with a given degree of certainty. by pronouncing it n times and receiving the comments of a French teacher. for instance. having or not having some disease – with a degree of certainty that increases with the number of diagnoses performed (the physician’s training set). select the maximum point (revealed as being) from structure 1 (rule 1). They reveal whether. or again d = (xM + xm )/2 .

the rule is a learning rule.4 we assume equal-variance normal distributions for the two classes of diagnosis. produced by the analysis. Our intention is to determine which value. separates the two classes of cases: the negative ones (without disease) and the positive ones (with disease). 10. Whatever the threshold value d we may choose on the straight line. Drillings in some interval. computed in 25 simulations for ∆ = 0. In contrast to what we had in the above problem. we will always have to cope with a non-null clas- Fig. as illustrated in Fig. Figure 10. depending on the training set size n. Let us concretize this scenario by considering a clinical analysis for detecting the presence or absence of some disease. In Fig. and instead one observes an overlap of the respective probabilistic distributions. Average deviation d − ∆ curves (solid lines) with ±1 standard deviation bounds (dotted lines). respectively. designated by [0.3. 1].3. The three rules produce a value d converging in probability (Bernoulli law of large numbers) to ∆. 10.5 and two learning rules: (a) d = xM .2 Learning Rules 205 Fig.4 shows an example of this situation. which we denote N and P. Usually. aiming to determine the contact point ∆ between structure 1 (open circles) and structure 2 (solid circles) (rule 3). 10.10. where there was a separation between the two categories of points. 10. (b) d = (xM + xm )/2 .2. we now suppose that this separation does not exist. We now consider another version of the problem of ﬁnding a point on a line segment. clinical analyses do not allow a total class separation.

There are. In the case of Fig.206 10 Living with Chance Fig. 10. that is. 2/3). it may be preferable to maximize instead of minimizing the error entropy. in many practical problems where one is required to learn the task ‘what is the best d separating positive from negative cases?’. 10. N and P. even with parameters (1/2. therefore. In fact. In fact. 10. xm ) from (1/2. and hence they do not behave as learning rules in this case. 10. It consists in adjusting d so as to minimize the entropy of the classiﬁcation errors and.4. (For certain classes of problems. the best value for d – the one yielding a minimum error – is the one indicated in the ﬁgure (as suggested by the symmetry of the problem): d = ∆.4 converge to values that are clearly far away from ∆. since it will lead on average and for the wrongly classiﬁed cases to a balance between positive and negative deviations from ∆. Upper : Probability densities of a clinical analysis value for two classes. a learning rule. We are only given a training set like the one represented at the bottom of Fig. the rule would no longer be a learning rule.4. it is still a learning rule for symmetrical distributions. their degree of disorder around the null error. rule 3 is not.) . Unfortunately. in this case we will get a non-null probability of obtaining d values arbitrarily far from ∆. changing the weight of the rule parameters (xM . As for rule 3. Another rule is based on the notion of entropy presented in the last chapter. 1/2). more sophisticated rules that behave as learning rules for large classes of problems and distributions. in general. the probability distributions are unknown. One such rule is based on a technique that reaches back to Abraham de Moivre. my research team has shown that.4. surprisingly enough. 1/2) to (1/3. Lower : Possible training set siﬁcation error. 2 and 3? Rules 1 and 2 for inﬁnite tailed distributions as in Fig. What is then the behavior of the previous rules 1. however. It consists in adjusting d so as to minimize the sum of squares of the deviations for the wrongly classiﬁed cases. The wrongly classiﬁed cases are the positive ones producing analysis values below ∆ and the negative ones producing values above ∆. Had we written rule 3 as d = (xM + xm )/3.

the game stipulates that Fig.10. others correspond to stranger scenarios. 10. 1] in binary. There are other and stranger scenarios where learning is impossible. 6. For L everything happens as if X’s choice were made completely at random. say mN and mP . 6 and illustrated in Fig.3 Impossible Learning 207 10. Learning is impossible. there exist scenarios where learning is impossible. The application of rule 4. and thereafter their average. as we did in the section on Borel’s normal numbers in Chap.9. under the same conditions as in Fig. 2 This is a learning rule for a large class of distributions. 10. and the convergence in probability of averages does not hold in this case. the law of large numbers is not applicable to Cauchy distributions. Let us assume that X chooses an integer n that L has to guess using active learning. that is.5. Let us again consider the previous example of ﬁnding out a threshold value ∆ separating two classes. Let us examine an example which has to do with expressing numbers of the interval [0. The situation we will encounter is the one illustrated in Fig. L does not know any rule that might lead X to favor some integers over others. But we now consider the situation where the distributions in both classes are Cauchy. given the convergence in probability of averages. The only diﬀerence is that we now use the following rule 4: we determine the averages of the values for classes N and P. Some have to do with the distribution of the data. Since the aim is to learn/guess n.3. d= mN + mP . 6. As we saw in Chap.3 Impossible Learning As strange as it may look at ﬁrst sight.5. 10. Let us ﬁrst take a look at an example of impossible learning for certain data distributions. N and P. for two classes with Cauchy distributed values .

25. 0. di ) encloses a 0. the average of the di values for any selected value of x converges to 0. and the error is then zero. Is it at least possible to arrive at a sophisticated rule allowing us to . (repetition of 0011) does not contain d3 . L : 0. We may conclude that either L makes a hit. The average of the absolute diﬀerences between what is produced by the rule and what is produced by the right choice is (0 + 1 + 1 + 1)/4 = 0. s(he) always has 50% probability of hitting upon the value chosen by X. the values produced by the right choice 0. including the previous examples.2 contains 0. using some rule. In the above example. which corresponds in decimal to 0.3 correspond to the 0101 sequence of answers.1 does not contain (the binary representation of) 0.125). as happens in a vast number of scenarios.2? X : 1 (0.5 with an increasing number of questions because the diﬀerence between any pair (dn .3? X : 0 (0.125. In other words. For instance. by using a suﬃciently ‘clever’ rule. 1]. that is d2 . L : 0. Is it possible that.1? X : 0 [(the binary representation of) 0. any random choice made by L has always 50% probability of a hit.3 represented in binary by 0.22? X : 1 (0.125). This is due to the fact that. . There is no middle term in the learning process. the number 0.125). 1] contains the binary digit dn . Now.3 does not contain 0.01001100110011 . this value always converges to 0. . Consider the average of the absolute diﬀerences between what is produced by the rule and what is produced by the right choice. decided 0. and the error is then still 50%.22 contains 0.5 area (as the reader may check geometrically). L : 0.5 independently of the chosen di . Supposing that L.001.125]. for a uniformly distributed x in [0. we would get the 0010 sequence.75. The question now is: will L be able to guess n after a ﬁnite number of questions? We ﬁrst note that whatever di the learner L decides to present as a guess. Here is an example of questions and answers when X chooses n = 3 (d3 ): L : 0. we will be able to substantially improve over this 50% probability? The surprising aspect comes now. or s(he) does not make a hit.208 10 Living with Chance L should ask X if a certain number in the interval [0.

it belongs to class N. What can provide us with the guarantee that the surmised classiﬁer type is acceptable? It could well happen that the diagnostic classes were ‘nested’. In a previous section we were interested in obtaining a learning rule that would determine a separation threshold with minimum classiﬁcation error. • Learning rule: rule 3. clinical analysis) in one of two classes. it belongs to class P.6b. In other scenarios it could well happen that the most appropriate classiﬁer is quadratic. otherwise. if we are confronted with the conﬁguration of Figs. Meanwhile.6. All we usually have available are simply the observations constituting . or some other situation. 10. 10. This is a deductive approach to the classiﬁcation problem. if we knew the class conﬁguration.. as in Fig. where we separate the two classes with: If x2 is greater than d.6a. instead of considering one threshold we would consider three and the classiﬁer would be: If x is below d1 or between d2 and d3 . in general. we are making the assumption that the best type of classiﬁer for the two classes (N and P) is a point. we never know. Classiﬁers involving one. It is impossible to learn n from questions about the di set. it belongs to class P. as well as the weighting of its parameters.6b. otherwise. We progress from the general (classiﬁer type) to the particular (explaining the observations).4. 10. • Classiﬁer type: a number (threshold) x. we would consider ﬁve thresholds and the following classiﬁer: If x is below d1 or between d2 and d3 . 10. three or ﬁve thresholds are linear in the respective parameters.6a.10.7. 10. 10. or between d4 and d5 . Once the type of classiﬁer has been decided. in particular when a larger number of parameters is involved. we may summarize the learning task in the following way: • Task to be learned: classify the value of a single variable (e. We may go on imagining the need to use ever more complex types of classiﬁer. If we use rule 3. as shown in Fig.7. In the case of Fig. 10.4 To Learn Is to Generalize Let us look back at the example in Fig. we apply the classiﬁer in order to classify/explain new facts. it belongs to class P. 10.4 To Learn Is to Generalize 209 reach the right solution in a ﬁnite number of steps? The answer is negative. then x belongs to class N . In the case of Fig. That is. then x belongs to class N . 10.g. otherwise.

in fact.6a we were fortunate with the training set supplied to us and we succeeded in determining d1 . 10.7. the classiﬁer designed on the basis of the training set would not generalize to our satisfaction. The designed classiﬁer would allow generalization and would generalize well (with the smallest probability of error.210 10 Living with Chance Fig. Suppose from now on that in the above example the distributions of the analysis value x for the two classes of diagnosis did correspond. 10. Our solution would then behave equally well. We could have been unfortunate. But what does this degree of certainty depend on? In fact. The classiﬁcation rule based on three thresholds is then adequate. Learning is now inductive. it depends . However. 10. aﬀording the smallest probability of error. for other (new) sets of values. and ∆3 . The only thing we have to do is to use a learning rule for this type of classiﬁer. and d3 far away from the true values. 10.4 Fig. and ∆3 . we can never be certain that a given type of classiﬁer allows generalization. The training set is the same as in Fig. Situation of classes N (white) and P (grey) separated by three (a) and ﬁve (b) thresholds. without any assumptions. leading to the determination of d1 . to the four-structure situation of Fig. We progress from the particular (the observations) to the general (classiﬁer type). on average). and d3 quite close to the true thresholds ∆1 . ∆2 . the person who designed the classiﬁer does not know this fact. In summary. The best we can have is some degree of certainty. ∆2 . d2 . having to deal with an atypical training set. The answer to the question of what type of classiﬁer to use is related to the diﬃcult generalization capability issue of inductive learning. d2 . Let us further suppose that in the case of Fig. in order to minimize the number of errors. s(he) only knows the training set. In this case. which adjusts the three thresholds to the three unknown values ∆1 . or with only a minimum of assumptions.6. Quadratic classiﬁer our training set. 10. How can we then choose which type of classiﬁer to use? In this situation we intend to determine the classiﬁer type and its parameters simply on the basis of the observations.6a. on average terms.

however. say n observations. It amounts to having a small program that . or in other words the scientiﬁc understanding of material phenomena.5 The Learning of Science 211 on two things: the number of elements of the training set and the complexity of the classiﬁer (number of parameters and type of classiﬁer). we needed to resort to a rule using n parameters. 6.10. but does so rather poorly. In the present example the complexity has to do with the number of thresholds used by the classiﬁer. If. the ﬁvethreshold classiﬁer does not generalize in any reasonable way. given its higher complexity. At the other extreme. the simplest explanations are the best. gets so stuck to the training set details that it will not behave in a similar way when applied to other (new) data sets. In fact. in the sense of the strong law of large numbers mentioned in Chap. As far as the number of training set elements is concerned. the easier the generalization. to understand is to compress. in the presence of uncertainty. while the ﬁvethreshold classiﬁer describes a structure that is too ﬁnely grained. 10. its generalization is always accompanied by a nonnull classiﬁcation error. It generalizes. in training sets like those of Fig. Consider the intelligibility of the universe. There is no diﬀerence between knowing the rule and knowing the observations. in order to explain a given set of observations of these phenomena. where the classiﬁer. The three-threshold classiﬁer is the best compromise solution for the learning task. Finding such a best compromise is not always an easy task in practice. We are thus in a sub-learning situation. For the above example.5 The Learning of Science Occam’s razor told us that. We are thus in an over-learning situation. the probability of the training set being typical also increases. it is easily understandable that. it attains zero errors. very much inﬂuenced by the random idiosyncrasies of the training set. Let us now focus on the complexity issue. whereas in new data sets it may produce very high error rates. the classiﬁer with a single threshold does not need such large training sets as the three-threshold classiﬁer.6. We may say that the one-threshold classiﬁer only describes the coarse-grained structure of the training set. since its performance behaves erratically. 10. we could say that no intelligibility was present. as we increase that number. The simpler the classiﬁer. The intelligibility of the universe exists only if inﬁnitely many observations can be explained by laws with a reduced number of parameters. in order to generalize.

On the other hand. they always explain the observations. supplying an objective foundation for inductive validity. up to now. no matter how complex those values are and even if we deliberately add incorrect class-labeled values to the training set. There are certain models with inﬁnite DVC . one must never use a more complex type of model than is allowed by the available data. The fact that an inductive model must have ﬁnite DVC in order to be valid and therefore be generalizable is related to the falsiﬁability criterion proposed by the Austrian philosopher Karl Popper for demarcating a theory as scientiﬁc. capable of generalization –. if scientiﬁc laws are used – thus. and all the more easily as its value is smaller. simple. The parameters of such models can be learned so that a complete separation of the values of any training set into the respective classes can be achieved. Now. low complexity (in the above sense) laws have been found that are capable of explaining a multitude of observations. with a given level of certainty. it can be said that non-scientiﬁc theories will never err. establishing (in 1991) a set of theoretical results which allow one in principle to determine a degree of certainty with which a given classiﬁer (or predictive system) generalizes. classiﬁer type). inductively learning the laws of the universe – that is. that the laws we have found behave equally well for as yet unobserved cases? The Russian mathematician Vladimir Vapnik developed pioneering work in statistical learning theory. encouraged by the appreciation that.212 10 Living with Chance is able to produce inﬁnitely many outcomes. We will never be certain (fortunately!) of having found the theory of everything. On the other hand. Statistical learning theory clariﬁes what is involved in the inductive learning model. we are in a situation corresponding to the use of ﬁnite . presents the problem of their generalization: can we be certain. A learning model can only be generalized if its DVC is ﬁnite. This is the aim of the ‘theory of everything’ that physicists are searching for.g. According to this theory we may say that the objective foundation of inductive validity can be stated in a compact way as follows: in order to have an acceptable degree of certainty about the generalization of our inductive model (e. in the same way as classiﬁers with inﬁnite DVC will never produce incorrect classiﬁcations on any training set (even if deliberately incorrectly labeled observations are added).. obtaining laws based on observations alone –. As a matter of fact. one of the interesting aspects of the theory developed by Vapnik is the characterization of the generalizing capability by means of a single number. known as the Vapnik–Chervonenkis dimension (often denoted by DVC ). no matter how strange they are.

Vapnik’s theory goes a step further. The strongest objec- .6 Chance and Determinism: A Never-Ending Story It seems that God does play dice after all. 7. they can give rise to very complex structures. contradicting Albert Einstein’s famous statement. it is possible to present observational data that is not explainable by the theory (falsifying the theory). We have seen that it was not always so.10. and not supported by demonstrable facts – is mainly based on some results of computer science: the computing capacity of the so-called cellular automata and the algorithmic compression of information. that is. in such a way that learning with generalization is achievable. of the learning system). some sort of hologram –. Suﬃce it to remember the famous statement by Laplace. however. quoted in Chap. 7. to a less extreme vision of an inanimate universe behaving as if it were a huge computer. as is the knowledge of the inﬁnitely many algorithmically complex numbers. by using a set of simple rules.and three-dimensional versions of the quadratic iterator presented in Chap. There is. it has been proposed that everything in the universe is computable. It tells us the level of complexity of the classiﬁer (or learning system) to which we may go for a certain ﬁnite number of observations. Occam’s razor taught us to be parsimonious in the choice of the number of parameters of the classiﬁer (more generally. This idea – perhaps better referred to as a belief. like those √ mentioned in Chap. because those that advocate it recognize that it is largely speculative. in the sense that. e and 2. 9 concerning the digit sequences of π. It seems that the decision of where a photon goes or when a beta emission takes place is of divine random origin. 10. Algorithmic compression of information deals with the possibility of a simple program being able to generate a sequence that passes all irregularity tests. There are of course many relevant and even obvious objections to such visions of a computable universe [42]. Cellular automata are somewhat like two. readily available to create a chaotic phenomenon.6 Chance and Determinism: A Never-Ending Story 213 DVC classiﬁers. with these proposals ranging from an extreme vision of a virtual reality universe – a universe without matter. On the basis of these ideas. The vast majority of the scientiﬁc community presently supports the vision of a universe where chance events of objective nature do exist. a minority of scientists who defend the idea of a computable universe. which may be used as models of such complex phenomena as biological growth.

in other words. . in such a way that we may forecast which path it will follow.214 10 Living with Chance Fig. of quantum randomness. Notwithstanding. which rules inﬂuence biological evolution. Are mutations a phenomenon of pure chance? We do not know. as we said already in the preface.8. If the universe has inﬁnite complexity. we do not know whether the complexity of a living being can be explained by a simple program. Further. for instance. the profound conviction of the scientiﬁc community at present is that objective chance is here to stay. if there were no chance events. As a matter of fact. although we might need huge amounts of information for that purpose. whether chance is a deterministic product of nature or one of its fundamental features. Life seems to be characterized by an inﬁnite complexity which we will never be able to learn in its entirety. When further observations are required tion to this neo-deterministic resurrection has to do with the noncomputability of pure chance. computable.e.. It remains to be seen whether one would ever be able to compute those vast quantities of information in a useful time. since objective chance will always remain absolutely beyond the cognition of human understanding. We may nevertheless be sure of one thing – really sure with a high degree of certainty – that chance and determinism will forever remain a subject of debate. it will never be completely understood. the temporal cause–eﬀect sequences would be deterministic and. as believed by the majority of scientists. 10. We do not know. therefore. i. The most profound question regarding the intelligibility of the universe thus asks whether it is completely ordered or chaotic.

52 )3 = (0. where nm means n × m. From the deﬁnition. designates the result of multiplying x by itself n times. for a negative exponent..53 = 2. without fractional part). In the book.5 = 15. one has xn × xm = xn+m . . Note also that 1n is 1.65625 .53 × 2.5 × 2. e. read x to the power n.5 × 2.5 × 2.. e. (0.5 × 2. a whole number. −n.A Some Mathematical Notes A. If n is positive. expressions such as (2/3)6 sometimes occur.5 × 2.. Moreover. From xn × xm = xn+m comes the convention x0 = 1 (for any x).1 Powers For integer n (i.g. the expression xn .5 × 2.52 = 2. 2. the calculation of xn is straightforward. We may apply the deﬁnition directly.625 . xn (xn )m = xnm .5)6 = 0. 2.e.015 625 .5 = 97. or perform the calculation as (2/3)6 = 26 /36 = 64/729 . as before. x is called the base and n the exponent. we have x−n = 1 .55 = 2.g.

where C is a constant.52 = 2. Both are calculated with the above power rules. Given an exponential function ax . Let us see some values of this quantity: 1+ 1 2 2 = 1. Comparing the growth of the exponential function and the power function .1 shows the fast growth of the exponential function in comparison with the power function. its growth rate (measured by the slope at each point of the curve) is given by Cax . C can be shown to be 1 when a is the value to which (1 + 1/h)h tends when the integer h increases arbitrarily. Fig. it is an exponential function. A. Figure A. from (x1/2 )2 = √ x. A. In the second function the base is constant.1.216 A Some Mathematical Notes since xn × x−n = x0 = 1 . 3 2 8 Powers with fractions as exponents and non-negative bases are calculated as a generalization of (xn )m = xnm .2 Exponential Function The ﬁgure below shows the graphs of functions y = x4 and y = 2x . For example. Given any power and exponential functions. For instance. the exponent is constant. 2−3 = 1 1 = .25 . it is always possible to ﬁnd a value x beyond which the exponential is above the power. one has x1/2 = x . In the ﬁrst function. the function is a power function.

1+ 1 1000 1000 = 1.593 742) = 1.2. Here are some examples: • log2 (2.. the logarithm of y in base a is x.2. The word ‘logarithm’. • log10 (2.718 28.5. is frequently used.3 Logarithm Function The logarithm function is the inverse of the exponential function.A. The logarithm function .110 = 2.110 = 2. a0 = 1 for any a. The logarithm function in the natural base. The natural exponential ex thus enjoys the property that its growth rate is equal to itself. because as we have seen. Fig.3 Logarithm Function 217 1+ 1 10 10 = 1. denoted by e.593 742. A. because 1.1. An important property of log(x) is that its growth rate is 1/x. These quantities converge to the natural or Napier number . This function is usually denoted by loga . comes from Khwarizm.716 924 2 . given y = ax . • loga (1) = 0. The value of e to 6 signiﬁcant ﬁgures is 2. like the word ‘algorithm’. viz.52 = 2. The graphs of these functions are shown in Fig. A. A. author of the ﬁrst treatise on algebra.0011000 = 2.593 742 .25) = 1. birth town of Abu J´’far Muhammad ibn Musa a Al-Khwarizmi (780–850). presently Khiva in Uzbekistan. because 1. or simply log(x).25. that is. loge (x).

we get log(x4 ) = 4 log(x) . In the long run.1 it is not possible to compare the growth of the two functions for very small or very high values of x. in the long run. A.3. denoted n!. The factorial function is a very fast-growing function. × 2 × 1 . If we apply the log transformation.4. log(2x ) = x log(2) . . The behavior of the functions log(x4 ) and log(2x ) The following results are a consequence of the properties of the power function: log(xy) = log(x) + log(y) .218 A Some Mathematical Notes Fig. is deﬁned as the product of all integers from n down to 1: n! = n × (n − 1) × (n − 2) × . . as can be checked in the graphs of Fig. The behavior of the two functions is now graphically comparable. It becomes clearly understandable that. A. the factorial function overtakes any exponential function. A. In the left-hand graph. n! is represented together with x4 and 2x . log(xy ) = y log(x) . A.4 Factorial Function The factorial function of a positive integer number n. For instance.3. while the respective logarithms are shown on the right. . in Fig. The logarithm function provides a useful transformation in the analysis of fast growth phenomena. any exponential function overtakes any power function. A. whose graphs are shown in Fig.

cos(π/2) = cos(3π/2) = 0 . the segment will vary as sin(2πf t + φ) (see Fig. cos(π) = −1 . where t is the time. As an example.5 Sinusoids 219 Fig. 8. called the phase.5. The function represents the length of the segment AP drawn perpendicularly from P to the horizontal axis. A. sin(π/2) = 1 . respectively. Considering point P rotating uniformly around O. measured in the anticlockwise direction in Fig. depending on whether the segment is above or below the horizontal axis. x4 and 2x . The inverse function of sin(x) is arcsin(x) (arcsine of x). √ sin(π/4) = 2/2 . If the movement of P starts at the position corresponding to the angle φ. √ arcsin( 2/2) = π/4 (for an arc between 0 and 2π). √ cos(π/4) = 2/2 . as an arc length represented by a certain fraction of the unitary radius. Right : Logarithms of the three functions on the left A. one has sin(0) = sin(π) = sin(2π) = 0 . so that it describes a whole circle every T seconds. the segment AP will vary as sin(2πt/T ). or equivalently sin(2πf t). cos(0) = cos(2π) = 1 . that is. and since 2π is the circumference.4. . The length can be either positive or negative. A. With the angle measured in radians. sin(3π/2) = −1 . The cosine function cos(x) (cosine of x) corresponds to the segment OA in Fig. Thus.5. Left : The functions n!. A.8).5 Sinusoids The sine function sin(x) (sine of x) is deﬁned by considering the radius OP of a unit-radius circle enclosing the angle x.A. where f = 1/T is the frequency (cycles per second).

. the digit sequence 375 represents 3 × 102 + 7 × 101 + 5 × 100 = 300 + 70 + 5 .6 Binary Number System In the decimal system which we are all accustomed to. Here are some examples: 101 has value 1 × 22 + 0 × 21 + 1 × 20 = 4 + 1 = 5 . The rule is the same when the number is represented in the binary system (or any other system base for that matter).101 has value 1 × 2−1 + 0 × 2−2 + 1 × 2−3 = 1/2 + 1/8 = 0.220 A Some Mathematical Notes Fig. while 0. A.125 = 0. For instance.5 + 0. In the binary system there are only two digits (commonly called bits): 0 and 1. one always has sin2 (x) + cos2 (x) = 1 . Unit circle and deﬁnition of the sine and cosine functions From Pythagoras’ theorem. the value of any number is obtained from the given representation by multiplying each digit by the corresponding power of ten. for any x. A.625 .5.

Goldberger.: The envelope paradox. 1. J. E. S. Bassingthwaighte. Probability and Its Uses in Physics..A.. www. J.econ. New Scientist 22. the Siegel paradox. and the impossibility of random walks in equity and ﬁnancial markets.: Modeling heart rate variability by stochastic feedback. F. Communications 121–2. Ambegaokar. Blackmore.-P. L. Liebovitch. 62–65 (1990) 10. D. Stanley. Bolle.: A pseudo-experience in parapsychology (letter).euv -frankfurto.: The perception of randomness. Harvard University Press (1998) 8. 784–788 (1995) 12. Phys. B. The Remarkable Story of Risk .de/veroeffentlichungen/bolle. West.. M. Alvarez.J. Brown. 37–39 (1990) . P.. 1541 (1965) 2. Science 148.: What is Random? . Amaral. L. 126–128 (1999) 3. Comp. Bar-Hillel.: The lure of the paranormal.. Cambridge University Press (1996) 4.html (2003) 11. Ivanov. Wagenaar. Beltrami.: Reasoning about Luck.: Randomness.: Fractal Physiology..B. John Wiley & Sons (1996) 9.References The following list of references includes works of popular science and other sources that are not not technically demanding. Bernstein. 428–454 (1991) 5. P. Springer-Verlag (1999) 7. Oxford University Press (1994) 6. Advances in Applied Mathematics 12. Bennet. Bouchaud. A.: Against the Gods.: Is the universe a computer? New Scientist 14. J. E. W. L.: Les lois des grands nombres. V. La Recherche 26.

J.: Le hasard des nombres.: Jeux Math´matiques.org/devlin/ devlin 0708 04. A. 32–35 (1997) 14. An Informal Guide to Probability. ed. Journal of the American Statistical Association 84.D. J. Scientiﬁc American 259. Shaw.H. Wunderlin. The Mathematical Association of America.: Chaos. La Recherche 21. M. J. Scientiﬁc American 232. Chaitin. P. G. S. 35–41 (1990) 31. W. Crutchﬁeld. K. 6.: Randomness in arithmetic. New Scientist 155.: What is chaos that we should be mindful of? In: The New Physics. 5. G..: Le chaos d´terministe. A. Feynman. B. R.: Methods for studying coincidences.: Sneaking a Look at God’s Cards. Devlin’s angle. Presses Universitaires de France (1982) 20. 853–861 (1989) 22.: Truly..-J. Springer-Verlag (1999) 24.: QED. J. 610–615 (1991) 17. Casti. Deheuvels. No.. by P. Ford. Boston a (1988) 30. Chaitin. B. Droesbeke. R.S. Gamow. Scientiﬁc American 255. 38–49 (1986) 19. collection Que Saise Je?.222 References 13. Birkh¨user. G.: The two envelopes paradox.html (2005) 21. Dunod e e (1961) 27.: Chance Rules. www.: Chaos and fractals in human physiology. D. J. G. 52–57 (1988) 16. La Recherche 22. Haken. Packard.: Tales of Physicists and Mathematicians. Gindikin. West. G. le hasard et la certitude. No. Everitt. F. Presses Universitaires de France (1990) 23. M.: Randomness and mathematical proof. No. Freeman (1994) 28. Penguin. Rigney. 34–41 (2003) 18. Diaconis.. London (1985) 25. Davies. Tassi. P. 2. Cambridge University Press (1989) 26. No.: La probabilit´. Princeton University Press (2005) 29. Chaitin.: The Quark and the Jaguar . collection Que SaisJe?.... Ghirardi.: L’Univers est-il intelligible? La Recherche 370. Quelques casse-tˆte. Chaitin. randomly. Devlin. Scientiﬁc American 262.. MAA Online. H. Risk and Statistics. P..: Histoire de la statistique. G. Stern. 1248– e 1255 (1990) . The Strange Theory of Light and Matter . Goldberger. Unraveling the Mysteries of Quantum Mechanics. 7. Gell-Mann. N. Mosteller. 47–52 (1975) 15. madly. Farmer.maa.

M. La Recherche 360.. A. 490–498 (1989) e 35. ed. Signals and Noise. H. The American Statistician 46. No. Kac.: Jumping to coincidences: Defying odds in the realm of the preposterous.. Layzer. 202–204 (1992) 47.htm (1999) 49. Pajot.C.: An Introduction to Information Theory. In: The New Physics. Jacquard. 34–37 (2004) e e 45. Proc. Symbols. collection Que Sais-Je?. 38–43 (2003) 43. www.. G. 56–62 (2002) 41. A.S. J. La Recherche 301. Pierce. 50–55 (1997) e 34.: Chaos and Fractals. D. 28–33 (1997) 42.: Les d´sordres boursiers. McGrew. J. 197–202 (1992) 33.: Mesure et instrument de mesure. 56–69 (1975) 39. Sci. Haroche..nctimes.: La d´coh´rence. M.-M. H. Brune. American Scientist 71. Idrac. Acad. B. Orl´an. Deneubourg. D.M. J. Mandelbrot. S. Shier. J.: A multifractal walk down Wall Street. espoir du calcul quantique. Polkinghorne. Pincus.: Improbable probabilities. 668–672 (1991) 46.: Using lottery games to illustrate statistical concepts and abuses. J¨rgens. La e e Recherche 378.: The two-envelope paradox resolved.-L.. Dunod (1960) 36.net/~mark/bibl science/probabilities. H´non.-O. H. Saupe. S. 6. A.. Davies.: The arrow of time. 245–267 (2001) 51. Nicolis. Raimond. P. 2. G. Theraulaz.: La diﬀusion chaotique. J.: What is Random?. USA 954. 50–53 (1999) 40.: Assessing serial irregularity and its implications for health. Perakh. M. Martinoli.: L’Univers est-il un calculateur? Quelques raisons pour douter. Scientiﬁc American 280. J. La Recherche 358. Peitgen. La Recherche 209. Cambridge University Press (1989) 44.References 223 32. No. 405–406 (1983) 38. New Frontiers u of Science. Dover Publications (1980) 50. Paulson. Analysis 57. Springer-Verlag (2004) 48. M. La Recherche 22. Mitchell. Natl. D. Princeton University Press (1984) .: Les Probabilit´s.: Physics of far-from-equilibrium systems and self-organisation. by P. Silverstein. R. The American Statistician 46. H. Scientiﬁc American 233.1. Ollivier. M. Presses Universie taires de France (1974) 37.: Quand les robots imitent la nature.J..: The Quantum World .. T.: Le chat de Schr¨dinger se prˆte o e a ` l’exp´rience. Hanley..

: L’Univers est-il un calculateur? Les nouveaux d´mie urges. Steyer.idsia.wikipedia.. West.uci.edu/~mlearn/MLSummary. Schmidhuber. 28–35 (2004) ` e 55... B. American Scientist 75. J. A. La Recherche 381. American Scientist 78.: Conceptual foundations of quantum mechanics. by P. and everything. A. B. Shlesinger. University Science Books (1997) 61. Rittaud. Cambridge University Press (1989) 59. J.: L’Ordinateur a rude ´preuve.. West. Scientiﬁc American 284.ics.224 References 52. In: The New Physics.: Quantum Physics: Illusion or Reality? . the universe. Wheeler. 2nd edn.: The Strange World of Quantum Mechanics.: A computer scientist’s view of life.: The noise in natural phenomena. UCI Machine Learning Repository (The Boston Housing Data): Website www. Wikipedia: Ordre de grandeur (nombre). 354–364 (1987) 64. collection Points. Cambridge University Press (2004) 54. 33–37 (2003) 53. Davies.A.: The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. D. M. Cambridge University Press (2000) 60. J. ed. Taylor.ch/~juergen/everything (1997) 58. Ruelle. B.: 100 years of quantum mysteries.: Hasard et chaos. Freeman (2001) 57. D. Salsburg. 54–61 (2001) 62. 2nd edn. Postel-Vinay.H. Rae. La Recherche 360. Tegmark. 40–45 (1990) 65..: An Introduction to Error Analysis.org/wiki/ Ordre de grandeur (nombre) (2005) . http://www. D. fr. A. O. Editions Odile Jacob (1991) 56. W.: Physiology in fractal dimensions. M. Shimony. Goldberger.html 63.