You are on page 1of 295

The Probable Universe

Roberto C. Alamino

Cover Picture:
Author: Unknown

All pictures in this book are on Public Domain and were taken from either or

1. A Universe of Possibilities ....................................................................... 1
How Do You Know? ................................................................................ 1
You Cannot Avoid Probabilities .............................................................. 8
The Measure of our Ignorance ............................................................. 11
Are You Afraid of Maths? ..................................................................... 15
2. Games of Chance ................................................................................. 19
The Many Faces of Fairness .................................................................. 19
Not-so-noble Beginnings ...................................................................... 27
Bayes and Laplace ................................................................................ 40
3. Making Sense of Randomness .............................................................. 45
Predictability ........................................................................................ 45
Chaos (and Mayhem) ........................................................................... 48
Organised Disorder ............................................................................... 55
False Randomness ............................................................................. 57
Compressing Issues .............................................................................. 59
Pattern Recognition... or Not ............................................................... 62
True Randomness ................................................................................. 68
Back to Vegas ....................................................................................... 69
4. The Logic Behind .................................................................................. 72
The Coxs Postulates ............................................................................. 72

A Bit of Logic ......................................................................................... 77
Liars ...................................................................................................... 85
Enters Consistency ................................................................................ 87
Kolmogorovs Axioms ........................................................................... 89
Logic, Mathematics and Physics ........................................................... 90
Messages .............................................................................................. 98
5. Information ........................................................................................... 99
Age of Information................................................................................ 99
Encoding Information ......................................................................... 103
Insufficient Reason ............................................................................. 109
Maximum Entropy .............................................................................. 117
Frequencies ........................................................................................ 123
6. It Depends... ........................................................................................ 129
Conditioned Information .................................................................... 129
Everything is Subjective ...................................................................... 137
Objectivity and Consistency ................................................................ 141
The Holographic Universe ................................................................... 147
7. Probability Zoo ................................................................................... 151
Intermission ........................................................................................ 151
Anatomy Lesson.................................................................................. 152
A Dangerous Creature ........................................................................ 157
Specimen #1: Poisson ......................................................................... 160
Specimen#2: Zipfs Law....................................................................... 161
The Continuum ................................................................................... 163

Specimen #3: The Gaussian ................................................................ 170
Specimen #4: Paretos Distribution .................................................... 174
The Tail of the Beast ........................................................................... 176
Biodiversity ......................................................................................... 177
8. Changing Mind ................................................................................... 179
Decisions ............................................................................................ 179
Priors and Posteriors .......................................................................... 182
The Likelihood .................................................................................... 186
Evaluating Hypotheses ....................................................................... 188
The Inference Time Arrow .................................................................. 191
Normalisations ................................................................................... 198
Taking Decisions ................................................................................. 204
The Bayesian Way .............................................................................. 210
9. The Catch ............................................................................................ 212
Law and Disorder ................................................................................ 212
Models................................................................................................ 216
Noise, Errors and Codes ..................................................................... 223
What about Fallacies? ........................................................................ 230
10. The Universe and Everything Else .................................................... 233
Fundamental Uncertainty................................................................... 233
Laymen Quantum Mechanics ............................................................. 234
The Large, the Small and the Complex ............................................... 242
The Psychologist Paradox ................................................................... 245
Statistical Physics ................................................................................ 248

The Great Bridge ................................................................................. 256
Other Corners ..................................................................................... 257
11. The Highest Levels ............................................................................ 259
Models once Again ............................................................................. 259
Science and Bayes ............................................................................... 262
Limits to Knowledge ........................................................................... 264
The Method ........................................................................................ 269
Checking ............................................................................................. 270
Beauty and the Beast .......................................................................... 272
Addressing Oneself ............................................................................. 273
Entropy always Increases .................................................................... 274
12. Answers ............................................................................................ 276
Bibliography ............................................................................................ 278
Apendix A. Internet Material .................................................................. 281
Random Useful Websites .................................................................... 281
arXiv .................................................................................................... 282
Google Scholar .................................................................................... 283
Open Access Journals.......................................................................... 284
Appendix B. Mathematical Symbols ....................................................... 285
Appendix C. Scientist List ........................................................................ 286
Appendix D. Greek Letters ...................................................................... 288

The Probable Universe


A Universe of Possibilities

How do you know we are not living inside the Matrix (or the next best
Can you ever tell whether everything is not actually an illusion inside your
Isnt science just a belief system not unlike religion?
How do we know that elementary particles exist if we cannot see them?
Can we ever hope to find an answer to any of these questions?
And finally: what is the relation of these questions with this book?

Among the few, but key, characteristics that differentiate humans
from the rest of the living organisms on Earth, the ability to question is the
deepest, the least appreciated and the most annoying. What makes it
annoying is the fact that answering some questions is not easy. The search
for those answers forces us to admit our own ignorance, to face the
uncomfortable truth that we are full of prejudices and to observe in awe
our odd ability to accommodate a universe of contradictions inside our
minds without any effort or even regret.
Roberto C. Alamino


When faced with one of these hard questions, the great majority of
us just find it easier to take a step backwards. Instead of engaging in the
lengthy and painful task of trying to decrease our ignorance by studying,
revising our prejudices, eventually throwing away many of them, and
resolving the contradictions by changing our mind, most people simply
convince themselves of one or more of the following reasons not to do so:
I do not want to know the answer.
I do not need to know the answer.
There is no answer.
Even if there is an answer, it is too complicated for me to understand.
I am fine the way I am. I just need to get back to browsing the Internet.
But some of us, including myself, are not satisfied with any of these
reasons. This is, of course, a choice that not everyone is obliged to pursue.
It has nothing to do with reason, but with an emotion called curiosity. It is
the urge to seek explanations that makes some of us want to push the
boundaries of what we (think) we know as much as we can and, even when
we start to feel that the boundaries are not going to move anymore, we
want to keep pushing just in case.
It is undeniable that there are cases when we cannot decide which
of several answers to a question is the correct one. But even in those
situations, we might be able to understand why that happens. When we
cannot choose with complete certainty one among several possibilities,
there is one other thing that we can try to do: find out the odds of each
possible answer being the right one.
To know the odds of something is again not an easy task. It is
necessary to weight whatever we know in favour and against each one of
the possibilities in such a way that we can create a rank. For instance,
imagine that you want to download an app for your smartphone. You go to
the Internet and there you find a list of five apps that do what you want.
Which one will you be most satisfied with?
The Probable Universe


The way most of us face this decision problem is by ranking the
apps according to the reviews of other people. We assume that, the more
people liked a particular app, the greater are the chances that we will like it
too. Although we could use the number of stars to calculate a rough
estimate of the odds of a general person liking that app, this quantitative
information would not make much of a difference in this situation. The
ranking is usually enough.
This weighting procedure used to create a rank is what we call
inference. In other words,
Inference is the name given to a procedure in which we compare
different answers to a question and try to evaluate what are the chances of
each one being correct (eventually choosing the one with the best chances)
based on given evidence.
Evidence has the broadest possible meaning and it must include
every single piece of information, of every quality or size, which we
accumulate as the result of our exploration of the world using our senses
or our reason, including whatever we can derive through a chain of
reasoning (assuming that the reasoning was sound).
An odour can be evidence for nearby food and this can be seen
from two completely different perspectives. One can justify it via our
experience as from the very beginning of our lives we learn that this
association works. Another way is to rely on acquired knowledge, which
can also be understood as some kind of experience, but of a different
quality. We know that odours are the result of our nose sensing molecules
of food in the air. If there are molecules of a certain food floating around in
the air, the food either has been there or still is.
Solar burns are evidence that something is coming out of the sun,
crossing the whole distance between that star and our planet Earth, and
finally hitting your skin with enough energy to damage it. This is indirect
evidence that this something-which-is-coming-out-of-the-sun exists.
Roberto C. Alamino


Notice that the line between direct and indirect evidence is a thin
and blurry one, if there is any sense in combining these two characteristics.
We usually attribute the term direct evidence to something that can be
measured and the term indirect evidence to something that is implied by
some rational consideration. Is an odour direct or indirect evidence for
nearby food? If you say indirect because you cannot see the food, think
again. Are you sure that you can attribute a higher level of reality to shapes
formed by reflected light captured by your retina than to molecules of a
substance detected through the sense of smell? Because of this, it is
unnatural and even arbitrary to separate different kinds of evidence and
we will consider every piece of information on the same foot, unless we
have strong reasons not to.
Inference is something that might indeed be difficult in its details.
Reconstruction of a story using pieces of evidence can be tricky as the
number of possibilities to fill the gaps can be larger than one can even
imagine. But, surprisingly, we do understand the overall process well
enough to be able to mechanise its fundamental workings. Even more
impressive is the fact that this mechanisation is so simple that I can give
you the final answer in one line. You doubt? I will show you.
Consider a certain question and all the relevant information we
possess to answer it. Call the set formed by all of this put together
(question + relevant information) the dataset . Suppose we want to know
the odds of a certain specific answer, which we will call , being the correct
one. These odds can be written as a simple-looking mathematical formula
given by (do not get desperate because of the mathematics, just bear with

That is all. Seriously. We do need to understand what each part of
this equation means, but the way it is written above it can be readily
programmed in a computer. To be entirely fair, the above symbols give you
one of many different ways of doing inference. It turns out that, although
The Probable Universe


there are many ways, the above formula gives you the most general and
correct one. It is known by the name of Bayesian Inference. There are
others, but they are all either approximations, particular cases or wrong
We can see Bayesian inference as a computer program which runs
in our brains every time we stumble into a question. Let us call this
program BAYES. The data that must be fed to BAYES in order to get a result
The question we want to answer. This question is considered as an
element of the dataset .
The possible answers to the question, one of which is .
How the question is related to its possible answers. This is what the
factor symbolises.
All information which is relevant to connect the question and the
answers. Like the question itself, this information is also part of .
All extra information we can collect about the answers alone,
represented by the factor.
Then, BAYES gives you back the odds of each answer, which is
symbolised by the notation . The following diagram is a way to
graphically visualise BAYES:

The program BAYES: once it is fed with all information about a question and its possible
answers, the program spills out the probabilities of each possibility.
Roberto C. Alamino


The largest part of this book will be concerned with understanding
each one of those three s you see in the above diagram. That is why you
do not need to get upset if you do not understand them right now. In the
course of the book we will also refine the above diagram as, the way it is
right now, it is missing some details. That will be done gradually and I will
explain exhaustively each step.
Now, if you reread the previous paragraphs, you will notice that
BAYES needs to be fed with the possible answers. Why? Once we learn a bit
more about probabilities, this will be clearer, but the rough justification is
that we cannot calculate odds of a certain thing to happen unless we know
all the possible results. It makes a huge difference for calculating the odds.
For instance, if there is only one possible result, the odds for it is 100%. If
there are one million possible results, the odds of many of them will have
to be less than one in a million.
This is actually something very deep and fundamental. It is. I am
neither joking nor exaggerating. BAYES cannot find the possible answers to
a question or the possible results to an experiment. The only thing that
BAYES can do, although it makes it very well, is to evaluate the chances of
each possibility. Sometimes though, depending on the information given to
it, not even that! There are instances in which all that can be done is to
know which possibility has better odds of being right without knowing the
exact value of those odds!
The task of finding out the possible answers/outcomes of
something is the one task that still today has not been mechanised. This is
the place in which imagination and creativity lie, and will ever do. It is in
devising possible scenarios that humans have the greatest advantage over
other species. I am not saying that creativity will never be understood to
the point of being programmable. I do not know. But I do know that when
machines start doing that at the same level we do, we should start to look
at them from a completely different point of view.
The Probable Universe


A fair question you might ask at this point would be: if BAYES is
actually such a simple program to write, why on Earth are you writing a
whole book about that? An even more practical question that might be
hovering inside your mind would be:
Do we really need to understand all those symbols if we are simply
interested in using BAYES to decide upon something?
The answer for this, and for the infinite number of similar
questions that can be asked about almost every kind of knowledge one can
acquire, is an obvious no. You do not need to understand it to use it as
long as the necessary information comes processed to you. Similarly, you
do not need to know how your mobile or your car works to use them too.
When one of them breaks down, you can take them to a technician, but
BAYES has a subtlety. BAYES does not break down. When it fails, it is
because you are doing something wrong at some point. Worse yet, you
might never know there is a problem until the wrong decision has been
You do not need to understand BAYES to continue to live your life,
but this book is for those who want to. It is written for those who are
interested in knowing how and mainly why that program works. In
understanding the logic behind the program, why should we trust it and
also its limitations, we open the way to use it more efficiently and,
consequently, more often.
There is an additional argument. Even those who would be
satisfied with simply running the program will need to feed it with the
correct data in the correct form. Another aim of this book will then be to
answer this other question:
How can we write down everything we know about something in a
way that BAYES will be able to process it?
Roberto C. Alamino


Do all those things sound a little familiar to you? If all of this rings a
bell, that is because it should. We run BAYES or some of its simplified and
approximated versions, almost every single moment of our lives, even
without knowing it. And we do exactly the same thing even when we are
dealing with the most sophisticated of our intellectual endeavours. Yes,
both the almighty philosophy and science rely on BAYES. In all of those
situations, surprisingly, the program to be used is exactly the same! The
only difference being the nature of the questions and what we do with the
If you do not look at reality in a different way after reading the rest
of this book, if you do not feel uncomfortable at any moment and if you
remain the same person at the end, I suggest you to read it again slower.
How do I know you will? It is just a question of inference.
Believe me. I tried. When I was in high school I discovered that I loved
geometry and hated probability. I could visualise inside my mind every
single geometric concept and understand how it worked, but a simple
calculation of the odds of something would give me a headache. I could
never get all the possible combinations of that card deck right!
I am a synesthete. I perceive colours when I think about maths. I do
not see colours. That is not how synesthesia works. I feel colours associated
with numbers or formulas or even whole theories. The number 7 is red and
so is classical mechanics. I do not see the number in red colour when I look
at it. It is still printed in plain black letters. But it is red. The number 3 is
green, just like electromagnetism. Some of these associations, especially
the ones related to physics, I know exactly where they come from. The
numbers, I have no idea.
The fact that geometry, inside my mind, was represented as such a
varied mixture of colours while probability theory was all black and white
The Probable Universe


gives you an idea of my despair. Once I finished high school and started my
physics degree, I was relieved that I would use a minimum of probability
theory. Maybe I would never need to use it again at all. I was naive to think
that I could choose a route in physics that would use mainly geometry and
only very basic probability.
If you think about being a good theoretical physicist, be very aware
that you cannot avoid using probabilities a lot. Quantum mechanics, the
best description of the microscopic world we know at the moment, can
only be connected with experiments through the use of probabilities. But
the presence of probabilities is much more pervasive than that. Even
thermodynamics, the theory of heat that probably conjures steampunk
images for most science fiction fans, and its twin area of statistical physics
require a deep knowledge of them as well.
You might somehow dodge probability by judiciously choosing
another area of science than theoretical physics, but one of the things that
you will learn in this book is that you cannot really understand science
without understanding the meaning of probabilities. You do not really need
to know how to calculate them rigorously, but you do need to grasp its
fundamental concepts. Not taking the time to acquire this knowledge is
one of the biggest problems of modern scientists, many of them being
happy in becoming more like calculators than thinkers. Nothing against
that, people do what they want with their lives, but there are always
consequences, in this case, for science and even how it is perceived by
those who are not scientists.
This does not seem appealing for non-scientists, I know. But as I
said in the previous section, science is not the only place where
probabilities are important. Every situation in which more than one thing
can happen and we do not know which one will, we are led to think about
the odds of the possible results. Odds, of course, is just a different term
for chances or probabilities.
Roberto C. Alamino


When BAYES spills out odds for possible answers, it is spilling out
probabilities. That is the reason why its end product is . The in
this case is for probability. By analogy, you can readily infer that
probabilities are also the correct way to feed in the information to BAYES.
There you have the three s. Because of that, we have to learn how to
translate information into the language of probabilities. As you see, there is
no escape. One way or another, we will need to understand probabilities
better before being able to understand and use BAYES.
Another argument to show you how probabilities are indeed
everywhere is the fact that you do not need to look for mathematicians or
physicists to find people whose job is to actually calculate them.
Bookmakers are examples of professional probability calculators which, in
general, are not scientists. People working in different jobs inside the
financial market, traders for instance, are also probability professionals.
Every decision we have to take in our lives requires some rough
estimation of probabilities. Remember that this is how the ranking
procedure we called inference is done, we are just changing the word
odds by the word probabilities.
We all know how difficult it is to take a decision and how important
it is to take into consideration as many information we can. We do that,
usually, in a very structured way. We usually start by identifying the full
spread of options available to us. When we take a deeper look on how
probabilities are defined, we will see that this delimitation of the set of
possibilities is the first fundamental step in constructing probabilities. We
do that naturally.
We then consider similar situations in our own lives, search in
books, newspapers or ask people if they ever had to go through something
alike. We assume some things also. For instance, we hope that if the
situation is similar, doing similar things will bring similar results. This is
another step in calculating probabilities. We will learn this as the technique
of counting frequencies. You assume that if you find many similar
The Probable Universe


situations, the amount of times a certain consequence results from a given
decision will be roughly the same. This is already a measure of probability.
But we also take into consideration our emotions. Sometimes, they
are so strong that they even make us ignore all other evidence we
collected, no matter how much convincing it was. It is usually not simple,
but at the end we do it. The big surprise is that we will learn that even the
emotional influence enters BAYES. And it enters also as a probability! In
fact, one of the main mistakes of many other inference procedures which
are not Bayesian is that they ignore this influence.
Because this book is about taking decisions in the best way you
can, we are then forced to delve into the world of probabilities. Decisions
are the most important and unavoidable things in our lives. Taking
decisions transcends science, business and personal life. It is amazing that
in the last two centuries we not only learned there is a right way to do it,
but in one of the greatest exceptions in history, we actually discovered the
right way! That is one of the greatest unsung (among the general public)
achievements of all time.
In the best no-free-lunch way though, the fact that we know the
right way to decide does not mean that it is simple to implement. Still, I
want to show you that it is simple to understand. And once you understand
Bayesian inference and start to use BAYES and all its principles in your
personal life and your work, you will finally see the world in a totally
different way. This will make you, like me, lose your fear and stop trying to
avoid probability. You will, probably, learn to love it.
Probabilities are always related to things that we do not know for sure.
Whenever our knowledge about something is uncertain, be it a natural
phenomenon like where the lightning will strike next time or a daily life
Roberto C. Alamino


conundrum as which one must be the fastest way to work today, we
naturally think about probabilities.
The language of probability is part of our culture. Whenever we are
completely sure about something, we do not hesitate to say that we have
100% of certainty about that. This 100% is one of the many forms for
expressing a probability estimate. It means that we analysed all relevant
information (or at least the information we deemed relevant) to reach that
conclusion and that all other conclusions are, according to that
information, surely wrong. They have 0% chance of being right.
When we are not sure about our answer, this 100% certainty
decreases and can even get to 0%. The actual number, most of the time, is
not a precise calculation, but a rough estimation based on some key
numbers. If we have to consider the possibility of two different outcomes
for something, like whether the child is going to be a boy or a girl before
the ultrasound, and we have no idea whatsoever which alternative is the
correct one, we simply pick one of them and attribute to it 50% of
certainty. This is the measure of how much we do not know about that
outcome, or of how much information we do not have to infer it correctly.
Notice the emphasis I am giving to the fact that we talk about
probabilities according to the information we have (or not) at our disposal.
This is, again, a reference to our inference program BAYES. We need to
give information to get probabilities and consequently the probabilities we
get do depend on the given information very strongly. If we do not know
some aspect of a problem, most of the time we will not be able to choose
one of its possible answers with 100% certainty. Ignorance leads to
probabilities. The opposite is not always true, though. Sometimes, even if
we know everything we can know about something, we still cannot do
better than just estimating probabilities. This is what the scientific
community, especially the physicists, had to accept when it became
evident that quantum mechanics was the right description of physical
reality. Do not worry about that right now, we will have time to talk more
about quantum mechanics later on.
The Probable Universe


Although the association of probabilities with information (or
rather, the lack of it) makes a lot of sense when explained with the
arguments I presented up to this point, it took a long time to formalise this
idea in the appropriate mathematical/philosophical way. After a lot of
effort, wrong turns and a series of mistakes which are common in scientific
research, we finally discovered that it is not the probability itself that
measures how much you do not know about something, but a slightly
different quantity. In order to calculate this quantity, we have to know the
probabilities for all possible answers. By feeding these probabilities to a
certain program, it will give you back the size of your lack of knowledge
about the subject. This quantity, which is nothing but a positive real
number, is something that everyone heard about at some point. It pops up
once in a while on TV shows, popular science magazines and internet
videos. It is called entropy and it is indeed a measure of missing
information, in other words, ignorance. It is a fundamental concept
underlying BAYES and we will learn a lot about that here.
Invariably, when a new technology is incorporated to our society, it
brings together a new way of interpreting the world. Steam power made
people think in terms of energy. Electromagnetism incorporated the idea of
fields to the popular culture. After the Second World War, something
similar happened with information. The increasing importance and
popularisation of computers and communication media since then led to
another change in point of view by the middle nineties and to the
recognition that information should be treated as a physical entity, at the
same level of reality as energy or mass (Landauer, 1996). It is true that
information is a kind of physical entity which we do not fully understand
yet in all its subtleties, but which pervades science wherever you look at.
We have some good grasp of it, but once again it becomes more
complicated when we look at the quantum world and, as we know, the
quantum world is the actual world.
To argue that we have to know how much we do not know in order
to improve our learning sounds like an advice from a Buddhist monk, but it
Roberto C. Alamino


could not be more objective. It is actually a bit obvious. Without measuring
our ignorance about a subject, how could we evaluate the improvement in
our learning of it as we acquire more information? Learning something
does not only mean to accumulate new information, but also to review our
concepts concerning the subject based on that new information. For
instance, what if the new information we get is simply useless? Useless
information does not decrease our ignorance. Unless we actually know
how much our ignorance changes when the review of concepts is carried
out, we will not be able to evaluate the best way of doing that.
Right now, you must be thinking the following: It makes sense. By
knowing the amount of ignorance I can devise a way to make this decrease
as fast as possible! Right? The surprising answer is NO. It is more subtle
than that. It turns out that the best way to learn a subject is by
guaranteeing that, each time new information arrives, the new conclusions
we take are done without incorporating any assumption not contained in
that piece of information.
Think about this. You see a video of a poor man robbing a shop and
then you conclude that poor people are robbers. That last bit was a
conclusion that was not implied by the fact. You are including an extra
assumption, the one that allows you to generalise to a whole population
the behaviour of a single individual.
The amazing fact that we will understand later in this book is that,
to guarantee that our conclusions on the face of new information are
unbiased, we have actually to find a way to maximise our ignorance about
anything that does not concern the information we have. Looks a bit
frustrating, but I assure you that it does make sense. Be patient and we will
get there one step at a time.
The Probable Universe


We are all afraid of criminals, accidents and failures. What do we do? Most
of us find ways around it. We put locks on our doors, insure our car and
study harder for exams. We might not like doing those things, but once we
are forced to, we just do them one way or another.
We are also forced to use maths everyday to deal with money. We
have to calculate taxes, changes, interest rates, mortgages and so on.
Slowly and with a lot of effort, we end up learning how to do those things
simply because we cannot avoid them. Well, we can, but then we would
have to accept a very different kind of life. Some do.
I am not going to lie to you. Probabilities require maths. The good
news is that it does not require more maths than you already know. You
will get along very well by knowing simply the basic four arithmetic
operations . You will have to learn a lot of new concepts, but
as far as calculations are concerned, those four will be enough for our
purposes. That does not mean that we will not see other mathematical
ideas in this book, but they will appear in the role of side dishes which you
should feel free not to eat if you do not want to.
My own experience is that there are two very scary features about
mathematics that makes it different from any other subject. The first is that
each new thing you learn in mathematics requires you to understand
almost everything that comes before. You will not understand square roots
without understanding squares, which you cannot understand without
multiplication, which you cannot understand without addition, which you
cannot understand without knowing what a number is. The main source of
this difficulty is that it is usually not enough to memorise. Given enough
time and enough storage space on our digital devices, memorising is not a
problem. The problem is that you are forced to actually understand the
concepts, and that is something that electronic devices still cannot do for
Roberto C. Alamino


And that brings us to the second scary feature of mathematics: we
are forced to think. Most of the time, we are required to think very hard.
That is the feature that often puts people off. If you do not like to exercise
your brain too much, that is going to be a problem. I will assume that, once
you decided to read this book, this means that you are up to the challenge.
Otherwise, I will understand if you furtively put this book back on the shelf.
Right now some of you must be shaking your heads and saying No.
Its not that. I just dont like formulas. Thats it. Cant you just explain
everything without using formulas? Let me show that your problem with
formulas is, actually, a no-problem. Formulas are friends, not enemies.
When you think about a formula, you think about the modern
incarnation of it. You think about those modern mathematical symbols
with some equal sign, or its cousins, inserted somewhere in between a
sequence of strange letters. A good example of how weird (or artistic
according to ones preferences) they are is the following

This doodle, as my mother use to call the symbols she would see
on the sheets of paper lying on my desk, is one of the laws of
electromagnetism. It is one of the famous Maxwell equations and it is
meant to encode a beautiful experimental fact. When people started to
play with the connection between electricity and magnetism, they
discovered that whenever you have an electric current and/or an electric
field that changes with time, you will create a magnet. The above formula,
in addition to telling that to the trained eye, gives the precise amount of
magnetic field generated when the variation in the electric field and the
current are such and such. By measuring the latter ones, the formula allows
you to know with a high precision the value of the former.
The symbols themselves are actually not compulsory. We could
describe the whole equation with words. But the symbols do have a
The Probable Universe


purpose. They work as abbreviations for a thread of thought which can be
very complex. You can imagine the combination of symbols in a formula as
a sequence of instructions in a computer program. Some of these
instructions are once again abbreviations, now for other programs in
nested sequence that goes all the way down to simple additions and
The problem with describing this program symbolised by an
equation using only words is that it can take whole books to explain what
some of them really do in details. We do not want to write books every
time we describe some phenomenon and neither read a book every time
we want to calculate some quantity. Formulas are a healthy application of
human laziness. The Greeks, for instance, did not have our modern
notation and they were able to do a lot of mathematics themselves, but
doing mathematics with the Greek notation is something that at least I do
not have the wish to accomplish today.
In summary, formulas are convenient devices and not using them
would be a waste of everybodys time. But if even after being presented to
all these reasons you are still not convinced that you need them, there are
two things I can say to you. The first is the nice one: just jump over the
formulas and read the text. That is what most scientists, including myself,
do the first time we meet a formula we do not know. If you ever tried to
learn a foreign language, you know how it works. When you start reading a
text, you look for all the unknown words in the dictionary, but after one or
two paragraphs you just give up and try to make as much sense of the text
as you can with what you already know. At the end, we can always learn
something and the most important result is that, next time, we have that
feeling that we are somehow more familiar with the language.
The second thing will be a slap in the face. Do you seriously think
that you can truly understand mathematics without knowing how to read
formulas? Can you imagine someone saying that she understands English
but cannot read a book? You can somehow get an idea or a general feeling
of a subject that requires maths, but if you do not make an effort to learn
Roberto C. Alamino


the language of mathematical symbols you will always have only a
superficial understanding of the subject. If that is what you want though,
then that is fine. It is your decision to make.

The Probable Universe


Games of Chance
The objective of the previous chapter was the same of a film trailer hook
you and give you a taste of the things we are going to talk about. Among
those things, there is one that stands out probabilities. Everything we will
learn in this book involves probabilities and therefore that will be the first
thing we will look at closer starting... now.
What is the probability of getting heads when you toss a fair coin? The
sensible, intuitive and often (approximately) correct answer is the obvious
value or 50%. The apparent triviality of this answer hides a series of
reasonable, but in no sense trivial, assumptions which are so natural that
are hardly ever noticed by us. Each one of those assumptions, not
mentioning the way they influence one another, plays an important role in
reaching the final values we gave above. Even understanding those values,
as we shall see, is a slippery task.
If you carefully read the previous paragraph, for instance, you will
notice that a precise definition of what should be understood by a coin
has not been given at any moment. Do I need to waste my time doing that?
This piece of information is surely part of the background culture of almost
every human being in this planet, and certainly of all those who are reading
this book. For the great majority, the image it conveys is that of a metal
disc with different inscriptions, or drawings, on each side. One of the sides
will usually portray the head of an authority, which is the reason it is
obviously denominated heads in English. The other side has the less classy
Roberto C. Alamino


name of tails (in Brazil, we call them face and crown respectively).
Depending on the country, the situation or the time, there will be some
variations. In some countries, coins have holes in the middle or have a
square instead of a circular shape. Some non-currency coins are made of
plastic or even wood. But some characteristics will usually be a constant,
the most important of them being the flatness of the coin.
The role played by the flatness assumption is twofold. First of all, it
tells us that we are dealing with an object with only two sides, which
translates to only two possible results of the coin tossing heads or tails. I
could, for instance, have asked instead the probability of getting any one of
the faces when rolling a fair dice instead of tossing a fair coin. The answer
that comes to the mind of most people is now 1/6, which has a much
higher chance of being the wrong one! Why? Because contrary to the word
coin, the word dice can describe a much more varied class of objects, not
only the typical six-sided casino dice.
In fact, if you are a geek or an RPG (Role Playing Game) player as I
was for many years (Im still a geek, by the way), you would readily ask
What kind of dice? If you ever played Dungeons & Dragons, you would
know that regular solids are used as dice in the game and called by the self-
explanatory terminology of d4, d6, d8, d10, d12 and d20 (d for dice,
followed by the number of faces). Sometimes, a coin is considered a
generalised dice and called a d2 in RPG terminology. There is even the
case when you might be required to toss a d100, which is often not actually
a dice with 100 faces, but in fact the combination of two d10 tossed at the
same time, with each one representing a digit of a number between 1 and
100. Given the convenience and obvious generality of this notation, I will
use it as my standard notation for dice in this book.
The Probable Universe


Dice with different number of faces normally used in Role Playing Games like Dungeons &
The second idea conveyed by the flatness assumption in a coin (or,
alternatively, d2) is that of symmetry. This is an important concept in all
areas of science, because it summarises the idea of a set of things which
are indistinguishable such that there is no reason whatsoever to prefer one
over the other. As a fancy but extremely important example, the most
fundamental laws of physics are based on a symmetry concept called
gauge symmetry, which among other things is the reason for the existence
of electromagnetic fields and the explanation of why electric charges are
conserved much like energy.
Back to our coin, when we assume that it is flat, this means that
both its sides are equally flat, which then guarantees that there is no
influence of the shape in the chances of getting either one or the other
result when we toss it. The same happens with dice. As long as all faces
have the same shape and the dice is geometrically symmetric, in the sense
that it looks the same when you are facing any of its sides, there is no
reason to think that one of them is privileged. This observation is
connected to another piece of information contained in the initial question,
the quality of being fair.
Roberto C. Alamino


Fair coins or dice are those which, in addition to being symmetric in
shape, are not loaded. As we have already seen, the geometric symmetry is
there to ensure that the external shape will not have any influence in the
final result, but coins and dice are physical objects and there is a another
way to make one face more likely than others, tampering with their
internal structure.
Fairness, in this case, goes beyond the external appearance of the
objects and requires the symmetry to be also an internal one. This means
that there should be no variations in the density of the material of which
the object is made of at any point in its three-dimensional structure. In
practice, there will always be some irregularities, but we expect them to be
small enough not to influence too much the final result. Besides, if these
irregularities are truly just random, they will have the same chance of
increasing or decreasing the local density, making them cancel out in
practice. Although perfect fair coins or dice do not really exist, these
idealisations are very useful and we will use them many times in this book
for the sake of clarity.
We have finally described all the relevant details of our coin, but
we are not done yet. Having opened up your eyes to the fact that there are
some details which are very important when calculating a probability, I still
have not described the most important one how the tossing itself is being
carried out.
Tossing a coin, or rolling a dice, is something that we have seen so
many times in our life that we hardly stop to think about the several ways
this can be done. We immediately assume another fairness quality to the
tossing (not the coin) with the meaning that the person who is carrying it
out is going to throw the object upwards or downwards with enough speed
and power to make it impossible to predict how it is going to stop. The
conviction that this is an agreed means of doing a tossing is so strong that if
the coin-tossing person simply chooses one of the sides and gently put it
upwards, that would result in a wave of angry complaints from all other
The Probable Universe


players of the game. That is especially true in a RPG session, where
emotions can run amazingly high.
You should now stop reading for a couple of minutes and reflect on
how eye-opening the above discussion is. We started with an almost trivial
action and saw that we automatically attach to it a whole series of
assumptions that might be completely wrong! Be assured that this happens
with almost every action we carry out in our daily lives. If you think enough,
you will start to see it everywhere.
One can now appreciate how assumptions based on some
previously acquired knowledge are put together in our minds to create a
scenario in which there is no reason to believe that one side of our coin has
a higher chance of turning upwards than the other. This symmetry, which
depends on the shape, the structure and even the way the coin is tossed, is
then translated into the number . Why? Well, we simply imagine that we
have a whole thing that has two possibilities. We attribute to this whole
thing the number 1 and then it is only natural to divide it in two halves,
neither being more probable than the other, resulting in the number 1/2.
The particular numbers, 1 and , obviously need a better justification, but
that is something more complicated and we will spend some time to
understand it in the following chapters.
But even without knowing the precise reasons why this is done, we
are all used to attribute numbers to probabilities. We see it on the TV or in
places like horse races all the time and we develop some understanding of
it. My preferred way to express probabilities in this book is by giving a
number between 0 and 1 instead of the most popular percentage. This is
really just a question of preference, of course. A percentage simply means
something divided by one hundred and you can convert from percentages
to numbers between zero and one simply by dividing the percentage by
100. For instance, 25% is the same as . The
conversion rule goes the other way round. If I say that some probability is
0.3, you just multiply it by 100 to get the percentage, in this case

Roberto C. Alamino


Back to our discussion on fairness, it serves to illustrate another
notion which is so important that we cannot emphasize enough. Whenever
we calculate the probability of a certain event, meaning the result of any
kind of experiment that is carried out, we start by making a series of
assumptions about how that event might happen. We make a mental
model (remember this word!) of that particular experiment and use it to
deduce symmetries and other rules that we think make sense, like for
instance how the coin or dice will move in space.
All the assumptions that enter in the building of a mental model
are based on some information we have before the actual experiment is
carried out and, because of that, it is called prior information. The idea of
prior information is one of the most important concepts we will
encounter in this book, not to mention in science and life, and we will
dedicate more time to it later on. From everything we have learned, it must
be clear that we use this prior information to construct the probabilities for
the possible outcomes of an event. Because this kind of probability is based
on prior information, we call it by the obvious name of prior probability.
For our games, the prior probability we are interested in is the chances of
each face coming out upwards in a dice or coin tossing.
But what happens once the coin is tossed? In real life, especially
when we go to a casino, we can never really trust that all the symmetry and
fairness assumptions are indeed correct. Consider the following situation. I
take a d10 out of my pocket and show it to you. At first sight it looks really
symmetric, which I swear to you is the truth. You take it into your hands
and examine it as closely as you want. Although it is difficult to evaluate
the homogeneity of its internal density, a quick check does not seem to
indicate any serious tampering. I also guarantee to you that I will throw it
as high as I can to make the action as fair as possible. This should be
enough to convince you that the probability of any side is 1/10, which is a
number with the meaning that any of them have the same chance of
turning upwards at the end. In other words, the odds for any face are 1 in
10. At this point, we say that your prior probability or, to get rid of too
The Probable Universe


many words, simply your prior for this dice rolling is 1/10 for each face. I
then throw the dice and the result is 1. Then I do it again and I get another
1. If that happens a third time, you would start to suspect that something is
wrong. Either the d10 was not as symmetric as you thought or I am using
some wicked sleight of hand to influence the rolling.
If I roll the d10 one hundred times and all the results turn out to be
1, you would have no doubts that the probabilities for each side are not the
same and that your prior, as plausible as it seemed to you at the beginning,
was completely wrong. If I stop at that point and say to you that, if you
predict the next result correctly, I will give you one thousand pounds, what
would be your best shot? No matter how symmetric the d10 looked like in
the beginning or how high I throw it each time, any other answer than 1
would obviously not be a very smart one. Excluding, of course, the case in
which I am really a magician who can actually manipulate the results and
want to get some easy money from you, everything points out to the fact
that the particular d10 I am rolling will always give the same result. Forget
the prior information. It was obviously wrong.
What happened here, as you surely can appreciate, is another
example of our central theme inference. As we discussed briefly in the
beginning of this book, whenever we have to predict something, either
about the future or about the past, we rely on some previously acquired
information to do that. The prior probability summarises all this initial
amount of information. In the d10 rolling case that was described above, all
available prior information would suggest us that any number from 1 to 10
would be equiprobable (having the same probability). But another
important lesson in life is that information might be either wrong or
incomplete and, once we recognise it, we have to change our beliefs. And
by changing our beliefs, we consequently change our next predictions! If
we insist in not changing our beliefs in face of new information, our
predictions will be wrong.
Let me add a small digression about prediction at this point.
Although most of the time the word is applied with the meaning of using
Roberto C. Alamino


information to make an educated guess about a future event, throughout
this book I am going to use it also to denote a hypothesis about a past
event that is still unknown. For instance, based on some oral legends about
some historical character, one might be able to predict her birthplace even
if it is not yet known. The same can happen with the location of some city
(like Troy) or archaeological site, even though these are events that
happened in the chronological past. Some people would use the word
retrodict, but this level of distinction is unnecessary and I will not be using
Back to inference, I shall stress once and again that in order to
predict things correctly it is not enough to use all prior information, but we
also need to update that information in light of new results! The seemingly
infinite series of 1s in our d10 virtually forces us to consider a new
probability for each face, one where the face 1 is going to turn up with
probability 1 and all the others with probability 0. This new updated
probability, after new information is incorporated, is what is called a
posterior probability. This process of incorporating new information to a
prior to get a posterior is usually referred to as updating the probabilities.
We can now at the same time refine and shorten the definition we
gave of inference in the previous chapter as
Inference is a method for updating probabilities.
In a more mundane way, it is a method to change our beliefs
about how probable things are. As I have said before, Bayesian inference,
or our computer program BAYES, is the most fundamental method to
update probabilities and all other correct methods are derived from it.
The word Bayesian, repeated so many times here, comes from the
surname of Reverend Thomas Bayes, a clergyman who proposed a
simplified version of the full method in a now classic paper written in 1763
(Bayes, 1763). However, as it often happens in every area of human
knowledge, the greatest contributions and the general method were only
The Probable Universe


rigorously formulated later, in this case by the French mathematician Pierre
Simon Laplace not long after. Especially in the last decades, Bayesian
inference has proven invaluable in many areas ranging from artificial
intelligence to genetics with profound, although not largely recognised,
philosophical consequences in science. Still, there are many scientists who
are reluctant to use it purely on the basis of prejudice and ignorance.
We shall explore all aspects of Bayesian inference as we travel
through the principles and the history of an area of mathematics that
touches virtually every aspect of nature. The initial point, and one of
utmost importance, is a question that we overlooked in our discussion so
far. What is a probability? But before we answer that, let us take a
historical look on how the question has arisen and the attempts to answer
it. As it always happens, the information contained in history is crucial in
arriving at the right answer.
Pierre Simon Laplace, the great French mathematician of the late 18
early 19
centuries, and a character we will meet again soon, is often
quoted as saying
One of the aims of this book is to give, or rather to be, supporting
evidence in favour of the last part of this quote. Right now, though, we
shall spend some time appreciating the very practical origins of the
mathematical interest in probabilities.
Roberto C. Alamino


Probability, the technical word used in mathematics for chance or
odds, has always been invariably associated with gambling. When I say
gambling I am not referring to the innocent Sunday game with friends,
but to the actual game of chance which is played legally in the casinos and
more furtively in closed rooms late at night. The kind of gambling in which
the stakes are real money and peoples lives can be ruined at the turning of
a card or the rolling of a dice.
Gambling was actually the driving force behind the development of
what we call today probability theory in its very beginning. The first
systematic study of probability was put together by an Italian physician
called Gerolamo Cardano at some point between 1525 and 1565. The
treatise itself was only published in 1663 (Cardano, 1693), 87 years after his

Gerolamo Cardano (1501-1576)
Cardano was a very controversial character. A physician by
profession, he had many interests which included astrology, mathematics
and gambling. He had a life with highs and lows. For instance, one of his
The Probable Universe


sons was executed after poisoning a prostitute with white arsenic. He
wrote many books on several subjects. In most of them, he basically
compiled information about some topic adding, according to some of his
critics, very little to the actual knowledge of the area.
It is not easy to check all the claims, either in favour or against his
contributions, but it is usually attributed to him the discovery of the
formula for the general solution of a particular kind of third degree
polynomial equation, that of the form

Remember that in the above formula, is the unknown variable,
the one we have to find. The other letters, , and , are given numbers.
This is called a polynomial equation because a polynomial is a formula that
is composed by a sum of powers of a main variable (in this case , but it
could be any other letter) each one multiplied by a number. The degree of
the polynomial is always the largest power appearing in the formula. For
instance, the following formulas are polynomials
polynomial of degree 1 in

polynomial of degree 2 in

polynomial of degree 21 in
Notice that it is not necessary for all the powers to appear in a
given polynomial. When we force some polynomial to be equal to a certain
value, which is usually zero, we have a polynomial equation with the
degree given by the degree of the polynomial. We all have learned in
school to solve polynomial equations of degree 1 and 2. Equations of
degree 1 are trivial, as all we need to do is to move all numbers to one side
and the variable to the other. For instance, using the first polynomial above

Roberto C. Alamino


The symbol means implies. This is, however, not a general
solution. It is a solution for a particular equation. A general polynomial
equation of degree 1 can be written as

This equation is also known as a linear polynomial equation. This is
because it can be used in geometry to describe a straight line (we will see
that later). The letters and can be any numbers with one exception:
cannot be zero. If is zero, then we clearly do not have a polynomial
equation at all. The general solution for the linear polynomial equation is
then given by

The equation and the solution are general because any linear
polynomial equation can be obtained by an appropriate choice of and .
Similarly, we all have learned that the general polynomial equation of
degree 2, also known as a quadratic equation, is given by

and this usually has two solutions that we call


given by the
following formulas

You can appreciate that the solutions are much more complicated
than the solution for the linear equation. And there is also an extra
complication here. Due to the presence of the square root, we have to be
careful with the solutions. If the number inside the square root is negative,
there are no real solutions for the equation, which means that no real
The Probable Universe


number can satisfy the polynomial equation. For instance, consider the

We have that

and, therefore, we cannot find any real value for x satisfying the equation.
If you do not believe, feel free to keep trying.
Today, we know that in cases like the one above, although there
are no real solutions for the equation, we can find complex numbers
satisfying it. Remember that complex numbers are a generalisation of the
real numbers in which it becomes legal to take square roots of negative
numbers. We will study them a bit later on.
Before the complex numbers were taken seriously, however, the
attitude towards solutions of equations in which square roots of negative
numbers appeared was to say that simply there were no solutions. Then, it
came Cardanos solution for his polynomial equation of degree 3, also
known as the cubic equation.
The equation whose solution was presented by Cardano is still not
the most general cubic equation, because it is lacking the term with the
square of the variable. Because of that, it sometimes gets the unusual
name of depressed cubic when . Still, even in this simplified case, the
solution iss already more complicated than the one for the quadratic
equation. I am not going to write it in full here, it is enough to know that it
also involves square roots, but in a more complicated way.
In the same way as we have seen for the quadratic equation, it
might happen that the numbers appearing inside the square roots might be
negative. For the quadratic equation, when this happens the solution is
complex and, if when people did not know about the complex numbers, we
Roberto C. Alamino


have seen that they would simply say that there are no solutions. However,
in Cardanos formula, the square roots would appear in an intermediate
step and what could happen is that we could have perfectly real solutions
even if those square roots were of negative numbers!
Because the square roots of negative numbers could be
manipulated to give real numbers as the final solutions for the equations,
Cardano was led to consider their existence as legal mathematical entities
that could be used in calculations and this led some to attribute to him also
the invention of the imaginary numbers, those which are square roots of
negative numbers.
Although imaginary numbers will not be necessarily used by us in
this book, they are connected to probability via (guess what!) quantum
mechanics. We will have a very brief encounter with them though, but if
you are interested in more details about this connection and Cardano
himself, I would recommend Penroses Shadows of the Mind (Penrose,
Cardanos love for gambling resulted in him being probably the
first person to create and study systems to earn money with that.
Cardanos book was called Liber de ludo aleae which is the Latin for Book
on games of chance and, as we had already seen, was only published
posthumously on 1663. The book was actually more of a gambling manual
than an actual mathematics treatise, but it is still considered the first of the
One of the most interesting things about Cardanos book is that it is
possibly the first place in which prior probabilities for dice rolling are
calculated taking into consideration arguments of symmetry. To be fair, the
idea is not exposed in this way, but it more or less follows the same steps
we used in the previous section. The possible results are enumerated and
then the same odds are attributed to each one. Later on, Galileo would
write a brief treatise named Considerazione sopra il Giuco dei Dadi, whose
exact date is unknown, in which he too would do the same feat. It is
The Probable Universe


however not certain if Galileo had the idea independently or if he knew
Cardanos work beforehand.
Although other scientists and mathematicians, like Kepler for
instance, also touched on the subject several times, the next great leap to
probability theory came with Pierre de Fermat and Blaise Pascal, and was
once again tied to gambling issues.

Pierre de Fermat (1601-1665)

Blaise Pascal (1623-1662)

Antoine Gombaud, known also as Chevalier de Mere, was a French
thinker, writer and many times considered a gamester who proposed to
Pascal a problem concerning an unfinished game of chance. The original
formulation of the problem is less important than the idea it conveys and
was described using what was familiar at that time. As our aim is to
understand the essence of it and therefore we are going to use a simplified
version of the game.
Suppose two men are playing a number of games of heads and
tails with a fair coin. As we all are used to in these situations, the number
is chosen as an odd number simply to guarantee that there will always be
a winner. It is just like saying the best of three, or five, or seven. Let us
use the last option and take . To be the winner, it is enough to
guarantee 4 victories. If this happens before the 7 games are played,
obviously there is no need to proceed and we can stop tossing the coin. To
Roberto C. Alamino


make things more exciting, let us assume also that there is a money prize
to be won. What happens if the sequence of games has to be interrupted in
the middle? For instance, suppose the players are two guys, one of which
did not finish the washing up before going to the bar and his wife has just
arrived with that angry look on her face (the male readers probably know
very well what I am talking about).
Assume that this happens after the 4th coin tossing. Surely, if both
players are tied at this point, no one would care too much. They would
simply stop the game and go back to their homes. However, if one of the
players won 3 games already, he would not simply be satisfied in coming
back home with empty hands. After all, he was one game from getting all
the money! How should the money be divided among the two guys in this
situation? What kind of arrangement would leave both of them satisfied?
The creative reader might come up with many solutions, but the
actual intention of the puzzle is to illustrate a situation in which one has to
calculate the odds of someone winning based on what happened so far.
This puzzle was attacked by both Fermat and Pascal in a series of now
famous letters exchanged during 1654 of which only three survived. These
letters are considered to be the landmark that defines the creation of the
mathematical foundations of probability theory. If you understood the
problem well, you probably realised that it is actually a problem of
From the Pascal-Fermat letters, two methods to calculate
probabilities were born. They were called the classical method and the
frequency method. Pay attention to these two terms as this is a distinction
that is in the kernel of even the most modern discussions about probability,
in particular those concerning Bayesian methods. Once I explain what both
of them are in details, you will surely understand the problem.
The idea of the classical method is to break down an experiment in
an exhaustive set of equiprobable outcomes. This set is technically called
the sample space. The feature of being exhaustive is extremely important
The Probable Universe


here. It means that this set contains all possible outcomes, nothing more
and nothing less. Therefore, if the total number of outcomes is , the
probability of each one should, accordingly to the equiprobable
assumption, be given by . In the case of a fair coin, the sample space
would be

and this gives the 1/2 probability for each possibility that we became
familiar with. For a fair d6, it gives probability 1/6 for each face and so on.
As you can readily appreciate, this allows one to attribute probabilities
before actually carrying out the experiment. The classical method is, in the
end, the simplest possible recipe to calculate a prior probability!
There are, however, situations in which it is difficult to infer prior
probabilities. This happens when there are no simple symmetry arguments
to help. Get a sheet of paper and make a ball with it. Can you give any
probability whatsoever about how it is going to stop if you throw it?
Another example would be a loaded coin. Although it would be possible in
principle to calculate probabilities if one has complete information about
the density variations, that would be a very complicated problem that
would probably require some computer program to solve in practice.
Because of all these complicated scenarios, which are actually
more frequent than the simple ones, the frequency method was
suggested. The idea is to count the number of times a certain result is
obtained if the experiment is repeated several, several... several times. The
more times the experiment is repeated the better is the estimation of the
probabilities, a result that we shall know by the name of Law of Large
Numbers. To be more precise, the usual formulation of the Law of Large
Numbers is concerned with average values. The version that concerns
frequencies is called Borels Law of Large Numbers. mile Borel was
another French mathematician. He lived in the late 19
and early 20

centuries and was also a member of the French resistance during the
Second World War.
Roberto C. Alamino


But the frequency method is not an ultimate solution, it has its own
problems too. If you need to repeat the experiment to calculate
probabilities, how can you determine the frequency of an unrepeatable
event? For instance, what is the probability that a certain person will die
from tuberculosis or cancer, lets say, tomorrow? That is in no way
repeatable many times. In this case, what is often done is to use some
extra assumptions which seem reasonable. These assumptions are always
implicitly based on some symmetry principle. For instance, physicians
assume that if you count the frequency of deaths in the whole population,
this frequency can be extended to any person inside the population
because peoples biology is approximately the same, it is approximately
symmetric. Rather than a trivial and obvious assumption, this can be
something prone to a whole series of problems when the deviations from
symmetry become important.
Another issue with the frequency method is the assumption that is
called statistical independence. When you infer the probability of heads
or tails in a loaded coin by counting frequencies of these individual
results, you are secretly assuming that these probabilities are independent,
meaning that the previous results of the tossing will not influence the next
one. This seems to be logical in the case of a real coin, but might not be the
case for other processes in nature! In fact, there is a general class of
processes in nature, called Markovian Processes in honour to the Russian
mathematician Andrey Markov, whose main characteristic is the
dependence on previous results. Think about raining, for instance. The
simple fact that we can actually talk about the existence of raining
seasons indicates that the event of raining today is far more probable to
happen if it rained yesterday than if yesterday was the driest day in the
Of course one can fix this issue by being judicious about how to
count frequencies. If one knows that that each result depends only on the
previous one, then we can count frequencies of pairs of events. However, if
one does not know a priori what is the extension of the dependency (pairs,
The Probable Universe


triples and so on), this again becomes a prior assumption that has to be
vindicated by more experiments. As you can see, it is not easy to get rid of
Clearly, logical consistency requires that both the frequency and
the classical methods give the same answer when the situation is such that
both of them can be used to calculate the desired probabilities. Otherwise,
something should be very wrong with one or both of them.
For the sake of completeness, and to satisfy the curiosity of some
readers, let me explain what the solution of the puzzle stated by Chevalier
the Mere is in modern probability language. If you forgot what the problem
was, go back to the beginning of this section and read it again. To make
things simpler, let us assume that Andrew (player A) always bets heads
and Barney (player B) always chooses tails. We will also assume that
Andrew is the one that won 3 games, while Barney won only 1. The only
way for Barney to win is if the next 3 results turn out to be tails. If we list
all the possible results for the rest of the game, we end up with

By doing this we can easily see that the total number of
possibilities is 8 and, out of that, only 1 possibility will give Barney his
desired victory. Because they are friends and trust each other, they have
no problems agreeing that the coin should be fair and that the 8
possibilities above should have the same probability. Therefore, if Barney
has a chance of winning in only 1 of the 8 possible scenarios, they happily
agree that Barney should receive 1/8 of the total prize, while Andrew keeps
Roberto C. Alamino


the other 7/8 of it. This is the essence of the solution found by Pascal and
Fermat. In fact, the way I described it above, considering all possible
combinations of results, was actually due to Fermat. If Shakespeares
famous quote Kill all the lawyers! were somehow realised, probability
theory, and many other areas of mathematics, might be much less
developed right now as Fermat was actually an amateur mathematician.
His true profession was that of a lawyer.
Explained this way, the problem seems so trivial that it is almost
unbelievable it took the efforts of two of the greatest mathematicians in
history to find its solution. However, you must remember that those were
the beginnings of probability theory and the ideas involved were not well
understood in that time. Besides, once you already know the answer for
something, it all looks simple to you. That is a good advice to have in mind
when you are writing a test for your students...
One fact is of notice for our future discussions. Notably, James
Bernoullis Ars Conjectandi, published on 1713, showed that the frequency
method and the classical method were indeed consistent. The interesting
thing about James Bernoulli is that he was also called by the names Jakob
(or Jacob) and Jacques. So, you might see the above book attributed to
three or four different Bernoullis, who are in fact all the same.
Leibniz, the famous philosopher who shared the discovery of
calculus with Newton and earned the enmity of the latter for the rest of his
life because of that, published a dissertation in 1666 about combinatorics
called Dissertatio de Arte Combinatoria.
The term combinatorics is used to describe a branch of
mathematics concerned with counting the total number of elements in a
given set. The name comes from the fact that it was initially concerned
with calculating how many combinations one could find by grouping a
certain number of objects. For instance, consider that you have three balls,
one RED, one GREEN and one BLUE. How many combinations of two balls
out of these three can you get? The total is easily found as being 3:
The Probable Universe


RED+GREEN, RED+BLUE and GREEN+BLUE. This is an easy case that can be
solved simply by listing all possibilities. However, how would you calculate
the number of combinations if you have 100 different balls and need to
arrange them in groups of 5?
The importance of combinatorics for probabilities is that many
times the possibilities of an experiment are obtained by combinations. One
well known example is the lottery. A usual version is that in which there are
100 numbers and you need to predict a combination of 5 of them to earn
the prize. Because all numbers have the same probability of being chosen
in a draw, all combinations should have the same probability too by
symmetry. Therefore, the probability of each combination, by the classical
method, should be 1 divided by the number of combinations. To
discourage you, the number of combinations is 75 287 520, which means
that the odds for you to win this particular version of the lottery by
choosing one combination is 1 in approximately 76 million.
A curious fact about Leibniz book concerns the notation he
suggested for combinations. Calculus today is often written using Leibniz
notation, which is more modern and even more powerful than the one
suggested by Newton. So, it is not a surprise that he tried to introduce a
new notation for combinations as well. From the point of view of Internet
notation, Leibniz can be seen as four centuries ahead of his time. He
suggested to use, instead of the word combination, the term
com2nation for all arrangements of two objects, con3nation for three
objects and so on. Our lottery guess would be a com5nation. If he only
lived in todays world, maybe it could have been a huge success, but at that
time, it was not (Todhunter, 1865).
Many other great mathematicians wrote treatises about
probability. Great names like Huygens, Leibniz, Daniel Bernoulli (not
Jacob!), Montmort, de Moivre, Euler, DAlembert and Newton all
contributed their bit to the development of this area of human knowledge.
The list is very extensive, but the focus continued to be mainly on games of
Roberto C. Alamino


chance, with occasional excursions on insurance and demographics
problems until Laplace took the stage.
Pierre-Simon Laplace was a prominent member of the great French
mathematical school of the 19
century. He was also an astronomer. In
fact, he and the English scientist John Michell were the first ones to
propose that there might be stars so dense that the escape velocity from
their gravitational fields would be superior to that of the light. Because
light coming from the surface of these bodies would not be able to escape
to outer space, they would appear as black balls to an external observer.
They called it black stars. Of course, in that time, the current theory of
gravity did not predict that light could be attracted by massive bodies.
Some decades later though, Einstein suggested that light would indeed be
affected by gravity. It did not take long for the idea to make a huge return
in the form we know today the black hole.
Among many works, one of the most important achievements of
Laplace, the most important for us at least, was his rediscovery of the work
of an English clergyman and mathematician from the 18
century called, as
you might expect, Reverend Thomas Bayes.
The Probable Universe


Thomas Bayes (1701-1761)

Pierre-Simon Laplace (1749-1827) by
Two years after Bayes died, in 1793, one of his works called An
Essay Towards Solving a Problem in the Doctrine of Chances was presented
by Richard Price to the Royal Society and then published. This problem in
the doctrine of chances, doctrine of chances being the old name for
probability theory, was something that, in modern language, we would call
an inverse problem on probabilities. The direct problem in probability
theory is to predict the outcome of an experiment given the probabilities of
each one of the possible results, while the inverse problem consists in
predicting the probabilities themselves given the observed outcomes.
Price, a philosopher and preacher, thought that Bayes work helped prove
the existence of God. On that matter, he was certainly wrong.
What the paper actually contains is a formula for finding the
posterior probability of a certain hypothesis concerning an experiment
once two other pieces of information are given to you: (1) the prior
probability of the hypothesis and (2) a set of observed outcomes of that
experiment. This is, if you remember, our program BAYES.
A hypothesis about an experiment is some feature or assumption
you attribute to it. When we consider fair dice rolling, we assume a
symmetry hypothesis. If you remember our example of the loaded d10,
which would always give 1, the symmetry hypothesis would lead us to think
Roberto C. Alamino


that all the faces would have the same probability. Because the observed
outcomes were always 1, what was in disagreement with that hypothesis,
we had to decrease the probability that the symmetry hypothesis was true.
This is a subtle point and it is when even I get confused sometimes.
The probability of the symmetry hypothesis is one thing, the probability of
the faces of the dice is something completely different. Do not worry if you
do not get this right now, we will go through all of these concepts many
times and very slowly throughout this book. The first time I wrote the
previous paragraph, for instance, I started to talk about the probabilities of
the faces as being the probability of the hypothesis and had to erase
Bayes original work was, once again, a study concerned with
games of chance. Laplace, however, had more heavenly preoccupations on
his mind. He wanted to predict properties not of games of chance, but of
celestial bodies. In fact, what Laplace wanted to do was really revolutionary
for his time. His idea was to collect data from different sources to estimate
a property which was not intrinsically random! He did that, for instance, to
estimate the mass of the planet Saturn.
Let us highlight here the novelty of the idea because it is the stroke
of a genius. One thing is to estimate a random property, like the next
number to turn out in a dice throwing. It is clearly going to be different
each time (given a series of assumptions, as we have already seen). But the
mass of a planet, specially such a huge one, is not supposed to vary
significantly from time to time in a random way. Not within the relevant
precision anyway. Once the planet mass is defined, it is going to stay that
way for some time at least. Therefore, who would use probability theory to
estimate it? How could one do that? It does not even make sense.
But it did. What Laplace noticed is that the source of randomness
in the problem was not in the mass of Saturn, but in the errors of
measurements. When I was in my first undergraduate year in physics, we
had a laboratory task that required us simply to measure a set of plastic
The Probable Universe


pieces hundreds of times. The measurements were made with a very
precise tool and, guess what, within the precision of the tool each time we
would get a different number. That is because we could never measure the
piece from the exact same angle, in the exact same way. The shapes were
not changing randomly, but small uncontrollable variations caused by our
inability to exactly reproduce the position of the measuring device caused a
kind of noise (remember this word) in our obtained measurements.
If this could happen with small plastic objects that could fit in the
palm of our hands, imagine how much worse this can be in the case of
measuring the mass of something as distant as Saturn. What was random
was not the actual mass of the planet, but, in the same way as the plastic
shapes I once measured, it was its measured mass which was influenced by
random mistakes. In fact, the measurements were not even of Saturns
mass directly, they were of its position in the sky, which together with the
equations of celestial motion, would give the planets mass in a very
indirect (and error prone) way.
The whole idea then was the following. By putting together all
information we know about celestial mechanics given by the Newtonian
theory of gravity (prior information) and the measurements made by
several different observatories (experimental observations) we could
calculate a posterior probability for the hypothesis that the value of the
mass of Saturn was such or such! Once we have this probability, we can
then calculate an average mass and the uncertainty in this prediction. We
have not talked about this last part yet, but for now, just be aware that we
can actually do that with almost every probability of interest in science.
The formula that Laplace used was the same that Bayes discovered
in his work. Today it is known as Bayes Theorem. Laplaces estimate of
Saturns mass using this method was so accurate that the improvement
factor on his estimate up to this date is only of 0.63% (Sivia and Skilling,
2006). That was a revolution that was obscured by the other two that also
appeared in that century: relativity and quantum mechanics. Only today,
more than one hundred years later, Laplaces ideas received the due credit.
Roberto C. Alamino


After Laplace released probability theory from the shackles of
gambling and monetary applications, he used it to solve problems in every
other area he could think about, from law to medicine. Although Laplaces
Bayesian methods were vigorously opposed by many in his time (and for a
decreasingly few even today), probability theory became part of virtually
every area of science since then.
In the forthcoming years, for instance, physicists like Boltzmann,
Maxwell and Einstein would use it to prove the existence of atoms. In an
even more revolutionary twist, the probabilistic character of the laws of
nature was mercilessly imposed upon us by the discovery of quantum
mechanics, showing that probability was built in a much deeper level of our
Biology, psychology, medicine, geology. All sciences today rely on
probabilistic estimates to find correlations that allow one to make
predictions of phenomena either too complicated to be modelled in detail
or too difficult to be controlled with the necessary precision.
Even the foundations of what we understand today as being
science are deeply rooted in probability theory. But as every deep subject,
this is still full of controversy.
Before we get there, though, we have to endure a long but exciting
journey through some of the most intriguing questions and discoveries of
science and mathematics involving the very nature of what can and what
cannot be predicted about the universe. We will spend the next chapters in
this journey, trying to understand deep issues about the limits that nature
imposes on our ability to describe it. Welcome to the realm of the random
and the uncertain.

The Probable Universe


Making Sense of Randomness
We spent the largest part of the previous chapter revisiting the history of
probability theory and how a large part of it was connected to gambling, at
least until Laplace started to apply it elsewhere. But this connection with
gambling is not a simple freak accident. A moment of reflection is enough
to convince oneself that casinos are, after all and for Laplaces dismay, the
most convenient places to see probabilities in action. Inside of them,
thousands of daily repetitive experiments are carried out and person after
person, at practically every second, tries to infer what would be the next
result in one of them.
Those experiments rolling dice, tossing coins, spinning the
roulette have one characteristic in common: nobody can acquire enough
information to exactly predict their result. We call experiments like them
by the name random experiments. Although there are sophisticated
methods that can be devised to generate a profit in the long run, if a dice is
rolled with enough strength, nobody will be able to measure every single
piece of information with the necessary precision to always give the right
answer for each roll.
It is not difficult to accept that some things in the world are
random in the sense that the result is difficult to be predicted in advance,
but the property of randomness is a very tricky one to define and can be
extremely deceptive when one starts to think too much about it. In this
sense, it is a property similar to life or consciousness.
Roberto C. Alamino


In most practical situations it is not that difficult to say that
something is or is not random, but when you try to define it precisely, or
when you are faced with some particular borderline cases, you get into
trouble. It is like trying to decide if a virus or a prion, a kind of rogue
replicating protein the most famous of which causes the mad cow
disease, is alive. Or even if a computer virus is. We would like it not to be,
but when we try to precisely characterise what is alive or not, we can
always find a loophole in the definition in such a way that something which
we would not like to be alive will satisfy all the criteria and, sometimes,
things that should be will not.

In order: The Bird Flu virus (computer model), the Mad Cow disease prion (computer
model) and the computer virus Blaster Worm. They all can live, spread and reproduce in
their appropriate environments.
Are they alive?
Suppose, for instance, that you are in Las Vegas and decide to bet
your money in one of the numbers of the roulette. Classical physics, the
laws discovered by Newton before quantum mechanics, are the ones that
govern the movement both of the ball and of the roulette within the
precision required. The modifications in the required calculations due to
quantum mechanics have negligible effects in this case, no matter what
some gurus might tell you.
The equations of classical physics state that if you know the
position and the velocity of an object at any time with enough precision,
you can predict its future arbitrarily well. Actually, that is what you would
read on most popular accounts of physics, but that is not quite true. You
also need to know with a good precision all the constraints that affect your
problem. This means that you have to know things like, for instance, the
The Probable Universe


shape and the material of the table on which you will roll your dice. This
kind of information constrains the possible ways the dice will bounce and
move around. If you roll your dice inside a round bowl, that will be
different than rolling them inside of a square box.
The main problem is that you will never be able to gather all the
necessary information to predict the required result, although, in principle,
classical physics states that you could. I emphasised in principle because
that is something you read a lot in physics texts. It means something that is
not forbidden by the known laws of physics, even if, in practice, that thing
is so difficult that nobody will really do that.
In the coin tossing or the dice rolling from the previous chapters,
we have seen that we need to know details of the geometry and density of
these objects as well as what was going to be the velocity with which they
would leave the hands of the person who was throwing them. And also, we
need to know the point where they leave the hand. Oh, and the velocity,
density and temperature of the air around it. Yes, I almost forgot, and
also... Classical physics does not forbid you measuring all of that, but unless
you are in a very controlled environment, with high precision equipment,
you will not be able to do that.
In a situation like this, randomness crawls in the experiment
because of our lack of complete information about all the variables
involved. If we could use all the resources of a very advanced laboratory, all
our knowledge about classical physics and as many supercomputers as we
needed to we could in principle measure things with precision enough to
give very good predictions. But the requirements would be so enormous
that we should probably have to run an entire universe full of computers
during billions of years to get the prediction. If you think that this is worth
one prediction, think again. And do not stop thinking until you change your
By the above arguments, it seems tempting to at least recognise a
quality that we can call practical randomness. If true randomness
Roberto C. Alamino


(whatever that means) is elusive, its practical version can be associated
simply with our incapacity of predicting within a certain precision the
results of an experiment. This accommodates situations as we saw above,
in which the experiment is predictable in principle, but not in practice.
As any definition, though, we need to be very specific about its
requirements. When we talk about prediction in systems that are
continuously evolving in time, for instance, we should specify if we are
interested in predicting either the long term or short term behaviour of a
system. This kind of difference can have serious consequences for some
particularly trouble-maker systems, especially when we are interested in
long term predictions. A good example is provided by the famous chaotic
systems, which deserve a closer look.
Chaotic systems (Gleick, 1988) became worldwide famous in the late 80s
due to a series of mathematical breakthroughs that allowed a better
understanding of their behaviour. In particular, the advancement of
computer technology was one of the decisive elements in these
Many physical systems can have their time evolution predicted by
simple equations. For instance, a car moving with speed 80km/h and
starting its journey at the kilometre 20 of a motorway can have its position
predicted at any moment of time t by the formula

For a given time given in hours, provides the position of the car
in kilometres. After 2 hours of trip, the car can be found at kilometre 180 if
its velocity stays constant. We call numbers like the starting position of the
car initial conditions. They are the numbers we need to provide to the
equations to start the evolution of our systems. Different initial
The Probable Universe


conditions result in different evolution histories for the systems. If the car
starts at kilometre 100, then its whole trip will be different, passing
through different cities at the time given by the drivers watch.
But that is not the whole story. Suppose you do not know for sure
the starting point. Suppose you think that the car started its journey at
kilometre 18 instead of 20. If the car keeps going for 20 hours, the correct
position would be kilometre 1620, but you would predict the car to be in
kilometre 1618. The point is, no matter for how far in time you try to
predict, your error will always be of 2 kilometres. It will not be worse than
that. If you miss the starting point by, lets say, 2 metres instead of 2
kilometres, you have to agree that your predictions will be so good for the
cars position that the error will hardly be felt.
What happens then with random errors in the measurement of the
initial position of the car? They will result in random errors in your
prediction which will make it slightly wrong. However, these errors will not
increase with time and very little predictability will be lost. Systems like
these go by the name of linear systems. Chaotic systems are different.
The equations describing chaotic systems are so sensitive to the
initial conditions you use to start them that, if we try to predict their
evolution using an initial value which is wrong by just a tiny amount (for
some this can be as small as one part in one million), after some time our
prediction is so different from the actual result that our calculations are of
no value at all.
You probably read about the connection of chaotic systems with
those beautiful pictures of geometric structures known as fractals. They
are sets of points that are generated by chaotic systems following some
special procedures. The most famous of them are probably the Mandelbrot
Set, discovered by the French/American mathematician Benoit Mandelbrot
in 1979, which is a picture with an infinite level of self-similar detail, which
means that if you zoom in enough, you will find the initial set repeated
over and over again after a while.
Roberto C. Alamino


Different levels of detail of the Mandelbrot Set.
Note how the initial shape appears again in each one of those levels.

By looking at fractals one might imagine that the equations needed
to create them are extremely complex. It might be a shock to know that
they are not. Many of them are the result of extremely simple equations
that use nothing more complex than additions and multiplications. The
Mandelbrot Set itself is generated by an equation extremely simple, but as
it involves calculating with complex numbers, I will instead use as an
example another equation which is even simpler: the logistic map.
This equation describes a system characterised by a single real
number at integer values of time. This number is given by the variable ,
in which is the symbol for time which assumes values 0, 1, 2, 3, 4 and so
on. Other values of time are not allowed in this equation, which is then
written as

The variable is an arbitrary real number which is fixed for each
system. What is done is to choose a value for , give an initial value for
and iterate this equation to get what can be called the trajectory of
the system, which is the sequence of values assumed by as time
increases. It is basically the position in which we would find the car of the
previous example and that is the reason for the name trajectory. We call
The Probable Universe


this a map because it maps a certain value of at a time into a new
value at time .
The first thing you have to notice is that if we start the system with
, all other subsequent values will also be zero. This is called a fixed
point of the equation and it is not the only one. For instance, choosing
and you have another one. To get a more interesting
trajectory, let us then choose and . Then we get

and so on. The trajectory would be given by the following set

which blows up very fast as you can see. You can also appreciate that there
is nothing random about this trajectory. Given the initial value and the
parameter , one can predict with 100% certainty the value of at any
point in time. The simplicity of this equation, though, hides a darker side.
The equation can also be written as

Because the last term of the equation contains the square of , we
cannot say this is a linear system anymore, we call it a quadratic map.
When squares appear in the equations, they can bring chaos (and
mayhem). To see this, look at the two pictures below. They are graphs
showing the value for the trajectories in the vertical axis and for time in the
horizontal axis of the logistic map with two different, but very close, initial
Roberto C. Alamino


The initial values for are given in the captions of the graphs and
differ by only 10%. The parameter is chosen to be equal to 3.7 in both
cases. This choice requires an explanation. The logistic map shows chaos,
but not for all values of . Some values will give very repetitive behaviour
and will lead to extremely easy to predict systems. There is a magic number
though, which is approximately 3.56995, above which almost all values of r
will lead to chaotic behaviour. Did you notice the approximately and the
The Probable Universe


almost in the previous sentence? Yes, even this feature of chaotic systems
is difficult to predict.
At first sight, one might think that there is some regularity in the
graphs. In a sense, there is. The amplitude of the oscillations seems to
decrease and increase almost regularly. Almost. There is no specific
frequency and no way to predict with 100% certainty when these pseudo-
regularities will appear and neither their exact amplitude unless you know
the exact initial value. The graphs are definitely not completely random,
but they are also not completely regular and surely not predictable.
If you think about these graphs as representing real trajectories, as
if the systems were actually cars on a road, the following graph shows the
distance between the two systems with time.

Once again, there seems to be some regularity, but, after some
time, one cannot predict what will be the distance between the two
systems. If you pay attention to the vertical scale, you will see that the
distance oscillates between zero and 0.6, which is almost as large as the
position of each system which varies approximately between 0 and 0.9.
This shows that if we are trying to predict the position of the first system,
but we have a slightly wrong initial condition (in this case only 10% wrong),
then after some time we simply cannot predict it even approximately!
Roberto C. Alamino


It is important to notice that not all is lost here. By the above
argument, one might think that we should abandon all hope of keeping
track of chaotic systems. That is not true. One of the most common
systems which is chaotic is a set of three celestial bodies. However, we
have to deal with this kind of system every day when working with
satellites, astronomy and so on. The point is that we can not only allow for
some uncertainty, but we can also keep correcting our predictions
continually. For instance, in the example of the two systems above, one can
try to measure the position of the correct car after a regular interval of
time and correct our predictions. Because predictions take some time to
deteriorate too much, if we correct them fast enough we can keep the
system under a certain control.
Not all chaotic systems are the same. Some are more complicated
than others and they lose predictability at different paces. Some seem
more random and others less. The rate with which the prediction
deteriorates in a chaotic system is measured by quantities called Lyapunov
exponents and they vary from system to system.
Also, the amount of predictability lost depends on the precision
you are aiming for. The weather, for instance, is a chaotic system, but you
can have some predictability depending on what you want to measure. If
you want to know the exact temperature at some time, that will be almost
impossible, but you can still identify seasons very well in different parts of
the world.
Chaos is something that is present in many systems in nature. It is
very common in fluids, in which it is associated with turbulence. Whenever
a fluid becomes turbulent, we lose predictability concerning the way it is
going to flow. Because of its common presence in science, there are many
methods that allow us to deal with it. Although systems like these are
always on the verge of randomness, we can still extract some information
from them.
The Probable Universe


Another example of systems which are very difficult to predict are those
said to possess Self-Organised Criticality or SOC for short (Bak, 1996). SOC
systems live in a very delicate equilibrium between order and disorder and
some of their behaviour is so difficult to predict that even probabilities can
be tricky.
One of the prototypical phenomena associated with SOC is that of
earthquakes. In order to understand what the problem really is, let us first
talk about a physical concept called a typical scale.
Think about the following question: where does the division
between biology and chemistry lie? There is, of course, no exact answer as
there is a lot of chemical phenomena happening in biological systems, but
one still can identify the difference between a biologist and a chemist in
the typical case. The difference is in the typical scale of the phenomena
Because biology is so extensive, let us focus on zoology. If you are a
zoologist, you rarely will study objects of the size of a molecule. Not that
you wont, but the typical object of your study will be animals and their
organs, which can be thought of being in the scale of metres. Ask chemists
if they use this unity of length very often and you will hear them laughing.
Their objects of study are even smaller than a nanometre, which is a
billionth of a metre. This is a huge difference of scales and, in some sense,
these typical scales are enough to characterise different areas of research
or even whole disciplines.
One of the main advantages of the existence of typical scales in
most phenomena is related to predictability. If you are a chemist and some
prediction about the size of a molecule gives you a number which is in the
scale of metres, then you either found a Nobel Prize winning phenomenon
or you must have committed a serious mistake. In both cases, something
very odd is happening.
Roberto C. Alamino


The existence and usefulness of scales is terribly neglected in the
usual school education. A result of this is, for instance, students finding a
completely nonsensical result of a physics exercise and simply accepting it
without any questioning. I remember a story told to me by my physics
teacher in the high school in which a student calculated the size of a
bathroom pipe during an exam and found a diameter of 30 metres (THIRTY
METERS!). The student did not even think that this could be a wrong result.
Scientists use scales all the time to make rough estimates before
rushing to calculate things exactly. This always gives an initial idea of what
one can expect of the final result. The ability to estimate things by using
knowledge about the relevant scales in a problem is fundamental for
professional scientists, but also very useful for any person. If you are
interested, I would suggest you a fantastic little book with the odd name
How Many Licks? Or, How to Estimate Damn Near Anything by Aaron
Santos (Santos, 2009).
I have been talk about size scales all this time, but virtually all
quantities in science will have a typical scale for certain phenomena. The
amount of energy, for instance, characterises the difference between
everyday physics and particle physics, the latter involving higher energy
scales, which you can appreciate by the huge sizes of the particle
Some phenomena, however, do not present this separation of
scales. They live in a special kind of state called criticality, where all scales
contribute equally to it. Usually, criticality needs to be induced and
controlled in physical systems to survive for long times. However, there are
some special systems that evolve naturally to a critical state and stay there
by themselves. These are the systems that present self-organised criticality.
What does all this have to do with earthquakes? We all know the
Richter scale of earthquakes. This scale measures the amount of energy of
a certain earthquake. There comes SOC. It turns out that there is no typical
scale for the energy liberated by an earthquake! This has the gloom
The Probable Universe


consequence that one can never predict the typical size of earthquakes
before they actually happen. This is not a failure of technology or science, it
is a feature of own their nature.
In this sense, predictability becomes lost in a worse way than in
chaotic systems. While we can still talk about scales in many cases where
chaos is present, this does not work anymore in SOC systems. Even
probabilistic statements become difficult as, in the case of the size of
earthquakes for instance, all scales become equally probable.
Once again, in principle, if you know all the initial conditions and a
full description of the earthquake phenomenon, one could in principle
predict everything about them. The practical randomness comes once
more from the practical impossibility of gathering the necessary
information. Will we never find true randomness?
In the back of your mind, a kind of discomfort should be growing right now.
If one cannot have a true random physical system, how can we trust that
some things like the numbers of the lottery are really being drawn fairly? If
you ignore the fact that people always find a way to cheat, as long as the
results are random enough, the difference between true and false
randomness should not be a big concern in many practical cases.
It is very common in science today to run computer simulations.
These simulations require, most of the time, the generation of random
numbers by a computer. But computers are physical systems whose
behaviour is designed to be predictable. How can they generate random
The answer is that computers cannot generate true random
numbers by their very nature. What they do is to generate something that
resembles a random sequence of numbers. These are then appropriately
Roberto C. Alamino


called pseudorandom numbers. How? Well, that is a good question. You
need to be very ingenious to do that.
If you remember our discussion of chaotic systems, you must now
be thinking that using them could be a good idea, however if you look
again at the graphs of the logistic map, you will see that there is some
limited predictability in the way that the amplitude of the oscillations
increase and decrease almost periodically. In scientific applications, one
would like to have something even more unpredictable than that.
Creating a computer algorithm that generates pseudorandom
numbers is as much an art as it is a science. There are many algorithms
today, each one with their own disadvantages. Still, for most applications,
we actually have very good generators, meaning that any attempt to find
any kind of pattern in the generated sequence usually fails (Press, 1992).
There are catches though. Random numbers are generated much
like the logistic map in the sense that you need to start the generator with
some initial value. This value is usually called a seed. If you start a
pseudorandom generator with the same seed twice, you will get exactly
the same sequence twice. Secondly, most current generators are not even
chaotic, they repeat themselves after some time. This time is called a
period, but it is usually designed to be so large that for all practical
purposes it does not really matter.
Pseudorandom numbers is as close as we can get to generate
randomness with our present computers. In fact, if the universe was
governed by the equations of classical physics, that would be the best we
could do. There would be no true randomness. Interestingly, nature is
much more interesting and tricky.
The Probable Universe


We are getting there, but before we find the holy grail of true randomness,
let us give its characterisation a bit more of a thought. Although practical
predictability seems a good way to characterise randomness, we actually
have been overlooking a lot of details.
First of all, when we associated randomness with predictability,
this implied that we are always analysing a sequence of numbers in time.
According to this, there is no sense in asking a question like: Is the number
9 random? This kind of question that does not make sense according to
some framework is called in mathematics an ill-posed question. This
means that there is no correct way to answer it simply because it does not
fit in the scheme we have at our hands.
The fact is that there is no definitive answer to the question of the
absolute randomness of something, but there are methods that can be
applied as long as one accepts some desired characteristics that may even
vary according to the situation. We can say that, in some sense,
predictability is the best guiding criteria, but we need a bit of sophistication
to generalise it and enlarge its scope of application.
Consider the following two numbers:
111111111 and 498762839
As numbers, they are not more random than the number 9 or the
number 100, but one cannot avoid, by looking at them, the feeling that
the second one should be considered more random than the other. We
could argue that we are, inadvertently, associating them as a sequence of
digits in time and attaching, once again, the idea of predictability of the
next digit. Well, that is exactly what it is. But suppose we right the same
numbers above as
Roberto C. Alamino


111 498
111 762
111 839
The first group still looks more random than the other even if we
now are looking at a two dimensional array instead of a sequence in time.
The fact is that, there is still predictability in the sense that if one of the
digits is erased, we can guess the missing digit much better in the first
group than in the last one.
This seems to indicate that we might be able to make sense of the
concept of a predictable object somehow. The key is to associate
predictability with a repeating pattern.
Consider, instead of numbers, the following sequence of symbols

It does not look very random, right? In fact, you can see that it is
only the two symbols repeated 5 times in sequence. It is highly
predictable. If we wanted to, we could write this sequence as

with an obvious meaning and a visible economy of characters. What we did
above can be seen as compressing the initial sequence. As a general rule,
whenever we can find regularities in a sequence or any other kind of
object, we can use them to write a description that is itself smaller than the
original object. In other words, regularities allow for compression.
The idea is that, in principle, random objects should have no
regularities. That is because regularity brings predictability. This
predictability does not need to be in time, but can also be in space. As
another example, think about a circle. There are not many things more
regular than a circle. If I tell you that I am going to draw a circle and I
present you the following figure
The Probable Universe


you can easily complete it to get the full circle. You can predict the rest of
the picture. How can you compress a circle? Easily. Every circle can be
described by a symbol and a number, namely, the circles radius.
Note that the circle is something extremely symmetric. Is that a
general rule? Yes. The more symmetric something is, the more
compressible is its description. Symmetry appears here once again, but this
time the role is different. In our previous encounter, symmetry was
responsible for our lack of reason to choose one result over another,
generating randomness. Here, it is the opposite. Symmetry decreases
randomness by increasing predictability. This shows that you must be very
careful when applying some concepts and think deep about how they fit in
each different situation.
The logical consequence of the ideas above is then the proposal
that a true random object should be one whose description cannot be
compressed. In other words, the smallest description of a random object
should be the object itself. Can we measure that more precisely? The
answer is an almost yes.
During the 60s two great mathematicians introduced the idea of
measuring the complexity of an object by measuring the size of the
smallest computer program that could generate that object. One of them
was Ray Solomonoff, from the USA, and the other was mathematician
Andrey Kolmogorov, a Russian about whom we will hear more later.
Although Solomonoff published it first, in 1960 while Kolmogorov
published his results in 1965, the measure ended up being known as
Kolmogorov Complexity (Li, 1997) or KC for short.
Roberto C. Alamino


The choice of a computer program should not be seen as
something too fundamental here. It is just a way of characterising
mathematically a description. It has been proved that the exact language in
which the program is written is not important, which means that we can
also use any normal language to describe our objects and the result will be
Finally, the idea of a true random object is then equivalent to an
object whose KC is that of a program that is composed simply by the
instruction Print the object ... In other words, the description of the object
is always larger than any program that can generate it.
KC is a concept which is very powerful and that leads to many
rigorous results, but as it happens with anything else when we try to study
matters concerning complexity, computation and randomness, there is a
catch and, in this case, a big one. It can be proved that KC is what we call
incomputable. This means that there is no program capable of calculating
KC for any object fed to it. However, we will see soon that there is a
concept which is very close to KC that comes to rescue and that will be
enough for us. It comes from an unexpected place thermodynamics and
we will call it by the name entropy. But we will only be able to understand
it if we learn a bit more about probability, so you will have to be a bit more
It seems that we made some progress, right? The more we can compress
an object, the less random it is. Therefore, a completely random object is
one that follows no pattern at all. That should be easy to recognise, right?
Err... not really.
There is a nice discussion about our perception powers and how
our senses help us to recognise patterns in objects in Stephen Wolframs
book about cellular automata (Wolfram, 2001). A cellular automaton is a
The Probable Universe


mathematical structure composed by a 2 x 2 grid of cells, each one can
assume one of many colours. The simplest case is of two colours. The cells
change colours at each time step by looking at the colours of their
neighbour cells and following a certain rule. An example is John Conways
game of life, whose rule is that a black cell, considered as being alive,
remains black if two or three of all its eight neighbours are also black and
changes to white, or dead, otherwise. In addition, white cells change to
black if they have exactly three black neighbours.
Some cellular automata rules generate a whole universe of
different patterns that can vary with the initial configuration of the cells.
Some patterns are boring and repetitive, while others look completely
random. Of course, because the rules are very well defined, this
randomness cannot be true randomness. The fact is that, for many of these
produced patterns, it is extremely difficult, if not impossible, to find the
rule that generates it from the pattern itself.
Wolframs book once again has a thorough discussion of this
subject and many pictures showing that, even if you use very sophisticated
algorithms to try to compress the produced patterns, you end up with
descriptions that are many times larger than the pattern itself. As we have
discussed in the previous section, this is a strong indication of randomness,
the true one!
You might think that this is a fancy example. After all, even in the
case of the logistic map in its chaotic regime we can somehow identify
some regularity. The equation that defines the logistic map is a
compression of the data we saw in the generated graphs. Those graphs
have some similarity between then and, therefore, maybe it is possible in
simpler cases to always find the correct compression.
Look then at the following two graphs. What is the difference
between them?
Roberto C. Alamino


Not much, right? They both seem very similar. Both are sequences
of integers from 0 to 9. Differently from the graphs of the logistic map, they
look very random. In fact, you can run on these sequences pretty much
every algorithm that tries to find regularity in data and nothing will come
out. Both, however, are not really random. The second sequence is a
pseudo-random sequence generated with a computer program. It is not a
surprise that it looks random as it is designed to be that way.
The Probable Universe


The first graph is, however, some of a shock. It is actually the
sequence of digits in the following well known number:
Can you recognise it? Yes, that is nothing more that good and old
. Its infinite sequence of digits can be calculated by many different finite
computer programs and can be compressed in the very short description
the ratio of the circumference of a circle and its diameter. Even with the
most sophisticated automated methods of pattern recognition, one would
not be able to find this amazing description of that sequence.
Here comes a SPOILER ALERT!
If you haven't read Contact by Carl Sagan (Sagan, 1985) yet, be aware that I
will be talking about something that happens literally at the end of the
Be warned. At the very end of the novel, the main character, an
astronomer called Ellie, is running a program to find a message that has
been left hidden in the digits of by the supposed 'designers of the
universe'. The program suddenly spills out a sequence of 1s that somehow
form the picture of a circle.
Now, you might be very truly amazed to know that Carl is right:
there is indeed such a sequence hidden in the digits of ! That exact
sequence, by the way! Have I just changed your life? Before you start
twitting about these amazing news, let us review a couple of facts about
that number.
We have seen that is a number which is a result of dividing the
length of a circle by its diameter. In flat Eulidean space, which is the one
Roberto C. Alamino


obeying the geometric properties you have learned in the school, this
works for every circle. It does not work for curved spaces like those present
in General Relativity, for instance, which you might be tempted to interpret
as saying that somehow contains information about the flatness of space
around you.
But is a very interesting number in many other aspects too. One
of them is the fact that it is an irrational number. This means that there is
no way to write as a fraction, or a rate, between two other integer
numbers. I once received a paper from someone who claimed that he
had found the true value of and provided a fraction. When you are
scientists, sometimes you have to deal with those people.
A consequence of s irrationality is that its decimal digits cannot
(ever, never) be periodic. A periodic sequence is one that repeats itself
after a certain amount of time like the following ones
The three dots at the end mean that these sequences repeat
forever in the obvious way (I call the last one the 'Mambo Sequence', by
the way).
The first sequence has period one, the second has period 2 and the
Mambo Sequence has period 6. The period is then, clearly, the number of
digits that are repeating. A rational number, one that can be written as a
fraction between two integers, always finishes with a periodic sequence. It
can take a while to reach that sequence, but it is always there. The
The Probable Universe


converse is true: if the decimal expansion of a number becomes periodic a
some point, then the number is rational. For instance:
where the last digit '6' repeats forever is a rational number. In irrational
numbers, like , this never happens, and that is the main reason why the
sequence of its digits becomes random.
What does this have to do with Sagans Contact? We are getting
there. Another detail about irrational numbers is that the number of
decimal digits in their expansions is infinite. That is because any number
whose sequence of decimal digits is finite IS a rational number. All you
need to do to find its representation as a fraction is to multiply it by an
appropriate power of ten until it becomes an integer. The number is then
that integer divided by the power of ten. For instance

a rate between two integers.
Many of you must know Jorge Luis Borges' story The Library of
Babel. In it, Borges imagines a library containing books in which every
combination of the letters of the alphabet is present, in a random order.
This means that, if you only look at the books with say 400 pages, the
library contains all stories and all scientific books that have been ever
written, or that will one day be written, as long as they fit in 400 pages!
Even things that have not been discovered yet! Even stories that nobody
wrote yet, but that one day someone will! In fact, because the library is
infinite, it contains all books that have ever been written or that will ever
Although Borges' library is fictional, it illustrates a truly amazing
property of the infinite. When you put together infinity and randomness,
Roberto C. Alamino


you get something even more amazing. It can be proved that in an infinite
random sequence, ANY finite sequence of characters appears an infinite
number of times! Now, the punchline:
Every finite sequence of numbers appears an infinite number of times in
the sequence of decimal digits of .
Think about this. In the same way as you can encode computer files
in binary form, you can also encode any information in decimal form. If you
doubt, just write down the binary representation of any file. That is an
integer number. Write that integer in decimal base and there you have it.
This means that every text that has ever been written or that will ever be
written can be found somewhere in the sequence of decimal digits of ! An
infinite number of times! This means that, whatever Sagan's character
found, it was not a message from another race, but simply the result of
good and old randomness.
If you are worried that there is so much information hidden in or
maybe trying to devise a plan to extract future information from it like in
the Bible Code, be aware that this is futile. Because the digits are random,
there is no way to know where the information is before hand, or even
which information is correct or not, because the same information appears
with all possible mistakes!
The bottom line: randomness is elusive, get over it.
I probably have convinced you that, even if true randomness exists, we
would not recognise it if we had our faces glued on it. But nature was kind
to us and, at least apparently, has allowed us to look into the very face of
randomness, the true and clean one. We called it quantum mechanics.
The Probable Universe


There is nothing more fundamental and mysterious than this. Born
in the beginning of the 20
century, quantum mechanics is the most
fundamental description of physical phenomena that we have and it says
that, even if we could gather all possible information about a system with
the maximum allowed precision, we would only be able to predict
probabilities for outcomes of an experiment, not the actual value of the
outcome with 100% of certainty.
Quantum mechanical randomness is a very special kind of
randomness which is not a result of a lack of information, but a basic
characteristic of some aspects of nature itself. If something deserves to be
called true randomness, it is it. But because quantum mechanics does
require some familiarity with probabilities, we will leave its discussion to a
later point in this book. Right now, we will come back to more mundane
subjects... we will be back to Vegas!
From our journey to the core of randomness, we have learned that there
are an incredible amount of times in life in which we will not be able to
gather enough information to predict a result. But cant we at least give our
best guess?
Of course we can and we usually do that. Randomness, be it true or
false, does not prevent us from saying something about those systems. But
we do not only guess, we do an educated guess, as we physicists like to call
it. Even without realising it, that is what we always do. We look at the
roulette and certify ourselves that it seems to be constructed in a way that
gives the same chance for the ball to stop in any of its holes. Symmetry,
remember? That allows us to infer that all of them should have the same
probability. Then, we choose any number and bet on it, hoping that the
probabilities are indeed equal.
Roberto C. Alamino


Probabilities are our main tool to tame randomness in all its facets.
Now that we had a better understanding of how randomness appears in
our experiments, we need a language to deal with it and this language is
provided by probability theory and we now have to understand how we
can evaluate them.
We touched this point already when we talked about coins and
dice. There, where everything was nicely symmetrical, the discussion was
almost straightforward. We also have seen that in more complicated
situations, one might say that there are at least two different answers.
They were subtly different the difference being using the classical and the
frequency method to calculate these so-called probabilities.
When we used the classical method, equal probabilities meant that
there were no reasons to expect any result in particular and we did not
have to actually do the random experiment to infer that. The idea was a
direct consequence of the symmetry arguments. The classical method is a
sort of static answer, because you calculate it once, you have your
information and you are done.
The frequency method, on the other hand, can be seen as a
dynamic answer to deal with randomness. It is based on the repetition of
an experiment. If all numbers have the same probability, by repeating the
random experiment many times, the Law of Large Numbers tells us that
each result would appear approximately the same amount of times. If the
probabilities are different, these differences will be reflected in the
obtained frequencies as long as we repeat the experiment enough times.
There is a raging battle here. The kernel of it is the dispute
concerning which one should be the correct method. The classical method
works by calculating probabilities as properties of the information we have
about a system, while the frequency method clearly gives a property of the
system. Because of this, many mathematicians call probabilities calculated
by the frequency method as objective probabilities, while those calculated
The Probable Universe


from the information one has about the system are frequently called
subjective probability.
If you believe in a physical reality out there, you would be tempted
to say that the real probability should be that given by the frequencies. As
we have learned in this chapter, however, unless you are dealing with
quantum mechanics, you will then never truly have objective probabilities.
We saw that in all other cases, randomness is actually related to our lack of
information about something, leading to a lack of predictability. Does that
mean that frequencies are then the wrong answer? The actual answer is
more subtle than that.
One of our goals will be to understand that there is no need to
choose one of the options above in detriment of the other. In order to do
that, we need to acquire a deeper understanding of how probability is
measured in general. We have to make sense of all those fractions and
percentages in a way that they stop being simply numbers and start to
acquire more meaningful visualisations inside our minds. There are
different ways in which this can be done. The path we will choose here is
via the Coxs Postulates and we will do that in the next chapter.
Roberto C. Alamino


The Logic Behind
In his Philosophical Essay on Probabilities (Laplace, 1825), Laplace wrote
The great insight of Laplace was that probability was an extension
of logic. It was logic adapted to situations in which you cannot be
completely sure of something. This should have the consequence that,
somehow, all the rules that govern probabilities should reflect logic
principles. When things were certain, probability should also reduce to
deductive logic, which is that kind of logic that states that if this, then this
without room for doubts.
Let us go back and analyse once again our d10. We have seen that
the probability of any side was 1/10 because this number was supposed to
represent the symmetry of the whole situation. Therefore, there must be a
path connecting rules of commonsense and non-negative real numbers
between 0 and 1. These rules should include also how to combine them.
That is because, for instance, we want to say that if the face 1 and the face
2 have both probability 1/10, the chances of getting either 1 or 2 in our
dice rolling should be .
This connection is made by a set of 3 (yes, only three) simple
requirements that are called the Coxs Postulates, sometimes also called
The Probable Universe


Coxs Axioms. As any set of postulates, they are assumed to be true from
start. That is not cheating at all. We want probabilities to reflect
something, so we need to tell the mathematics what is this something we
want. Once we defined it, it can be proved that probabilities can be
represented by real numbers just like those we are using for the d10. The
beauty of these postulates is that they can actually be fully reproduced
here as their essence is so clear that precludes any technical knowledge.
They are
C1. Probabilities are real numbers representing degrees of plausibility.
C2. These real numbers should follow all properties required by common
C3. The numbers should be consistent.
They look pretty sensible requirements, dont you agree? We will
go through all of them one by one. The first postulate, to which we will
refer as C1, expresses the most fundamental idea behind all we are going
to do. In plain English, it says that it is possible to quantify the plausibility
for anything to happen. This plausibility is, of course, what we will end up
calling by the name probability.
Only postulate C1, however, is still not enough to fix the range of
numbers inside of which probabilities will be defined. Even if we decide to
define the number 1 as describing the certain event (that which has 100%
of chance to occur), C1 does not force us to attribute the value 0 to the
impossible event, although it would be a very convenient and fairly
intuitive choice. I have to agree that C1 has a sort of philosophical feeling
attached to it. It is more like a statement of a goal. But there is one thing
we can extract from it the idea that, to propositions with different levels
of plausibility, we should associate different real numbers.
C1 is very important as a starting point, but let us analyse a little
more its limitations. Consider two propositions to which we would like to
assign some degree of plausibility. I put the word proposition in boldface
Roberto C. Alamino


to draw your attention to something extremely important. By targeting
propositions, we are now able to assign values of plausibility not only to
random physical experiments, but to any kind of sentence you can
formulate in any language, as long as it has some meaning, of course.
Examples of propositions are = Its going to rain tomorrow and
proposition = I will spot a dodo today. Notice that I am labelling my
propositions by the letters and . This is just a trick that helps us avoid
rewriting the whole sentence again and again while we analyse it. What
postulate C1 suggests in this case is that we can attribute the real numbers
and (dop = degree of plausibility) which respectively give
the plausibility of propositions and .
Because of the name we chose for these numbers (again, degree of
plausibility) and knowing that is obviously less plausible than (even if
you leave in a desert), we might be tempted to require these numbers to

Although the above equation looks pleasant to the eye, it is worth
to remind yourself that there is quite an ordinary counterexample to this
kind of ordering. Think about the property of coldness. The coldness of an
object can be easily measured by measuring its temperature. When one
says that an object with temperature

is colder than another one with


, we are obviously saying that

such that colder

means a smaller temperature. Because we are used to it, we are not
confused by that terminology. The other reason this does not sound
strange to us is that we do not use the term degree of coldness to
describe temperature, although we could even get used to that if we had
spent a long time applying that name.
It clearly sounds very strange, but totally acceptable, to assign a
smaller number to propositions which are more plausible, although it
would then be more sensible to call degree of implausibility to that
The Probable Universe


number. We, for instance, feel much more comfortable saying that
temperature measures how warm something is than saying that it
measures how cold it is, even with both assertions providing the same
information. Therefore, keep in mind that the order given in the above
equation for the is not necessarily implied by the postulates, but is
conveniently chosen simply because of the name we chose for that
But the order is not the only thing that is not implied by C1. The
usual 0-to-1 interval that I mentioned so many times, written in
mathematical notation as [0,1], is nothing more than a convention. And we
will see in a while that even putting all three postulates together, it is still
going to be a convention. We are used to consider that an impossible
proposition corresponds to a zero value and an absolutely sure proposition
to the value one. If you are a mathematically inclined reader, you can find a
detailed analysis of Coxs postulates in Jaynes book (Jaynes, 2003). There it
is shown that we could as well work on the interval from 1 to infinity,
which I will prefer to write with the shortcut notation [1, ], by assigning
the impossible event to the infinite value, the certain event to the value 1
once again and using the degree of implausibility way of thinking.
Mathematicians would be shouting in anger to me right now. The
reason is my interval [1, ]. In general, the notation for intervals in
mathematics states that we use the symbols [ or ] to indicate that the
points to which they are closer are considered to be included in the
interval. That is because we can choose to delete one or both of those
points and keep the rest. For instance, the interval is almost the same
as the interval , but does not contain the zero, and the interval
does not contain both the zero and the one. In rigorous mathematics,
infinity is not an actual number, but a kind of limit. Therefore, as a limit, it
is never reached and is never actually included inside the interval. This
means that the rigorously correct way to write the infinite interval starting
at 1 would be . I will use the excuse that I am doing this because I
Roberto C. Alamino


want infinity to be a possible value, not only a limit. I hope I can be
If you are not a mathematician, you might be suspicious about a
different issue. How can I change the interval from [0,1] to [1, ] and say
that they are equivalent descriptions if the interval from 1 to is clearly
larger than the one between 0 and 1? Well, it all depends on what you
mean by the word larger. If you measure with a ruler, than it is, but what
is important for them to be equivalent for our purposes is the fact that
both have exactly same quantity of numbers.
It looks like I am cheating, because it is not very difficult to
understand that both have an infinite number of real numbers, but the
issue runs deeper than that. Although there are infinities which are of
different sizes, theirs are the same. The trick is to understand that we can
associate to each number in one interval a unique number in the other. To
see this, call the numbers in the interval from 1 to the degrees of
implausibility, or , of the propositions. To check that the number of
points is the same as the number of points we associate them by the
following formula

To each value of , this formula associate exactly one value of
. In addition, any value of has its associated and vice versa.
There are no unmatched points in the two intervals! Therefore, they have
the same number of points. The number of points in both these intervals is
different, for instance, from the number of points in the set of natural
numbers (the numbers 1, 2, 3, 4 and so on). We will need to know about
this distinction later and we will discuss that when that time comes, so do
not be very worried about this right now.
The Probable Universe


We have extracted all we possibly could from C1 without delving
into more serious mathematical calculations. We can now safely proceed
to C2, which is probably the subtlest of all three postulates as we all know
from our daily life how difficult it is to agree on what common sense really
I will adopt a very pragmatic view concerning what should be
understood by the term common sense. We are going to assume that this
should be equivalent to the requirement that the numbers we assign to our
degrees of plausibility, based on C1, should obey all sensible rules of logical
The difference between logical induction and logical deduction is
that, in a sense, the former is a relaxed version of the latter. Deduction is
concerned with proving assertions, while induction is concerned with
increasing or decreasing the likelihood of an assertion being true without
requiring definitive proof. The choice of induction in what we are doing is
then obvious. Apart from this, though, most of the content will be the
same for both.
Defining common sense as logical induction sounds a bit abstract,
but we will see that, although there is always a bit of required abstraction
when you are dealing with matters of logic, the rule of induction indeed the
ones we apply all the time without paying much attention. Every time we
want to draw conclusions from some piece of information, we are
invariably using logical induction. If you keep that in mind as we go along,
you will be able to find many examples of daily decisions in which those
rules are being applied.
As we have discussed before, instead of using plain words all the
time, it becomes convenient to introduce a bit of mathematical notation at
this point. Try to remember what we talked about mathematical symbols
and do not be scared. They are only shortcuts to words or whole
Roberto C. Alamino


sentences. You can even think about them as Chinese characters, only
simpler to draw, although with longer meanings.
First of all, let us agree once and for all to use the degree of
plausibility point of view and, consequently, our numbers will be always
between 0 and 1. I will then introduce a very subtle modification in the way
we interpret the symbol . We will define it as an abbreviation for
the sentence the degree of plausibility for proposition to be true. This
seemingly innocent change of wording is actually very important, because
it is here that we open up the doors for logic to enter the stage.
This modification will now require that the proposition can only
assume two truth values: true and false. Anything that can assume two
different values can actually be though as something that can be either
true or false. But how then can we attribute plausibilities to propositions
that can assume more than two values? What about our dice, for instance?
The number of possible results of a dice rolling is equal to the number of
faces of the dice, which obviously can be larger than only two.
Although at first sight this would present a difficulty, there is no
real problem. We are going to talk much more about how this works in the
next chapter, but to easy your mind, think about this. If we roll a d6, the
result can be broken in 6 propositions which can only have the true/false
values. They are the propositions the result was 1, the result was 2 and
so on. We can always do that for any experiment. We simply list the
number of possible results and consider each one as a binary proposition,
i.e., a proposition that can assume two values.
The word binary might ring some bells. Computers work using
binary numbers, right? The binary numbers are numbers that contain only
two digits, which are usually chosen to be 0 and 1. A variable that can
assume two values can alternatively be called a bit, although the more
correct way of putting that is that it carries a bit of information. We will
come back to that at some point as information is one of the things we will
be concerned in this book.
The Probable Universe


There are many reasons why a computer uses a binary
representation for numbers. It turns out that this is enough to deal with
logic. The connection is obvious. We associate the value 0 to the truth
value FALSE and the value 1 to the truth value TRUE. This is actually a very
convenient (and abbreviated) way of symbolising the truth value of our
Some caution with notation is required now as logic is prone to a
phenomenon called abuse of notation. This happens all the time in science
and is a pain in the neck of those who are starting to learn some subject.
Once you get used to some mathematical notation, you start to use it in an
even more abbreviated way which, most of the time, turns out to be
rigorously wrong. However, because you are experienced and know what
you are doing, you get away with that. This happens a lot with
Remember that our proposition is a sentence. We can attach a
degree of plausibility to it and now we have learned that we can also attach
a binary number named its truth value. I will use the notation

to indicate that the truth value of is 1 (TRUE) with the obvious
corresponding notation when its value is 0 (FALSE). The danger is that,
many times, you will see people writing, instead,

Strictly speaking, that is wrong. It is true that one can work with
plain numbers instead of propositions in mathematics, and then this would
be correct, but in our case that would be an abuse of notation and I will try
to avoid it as much as I can.
One thing we can do with binary propositions is to modify them.
For instance, we can negate it. We use the symbol

(a bar over the letter

corresponding to our original proposition) to indicate the negation of
Roberto C. Alamino


proposition and we call it, quite obviously, not . There are actually
many symbols for this and each book will use its favourite one. Books on
logic will many times use the notation , or even , which will be
familiar for programmers of C or Java. If you think about as a sentence
like it is going to rain tomorrow then

becomes quite obviously it is not

going to rain tomorrow.
The word not is an example of what is called a logical operator
and we will many times write it in uppercase letters as NOT in order to
emphasise it. In the same way as BAYES, logical operators can be thought
as small computer programs that are fed with one or more propositions
and return a single proposition constructed by combining the initial
propositions in some way. In other words, it operates some modification in
the original propositions.
In the case of the NOT operator, it is fed with one proposition and
spills out another one. What happens in terms of truth value? Let us see. If
the truth value of is 0, then the truth value of

is 1 and vice-versa. This

can be summarised by the following truth table

The second column of the table gives the truth value corresponding
to the negation of the proposition in the first column. Notice that this
table shows truth values, not the propositions themselves.
Common sense becomes relatively easy to apply here. We just
require that either or

, and never both at the same time, should be true.

Either it is going to rain tomorrow or it is not. There is no other possibility.
You can appreciate this by noticing that each line in the truth table has
The Probable Universe


different numbers. There is no repetition in each line. Our degree of
plausibility must reflect this feature.
If you know about quantum mechanical superpositions, you must
have a smile of superiority in the corner of your mouth right now. Yes, it is
true that in quantum theory things can be both TRUE and FALSE at the
same time in some very particular sense, but Coxs Postulates deal mainly
with classical logic, not any modification of it. Our conscious mind, after all,
works on a classical world doing classical logic even if you are a physicist
working with quantum theory. Still, even when dealing with quantum
mechanics, Coxs Postulates and in fact everything else we will learn in this
book, will continue to work when correctly interpreted.
The NOT operator is not the only one we use when dealing with
propositions. Another thing we usually do is to combine two or more
propositions into one. One way we can do that is by using the AND
Consider two propositions and . We use the symbol to
indicate a proposition which is true only if both propositions and B are
true at the same time and read this symbol in the obvious way as
. The meaning of the and is the same as in usual language. Just
think about it. If we say that such and such is true, we mean that both
things are true. It is with this form of combining two propositions that we
associate the AND operator, which once again can be interpreted as a
program that takes two propositions and and returns a single one
called .
In the same way as the NOT operator could be represented as a
truth table, AND also has its own given by
Roberto C. Alamino


This table summarises what happens with the truth values when
we use the AND operator. The white squares in the middle hold the result
of , which is only 1 (TRUE) when both propositions are 1 at the same
In a similar fashion, we can introduce the OR operator with the
obvious interpretation in terms of usual language. The symbol we will use
for the combined proposition is and it corresponds to either only
being true, or only being true or both being true at the same time. That
is how the word or usually works in any language if you think about it.
The corresponding truth table is

These three logical operators are enough to represent all logical
combinations and manipulations of truth values. If you think in terms of
truth tables, it is obvious that we did not exhaust all possibilities. For
instance, there could be a truth table like this
The Probable Universe


But this particular case can be represented as

, and you
are invited to check all the possibilities to see that, indeed, the above table
corresponds to this expression. Because of this, these three operations
become extremely important in computer science and electronics. In
computer chips, those operators are implemented already in the hardware
level. They receive the alternate name of logical gates and are represented
by the following three drawings:

These are respectively the AND, OR and NOT gates. The incoming
wires (the lines at the left of each drawing) represent the bits that will be
operated on. The outgoing wire is the result of the operation. They are
represented as wires because that is what they are (or the equivalent) in
real circuits. The truth value 0 represents a physical situation in which no
current is passing through the corresponding wire, while the truth value 1
to that in which a current is.
In fact, it can be shown that one can construct all possible logical
operations with only one logical operator! This operator is not unique, but
the most commonly used is the NOT AND operator, or NAND. This is just
the result of applying first the AND operator to two propositions and then
applying the NOT operator in sequence. There are many more operators,
but I will stop this discussion here as NOT, AND and OR are not only
Roberto C. Alamino


enough, but actually the most convenient set for us to work with as it can
be readily translated to daily language.
Because AND and OR are such common operators, you will find
many other notations for them as well. The most common of these is to
depict them, respectively, by the symbols and .
Let us now see how all of this connects with our degrees of
plausibility. Remembering that we have agreed to work in the interval
[0,1], we should then require the following properties


If you are put off by the mathematical symbols, simply try to
substitute them by the words. For instance, the first bit of the line above
would read

the degree of plausibility (dop) of and not being true

at the same time is zero
If our definitions follow usual logic, this has to be true. As we have
already discussed, we are assuming that propositions are either true or
false, but not both at the same time. Never. That is why the probability of
both being true at the same time must be zero. This can be easily visualised
in our truth tables. Looking at the NOT truth table one can see that and

always have opposite truth values. The only case in which the AND of
two proposition is true is when both have the truth value 1, which in this
case never happens.
The second property concerning the OR operator translates to

the degree of plausibility of either A or not A being

true at the same time is one
That is simply because either one or the other has to be true. It is
the same as saying that is either true or false. It cannot be neither. These
two properties are logical consequences one would require from any
The Probable Universe


proposition. If something can be only true or false, but not both or neither
at the same time, then and

cannot be true at the same time, but at

least one of them should be. This kind of reasoning, amazingly, narrows
down the possibilities for our degrees of plausibility enormously. Just to be
complete, the above reasoning would work also if we have chosen to work
in the interval [1, ].
You might think that all propositions that make sense should be able to fall
into the tendrils of logic analysis. I am not talking those sentences
appearing in Lewis Carrolls Jabberwocky (Carroll, 1871), which is a poem
containing many nonsensical sentences. Some of them, of course, cannot
be either TRUE or FALSE. For instance, its stanza
does not really mean anything and, therefore, has no defined truth value.
Random ensembles of words are another example as they also have no real
meaning. Amazingly, languages are unconstrained enough to allow
sentences which do have meaning, but at the same time do not have a
defined truth value. Most of them are related to something to which we
will return later in this book, the ability of languages to talk about
Consider the following sentence:
Roberto C. Alamino


Notice that it is a sentence talking about itself. Can you decide if
this sentence is either true or false? No. It, in fact, can be both. You can
attribute whatever truth value you want to the sentence. Still, you would
be a bit reluctant in attributing both values and to say that it is both true
and false at the same time, right? That is because you are thinking at the
sentence as an object, but imagine that you write that sentence twice with
different colours, like this
You then can say that the red sentence is true, while the blue one
is false. Although that would make you more comfortable, both sentences
are still the same! So, we can say that the sentence is true and false at the
same time.
But there is an even worse consequence of addressing oneself
which is embodied by something called the Epimenides Paradox.
Epimenides was a Cretan philosopher that lived around the 6
century BC.
In one of his writings, he stated
The paradox did not seem to be evident either for him or for many
people for several centuries. But wait, if the above sentence was stated for
a Cretan, is that a lie or not? That is equivalent to ask if the above
proposition is TRUE or FALSE. Let us analyse that.
If all Cretans are indeed liars, meaning that the above sentence is
TRUE, then Epimenides being a Cretan must be lying and the sentence
should be FALSE, which is a contradiction. If the sentence is FALSE and
Cretans are not liars, then Epimenides must be saying the truth and the
The Probable Universe


sentence must be TRUE, which is also a contradiction. Then the sentence is
neither FALSE nor TRUE!
You might oppose to such a radical interpretation by saying that
liars do not lie all the time and honest people lie once in a while. Fair
enough. Consider then this clearer version of Epimenides Paradox which is
very similar to the sentence that was both true and false at the same time:
There is no social interpretation in this sentence. It states that it is
FALSE. Now, if the above sentence is TRUE, than according to itself it must
be FALSE, a contradiction. If it is FALSE, then it must be TRUE, another
contradiction. The above sentence cannot be either FALSE or TRUE. There
you go.
The lesson is that some languages, including mathematics, can hide
a few secrets and you should always be careful when assuming things to be
obvious. That is the reason why mathematicians are always concerned
about rigorous proofs of everything.
And with this we are done with what we called common sense. We can
finally move on to the last postulate in the set. If we now require C3 also to
hold, apart from the freedom on choosing the interval, our function
acquires virtually all mathematical properties that probabilities as we know
them have!
The postulate C3 is not difficult to understand and it is just the very
sensible requirement (at least for most people) that if we calculate our
degrees of plausibility using two different lines of reasoning, as long as C1
and C2 hold and we chose only one of the possible intervals, the final
values must be the same. Strictly speaking, the last postulate, C3, can also
be seen as a kind of higher level common sense requirement. After all, how
Roberto C. Alamino


good can be a theory that gives different answers for exactly the same
question just because you changed the way you asked it?
I cannot help stressing here how important postulate C3 really is.
By using Coxs Postulates as our foundation to define probabilities instead
of only using frequencies, we are doing what most people call the Bayesian
interpretation of probability. This, as I mentioned already, has been called
by many the subjective interpretation of probability because it is based on
using available information to calculate the numerical values of the
probabilities, and available information is obviously a feature of the
observer who is doing the calculation, not only of the system, and can be
different for different users.
However, when we require C3 to be valid, if two persons have
access to the same information, they should end up with the same values
of the probability. This is not more subjective than any other calculation
either in mathematics or in science as a whole. In fact, consistency is the
maximum we can require from any definition of an objective reality which
is free of contradictions. Our Bayesian probabilities have it for sure. Pay
attention to consistency, this is not the only time we are going to see it.
Once we force our degrees of plausibility to lie in the interval [0,1]
and to comply with everyones expectations (more technically, with Coxs
Postulates), it becomes legitimate to officially call them probabilities once
and for all. In order to officialise that properly, instead of , we will
finally start to use the scientists beloved notation for the probability
that proposition is true.
Take a moment now to appreciate the beauty of what we have
accomplished here. With a minimum of mathematics, and a lot of logical
reasoning, we derived the most fundamental aspects of probabilities. This
was not a small achievement. Even today, many mathematicians, scientists
and philosophers might look at this with some disdain, but apart from a
matter of taste, there is nothing to frown upon here. On the contrary, we
The Probable Universe


have a theory that should please all three of them with the additional
advantage of being extremely easy to grasp.
For a great number of mathematicians and for those who have a
technical knowledge of probability theory, the derivation of probabilities
from Coxs Postulates (remember, CP) might lack that professional and
technical feeling. That is because the modern mathematical theory of
probability is conventionally based on something called measure theory.
This theory relies on the axioms proposed by a character that we have
already met before, Andrey Kolmogorov.
Kolmogorovs Axioms, which we will abbreviate here by KA, use
ideas of set theory to define probabilities. Although it can be shown that all
of KA can be derived from CP (for technical details you can look at Jaynes,
2003), the former are less clear in terms of interpretation than the latter.
The advantage of basing our derivation of probabilities on CP
rather than KA is that, because of the way they are formulated, it becomes
much clearer how and why we are doing it. In addition, if you are not a
mathematician, it is much easier to grasp the fundamental ideas of
probability using CP than using KA. Finally, the connection with logic and
information theory is much direct on CP.
As an example, let us analyse how one would see the throwing of a
d10 through KAs point of view. We start by creating a set of all possible
mutually exclusive results of the d10 for one rolling. We have seen this set
before. We called it the sample space. Each one of its elements will be
called now a sample point. The most common symbol in mathematics used
to name sample spaces is the uppercase version of the last letter in the
Greek alphabet, the omega. This symbol is . Then, for our d10, we could
write the sample space as the set
Roberto C. Alamino


According to KA, we then assign a number, or a measure, to each
one of the sample points. This number, of course, is the probability of that
particular result. KA are then roughly equivalent to the following three
(1) Probabilities are non-negative real numbers.
(2) The probability of the whole sample space is 1.
(3) The probability of any combination of results (meaning that they are
put together with the OR operator) that cannot happen at the same
time should be equal to the sum of the probabilities of each result.
You can notice that KA chooses the probability interval from start,
but apart from that, there is no mathematical difference between the
approaches and, sometimes, we will use the visual picture of sets of
elements to make things easier to understand.
You should always remember that the two points of view are
complementary, not mutually exclusive. This is something that will happen
all the time. Bayesian inference is not meant to substitute what works, but
to enlarge the scope and improve the understanding of probability theory.
I hid something underneath the carpet in the course of the previous
sections. You will find the clues scattered all around the text. I said that
common sense was a difficult thing to agree upon and that all we were
doing would change a little if quantum mechanics would enter the stage,
but that everything would be alright at the end.
Everything is indeed alright, as long as we take a deeper look on
what alright actually means. That sounds a little too philosophical, but
philosophy is in the basis of everything we do in science and mathematics.
Ignoring this, like many do, is like behaving as a computer program that
The Probable Universe


calculates things without knowing why. Computers hardly ever know if they
are making mistakes because they simply not know what a mistake is
unless you program them.
In 1960, one of the greatest unknown-to-the-public physicists of
the last century, Eugene Wigner, published a paper that became a kind of
holy text for physicists. The paper is called The Unreasonable Effectiveness
of Mathematics in the Natural Sciences (Wigner, 1960) and it deserves to
be part of our discussion. The weight of Wigners stature among the
scientific community of the time can be inferred by the fact that the paper,
although published on a journal of mathematics, is purely philosophical.
There is not a single equation in all of it! Would a less famous researcher
try a similar stunt, the paper would have bounced into the editors in less
than five minutes.
Wigners worries expressed on that paper reflect the questions
which were occupying the minds of the physics community during his time.
Since the Greek philosophers discovered that we could use mathematics to
describe nature instead of using divine explanations, the amount of
patterns discovered in the natural world has only increased and become
more complex. The fact that we can create a purely mathematical line of
reasoning from experiments to predictions that, afterwards, are confirmed
with outstanding precision is almost a miracle.
In other words, Wigner points out to the fact that mathematical
concepts entering physics via some analogy with similar but purely
mathematical constructions many times lead to conclusions, based only on
the mathematical manipulations of the used symbols, which are so
accurate when compared to the experiments that there seems to be no
sensible explanation for that.
One example, and probably the one who always impressed any
physicist that began to learn the principles of quantum mechanics, is the
necessity of complex numbers in quantum theory. We have talked briefly
about complex numbers in connection with Cardano. One of the main tasks
Roberto C. Alamino


of complex numbers in mathematics is to guarantee that any polynomial
equation with complex coefficients has solutions that are also complex
numbers. This property is called algebraic closure.
Algebraic closure is not a trivial property and it is not difficult to see
that. Even if your current profession has nothing to do with mathematics,
you might remember from high school that not all polynomial equations
with real coefficients have real solutions. Consider, for instance, the simple
quadratic equation

As anyone can see, there are only old good real numbers in the
above formula. Nothing involves the square root of -1, the telltale sign of
the complexes. However, to solve it one needs to find a number whose
square is -1. We all know that cannot be a real number. In fact, it is the
imaginary unit . This means that the reals are not algebraically closed
because an equation that can be written using only reals needs non-real
complexes to solve it. By considering from start the whole set of complex
numbers, this completely changes, meaning that every polynomial
equation that can be written by combining complexes has roots which are
necessarily complex.
Those who still remember complex numbers will be scratching
their heads asking what else can exist beyond them. It turns out that a lot
of things with strange names exist, like quaternions and -adic numbers.
Everything depends in essence of what you really call a number. In
mathematics, numbers are entities that obey some rules. In general, what
we see as a number is usually what in mathematics we call a field. Still, we
can stretch the boundaries a little and call other structures also numbers in
a sort of way.
Complex numbers indeed form a field in the mathematical sense,
so they deserve to be called numbers as much as the reals. One could
spend a whole book showing how complex analysis, the branch of
The Probable Universe


mathematics that studies complex numbers and their consequences, is
beautiful and how it simplifies so many derivations. However, when it
comes to classical physics, at first sight there is no reason to introduce
complex numbers apart from some tricks one can do to solve some
equations. They are useful, but they are not an integral part of the theory.
Go to the internet now and look up any article on quantum
mechanics. It is all about complex numbers. The most impressive fact about
the use of complex numbers in quantum theory is that, by following their
mathematical rules, one can indeed reach conclusions that can be directly
translated to phenomena in the real world. This kind of pattern happens in
physics all the time, not only for complexes. Once in a while, someone finds
a way of expressing part of physical reality by associating to it some
abstract mathematical structure and mathematics itself does the rest of
the trick! The fact that this works so often is what Wigner saw as a mystery.
There is more to the article, but that is its essence. However, things
changed since Wigners time and we are in a better position to look into
this philosophical problem from a different point of view. First of all, we
have to understand that mathematics is, in some sense, above science.
Above here means that it provides a language in which one can not only
encode rules of inference that are a result of our observations about the
physical world, but also any rule that we can invent. Science, on the other
hand, is always constrained by physical reality.
In addition, we actually know that things are not as simple as
Wigner painted them. Not every mathematical structure we try out end up
being perfectly successful in describing the world when we run the
mathematical lever. Not all solutions to the equations of physics are
realized in the real world, no matter how rigorously the mathematical rules
are followed. In those cases, we either change the principles we used to
derive the result or we change the mathematical rules themselves.
This works even for the most basic mathematical structures.
Imagine that someone comes to you with a mathematical challenge. The
Roberto C. Alamino


person says that her age is the solution of the following mathematical

This equation has two solutions, -2 and 18. Which one are you
going to pick? Given that negative ages have no meaning for us, the -2
solution is just an artefact of the equation and we have to discard them. It
is not that we cannot represent ages by numbers and operate them with
minus, plus and squares. We can, but we need to be aware of the
limitations of it.
A more sophisticated and interesting example is Boolean algebra.
Boolean algebra, in a nutshell, is just the way logic is represented inside
computers. We have seen it before. It is comprised by the rules to operate
with the numbers 0 and 1 as representing, respectively, false and true
together with the logical operators AND, OR and NOT.
Boolean algebra is constructed such that the mathematics mirrors
what we would expect by substituting the numbers and symbols by words.
We saw that, if we work with propositions, attribute truth values values 0
and 1 to them and combine them using AND, OR or NOT, we will not end
up with nonsensical results. Remember that, when discussing Coxs
Postulates, we attributed truth values to propositions like = I am reading
this book now and used the idea that and

(not A, or I am not reading

this book now) could not be both true at the same time. In terms of
Boolean algebra, that meant that

can only assume the value 0, never 1.

Think about this for a moment. Why do we insist that we can either
be doing or not doing something, but not both at the same time? When we
were constructing probabilities via Coxs Postulates, this exclusion rule
was imposed by us because of our daily experience with the physical
reality. We never see in our daily routine people doing and not doing
something at the same time, so we infer that this must be an acceptable
The Probable Universe


rule. We generalise this rule and deem it applicable to everything around
By contemplating the example above, we could be inclined to
accept that Boolean algebra must be the mathematical structure that
describes logic in nature. This is fine, as long as we are not in the realm of
quantum physics. It is well disseminated now the fact that while classical
computers work with bits which can be either 0 or 1, but not both at the
same time, there is the possibility that we can construct quantum
computers which work with qubits (quantum bits), strange entities that can
be both 0 and 1 at the same time!
We will study quantum mechanics in more details later on, but
right now it is enough to know that everything in quantum theory is
defined by a mathematical object called a state vector. A state vector is a
mathematical construction that contains all possible information necessary
to describe anything at some particular moment in time. State vectors are
used to relate systems to their possible states, which are defined by
characteristics that can be measured. For instance, in physics we usually
talk about energy states. If you consider a highway with cars that have the
same mass, cars with the same velocity will have the same kinetic energy.
Each different value of this kinetic energy defines then a different energy
state. The value of the kinetic energy works as a label to identify the state.
Two cars with the same velocity are said to be in the same energy state.
The same concept works for other quantities too, not only energy. We can
have position states, electric current states, mass states and virtually any
kind of state associated with something measurable. These measurable
quantities in quantum mechanics are called observables, for obvious
Classically, bits are systems for which we choose an observable
that can assume only two values. We then re-label one of the values 0 and
the other 1. In classical physics systems can only be in one or the other
state. Right now, for instance, you are in a state that would include among
many other pieces of information the fact that you are reading this text. If
Roberto C. Alamino


we forget about everything else, we could simply say that your state is
given by the above proposition = you are reading this book now.
Quantum effects are only perceptible for very small things or for very high
energies in our world and, because we are characterized by neither, we are
better described by the rules of classical physics. In classical physics, you
are now either reading or not reading these lines, but not both. That is a
completely consistent mathematical description, which means that it does
not lead to any internal problems in the physical theory describing this
particular phenomenon (of being reading this book now). Therefore, your
reading state can be associated with a classical bit by re-labelling the
proposition as 1 and the negation of , or

, as 0.
A subatomic particle, on the other hand, is small enough to be
affected by quantum physics. Be aware that this is not a change of the rules
of nature. Quantum physics describes both us and the subatomic particle.
What happens is that the differences between quantum physics and
classical physics become small when things are large and slow (like us). The
effect of quantum physics on the subatomic particle allows it to break the
rule that it can either be doing something or not doing it, but not both at
the same time. This means that while classical objects can either assume
the state described by or

, but not one described by

, a quantum
object would be allowed to actually assume

. This strange situation is

described by saying that the system is in a superposition of the states

and this superposition will be described by a combined state vector.

Of course there are subtleties and we will talk about them later on.
Do not get desperate if things do not make sense for you at this
point. The situation in quantum physics is a subtle and very confusing one.
It takes time to get used to it. As strange as it might seem though, this state
of affairs is the result of we physicists trying to make sense of a series of
experiments in such a way that does not lead to an internal contradiction in
our description of nature. It is our attempt to create a consistent theory
describing those phenomena. The only way we found to do that was
accepting that, at the quantum level, things can be described by assuming
The Probable Universe


that there is a state

that they can assume, what would be impossible at

the classical level. There is no contradiction in that because the theory is
constructed from start in such a way that, when we deal with daily life
objects (remember, slow and large), Boolean algebra is recovered.
The bottom line is that even the rules that we take for granted in
nature are a consequence of inference made on the basis of collected
information. We use mathematical structures because the experiments
seem to indicate that it makes sense to do so. You might say that not only
the experiments, but also logic. Just remember that logic is also inferred
from experiments, so it is basically all the same. Even science itself,
including all the rules we use for working with it, is a question of inference.
In this sense, the fact that mathematics is useful in physical sciences is a
result of us inferring the right rules.
There are, of course, philosophical alternatives. In a paper
originally released on the internet in 1997, Max Tegmark (Tegmark, 1997)
proposed that every mathematical structure actually exists as a different
reality. This would answer Wigners question in a different way by saying
that the fact that mathematics is useful is inevitable as any mathematical
description will fit some physical reality and every physical reality has a
mathematical description fitting it. It says that there is a one-to-one
relation between mathematical structures and realities.
This idea is, as many others, not provable in its present state, which
characterizes it as a hypothesis, not a scientific theory. It is not even clear
whether it is a scientific hypothesis, although it does not take its merit. But
even if that hypothesis turns out to be true somehow, we still would have
to infer which are the rules of our mathematical structure. That would take
us back to our game of inference.
Roberto C. Alamino


The main message of this chapter is that probability theory can be
constructed from simple principles of inductive logic. By starting with some
basic requirements, which many people call more technically desiderata
(something you wish), we built step-by-step a formal framework that
allows us not only to talk about probabilities, but to calculate them in such
a way that we can compare the numbers we obtain to the results of actual
experiments. Probability is logic.
The second important message that you should keep with you not
only while reading this book, but also when thinking about science in
general, is that the mathematical frameworks that we develop, no matter
how complicated they might seem, are there to encode patterns in
whatever way we want to. The connection of mathematics with the real
world is made by selecting those patterns which we actually observe in
nature. If we decide to, we can modify the mathematical rules, although
respecting some limits, to make them reproduce whatever behaviour we
Mathematics, philosophy and science are a triad that is on the
foundations of all our knowledge. Although you can ignore their
connections if you wish, a consistent understanding of everything we learn
and do cannot be complete without considering all three at the same time.
This unity will be lurking behind every page of this book and we will have a
chance to feel it at its full power as we journey to its end.

The Probable Universe


Science, as we discussed at the end of the previous chapter, uses
mathematics to describe the world we live in. These mathematical
descriptions, which we have seen to be abbreviated ways of encoding a
series of rules, are called models. Every science, from psychology to
physics, works by creating models of phenomena and trying to test if they
work or not.
I have already lost count of how many times people come to me
and say that this or that theory has been proven wrong. Newtonian
mechanics has been proven wrong by relativity. Classical physics has been
proven wrong by quantum mechanics. If those theories are wrong, why do
we keep using them? To answer this question in full, we need to
understand a bit more what does it mean for a theory about nature to be
right or wrong. That will take us through a long path.
It is true that the scientist in general, but even more strongly the
physicist, has an inner desire to believe that nature is describable by one
single, mathematically coherent model the so famous and exaggeratedly
named Theory of Everything, or in one of those many instances in which
scientists like to make a joke, a TOE.
We are not there yet and, honestly, there is no guarantee that we
will ever be. But if there is one thing we are certain about is that all models
we have now are wrong in some way. The beauty of this is that it does not
matter if our models are right or wrong in every single detail as long as we
Roberto C. Alamino


can use them to make predictions about what they are supposed to
describe and within certain acceptable limits.
But what about models that offer different descriptions for the
same aspect of nature? In our daily life, most people agree on the colours,
shapes and other directly detectable characteristics of things around us,
which means those which can be detected by one of our senses. The
amazing thing about our present knowledge of nature is that we have
models describing things that are not detectable by our bodies unless we
use some kind of tool to indirectly measure them.
The classic example is atoms. The existence of atoms was not fully
accepted until the work of Einstein about Brownian motion in the
beginning of the 20
century. Ernst Mach, for instance, used to say publicly
that he did not believe in atoms. Today, atoms can be visualised, but only
by using a tunnelling microscope.
Another simple example is radio waves. We can literally see with
our eyes electromagnetic waves within a certain band of the spectrum. We
call it simply visible light. We can detect it because some proteins in our
eyes change shape when they are hit by light and this shape-changing is
transformed in electrochemical signals which are transferred to our brains
for post-processing. Light (the visible kind) differs from other kinds of
electromagnetic waves simply by its frequency. All other frequencies,
including radio waves, need special devices to be detected. In the case of
radio, we use antennas and electronic circuits to detect them and
electronic circuits to interpret them as images or sounds.
What is real and what is not is a very complicated, but not
unimportant, philosophical question. It is however one which we will only
discuss here very briefly. When we detect something with our senses, we
tend to attribute to that a quality of being real much greater than when
we are forced to use indirect measurements. We usually forget that the
shapes, colours and sounds are all interpretations of our brains of the
information it receives from our detectors. In fact, this information is
The Probable Universe


already pre-processed by the nervous cells that change them to electric
signals transmitted from neuron to neuron. This means that part of reality
is already lost and changed in this process. Many experiments in
psychology show that the brains interpretation of the information it
receives is so subjective in some aspects that it can be heavily affected by
our memories and experiences.
This realisation, that the reality we create in our brains is a model
resulting from the processing and interpretation of electric signals, puts in
evidence a different player in physical reality whose importance could only
be appreciated once we entered the present computer age. This element is
Science and technology are complementary in the sense that
advances in one bring advances in the other. When new technologies are
incorporated to our daily routines, they change our culture and even our
way of thinking. This new way of thinking usually bounces back on the way
scientists interpret nature. That happened with the steam age, which
brought the energy paradigm in physics, and happened again with the
information age, which forced physicists to think about the world also in
terms of how information is acquired and processed between systems.
Computers are systems that eat, digest and spill information all the
time. We are so used to them nowadays that we find it very simple to
understand things in terms of information. Anyone today can appreciate
that, given enough bits, we can construct any kind of image, sound and
even actual three dimensional objects. We might still not be able to
reproduce some things like smells, but that is a technological issue, not a
fundamental one.
Because we created them, we can still understand how computers
work. At least, most of us have a feeling that we do. Not always in detail,
but in general. There is though a much more complex system which also
eats, digests and spills information called the human brain. Repeating what
I wrote before, our brain is all the time receiving information about things
Roberto C. Alamino


and creating a model that gives us our perception of reality. But there is
also another level in which this process happens. It is happening right now.
While you read this text, the words in it are being processed by your brain
and associated with your memories to create a meaning. If you read the
word ball, you know what it is without seeing the picture. In fact, right
now you are seeing a ball without really seeing it. If you think about your
favourite music, you will hear it without actually hearing! If you stop to
think about it for a moment, it gives you chills.
What is happening is that I am using a kind of code to store
information in a way that you can detect, decode and interpret later. We
call this code language. This particular language you are reading right now
is called English. This encoding allows me to think about something and
transmit it to you without you having to take a look inside my head.
Learning English means to learn how to associate sequences of letters to
images, sounds, smells and so on. Some words, however, are associated
with higher level concepts. The word English itself describes something
that is much more complex.
Mathematics, as I have argued before, is another kind of language.
It was developed for a different (an admittedly more restricted) purpose,
but it is still a code used to store, transmit and also to process information.
It turns out that it is the most efficient language to do science. In science,
we learned that it is extremely convenient to encode information about the
external world as mathematical structures. We then use mathematics itself
to process this information until we can extract some hidden pieces of it
from the original data.
Probabilities are one of those structures used to codify and process
information, in particular when there is a lot of uncertainty around, which
is almost always the case. In fact, we are going to see that they are one of
the most fundamental tools to do that in any case, with certainty being
just a special situation. We have already learned that probabilities encode
Coxs Postulates. In the next sections, we are going to understand what
else they can encode and how.
The Probable Universe


It is time to roll our dice again. Let us use the d10. After all we have
learned, we can safely agree that saying that the probability of getting a 10
when rolling the d10 is the same as the probability of the proposition =
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 10 upwards being true. We can then use
the shortcut mathematical notation , which I hope at this
point has become clear enough. We can now look at this number using our
new knowledge about what a probability is.
The important thing is to notice that, when we associated
probabilities with degrees of plausibility, the former could be interpreted
as a way to actually encode whatever information we had about a
proposition. The greatest challenge is how to do this encoding. There is not
a unique answer to this problem. There are many ways of doing it
according to which type of information we have in our hands. This is one of
those places in science in which creativity is essential. Although there are
some basic principles that can be used as guidelines, unfortunately (or
fortunately if you are a scientist who is afraid of being substituted by a
computer) there is no general procedure that works for all cases.
Rolling a d10 is a physical situation which is simple enough to allow
us to do that with little difficulty. Here too it is convenient to be economic
with mathematical symbols. In terms of information theory, whenever we
try to be economic with symbols this is equivalent to say that we would like
to compress information. To do that, we have to agree on some basic
conventions that will allow us to write our propositions and the
probabilities corresponding to them in the most compact way possible and
to recover their actual meaning whenever we see them. We start by trying
to find a condensed description to refer to all 10 possible flavours of
proposition which written in full become
Roberto C. Alamino


When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 1 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 2 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 3 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 4 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 5 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 6 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 7 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 8 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 9 upwards
When I throw my d10 up into the air and it hits the table, it will stop with
the face marked by the number 10 upwards
I am pretty sure that after the second or third sentence you were
already losing your patience. I could have used much less space to describe
the above 10 possibilities. I can simply omit from the description of the
proposition the whole procedure of how the dice is being rolled. We keep a
kind of rulebook where we describe that procedure only once and then
assume that we agreed in following that exact procedure for all dice rolling.
The Probable Universe


Next, we give a name to the result of the dice rolling. Let us call it
. Then, each one of the ten propositions giving the results of the dice
rolling can be written in the following compact form

This is visibly a huge amount of compressing. For instance, in this
notation we have now a very compact way to write our proposition as
Because it is a notation that makes very clear what our proposition
is, we will many times use it inside the brackets of the probability symbol in
the following way
The variable is what is called a random variable in probability
theory and it is a convention in many books to use capital letters for them.
Random variables should not be confused with our propositions. A
proposition is, in fact, an assignment of a value to a random variable. You
can think of the random variable as a kind of incomplete proposition, one in
which something is missing. In the above case, this something is the value
of .
The act of assigning probabilities for each value of our random
variable is then exactly equivalent to say that a certain face has such and
such odds of being the result of the dice rolling. So what can we say now
about these odds? Remembering our previous discussions, let us assume
that our d10 is an exactly symmetric solid with 10 faces. In addition, we will
assume that along the entire d10, the density of its material is constant and
that the small painting indicating numbers at each face is so thin that the
paints mass is negligible and will not affect the result.
We discussed previously that, unless we live in a completely
chaotic world where things do not make any sense, like in a cartoon, it
Roberto C. Alamino


seems logical that if the dice is thrown with enough energy and in a
careless enough way all faces should have the same probability. Finally, we
also know that only one face can occur each time we roll the dice. That
summarises all our information about the entire game. But before we start
the encoding of this information into probabilities, I will introduce some
even lazier notation and compress things a bit more.
Because all that is relevant to the situation can be described by the
result of one single random variable, we do not need to repeat its name all
the time. We know that, at least for now, we will be only dealing with .
Therefore, there should be no need to keep repeating the name of the
variable over and over again as long as we remain working with the same
problem. So, instead of writing , we can simply use to
describe the probability of being . That is a risky move as sometimes
omitting the random variable can be the source of a lot of confusion. Our
situation here is, however, clear enough for this not to happen.
For what we have to do now, we will need to combine propositions
with the logical operators AND and OR, the same two that we learned
about when we were dealing with Coxs Axioms. The convention was that
combining two propositions with AND would be denoted by writing them
in sequence. For instance, the combined proposition AND would
simply be written as . Similarly, we would write A OR B as .
There is only one inconvenient when using the AND convention
together with our present compressed notation for dice rolling. The
problem is that it is visually confusing to put numbers together like
propositions. For instance, 1 AND 2 would be written 12, which could be
misunderstood as a result of twelve in the dice rolling, which cannot
happen for a d10. Therefore, instead of writing 12 for the proposition
and , we will separate them with a coma and write 1,2.
We now have a very compact and convenient notation to encode
the information about the d10 rolling. We can now summarise it by the
following requirements
The Probable Universe


1. All probabilities should be the same (symmetry):

2. When the d10 is rolled, at least one of the faces has to end up

3. It is impossible to get more than one face at each time:


The first equation, which encodes the idea that all faces have the
same probability, is a simple way to encode the symmetry of this problem.
You need to be careful with the second equation. Remember that we are
using the convention that the symbol + means OR and not the usual
addition of numbers. This means that 1+2 is not equal to 3 in this notation.
The last set of equations seems to encode such an obvious fact that
it looks almost superfluous. But remember that we are using a variation of
the lawyers rule which says that, if something is not explicitly written
down (in this case, encoded), then it does not exist. We need to encode
somewhere the idea that only one face can be the result of the rolling,
otherwise we open a potentially dangerous loophole in our encoding
This impossibility of having two or more faces at the same time is
important here. It turns out that the rules of probability state that when
two propositions, say A and B, are mutually exclusive with , this
implies that
Roberto C. Alamino


This is called the sum rule and, although it has admittedly a simpler
interpretation when one considers probabilities as frequencies, it works for
any proposition. The way to prove it involves some mathematical
manipulations and I will skip them here. If you are one of those persons
who (correctly) does not believe in something just because a book says so,
you can find the proof in chapter 2 of Jaynes book (Jaynes, 2003). Notice
that the sum rule is not always valid. The most complete form of it is given

and it states that when we sum the individual probabilities for and , we
are counting twice the probability of and happening at the same time.
Because we counted it twice, we have to discount it once to get the correct
value. When the discounted value is zero, which is when there is no chance
for both to happen at the same time, we recover the initial sum rule.
There is also the complementary rule called the product rule,
which says that the probability of the conjunction (AND) of two
propositions which are independent of each other is the product of their
probabilities, or in mathematical symbols

The independence property is extremely important here and
means that the occurrence of one of the propositions has no influence
whatsoever in the occurrence of the other. If the propositions are
somehow related like = We are in the Sahara desert and = It is going
to rain today, which very obviously influence each other, the rule changes
in a way that we will see later on. But it cannot be written as above
anymore. In the same way as there was a generalisation of the sum rule
that would work always, there is also a generalisation of the product rule
that works even when the propositions are not independent. However, this
generalisation involves something called a conditional probability,
The Probable Universe


something that is very important for our program BAYES, but which we will
talk about only later on.
Because the occurrence of faces in the d10 rolling are mutually
exclusive according to the rules of our game, using the sum rule the
equation containing the OR of all faces is equivalent to

The symmetry information says that all these probabilities should
be the same. Therefore, we can call all of them by the same name which I
am choosing to be the letter . This results in the simplified equation

which finally gives us the probability of any face to be , in a
relieving agreement with what we thought to be justified by common
Can you now look at the simple expression in the same
way as you did before? Take a moment to think about how much
information is contained in this simple-looking equality. Of course, once
you get the knack of it, the act of encoding information into probabilities
becomes easier, but it does not mean that it becomes easy. It requires a lot
of thinking and a lot of experimenting to find out which are the relevant
pieces of information and how to put them together in order to find
numbers which will allow you to make good predictions. In some cases, like
the stock market for instance, this might never be possible within the
accuracy we would like to. Get used to uncertainty. Life is full of it.
The symmetry principle we used to find the probabilities for our d10 in the
previous section is a very powerful and very deep one. It is called Laplaces
Roberto C. Alamino


Principle of Insufficient Reason and is a simple statement of rational
Symmetry is a fundamental concept in our current understanding
of nature and, fortunately, is a very simple one. We have already seen it
working in two different ways: to allow us to calculate probabilities and to
generate uncertainty. We say that something is symmetric or has a
symmetry when, if we change this something in some way, there is at least
one characteristic of it that remains the same. If everything changes, there
is no symmetry at all.
For instance, the human body has what is called bilateral
symmetry. This is a fancy term for the fact that, if you take a photo of the
right side of your body, you can use a graphic software like Photoshop (or
GIMP, the free alternative to it) to complete the rest of the picture by
simply reflecting it. This also means that if you change your left and right
sides, your picture remains the same.
Nobody has perfect bilateral symmetry though. We call this
situation an approximate symmetry and, in most cases, this is already
powerful enough for most purposes. It is actually very rare to have a
perfect symmetry in any situation, but fortunately we can most of the time
ignore the small imperfections and assume that the symmetry is perfect
enough given the precision we are working with. Just remember that we
ignored differences in the density of each face of our dice caused by the
inscription of the numbers on it because they would not cause too much
deviation from symmetry.
I will interrupt the text flow here because enough this last issue
cannot be emphasized enough. In most tasks we perform in our lives, we
only need to be precise within certain limits. This is true for science as well.
Most of the time, perfection is not only unachievable, but is also
unimportant and looking for total precision can be a waste of resources.
Keep that in mind.
The Probable Universe


Returning to the main topic, in order to appreciate the power of
symmetry concepts in science, in particular in physics, it is worth knowing
that the most fundamental descriptions of natures laws that we presently
have are based on something called gauge symmetry. This is a very difficult
symmetry to visualize, because it is associated with the way we describe
some things mathematically. The first place in which it was identified in
physics was in electromagnetic theory.
We all know what a voltage is. Voltages are everywhere written in
our electric devices. They measure differences of something called an
electric potential. It turns out that there is no meaning in an absolute value
of an electric potential. They are always relative to some fixed reference
potential. You can change the numerical values of the potentials, but the
voltages remain the same. This was exactly what happened in our
definition of symmetry. This is a very simplified example of the gauge
symmetry present in the electromagnetic theory. It is called a gauge
symmetry because we can gauge the electric potential in the most
convenient way for us without changing the actual physics.
A full understanding of gauge symmetry would require a level of
mathematics that is beyond what we can have here. The consequences of
gauge symmetry, however, are far reaching. For instance, when we go to
quantum theory, if we take as a fundamental requirement that some gauge
symmetries should be satisfied, this implies that fields similar to the
electromagnetic field must exist. If they do not exist, the symmetry cannot
exist as well.
There are even more consequences of symmetries. In 1915, the
German mathematician Emmy Noether proved what is probably the most
beautiful theorem of all physics and, to show that prizes are not everything
in life, she never earned the Nobel Prize for that. In one of the fewer cases
in which deserved credit is correctly assigned, the theorem is known today
as Nothers Theorem, although the beauty of it more than provides
grounds to suggest that it should be called Nothers Poem.
Roberto C. Alamino


The theorem was only published in 1918 in German and was
translated to English in 1971 (Noether, 1971). What Noether proved is very
simple, but incredibly amazing. Her theorem shows that each continuous
symmetry of a physical system implies the existence of a conserved
quantity. Not only that, the theorem is so complete that it gives you the
formula to calculate the conserved quantities. The hypothesis and the
conclusion of the theorem are not difficult to understand, even if you are
not a mathematician, although the proof cannot be given without maths.
The key concepts are continuous symmetries and conserved quantities.
Let me explain what they mean.
Continuous symmetries refer to symmetries of an object when the
change we make on it depends on a parameter that can vary continuously
from the initial unchanged configuration of the object to the final, changed
one. For instance, think about a round plate. If you hold the plate in front
of you and rotate it by any angle, the plate looks just the same (once again
ignoring the tiny imperfections on it). But at any moment between the
initial position and the final position, the plate also looked the same, no
matter what the value of the angle was. Consider that, with respect to the
initial position of the plate, you rotated it by 90 degrees. In order to do
that, you needed to pass through 87, 87.5, 87.9, 87.999 and so on before
reaching 90 degrees. Any value between 0 and 90 is acceptable for the
angle of rotation. Inside this interval, there were no forbidden values, they
could be varied continuously and at each value the plate would still look
the same.
We say that the angle varies continuously in contrast to what we
would call a discrete change. When something changes discretely, it goes
through jumps. The real numbers, the numbers in a straight line, vary
continuously because there is always another real number between any
two. The integers, contrary to that, vary discretely. There are no integers
between 1 and 2. There is a jump from 1 to 2.
Consider the picture below.
The Probable Universe


The second row shows a square which is rotated by the angles
given in the first row, which are 0, 10, 45, 87 and finally 90 degrees. You
can see that the square will only look the same if you rotate it by 90
degrees or by multiples of it, like 180, 270 and so on. The third row shows a
circle (or a plate) for comparison, which is always the same no matter what
the angle of rotation is.
For the square, any angle between 0 and 90 degrees is not a
symmetry. We call cases like this a discrete symmetry, because the
symmetry appears only in jumps of 90 degrees. This kind of symmetry does
not obey the hypothesis of Noethers Theorem and we have no guarantees
that it leads to conserved quantities, although sometimes it does.
The other concept we need is that of a conserved quantity. This is
a quantity that does not change in the system as time goes by. The
example we are all used to is energy. Everyone has heard that energy
cannot be created or destroyed, just transformed. This is another way to
say that energy is conserved. Because if you calculate the total value of it
coming from all possible sources it should not change, energy is said to be a
conserved quantity. But have you ever wondered why?
Here comes the beauty of Noethers Theorem. Using it, we can
show that if the laws of physics are symmetric by time translations then
energy should be conserved! A time translation is just a fancy name by
Roberto C. Alamino


which we call the passage of time. As far as we know, time seems to flow
continuously. Up to the precision we can measure, there seems to be
possible to divide any time interval indefinitely. There is no minimum time
step. This matter, though, is not settled for sure. In any case, we can
assume that time is continuous and see what are the consequences.
By assuming that time flows continuously, we open up the way to
use Noethers Theorem. The fact that the laws of physics do not change
with time in their fundamental description can be summarised by the idea
that if we do the exact same experiment today or at any other day, keeping
all experimental conditions as equal as possible, the results should be the
same. Then, this allows us to calculate, via Noethers Theorem, a quantity
that is conserved. This quantity turns out to have exactly the formula of
In the same way, the laws of physics seem not to change from
place to place. If we do the same experiment in the UK and in the US, apart
from environmental differences, the results are the same. We call this
symmetry by spatial translations in analogy with that by time
translations. Because space is also continuous (within the precision we can
measure) this also leads to another conserved quantity, one we call
momentum. One of the consequences is that, unless any forces act on a
body, it will keep moving with the exact same velocity forever.
Just to complete the most famous trio of conserved quantities,
because there is no preferential direction for the laws of physics in space,
or if physics is symmetric by spatial rotations, then the conserved quantity
is the angular momentum. Very similarly to the non-angular momentum,
the angular version has the effect that if we do not try to stop a spinning
object, it will keep spinning with the same angular velocity forever.
There are many other consequences of symmetries, and even of
the failure of a system to be symmetric, but we are not going into the
details of it. One very important side effect of symmetries is that they are
accompanied by ignorance. In the example of the rotationally symmetric
The Probable Universe


plate, you will never know whether someone touched your plate if the final
effect was only a rotation. Any rotated position of the plate will be the
same. Of course, if you had access to some equipment that allowed you to
detect fingerprints, you would know that someone touched the plate. This
means that the symmetry, in this case, can also be the result of some lack
of information about some non-symmetric aspect of the plate.
We can think of the roulette as a discretised version of the plate.
The rotation symmetry in the roulette is discrete, like the one of the
square, because it is divided in compartments of finite size. When it is
rotated in such a way that the divisions between compartments coincide
for the initial and final positions, if we ignore the differences in the printed
numbers, we can consider both configurations as symmetric.
As we have seen with the d10 previously, the symmetry of the
compartments of the roulette tells us that there is no reason to assume
that one number is more probable than the other. In other words, there is
insufficient reason to prefer one result over the other. The fact that we
have no reason to favour any of a number of symmetric configurations of a
system is Laplaces Principle of Insufficient Reason.
In other words, what Laplace proposed was that, if the symmetries
of a system are such that there is no way to differentiate between one
result and another, the most reasonable thing to do, given all the
information you have, is to assume that every state is equiprobable. It does
not matter if 11 is your lucky number, nothing in the roulette says that it is
more probable and you would not be rational if you attributed a higher
probability to it.
Laplaces Principle is not something that can be proved
mathematically. It is a postulate based on a logical requirement. It has a
philosophical underpinning which reflects a rational consistency of the
world around us. Why on Earth (or in this case, on the whole universe)
would one of two exactly symmetric configurations be more probable than
the other? The only rational reason would be if there was something that
Roberto C. Alamino


would differentiate then. Of course, we cannot rule out that something like
that might occur in nature just because we cannot understand it, but
unless strong experimental evidence appears, it seems that we live in a
more or less sensible universe.
One of the places in physics in which Laplaces Principle has a great
impact, for instance, is in statistical physics, although this is rarely seen in
this way. Statistical physics is the area of physics that deals with systems
consisting of a very large number of smaller parts. Any macroscopic object
has more than

atoms, which by most accounts is a very large number.

The way these atoms interact to result in a solid, a gas or a liquid and the
conditions over which these different phases change into one another are
subjects of statistical physics.
Because it is not practical to keep track of that amount of atoms,
statistical physics recurs to descriptions that involve probabilities. The
reason, again, is the lack of total information about the system. In principle,
total information could be obtained with enough time and resources, but in
practice it is not worth to do that. You do not want, and actually you do not
need, to measure all information about a cube of ice to know that it will
melt at about zero degrees Celsius.
In statistical physics, the strategy used is to focus on the energy
states of a system and this is the point where Laplaces Principle makes its
appearance. The fundamental assumption of statistical physics is that all
states with the same energy have the same probability. The symmetry
assumption here is a subtle one. It corresponds to the idea that the same
energy states behave in the same way macroscopically. Never forget that
assumptions like this should always be subjected to experimental
validation. As it turns out, it has been working quite well for more than a
The Probable Universe


Laplaces Principle, in its plain formulation, is easy to apply in some
situations where the description of the symmetries is straightforward, but
not that much in others. Fortunately, it can be translated into a very
powerful formulation which is known these days by the name of Principle
of Maximum Entropy. Odds are that you are familiar with the word
entropy as describing disorder. That is one of the ways to look at entropy,
but there is a more modern point of view which is even deeper and
reduces to the old one when the situation is appropriate. We can look at
entropy as lack of information.
Claude Shannon was an illuminated engineer from the US. During
his studies about communications systems, he discovered that he could
define mathematically the concept of lack of information in a written
message. The legend tells that he was in Princeton University at that time
and talked about his result to John von Neumann, who told them that his
formula was nothing more than the formula for the entropy that was used
in statistical physics!
The main reason why this connection was a surprising one is that
Shannon found his formula by starting from some reasonable requirements
he thought a mathematical quantity would have if it was supposed to
measure lack of information. To understand better what is meant by lack of
information, think about a silly language where all texts are just sequences
of the letter A. The only information that you might not have is the actual
size of the message, but if you forget about that, you know everything you
need to know about any message! They are just sequences of As. There is
no missing information and, therefore, the entropy of a message like that is
Another useful way to understand missing information is in terms
of the average amount of surprise each time you receive a new symbol. In
the above message, there is no surprise at all. You always know that the
Roberto C. Alamino


next symbol will be another A. Surprise, of course, is just another way to
describe the amount of information revealed to you by each new symbol.
Imagine now that we have our full Latin alphabet at our disposition
all 26 letters, from A to Z and some special characters like the blank
space, the comma and the full stop. If we ignore the particular variations
present in different languages like or , then each language corresponds
to a different set of rules to put the same symbols together. Shannon
considered these rules to be probabilities of a symbol coming after the
other. A forbidden sequence has, for instance, zero probability. For
instance, in English, the sequence djhfasdf strictly speaking does not
correspond to a grammatically correct word. Of course it has a probability
of occurring in a text, like it just did here, but it is a very low probability.
Clearly, the lower the probability, the greater the surprise when we receive
a word like that. But it happens so rarely that, in average, the amount of
surprise will not be very big as long as it appears just once in a while.
However, if we still insist in using a language where everything is
written just with an A, we still have no surprise at all in any message, no
missing information and therefore, according to Shannons idea, zero
entropy. Notice that we still have not made the connection with disorder
here. Wait a bit more and we will get there.
What would be the situation in which we would get maximum
surprise when each new symbol arrived? Whenever we have a symbol
which is more probable than another, that messes up the average
(important word here!) surprise in the sense that we expect more of that
symbol to appear and it will. The only situation in which we cannot have
any expectations about the next symbol is when all of them have the same
probability! That is exactly the situation of maximum entropy! It also
means that each symbol contains the maximum amount of information
possible on it, because each next symbol is extremely important as we
cannot predict it.
The Probable Universe


This should remind you of our discussion on randomness and
predictability. When we have access to the probabilities of something, then
we can evaluate how random that object is by calculating the entropy
resulting from those probabilities. The entropy will give its highest
numerical value when all probabilities are the same and will give exactly
zero when one of the results has probability one and the others zero. This,
as you can imagine, is the case in which there is no randomness and the
result is completely predictable.
Now, look at the following text with only As:
And compare with the following text containing every letter with
the same frequency, in this case, each one appears only once in a random
Which one looks more disordered to you? For me, it is clearly the
second one the one with the largest entropy. The first sequence full of As
could not be more organised. Think about a very organised bedroom. If
something is out of place, you can readily tell it, what would not happen if
the place was a mess. Consider that I make a small change to the first
It should be quite easy to find out where I put the different letter.
What if I do the same with the second sequence?
Roberto C. Alamino


A bit less immediate, isnt it? For a small sequence like that, you
might still think that it is not too difficult, but imagine how it is to find one
typo in the Bible. The fact that the formula discovered by Shannon
considering text messages is exactly the same as the entropy used by John
Willard Gibbs for general physical systems should be taken as a sign of
something deeper. Nature usually does not provide such coincidences
without some underlying connection between them. Today, after this
connection was scrutinised time and again, we finally understand that
entropy is a measure of lack of information about a system.
There are many ways to appreciate the analogy between texts and
physical systems. We can think of a particular text as the current state of a
sequence of letters. Each particular sequence is a text-state. A
chromosome, for instance, can be thought as a kind of text with only the 4
letters A, T, C and G. Take mans chromosome Y. Each man has a different
sequence of letters (bases) in this chromosome, although the amount of
letters is basically the same. Different men have their chromosomes Y in
different states.
Material objects can be seen as a sequence of atoms juxtaposed in
some spatial order. In the same way as you can tell a story with different
words, an object can have modifications in the arrangements of its atoms
and still be the same object. For instance, you are the same person even if
your body composition is changing at every moment. Entropy can be
associated with the amount of states something can have without changing
some important macroscopic property of it, be it the essence of the story
or the individuality of a person. You name the property. Of course, when
you change things and some property does not change, we are led to
consider symmetries.
As strange as this can sound, disorder brings symmetry. This is
consistent with all we have learned before as we have seen that symmetry
The Probable Universe


brings and is the result of lack of information. Therefore, the larger the
symmetry of a system, the higher the entropy associated with it.
It would be totally understandable if you have difficulties to accept
that disorder brings symmetry. Our first impulse is always to associate
symmetry with order, but that is a great misconception that I want to
correct before we can go any further.
Go to your kitchen now, get a glass of clear water and observe it.
The water looks the same everywhere. Now, if you add to it a drop of black
ink, or any other kind of coloured fluid, the point created by the ink is said
to break the symmetry of the water. From that moment on, the water is
not the same everywhere because you can identify every point in the glass
by its position relative to the ink drop. But if you wait enough time, the ink
will mix with the water. If you are impatient, you can ever stir the mix to
help the process. The highly organised initial configuration with the ink
arranged in a small drop will decay to a chaotic situation where the ink is
now everywhere. But now the glass is all the same once again! Disorder
created symmetry.
The same idea works for our messages as well. You might right now
be arguing that the text with only As is more symmetric than the one with
all the letters. You would be right if the disorder in the text was associated
with some geometric translation of the symbols, but remember that we
associated the entropy of the text with the appearance of the different
symbols. Changing the symbols is the relevant transformation here to
evaluate symmetry. What happened when we changed only one symbol in
the first sequence? It changed completely, which was reflected by the fact
that we found the change very easily. When we changed the other one, the
change was less obvious. If you do that in a larger text, it will basically
remain the same. There you get your symmetry. You change something
and it remains (almost) the same. I know it is hard to swallow, so keep
thinking about that as much as you can.
Roberto C. Alamino


Let us recollect things. Laplaces Principle says that we should not
attribute different probabilities to results if we do not have any reason to
do so. Not having any reason to do so, as we have seen, means that any
states which are related by symmetries should be assigned equal
probabilities. A system whose all possible states are related by a symmetry
and cannot be distinguished has maximal symmetry, therefore disorder is
maximal and the lack of information is also maximal. This system,
therefore, has maximal entropy.
Because the mathematical formula for the entropy involves only
the probabilities of the possible states of a system, if we maximise this
entropy, we then get the correct formulas for the probabilities! This is the
general formulation of Laplaces Principle, or Maximum Entropy if you
prefer it, that we were after. This allows us to attribute probabilities by
using all the information about a system and nothing else for a far larger
class of systems than we could before.
There are some small details that I have skipped. Clearly, symmetry
is not the only kind of information we have about a system. Suppose, for
instance, that we are playing a dice game and we notice that something
very strange happens the average value of the results in our d10 seems to
be fixed to 3. Even if we do not know how this happens, we can include this
information in the Maximum Entropy formulation by introducing what is
called constraints. Constraints are nothing more than pieces of information
we know about the system. They are called this way because they are like
rules that constrain the behaviour of the system. For those who are
interested in the mathematics, this is accomplished by a technique called
Lagrange Multipliers and you can find it, once again, in Jaynes book
(Jaynes, 2003). It is enough for us to know that this can be done, but we
will not spend any time with the maths.
Let us do another pit stop and, one more time, appreciate the
beauty of all this. Maximum Entropy, as a generalisation of Laplaces
Principle, is a statement that the physical world must follow the rules of
what we consider everyday logic. It encompasses a concept which, as we
The Probable Universe


will see later on, is the basis of all human search for a rational description
of the world. It guides us in finding probabilities about phenomena by using
the information given by experiments in a maximal way, trying to avoid any
kind of emotional bias (unless we really want to be biased). The most
amazing fact is that all of this is related to a fundamental quantity of nature
which was rediscovered many times in different contexts entropy. This
relation is just one of the connections between information and philosophy
of science. We will see many more as we proceed.
Encoding information in probabilities in the way we did via Maximum
Entropy is deeply connected to the classical way to calculate probabilities.
We did not have to repeat an experiment many times to calculate
probabilities. Our propositions could even be non-repeatable experiments
and nothing would change. Nothing prohibits us to assign probabilities to
the proposition that the world would be ending tomorrow, what would be
highly non-repeatable (without using some clever tricks).
All we studied by now seems to be biased towards using only the
classical method of obtaining probabilities and virtually marginalising the
frequency method. In a sense, I confess I did it. So, in order to reverse this
bias, let us talk about the relation between probabilities and frequencies.
This association, in fact, will be very useful to visualise some examples. Still,
you have to bear in mind that they are particular cases and the view of
probabilities as degrees of plausibility is more general, but in no way
When one considers experiments that can be repeated more than
once, there is a mathematical theorem that guarantees that, if we repeat
them enough times, probabilities coincide with the relative frequencies of
the possible results. This theorem is the famous Law of Large Numbers, the
name being highly self-explanatory.
Roberto C. Alamino


By relative frequency of a certain result one understands the
number of times that result occurred during the experiment divided by the
total number of times the experiment was repeated. This definition
immediately guarantees that the frequency is a number between zero and
one, as we chose the probabilities to be. Another obvious consequence is
that, in the case results are mutually exclusive (no more than one at each
time), the frequencies of individual results add to one in a straightforward
As it should be for everything to fit together nicely, the AND and
OR rules also work for frequencies. In fact, it is working with frequencies
that it becomes easier to understand these rules. Let us then see how they
work. Just be careful to not forget that, although these rules have a nice
explanation with frequencies, they are also valid for general propositions
which might not be repeatable.
Let us continue to refer to our d10 rolling game. Dice rolling is a
good example for dealing with frequencies because, if we ignore the
inevitable but tiny differences appearing each time we roll them, we can
consider the rolling as a repetitive experiment.
Consider, for instance, the probability , which in our
notation means the probability of either a 1 or a 2 in the dice rolling. We
have seen before that, because 1 and 2 cannot appear at the same time,
this probability is simply given by

We can then attribute a frequency interpretation to this probability
by saying that the odds of getting either 1 or 2 are one in five. In fact, if you
ever went to horse or dog races, this is the terminology used in those
Now, the Law of Large Numbers guarantees that if you roll your
d10 enough times, the relative frequency of each face will approach 1/10.
The Probable Universe


This means that if you roll it times and is very, very large, and if you

the number of times the -th face appeared, the frequencies

will become close to 1/10 as long as the dice is completely symmetric and
the rolling is fair enough. In mathematical notation, when a number
approaches another, we use a little arrow to indicate it in the obvious way
(as if one was going to the direction of the other). This means that

approaches 1/10 can be written as

This should be valid for all faces if the dice is perfectly symmetric. If
you ever tried this kind of experiment, as I had to do in my first physics
laboratory in the university, you know that no matter how many times you
throw the dice, the number of times you get each face will never be the
same. They will become closer and closer for large , but they are never
exactly equal.
This variation away from the exact result, which will be present in
all actual experiment, is called quite appropriately by the name variance.
Variance is a term which indicates fluctuations away from some average
value. In this case, we can consider the exact probabilities as some kind of
average value, as frequencies always are. Usually, the variance decreases
with , but sometimes it does not. Those cases in which the variance
refuses to o down are those in which the Law of Large Numbers does not
work and it becomes more difficult to work with frequencies directly. It
does not mean that we cannot find clever tricks that will still allow us to
work with frequencies, but the more tricks you use, the less distinguishable
from the Bayesian approach it gets.
For those who like economy, variance is the same as volatility.
When market indicators vary too much, generating those graphs which
look like rugged surfaces instead of smooth curves, economists say that the
volatility is high. The ruggedness is nothing but fluctuations away from
average smooth values, in other words, variance. The picture below shows
Roberto C. Alamino


two series of 50 numbers generated using a certain random rule such that
the average value of each series of numbers is zero. The difference is that
the blue sequence has a higher variance than the red sequence, meaning
that the blue numbers are more spread away from the average value (in
this case zero) than the red ones.

Two sequences of 50 numbers generated using the same rule with the only difference
being the variance. Both sequences have as their average value zero, but the blue
sequence has a variance equal to 3 while the red sequence has a variance equal to 0.05.
Let us assume now that we have been rolling our d10 long enough
to have variances so small that they can be safely ignored in practice.
When we reach that point, we can simply say that

Can you guess now how would we attribute frequencies to the
probability of either 1 or 2? We simply count the number of times in which
we got either 1 or 2. This obviously amounts to

(because they can

never appear at the same time). To get the corresponding probability, we
calculate the relative frequency by dividing it by the total number of
repetitions, which gives us
The Probable Universe


The AND rule is similar. We just need to count how many times
both 1 and 2 appear at the same time. For our d10 rolling game, this will be
obviously zero.
Frequencies are very closely associated with the Kolmogorov way
of defining probabilities via sets. It is not very difficult to see that counting
the relative ratios of elements in a set is completely equivalent to counting
frequencies. Just imagine that you write down in a piece of paper each
result of your experiment and put them all together in a bag. The bag is the
equivalent of the set and the written papers are the equivalent of the
elements of the set.
As you can see, thinking about frequencies is really easy, useful and
absolutely fine as long as you understand the limits of what you are doing.
Because it is such a straightforward way to visualize probabilities, most
scientists are extremely attached to the frequency way of thinking. The
devil is hidden in one detail what exactly you consider to be a repetitive
experiment. It is clear that no experiment is exactly repetitive, but if we
require only a finite precision and allow for some unimportant variations,
this can be roughly achieved most of the times.
We did that with the dice rolling. We ignored dice imperfections
and differences in throwing styles. Physicians do exactly that when they
tell someone what are the odds that some disease will kill a person. They
count the number of people who died from a disease and divide by the
number of people who contracted it. What they are doing is to consider
one repetitive experiment called catching the disease with two mutually
exclusive possible results survival or death.
When they do that, however, they are clearly ignoring age, social
conditions, genetic predispositions and so many other differences in all
possible parts of this process. The frequency they get is not useless, it gives
some information, but one must be aware that this information is only
Roberto C. Alamino


approximate. Because each person is different, the proposition John died
from that disease is not a repeatable experiment as John can only die once
and there will never be another John exactly equal to him. Still, by
compromising some details, one can extract some information from
frequencies of similar experiments. This does not mean that physicians do
not know about the differences. The good ones do. So much that, if you
push them a bit, they might be able to tell you the death rate according to
age, social conditions and so on.
The assumption when we use frequencies is always that similar
experiments should give similar probabilities. The trick is to know exactly
where the dissimilarities hide and understand when and how they can be
ignored. This depends, of course, on information about the details of the
experiment. In one way or another, be via Maximum Entropy or
frequencies, we are once again trying to find a way to encode information
in the form of probabilities.

The Probable Universe


It Depends...
In the same article in which he mused about the role of mathematics in
nature (Wigner, 1960), the physicist Eugene Wigner wrote the following
observation (the underlining was added by me)
The above excerpt is more than 50 years old and it contains a
statement about one of the basic foundations of the Bayesian
interpretation of probabilities. I am not aware of how much Wigner knew
about Bayesian inference, but he correctly identified that the important
fact about every piece of information we know, which includes what we
call the laws of physics, is conditional on some other previous collected
amount of information. I have repeated several times by now that the
Roberto C. Alamino


probabilities we assigned for our d10 work only under the condition that
the dice and the way it is thrown into the air are both fair, where we
defined fair as a condition that summarises a series of more detailed rules
and specifications that are required to be valid in the set up of the whole
experiment or game. If the specifications determining the initial set up (i.e.,
the information we start with) change, then we would have to review our
probabilities (i.e., the information we deduced about the possible results of
the dice rolling).
Let us keep everything as general as possible and deal once more
with (possibly non-repeatable) propositions. Consider the proposition =
The temperature tomorrow will be below zero degrees Celsius. We
already know that we can assign a probability for this proposition and
we also know that this probability will be different if the conditions in
which we calculate it are different, just like in the case of our dice. For
instance, common sense dictates that should assume different values
in Brazil and in Antarctica.
Most of the times the conditions under which a probability is
calculated are not included in the notation, where I am using the
symbol as a placeholder for any proposition. That happens mainly
because we usually know what the conditions are and we do not need to
be reminded of them all the time, but also because we want to save space
and writing time.
Sometimes, however, it becomes important or convenient to write
down explicitly some of the conditions on which a probability depends.
When we want to do that, we use the symbol | and call it the conditional
operator. In the same way as the AND and OR operators, the conditional
operator will connect two propositions, but this time in a slightly more
complex way. If propositions and are connected by it, in a way that we
will see very soon, we write the combined proposition as and call it a
conditional proposition. This name is just a way to make explicit the fact
that we are considering some condition, but always have in mind that, as
stated by Wigner in his article, all propositions are actually conditional,
The Probable Universe


even if this is not explicitly said or written. Accordingly, whenever we find a
conditional proposition, the associated probabilities are then called
conditional probabilities.
Let us use the example of the temperature to understand how the
conditional operator is actually used. The proposition about tomorrows
temperature depends on several pieces of information which we usually
take for granted, like the definition of temperature and what day we
mean by tomorrow, but, as we have seen, it makes no explicit reference
to where that phenomenon is going to happen. If you are talking to a
friend, the place to where the proposition is referring might even be
implicit, but let us assume that it is not. As I said before, if we change the
place, the probability should change accordingly.
In order to include the information about the place, we start by
writing it as two different propositions = We are in Brazil. And = We
are in Antarctica. We can then use the conditional operator to construct
the two different conditional propositions in the following way
= The temperature tomorrow will be below zero degrees Celsius given
that we are in Brazil.
= The temperature tomorrow will be below zero degrees Celsius given
that we are in Antarctica.
The conditional operator | is usually read as given (or given that,
when it makes more sense grammatically) and is read as given .
is then a conditional statement and the probability , the
probability of given , is a conditional probability. Mathematically, a
conditional probability can be calculated if we know both the probability of
AND and the probability of irrespectively of . It is then given by
dividing the former by the latter

Roberto C. Alamino


Here is a place where thinking about probabilities as the relative
amount of things inside sets, as used in Kolmogorovs Axioms, can be
helpful in order to visualize the meaning of this equation. Once again, if
you do not feel very comfortable with formulas, you might want to skip it
in a first reading, but I would strongly advise you to come back to it again
after some time.
The above formula for conditional probabilities is, of course, valid
not only for our propositions about climate, but for any two propositions.
This can include things like dice rolling or coin tossing, numbers in the
lottery or the chances of having a baby girl and a baby boy in a row.
To understand this, consider two sets of childrens LEGO. Call the
bricks squares and the bricks rectangles. As everybody
knows, LEGO comes in many different colours, but we will consider only
two, say red and green. The picture represents a box with a total of 6
bricks, 3 of them are red and 3 are green, 3 are squares and 3 are

Although I briefly said that relative frequencies and sets were
related (but be aware that they are not the same!) in probability theory via
The Probable Universe


Kolmogorovs Axioms, I never really explained the connection in details.
This is a good place to do that.
The link between these two things can be made by imagining a
simple experiment, namely, inserting our hand inside the LEGO box and
picking a certain brick from it with our eyes closed. As long as you put the
brick back in the box after you look at it, this is clearly a repeatable
experiment in which all the initial conditions are roughly the same. This
implies that we can count the number of times a certain result which kind
of block we picked is obtained. Using a bit of common sense, which
means that we will throw away any irrelevant information, we can envision
a random drawing and agree that it makes sense to say that the probability
of picking a brick with certain characteristics must reflect the ratio
between how many bricks have that characteristic and the total number
of bricks in the box.
Because we are considering the LEGO box as being a set, we can
then say that each LEGO brick is an element of that set. If we draw the
bricks fairly enough (with all the implications and conditions you by now
should have learned to consider in the back of your mind) we can assign an
equal probability of being picked to each one of them. Because we have six
bricks, we assign a probability of 1/6 for each one of them. These results
are obviously mutually exclusive as we already agreed that we can only
pick one brick at a time. We can attach to ours bricks numbered labels
ranging from 1 to 6. The mutually exclusiveness then means that the
probability of picking, lets say, either brick 1 OR brick 2 is according to our
rules . For all practical purposes, if we ignore colours and
shapes, this part of the game ends up being basically the same as a d6
rolling. But when we consider the added information about these two
properties, things obviously change.
Let us call the probability of picking a green brick. It does
not matter if it is a square or a rectangle, only the colour is important in
this case. After all we have said up to this point, it clearly makes sense that,
because there are 3 green bricks out of 6, this probability should be given
Roberto C. Alamino


by the fraction 3/6, or in other words, . We can arrive at
this result also by using the OR operator. Picking a green brick is equivalent
to picking either brick 2 OR brick 3 OR brick 6, which by our rules should
give the result .
The same reasoning gives the probabilities for picking a red brick as
(brick 1 OR brick 4 OR brick 5), for a square brick as
(brick 4 OR brick 5 OR brick 6) and for a rectangular brick as
(brick 1 OR brick 2 OR brick 3). The fact that each one of
these probabilities is numerically equal is a coincidence. It happened
because the ratios between the number of bricks with the corresponding
particular property and the total number of bricks are the same in this
case. Most of the times, probabilities like this will be different.
Because probabilities are very subtle things, it is worthwhile to stop
a bit at this point in order to make an important observation. The reason
we could assign ratios to each one of the above probabilities was that we
assumed that each brick had the same probability of being picked. If we
want to calculate the probability of picking a brick with a certain property,
we just count the number of bricks with that property, lets call it

and divide it by the total number of bricks (in this case it would be
). The probability would then be

which is misleadingly similar to assigning probabilities as frequencies! But
be careful! They are not the same! The above probability is not calculated
by repeating the experiment and is exact, without any fluctuation involved.
So remember ratios are different from frequencies. There is a connection
which is made, once again, by the Law of Large Numbers. If we repeat the
brick picking experiment a large number of times, the relative frequencies
should approach the above ratio, but there will always be fluctuations in
the actual measured frequencies.
The Probable Universe


With that distinction cleared up, let us proceed, in the most ancient
tradition of probability theory, with a game. Suppose that I am playing a
guessing game with a friend and say to her Ive just picked a red brick, is
it a square or a rectangle? If my friend knows the number of bricks of each
quality, her best guess will obviously be a square, because there are two
red squares while there is only one red rectangle. Instinctively, she is
calculating the conditional probabilities the probability of a
brick being square given that it is red and the probability
of a brick being rectangular given that it is red and then choosing the
largest one. We all would do that, even if we did not know anything about
probability theory (which now you do). It is reassuring when the theory
gives sensible results when used in practice as sometimes that might fail to
As you can see, it makes a lot of sense to consider the ratios of
bricks inside the box relative to a certain kind. For instance, out of a total of
3 red bricks, 2 are squares and 1 is a rectangle. Therefore, the ratio of
squares among the reds is and of rectangles is
. The highest probability case is obviously that of
squares. The formula for conditional probabilities that we saw before does
exactly this calculation automatically. We need that formula because we
not always have this clear picture of ratios when we are dealing with
general propositions. The formula does the job of encoding this procedure
in such a way that we can still find the correct result even if we cannot
visualise sets or ratios.
It remains for us to see that the formula actually works for our
LEGO box. Suppose we start by calculating the conditional probability of
our brick being square given that we know it is red, or .
According to the formula for conditional probabilities

Roberto C. Alamino


remembering that we are using the coma as an alternative notation for the
AND operator of two propositions. In the above case, is the
probability of picking a square AND red brick. Of course, the formula is
only useful if we know the probability of the combined case. In technical
language, when we put together two events with the AND operator, it is
common to call the resulting probability the joint probability of the two
events. In the above formula, we then have the joint probability of and
. Notice that in here, the two cases are not mutually exclusive as it
happened with the faces of a dice. Colour and shape are not exclusive
properties of the bricks, because every brick has both.
For our LEGO box, it is easy to calculate the required joint
probability. There are only 2 bricks out of the 6 which are square and red at
the same time, namely the bricks number 4 and 5. Therefore
We have already found the probability for a
brick to be red and, therefore, we can already put everything together in
our formula as

This is, fortunately, the same result we found before. For this
simple case, we can check that the answer is correct right away by
inspection. Because probabilities should add to 1, and there is no other
shapes besides square and rectangle, it is obvious that the above value
implies that , which we know is the right answer.
Once we understand what is behind a formula, we can start to
trust it (at least a bit). Back to our climate example, we see that in it we do
not have such a clear way to visualise the meaning of the conditional
probability formula. But there is one thing we know. If there is any way to
calculate the required probabilities and if this way is correct than we
should arrive at the conclusion that

The Probable Universe


where >> is just the way we physicists use to say that one thing is much
larger than another, not just an inexpressive larger, which is the more
common symbol >. This means that the probability of having subzero
temperatures in Antarctica must be much higher than in Brazil. If after all
we did we get a different result, something must be very wrong. The
information necessary to use the formula in this case is more difficult to
obtain. It must be inferred from measurements. But you can be assured
that, when all data is correctly gathered, the formula keeps working.
We reached a point in which we managed to construct
probabilities by encoding information using a mixture of common sense,
logical deduction and experience. In the process, we learned that every
probability we calculate is conditioned on what we know about the
situation. This conditioning is the defining property of what has been
conventionally called the Bayesian interpretation of probability, or simply
Bayesian probability.
Although everything seems extremely reasonable, what we have
done has been repeatedly called the subjective interpretation of
probability, a term which carries a partially prejudicial meaning. The
reason for this is the idea that, because every probability we calculate is
conditioned on the information we have about something, we are not
calculating an objective property of a system, but only a subjective point of
view about it. Persons with different information about a system will
calculate different probabilities. Change the proposition after the symbol
| and you change the value of the probability. Think about the LEGO
A huge amount of scientists feel very uncomfortable with this. It
seems nice that we have deduced the probabilities for the faces of our d10,
but where exactly is the connection with reality? How do we check if our
inferred probability is really correct? Of course, we have seen that we can
Roberto C. Alamino


always connect our calculations with actual experiments using frequencies
if we allow for some degree of imprecision. However, many scientists and
mathematicians go as far as to say that the only real probabilities are
frequencies. This is the essence of the so-called frequentist interpretation
of probability, also called by its followers the objective interpretation of
probability in a clear attempt to make it look the only sensible
interpretation when compared to the Bayesian one. In this version,
probabilities can only be defined by frequencies of experiments, not by
encoded information. This is an appealing interpretation for natural
scientists as probabilities become measurable objective properties of some
physical system. If you are naive enough to not think too deeply, this would
seem the correct thing to assume.
It is not rare to listen people even arguing that the frequentist
approach is the correct interpretation for probabilities because frequencies
are objective and the Bayesian view is subjective, and the word subjective
is a blasphemous word in science. Although it is understandable that
scientists have some aversion to that word, especially after the post-
modernist pseudo-science madness, discarding something because of the
name it was given is nothing but a logical fallacy.
In order to see where the above arguments against the Bayesian
interpretation actually fail, we need to identify where and how subjectivity
enters our derivation of the probabilities. Let us do this for our friend the
d10. What people points as an element of subjectivity in that derivation
comes from the fact that we chose the pieces of information we wanted to
encode in our probability and someone with different choices of
information would arrive at a different probability. Therefore, frequentists
claim, the calculated probability is not objective in the sense of not being a
property of the system.
First of all, let us consider the issue of different persons with access
to different information calculating different numbers. After that, we will
talk about the issue of choosing which information to use.
The Probable Universe


Consider two persons, each one within an inertial frame with
relative speed with respect to each other. In physics language, this is
used to indicate that the relative speed of the two observers is 90% of the
speed of light, which we call by the letter . Relativity implies that if one
person measures a constant electric field in her frame, the other one will
measure that field as being a mixture of an electric and a magnetic field.
How subjective is this description of physics? Is the magnetic field
measured by one of the persons less real than the electric field? They are
different descriptions, both valid, of the same physical situation. What
causes the difference?
The different descriptions happen because the two observers use
different information to calculate things under different conditions, but if
we have two persons in the same frame they both will use the same
information and arrive at the same description. Even more than that, if the
person in one reference system knows exactly what information is
available to that in the other, this person can calculate (if she knows
enough physics) what the latter will describe and they will both agree if
they are using relativity theory correctly. This agreement, in fact, is the
defining property of objectivity of a physical model (we will talk more
about that in the next section).
Things get even worse when you go to general relativity, the
theory of gravity that incorporates the principle of relativity. The weirdest
example is of the physical description of a black hole. If you imagine two
persons, one very far from a black hole and the other falling down on it,
their descriptions of what happens are completely different. For the distant
observer, the person who is falling towards the hole never actually reaches
the event horizon, which is the area around the hole that, once everything
passes it, it can never escape. But wait! If the person never reaches the
horizon, how can her pass it and never return. This is where things get
strange. For the person who is falling, she will indeed reach the horizon at
some moment in time, but will never feel anything different when she
does! But, once again, given the same information, both observers can
Roberto C. Alamino


actually calculate correctly what the other one will be seeing. Admittedly,
things become more complicated when one adds quantum mechanics to
this situation, but that is not the point here.
Bayesian probabilities are like the relativistic descriptions above.
Two persons with the same information always calculate the same
probabilities. It might be acceptable to say that they are subjective in the
sense that persons with different information arrive at different
probabilities, but then you have to admit that relativity, and actually all of
science, is subjective too. I am personally not completely averse to the
word subjective, but I have to admit that given the margin to
misinterpretation, that is a very inconvenient term to keep in scientific
It is worth repeating here once again that associating frequencies
to probabilities is not contrary to associating information to probabilities.
When one tabulates frequencies, one is simply collecting information
about an experiment or a system. In fact, counting frequencies is a way of
doing inference and, as we have already cited some time ago, Bayesian
inference will account for including this new information in our
probabilities. But if you remember the Law of Large Numbers, frequencies
can only approximate some ideal property of the system. An extrapolation
is always necessary and any extrapolation requires information coming
from outside the experiment.
If you start to roll our d10 and count the frequency of each face, at
some point you will see that the frequency of every face is almost the same
but not quite so. You might arrive at the result that the frequency of 1 is
0.1013 while that of 5 is 0.0989. That is when most of us will do the jump
and see the differences in the frequencies as irrelevant. But that
information, the one that says that we are allowed to do this jump, is not
coming from that experiment alone, but from bits and pieces of
information that we know from everything else we have ever
experimented before!
The Probable Universe


The second issue concerns the choice of the information which you
want to encode in your probabilities. Yes, that sounds arbitrary and person-
dependent, but so is every experiment we carry out. In every experiment
there is an infinite number of things you might measure and an infinite
amount of choice in what precision is relevant and what can be thrown
away. We discussed that before. Choosing what to encode is equivalent to
taking a decision about what is relevant to the situation which you are
interested in understanding. Making those choices is, in the end, the only
sensible way to do science. But when two persons come together and
agree on what is relevant, they must get the same probabilities.
Although you will find a lot of physicists subscribing to the philosophy
known as Shut-Up-And-Calculate, the one saying that you do not need to
be thinking too much about the philosophy behind the phenomenon if you
can do the necessary calculations, that is a narrow minded attitude that we
cannot assume if we want to understand better the fundamentals that
underlie what we are doing. Without that, we are not much better than
simple spreadsheets whose only function is indeed to shut up and
From the discussion of the previous section, many of you must now
be feeling a bad aftertaste in your mouths. The subjective versus objective
matter does not seem to be completely settled when we think enough
about that. If we can only deal with the information we can access about
something, what would we really call an objective property of a system? Or
am I conceding victory to the infamous post-modernists and saying that
there is indeed no objective reality?
These questions, as any really deep question, are not settled
among scientists and philosophers. That is because even the meaning of
the words themselves is a bit uncertain. Most people will associate the
word objective with the word real in the sense that most scientists,
Roberto C. Alamino


myself included, will claim that science is the search for the rules
underlying an objective reality. But the problem is that reality is probably
the most complicated concept to define in philosophy. You might think that
the answer is obvious something is real if it exists but that is just
carrying the problem to the next level. What do we mean when we say that
something exists? If you think carefully, you will see that you begin to walk
in swamped terrain sooner or later.
Most scientists assign an element of reality to sensorial models of
nature. Sensorial is a very important word here because that is what you
do as well, and also what I do if I am distracted and relaxed. Think about it.
Almost all things (if not all) that you are sure that are real have a sensorial
model inside your mind. If you are able to see, the odds are that these
models are in their great majority visual. Even those things you cannot see,
like sound, microwaves and atoms have a visual representation in your
mind. All of that is associated with some picture. You should be right now
imagining them. If you were blind since your birth, odds are that you have
different mental models associated with the other senses you possess.
They are probably associations with sounds or tactile sensations. All our
intuition about reality is based on models our brains create by
amalgamating information from our biosensors be them eyes, skin, ears
or any other.
When one can detect things directly through one of these sensors,
it is easy to say that what one is detecting is real. What about things like
light polarization? By using polarization filters, we can block or let pass light
with a certain property that we call by convention polarization. The black
3D glasses (not the red and blue ones) are nothing but two polarization
filters, each one letting light from a different polarization pass through it.
The problem is that we cannot actually see or sense directly with
our naturally evolved sensors the polarization of light. What we can detect
is the presence or absence of light after we use the filter and then we
associate the polarization with this light/shadow result. Can we say that
polarization is real? Would you answer this as No, because we cannot
The Probable Universe


detect it directly? Although we cannot, bees actually can detect
polarization directly. They have naturally evolved sensors in their eyes that
can instantly detect it. Does it make any sense to say then that polarization
is real for bees and not real for humans? I am not sure about you, but for
me that sounds like nonsense. The sensible solution is to admit that things
that we can detect indirectly must exist as well, as long as we can measure
its effect.
Another example is that of colours. We all see colours and, indeed,
we today can associate colours with the frequency of the light we are
detecting. Each frequency, within some precision, is interpreted by our
brains as a different hue. However, nobody is exactly the same and the
precision with which each person perceives different colours can vary
according to details of each persons biochemistry. Even today, I still
disagree with some old friends concerning the colours of public telephones
in my city more than twenty years ago. Some of us say they were yellow,
others that they were green.

Public telephones in So Paulo, Brazil: Yellow or green? Dont rush, take your time.
The extreme case is that of persons which are colour-blind. It is not
that a colour-blind person sees in black and white, but due to a biochemical
Roberto C. Alamino


glitch, they cannot correctly tell apart some colours. Consider a population
in which half of the persons are colour-blind. If you give to this population a
colour-chart and ask how many colours exist on it, half of the population
will disagree with the other half. How do you know which half is right? How
do you really know that the non-colour-blind half is not imagining colours
that do not actually exist?
The solution here can only be given in terms of consistency.
Consistency is the property of not having logical contradictions when
comparing information. The main argument when deciding on the issue of
colours is to say that if the non-colour-blind half is imagining the colours,
how can everyone imagine them in the same way? The solution which has
no logical contradictions, which is consistent, is the one in which it is the
colour-blind people who cannot see some of the colours the other half can.
But we can go even far once we have a theory that can associate the
frequency of the light with the colour. We can vary the frequency of a light
ray in an experiment in a systematic way and check that the colour-blind
people will associate the same colour to different frequencies, while the
other half will consistently attribute different colours to different
frequencies in the same way. This adds another level of required
consistency which is only satisfied by the description in which the different
colours seen by the non-colour-blind population actually exist.
Consider the examples about relativity I gave in the previous
section. In particular, the black hole one. The distant observer would see
the in-falling observer (the one falling into the black hole) never cross the
horizon, while the in-falling observer will think that she does, but feels
nothing special when doing that. Which description is right in this case?
The answer is that both are right from their points of view. That is because
all experiments that the distant observer can do will always be consistent
with her description, while all experiments that the in-falling observer can
do will also be consistent with her description!
This example concerning black holes is known as the black hole
complementarity principle. It should work for classical (non-quantum)
The Probable Universe


relativity, but when issues concerning black hole evaporation are
considered, which happens only when quantum mechanics is added to the
description of the phenomenon, it is not clear that things really happen
that way. However, purely classical relativity is a consistent model and, in
this case, the complementarity should be valid and, as we have seen
before, the two descriptions are also consistent in the sense that, if you
describe the whole situation for both observers in the same way, they can
calculate what each one of them will measure and these calculations will
What happens is that, although each observer is in some sense
describing her own point of view in a different way, the whole physical
description is free of contradictions and the two observers arrive at the
same conclusions about each other if given the same information. Both
descriptions are consistent in the sense that there are no logical
contradictions. All measurements that can be done can be correctly
calculated by, let us say, a computer program fed with the same data. That
is all we can ask of these descriptions. In the case of (classical) relativity, we
have a consistent theory given by a mathematical description of the
situation. The same happens with our example of polarization.
Electromagnetic theory can describe it consistently and that is all we can
At this point I might have convinced you that things that are
indirectly measured should be considered real because they are part of a
consistent description of nature. Therefore, things like polarization, the
gravitational field, electromagnetic waves should all exist because they are
part of consistent theories. Not only internally consistent, but these
theories are all consistent when compared with one another. That solves
the issue, right?
Well, that is when things start to get blurrier. Because our
biosensors are very limited, there are much more things that we cannot
detect directly than things we can. Science has used the power of our
brains to create mathematical models describing all those things that are
Roberto C. Alamino


reachable by indirect detection. But what happens when all information
collected about a phenomenon leads to two mathematical models which
do not differ in their measurable predictions, but do differ in their internal
structure? Which model is real and which is not? Or can we say that both of
them are real at the same time?
What I am calling the internal structure of a model is the
mathematical entities it uses to describe the data of experiments.
Electromagnetic waves and fields are part of the internal structure of
electromagnetism. A spacetime fabric that bends and stretches in the
presence of matter and energy is part of the internal structure of relativity
(Greene, 2005). Energy is part of the internal structure of mechanics.
Two theories or models (the words are interchangeable) differing
in their internal structure might use different concepts. For instance, one
might not make use of the idea of energy in its description of a
phenomenon. What if, when measurements concerning the phenomenon
are done, both always predict the same results? If one of the theories uses
one concept and the other does not, what theory (and by extension what
concept) is real?
The first thing that might come to your mind is to invoke Occams
Razor. This principle says that, if you have two valid descriptions of the
same phenomenon, you should keep the simplest one. In this sense, we
would be advised to check which theory is the simplest of both and throw
away all but the least complicated model. The problem is that, as helpful as
it is in practice, Occams Razor is just a guideline, it cannot be used to
actually state that the simplest theory is the one which is real while the
others are not. It cannot separate the right theory from the wrong one. It is
a principle of practicality, but not of reality.
If you are interested in doing calculations and shutting up, then
Occams Razor might be enough for you, but it cannot answer the question
concerning the reality of either theory. Most scientists, contrary to
philosophers, would dismiss the question altogether. But lack of interest is
The Probable Universe


not an excuse in this case. We must at least know if the question is sensible
before we discard it!
One of the possibilities is that, when we put all theories of nature
together, there is only one description which will be fully consistent. It
might be that, whenever we have two different theories that describe
some piece of nature equally well, one of them will end up being
inconsistent with other natural phenomena. At this point in time, what
might be happening is that we still cannot single out one theory among the
others simply because we do not know all natural phenomena yet. That is
indeed possible and many physicists have hopes that this is the truth.
However, the prospects for that are not good. In fact, if (and this is a big
if) some hypothesis about high energy physics turn out to be correct, we
might be forced to admit that two different descriptions of nature, with
different elements, cannot be ruled out! This is such an important thing
that deserves a digression.
There are many problems in physics that have not been solved yet. Our
current theory of gravity and mechanics for things that are large and heavy
is general relativity and it is consistent when we do not try to apply it for
too small things. When things get small, we use a theory called quantum
field theory, which is a sophistication of quantum mechanics that includes
some aspects of relativity. However, in phenomena where these two
theories meet, they are unfortunately inconsistent. Black holes, as I
mentioned briefly before, are an example of phenomena where this
happens. This obviously means that one or both of them are not
completely right, only approximately right.
Currently, we do not know the solution, but we have some working
hypotheses which are, however, not confirmed yet. You surely have heard
about string theory and might or not have heard about loop quantum
gravity. There are others, which differ in its details. They all result from the
Roberto C. Alamino


attempt to find a consistent description of phenomena where general
relativity and quantum field theory meet.
It turns out that, from many works on these theories and in these
borderline problems, some hints about what we call a duality has
appeared. A duality is a connection between two things in which if you
know one of them, the other is completely defined. You can say in a sense
that numbers of the form and are dual as long as they are not zero
(or if you allow infinity as a number), although this is not very conventional.
The point is that there is a very precise procedure to find one if the other is
The duality that seems to exist in physics is not completely proven,
but it is a conjecture against which there is no counterexample yet,
although it must be said that it is difficult to find many complete examples
as well. This duality is called the Holographic Principle. In its most technical
realisation, it is known by the strange name of AdS/CFT Correspondence,
or the Maldacena Conjecture in honour of the Argentinean physicist Juan
Maldacena who first found a mathematical model where it was satisfied.
The principle states that we can describe all physics either by a
theory of quantum gravity in 3 spatial dimensions and 1 time dimension or
by a theory of fields which does not contain gravity and has 2 spatial
dimensions and 1 time dimension. The holographic part of the name
comes from the fact that it states that all information to describe the 3D
(excluding the time) world in which we live is contained in a 2D surface,
which according to the principle is in fact the boundary of that same 3D
In order to understand more or less how this works, consider a
sphere as the one represented in the picture below.
The Probable Universe


Imagine that this is a glass sphere completely filled with jelly. The
jelly is the interior of the sphere and is called its bulk. The glass,
corresponding to the surface of the sphere, would be its boundary. The
correspondence says that everything that happens in the bulk can be
described by a theory that concerns only the boundary.
Because the principle is a duality, both descriptions are exactly
equivalent. None is better than the other. In this case, even Occams Razor
cannot save us as some calculations are easier in one theory and harder in
the other. There is no way to decide which one is the simplest.
You might say Okay, let us admit that both theories are real if they
are consistent. What is the problem then? Answer this question: is gravity
real? Gravity exists in the bulk theory (the 3D-space one) but not in the
boundary theory. Each of these theories, although they are dual, can stand
apart on their own entirely. What would you say then?
Again, of course, it might happen that this is a false duality and in the
future a counterexample will be found, but there is a possibility that it
stands against experimental validation. What happens if this happens?
Should we give up on reality, objectivity and related concepts? Fortunately,
not yet. But this leads us to think about demoting elements of the theories
from being real. Maybe only the interactions themselves can be said to be
real. But how do we know we can stop there?
Roberto C. Alamino


So where exactly do we stand? One of the greatest lessons of being
a critical thinker is that you have to accept that some questions are difficult
and you have to live (at least for a while) in doubt about the answer. There
is nothing wrong about saying I dont know. We have to find a way of
keep going on with our explorations of nature by going around the
difficulty until we have more information helping us to decide.
If we still do not know when something exists or is real in a
fundamental level, how can we characterise objectivity? How can we say
that science is objective, but religion is not? Fortunately, consistency is all
that we need for all practical purposes. Although we cannot completely
characterize what is actually the essence of a fundamental objective reality,
we can at least assume that if something objective exists it must have the
basic property of consistency. This is actually the property we will require
in this book to characterise objectivity. You may feel that it is a weak
definition, but think deep enough and you will hardly find a better working
Is that wishful thinking that objectivity, and ultimately reality,
should require consistency? We can still not discard that possibility, and we
do not know if we will ever be able to do so, but we must draw a line from
which we can start. The consistency requirement seems to be the most
reasonable place to put it, if not the only reasonable one. If this is the
correct concept, it can be only known by always challenging the concept
itself, always keeping in mind the possibility that it might not survive.
Nobody said that understanding reality would be an easy task.
The Probable Universe


Probability Zoo
Before we actually enter the zoo, it is worthwhile to localise ourselves and
find out how far in our journey we have come. In the last chapters, I have
tried to introduce to you to the idea of probabilities. Do you remember
why? If not, look at BAYES again:

BAYES is the program that would allow us to carry out inference
and, as you can clearly see, it depended on probabilities. When we started,
our understanding of what the symbols entering and leaving the program
meant. Now we do. We can understand now that in order for BAYES to
calculate the conditional probability of a model encoded by the proposition
given the data encoded by the proposition , which is what we want, it
has to be fed with the conditional probability of the data given the model
and the prior probability for the model.
What we learned is that it is not trivial to go from the actual
propositions and to the probabilities and . These are,
Roberto C. Alamino


themselves, procedures that can be seen as separate programs. In fact, we
could expand the above picture into the following one

The temporarily called ???, will be given a name later on. The
program PRIOR, takes a proposition and calculates its prior probability. As
we learned before, this is a bit misleading as, to calculate a prior, we
always need some more information than simply the proposition .
Consider it as an abbreviated way to represent it. We have seen some very
simple examples of priors, but there are in fact an infinite number of
possibilities for the mathematical object spilt out by PRIOR. The walk in the
zoo of this chapter will show you the most common of them.
Have you ever heard about the bell curve? Long tail distributions? Have you
ever read the Four-Hour Work Week by Tim Ferris (Ferris, 2011) and
became amazed by the Pareto distribution? If you have, you might want to
know that we are at a stage in which we can start to understand what they
mean. But before we can do that, we have to understand some very simple
but deep facts about numbers.
The Probable Universe


Dice are very simple objects and they have a certain definite
number of faces, what translates to a certain definite number of results in a
dice rolling. If the dice is small enough, like the d10, we can count the
number of faces or results in our fingers. The fact that we attributed
numbers to these faces is, in a sense, irrelevant. We could label them in
any way we wanted, with letters or shapes for instance. Because in this
situation there is an end to the number of results we can get, we say that
this number is finite. Therefore, dice have a finite number of faces and we
have a finite number of fingers.
Collections or sets of finite things can be counted using natural
numbers. These are the first ones we learn: 0, 1, 2, 3 and so on. Although
we do not realise that when we are children, at some point we notice that
the natural numbers form an infinite set, which means that they never
end. If you tell me any natural number, no matter how high it is, all I need
to do to get a higher one is to add 1 to yours. This process never ends.
The second characteristic that we need to know about natural
numbers is that they are something called discrete. For numbers, this
means that, if you take two natural numbers in sequence like 2 and 3, there
is no other natural number between them. The number 2.5 does not count
as a natural number. Only integer numbers count as natural. In terms of
objects, if you can separate any two objects of your set, you have a discrete
collection of them. We then say that the set of natural numbers is discrete,
but infinite.
The natural numbers are very important for us because they are
what we use to count things. The act of counting can in fact be rigorously
defined as the act of associating things with natural numbers. The idea is
that, when you count objects, it is like assigning to them labels, with each
label being a non-zero natural number. This is so important that
mathematicians give a name to this. A set to which we can assign a label
corresponding to a natural number to each element is called a countable
set. Discrete sets are clearly countable.
Roberto C. Alamino


Let us consider finite discrete sets of numbers. If you remember
your school days, or if you are still there, you must know that sets are
represented by objects inside curly brackets. If we consider the set of faces
of a d10, we can write it as

Does it ring a bell? This was the sample space of our dice rolling
game for the d10. This set is finite and discrete. The size of a set is called its
cardinality. In this case, the cardinality is simply 10. It is quite easy to
attribute probabilities to each one of the elements of a finite discrete set.
In fact, when our sample space is given by a discrete set, be it finite or not,
we can always talk about the probability of each element of this set. We
saw that with the d10, in which the probability of each face was 1/10.
The nice thing about probabilities on discrete sets is that we can
easily visualise them graphically by points in a graph. A graph is basically an
aid to visualise tables of numbers with two columns. Let us see an example.
Our d10, with all equal probabilities, can be represented by the graph

This graph, or plot, is a graphical representation of the following
boring table
The Probable Universe


Face Probability
1 1/10
2 1/10
3 1/10
4 1/10
5 1/10
6 1/10
7 1/10
8 1/10
9 1/10
10 1/10

The rules for plotting the graph are:
The graph has two axes (plural of axis) which are the thick lines
with arrows on their tips that cross each other. If nothing is
specified, the standard assumption is that they cross at the point
where both are zero.
The horizontal axis corresponds to the first column, while the
vertical axis to the second.
The arrows indicate the direction in which values increase.
Each line of the table corresponds to one of the points in the graph,
which I have exaggerated to red circles.
To know which values of the table each point represents, put your
finger in the point and slide it downwards until it touches the
horizontal axis (follow the dashed lines downwards). This point
corresponds to the value in the first column. Now do the same, but
go towards the vertical axis, which gives you the value in the
second column.
The table and the graph are two different ways of representing the
same object which is, in this case, a probability distribution. When you are
attributing probabilities to elements of a discrete set, the values in the
second column of the table, or on the vertical axis, represent the actual
Roberto C. Alamino


probability of each element. We will many times use the term discrete
distribution for a probability distribution defined on a discrete set.
A more interesting (less boring) probability distribution, would be
the following one:

This time you can see that the probabilities for each face are
different. If this is a d10, it is clearly loaded; so much that the faces 1 and
10 will never turn up! I will leave to you the task of obtaining the table for
this probability distribution based on the graph above.
There are some standard names describing properties and pieces
of probability distributions. One of them, which is worth knowing before
we proceed, is the mean. The mean, as most people already know, is what
you obtain if you multiply each element of your set by its probability and
add them up. In the case of equal probabilities for all elements, the mean
gives the same as the usual averaging process of adding everything and
dividing by the total amount of numbers. For a fair d6, this would be given
The Probable Universe


Notice that the mean does not need to be a natural number too! If
you plot the graph of the distribution for a fair d6, you have the following

The red line I added is marking the place where the mean is
located. You can see that the red line does not correspond to any point in
the distribution and, therefore, is not part of its graph. There is no way for
you to roll a d6 and get 3.5 as a result. The mean, in case of discrete
distributions, just indicates the place of a hypothetical average element
of your set.
There are more properties of probability distributions with strange
names like kurtosis and skewness, but we will not need most of them.
Those we do, I will introduce as we go along.
Everything looks good, but there is one thing with which we always have to
be careful: the infinite. Nothing prevents countable sets from being infinite.
In this case, we usually will not be able to write full tables or plot the entire
graphs for the probability distributions for obvious reasons, but we still can
do it for parts of it or to write down a rule as an equation. To see this,
Roberto C. Alamino


suppose we have a dice with sides, all with the same probability. We
already know that, in this case, the probability of each side should be .
Now, if these faces were just the first sides of an infinite-sided dice,
we could still have a probability distribution such that these faces have
probability 1/ and all others have zero probability. We could write this as

Remember that here is a random variable that represents the
proposition the result of the rolling is... The value one gives to
corresponds to a possible face of the infinite dice. This is a completely legal
probability distribution that can be assigned to the whole set of non-zero
natural numbers. It is a bit of cheating, of course, but allowed nevertheless.
If we sum the probabilities for all natural numbers, the result is 1, as it
should be. Each probability is also positive and between zero and one, as
they should be.
The above distribution is clearly very unbalanced. What if we want
a probability distribution that gives the same probability for all natural
numbers? That would be the equivalent to the following situation. A friend
of yours asks you to think about a natural number. Any natural number. If
your friend guesses this number in a completely unbiased way, what would
be the probability of guessing it right? We can find it by a process named
taking the limit. We have seen that when we have numbers, the
corresponding probability would be . We then start with and
keep increasing it to see if we are going somewhere:

1 1
2 0.5
4 0.25
10 0.1
1000 0.001
1000000 0.000001
The Probable Universe


The more we increase , the smaller is the probability of guessing
a number correctly. If goes to infinite, the actual size of the set of
natural numbers, the above table clearly shows that the probability should
go to zero! So, the probability of guessing any natural number correctly is
zero! But now comes the strangest thing. All probabilities go to zero, but
their sum stays always equal to 1! To see that, notice that if you add
times the probabilities , this is the same as multiplying by . But
no matter how large is, this is always 1!
You can now appreciate how it is tricky to deal with infinite things.
If you are careful enough, though, you can live with it. The important thing
is that, once again, your probabilities should add to 1 and be between zero
and 1. If you are worried because a sum of an infinite number of numbers
will not be finite, look again at the result of the previous paragraph. What
happens is that, if each term of the sum is small enough, than the sum
might be finite. We say, in mathematics lingo, that the sum converges to 1.
The list below contains some other infinite sums that converge,
some of them to other values than 1. You can have an idea of how it
happens by using your computer to keep adding each successive term to
see how the total approaches the final result and try to guess what it is.
Have fun.

Roberto C. Alamino


With Anatomy 101 cleared, we can now start our taxonomy classes. Of
course, we will not be able to go through mathematical details. If you are
interested, the standard introductory book, which contains a lot of
examples, is Fellers book (Feller, 1968). You can get more information
about specific distributions either in Wikipedia or in Wolframs MathWorld
(see the appendices for a list of Internet links).
Our first probability distribution is a discrete one called the Poisson
Distribution. Named after the French mathematician Simon Denis
Poisson, this is one of those probability distributions which appear
everywhere you look at in nature. Its usual shape is given by the following

As an aid to the eye, I have connected the points and added some
colour to the space below each one of the above three curves. Each curve
represents a Poisson distribution, the only difference between them being
their means. Notice that the probabilities become so close to zero after
some point that they cannot be seen in the graph. The larger the mean of
the distribution, the more to the right this point is.
Suppose you have a certain time interval inside of which a certain
number of events might occur. One usual case is that of people arriving at a
The Probable Universe


certain shop. If people choose randomly and with the same probability to
go to the shop and the average amount of people arriving at any specific
hour is always the same, then the number of people at any given hour is
roughly given by a Poisson distribution!
That works with space intervals as well and, more interestingly,
with areas. A very interesting situation in which you can find the Poisson
distribution is in the case of colonies of bacteria in a Petri plate. Fellers
book (Feller, 1968) has this example. Colonies of bacteria will appear as
spots in the jelly inside the circular plate. If you take a picture and draw a
square grid over the picture, you can count the number of squares in your
grid with spots in it. This number approximately follows a Poisson
distribution and it is not difficult to check it in real experiments. Of course,
the distribution is approximate if drawn in this way, but it is very close!
Zipfs law or Zipfs distribution is a discrete version of a whole class of
distributions which are called power law distributions. Oddly enough,
these are also very common in nature. For instance, the probability of an
earthquake having a certain energy is given by a power law, although a
continuous one (we will talk about continuous distributions soon). The
name power law comes from the fact that the probability for a certain
value is proportional to a negative power of . For instance, the following
power law

has a power 2.
I said proportional because we always need to guarantee that our
probabilities add to 1. Because Zipfs law is discrete, of course must be a
natural number. The most common place where this distribution appears is
in the frequency of words in a certain language.
Roberto C. Alamino


Suppose you list the words of a certain language in order of
appearance, giving to the most used one the rank 1, the second most used
the rank 2 and so on. If you choose a text at random, the probability of
finding in it a word whose rank in the above list is will be proportional to

where is a power that depends on the specific language. For most known
human languages, the number is very close to 1. There are many
suggested explanations for this phenomenon, but none is perfect in
accounting for it completely.
Depending on the value of , Zipfs law can have very strange
properties. For instance, when we cannot make the probabilities add
to 1! That is because the sum of the inverse of natural numbers can be
shown to be infinite. That is why, in general, Zipfs law is only considered
up to a certain maximum natural number . For larger than 1 this
problem does not appear and can be even infinite. However if ,
now is the mean of the distribution that becomes infinite if is also
infinite! This means that there is no mean value. This can be seen by using a
computer to generate numbers according to this distribution. If you try to
average them, your average will increase forever and never reach a certain
Finally, another curious thing appears for infinite and . In
this case, if your computer generates numbers according to this
distribution and you try to calculate their average, although mathematically
the mean exists, you will never get to it by this method! Your calculated
average will keep jumping from one number to another without any
apparent direction. This is because, in this case, the deviations away from
the mean are so large that they are never averaged away.
The Probable Universe


Before we take a look at the next species, we need to understand the
concept of the continuum. Discrete probabilities are straightforward
beasts. You simply give numbers to each one of the elements and these
numbers represent the probabilities of drawing that element according to
the valid rules of the game. But not all probabilities can be expressed as
discrete distributions. Think about the real line as in the picture below

If the only positions for points in the line where in the marks from 1
to 10 and the probability for finding one point in these positions was the
same, we know the result: the same as our d10. Suppose now that a point
can fall in any of the first 10 intervals between the numbers in the above
line. If the probability of a point falling in each interval is the same, then we
have the same result again if we ask what is the probability of finding a
point in the -th interval, with ranging from 1 to 10.
But what if I say to you that there is the same chance of a point
falling anywhere in the interval from 0 to 10? What is the probability to
find this point in exactly, let us say, 2.013?
We can answer this question by an approach we used before:
taking a limit. If only the 10 positions marked by the numbers 1 to 10 are
allowed, then the probability is surely 1/10. If we want to allow any place in
the interval, we should increase the number of allowed points. Let us say
that we put an extra point in the middle of each interval. Now we have 20
possible positions, all with equal chances, and therefore the probability is
1/20. You can easily see that, in order to cover the whole interval, we need
an infinite number of points. Therefore, very similar to the case of an
infinite number of discrete points, the probability for a point falling in one
exact place of the interval from 0 to 10 is zero! However, it is clear that the
Roberto C. Alamino


chance of a point being in one of the ten intervals will still be 1/10. How
can we deal with that?
The interval from 0 to 10 is said to be continuous because there is
always another point between any two points in this interval, contrary to
the discrete case. When this happens, talking about the probability of a
certain value is always going to give the result zero, because the number of
points in any continuous interval is infinite, no matter how small the
interval is.
But why cant we do something as we did with the Poisson
distribution or Zipfs law? Both these distributions can be applied to an
infinite number of points as long as each new point has a probability that
becomes ever smaller, which allows others to have non-zero probabilities.
The difference is that the types of infinity you have in these two
cases are different. Believe it or not, the infinite in the continuous interval
is larger than the infinite of the natural numbers! Although this might seem
confusing, this is related to the concept of counting that we saw before.
I said that counting is the same as associating objects with natural
numbers (0, 1, 2, 3, 4, ...). It turns out that you can extend this concept to
infinity. If you can associate a sequence of objects, even if it is infinite, with
the natural numbers, both infinities must be of the same size! Here is an
example. As hard as it is to believe, the amount of even numbers is exactly
equal to the amount of natural numbers. To see this, notice that to every
even number you can associate a natural number corresponding to its half:
0 2 4 6 8 10 12 14 16 ...
0 1 2 3 4 5 6 7 8 ...

Using the rule in the above table, you can associate a natural
number to every even number and vice-versa. Therefore, you have to have
the same amount of them! How can this be possible? Because these two
numbers are infinite, and of the same kind. Every infinite sequence of
The Probable Universe


numbers that can be put on an exhaustive list has the same amount of
elements as the natural numbers. We call this the cardinality of the set.
The cardinality of the integer numbers, when you also count the negative
ones, is also the same as that of the natural numbers and also the same as
that of the rational numbers, which comprise all fractions you can make
out of integers! This infinite number is given a very special symbol:

This is the first letter of the Hebrew alphabet with a subscript zero
attached to it and is called aleph-null. The theory concerning different
types of infinities was developed by the German mathematician Georg
Cantor by the end of the 19
century. He called them transfinite numbers.
The cardinality of any continuous interval is different from

Cantor proved that with a very ingenious method which is today called
Cantors Diagonal Slash, quite a catchy name. The details need some
mathematics, but the idea is that, if you try to organise the real numbers
inside any interval (finite or infinite), you will always ending up missing one
or more. Because of that, their amount, or cardinality, must be larger than
that of integers. This cardinality has its own symbol, which is far less fancy
than the one for the integers: , naturally standing for continuum.
You might be asking right now whether there is any cardinality in

and . The answer is literally: you decide. The assertion that

there is not is called the continuum hypothesis and it was proved to be
optional in mathematics. The foundations of mathematics are usually
stated in a series of axioms defining how sets work. These are called the
Zermelo-Frenkel Axioms. In 1963, the USA mathematician Paul Cohen
proved that you can choose either the continuum hypothesis or its
negation as an independent axiom which will not affect the usual set
theory and won the Fields Medal, the most desired mathematical award,
for this.
Roberto C. Alamino


What all this means is that, in the end, we cannot do to continuous
intervals the same trick we used to define probabilities for an infinite
number of discrete points. What is the way out then? How can we talk
about the probability of a point in a continuous interval? Short answer: we
cannot. We can only talk about probabilities of intervals. But one thing we
can do is to talk about probability densities. Let us see how this works.
In physics, to find the density of a substance you divide the mass of
a certain quantity of the substance by its volume. We usually write these
using the Greek letter , called rho, for the density, the letter for mass
and the letter for volume. The density of lead is higher than that of water
because the same cup filled with water will be much lighter than if it is
filled with lead.
Notice that we cannot talk about the mass of a point of a
substance, because when we divide the substance enough times, we enter
in the subatomic realm and there is no sense in talking about the same
substance anymore. In a sense, we consider only large chunks of
substances such that we can calculate masses.
But although it does not make sense literally, within practical limits
we usually talk about continuous substances. Consider water once again.
We usually imagine water as a continuous material and forget that it is
made of molecules and empty space. The same is true for a block of solid
substance. As long as we remain far from the subatomic domain, we can
consider it as approximately continuous. We can still not talk about the
mass of a point, because points have zero volume and, therefore, zero
mass, but we can talk about the density at that point. How? Once again, we
take a limit.
We start by weighting the block and dividing the mass by the total
volume. This gives us the average density for the whole block. If the
substance is the same everywhere, like our fair d10 should be, then the
average density is the same as the density at each point and we are done. If
not, we have to choose the point in which we are interested and do the
The Probable Universe


following. We divide the block in two pieces and calculate the average
density for the piece containing our point. Then we do it again and again. If
we are lucky enough, these values will start to converge to some fixed
value. For instance, this would be a possible sequence of average densities
in kg/m
as we divide the block in smaller pieces:

If we need a precision of only 3 decimal places, we can take the
density at our point to be 1.260 with a good approximation. That is exactly
what we do to define probability densities. The only difference is that mass
becomes probability and the volume becomes the length of the interval.
For our previous interval from 0 to 10, this works fine as we are
now going to see. Suppose that every point in the interval is equiprobable.
That means that the average density (not the probability!) should be equal
to the density at every point. We know that the total probability (mass) in
the interval must be 1. Therefore, to obtain the probability density at
the point , we just divide the total probability (mass) 1 by the length of
the interval (volume) to get

which is called a uniform probability density as it is everywhere the same.
Unfortunately, most of the texts will use the letter both for the
actual probability and for the density, including myself. The difference
should be clear from the context, but as a rule of thumb, whenever we are
dealing with continuous variables, we will use densities and we will use
probabilities for discrete variables.
The above formula then gives the probability density for the points
of our interval and it is all we can say about the individual points. In order
to calculate probabilities, not densities, we need to choose intervals. Now,
the actual density of pure water at some standard temperature and
Roberto C. Alamino


pressure is 1 kg/m
. If you want to know what the mass in a volume
comprising 2 cubic meters of water is, what do you do? You obviously
multiply the density by two, right? What about in half a cubic meter? You
divide by two. In any case, you multiply the density by the volume to obtain
the mass. Guess what you do to obtain the probability of a smaller interval.
Correct. You multiply the probability density by the length of the interval.
Therefore, the probability of being in an interval from 0 to 1 or from 2 to 3
is just 1 times the probability density, which is 1/10 as we calculated from
symmetry arguments!
But what happens if the density changes from point to point? That
is a good question. The answer has the threatening name of an integral,
but it hides an almost trivial idea, easy to understand, but which can
however be difficult to calculate in general. For us, the idea is enough.
The graph below is a graphic representation of our uniform density
from 0 to 10:

Instead of 10 points, now we have a straight red line. That is
because the probability density is defined at every point in the interval, not
only for the natural numbers. If you look at the whole graph, you can easily
see that the region delimited by the density, the two axes and the dashed
green line coming upwards from the point 10 at the end, is a rectangle. The
The Probable Universe


area of this rectangle is

, which is the total probability of being

at any place in the interval. This is no coincidence, it is always true.
Suppose we have a different density given by the graph below

For the above red line to be an acceptable probability density, the
total probability represented by it must add to 1. As I said before, this total
probability is the area delimited by the density and the axes in the whole
interval. In the above case, this area is the area of the orange triangle in the
picture. This implies that the area of the triangle must be 1 and, therefore,
the height of the triangle (marked with an ?) has to be such that the
area is

What is the probability of a point falling inside a certain interval
when you have the shape of the probability density? It is just the area
below the density in that interval! Let us go back to our uniform density. If
you look at its graph, there are three regions painted with different
colours. The first region, in red, is the area below the density in the interval
[1,2] (try to remember our notation for intervals). Therefore, its area is
equal to the probability of a point being between 1 and 2. It is easy to
calculate this are, it is just 1/10. The area of the orange region then gives
the probability of a point being in the interval [5,8] and it is just 3/10.
Roberto C. Alamino


Finally, the blue region goes from 9 to 9.5 and, therefore, the probability of
a point being in the interval [9,9.5] is 1/10 divided by 2, or 1/20.
You have just learned the notion of an integral from calculus. The
integral of any function that can be plotted as a two-dimensional graph like
our probability densities is just the area between the curve and the
horizontal axis within a certain interval. Because of that, we can always say
that the probability associated with some interval is the integral of the
probability density. It works for any probability density. For instance, the
one below

As long as the total area below the red curve is 1, this is a perfectly
valid probability density between 0 and 8 (or 10, but it is trivially zero after
8). The probability for a point being in the interval [1,4] is then the area of
the orange region. Sometimes, if the density has a shape which is too
strange, it might not be very easy to calculate the area, but the fact that it
is the probability is still true.
The Gaussian distribution, also known as the Normal distribution
or, sometimes, the Bell curve, is the most important of all continuous
distributions not only in mathematics, but also in nature. This is the result
of a combination of many different features which will discuss here.
The Probable Universe


First of all, let us see a plot of this distribution. From now on, I will
use the word distribution to mean both probability distribution and
probability density. Hopefully, the context will always be clear enough for
you to know which one I am talking about. Here it is:

The shape of the curve makes it evident why it is called a bell curve.
First of all, this curve is perfectly symmetric if reflected through the vertical
dashed green line. This green line marks another special place for this
distribution. The point on the horizontal axis marked by this line
corresponds to the mean of this distribution, which is the average value of
the quantity corresponding to this distribution. The most common notation
for the mean of the Gaussian is the Greek letter mu: . You can see that,
for the graph above,
For the Gaussian, the position of the mean coincides with the point
in which the distribution has its highest value, a point called the mode of
the distribution. In discrete distributions, the mode is the most probable
point, for continuous ones, the probability of being around the mode is
higher than being around any other point. The fact that the mean and the
mode of the Gaussian coincide is a special feature of it and does not
happen with all distributions.
The Gaussian is a very simple distribution and needs only two
parameters to be completely defined. One is the mean, the other is its
variance. We have already seen that the variance measures fluctuations
around the mean. In the graph above, this corresponds to the width of the
Roberto C. Alamino


curve marked by the orange double-arrow. In order to understand this,
consider the three Gaussians below:

They are plotted in the same scale. Notice that the narrower the
distribution is, the smaller is its value for points far from the location of the
mean. This means that it is less probable to be far from the mean. As the
variance measures how probable is for a point to be far from the mean, the
narrower the distribution, the smaller is the variance. In the picture above,
the variance increases to the right. Usually, the variance is symbolised by
the Greek letter sigma: . Most of the time though, it is much more
convenient to talk about

instead, because in the formula characterising

this distribution, the variance always appears squared.
If you are curious to see the formula, here it is

otherwise you can simply ignore it and go on reading. Notice that, apart
from the random variable that we named in this case, there are no other
variables in the formula except by the mean and variance.
There are, of course, more interesting properties of the Gaussian
than its simplicity. Suppose you are searching for the least biased
distribution to describe a continuous variable. You only know that there is a
fixed mean and a fixed variance. If you remember, to find the least biased
distribution we have to maximise the entropy. What do you think we get if
we search for a distribution that maximises the entropy given a mean and a
variance? You got it. It is the Gaussian!
The Probable Universe


But an even more interesting property is the one called the Central
Limit Theorem, which I will call CLT for short. The CLT is one of those
wonders of nature that makes you think that there is something special
about the universe. It is very simple, very powerful and very beautiful.
Consider a set of independently and identically distributed (or
i.i.d., in mathematical lingo) random variables

What I mean by this is that the probability density of all those variables is
given by exactly the same distribution (identically distributed) and that
they all have to do with experiments which do not influence one another
Let us now consider the variable formed by the sum of all this

The variable can be understood in the following way. Each is generated
by a random process according to its probability density. We generate each
one of them and then add all the results. Clearly, is a random variable as
well as its value is not pre-determined, but depends on the drawn values
of the s. Given the distributions of the s, we can indeed calculate ,
but there is something even more interesting that happens when the
number of s is very large. The CLT states that, the larger the number is,
the closer to a Gaussian becomes! It goes as far as to give you the
values of the mean and the variance of this Gaussian. The details involve
more mathematics than we will use here, but if you are interested you can
take a look at Fellers book (Feller, 1968).
I am not sure you understood the beauty of this. Let me explain
again. ANY very large set of random processes which are i.i.d. gives values
which, if you add them, have a probability distribution given by a Gaussian.
A-n-y-o-n-e. Whatsoever. If that does not impress you, I give up.
There is one subspecies of Gaussian which is very important in
probability and physics. It is called a Dirac delta in honour of the engineer
Roberto C. Alamino


turned physicist Paul Dirac who introduced it for the first time. The delta
in the name comes from the symbol used for it:

Delta is the Greek letter that you see on the right hand side of
the above formula. Do not be in despair, it is very simple to understand the
essence of the formula above. Imagine a Gaussian whose variance
becomes very, very small. Look back at the pictures to see that the smaller
the variance, the more the Gaussian is concentrated around the mean. The
Dirac delta is nothing more than a Gaussian with zero variance and with

. In terms of a picture, it is a very high spike at the point

. In fact,
because we are dealing with a continuous distribution and, as we have
seen, things get weird in the continuum, the high of the Dirac delta is
The meaning is that the whole probability is concentrated at

which means that it is the only value that can appear. In other words, the
Dirac delta is a probability distribution representing the certainty that a
continuous variable will have the exact value

. Simply that.
The reason why I am including the Pareto distribution in our zoo is because
it has been largely popular in economics and social sciences literature and
was naturally introduced by an Italian economist called Vilfredo Pareto.
The Pareto distribution can be seen as a continuous version of
Zipfs law and is another example of a power law distribution. The
interesting thing about it is that it has what is called a cut-off, it is zero
below a certain value of the random variable. An example is the following
The Probable Universe


Notice that the Pareto distribution in the above example is zero
below 2 and becomes ever smaller as the values of the random variable
increase. It also is defined by only two parameters: the point of the cut-off
and the speed with which it goes to zero. As we have seen before, power
laws are very usual in nature and approximate Pareto distributions fit very
well many natural phenomena. Last time I checked, Wikipedia had a list
containing, among other things:
City sizes
The size of Bose-Einstein condensates at very low temperature
Total area of a forest fire
Sizes of meteorites
These are phenomena that can safely be said not to have any
common causes. But the main reason for the fame of Pareto distribution is
the so-called Pareto principle, which says that 80% of the results come
from 20% of the causes. I have seen this principle used to justify ideas like
the one that you just need to study 20% of something to be able to
understand 80% of it. Although there are many examples where this works,
the underlying causes being that the processes can be approximately
modelled by appropriate Pareto distributions, do not trust it always!
Pareto distribution is one of many distributions that occur in nature
and I can tell you with 100% certainty that there are many processes for
which the Pareto principle will not work. More than that, we do not know
Roberto C. Alamino


what the processes for which it works are. So, be warned when you read
things around.
A bit more of anatomy. This time we will talk about tails. We now know
that if we want to define probability distributions for any value including
the infinites in the two possible directions, be them discrete or continuous,
we need to guarantee that the distributions go fast to zero when the
random variables become either too large or too negative. We can look
again at the Gaussian. The one below is plotted for a very wide range:

You can see that when the distance of the values on the horizontal
axes from the mean becomes much larger than around 3, the value of the
distribution becomes so small that is practically indistinguishable from
zero. The values away from the mean form the tail of the distribution. In
the Gaussian, this tail is said to be short because it goes very fast to zero at
both sides of the mean. This means that most of the probability mass of
the distribution is located in the neighbourhood of the mean. However,
that does not happen for some distributions. For distributions in which
there is more mass in the tail than a Gaussian, the name long tail
distribution was coined.
Although it is not always the case, when the tail of the distribution
is too big, we can have the pathological effects that we have already
The Probable Universe


discussed in the Zipfs law. Sometimes, the mean will not exist, while in
others the variance will be infinite invalidating the law of large numbers
and preventing one from estimating the mean by averaging the results of
There are many more probability distributions than the ones listed in this
chapter. In fact, there is an infinite number of them, but only some have
properties that make them interesting. In general, these properties are
associated with the fact that these probabilities describe physical
phenomena in a very condensed (or compressed) way.
One common variation of probability distributions that we have
not seen in this chapter is that describing at the same time more than one
variable. They are usually called multivariate distributions. For two
variables, we can draw three-dimensional graphs, like the one below which
is that of a two-dimensional Gaussian, but when there are more variables
than two, visualisation becomes trickier.

In the above Gaussian distribution, the two variables have zero
mean and their values are in the horizontal plane, while the values of the
Roberto C. Alamino


distribution are in the vertical axis. Notice how the bump has a very similar
shape to the one-dimensional Gaussian, but it is symmetric around the
zero. Not all two-dimensional Gaussians are symmetric like that. If I choose
different variances for both variables, the shape you obtain is very
The biodiversity of the probability distributions is very high and
new interesting species are being discovered all the time. If you are
interested in a day off in a larger probability zoo, Wikipedia has a nice list of
probability distributions at
Enjoy it.

The Probable Universe


Changing Mind
The moment is finally ripe for you to understand what Bayesian inference
actually is, how it works and how everyone can, or should, use it in their
lives. This point should be made over and over again. Bayesian inference is
a tool that is not only restricted to technical books or hard science research.
You do not need to be a professional mathematician to make use of it in
your daily life. Almost nobody goes around calculating numerical
probabilities all the time, but almost everyone goes around taking decisions
based on some information and, to take a decision, we are always
evaluating the relative importance of probable outcomes. What Bayesian
inference does is to help weighting the best one. It can give you numbers if
you need to, but it can also guide you with very little use of them.
Taking decisions is obviously not an easy task, as anyone knows. It
is a complicated issue that requires us to constantly change our minds.
When we face a difficult situation for a second time in our lives, our
reaction will not be the same as in the first time. That happens because the
available information for taking the decision has invariably changed. You
now know the consequences of your first decision. Because time has
Roberto C. Alamino


passed, the world changed and there are also new things to consider. You
are not the same person as before. Your needs changed. Your opinions
changed. Possibly your values changed. All this change forces us to change
our beliefs about the situation requiring a decision and that is what
inference is all about. Changing your mind. Changing your beliefs.
You must remember that we calculated the probability for our
extremely fair d10 rolling game as 1/10 (or 0.1, or 10%) for each different
face. We arrived at this by assuming many things. The fact is that all of
them might be completely wrong. For instance, the d10 might be loaded,
either because it was intentionally tampered with or simply because the
company that manufactured it was not careful enough to produce a good
quality dice.
The easiest way to check if we can really trust our fairness
assumptions is to roll the dice and acquire ever more information by
keeping track of frequencies. Frequentists, rejoice! As we have seen, the
Law of Large Numbers tells us that, within certain limits, those tabulated
frequencies will get closer and closer (although they will never get exactly
there) to the true probabilities of each face. As long as they do not differ
too much from 1/10, we will be satisfied within our desired precision.
However, if the numbers we get from the measured frequencies
are too different from those we are expecting, then we have to revise our
information about the d10 and the conditions of the game, because
something is not right. By repeating this many times we might be able to
infer (again, approximately) the actual conditions of the dice, which will
allow us to calculate correct probabilities and make good predictions.
Making good predictions is nothing more than deciding on which results
are more probable.
This procedure can obviously be applied to anything and can be
written in the following algorithmic way:
The Probable Universe


1. Use all information you have to create an initial model for your
2. With that model in your hands, calculate the probabilities of each
3. Test your model by doing the experiment and checking if the
frequencies agree with your calculations (taking into consideration the
limitations of this procedure).
4. If there is no agreement, you have to go back to 1 and change your
model. If there is agreement, you can relax for a while (until someone
finds a disagreement and you have to start again).
That should ring more than simply one bell. That is because the
above procedure is extremely similar to what people call the scientific
method. This is no coincidence. We will see, by the end of this book, that
Bayesian inference is exactly the scientific method. There is a lot to say
about this connection and we will spend a whole chapter of this book
explaining that.
The key part of Bayesian inference is then to change your beliefs
about something whenever new information about it appears. This new
information might either confirm your previous beliefs or indicate that they
were wrong. If they are wrong, the sensible thing to do is to change them.
That sounds obvious, but that is the most difficult part of the whole
procedure for the great majority of people in the world.
Although we use of the word belief associated to probabilistic
inference in a technical sense, meaning a probability assignment to some
piece of information, this is no different from the common use of the word.
A belief is something that you think is true, which includes the values for
the probabilities that you calculate for anything. The 1/10 probabilities that
we calculated for the faces of our d10 are beliefs that must be changed if
we measure frequencies that are too far from them.
This makes sense, doesnt it? The more we believe in something,
the higher is the probability we assign to that thing being true. In the same
Roberto C. Alamino


way as we might have assigned wrong probabilities to our d10, any other
belief can also be assigned wrong probabilities of being true. The second
trickiest thing, after admitting that changing beliefs is necessary, is to find
the right way of doing that. Of course, this right way is Bayesian inference,
the small formula we have learned about in the very beginning of this book
and that we used to create the program BAYES. It is now time to
understand each piece of that formula in details, and that is what we are
going to do now.
Let us look again at the formula in the beginning of the book

As you already know, this formula is called Bayes Theorem in
honour of our late Reverend Thomas Bayes and now you have a much
better understand of what that means. It is not so alien anymore.
The probability distribution before the sign , which means
proportional to, we can now identify as the conditional probability of the
proposition given the data . The symbol simply means that, in order
for this to be an equality, there is something else multiplying
which we will not write right now because it is less important at this point.
This first probability before is, of course, what we want to
calculate with our inference procedure, our beloved BAYES. Let us forget
about the first probability after the and concentrate simply on I
have said in the beginning of the previous chapter that this is the prior
probability of . This is because, in Bayesian inference task, this probability
represents all the information we have about before we include the
information contained in .
However, this association of with the prior probability of is
not straightforward and we will discuss in details how this happens in the
The Probable Universe


following sections. Let us go over its meaning once again, but now using
our deeper understanding of probabilities.
One very important question that one should always ask when
faced with priors is Prior to what? We have been using the name prior to
indicate all the information we know about the results of an experiment
before the experiment is done. But what if we do the experiments more
than once? Do we call the prior only the probability distribution that we
calculate before all the experiments are done?
The answer is no. Not only. In terms of Bayesian inference, the
prior is a probability distribution calculated with the information we have
before each new step in our inference process. A prior probability is always
prior relative to new information that is incorporated. Of course, you might
think about a sort of ultimate prior when you have absolutely no
information about something, but by the Principle of Insufficient Reason
that would be just a uniform distribution, a probability in which all
possibilities have the same odds. Our super-fair assignment of probabilities
to our d10 in which all faces have probability 1/10 is, for instance, a
uniform distribution.
The prior is like a book that codifies all the acquired information
just before a new piece of data comes in. The inference task will add a new
page to that book and update the prior to a new form. This updated prior is
called the posterior distribution or simply the posterior as I have cited very
briefly before. The process of incorporating information to a prior so that it
becomes a posterior is the inference process. We can do that as many
times as we want. Each time we decide to do again the experiment and
update our knowledge or beliefs about the probabilities of the result, the
old posterior becomes the new prior and it is then updated to a new
Consider the figure below.
Roberto C. Alamino


This picture describes an inference task about some proposition .
You can think about as being, for instance, one of the faces of our d10. In
the beginning of time, which is represented by the horizontal line with an
arrow indicating in which direction it increases, we have the prior
probability that encodes everything we know about before doing
any experiment. The first arrow indicates the moment we do our first
experiment, let us say, we roll the dice. We can roll it once, twice or as
many times as we want. When we finish, we collect all the data, which in
the case of a dice rolling might be the observed frequencies of the faces,
and put it into a dataset that we are calling . When we incorporate the
information contained in into our prior distribution, we end up with the
posterior distribution and now you see that what we have at the
left hand side of the inference formula at the beginning of this section is
exactly this posterior.
But we do not need to stop there. We might keep doing our
experiment to confirm if the new probabilities are already the correct one.
Let us call the next experiment, or bunch of them, . The posterior
becomes now a new prior, a prior to the experiment . It is still
the same probability, but only for cosmetic purposes, I call it . This is
at the same time a prior for and a posterior for . Once the
information coming from is also incorporated, we now have the
posterior , which can also be written as because it
contains information of both bunches of experiment.
This pattern can be repeated as many times as we want. Before
each new experiment is carried out, the previous posterior becomes the
new prior and so on. After each experiment, the gathered information is
incorporated to it and it becomes a new posterior. In theory, one should
The Probable Universe


repeat this forever, in practice, we stop once the probabilities stop
It should be clear that two persons with different information have
different priors for the same experiment. That is because one or both
might have incomplete information about it, with the missing information
being different for the case of both. As we have seen before, there is no
contradiction here as the description is completely consistent. Once both
start to do experiments and incorporate (the same) new information, their
probabilities should start to become similar to each other.
Consider our completely symmetric, completely homogeneous
d10. If we did not have discussed those matters before, we might be
tempted to say that the 1/10 probability for each face is an inbuilt,
objective property of the d10. We know that it is not. We saw that this
depends not only on the geometry of the dice, but also on the procedure
one uses to throw it and the environmental conditions in which it happens.
The uniform probabilities can then be seen as a kind of compressed
computer file containing only what is important about the above
description of the d10 tossing before we throw the dice. They are the prior
probabilities of the d10 throwing. The confusion comes because, in this
case, most people assume everything to be as far as possible and end up
assigning the same priors.
When we roll our d10 though, all of that can change. Consider that
you have a geek party and each person is arguing in favour of a different
prior for the d10 rolling. One person says that all faces have the same
probability. Another one swears that the dice is loaded and 1 is the most
probable result. Another yet says that she does not know about the other
faces, but she knows that the dice will never give a 10. The host of the
party then decides that experiments will be done. The dice is rolled 10
times and, after each bunch, people are allowed to review their probability
assignments. At the end of the night, as long as people are not drunk and
they are all looking at the same experiments, their posterior probabilities
will be practically the same. The subjectivity of the initial priors will be
Roberto C. Alamino


erased by the experimental information and, at each time a new
experiment is done, their influence on the posterior will decrease.
Wait a minute! That might seem to invalidate everything we said
before! After all, is that not the same as saying that the only true
probabilities are the frequencies? Do not forget about the fluctuations. But
even if you do, what would happen if you measured your dice with an
infinite precision and found it perfectly symmetric, but still the frequencies
would not be the same? I would assume that the dice is loaded, but what if
someone challenges that explanation? What happens if we measure and
weight the dice in all possible ways and find out that it is perfectly
symmetric? Would you give up reason and assume that a perfectly
symmetric dice would have different probabilities for each face? I would
not. I would look for some problem in the throwing of the dice. Why?
Because logic says that it does not make sense for a perfect symmetric dice
to have different frequencies for each face and logic is a piece of prior
information that we know works and we cannot throw away lightly. If you
ever find a situation with perfect symmetry in which probabilities are
different, call me. You are on the verge of a revolutionary discovery. The
most probable though, given all my years of posterior about life, is that you
should be wrong.
The more we repeat something, the more familiar it becomes. Therefore,
let us take another look at Bayes Theorem

We now know that is the posterior and is the prior.
What is left is the probability that makes this connection possible. That is
the ??? in our larger representation of the BAYES program. This
probability is called the likelihood function of given , or simply the
likelihood, and is symbolised by . The meaning is clear from the
The Probable Universe


probability definition the likelihood of a proposition given the data is
equal to how probable the data is if you assume that the proposition is
You have all the rights to be confused, because the mathematicians
really messed up the names here. It would be very natural to associate the
likelihood of an explanation given the data it tries to explain should be
the probability of given . But unfortunately, it is exactly the opposite.
The choice of the name was because, as we are going to see, the likelihood
is usually used to choose between different explanations for a fixed
dataset, thus the given D part. But, in fact, it is always given by . To
add to the confusion, many times the notation is used for the
likelihood of given and we will have the equivalence

Too bad. We have to live with that.
The way it works is by thinking of the proposition as being a
tentative explanation of why we have observed that specific dataset . The
explanation which is more likely to be true (thus likelihood of ) is the one
for which is more probable.
Let us roll a d10 again. Suppose that we roll it ten times and we
obtain ten 1s. What we call the explanation is an explanation of why we
observed those results. Well, if we give a probability for each face, that will
explain why we observed a certain amount of faces of each kind.
Therefore, would be, in this case, a probability assignment for the faces.
Let us say that we have two possible explanations and . is our
uniform probability where every face has probability . is a
revolutionary explanation in which the face 1 has probability and each
one of the other faces has probability . Which explanation is more
likely to be true for the data 1-1-1-1-1-1-1-1-1-1 that we observed?
Let us do the calculation together. Because we want the probability
of 1 in the first roll AND 1 in the second roll AND 1 in the third row AND...
Roberto C. Alamino


up to ten, we multiply the probabilities for 1 each time. For explanation ,
this number is 0.0000000001 or 0.00000001% of chance of this happening!
On the other hand, according to explanation , the probability of the
observed data would be approximately 0.01 or 1%! The probability of
observing that data is ONE HUNDRED MILLION TIMES larger if the second
explanation is the correct one! Which one do you think is more likely?
Of course, for every dataset there is an explanation that explains it
better than any other else, which is the explanation that says that every
time you do an observation you will observe exactly that. For instance, this
explanation for the above d10 rolling would be that the probability of 1 is 1
and all the other faces have probability zero. If we call it the explanation
, the probability of the observed data in is 100%! It is obviously as
large as you can get. This explanation is, however, what is called fitting the
data. You just use the data as it is own explanation. It is the simplest thing
you can do when you do not have any other information. However, this is a
hypothesis that should be tested against not only all other information you
have, but also against more experiments. We will talk a lot about that.
Quickly revising, conditional probabilities allowed us to think about
the probability of something given the knowledge of something else. This
naturally led to the idea of changing probabilities when the conditions
themselves change, which is equivalent to say that extra information
became available. This connection is given by the inference formula which
we called Bayes Theorem the prior becomes the posterior after the new
information is incorporated. Finally, we have learned that this connection is
done by multiplying the prior by the likelihood.
When we look at Bayes Theorem, we can understand it also as a
means to evaluate the probability of some hypothesis, which in
would be the proposition , given some piece of evidence . Being a
general proposition, this hypothesis clearly can be anything, even
The Probable Universe


something completely imaginary like (spoiler alert for kids!) the existence
of Santa Claus.
We will call the task of calculating the probability of a certain
hypothesis as hypothesis evaluation. This task is obviously central in
virtually all areas of human enterprise. This is the bread and butter of
science and also of financial markets. In science, the hypothesis is many
times the theory that explains some natural phenomenon , where we are
associating the phenomenon itself to the data describing it. In a bank, the
hypothesis could be the percentage of change in the price of some stock
by tomorrow given the news about the corresponding company that
were announced today.
This kind of hypothesis evaluation is also what happens when a
crime is committed and the investigators need to evaluate possible
scenarios based on forensic evidence. Solving a crime is a task full of
uncertainties of course. The investigator must put together a story based
on a possibly very small number of clues, or pieces of evidence. In an
intuitive way, the investigator needs to evaluate how likely each possible
story is, given the collected evidence. That is nothing more than evaluating
the conditional probability of the scenario given the forensic evidence ,
or .
Another very common situation is a trial, where either one person
the judge or a group of persons the jury needs to evaluate not only
the physical evidence but also the information given in the form of the
accounts of several witnesses, experts and victims. During the sessions,
lawyers representing the different parties will sew all of this together and
create stories given that evidence. The judge or jury then will need to
evaluate the conditional probability of those stories given the information
provided and all else they know about crimes, human nature, society and
so on. Because each one of them has different life experiences, the given
information will be different and different probabilities will be calculated.
The task is even more difficult, because there is also the need to evaluate if
Roberto C. Alamino


the information itself is reliable or not, in other words, the probability that
the information is true also needs to be evaluated at the same time.
Think about the murder of a billionaire lady, for instance. We can
think about the murder itself and all the collected evidence as the data
(for murder). Suppose that we have two suspects, the husband and,
naturally, the butler . What we want to do is to discover which
explanation is more likely. According to the discussion of the previous
section, this can be accomplished by evaluating the probability of the
murder given that the killer is the husband and the probability of
the murder given that the killer is the butler . If we have

then we can say that the likelihood that the husband is the murderer given
the evidence is larger than that the butler committed it ,
or that it is more likely that the husband is the killer given the evidence,
which would be written

But if that is all, why should we use Bayes Theorem? According to
the above explanation, we just need to calculate the likelihood. We did not
use priors and posteriors. We did not, but we should! As we have seen, the
likelihood of a hypothesis measures its ability to reproduce the data. This is
a very important point to be made. All information used in the likelihood
comes from the measured dataset. No other information enters in this
evaluation. The more probable a dataset is to be generated by the
hypothesis, the larger the likelihood of the hypothesis being right if we take
into consideration only that data. I will not get tired of repeating it! Is there
anything else that we should take into consideration? Yes, they go by the
name priors.
Suppose that we know for sure that the husband is a good person
and never did anything wrong, but the butler killed other people before
and is a sadistic cold psychopath! This could be reflected into, for instance,
The Probable Universe


their criminal records. That does not really prove that the butler is the
murderer, but the problem here is that many times we cannot find
conclusive proofs for anything! When conclusive proofs cannot be found
and we still need to give a verdict, all probabilities should contribute! Even
the previous history of each one of the suspects counts. This previous
history is, of course, the prior and what we can infer from it will be the
posterior. Because the likelihood and the prior multiply each other, they
will compete for supremacy. The reason why this competition is in the form
of a multiplication is related to something we have already seen before:
maximum entropy. It is maximum entropy that will give the correct
justification of Bayes Theorem. We already have in our hands all
mathematical ideas needed to understand it. Showtime.
Let us talk about two generic propositions and without
specifying what exactly they are. Remember that the conjunction
corresponds to the proposition that both and are true at the same
time. That is what we called the logical AND operation. It should be obvious
that and must describe exactly the same proposition as their
conjunction does not have any relation to temporal order. If and when
temporal order makes difference, we have to include it in the propositions
themselves, but it will not be part of the AND operator. As long as we use
the rules of usual logic, there will be no special or pathological situations in
which that will not be true. We can write this obvious observation in our
mathematical condensed form as

which is the short way of saying that the probabilities of the conjunction of
two propositions do not depend on the order in which we write them. We
can say that the probability is symmetric if we exchange by . Remember
that a symmetry is something that remains unchanged when we change
Roberto C. Alamino


something else. In this case, changing and keeps the probability of the
conjunction unchanged, therefore we have a symmetry.
Because this is always true, we can use it in the formula we wrote
for conditional probabilities. If you go back to that section and look at that
formula, you can take the denominator of the fraction and move it to the
left side multiplying. This will allow us to write the following equation

In this equation, we are relating the probability of the conjunction
to the conditional probability of given and to the probability of
irrespectively of what is. Note that the conditional probability and the
probability of alone are not symmetric under the exchange of and
separately. is not the same as and is obviously not
the same as The symmetry miraculously appears only when we
multiply both together.
We can do the same trick for the probability of the conjunction in
the different order to get

Because both need to be equal, we then have

This is valid for any two propositions. They do not even need to be
related to each other. can be the proposition that we can find a polar
bear in Africa and can be the proposition that there are aliens living in
Neptune. The above formulas still work. In particular, the formula works for
our hypothesis testing task of the previous section and, to make things
more mnemonic and easier to understand, we will use it as our working
Instead of , we will then use the letter to make clear that our
proposition is now a Hypothesis about something. Instead of , we now
The Probable Universe


write for the Data we collected. By moving the second probability in the
left hand side of the equation to the other side dividing, we finally have

Believe it or not, that is Bayes Theorem. This is the complete
formula. You see that the difference to the one I gave you before

Is that I skipped the equal sign and the division by . This is because, as
we will see later, is usually just a constant number and will not need
to be calculated in some applications. In any case, we know that it should
be there. We will talk more about that later.
Are you surprised of how easy we got to Bayes Theorem? But what
about the maximum entropy relation I promised? You are missing it
because, although we found Bayes Theorem, it is still not the formula for
inference! You got me. I have been slightly misleading you from the
beginning, but that is because this point is very subtle.
Inference requires what we will call Bayes Rule. The difference
between it and Bayes Theorem is that, as it becomes clear from the
considerations above, the latter is simply a consequence of the definition
of conditional probabilities. Conditional probabilities are valid for any
propositions and have nothing to do with temporal order. Inference, on the
other hand, has a very clear temporal order associated with it. We have a
prior, collect data and change the prior to a posterior. It is something
It is worthwhile to spend as much time as needed at this point
because this is the most important distinction in this whole book. This
distinction is many times overlooked even by those who work with it. The
reason why this happens is because whoever looks at the formula of Bayes
Theorem naturally interprets as the new probability (the posterior) of
Roberto C. Alamino


after is taken into consideration, after all, we read given right?
The problem arises because one attributes to the symbol | a temporal
interpretation that it does not really possess. Go back and look at the
example of the LEGO bricks. We do not talk about the probability of a brick
being square before and after it was known to be red. All probabilities on
both sides of the equation we call Bayes Theorem are defined at the same
instant of time and because inference is always related to change, we need
to add something extra to the equation. We need to justify why it works to
include this inference time arrow in it!
But beware! As it happens in many other places, even if those
things are different, people do not care too much about using different
notations and even I, when I am at work, write things in the form of Bayes
Theorem. As long as you understand of what you are doing, you are
excused on the grounds of simplicity of writing.
Let us proceed. As we have seen, inference concerns changing the
probabilities we assign to some proposition after new data has arrived, in
other words, changing a prior to a posterior. The act of adding new data to
our database defines a before the new data and an after the new data.
What we call our database is obviously out current probabilities, as you
should know by now that probabilities are simply a method to encode
Because there is a change in this database, one immediately gets
the concept of a temporal order of events. One very important thing that
you must bear in mind is that this time passage implied by the arriving of
new data is not the same as the time in which each piece of data was
generated. Confusing the temporal order of arrival with the order in which
each piece of information was created is the source of huge
misinterpretations. We can call the late, the actual order in which things
happen, as the causal time arrow because it relates causes and effects.
Therefore, we must be very careful and always remember that the causal
arrow is different from the inference arrow.
The Probable Universe


To make this distinction clearer, let us consider an archaeologist
who is trying to piece together the life of Tutankhamun, the Egyptian
teenager pharaoh. Let us indicate his life history by the letter . Each time
someone finds a new artefact or document related to Tutankhamun, this
new piece of information has to be added to the database, which can be
imagined as a huge book were everything that was ever found about the
pharaoh, which we call the data , has been recorded, encoded in some
language as English, for instance.
Say that the archaeologist has to include in the book three new
findings about Tutankhamuns life which were discovered last year. Call
these findings by the names


, where the numbers indicate

the order in which they were found (not the historical order in which they
happened). Let us say that

was found on the first day at the digging site,


was found one week later and

two months after that. Each

time one of these findings was added to the book, the archaeologist had to
spend the whole night with the history team reviewing Tutankhamuns life
history . This is inference time. However, even if

comes before

inference order, nothing forbids

to be a document written 1000 years

after the statue

was sculpted.
Bayes Rule, not Bayes Theorem, is the actual algorithm for
changing the probability of a hypothesis, like the life history of
Tutankhamun , as new evidence arrives. We will emphasise this by
changing a bit the way we write probabilities. We will attach a small label
in the probability for and write it as

The label represents inference time and a summary of how it
works for our digging is given by the table below.

The probability of for . This is

the initial probability for before
taking into consideration any of the
archaeological findings we found last
Roberto C. Alamino


year in the digging. This is the book
about Tutankhamuns life before last

The probability of at , right

after the information conveyed by the

has been taken into


The probability of at , after

the evidence provided by the canopy

has been also taken into


Finally, the probability of at ,

with information of all 3 findings have
been found and analysed by the
history team.

To automate this process, we need a rule, an algorithm, a program,
which starts with

and calculates the new probability

time a new piece of data

, is added to our knowledge database. We

already gave a name for this program. We called it BAYES. When we first
talked about it, we did not include the temporal details, but they are
important. BAYES is not Bayes Theorem, but actually Bayes Rule.
To be honest, the truth is that in the past it was usual to assume
that Bayes Theorem and Bayes Rule were the same. Bayes Theorem was
considered as the rule to do inference simply because it seemed to make
sense, but if we consider everything very carefully, it does not. Only a long
time later it was shown that Bayes Rule, with its temporal inference order,
could be derived from something much more basic: Laplaces Principle of
Insufficient Reason or its generalisation Maximum Entropy.
We talked a lot about both of them before. There, we saw that
these principles objective is to guarantee that we use only the available
information without making any extra assumptions when we calculate
probabilities. Although the situations we analysed at that point and the
The Probable Universe


inference or hypothesis evaluation tasks seem to be different ones, they
are actually the same! What we are trying to do now is nothing more than
to create a new probability that includes the information encoded in the
old one plus the new information. We are trying simply to append new
information to our probability distribution. Again, the ideal way to do that
is by encoding only these two pieces of information, the old probability
(the prior) and the new data, without making any extra assumptions.
The mathematical proof of how to arrive at Bayes Rule using
maximum entropy is not simple to describe. If you are interested, you can
find it on Ariel Catichas paper listed at the end of this book (Caticha, 2010).
The important thing is that this line of reasoning takes us to the following

This formula says that the new probability, the one at the inference
time , which is in fact the posterior probability we are looking for after
including the new data

, is given by the conditional probability

. We then can use Bayes Theorem at time and write this

conditional probability using the formula we have derived for it as long as
all probabilities at the right hand side are calculated at inference time .
This finally is Bayes Rule. You might be very unsatisfied at this
point as there seems to be little difference between this formula and
Bayes Theorem, but there is a difference and a very crucial one! It is
contained in the first equality in the formula above given by

This formula means that the best rule to update probabilities is to
use the conditional probability given by the previous probability functions

as the new probability

. This formula actually is Bayes

Rule while the rest is just a reminder of how we can use Bayes Theorem to
Roberto C. Alamino



, but which has nothing to do with actually updating

This is a good time to stop and review the names of all the symbols
used above. These names, in the end, are what really make the connection
with our intuition, so it is very important to keep them in mind all the time.
The probability

plays the role of the posterior probability of

Tutankhamuns life history because it is the probability of after the new

is included.
Can you remember how we call

? (Hint: comes before

.) Because this probability encodes all the knowledge we had before we
started to include new data, it is nothing more than the prior probability of
. You might now want to revise our definition of likelihood and soon you
will realise that

is the likelihood of Tutankhamuns life history

being a good account given the new data.
There is still one term in that formula to which we have not
assigned any name. We have been avoiding talking too much about it. This
is of course the term

, which during a long time we hid behind the

symbol. The usual name for this term is the evidence for the new data,
but I personally hate that name and hardly use it especially because, in
practical situations, this terminology is usually unnecessary. In fact, most of
the time, this number is simply what we call a normalisation. Sometimes,
in particular in physics, it will have its importance. In those cases, this term
will be known by the much more interesting name of partition function. I
will open a parenthesis here and briefly digress about the former term,
normalisation. We will talk about partition functions when we see how
everything connects to physics.
At the very beginning of this book, when we deduced probabilities from
Coxs Postulates, we agreed that they would always be real numbers
The Probable Universe


between 0 and 1. We did this arbitrarily, for a question of convenience, as
we could have chosen the interval from 0 to . A consequence of our
choice was that if we consider all possible outcomes for some experience
and assign probabilities to them, they have to add to 1.
Now let us have a more detailed chat about that small symbol ,
the one we have been calling proportional to. It looks like the Greek letter
alpha , and most people (including me) do not bother in writing them
differently when using a pen, but they are actually different and is a
symbol that will be quite convenient many times.
We say that a certain quantity, let us call it , is proportional to
another quantity whenever is equal to times a number. For instance,
we say that the height of a building is proportional to the number of stores
it has. Suppose that each store has a fixed height, let us say 3 meters. If we
call the height of the building and the number of stores , then we can
write the simple relation

If you give me the number of stores, I just need to multiply it by 3
to get the height of the building. Therefore, the height is proportional to
the number of stores and we could write it as

Of course with this notation we are losing the information about
what the proportionality constant (that is the name we give for the
number 3) is, but this notation is used when it is not important. When
would that be not important? Sometimes, we just want to know things like
how many times will increase the height of the building if we double the
number of stores? It does not matter the actual size of the store, the
important thing is that if we double , because is proportional to , it will
also double. If we increase the store number by 10%, then the size of the
building will also increase in total by 10% and, sometimes, that is all we
need to know.
Roberto C. Alamino


To understand when something is not proportional to another
thing, consider the infamous body mass index (BMI) which is supposed to
indicate how much someone must weight according to his or her height.
The formula for this index is

where is the weight, or more precisely the body mass of the person, and
is the persons height. The idea is that you weight and measure yourself,
put those numbers in the formula above and get another number that will
tell you if you are too thin, too fat or just right. I will not comment on the
precision of this procedure, which is actually very poor, but let me use it to
illustrate the concept of proportionality.
Looking at the formula we can say that the BMI is proportional to
the body mass and write

The information conveyed by this is that, if you double your
weight, your BMI will also double. If you get 20% lighter, your BMI will also
decrease in 20%. Now, can we say that the BMI is proportional to the
height? Of course not! Suppose that you compare two persons with the
same weight, but with different sizes. Let us call the height of the first

and the height of the second person

. If the first person is

twice the size of the second person we can write this as

If we square both sides we get

and this means that the BMI of person 1 will be four times smaller than
that of person two. Proportionality would require that the BMI should, like
the height, be two times larger for the first person. Therefore, we cannot
The Probable Universe


say that the BMI of a person is proportional to its height. However, we can
say that it is proportional to the inverse square of the height and write

That is because if you multiply the inverse square of the height by
any number, let us say by 3 like this

then the BMI is multiplied by the same number, which in this case would
be also 3

Caution! You should be careful when dealing with the same
quantities. You cannot say that

is proportional to just because it is

as in this case is not a constant.
I said that in many cases we do not need the proportionality
constant, but I have to admit that sometimes we do. The good thing about
probabilities is that, exactly because we agreed that probabilities always
add to 1, we can calculate proportionality constants even when we do not
write them. That is why we wrote Bayes Theorem at the beginning of this
book as a proportionality.
Let us see this is action for a dice rolling. Just to change a bit (and
simplify things) suppose that we have a loaded d3. Someone built it in such
a way as to make the face with the number 3 on it be three times more
probable to turn up than the face with a 1 on it. The person also had the
work to guarantee that the face with a 2 on it is twice more probable than
the face with a 1. Let us call the probability of getting the number 1 when
rolling this dice. The above considerations mean that

Roberto C. Alamino


and because is the same constant number for all three probabilities, we
can also write

where the proportionality constant is the same for all three cases, being
obviously . We could write the above 3 probabilities in a condensed
formula in two ways. Either


with being the number on the face, either 1, 2 or 3. The second way is
just a more simplified way of writing the first one. Now, because
probabilities are obliged to add up to 1, we could easily find the actual
numerical value of the constant by solving the following equation

This would give . Finding when we do not know it
beforehand is called to normalise the probabilities. Note that 6 is actually
the sum of all the values that appear after the symbol in the
probabilities. The constant that normalises the probabilities is usually
called the normalising factor or simply the normalisation. It is just a
constant that appears in all probabilities but do not depend on each one
individually, only on the sum of them all. The mathematical convention is
that the normalisation is the number which divides everything. So, in the
above example, the normalisation would be , not .
We are ready to understand why

is just the normalisation

in the Bayes Theorem part of the inference equation. Notice that Bayes
Theorem gives the posterior probability of the life history . If we could
sum over all possible s, we should arrive at the value 1 as it is the way we
agreed our probabilities should work. In this case,

would be a
The Probable Universe


constant exactly like above. We do not need to worry too much about
that because we could calculate it at some point when we really need it in
the same way as we calculated above.
This means that the right hand side of Bayes Rule is proportional
to the left hand side with the proportionality constant being

, or
in other words

This is a formula which involves just the posterior in terms of the
prior and the likelihood. If you really need the normalisation because you
want to calculate the exact numerical value of the posterior, then you can
calculate it as I have already said by summing over all possible values of ,
which we write to get

which is a formula that, again, involves only the prior and the likelihood.
The big symbol that you see in the above equation is the capital sigma
letter of the Greek alphabet. Sigma roughly corresponds to our S and it is
for that reason that it is used with the meaning of summation. The letter ,
or whatever other letter we put underneath this symbol, is used to indicate
what is the variable we are summing over.
Consider that there are three possible histories of Tutankhamuns
life which are labelled by the names

. In this case, the above

formula would be equivalent to

Therefore, if we wanted to, we could completely get rid of the

by writing Bayes Rule as

Roberto C. Alamino



would be gone even without using the proportional to

symbol. That is why I do not really fancy it. It does not need to be there at
But, as I said before, this is not the whole story for normalisations.
The first thing that can happen is that the normalisation can be difficult to
calculate. This is a problem because, in some cases, the normalisation itself
can become a very important tool which can be even more useful than the
probability. How can it be? If you have a formula for the normalisation, you
might be able to use some mathematical tricks to extract from it
interesting pieces of information like averages and variance. Due to these
tricks, normalisations play a central role in one of the most important areas
of physics, Statistical Physics. It is there that they are called partition
functions. We will get back to it.
All we have been learning up to this point is concerned with one main
objective: taking decisions. This is nothing but the task we analysed in the
beginning of this chapter. In order to take decisions, we have to evaluate
the relative magnitude of several competing hypotheses. We will usually
not need to calculate the precise value of these probabilities, because they
all will have the same normalisation factor which, in this case, will not be
necessary to decide which one is the best.
Let us simplify things and consider only two hypotheses. If we have
more than that, we simply compare them in pairs. The problem then
becomes to evaluate how many times more probable a hypothesis

compared to a second hypothesis

if the dataset used to evaluate

both is the same. This would be the equivalent of, for instance, piecing
together two different accounts of Tutankhamuns life based on the
archaeological evidence in and trying to decide which one is his true life
story (or at least truer).
The Probable Universe


A warning: I am going to use the notation I said was the wrong one,
mainly for reasons of laziness. Therefore, pay a lot of attention to it and, in
doubt, come back to this paragraph to remind yourself of what we are
doing. Because I do not want to be carrying time indices all the time, we
will forget about all of them when writing the Bayes Rule and, in addition,
instead of using

to denote the posterior I will prefer to use

. The notation makes it look like Bayes Theorem

but it is actually Bayes Rule that I want to write!
Yes, you will be right in complaining after I spent a whole section
highlighting the differences between Bayes Rule and Bayes Theorem. But
now that you know the difference, you can make the appropriate
corrections in your mind as long as the interpretations for each one of the
probabilities in the above formula, and the time at which each one is
related to, are clear in your head. In doubt, go back to the previous
sections. Welcome to the world of confusing notations in which
professional scientists live.
Now, because we want to compare the effect of the data in on
the relative probabilities of the two hypothesis


, what we need
to do is simply to divide the posterior of both hypotheses to obtain

The normalisation cancelled because it is the same in both
formulas. This is one of those cases in which it simply is not necessary. The
above equation has the following meaning: the odds that hypothesis

more correct than hypothesis

depends on (1) the ratio between the

likelihoods of each hypothesis and (2) the ratio between their priors.
Roberto C. Alamino


The first ratio is called the likelihood ratio, but we will begin our
analysis by focusing on the second one, the ratio between the priors of the
When evaluating the relative probability of two hypotheses, the
ratio between the priors is many times overlooked because one considers
both priors equal and their ratio becomes simply 1. Of course, if without
considering the data there is not reason to favour any of the hypotheses,
then their priors are indeed equal, disappearing from the formula. In other
words, if there is no reason to favour one of the hypotheses, then the
problem is reduced to calculate the likelihood ratio, or as we have seen,
how well each hypothesis explains the data.
HOWEVER, if the priors are different, the hypothesis with the
largest prior is favoured from start, even if both explain the data equally
well. To understand that, remember that the concept of a prior is always
relative to the addition of new data, in this case represented by .
For example, suppose that you are a teacher and you have two
students which are suspects of stealing the answers for an examination.
Call them Alice and Beth. After all evidence has been collected about the
crime, one thing is very clear one of them must be the criminal with
100% of certainty. You interrogate both and they both say that they are
innocent. Both Alices story (we will call it ) and Beths story (we will call
it ) explain the fact that the answers disappeared equally well, meaning
that the likelihood for the answer-stealing is the same for both stories.
Who would you think is lying? If Alice is that kind of student with a clean
profile, who always studied a lot and was never involved in any
wrongdoing, while Beth was always creating trouble, it is obvious that in
the absence of any other evidence, you would think that the guilty one
should be Beth.
Let us understand the Bayesian reasoning working behind the
curtains here. If you call the evidence of cheating , you can translate that
to equations by writing for how well Alices story explains the
The Probable Universe


cheating. Remember that we are using here the notation where we ignore
all time indices. In the same way, how well Beths story explains all the
evidence is written as . Because both stories explain the situation
equally well, these two probabilities are the same. This means that the
likelihood ratio part of the equation disappears and likelihoods alone
cannot help you take a decision. This is where the prior ratio enters. The
priors here represent how much you believe in the stories given your
prejudices about the two students. Prejudice? Yes, exactly. Prejudices are
nothing more than priors. In a perfect world, people should change their
prejudices with data using Bayesian inference, but we know people never
really do that, right?
Back to the crime. As we have seen, you know both students long
enough to believe that it is extremely unlikely that Alice would be lying. On
the other hand, given Beths previous problems, it is much more probable
that she is indeed lying. Your preconception about Alice saying the truth is
your prior , the probability that Alices story is true according to
your knowledge about her student profile. The same preconception for
Beth is . Because for you, this implies that

Because the likelihoods are the same

If we put this in the formula, we have

and, therefore,
Roberto C. Alamino


meaning that you are inclined to believe in Alice even if any of the stories
cannot be favoured!
The decision might sound unfair, and it is. If we had the possibility
of not punishing anyone until an irrefutable proof appeared, that would be
the moral thing to do as probabilities are not certainties. But what if one
has to be punished? What if there is absolutely no way to spare both
students? Even if the truth cannot be discovered, we would have to punish
Beth! Even if only this time the guilty one is Alice and she is abusing the
system. This is certainly unfair, but it comes from the necessity to take a
decision. This happens every day in every part of the world inside courts.
The judge cannot suspend the judgement forever. A final verdict has
always to be given.
We see here how the prior affects the decision if the likelihoods
are the same. But you must also be aware that even if the likelihoods are
not the same, but the priors are very different, their difference can
overcome the difference in the likelihoods! For an example, suppose that
Beths story is actually 2 times more plausible than Alices, or in symbols

However, if you think that Alice is 4 times more honest than Beth,

by inserting these two ratios in the formula we have

which implies that , or that it is still twice more
probable that Alices story is the correct one and she should still be
considered innocent! Lesson for life:
The Probable Universe


Never underestimate the power of prejudice,
even in the light of evidence!
Let us now consider what happens when the priors are equal. In
that case, our equation becomes

The ratio of the probabilities of the two hypotheses in this case will
only depend on the likelihood ratio as I had already pointed out. This
sounds less unfair, but this is just because I said to you that priors
represent prejudice and prejudice is never a nice word to hear. But you
must remember that the prior includes all previous information, which
might not be only prejudice in the bad sense of the word, but proved and
verified information that we had before we started to consider the data
and it might be wrong to ignore them.
The above formula clearly makes sense if you think about that.
When we do not have any other reasons but the data to choose between
two explanations, the one that explains better the data should be the
correct one. Coming back to Alice and Beth, now imagine that both
students have clean profiles. None has ever had any problems in the
school. In this case, the priors for them would be the same

and the one with the best story would win. You might be shocked by the
fact that it is not really the truth that wins, but the best account of the
facts. The point is that all we know about the truth is either in the priors or
in the data. How true each account is has to be judged only on the basis of
how consistent it is when compared with the data. That is why, as unfair as
it is, having the best lawyer ends up making the whole difference in a
Roberto C. Alamino


judgement and this can only be compensated by a good judge, one that has
enough experience with cases to put together good priors!
There is a subtle point here which, in fact, is extremely important.
It is the idea that we do not really need to know the exact value of the
probabilities to take a decision. All we need to know is which one is more
probable. In everyday life, we know that, but once we learn the above
arguments and formulas and get hooked by them, it is easy to forget the
simple things. Instead of calculating probabilities of hypotheses up to
several digits after the decimal point, we just need to rank them.
The importance of this simple observation goes very deep. It can
be appreciated by considering nothing less than science itself. There is no
way to evaluate exactly the probability of a scientific theory to be correct,
because we simply will never know all possible theories. Why would we
need to know all theories? Because to calculate the exact value of a
probability we need to calculate the normalisations, but normalisations
require us to calculate a sum over all possible values of our variables!
However, because the normalisation disappears when we calculate ratios,
even not knowing what is the probability of a certain theory being right, we
can still rank them in order of plausibility! Science still works.
There is nothing more to BAYES than that. Seriously. We finally arrived at
the complete formulation, both conceptually and mathematically, of
Laplaces ideas based on the insight of Bayes. The previous section
summarises all you need to know to take decisions in the most unbiased
way possible considering all available information. In fact, it shows you
even how to take biased decisions knowing that you are doing that.
As you could see, you did not need much more mathematics than
you have learned on the high school. You need to multiply, divide and pass
things from one side to the other of an equation. That is pretty much all
The Probable Universe


mathematics we have used. I had to introduce some strange letters, but
they are still just abbreviations of the four basic arithmetic operations. The
sophistication of Bayesian reasoning is not in the formal techniques
required to use it, but in the philosophical framework underlying the idea.
Although Bayesian inference is a powerful tool for science, I believe
it is very clear after all the examples we have seen that the importance of it
is much wider. Bayesian reasoning is not only a tool for developing theories
or taking managerial decisions, although these are surely included. By
understanding how it works, we understand the very foundations of what
is meant by rational thinking. More than that, it shows us exactly where
our biases lie, how they enter in every single choice we make and what we
should do to modify that. You can even pinpoint where emotions enter to
bias the process! As you might guess, it will always be in the prior.
Now we finally can complete our picture:

In order to appreciate all the consequences and the power of both
probabilities and the Bayesian way of thinking, we need to see it in action
though. That is what we are going to do in the next chapters. I almost
forgot to say one last thing: theres a catch.

Roberto C. Alamino


The Catch
After a whole book on Bayes, it seems clear that a justice court should be
the place in which Bayesian inference would be most important to
guarantee that the level of injustice remains at its minimum. Judgements,
by their own nature, admit only two results: guilty or non-guilty. If the
analysis of the whole case leads to the conclusion that no conclusion can
be reached, then the result should be the non-guilty verdict. There is then
only one option, you might think: calculate the ratio of the posteriors and
decide the case by choosing the highest one.
Lately, there has been much noise about decisions of forbidding
the use of Bayesian arguments inside courts in Britain. These rulings, if
taken literally, are simply ridiculous because, as we have seen in this entire
book, you either use BAYES or you are doing something wrong. Some of the
arguments are indeed very naive, to say the least. For instance, consider
the following comment given by Lord Justice Toulson on an appeal on
The Probable Universe


The opinion expressed in the above comment is obviously wrong.
We know that we can attribute probabilities to past events. We did that
with the life of Tutankhamun. The argument is itself weird because, in the
same way as something either happened or not, the smoking person will
either die or not unless she becomes trapped in some kind of
interdimensional limbo, but that seems highly improbable. We have
learned that probabilities arise from lack of knowledge, not any temporal
or causal order. Whenever information is missing, the entropy is not zero
and probabilities will reflect uncertainty. The refusal in accepting Bayesian
reasoning here is just a question of misjudgement due to improper
knowledge of the subject. That is not the first time it happens in courts and
will not be the last, as we all know.
Another court decision in 2010 also ruled out Bayesian reasoning
by using a slightly different argument. The argument this time was that the
numbers used to calculate the likelihood ratio of one of the evidences
being reliable or not were not precise. Of course they will never be,
because they are probabilities. In this case, however, the complaint was
that this was not explained well to the jury and therefore it was unfair.
Without entering the discussion if being judged by a group of people
without guaranteed expertise in logical deduction, deception and inference
and fully susceptible to whatever influence from popular culture and media
is actually fair, we have spent around 200 pages only to understand in a
very basic way why BAYES is correct. And we were not under pressure!
There is no way to explain this to a jury in a couple of hours.
It is true that the prior and the likelihood contain uncertainties and
they should be taken into consideration. This is one of the catches in
Roberto C. Alamino


Bayesian inference. However, the alternative of not using BAYES is even
worse. The way out of this is not easy. It consists of finding guiding
principles to encode information in the probabilities, like we did with
symmetry, and in keeping testing them against new data. If they do not
work, like any other belief, you need to change them.
But that is not the only problem we face when taking decisions.
Why do judges worry so much about the uncertainty in the probabilities
after all? The answer is obvious, but it is worth going through it explicitly.
Think about a homicide case in the USA in one of those states in
which the death penalty is legal. Call the proposition that the accused is
guilty by and that the accused is innocent by . Suppose that the priors
are the same and disappear from the process of taking a decision. Suppose
also that, once the likelihoods given the evidence are calculated, you end
up with

This means that the probability of the accused being guilty is only
0.01% larger than the probability of that person being innocent. With such
a small margin, would you send this person to the electric chair? I would
not. That is because I know that for this kind of probability estimation, both
the priors and the likelihoods have a high chance of being wrongly
estimated, with errors that are probably larger than this. Am I sure? No.
But killing someone is something that cannot be undone. I would rather
not risk.
Here is the biggest catch of taking decisions: the cost of being
wrong. Whenever you use probabilities to estimate anything, there is a
chance that you are going to be wrong. This cost is not part of Bayes Rule
and cannot be calculated from any general principle of information theory.
Costs in taking decisions can be even purely emotional, without any
rational justification for them (except for the fact that being unhappy on
The Probable Universe


purpose is not rational at all). They vary with the situation and even from
person to person.
Let us say that a friend comes with a coin and wants to bet which
side is going to turn up when the coin is flipped. He says to you that, only
for fun, he wants to bet a toothpick in the game. You probably will not
mind and will probably go on with the game. But what if your friend says
that whoever loses the game has to cross the motorway from one side to
the other with shut eyes? You start to doubt that he is actually your friend.
The difference in these two cases is the cost of losing. There is no
problem in giving up a toothpick, you can always get another one in the
nearby cafe, but you cannot find a new life in there. What if your friend
rolls a d10 and says that you just need to cross the motorway if the number
1 turns up. I am not sure about you, but I need much less than 1/10
chances to die before I agree to risk my life. However, that is just me.
Everywhere in the world you can find people willing to enter bets which are
not much different from this one.
Why do people enter bets like that? That is because the cost of
losing has to be compensated by the temptation of winning. If the prize
one receives upon winning is very attractive, you will always find people
that will accept the cost of losing. There is an interplay between these two
quantities, but it is something which is quite difficult to estimate.
On the other hand, there is also the cost of not taking the decision.
If your friend has a gun and says to you that you either accept his bet or he
will shoot you in that moment, then your willingness to accept the game
will instantly change. There is no easy method to quantify costs in general.
Each case is a particular case.
In this sense, when a judge shows a worry about using Bayes Rule
in the court, what he or she is worried about is that, if the calculation of the
probabilities was done in a too uncertain way, the costs of deciding
wrongly will be extremely high. Because we know that each case is
Roberto C. Alamino


different, we also know that the model used to calculate likelihoods and
priors can be very poor. This, of course, should not prevent us from using
BAYES, but should force us to analyse carefully our models and
substantiate them with enough supporting data. If even with that judges
still do not accept that, then they will be doing a serious mistake.
When we talked about Maximum Entropy, we were in the business of
encoding information about the symmetry of a situation into prior
probabilities. But, in fact, there is no known general method to calculate
priors in all possible situations. The likelihood suffers the same problem as
it also encodes information, although of a different kind. When building the
likelihood of proposition given the data , or , what we really
want is to encode all the influences and relationships between these two
things in a mathematical formula. Trying to build this mathematical
formula out of the known relationships between the variables has a name
in science and mathematics. It is called creating a model, or modelling.
Another equivalent way of visualizing a model is to look at it as a
program or a mathematical function. Every model can be broken up in a
collection of possibly interconnected programs, which you can consider as
a kind of super-program. In the same way as we have seen for the program
BAYES, you feed some information to models and they give you back a
different piece of information, processed in a way that is more convenient
for us to use according to the problem in our hands.
For instance, the program PRIOR asks you for a proposition and
all the information you have ever collected about that proposition. In
return, it gives you back the probability distribution that we called .
The probability distribution itself is information, but processed in a form
more useful to, let us say, doing inference using Bayes Rule.
The Probable Universe


The program LIKELIHOOD works in a very similar way, although it is
a bit more complex as it is concerned with the influence of one variable on
another. This program requires you to enter two propositions and and
all information about how one influences the other that you might have
collected somehow. Then it returns to you a function which takes
into consideration how a certain value of will determine the value of .
Consider the case in which every value of determines the value
of with complete certainty. We call models of that kind deterministic as
opposed to probabilistic models in which we can only model probabilities
for values of . Suppose that is the distance of a vehicle from a starting
position on a road and is the time in which that distance is measured. Let
us say that the car has a speed of 30 km/h. How far from the starting point
will the car be after 2 hours? The answer is clear 60 km and we can write a
very simple formula relating and that will always work

If you give any value, in hours, to then the value of in
kilometres is given with complete certainty. This is a very simple
deterministic model. There is a formal way to write a deterministic model
as a probability by using something that we have seen in the zoo: the Dirac
delta. The Dirac delta is the way to encode deterministic models as
likelihoods. In the case of the above model, our notation becomes
The meaning of the above equation is that, although we write it as
a conditional probability, the delta is telling us that we in fact know that
the only possible value of given is the one that makes whatever is
inside the brackets equal to zero. As you can easily confirm, if you do that
in the likelihood above you recover our formula for the position of the car
given the instant of time in which it is measured. You can think as if the
possible values for were given by a Gaussian with zero variance at the
point .
Roberto C. Alamino


Probabilistic models are a generalisation of deterministic models.
In deterministic models there is no uncertainty on predicting some
variable. When uncertainty starts to appear, we then need to introduce
probabilities. If we consider the Dirac delta above as a Gaussian with zero
variance, then we can introduce uncertainty by increasing the variance
from zero to any other positive number. This would create a probabilistic
instead of deterministic model in which we would know the probable
position of the car, but not the exact one.
As another example, we know that in a coin tossing, if the
probability of heads is , then the probability of tails is . If we call the
result of the coin tossing and the possible values of as the proposition
, then we can write a probabilistic model for the coin tossing game as

Now, if I say to you that is 1/3, you cannot give me the exact
value of , the most you can say is that has a probability 1/3 of being
heads and a probability 2/3 of being tails. That uncertainty characterises
the model as a probabilistic one. This is a general model that works for any
value of . Just choose one and plug it in the above formula.
We have already seen an example of how to calculate the prior of a
d10 rolling as long as we assume some fairness conditions. In that case, we
were building a probabilistic model of the prior. That modelling was based
in very clear assumptions. Nevertheless, we argued that we should always
test our prior by actually rolling the d10 and changing it if necessary, but up
to this point, we never really talked about changing the likelihood.
Of course, if everything we said before remains true, likelihoods
should also be changed if they are wrong in the exactly same way as we
change priors to posteriors using Bayes Rule. The only complication is
that the task now becomes to encode priors of the likelihoods themselves
and modify those priors to posteriors of the likelihoods. For instance, if we
The Probable Universe


call the likelihood of given by , then the prior of the likelihood
is something like

Sometimes, these are probabilities of the values of parameters in
your model, sometimes they are probabilities of how these parameters
relate to each other. Looks confusing? It can be, even for an experienced
researcher. Let us try to understand it using a very simple example. Let us
go back to the coin.
As we have seen, we can understand likelihoods as theories about
a certain phenomenon. If you are considering a coin tossing game, you
immediately create a theory about it inside your mind. A theory is created
by collecting a certain amount of rules or, as we physicists like to call them,
laws. In the case of tossing a coin, we can consider the following three
basic laws:
1. The coin has two sides and only one of each can be the result of a
tossing at each time.
2. The result of a coin tossing at some time does not depend on what
happened before.
3. The chances of one side being a result of the tossing do not change
with time.
All the above three laws are very natural, but they all might be
wrong! What if the coin hides some electronic mechanism that makes the
next result depend on the previous one? What if the coin is magic and the
faces keep changing? What if somehow the coin is melting and an
imbalance between the two faces keeps growing with time? We can call
the above three laws as part of our Coin Tossing Theory which we will call
CTT for short. We can now start to work with those laws and try to deduce
something more useful from them.
Let us start by creating some terminology that will make things
simpler for us to talk about experiments. As the coin has only two sides
Roberto C. Alamino


according to the first law, let us call one of them heads or and the other
tails or . Because of the second and third laws of CTT, each time we toss
the coin we can say that there is a constant probability for each side. Let us
call the probability that a result is by the letter and the probability that
a result is by . By combining CTT with what we learned about probability
theory, we know that and therefore . This shows that,
in fact, we can get rid of and say that the probability of getting is and
of getting is .
What you have to notice now is that CTT does not say anything
about the fairness of the coin tossing. Therefore, we cannot say that
. As far as CTT is concerned, can be anything. We call a free
parameter of our theory. The only way to know is by throwing the coin
an updating our information. How do we do that? The answer, as you
might expect, is Bayes Rule. Suppose that we call the result of a coin
tossing by the letter . Each time we toss the coin, can be either or .
This means that, if we want to infer the value of , we have found what is
the most probable value of it given the results of the coin tossing. This can
be found by using Bayes Rule as

Remember that is just the prior over the possible values of
before that specific coin tossing. Notice the subtlety of the problem here!
Because is itself a probability, is the probability of a probability! It
should include all information about the possible values of that we
gathered from all previous times we tossed our coin. Clearly, if we are
talking about the very first coin tossing, unless we have reasons to believe
the contrary, the most unbiased thing to do is to start assuming that can
assume any number between zero and one (because it is a probability) with
the same odds. That prior is, however, not important to us right now. What
we are interested in discussing is actually the likelihood. The likelihood
basically encodes our CTT. It would not be an exaggeration to say
that is CTT if we neglect some unimportant subtleties. This means
The Probable Universe


that we would like to be able to actually calculate . In the simple
case we are studying, we can.
It is not difficult to see how this can be done if you remember that
means the probability of one of the sides being given that the
probability of getting an is . This just means that if , then
and if , then . This is
a very neat result, but it was obtained only because we decided to assume
that the three laws of CTT are valid. If we now start to toss the coin and
estimate using Bayes Rule, we will of course get some result, but if CTT is
wrong, that form of Bayes Rule will not tell us that!
Imagine, for instance, that you are not actually watching the coin
tossing with your own eyes, but that it is being broadcasted to you via
radio. You simply assumed that a coin would have two faces, and you are
happily doing your inference on until the person at the other side says to
you This time it is neither nor , it is the third side! Your inference
crashes at that point simply because you do not have a theory that allows
you to calculate the likelihood when is neither nor !
If you want to continue in the inferring game from that point
onwards, you will have to change CTT to allow for a three-sided coin. If you
knew from the beginning that the number of sides was a variable, let us say
, you could even have included it as another free parameter in CTT and
done the inference of and altogether using Bayes Rule with a
likelihood given by , where the three dots mean any other
parameters that would be necessary, but you did not know that in advance.
What do you do now?
Well, you do what every professional scientist does at this point:
you try to find new laws. The simplest modification would be to change the
first law to allow for three sides in your coin. This means that you will have
three, not two probabilities to calculate: for , for and for , our
new side. You still can say that , but now you have two free
Roberto C. Alamino


parameters instead of 1. You can choose any pair as, once you find two of
them, the third is automatically defined. Let us choose and . Then, our
likelihood changes to

This is our new model, that we now call the Generalised-Coin
Tossing Theory, or GCTT. In a sense, this is a more general model, because
if the coin has only two sides after all, we will end up inferring as being
zero at some point and we will be able to, with some precision, and
eliminate it from GCTT. If that happens, GCTT reduces itself to CTT.
Now you can appreciate the problem created by the uncertainty of
the model. What if we need to judge a case based on the above coin
tossing, but the true coin has three sides and we are using the two-sided
model? Even if Bayes Rule gives us the best guess for the probabilities, by
using the wrong model we can only get wrong results.
What is the solution for this conundrum? The solution is that we
also need to be constantly revising our models. Sometimes, we will not
know how to do that using Bayes Rule and it is in those cases that
creativity takes place. A computer can run BAYES at any time, but only our
naturally evolved computer has the capability of imagining new models, at
least for now.
There is one last thing that we need to talk about in the case of
models. You noticed that we could either infer the free parameters of the
model or change the mathematical structure of the model itself. Although
these two things seem very different at first sight, in a higher level they are
just the same. The mathematical structure is also a piece of information
and, as any piece of information, can be written as a numerical parameter
that should be inferred!
The Probable Universe


If you do not believe this, think about the following. The
mathematical structure of electromagnetism can be contained in a book.
That book can be scanned and put into digital format if it has not been
written like that already. In any case, the book becomes a computer file
and, as we all know, a computer file is simply a sequence of zeroes and
ones. But a sequence of zeroes and ones is nothing more than a very big
integer number! Therefore, the entire electromagnetic theory can be
encoded into an integer number, and we do that all the time without
realising it. Discovering the correct electromagnetic theory might be seen
as inferring the right integer number. The problem is how difficult it is to do
Suppose we have a perfectly deterministic model for our likelihood the
car speed from previous section, for instance. Let us just change the
notation to make things easier to understand and write

This means that we are interested in the probability of the car
being at position given that the measured time is . The speed of the car
is . If the clock measuring is perfectly precise, then the position of the
car is given with perfect precision by the value .
What if instead of being you measuring the time, it is a friend of
yours and your friend has to tell you the measured time via radio? So far,
so good, but what if the radio is very bad and full of interference? The
formula giving the car position at a certain time continues the same, but if
you understand the numbers given by your friend wrongly, you will still
calculate the wrong position of the car. If you know that it might be wrong,
then you now have an uncertainty, and probabilities are back in the game.
Roberto C. Alamino


The radio interference is an example of what is called noise. In the
real world, whenever we measure something, that measurement always
come with some errors and those errors are collectively called noise. We
say that the measurement is corrupted by noise. This happens, for
instance, when you copy files in your computer. Because of imperfections
of the materials, power surges and even the probabilistic nature of
quantum mechanics, there is always a chance that some of the bits
composing your files will be flipped. This means that if they are 1 they
might become 0 and vice-versa. In computer science, this has the obvious
name of flip noise.
It would be very good if we always knew when a bit has been
corrupted by flip noise in a file, but just looking at that particular bit, there
is no way to know that. To see how this can affect a model, let us consider
the following simplified situation. Call it the deterministic coin tossing
game. Instead of actually tossing a coin, we enter a number in the
computer. If the number is zero, the computer returns heads, if it is one,
the computer returns tails. That is a very boring game with the following
values for the likelihood of the result of the coin given the
entered number

However, if your computer suffers from a hardware problem that
flips every bit you enter with a probability , then the results of our coin
tossing cannot be predicted with certainty anymore! Although internally
the computer is generating numbers following a deterministic rule, now
there is noise in the data you are entering! Noise is a very tricky fellow. It
can be anywhere and appear at any time. In the above case, it would be
useless to use the deterministic model for the coin tossing as you would be
The Probable Universe


wrong half of the time! The best you can do is, given that you know the
strength of the noise, to say that you have now

no matter the value of the number you enter. Noise washed away your
certainty and forced you to use probabilities once again.
Whenever there is a chance that noise is present either in your
measurements or anywhere else in your model, you have to take that into
consideration when building it. If you do not, you will just be fooling
The deeper meaning behind this is related to something that we
discussed before. In order to check our models, we also need a model of
how the measurements are taken and how noise corrupts those
measurements. But to check the noise model, we need another model that
says something about the errors involved in checking the noise model and
so on ad infinitum. The way out is, as we have seen before, to rely on the
consistency of our models.
But we do not need to go as deep as that to fight noise in general.
The branch of mathematics known as information theory, great part of
which can be attributed to the works of Shannon, revealed to us many
techniques that can be used to shield us from errors. It is variations of
these techniques that allow us to transmit information back and forwards
through the web and not lose the contents irreparably. All of them are
based on a very simple concept called redundancy.
Yu can probly understd tis sentnc in Englsh ven wit sme o te letrs
missin. U r also capable of understand words when they r not completely
written. Eevn mroe aznmialgy, you can uaterndnsd tihs etnrie sncnetee in
wcihh all the wdors are wnlgory wtiertn epecxt for the frsit and lsat lteerts!
The last example is the more difficult because there are much more
errors, but you can still understand the sentence with some effort. How is
Roberto C. Alamino


it possible that we can correct the errors of a text with such a high level of
noise? What is happening here is that all languages, English included, have
a high level of redundancy. By that, I mean that the number of words that
really exist is so much smaller than all the combinations of letters that only
a few of these combinations make sense. The larger the word, the easier it
becomes because there are much less possibilities.
But languages have other characteristics which create extra
redundancy, like grammar and semantics. These rules guarantee that there
is a certain rational order in combining the words that, if subverted,
renders the sentences meaningless. For instance, the following sentence is
extremely rare because it does not make any sense
The sequence of words in this sentence does not make sense, so
upon reading it we know something must be wrong. In fact, if the publisher
of this book exchanged any of the words in my original sentence, there is
no way to know that! On the other hand, you can easily fill in the missing
word in the following sentence
Surely there are other possibilities, but it is almost sure that the
missing word of this sentence is HEAD. More than that, the above sentence
makes sense because it has nouns and verbs arranged in an order that we
know is correct. This kind of redundancy can even help you when you are
learning languages that are not your native one.
The Probable Universe


But what happens if you have in your hands a message of a
language that you do not know? How do you know there is a mistake on it?
This is what happens in computers and other digital devices all the time.
Computers copy information, but they do not interpret it. How can they
correct the possible errors then? The answer is in the grammar rules. In
particular, the spelling rules of the language. Some languages have very
strict orthographical rules and, once you learn then, you can identify errors
even in words you do not know the meaning.
For instance, Portuguese has a rule that says that you never use n
before p or b. It is always m. This means that, even not knowing what
the word campo means, if you see it written as canpo, you immediately
know it is wrong. There are, of course, exceptions, but they are few enough
for the rule to work most of the time and give you a good error-correction
Error-correction is indeed the technical word used in information
theory for the process of identifying and correcting errors in messages. It
would be much easier to correct errors if we could read every message, but
nobody wants it, right? Well, except maybe the NSA, but that is a subject
for another book. The solution found was to create a code in which the
orthographical rules are so strict that one cannot violate them without
someone else noticing.
The problem then becomes to translate every message to this
code. When we are dealing with computers, this code has only two
characters 0 and 1. You can imagine how tricky it is to create a set of
rules of this kind in such a way that we can translate every kind of
information to this code. Shannon, once again, showed that this is possible,
but only if you introduce redundancy in the encoded message. The more
redundancy you introduce, the easier it becomes to recover the message.
These codes are appropriately called error-correction codes.
Let us see a classical example called, for reasons that will become
obvious in a moment, repetition codes. Suppose you want to send a
Roberto C. Alamino


message to a friend which is either 0 or 1, but something will happen with
your message and there is a 1-in-5 odds of your bit being flipped. How can
you guarantee that your friend will receive the correct message? The
answer is obvious. You send the same message several times. As we have
learned, the Law of Large Numbers guarantee that, the more times we
send, the closer the amount of flipped bits will become to 1/5 of the total
bits received by your friend. The only thing your friend needs to do is to
count which bit appears more times.
Clearly, the more probable is the flipping, the more bits you have
to send to guarantee that it will be possible to recover your message. For
instance, suppose that exactly 1 in every 5 bits is flipped. Then, if you send
a chain of five 1s, then your friend will receive something like

and it is not difficult to see that the correct message is 1. The most
frequent number wins. This is called, technically, the majority rule. If your
friend receives

the majority of the bits is 0 and, therefore, your message was probably 0.
You can also calculate the average value of the received bits. In the first
case you would have 0.8 and in the second it would be 0.2. As you can
notice, you would just need to choose the closest bit to the average to get
the correct answer.
Now, suppose that exactly 1 in every 2 bits is flipped. What would
be the original message if you receive this

In fact, unless your friend knows that your message can only be 1,
there is no way in this case to discover the message! You have only two
characters and they can either flip or not with equal probability each time.
No repetition code can help here. The problem is that the noise is so high
The Probable Universe


in this case that no amount of redundancy will allow you to correct errors.
In this case, of course, the only way out would be to find better hardware
for your transmission.
When noise is not so severe, though, there are much more efficient
schemes than repetition codes. These are used all the time in computers.
As I said before, the efficiency of these codes does not depend on the
content of the message, although knowing that would definitely help. We
can now generalise this idea beyond simple binary messages to every kind
of information we collect from the world.
It is because of noise that we repeat the same measurement many
times in science. We hope that by repeating the experiment, if the noise is
small enough, we can detect what is the most probable correct result.
Consistency of the results among different scientific theories is a kind of
redundancy that allows us to check if the result is correct or not, or maybe
if the theories themselves are correct or not.
Noise can be anywhere. It can be acoustic noise which prevents us
from hearing correctly what another person is saying. It can be visual noise,
like when you have short sight and are without your glasses. It can be the
strong rain which prevents you from seeing the car in front of yours. Our
brain is a wonderful error correcting machine, but it is so because it uses
prior knowledge and full information integration to create redundancy and
fill in the missing gaps. Even though, no matter how good it is, if you have
your eyes closed you cannot see the car ahead.
The fact that noise is everywhere means that there is always some
information lacking and, because of this, most of the time we have to use
probabilistic theories instead of deterministic ones. We have to learn how
to deal with noise and never forget to take it into consideration.
Roberto C. Alamino


Roughly speaking, a fallacy is an error in a chain of logical reasoning. The
error can be anywhere. It might be hiding in the initial hypothesis or in any
one of the steps we take to reach the final conclusion.
Each one of the steps in a logical reasoning is based on some rule
of logic we assume to be true. We have seen some of those rules as logical
operators. For instance, if I say that either I am a physicist or my brother is
a physicist and then I say that my brother is not a physicist, you can
conclude that I am a physicist. This is the rule that:
If (A OR B) is TRUE, then if A is FALSE, B must be TRUE.
This is a simple rule that we assume to be true. Sometimes,
though, we have a tendency to assume rules which are not necessarily
true. These are fallacies.
Probably the most common fallacy is the authority argument. This
is an error of reasoning that happens by assuming that a recognised
authority in a subject is always right. For instance, consider the following
chain of reasoning:
The doctor said I have a disease.
Doctors are authorities on diseases.
Therefore, I have the disease that the doctor diagnosed.
If that was always true, we would not have deaths due to medical
errors. Although it seems logical to believe in physicians because they
studied much more about diseases than other people, that does not
prevent them from being humans and committing errors of judgement.
However, the clever reader must now be pointing to the fact that,
although it is not certain, it is probable that the physician is right. The odds
The Probable Universe


that a physician knows better about a disease are higher than those for a
non-physician. Without any other information, if you are forced to choose
between the advice of the physician and of the non-physician, the best
thing to do is to go with the physician. That is equivalent to choosing a
hypothesis based on the prior information that a physician has a higher
knowledge on matters of health.
Therefore, based on BAYES, the probability that the physician is
right is higher, but is not zero! There is a non-zero chance that the
physician is wrong and you have to prepare yourself for that. With luck, we
might be able to change our decision with time. If the treatment is not
working, we change our belief in the physician and look for another one.
Incredibly enough, a fallacy can be (not all of them actually are)
evidence that increases the probability of something, but never enough to
make that thing 100% certain. That is why you have to be careful with them
and how they are used.
Consensus is a fallacy that works in a similar way. It is logically
wrong, but has a large probability weight. Although 1000 expert opinions
do not constitute a proof for something, the more experts agree with
something, the more probable it is that they are correct (but still, they can
be wrong). All the medieval specialists saying that the Earth was flat were
still wrong, no matter their number.
What you always need to remember also is that more probable
does not mean highly probable. Suppose you have three doors and behind
one of them is a prize. The fact that there is a chance of 0.01% of the prize
being in door #1 and 1% of being in door #2 means that door #2 is 100
times more probable than door #1. Still, the probability of the prize being
in door #3 is 98.99 %! More probable in comparison to another probability
does not mean highly probable in absolute terms! Once again though, if
you are forced to choose only between doors #1 and #2, you would be a
fool if you do not choose the latter.
Roberto C. Alamino


As I said, though, not all fallacies can be seen as probabilistic
evidence in favour of something. For instance, the argument from
ignorance, a fallacy in which something is considered to be true because it
cannot be proved wrong, has no probabilistic use whatsoever. Not being
able to prove a proposition wrong does not increase its chances of being
correct as it can be easily seen by Carl Sagans invisible dragon (Sagan,
The invisible dragon is a dragon that a child swears to her parents
exists in their garage. The parents go there to check and the child says that
they cannot see it because it is invisible. The parents then try to find its
footprints, but the child says that the dragon flies. The parents then
prepare a trap in which paint is sprayed in the dragon, but the child, once
again, argues that the dragon can become intangible. The existence of the
dragon cannot be disproved. Still, that does not make it more probable.
The list of fallacies is extensive and changes depending on the
source. One good resource is Wikipedia:
Each one of these fallacies works in a different way and their
validity as evidence in favour of some reasoning can only be evaluated by
analysing case by case carefully. BAYES does not change the fact that they
are wrong when used to reach certainties, but they can have some
probabilistic value once in a while. The important thing is to know to
recognise when.
The Probable Universe


The Universe and Everything Else
Physics is the branch of science that studies the simplest systems in the
universe. Although simple, they span a huge range o sizes, from the whole
universe to the subatomic particles and beyond. By the 17
century, with
Isaac Newton, a very precise model of phenomena, which we agreed today
to fall under the physics umbrella, was developed. This model, known as
Newtonian Mechanics, was a deterministic one. The underlying idea was
that the natural patterns were not probabilistic, but rather deterministic
and probabilities, if needed at all, would appear from incomplete
knowledge of initial conditions for each new prediction.
The 18
century has seen the rise of electromagnetism and
thermodynamics, but both were models which fitted very well into
Newtons theory. Even with the discovery of relativity and the recognition
that simultaneity was a relative concept, the theory of relativity remained a
deterministic model. Given a set of initial conditions for any system, one
could predict its entire future from it with complete certainty.
Something, however, started to change more or less at the same
time relativity appeared. When physicists tried to predict the amount of
energy emitted by an object kept at a certain temperature using Newtons
model, complemented by Maxwells electromagnetic theory, they ended
up finding an infinite value for this energy. Clearly an object at a certain
temperature cannot be emitting infinite amounts of energy, otherwise we
would never need to worry about sustainability.
Roberto C. Alamino


The solution to this and other small inconsistencies between
experiments were at first seen as details that would be solved sooner or
later. The prior of Newtons theory was too strong as it had been working
without failing for over 200 hundred years up to that point. But in the best
Bayesian style, too much evidence started to accumulate, indicating that
there was something wrong with the Newtonian description of those
It turns out that the solution would involve a series of changes in
the prevailing models of physical phenomena, one of the most radical
being that we were forced to accept that probabilities are a fundamental
part of the description of nature even at its most basic level. I am talking, of
course, of quantum mechanics.
In order to understand where probabilities are hidden in quantum
mechanics, I have first to give you a crash course on it. It will involve a little
bit of mathematics, of course, but it will not be more than we had to use to
understand probabilities up to this point. Of course, we will need new
symbols, but you are already used to that.
The shroud of mystery covering quantum mechanics is partially
disappearing due to the fact that it is now the leading technology of our
time. There is still an air of fantasy and science fiction around it and the
idea that it is something almost impossible to grasp is always circulating
Quantum mechanics is indeed counter-intuitive in the sense that in
the microscopic world, where its effects are more accentuated, the results
drastically differ from what we are used to see in our daily, macroscopic
lives and those are the observations that shape our prior about the world
in which we live. It would be fair to say that we could not still understand
the principles of quantum mechanics by applying our classical thinking,
The Probable Universe


where by classical I tautologically mean everything which is not quantum.
In fact, we might never be able to fill this gap as the quantum description
seems to be more fundamental than any classical reasoning and some
argue that we should find a way to get used to it as it is.
If one assumes this posture and accepts its strangeness at least for
now, the amount of mathematics needed to gain some grasp of it is not
much worse than high school level and not much more complicated than
what we used to learn about probabilities. If you are a professional, of
course you will need some more involved concepts, like differential
equations and integrals, but we will not need them here.
The basic object of the model we call quantum mechanics is called
a state vector. You probably learnt that vectors are arrows of some size
pointing in some direction. They are used to represent quantities that have
a certain magnitude and a certain direction, like velocities for instance. It
turns out that the concept of an arrow can be generalised to a
mathematical entity in a certain way that it keeps all properties that the
original arrows have: they can be added, subtracted and multiplied by
numbers to make them larger, smaller or simply to change their direction.
A state vector is like one of these arrows, but it is designed to
represent all the information about a physical system. The simplest of all
these state vectors is the one that describes the spin of an electron. The
electron, like other fundamental particles, is also a very tiny magnet, with a
north and a south magnetic pole. To represent this, we use a small arrow
like in such a way that the head indicates the north pole and the tail the
south pole.
The name spin comes from the fact that we can associate the
magnetic field of the electron to a magnetic field produced by a current
spinning around. In this picture, the electron would be a spinning sphere
and its magnetic field would be produced by the effect of the negative
charge of the electron spinning. The correct picture is a little more
complicated and involves concepts about symmetry under rotations, but
Roberto C. Alamino


we will not talk much about that. The important thing is that spin is
associated somehow with rotations and not directly to charge or magnetic
fields. Even a neutral particle can have spin as it can, in a sense, rotate.
It happens that when a big object like a tennis ball spins around an
axis, this axis can be pointing in any direction. We say that it can assume a
continuous number of directions because we can rotate the axis smoothly
in whatever way we want. But experiments (data) in quantum mechanics
force us to accept that the electron spin is slightly different. Whenever we
try to measure in which direction each one of a bunch of electrons is
spinning, we can only find two mutually exclusive answers! They are always
spinning with the same velocity in parallel directions like the arrows

Because of this property, we say that the spin of the electron is
quantised and assumes one of two states that we call up and down
(depending on the alignment of the axis, it could be left and right for
instance, but that is not important) and write with the symbols
The arrows have an obvious meaning, but they are enclosed by a
strange-looking mixed bracket. That kind of bracket is called a ket and it is
a notation for vectors invented by the physicist Paul Dirac and therefore
The Probable Universe


called the Dirac notation for vectors. Many of us learned another notation
in the high school. There, for instance, when we wanted to label a vector
by the letter , we asserted its vectorial character by putting an arrow over
it and writing . This notation is also valid, but the Dirac notation is more
convenient when you want to give larger labels, like words, or strange ones
like the spin arrows.
The strangeness of quantum mechanics does not stop in the
quantisation of the spin. In the macroscopic world, things are usually in one
single state at a time. A tennis ball either spins in one direction or the
other. Fundamental particles, however, cannot be said to be spinning in
one direction or another until you measure them. Before you do, we say
that they are spinning in both directions at the same time! How do we
know it? Well, that is a tricky question.
As we can easily understand by now, quantum mechanics, like any
other model about the physical reality, was created to encode all the
information we collected from microscopic experiments. The way we
describe the spin of an electron in this model is by the state kets, but the
mysterious thing is that if we try to assign a definite spin up or down to the
electron before we measure it, we arrive at inconsistencies, also known as
Einstein, together with the physicists Boris Podolsky and Nathan
Rosen, discovered one of these paradoxes and, since then, it has been
called the EPR paradox. They discovered that, if we assign definite spins for
a pair of electrons that are created together in a very special way called an
entangled state, we have problems. In an entangled pair of electrons, if
one of the electrons has a spin up the other has a spin down and vice-
versa. No matter how far they travel after creation, if you measure one of
the spins and another person measures the other, you both will find
opposite answers every time.
If you use the laws of quantum mechanics to calculate what
happens and consider that both electrons have their spins defined at the
Roberto C. Alamino


moment of creation, EPR discovered that this implies that the electrons
have an interaction that travels faster than light, which especially for
Einstein could not be true as relativity implied that this could not happen.
Indeed, today we know of only two solutions for this paradox. The
solution that guarantees that interactions cannot act instantaneously at
distance requires the spins not to exist until they are measured. This is the
usual interpretation and is the one assumed by usual quantum mechanics.
In some theories, called non-local theories, we can assign a definite state
to the electron, but then we have to admit that physics is non-local,
meaning that actions at a distance can be instantaneous. Usually, this is a
very unpopular solution. In both cases, information itself cannot be
transmitted faster than light and the spirit of Einsteins relativity (no signal
travels faster than light) remains intact.
Because in the usual quantum mechanics we cannot assign a
definite state to the electron until we measure it, we represent the state of
an electron by what we call a linear combination of the up and down
states. A linear combination is simply what you get if you multiply each
state ket by a number (remember they are vectors and, therefore, we can
do this) and add them. In symbols, we write the rather esoteric equation

The symbol is the Greek letter psi and must be familiar to
psychologists. I am only using this letter to describe the combined state of
the electron because it is traditional in physics. I could have used any other
label if I wanted. We are getting very close to the place in which
probabilities enter in quantum mechanics. Stay with me.
Let us understand what the physical meaning of the symbols in the
above equation is. As it is the rule in quantum mechanics, and are not
usual numbers in the sense that they are not real numbers, they are
complex numbers. Complex numbers include the usual real numbers plus
all the square roots of negative numbers usually called imaginary
The Probable Universe


numbers for obvious reasons plus all the mixed additions and
subtractions between members of the two classes. Remember our
discussion about Cardano and polynomial equations of degree 3? They
appeared there for the first time.
Complex numbers can all be written in the form , where
and are real numbers and . You will notice that, because of that,
every complex number is completely characterised by a pair of real
numbers, and . The number r is called the real part of the complex
number and , because it is multiplying , is called the imaginary part.
Consequently, we can associate to every complex number a two-
dimensional vector, by which I mean an arrow on a plane! You can see a
picture of this below

The arrow, for which I chose the name , is the graphical
representation of the complex number. We can write this as
The shaded area below it forms a right triangle. This allows us to
find the size of the arrow by using the well known Pythagoras Theorem.
The size of the arrow associated to is called its modulus, and it has the
symbol . If you look at the picture, you will see that the modulus of is
Roberto C. Alamino


the hypotenuse of the triangle and, therefore, Pythagoras Theorem gives

Why is this important at all? Because the modulus of a complex
number holds the last clue to understand the physical meaning of a
quantum mechanical state!
Right in the beginning of quantum theory, the physicist Max Born
proposed a rule that ended up being the correct way of understanding
what a state ket like our

really means. Long story short, whenever we
measure an electron in the state we get a spin up with probability
proportional to

and a spin down with probability proportional to

Of course, we want to normalise these probabilities and that is easily done.
In the above case, all we need to do is to divide the ket by

. In
quantum mechanics, two kets differing by a multiplicative constant are
considered to represent the same physical state.
And thats how probabilities enter in quantum mechanics!
There are many other strange aspects in quantum mechanics, but
it is during the act of measurement that probabilities enter it. Whenever
you measure a state which is a superposition of two others in quantum
mechanics, you cannot predict the result, only the probabilities. This is not
because you are losing some information. Even if you have a perfect
measuring device, you could only predict probabilities. They are not a
consequence of incomplete knowledge in quantum mechanics. They are a
fundamental characteristic of the model.
Many people now argue that these probabilities are fundamental,
they do not represent states of knowledge or belief, but fundamental
frequencies in the sense that if the experiment is repeated many times, the
relative frequencies of the outcome are a property of the system, a very
objective one. As we have seen before, that is not the correct point of
The Probable Universe


Consider the following experiment. Imagine that you have a two
dimensional sheet of atoms. There is one real example of that, called
graphene. Graphene is a sheet of carbon atoms organised in a hexagonal
lattice, which is the grid of points formed by the vertices of a series of tiled
hexagons like the picture below.

At each one of the vertices of this lattice, like the one marked by
the small green circle, lies one carbon atom. Imagine that instead of carbon
atoms, we have a lattice with fixed electrons. If we measure the spin of
each electron in the direction perpendicular to the plane, we will either
measure a spin up or down, as we have already seen.
Assume that we can arrange an initial configuration of electrons in
the lattice such that, for each vertex, the probability of up or down spin is
given by our state ket above already normalised. Is the probability a
feature of the electron? The answer is a clear no. It is a feature of the
entire system, which means, it also depends on how the experiment is
If we decide to use some device to generate a magnetic field
perpendicular to the plane on which the lattice lies, the spins of the
electrons will tend to align with the field and the failure to align perfectly
Roberto C. Alamino


will depend on the temperature of the system. In this case, the probability
of measuring the spin up and down will change. You need information
about the whole experimental setup to calculate it.
It is never enough to repeat this: the probability stores information
about the whole system. In many cases, this information can be
summarised in the symmetries, or lack of them, in the system. In classical
mechanics probabilities arise from our inability of reproducing the exact
initial configurations at each repetition of the experiment which results on
lack of knowledge (remember chaos?). For quantum mechanics, even a
perfect reproduction of the initial state will allow us to predict only
probabilities. If there is one sense in which probabilities can be said to be
fundamental is in this one, but they still capture information about the
whole setup and can be wrongly inferred.
If you like science to the point of reading books from Stephen Hawking and
Carl Sagan, you probably heard that the main goal of physics for the last 60
or more years has been to find a theory of quantum gravity. You also
probably heard that quantum mechanics and general relativity are not
consistent when used together and that, maybe, string theory can do the
magic. However, nobody knows if string theory is right because it is difficult
to test. That is an inspiring story, but is only partially true.
Quantum gravity is indeed one of the main goals of theoretical
physics today, but it is not the only main goal. Quantum gravity would be
the culminating point of extending physics in two directions: the very large
and the very small.
Quantum mechanics, as we have seen, is our present model for the
physics of very small systems. If gravity is not taken into consideration, it
can be consistently described together with special relativity in a model
which is called quantum field theory, or QFT. This model is capable of
The Probable Universe


describing with the best precision known in science most experiments
related to microscopic physics.
On the other extreme, we have general relativity, or GR, being also
very successful in describing most astronomical observations we recorded
since we started to look to the stars. Where Newtonian gravity fails, even
for a very small margin, general relativity corrects it with a high precision.
QFT and GR however cannot be used in conjunction when we deal
with phenomena involving high energy and gravity. In the usual
microscopic world, masses are small and GR is not very important. In the
usual cosmological world, things are large and QFT is not very important.
But because relativity teaches us that mass and energy are equivalent,
whenever too much energy is concentrated in a too small region of space,
we should use both QFT and GR at the same time. This happens, for
instance, in the centre of black holes where a very strange object called a
singularity might exist (although most people bet it does not). The problem
is, if you try to calculate things in that regime, mathematical
inconsistencies start to appear and you cannot get any useful result. As
nature apparently works well everywhere, we tend to think that the
problem is that we did not get the models right.
I told you before that many theoretical physicists call the yet-to-be-
found consistent model that includes QFT and GR by the exaggerated name
of Theory of Everything, or TOE. String theory is considered as a candidate
for a TOE as it is an attempt of a unified description of QFT and GR. But
there is a detail that is left out here. If once we unify QFT and GR we have a
theory of everything, then what else remains?
The point is that QFT and GR are both fundamental theories and,
consequently, the TOE will be a fundamental theory as well. Fundamental
theories are supposed to contain sets of fundamental principles that
should be obeyed by the all other physical models and, in this sense, if we
find all fundamental principles we have already achieved a great goal.
However, we still cannot explain everything.
Roberto C. Alamino


The reason why we cannot is that our goal is to use the
fundamental principles to predict the behaviour of physical systems. When
we have very simple physical systems, it is very easy to apply those
fundamental principles. In fact, most of the fundamental principles are
supposed to be readily applicable to simple situations, but become more
difficult to be applied to complex ones.
The best example of a complex system that does not look that
complex at first sight is one which is very familiar to us and that we used as
an example in many other situations in this book: water. Water is made of
molecules containing two atoms of Hydrogen and one of Oxygen. If you
look at one such a molecule, it is not that difficult to predict its behaviour.
We can measure its speed, mass and calculate its trajectory with
acceptable precision for all practical purposes. But one molecule of H
O is
not really water! Water is something much more complex than one
molecule. Water is what you get when you put together a mind-boggling
number of H
O molecules together at a certain temperature. If you change
the temperature, you change this collection of molecules from water to ice
or to vapour. And all three have completely different behaviours! You do
not wash your clothes with ice and you do not drink vapour when you are
thirsty. The whole process of passing from one to the other in which these
characteristics change depends on the complex interaction taking place in
To describe scenarios like this one, a whole new bunch of concepts
had to be introduced which deal only with situations where you have a
huge number of interacting units. Concepts that, afterwards, made their
way back to more fundamental theories. Temperature is one of them.
Entropy, our old friend, has its origins in the physics of the very complex
too. As you might imagine, the more complex a system becomes, the more
difficult it is to keep track of everything happening within it, and one thing
we know is that when this happens we need probabilities!
The level of complexity increases in general as we leave physics
into the direction of the social sciences. In the 1900s, quantum mechanics
The Probable Universe


found a bridge between the very simple systems of physics and the more
complex ones of chemistry. The next step after chemistry, known as
biology, requires an even bigger jump which has not been fully made even
today. The same happens as we climb up the ladder. Each time, we get a
higher number of different systems and more diverse interactions between
them. This is the barrier of the complex that we are also trying to break.
This barrier is even more unifying than the alleged TOE, as it is not only a
theory of physics, but it aims to bring together in a consistent way models
throughout all areas of knowledge. From biology, we go to psychology,
then to sociology and economics and ecology.
No matter how complex, the miracle is that the systems keep
presenting to us a series of repeating patterns depending on the way we
look at them. The art is to throw away the correct amount of information
to isolate those patterns in a sea of noise. To salvage the right datasets that
will allow us to develop our probabilistic models of nature is the ultimate
goal of science.
Some time ago, I was trying to understand why people believe in irrational
things. I did what every physicist would do: I drafted a very simple model
and looked to it from every point of view I could imagine. I changed a bit
here and there until I could get some basic understanding out of it. It
looked interesting. Then, I talked about it to a friend of mine who is a
psychologist with the hope that he could point me to more interesting
problems or that we could start a collaboration and try to apply the model
to some interesting problems. My friend looked at it and replied with It is
interesting, but I dont think it can be right. Every human is different, you
cannot understand everyone with the same formula. I looked at my friend
and did not say anything as I did not want to be rude, but I immediately
thought If everyone is completely different, how can psychology even
Roberto C. Alamino


If every human being in the planet reacted completely differently
to everything, then no one could infer patterns of behaviour. This would
mean that, if you are a psychologist, everything you learned in the
university about behaviour will never be useful, because it was learned by
studying people who will not be your patients and, therefore, will have
completely different behaviours. Still, we do know two things that, at least
for my friend, should seem paradoxical: every individual is different and
psychology does work. This is what I call the Psychologist Paradox, in
honour to my friend.
Of course there is no real paradox. The misleading idea here is that
people are completely different. We all know that people are not
completely different. As a poet once said, if we are cut we all bleed the
same, right? We are all variations over the same theme with noise helping
to increase our variety. Almost everyone will shout if burned by a cigarette.
That is why we do not do it (in general) to other people. But a very small
number of people do not feel pain. These are exceptions, or as we learned
previously, noise.
In the case of the sensation of pain, natural selection was
responsible for decreasing the amount of noise. Those individuals that did
not feel pain ended up dying easier. Those who felt too much pain could
not carry on their necessary tasks to survive. But whatever lies in between
these extremes became fair game. We can safely say that the typical
human being feels pain, some more, some less. In popular lingo, typical
usually means normal. In fact, even the psychologists measure the
normal is equivalent to the typical.
What in fact my friends intuition was trying to say was that there
was a lot of variation in human beings and that, because he had very little
intuition with mathematical models of physical phenomena, he could not
believe that such a simple model could capture the similarities between
individuals. There is indeed a huge amount of noise in the making of a
human being. That is because the places in which this noise can enter are
extremely numerous. Still, miraculously, in some cases noise is kept
The Probable Universe


extremely small, most of them due to some kind of selection mechanism,
be it natural or cultural.
In any case, there are clear patterns in human behaviour. The tricky
is to deal with the noise and the consequent missing information. As we
have already seen, we deal with noise by using a statistical description. We
talked about that when we saw what physicians really mean by your
chances of survival are of 20%. They are calculating the average case of
many patients and hoping that the Law of Large Numbers is indicating the
right probabilities.
Probabilities are being increasingly used to describe biological and
social systems when they are composed by a large number of interacting
parts. We call systems like that, in a very broad sense, complex systems. In
biology, for instance, we can think of the constituent parts as cells or
neurons (in the case of the brain). In sociology, we deal with a huge
number of human beings. In all cases, we justify our hopes that we will get
useful results by using the ubiquitous Law of Large Numbers. Why?
Because it works so many times that we are led to have a very optimistic
prior about it.
Although each one of the sciences has its particularities and
different objectives, there are lots of overlaps. In many of the above
systems, one is interested in identifying and understanding something
which is called an emergent behaviour. The example of the water changing
phases (from ice to liquid, from liquid to vapour...) is a type of emergent
If you again imagine an isolated water molecule, there is nothing in
it that hints to the fact that it can change from water to ice to vapour.
These changes are called in physics phase transitions and are a result of
the interactions between the molecules. Because they can only happen
when a collection of molecules is present, once that water/ice/vapour are
nonsensical words to apply to a single molecule, they are called collective
Roberto C. Alamino


behaviour. And because they emerge from simpler laws, they are called
This kind of behaviour is present in many different complex
systems. For instance, conscience is the ultimate emergent behaviour as it
is surely a result of the interaction of billions of neurons. In the same way,
groups of people can organise themselves into a revolution, or the car
traffic can suddenly become jammed without a clear reason. All these are
emergent behaviours. Presently, due to the power of modern computers
that allow us to simulate this kind of system, complex systems became very
fashionable in all areas of science.
The idea that everything in the universe should obey the same set of laws
comes from the belief that the division of nature in areas of knowledge is
something created by us to understand it better, but that nature is in
reality just one. This, among other things, means that if we knew all
important laws governing the simple things, we should be able to somehow
make a bridge between them and the laws we know that rule the complex
One of the simplest things we know is an atom, so let us start
there. The idea that matter was composed by atoms was first proposed by
the Greeks, with the first references pointing to Democritus on the 5

century. However, the first experimental evidence of the existence of
atoms had to wait for Einsteins work on Brownian motion fifteen centuries
later, in 1905. It would take 76 more years for us to construct a device
powerful enough to allow us to indirectly picture single atoms, the
scanning tunnelling microscope.
The Probable Universe


Scanning Tunnelling Microscope image of a piece of gold. Each one of the visible blobs
represents a gold atom.
A hydrogen atom has a size of the order of

meters, with of
the order meaning that the size is some one-digit number times that. This
is about one-hundred thousand (100 000) times smaller than a thin human
hair. Due to its size, light does not bounce in an atom in a way that enables
our eyes to detect enough information to create a visual model of it. In
more mundane words, atoms are too small to see even with strong lenses.
The reason is not that we do not posses strong enough lenses which cannot
magnify things that much, but because atoms are already small enough to
make the quantum effects that our brain usually ignores make a lot of
In order for our eyes to directly detect an object, light must hit the
object and be reflected by it. This reflected light brings to our eyes
information, which is then interpreted by our brain. In order for something
to reflect light, it must interact with it and, usually, the object must be as
large as the wavelength of the light. The wavelength of any wave is the
distance that it takes for the wave to repeat itself. You can visualise it from
the picture below, where the distance marked by the black line
corresponds to the wavelength of the wave.
Roberto C. Alamino


The problem is that the minimum wavelength our eyes can detect
is of the order of

meters, which means that the kind of light that can

interact with one single atom is 1000 times smaller than what we can
detect. Even if the atom was 1000 times larger, we would still have the
problem of the limited resolution of our eyes. We would still need a
microscope to see it, in the same way we do with cells. No wonder the
existence of cells was proved in 1665, when Robert Hooke actually saw a
cell via a microscope, which is 240 years before atoms were proved to
But as we all know, when the number of atoms is large enough and
they are packed together really close to each other, we can see the object
they form. They are the objects of our daily life. Solids, liquids, gases and
everything in between. How can we see the whole object if each atom
cannot be seen individually?
For those who believed in atoms before their existence was
proved, questions like that posed a big problem. If everything is made of
atoms and, at least after Newton, we believed that even atoms should
follow the laws of mechanics, can we explain the phenomena we see
happening with matter around us starting from things as simple as atoms?
Today, we know that the answer is very probably yes, although
we are not completely sure (Stewart, 1997). However, even if that is indeed
possible, it might be so difficult that it becomes impractical. Even with our
reasonably powerful computers, calculating things starting with its atomic
structure, something that physicists call ab initio calculations, is too
The Probable Universe


complicated to be useful except in very simplified situations. The problem
is not only that we have so many particles to keep track of, but it has to do
with the fact that many of the necessary computations belong to a class of
problems which we usually call hard computational problems, for which
there are not any known fast algorithms to solve.
Still, even in the 17
century, physics already started to study the
laws of heat and how they affect many properties of matter. By the time
the atom was proved to exist, there was already a good deal of work that
allowed the study of phase transitions, which as we have already seen is
the phenomenon associated with matter changing from one phase to the
When applied to the substances we use every day, phases are
what in common language people call states of matter, like liquid, solid and
gaseous. We have already used this word to describe ice, water and
vapour. The word state is never used in this sense in physics. Technically,
the term state is used to describe characteristics of any system, even a
single particle, either in classical or quantum mechanics and can cause
some confusion, like the energy states we talked about before.
The area of physics that deals with the phase transitions occurring
in the substances is the well-known thermodynamics. In 1738, Daniel
Bernoulli, a name that we heard before, published the book
Hydrodynamica (Bernoulli, 1738) on which he assumed that liquids were
formed by a huge number of molecules in movement and used this to
calculate some of their observed properties. Other scientists, like Clausius
and Maxwell (the same of the electromagnetism), used similar ideas to
describe not only liquids, but also gases.
The revolutionary idea of Bernoulli was to assume that the
molecules were moving in a random way inside the liquid and to use
probabilities to calculate average properties of it. He then associated the
average properties with what we can really measure. More than one
century later, Boltzmann took these ideas to their limits. Today, the latter is
Roberto C. Alamino


considered the father of statistical physics, also known as statistical
But thermodynamics had already existed before statistical physics.
What Boltzmann was one of the first persons to think about was the idea
we called emergent properties. He was sure that the laws of
thermodynamics should emerge from the simpler laws of Newtonian
mechanics and the key to accomplish that should be, following the example
of Bernoullis hydrodynamics, probabilities.
Systems composed by atoms and molecules at a certain
temperature (as long as the temperature is different from zero) have to be
described by probabilities, because heat is a kind of disorder which moves
things around randomly. Heat is, fundamentally, noise and temperature is
a measure of how high this noise is.
Here is where the distinction between phase and state becomes
even more important. We describe systems by the probabilities of them
being in some state, meaning the probabilities of values for all microscopic
variables that characterise the system. One of the great results of
Boltzmanns work was that, if we are given the normalisation of these
probabilities, there are methods that allow us to calculate everything we
are interested in about the system! We are even able to predict things like
at which temperature a phase transition should occur. In an interesting
twist, while normalisations are usually ignored in the inference tasks we
have seen before, they are actually a key calculational tool when we deal
with complex systems!
Because of this amazing fact, the normalisation is given a new,
more important name. It is called the partition sum and given the letter ,
which stands for the German word Zustandssumme, or sum over states,
because, in that time, German was a most important language in science
and also because Boltzmann himself was Austrian.
The Probable Universe


Unfortunately, the no free lunch theorem applies here. The more
complex your system is, the more difficult it is to calculate its partition
function. Many times, we are forced to do that numerically, which
nowadays we do using computers. A numerical calculation is one in which,
instead of trying to find a nice compact formula to express the
normalisation, we calculate each term separately as a number and add
them all by brute force.
The normalisation that we calculate in statistical physics is still
independent of the particular state because it is a sum over all of them, but
it might depend on other variables like the temperature in which the
system is being kept and the magnetic fields acting on the system.
However, it is way simpler to calculate partition functions than to calculate
what is going to happen with a system by tracking atom by atom, although
some people actually do that with computers as a different line of
This and many other great insights make statistical physics such a
powerful tool to deal with complex systems that it can probably be
considered to have transcended physics to become a general mathematical
framework that can be applied to every area of human knowledge. The
methods of statistical physics today are used in areas as diverse as social
sciences and artificial intelligence.
All probabilistic concepts we have seen before are brought
together in statistical physics. One of its basic assumptions is that there are
some elementary states of a system which have all the same probability
under certain conditions. This last part is very important because, as we
have repeated many times by now, every probability is conditional.
In fact, we can associate this condition under which these states
are the same with the idea of a cost function, or simply a cost. We talked
briefly about that when we were discussing the cost of taking decisions.
Here, although the concept is related to that one, it is used in a slightly
different way.
Roberto C. Alamino


When we observe the world around us, we can immediately
recognise that some states are more difficult to maintain than others. For
instance, if you raise a rock above the ground, it will not stay there unless
you keep holding it. The moment you let it go, gravity guarantees that it
will go back to the ground. This means that there is a cost associated with
keeping the rock above the ground. In physics, this cost is usually
associated with the word energy. Therefore, energy is a cost function.
There is a chance that you still remember that everything in the
world tries to achieve a state of minimum energy. This seems to be at odds
with the idea that energy is conserved. When we say that a system tries to
achieve the state of minimum energy we are actually considering that the
system is interacting with others in such a way that it can exchange energy
with them. This guarantees that the energy remains conserved.
Another slight complication related to this picture of minimum
energy also appears when heat is present. Because heat is noise, it disrupts
the minimum energy state in a way that we will understand in a bit, but the
main idea, that the universe tries to minimise its costs, can still be seen to
be valid. What happens is that, due to some restrictions imposed by the
environment, it is prevented of doing so.
The main point is that the key assumption of statistical physics is
that, if a system can assume different states, we can associate a different
cost or energy value to each state. Then, if we do not know in which state
the system is and we measure it, states with the same probability have the
same chance of being measured.
This also means that if you know the exact value of the energy for a
system and if you also know that the total number of states with that
energy is, let us say, ten, then the system will be in one of those states with
probability 1/10. For all practical purposes, the system can is the same as a
fair d10 rolling experiment!
The Probable Universe


I guess you can appreciate that this is a symmetry assumption.
States with the same energy can be considered symmetric in the sense
that, if we change the state, the energy remains the same. Remember that
we defined a symmetry as being a change we do to a system that keeps
some aspects of the system unchanged.
When we are studying systems whose energy we know with
certainty, we say that we are working in the microcanonical ensemble. The
actual explanation of the term is not important. The important thing is
that, in the microcanonical ensemble, the energy of the system is fixed and
all states have the same probability. The only problem is that the most
common situation in practice is that we also do not know the exact value of
the energy of a system.
The reason is that systems are exchanging energy with other
systems all the time. It is not easy to isolate the systems in nature. Think
about the amount of work needed to construct a thermal bottle. Today we
know that heat is a form of energy and heat exchanges are energy
exchanges. It is not at all easy to keep your coffee warm the whole day.
Actually, it is virtually impossible.
Because it is so difficult to control the exchange of energy, we were
forced to go one step beyond the microcanonical ensemble. We are forced
to use what is called the canonical ensemble. Although in the canonical
ensemble we do not know the precise value of the energy in the system,
there is still one thing that we assume we know: its average value. This
average value of the energy is fixed by something we can measure and
control much easier than the energy: the temperature.
Temperature, mathematically, can be interpreted as a numerical
parameter which, by fixing it, we fix the average value of the energy of the
system. But there is still one thing missing. In the microcanonical ensemble,
once we knew the energy and the number of states, we knew how to
calculate the probabilities. In the canonical ensemble the rule that equal
energies correspond to equal probabilities is still valid, but there is an
Roberto C. Alamino


infinite number of functions that give the same probability to the same
energy and which have a fixed energy. Our probability zoo is full of
probability distributions with the same mean. We need to find a way to
attribute the correct probabilities to all states, even those with different
energies, which will reproduce the same results we observe in real systems
in nature.
Now comes the time to mix in a little bit of inference. How do we
calculate probabilities when we have all the necessary constraints and do
not want to assume anything else? Yes... we maximise the entropy!
If we fix the average energy and look for a function that maximise
the entropy of a probability distribution which give the same value for the
same energy, we find exactly what we call in physics the Gibbs distribution.
This distribution is one of the most fundamental results of statistical
physics and works either for classical or quantum systems almost
unchanged. In a nutshell, it says that the probability of finding a system in a
state with a certain energy is exponentially smaller the larger the energy is.
The exact value depends on the temperature.
In the Gibbs distribution, the temperature regulates the relative
probability between states of different energy. If the temperature is very
high, the energy does not matter and all the states have the same
probability. On the other hand, when the temperature is almost zero only
states with the minimum value of the energy can be found, all others have
zero probability. With the correct probability we can find the correct
normalisation and, as we have seen before, everything follows.
But we still cannot bridge the gap between physics and the other
disciplines, can we? Yes, we can. This kind of research is in its early stages
(even being decades old), but we can already put together under the same
framework, that of statistical physics, many collective phenomena.
The Probable Universe


The concept of phases and phase transitions is probably the most
important in this whole idea. Phases are characterised by an overall
behaviour of the system. For instance, you can think about democracy and
authoritarian rule as two political phases of a society. Then you would be
interested in knowing what are the social parameters that make a culture
change from one to the other and back again. That might be difficult, but in
principle, it can be possible.
Several other things in many different areas are presently being
investigated using the methods of statistical physics. For instance, how
birds collectively organise themselves in flocks. Or in what moment and
how voters polarise in the direction of one candidate in an election. How
swarms of robots can be programmed to collectively achieve some goal. All
those things involve throwing away detailed information about the
individuals to look at the system from far away, as one single complex
system. In the same way as if you look at ice or water close enough they
will look the same, but if you look from far enough they are completely
different, this happens with all these systems too. By throwing this
information away, we need probabilities and that is how we enter the
realm of statistical physics.
There are, of courses, many other ways in which probability enters science.
In biology, genetics is an area where probabilities abound. What is the
probability of having a baby girl or a baby boy? How probable is that
someone is the parent of someone else? As we have seen before,
physicians are always talking about death rates, which are probabilistic
concepts. Epidemiologists need to know how probable is for a disease to
spread. Meteorologists need to predict the weather, but there are so many
variables involved that only probabilities can be given. The same happens
with stock markets.
Roberto C. Alamino


Each one of the sciences needs to use different concepts to make
different predictions with different kinds of information. These predictions
should match real outcomes to have any validity. In addition, given real
phenomena, these sciences need to make meta-prediction, meaning that
they need to predict what is the best theory to make predictions.
The way of doing these meta-predictions is, of course, using
Bayesian inference and it is at the core of science. Now that we have some
idea about how probability enters in the most fundamental aspects of the
universe, it is time to go to higher levels.

The Probable Universe


The Highest Levels
The idea that science is a quest to understand how reality works is a
beautiful but subtle one. It is probably the one thing that is deep inside the
heart of every scientist as a drive and a hope, but what each person sees as
understanding is generally very subjective.
We have seen before that the best shot at a rigorous definition of
what is an objective reality is embodied by the requirement of consistency.
Something similar happens when we try to define what we mean by
understanding. Surprisingly, when one looks deeply enough, the best
definition of what understanding actually means is to say that we are
capable of creating a consistent model of whatever phenomenon we are
trying to understand. Consistency here, as we have seen before, is used in
the sense that the model is free of contradictions with all collected (and
connected) pieces of evidence.
The first reaction to this kind of definition is to be reluctant and
sceptical, as it is healthy to be. We all have a deep feeling that
understanding has something to do with the capacity to explain something
until we have answers to all possible questions concerning subject. Let us
think about that. Imagine something that you are sure you understand.
Right now, you are probably reticent in picking up anything, even those
things that you always considered that you understood. You are probably
becoming aware of many questions that someone could ask to you which
you would have to answer I dont know.
Roberto C. Alamino


It turns out that different people become satisfied with different
levels of explanations and, once this level is achieved, the persons do not
bother to formulate further questions even if they are possible. The truth is
that, unless you postulate that you cannot answer something above some
level, it is not even clear that you will ever find an end to the sequence of
possible questions. For most people, the final explanation is the religious
one simply because they assume that any question about that has no
answer. Religion, many times, works as a cut-off to the endless string of
doubts that can be raised about almost every subject.
This is the equivalent of assuming a set of axioms which are to be
accepted without questioning. Whenever we can explain things by using
the axioms, we say that we understood those things. Of course, one then
might say that we do not understand the postulates. The religion position is
that this is not acceptable. The postulates should not be questioned. The
scientific position is that anything can be questioned.
Because of this possibility of a never-ending questioning, when we
scientists say that we understood something, what we really mean is that
we have a model that describes that phenomenon, accommodates well all
collected data and is free of contradictions with other models. When a
contradiction appears, we say that there is a lack of understanding about
that point. This definition goes well with the daily-life one too. As we have
already seen, our brain works by constructing models with data. When we
do not understand something, this means a failure of creating a consistent
model of the world around us because the new information does not fit
with the already established model (or models).
If there is one thing that we have learned during this book is the
fact that the most efficient way of encoding the models of natural
phenomena we ever found is mathematics. That is why scientific models
are constructed using it. It is also because we can use mathematics to
generate predictions that we can test by measuring data. We can do this
because mathematics allows us to do logical inference simply by following
mechanical substitution rules of symbols.
The Probable Universe


All theories of science, all descriptions of phenomena, are
ultimately based on mathematical models. In order to make a connection
with our daily life we use mathematics plus our more elementary
communication codes, like English. That is because our brains were trained
to model the world in terms of languages much before we learn maths.
We have already studied how probability enters in the
constructions of models for the phenomena of our world. But remember
that there is also one level above that of creating models in which
probabilities enter their selection.
Sometimes we have more than one description of the same
phenomenon, i.e. more than one model, which we were able to create
based on all collected data and we need to decide between these models.
We do that in the same way as we decide everything else by doing
Bayesian inference. Bayesian inference, in this sense, is one level above
science itself in the sense that it is a mathematical description of a
philosophical framework. It is Bayesian inference that guides us on how to
do science. That is because one of the tasks of a scientist is to choose the
best models that fit evidence without assuming anything else and, when
more evidence becomes available, the models should be updated.
If you remember our discussion about the life history of
Tuthankhamun, you will remember that we needed to compare an ever
increasing number of evidences against possible stories and choose the
best one. We did not discuss that, but if all stories were incompatible with
the evidence, what should we do? The answer is that we should invent one
which is compatible. How do we know that that history would be the right
one? We start the cycle of finding more evidence and comparing again.
Does that seem similar to anything that you have learned about science?
Roberto C. Alamino


I remember that by the end of my Ph.D. in the Institute of Physics of the
University of Sao Paulo, in Brazil, my supervisor, Prof Nestor Caticha, took
me to have lunch in the fancy restaurant of the universitys business
school. These days I have collected enough evidence to assert that the fact
that business schools have the best restaurants in any campus is an
international truth. It was a buffet and, while we are serving ourselves, I
remember that somehow the discussion ended up being about the
foundations of science.
Isnt science just a belief system as any other else? asked Nestor
casually to me.
I used to think that, I answered. That was true. I had even
convinced a friend of my wife who is a lawyer about that. But then I added
but after I understood Bayesian inference I realised that that is not the
Nestor looked at me, smiled and said If that is all you take with
you from your Ph.D., I consider my role successful. Now go and spread the
I am not exaggerating when I say that understanding Bayesian
inference changed my whole way of seeing the world around me. It is now
time for me to guarantee that it is going to do the same for you as well. If
that is the only thing you take with you from this book, I will have paid my
debt to Nestor.
We have been talking about many areas of knowledge, especially
of scientific knowledge, but there is one level that we did touched only
slightly up to this point, which you might be already guessing that it is how
to do science. Many people tried to suggest during the 20
century that
science is nothing more than a belief system not unlike any religion or
The Probable Universe


mythology. Are they right? What would be the difference, if any, between
science and the rest?
If you understood correctly the rest of this book, you are smiling on
your chair right now because you know the answer. It is a good feeling to
understand what is wrong with post-modernism, isnt it? The answer is, of
course, Bayesian inference.
What is the objective of science? It is to understand the laws that
organise the universe. But how exactly does science intend to do that
differently from, let us say, religion? After all, religion also tries to find
somehow a theory about the universe. Both science and religion create
models of the universe in their own way. They use different languages and
different symbols, but all can be summarised in inventing descriptions,
called explanations, for natural phenomena.
The crucial point is that sciences fundamental idea is to find
explanations, or models, for natural phenomena by only using the
information given by nature itself, with minimal extra assumptions. Yes,
that is maximum entropy. In addition, whenever something new is
discovered, we want to review our previous beliefs using the same
principle. There you are. Bayesian inference.
Compare it with other systems similar to religion. For instance,
think about the concept of faith. Faith is a nice idea from the emotional
point of view. It evokes noble feelings. It is romantic, poetic, but it is a
complete disaster when it is used for reasoning. Why? Why faith is so bad
for reasoning? For one single fact: faith is the idea that you should not
update your beliefs. You might object this by saying that, in fact, you can
update your beliefs, but you should give a prior so high to information that
is given by an authority that they change very little. That is not completely
true, because if you keep including new evidence, sooner or later your
beliefs will change to reflect the evidence. Religions, ideologies and football
rely on completely crushing the addition of new evidence that contradicts
your beliefs.
Roberto C. Alamino


But what about science? What about the absolute truths of
science, like gravity? If you did not get it by now, it is time. Especially
because after understanding Bayesian inference you should be completely
prepared for that:
There are no absolute truths in science.
Everything in science has a plausibility of being right, but that can
always change if new evidence appears. Every single concept in science
hangs on by a thread which is as thick as the amount of evidence collected
to support that concept. Review the evidence and the concept can change
at any time.
From the emotional point of view, that seems quite unsatisfactory.
The certainty provided by religion is comforting. The fact that everything
on science is always changing is disturbing and the great majority of people
cannot cope with that. That is too bad, but nature does not care if you are
prepared for the truth or not.
But we surely have to rely on some kind of fixed concept right? Am
I not saying that Bayesian inference, or even deeper, maximum entropy is
the one thing we should not doubt? No. Even that we should question.
Even consistency? Yes, even consistency. Nobody said understanding the
world (whatever it means) would be easy. But we should continue to
believe that it is possible until further evidence forces us to update this.
You do not need to be in dismay, though. There is some order in
the chaos as long as we use Bayesian inference. It is an irony that the one
who first had a glance at this was a reverend.
Contrary to what many philosophers will say to you, it is possible to outline
a rigorous framework for science and the scientific method. This
The Probable Universe


framework entirely relies on the concepts we have seen throughout this
book and, in a sense, is its climax.
The first thing we need to do is to leave aside our strong feelings
about the subject and try to found a, at least rough, definition to what we
mean by science. Broadly, we can say that science is the art (or whatever
word you feel better to use here) of encoding the information we obtain
from the world around us into models that are falsifiable. Here we have a
new concept and we need to talk a bit about that.
The first person who proposed falsifiability as a requirement in
scientific models was Karl Popper (Popper, 1935) in the 20
Roughly, this means that the models of science should be testable in the
sense that there must have some experiment that can be done in principle
(although sometimes not in practice) that the model has the possibility of
answering wrongly. This is basically the requirement that every model
should be able to predict something. If a model cannot make a prediction,
it can never be wrong. If it can never be wrong, it cannot be evaluated and
therefore is out of the scope of science (although it is still in the range of
Note that I am not saying that a non-falsifiable model cannot be
true, but this is very muddy philosophical area. What falsifiability makes
evident is a fundamental logical limitation of science. The idea of science is
a set of rules that would allow us to infer the laws of nature from
observations. But as any good lawyer knows, given a finite set of
observations without anything else, you can always create a good story
unless you have external limitations.
This is something that happens here as well. When one creates a
model to fit a certain dataset of observations, unless there is some
limitation, we can make the corresponding likelihood as complicated as it is
required to give a probability one for all our observations. There is a very
simple visual way to understand that in terms of certain mathematical
objects which we talked about before, the polynomials.
Roberto C. Alamino


Remember that we learned that polynomials are functions of one
or more variables involving only natural powers of the variables multiplied
by numbers and added together. The most common are those involving
only one variable whose most used letter is . The degree of the
polynomial is the value of the highest power appearing in the formula.
These are some additional examples

The first in the list is a polynomial of degree 1, the second of
degree 2 and the third of degree 3. It turns out that we can plot
polynomials of one variable in a graph by giving values to and calculating
the resulting number. For the above three, we get the following graphs:

The first one is a straight line, the second is called a parabola and
the third has no especial name.
Now, everybody knows that through any two points one can draw
straight line. Less known is that through any three points, one can draw a
parabola, which is a polynomial of the second degree, passing through all
the points. The pattern continues for all polynomials of one variable. If you
have points, you can always find a polynomial of degree which
exactly passes through all the given points. Finding the polynomial that
passes through a certain number of points is called fitting a curve. It does
The Probable Universe


not require a big leap to notice that fitting a curve is a very simple case of
finding a model for a dataset.
But if you can use a parabola to fit any three points you surely can
fit any two points by a parabola, right? Right. This means that the more
correct assertion is that through any points you can always fit a
polynomial of degree or higher. But if polynomials are models for our
points, does that mean that given a certain number of points we can find
an infinite number of models that explains them? Yes, that is correct. Now
you start to see the problem. If all you have is two points, your story to
explain those two points can be a straight line (the one rarely used by
lawyers), a parabola or any other polynomial. How do you know which one
is the correct model?
Suppose you had two points and fitted a straight line. How would
you try to find out if your straight line is the correct explanation for all
possible points? The obvious answer is that you need a third point to check
it! Without a third point, there is no way to decide. The third point is the
equivalent of an experiment that needs to be done to test your model.
Each time a new experiment gives a point over your straight line, it
increases the confidence on your model. The tricky part is that you never
know if the next point will still be on that line. All you can do, thanks to
Bayes, is to be sure that each new point on the line increases the
probability that it is the correct explanation.
Now, notice that can be explained is not the same as it is the
correct explanation. Most of you must remember a trigonometric function
called the sine. The sine is the prototype of wave and its graph is given by
the picture below, in which two points were highlighted.
Roberto C. Alamino


Although the two points were generated by a sine function, the
green dashed straight line also fits them. You could say that the straight
line also explains them. To check if the straight line is correct or not, one
would need at least one more point. Notice that in this case, if the next
point was exactly where the axes cross each other, the zero-zero point, the
straight line would still be over it! In that case, we would have our certainty
about the line increased even with it not being the correct explanation! In
daily life, this is called a coincidence. Another point then would be enough
for us to see that the straight line would be the wrong function. The
problem is that sometimes people like the straight line so much that they
start to ignore the next points.
But what happens now if someone says that there are no sines in
the universe? Everything is a polynomial, you just need to find its right
degree. Is there any experiment one can do to disprove this polynomial
theory of the universe? If our experiments are limited to points, the answer
is a no! Even if the function generating the points is actually a sine, there is
no way to tell which theory is correct by doing any experiment. Theories
like that have another characteristic: they cannot make predictions.
The sine-theory, on the other hand, is testable in principle. We can
start to do experiments and compare with the sine graph. If a (correctly
measured) point falls outside the curve, then the sine is falsified. We might
The Probable Universe


never know if the next point will be wrong, which means that we might
never know for sure that the sine is the correct theory, but we have the
possibility of disproving it in principle.
The sine-theory has the property we want to attribute to scientific
theories, while the polynomial-theory, while it might even be true, does
not. It is said to be non-falsifiable. In fact, unless you get some emotional
comfort from knowing that the universe is explained by a polynomial, this
theory is completely useless.
Consider solipsism. This is the idea that your mind is the only
existing thing in the universe and it creates everything else, including this
book. Solipsism is non-falsifiable. Can it be true? Yes. Does it matter?
Philosophically, a lot, in practice, not at all. It is completely undecidable
and, because of that, does not have any effect whatsoever in our lives
(except maybe, as I already said, an emotional one).
Another problem with non-falsifiable models or theories is that the
amount of them that can be created is limited only by the creators
imagination. Although the reality of non-falsifiable theories is a legitimate
and difficult philosophical problem, this is the point where science departs
from philosophy. Science is concerned with falsifiable theories and that
should be part of its definition.
We now limited the scientific models to those which are falsifiable. This
defines the scope of our subject. It does not mean that, from now on, we
are forbidden to imagine non-falsifiable things, but being non-testable,
they are out of the scope of science simply because we cannot ever decide
if they are right or wrong by any effect.
We have spent the most part of this book understanding how to
test models: using Bayesian inference. That is exactly how we should test
Roberto C. Alamino


scientific models. The scientific method is an outline of a procedure to
create and test falsifiable models. In a nutshell, it can be described as a
disordered chain containing the following steps in any arbitrary sequence:
1. Information Gathering: most of the times, this phase consists in
measuring things. But it also consists of organising collections of
concepts and even models themselves. Going to the library and picking
up books is also information gathering. Playing with axioms to find new
theorems can also be included here, with the new theorems standing
for new information.
2. Modelling: encoding all gathered information into a model, or as we
have seen, a likelihood. As we extensively discussed in this book, there
is no rigid way of doing that. This is where creativity enters. However,
some rules must be followed. The most important of them is that the
model should be falsifiable. It needs also aim to be consistent.
3. Testing: this is a complicated phase. We have to check the consistency
of the model with the rest of the information we have. In other words,
we have to falsify the model and to update the posteriors. Then, we
compare models and, possibly, choose the best.
Welcome to science. Notice that phases 1 and 2 could also be
claimed to be realised by any religion or pseudoscience if not by the
requirement of falsifiability. It is then in phase 3 that falsification takes
place. That is why it is important to understand that phase in more detail.
Checking for consistency, falsifying and choosing are all instances of
decision making. Let us see each one separately through Bayesian lenses.
Checking for consistency requires embedding your model in a
larger context. If you call your model , then you are looking for a
probability , where means all other models which you assume
are trustworthy enough for you to require yours to be consistent with.
The Probable Universe


If we can find a way to actually check the consistency decisively, we
can discard inconsistent models as they acquire a zero probability.
Sometime we cannot do that. Science can be very complicated, to the point
that proposed models might not be easy to check for consistency. As long
as those models are not already proved to be inconsistent, the usual thing
to do is to keep them as possible hypothesis.
The same happens with falsifiability. As long as a model is not non-
falsifiable, it makes no harm in keeping an eye on it, especially if it has
other attractive characteristics which can vary enormously.
Consider string theory, for instance. We still do not know if string
theory is or is not falsifiable. As far as we are aware of, it might happen
that string theory is a sophisticated framework that is capable of fitting
many different universes. But up to this point, nobody has proved that it is
either non-falsifiable or non-consistent (either internally or with the rest of
science). The appeal of it is that it has inspired amazing ideas and
generated beautiful mathematics. Still, that is not enough, but all we can
do is to keep working on it. It is still a model under consideration which has
not been discarded.
If we have several competing models, we can rank them by finding
their posterior ratios. This allows us to choose the best theories in a set.
But wouldnt that allow us also to say if a theory is right or wrong by
considering this as a binary decision?
The answer is generally no. That is because it is not in general
possible to find the likelihood for the proposition that a theory is not right.
Let us think about it. Consider that we have the data point and a theory
. Now, suppose that we try to judge the probability of being right or
wrong. We can see this as two meta-theories, respectively and . If we
define properly, we should be able to calculate as it should be
simply , but we cannot calculate ! That is because this
probability can be anything in other theories that are not !
Roberto C. Alamino


Of course, all three phases of the scientific method present their
own difficulties, but they do form a consistent system to check for models
and to guarantee that we are choosing those that are the best one possible
to account for the phenomena we see in our world. Thanks, once again, to
Bayes and Laplace.
Because I am a theoretical physicist, I think that I will be excused when I say
that it is one of the most beautiful areas of science. Relating symmetry,
geometry and mathematics to the patterns of nature in an intricate but
consistent way is something that never ceases to amaze you once you start
to understand how it works. But... yes, always a catch!
There are many stories about experiments that gave results
contradicting predictions of theoretical physicists which were, afterwards,
shown to be wrong. Should we give more confidence to established
theories than experiments? Should we simply discard the experiments
then? If so, should we simply not require experimental evidence to be so
fundamental in science?
The saying that if the data contradict the theory, then throw away
the data was attributed to many great theoretical physicists and probably
actually spoken by a good number of them. That, however, should be
considered as a kind of joke. Experimental evidence should never be
discarded, but indeed needs to be critically considered as any other piece
of information.
On the other hand, evidence can be filtered to models in the form
of mathematical patterns that repeat themselves time and again. This
repetition is nothing more than another kind of evidence as it is a pattern
in itself. Beauty, as seen by many theoretical scientists, is just a realisation
of these repeating patterns.
The Probable Universe


There is a problem though when the scientists level of confidence
in the mathematical models becomes closer to faith. When experimental
evidence is small, then we saw that BAYES does not allow it to change the
prior too much, but as it starts to pile up, then it affects the posterior in a
significant way. Because of this, even contradictory experiments should be
taken into consideration and not thrown away. If considered in the right
way, they cannot overthrow a correct model.
Nothing though prevents one of analysing the validity of the
contradictory experiment itself using BAYES. You can always calculate the
probability of the experiment being right or wrong according to the rest of
the relevant information. Although it is not possible to do that for general
theories, as we have seen, for experiments this becomes easier as some
background theories are always considered true.
All seem reasonable and consistent, but how do we know BAYES, or its
basis, maximum entropy, are actually correct? If I say that I should check
BAYES by using BAYES on itself, am I not cheating?
This is an instance of something called self-reference, which is
when a system talks about itself. Remember the Epimenides Paradox in
which a sentence was neither true nor false? That was a case of self-
reference as the sentence was about itself. Self-reference is full of pitfalls,
but also full of wonderful things. Just think that what makes us feel
conscious as an independent being is the fact that we can actually think
about ourselves.
In mathematics, all sort of strange things happen when systems are
powerful enough to reason about themselves. One famous instance is
Gdels Theorem, a surprising mathematical result that appeared during
the first half of the 20
century. This theorem says that in some very well
established mathematical structures, like the arithmetic of the integer
Roberto C. Alamino


numbers, although the system can talk about itself there are some
questions that, although they are true, the system cannot decide if they are
true or not. A system like this is called incomplete. Incompleteness
happens with some systems, but not with others. There are technically
subtle issues that makes difficult to identify when this happens or not.
It can be shown via Gdels Theorem that systems that obey
certain conditions, self-reference one of them, if the system is complete,
then it has to be inconsistent! This means that are some propositions in the
system that are true and false at the same time!
Let us now get back to the first question: are we cheating if we use
BAYES to test BAYES? Well, as long as it is consistent, we can try and see if
it leads to some undesirable or wrong consequence. That is all we can do.
Now, if we allow that, we are allowing BAYES to talk about itself. Should we
be worried if it complies with Gdels Theorem?
We argued that science is based on BAYES. If BAYES turns out to
result in an inconsistent system, then we are in trouble, because some
results can end up being true and false at the same time and that might be
a disaster. What about incompleteness? That is less harmful, because in
science we have what is called an oracle. An oracle is something that we
can always use to decide upon the truth of something by asking, without
using the theory. In science, nature is our oracle. Of course, we need
always to check the validity of the experiments...
But still, it is not clear that Gdels Theorem applies to science. This
is a complicated subject, and one we do not actually need to be sure about
to continue to do science in the best way we can.
I need to include an observation before we finish our conversation. Many
people will object to this last part about science saying that scientists, like
The Probable Universe


any human being, are corrupt. Greed researchers can and do hide
information that contradict their ideas. Some simply do their experiments
in a wrong way. Others stick to non-falsifiable ideas and defend them as if
they were scientific ones. If science is what scientists do, doesnt it
contradict the purity that I described in the previous sections?
It does. That is because science is not what scientists do. That is a
horrible definition. At least, not the science with the objectives we have
discussed in this book. That is the same as saying that justice is what judges
do. If you find a corrupt one, should you believe that every decision he or
she makes is fair? Of course not. Both in justice and in science, what define
these concepts should be sought in higher level principles. Nobody said it is
easy to do that, but in science we have a very good idea!
Any system will tend to be corrupted with time. That is a
consequence of the fact that entropy (disorder) tends to increase. We all
know that governments, schools and virtually all human institutions might
start with good intentions, but are not immune to degradation. Corruption
can appear in every system, including the scientific community. That is why
it is important to keep the separation between science and the scientist.
We humans are not consistent machines. We are driven by instincts which
mainly guide our survival, no matter what it takes. There is some evidence
that our survival also depends on Bayesian inference, but that is a story for
another book.
One can always redefine the meaning of words. Nobody has a
monopoly of them. They change meaning with time. Science, in the
beginning, would not describe falsifiable knowledge, but we have learned
that without falsifiability we cannot check the validity of a model and then
it was included in the meaning. Today, the meaning of science seems to be
shifting from the procedure to the profession. This is dangerous if people
start to confuse the two things. As humans, that is what we invariably do.

Roberto C. Alamino


How well can we, after a whole book, answer the questions we proposed in
the beginning? Let us see:
Q. How do you know we are not living inside the Matrix (or the next best
A. We need first to define what this Matrix is by creating a model to explain
the data we collect from our observations of the world. Then, we gather
information to increase or decrease the probability of it by using BAYES. If
the model is not falsifiable, though, we will never know if it is true or false.
Q. Can you ever tell whether everything is not actually an illusion inside
your mind?
A. No. That is solipsism and it is not falsifiable.
Q. Isnt science just a belief system not unlike religion?
A. No. Science use BAYES, religion does not.
Q. How do we know that elementary particles exist if we cannot see
A. Existing is complicated philosophically. We can say that the model with
particles is consistent with our observations of the world because they
provide indirect evidence that supports it.
Q. Can we ever hope to find an answer to any of these questions?
The Probable Universe


A. We just did.
And finally: what is the relation of these questions with this book?

Roberto C. Alamino


Obs: books highlighted in red in this section are technical books or papers
and might need an extra level of scientific or mathematical knowledge to
be understood.
Bak, P. (1996) How Nature Works: The Science of Self-Organized Criticality,
Bayes, T. (1763) An Essay Towards Solving a Problem in the Doctrine of
Chances, Philos. Trans. R. Soc. London 53, 370-418
Bernoulli, D. (1738) Hydrodynamica
Cardano, G. (1663) Liber de Ludo Aleae
Carroll, L. (1871) Through the Looking-Glass, and What Alice Found There
Caticha, A. (2010) Entropic Inference, arXiv:1011.0723 []
David, F.N. (1962) Games, Gods and Gambling, Hafner Publishing Company
Feller, W. (1968) An Introduction to Probability Theory and Its Applications
Vols. 1 and 2, John Wiley & Sons
Ferris, T. (2011) The 4-Hour Work Week: Escape the 9-5, Live Anywhere and
Join the New Rich, Vermillion
Gleick, J. (1988) Chaos: Making a New Science, Penguin Books
Greene, B. (2005) The Fabric of the Cosmos: Space, Time and the Texture of
Reality, Penguin Press Science
Jaynes, E.T. (2003) Probability Theory The Logic of Science, Oxford
University Press
The Probable Universe


Landauer, R. (1996) The Physical Nature of Information, Phys. Lett. A 217,
Laplace, P.S. (1825) Philosophical Essay on Probabilities
Li, M., Vitanyi, P. (1997) An introduction to Kolmogorov complexity and its
applications, Springer-Verlag
Noether, E. (1971) Invariant Variation Problems, translated by M. Tavel,
Transport Theory and Statistical Physics 1, 186-207
Penrose, R. (1994) Shadows of the Mind
Popper, K. (1935) The Logic of Scientific Discovery
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. (1992)
Numerical Recipes in C, The Art of Scientific Computing, Cambridge
University Press
Sagan, C. (1985) Contact
Sagan, C. (1997) The Demon-Haunted World: Science as a Candle in the
Dark, Ballantine Books
Santos, A. (2009) How Many Licks? Or, How to Estimate Damn Near
Sivia, D.S., Skilling, J. (2006) Data Analysis: A Bayesian Tutorial, Oxford
University Press
Stewart, I., Cohen, J. (1997) Figments of reality The evolution of the
curious mind, Cambridge University Press
Todhunter, I. (1865) A History of the Mathematical Theory of Probability
from the Time of Pascal to that of Laplace, Macmillan and Co
Wigner, E. (1960) The Unreasonable Effectiveness of Mathematics in the
Natural Sciences, Comm. Pure Appl. Math. 13, 1-14
Roberto C. Alamino


Wolfram, S. (2001) A New Kind of Science, Wolfram Media

The Probable Universe


Apendix A.
Internet Material
The reference material found in the bibliography of this book is from
mainly 3 sources: scientific journals, books and websites. If you are having
difficult to find them, you can find it with the appropriate links on my
website on the internet at the address:
All you need to do is to click at the link and it will take you to the official
reference. Some of the material can also be found for free and, if this
option is available, I put the appropriate link as well.
Wolfram Alpha:
Among other things, this site can be used as a simplified version of the
famous software from the same company Mathematica. You can use it to
plot graphs, solve equations and do other kinds of calculations. It has an
impressive recognition algorithm. Try, for instance, to type Gauss(1,2) to
see what happens.
Wolfram Mathworld:
Previously known as Eric Weissteins World of mathematics, due to its
creator, this website is a mathematics encyclopaedia containing virtually
everything you would like to know about mathematics at a technical level.
Roberto C. Alamino

It contains a lot of interesting information about random numbers. It also
has a handy random number generator for you to make experiments.
Public Domain Books
Gutemberg Project:
Open Library:
For scientific papers, there are two main ways to check and obtain them if
they are available: arXiv and Google Scholar. Both are described below.
Some of the material in this cited in this book is available for free in the
internet. References containing the word arXiv are of preprints that can
be found in the website
The arXiv is today the standard preprint repository for physics and
has been running since the end of the 90s. A preprint is a preliminary
version of a scientific article or paper which is made available before it has
been passed by the process of peer-reviewing at some official scientific
journal. Most physicists, including famous ones, have the helpful habit of
posting their preprints in the arXiv before sending then for peer review.
In addition to articles, the arXiv also contains lecture notes, PhD
and Master thesis, drafts of books and many other helpful documents. As
any kind of information wherever you get it, you must be careful not to
blindly trust what is in then, but that is also true even for what is published
in the official journals.
As an example, if you want to access the reference
The Probable Universe


Caticha, A. (2010) Entropic Inference, arXiv:1011.0723 []
you should type in your browser
This will take you to a page containing the abstract of the paper and
options of formats to download it. PDF is most of the times available, as it
is PS and source files for those who know how to handle them. This is a
relatively modern reference. Older ones have a slightly different code as
this one:
In this case, the address would be
Most scientific journals charge relatively expensive fees for you to
download a paper. It is worth remind that the authors never get any
amount of money for that and they even give up the copyright of the paper
to the journal without receiving one penny. Because many authors got
annoyed with the situation (although not enough) today several scientific
journals allow the authors to keep a copy of their papers on their personal
websites. Google Scholar can be used to find these copies. All you need to
do is to type the title of the paper in the search box of Scholar at:
Once you find the paper, search for a downloadable version on the link All
n versions right below it.
Roberto C. Alamino


Because scientific journals are not being able to keep the papers protected
in the internet, they opted for a different strategy. Instead of charging for
downloads, they charge the authors amounts that vary between 1000 and
3000 US dollars and the paper becomes free to access.
A list of these journals with the respective links can be found on Wikipedia:

The Probable Universe


Appendix B.
Mathematical Symbols
The probability (density) of proposition

Not A
Proportional to
Cardinality (size) of the set of real numbers
Partition function (statistical physics)

Aleph Zero Cardinality (size) of the set of natural

Real interval starting in and ending in , including the

Dirac delta centred at the point

Sample space

Imaginary unit,

Ket, quantum state vector labelled by the letter
Absolute value or modulus of the number
AND logical operator
OR logical operator

Roberto C. Alamino


Appendix C.
Scientist List
We have seen many players during this book, so many that it is worthwhile
to have a list of them. There are two extra advantages of doing that. The
first is that you can have a better idea of who lived in what time and,
second, what their specialities were.
Bayes, Thomas 1701-1761 English statistician, philosopher
and Presbyterian minister
Boltzmann, Ludwig 1844-1906 Austrian physicist and
Borel, Flix douard
Justin mile
1871-1956 French mathematician and
Cantor, Georg
Ferdinand Ludwig
1845-1918 German mathematician
Cardano, Gerolamo 1501-1576 Italian mathematician,
physician, astrologer,
philosopher and gambler
Cohen, Paul Joseph 1934-2007 American (USA) mathematician
Dirac, Paul Adrien
1902-1984 British physicist
Einstein, Albert 1879-1955 German physicist
Fermat, Pierre de 1601-1665 French lawyer and
Galileo Galilei 1564-1642 Italian physicist,
mathematician, engineer,
astronomer and philosopher
Gauss, Johann Carl
1777-1855 German mathematician
Gibbs, Josiah Willard 1839-1903 American (USA) scientist
Kolmogorov, Andrey 1903-1987 Soviet mathematician
The Probable Universe


Laplace, Pierre Simon 1749-1827 French mathematician and
Mach, Ernst Waldfried
Josef Wenzel
1938-1916 Austrian physicist and
Maldacena, Juan
1968- Argentinean physicist
Mandelbrot, Benoit B. 1924-2010 French-American
Markov, Andrey
1856-1922 Russian mathematician
Maxwell, James Clerk 1831-1879 Scottish physicist
Michell, John 1724-1793 English clergyman and natural
Newton, Isaac 1642-1727 English physicist and
Pareto, Vilfredo
Federico Damaso
1848-1923 Italian engineer, sociologist,
economist, political scientist
and philosopher
Pascal, Blaise 1623-1662 French mathematician,
physicist, inventor, writer and
Podolsky, Boris
1896-1966 American (USA) physicist
Price, Richard 1723-1791 Welsh philosopher and
Rosen, Nathan 1909-1995 American (USA)-Israeli physicist
Solomonoff, Ray 1926-2009 American (USA) mathematician
Shannon, Claude
1916-2001 American (USA)
mathematician, electronic
engineer and cryptographer
Zipf, George Kingsley 1902-1950 American (USA) linguist

Roberto C. Alamino


Appendix D.
Greek Letters
The Greeks might not have invented mathematics, but they surely took it
to higher levels by inventing science. It is just natural that so many of their
letters appears as mathematical symbols today. To help you at least to spell
their names, here is a table with the Greek letters, their names and some
common variations.
Small Capital Name

The Probable Universe


Brainstorming Area
Talk about other methods of inference which are not Bayesian and their
relations. -> when not to use Bayes (although using) - simplifications
Add How to solve it by Polia
Talk about Bayesian inference in biology. The brain might be Bayesian
[probabilities are related to what we do not know] , and we all know
that those are the things that scare us most
Generalisation vs memorisation, non-testable hypothesis, no extra
experiments to be done.
Talk about average faces when defining the mean?
Research the origin of the name aleph-zero/aleph-null
Talk about averages and how they enter the estimation process.
Means and averages as estimators
Explain that entropy is a functional.
Forgetting and adaptability.
Add the letters between Fermat and Pascal on Probability as an
I have to answer at some point the question: Why do we keep
using wrong theories?
Include the non-interacting universes problem.