86 views

Uploaded by Roberto C. Alamino

DRAFT - WIP - Feedback appreciated

- Theory of heat by J. Clerk Maxwell, 1891
- 2D Product Sheet 2 Maxwell 3D
- Maxwell’s Eqautions and some applications
- The Enigma of the Treatise
- Advances and Applications of DSmT for Information Fusion, Vol. 3, editors F. Smarandache, J. Dezert
- 5th International Probabilistic Workshop
- electricandmag02maxwrich
- Maya Releasenotes 0
- 127929606 Tesla Cold Electricity
- Maxwell Equations MIT OCW
- April, 2010
- Learning Kernel Classifiers. Theory and Algorithms
- J.H. Conway, R.T. Curtis, S.P. Norton, R.A. Parker and R.A. Wilson - Atlas of Finite Groups
- An Electric Revolution: Reforming Monopolies, Reinventing the Grid, and Giving Power to the People
- Electricity History
- 3856 Motor Maxwell2D
- scriptingMaxwell_onlinehelp
- Direct time integration of Maxwell’s equations
- Maxwells Legacy
- Treatise on Mathematical Theory of Elasticity

You are on page 1of 295

Roberto C. Alamino

Cover Picture:

Source: http://pixabay.com

Author: Unknown

Notice:

All pictures in this book are on Public Domain and were taken from either

http://pixabay.com or http://wikipedia.com

Contents

1. A Universe of Possibilities ....................................................................... 1

How Do You Know? ................................................................................ 1

You Cannot Avoid Probabilities .............................................................. 8

The Measure of our Ignorance ............................................................. 11

Are You Afraid of Maths? ..................................................................... 15

2. Games of Chance ................................................................................. 19

The Many Faces of Fairness .................................................................. 19

Not-so-noble Beginnings ...................................................................... 27

Bayes and Laplace ................................................................................ 40

3. Making Sense of Randomness .............................................................. 45

Predictability ........................................................................................ 45

Chaos (and Mayhem) ........................................................................... 48

Organised Disorder ............................................................................... 55

False Randomness ............................................................................. 57

Compressing Issues .............................................................................. 59

Pattern Recognition... or Not ............................................................... 62

True Randomness ................................................................................. 68

Back to Vegas ....................................................................................... 69

4. The Logic Behind .................................................................................. 72

The Coxs Postulates ............................................................................. 72

A Bit of Logic ......................................................................................... 77

Liars ...................................................................................................... 85

Enters Consistency ................................................................................ 87

Kolmogorovs Axioms ........................................................................... 89

Logic, Mathematics and Physics ........................................................... 90

Messages .............................................................................................. 98

5. Information ........................................................................................... 99

Age of Information................................................................................ 99

Encoding Information ......................................................................... 103

Insufficient Reason ............................................................................. 109

Maximum Entropy .............................................................................. 117

Frequencies ........................................................................................ 123

6. It Depends... ........................................................................................ 129

Conditioned Information .................................................................... 129

Everything is Subjective ...................................................................... 137

Objectivity and Consistency ................................................................ 141

The Holographic Universe ................................................................... 147

7. Probability Zoo ................................................................................... 151

Intermission ........................................................................................ 151

Anatomy Lesson.................................................................................. 152

A Dangerous Creature ........................................................................ 157

Specimen #1: Poisson ......................................................................... 160

Specimen#2: Zipfs Law....................................................................... 161

The Continuum ................................................................................... 163

Specimen #3: The Gaussian ................................................................ 170

Specimen #4: Paretos Distribution .................................................... 174

The Tail of the Beast ........................................................................... 176

Biodiversity ......................................................................................... 177

8. Changing Mind ................................................................................... 179

Decisions ............................................................................................ 179

Priors and Posteriors .......................................................................... 182

The Likelihood .................................................................................... 186

Evaluating Hypotheses ....................................................................... 188

The Inference Time Arrow .................................................................. 191

Normalisations ................................................................................... 198

Taking Decisions ................................................................................. 204

The Bayesian Way .............................................................................. 210

9. The Catch ............................................................................................ 212

Law and Disorder ................................................................................ 212

Models................................................................................................ 216

Noise, Errors and Codes ..................................................................... 223

What about Fallacies? ........................................................................ 230

10. The Universe and Everything Else .................................................... 233

Fundamental Uncertainty................................................................... 233

Laymen Quantum Mechanics ............................................................. 234

The Large, the Small and the Complex ............................................... 242

The Psychologist Paradox ................................................................... 245

Statistical Physics ................................................................................ 248

The Great Bridge ................................................................................. 256

Other Corners ..................................................................................... 257

11. The Highest Levels ............................................................................ 259

Models once Again ............................................................................. 259

Science and Bayes ............................................................................... 262

Limits to Knowledge ........................................................................... 264

The Method ........................................................................................ 269

Checking ............................................................................................. 270

Beauty and the Beast .......................................................................... 272

Addressing Oneself ............................................................................. 273

Entropy always Increases .................................................................... 274

12. Answers ............................................................................................ 276

Bibliography ............................................................................................ 278

Apendix A. Internet Material .................................................................. 281

Random Useful Websites .................................................................... 281

arXiv .................................................................................................... 282

Google Scholar .................................................................................... 283

Open Access Journals.......................................................................... 284

Appendix B. Mathematical Symbols ....................................................... 285

Appendix C. Scientist List ........................................................................ 286

Appendix D. Greek Letters ...................................................................... 288

The Probable Universe

1

1.

A Universe of Possibilities

HOW DO YOU KNOW?

How do you know we are not living inside the Matrix (or the next best

thing)?

Can you ever tell whether everything is not actually an illusion inside your

mind?

Isnt science just a belief system not unlike religion?

How do we know that elementary particles exist if we cannot see them?

Can we ever hope to find an answer to any of these questions?

And finally: what is the relation of these questions with this book?

Among the few, but key, characteristics that differentiate humans

from the rest of the living organisms on Earth, the ability to question is the

deepest, the least appreciated and the most annoying. What makes it

annoying is the fact that answering some questions is not easy. The search

for those answers forces us to admit our own ignorance, to face the

uncomfortable truth that we are full of prejudices and to observe in awe

our odd ability to accommodate a universe of contradictions inside our

minds without any effort or even regret.

Roberto C. Alamino

2

When faced with one of these hard questions, the great majority of

us just find it easier to take a step backwards. Instead of engaging in the

lengthy and painful task of trying to decrease our ignorance by studying,

revising our prejudices, eventually throwing away many of them, and

resolving the contradictions by changing our mind, most people simply

convince themselves of one or more of the following reasons not to do so:

I do not want to know the answer.

I do not need to know the answer.

There is no answer.

Even if there is an answer, it is too complicated for me to understand.

I am fine the way I am. I just need to get back to browsing the Internet.

But some of us, including myself, are not satisfied with any of these

reasons. This is, of course, a choice that not everyone is obliged to pursue.

It has nothing to do with reason, but with an emotion called curiosity. It is

the urge to seek explanations that makes some of us want to push the

boundaries of what we (think) we know as much as we can and, even when

we start to feel that the boundaries are not going to move anymore, we

want to keep pushing just in case.

It is undeniable that there are cases when we cannot decide which

of several answers to a question is the correct one. But even in those

situations, we might be able to understand why that happens. When we

cannot choose with complete certainty one among several possibilities,

there is one other thing that we can try to do: find out the odds of each

possible answer being the right one.

To know the odds of something is again not an easy task. It is

necessary to weight whatever we know in favour and against each one of

the possibilities in such a way that we can create a rank. For instance,

imagine that you want to download an app for your smartphone. You go to

the Internet and there you find a list of five apps that do what you want.

Which one will you be most satisfied with?

The Probable Universe

3

The way most of us face this decision problem is by ranking the

apps according to the reviews of other people. We assume that, the more

people liked a particular app, the greater are the chances that we will like it

too. Although we could use the number of stars to calculate a rough

estimate of the odds of a general person liking that app, this quantitative

information would not make much of a difference in this situation. The

ranking is usually enough.

This weighting procedure used to create a rank is what we call

inference. In other words,

Inference is the name given to a procedure in which we compare

different answers to a question and try to evaluate what are the chances of

each one being correct (eventually choosing the one with the best chances)

based on given evidence.

Evidence has the broadest possible meaning and it must include

every single piece of information, of every quality or size, which we

accumulate as the result of our exploration of the world using our senses

or our reason, including whatever we can derive through a chain of

reasoning (assuming that the reasoning was sound).

An odour can be evidence for nearby food and this can be seen

from two completely different perspectives. One can justify it via our

experience as from the very beginning of our lives we learn that this

association works. Another way is to rely on acquired knowledge, which

can also be understood as some kind of experience, but of a different

quality. We know that odours are the result of our nose sensing molecules

of food in the air. If there are molecules of a certain food floating around in

the air, the food either has been there or still is.

Solar burns are evidence that something is coming out of the sun,

crossing the whole distance between that star and our planet Earth, and

finally hitting your skin with enough energy to damage it. This is indirect

evidence that this something-which-is-coming-out-of-the-sun exists.

Roberto C. Alamino

4

Notice that the line between direct and indirect evidence is a thin

and blurry one, if there is any sense in combining these two characteristics.

We usually attribute the term direct evidence to something that can be

measured and the term indirect evidence to something that is implied by

some rational consideration. Is an odour direct or indirect evidence for

nearby food? If you say indirect because you cannot see the food, think

again. Are you sure that you can attribute a higher level of reality to shapes

formed by reflected light captured by your retina than to molecules of a

substance detected through the sense of smell? Because of this, it is

unnatural and even arbitrary to separate different kinds of evidence and

we will consider every piece of information on the same foot, unless we

have strong reasons not to.

Inference is something that might indeed be difficult in its details.

Reconstruction of a story using pieces of evidence can be tricky as the

number of possibilities to fill the gaps can be larger than one can even

imagine. But, surprisingly, we do understand the overall process well

enough to be able to mechanise its fundamental workings. Even more

impressive is the fact that this mechanisation is so simple that I can give

you the final answer in one line. You doubt? I will show you.

Consider a certain question and all the relevant information we

possess to answer it. Call the set formed by all of this put together

(question + relevant information) the dataset . Suppose we want to know

the odds of a certain specific answer, which we will call , being the correct

one. These odds can be written as a simple-looking mathematical formula

given by (do not get desperate because of the mathematics, just bear with

me)

That is all. Seriously. We do need to understand what each part of

this equation means, but the way it is written above it can be readily

programmed in a computer. To be entirely fair, the above symbols give you

one of many different ways of doing inference. It turns out that, although

The Probable Universe

5

there are many ways, the above formula gives you the most general and

correct one. It is known by the name of Bayesian Inference. There are

others, but they are all either approximations, particular cases or wrong

methods.

We can see Bayesian inference as a computer program which runs

in our brains every time we stumble into a question. Let us call this

program BAYES. The data that must be fed to BAYES in order to get a result

is

The question we want to answer. This question is considered as an

element of the dataset .

The possible answers to the question, one of which is .

How the question is related to its possible answers. This is what the

factor symbolises.

All information which is relevant to connect the question and the

answers. Like the question itself, this information is also part of .

All extra information we can collect about the answers alone,

represented by the factor.

Then, BAYES gives you back the odds of each answer, which is

symbolised by the notation . The following diagram is a way to

graphically visualise BAYES:

The program BAYES: once it is fed with all information about a question and its possible

answers, the program spills out the probabilities of each possibility.

Roberto C. Alamino

6

The largest part of this book will be concerned with understanding

each one of those three s you see in the above diagram. That is why you

do not need to get upset if you do not understand them right now. In the

course of the book we will also refine the above diagram as, the way it is

right now, it is missing some details. That will be done gradually and I will

explain exhaustively each step.

Now, if you reread the previous paragraphs, you will notice that

BAYES needs to be fed with the possible answers. Why? Once we learn a bit

more about probabilities, this will be clearer, but the rough justification is

that we cannot calculate odds of a certain thing to happen unless we know

all the possible results. It makes a huge difference for calculating the odds.

For instance, if there is only one possible result, the odds for it is 100%. If

there are one million possible results, the odds of many of them will have

to be less than one in a million.

This is actually something very deep and fundamental. It is. I am

neither joking nor exaggerating. BAYES cannot find the possible answers to

a question or the possible results to an experiment. The only thing that

BAYES can do, although it makes it very well, is to evaluate the chances of

each possibility. Sometimes though, depending on the information given to

it, not even that! There are instances in which all that can be done is to

know which possibility has better odds of being right without knowing the

exact value of those odds!

The task of finding out the possible answers/outcomes of

something is the one task that still today has not been mechanised. This is

the place in which imagination and creativity lie, and will ever do. It is in

devising possible scenarios that humans have the greatest advantage over

other species. I am not saying that creativity will never be understood to

the point of being programmable. I do not know. But I do know that when

machines start doing that at the same level we do, we should start to look

at them from a completely different point of view.

The Probable Universe

7

A fair question you might ask at this point would be: if BAYES is

actually such a simple program to write, why on Earth are you writing a

whole book about that? An even more practical question that might be

hovering inside your mind would be:

Do we really need to understand all those symbols if we are simply

interested in using BAYES to decide upon something?

The answer for this, and for the infinite number of similar

questions that can be asked about almost every kind of knowledge one can

acquire, is an obvious no. You do not need to understand it to use it as

long as the necessary information comes processed to you. Similarly, you

do not need to know how your mobile or your car works to use them too.

When one of them breaks down, you can take them to a technician, but

BAYES has a subtlety. BAYES does not break down. When it fails, it is

because you are doing something wrong at some point. Worse yet, you

might never know there is a problem until the wrong decision has been

taken.

You do not need to understand BAYES to continue to live your life,

but this book is for those who want to. It is written for those who are

interested in knowing how and mainly why that program works. In

understanding the logic behind the program, why should we trust it and

also its limitations, we open the way to use it more efficiently and,

consequently, more often.

There is an additional argument. Even those who would be

satisfied with simply running the program will need to feed it with the

correct data in the correct form. Another aim of this book will then be to

answer this other question:

How can we write down everything we know about something in a

way that BAYES will be able to process it?

Roberto C. Alamino

8

Do all those things sound a little familiar to you? If all of this rings a

bell, that is because it should. We run BAYES or some of its simplified and

approximated versions, almost every single moment of our lives, even

without knowing it. And we do exactly the same thing even when we are

dealing with the most sophisticated of our intellectual endeavours. Yes,

both the almighty philosophy and science rely on BAYES. In all of those

situations, surprisingly, the program to be used is exactly the same! The

only difference being the nature of the questions and what we do with the

answers.

If you do not look at reality in a different way after reading the rest

of this book, if you do not feel uncomfortable at any moment and if you

remain the same person at the end, I suggest you to read it again slower.

How do I know you will? It is just a question of inference.

YOU CANNOT AVOID PROBABILITIES

Believe me. I tried. When I was in high school I discovered that I loved

geometry and hated probability. I could visualise inside my mind every

single geometric concept and understand how it worked, but a simple

calculation of the odds of something would give me a headache. I could

never get all the possible combinations of that card deck right!

I am a synesthete. I perceive colours when I think about maths. I do

not see colours. That is not how synesthesia works. I feel colours associated

with numbers or formulas or even whole theories. The number 7 is red and

so is classical mechanics. I do not see the number in red colour when I look

at it. It is still printed in plain black letters. But it is red. The number 3 is

green, just like electromagnetism. Some of these associations, especially

the ones related to physics, I know exactly where they come from. The

numbers, I have no idea.

The fact that geometry, inside my mind, was represented as such a

varied mixture of colours while probability theory was all black and white

The Probable Universe

9

gives you an idea of my despair. Once I finished high school and started my

physics degree, I was relieved that I would use a minimum of probability

theory. Maybe I would never need to use it again at all. I was naive to think

that I could choose a route in physics that would use mainly geometry and

only very basic probability.

If you think about being a good theoretical physicist, be very aware

that you cannot avoid using probabilities a lot. Quantum mechanics, the

best description of the microscopic world we know at the moment, can

only be connected with experiments through the use of probabilities. But

the presence of probabilities is much more pervasive than that. Even

thermodynamics, the theory of heat that probably conjures steampunk

images for most science fiction fans, and its twin area of statistical physics

require a deep knowledge of them as well.

You might somehow dodge probability by judiciously choosing

another area of science than theoretical physics, but one of the things that

you will learn in this book is that you cannot really understand science

without understanding the meaning of probabilities. You do not really need

to know how to calculate them rigorously, but you do need to grasp its

fundamental concepts. Not taking the time to acquire this knowledge is

one of the biggest problems of modern scientists, many of them being

happy in becoming more like calculators than thinkers. Nothing against

that, people do what they want with their lives, but there are always

consequences, in this case, for science and even how it is perceived by

those who are not scientists.

This does not seem appealing for non-scientists, I know. But as I

said in the previous section, science is not the only place where

probabilities are important. Every situation in which more than one thing

can happen and we do not know which one will, we are led to think about

the odds of the possible results. Odds, of course, is just a different term

for chances or probabilities.

Roberto C. Alamino

10

When BAYES spills out odds for possible answers, it is spilling out

probabilities. That is the reason why its end product is . The in

this case is for probability. By analogy, you can readily infer that

probabilities are also the correct way to feed in the information to BAYES.

There you have the three s. Because of that, we have to learn how to

translate information into the language of probabilities. As you see, there is

no escape. One way or another, we will need to understand probabilities

better before being able to understand and use BAYES.

Another argument to show you how probabilities are indeed

everywhere is the fact that you do not need to look for mathematicians or

physicists to find people whose job is to actually calculate them.

Bookmakers are examples of professional probability calculators which, in

general, are not scientists. People working in different jobs inside the

financial market, traders for instance, are also probability professionals.

Every decision we have to take in our lives requires some rough

estimation of probabilities. Remember that this is how the ranking

procedure we called inference is done, we are just changing the word

odds by the word probabilities.

We all know how difficult it is to take a decision and how important

it is to take into consideration as many information we can. We do that,

usually, in a very structured way. We usually start by identifying the full

spread of options available to us. When we take a deeper look on how

probabilities are defined, we will see that this delimitation of the set of

possibilities is the first fundamental step in constructing probabilities. We

do that naturally.

We then consider similar situations in our own lives, search in

books, newspapers or ask people if they ever had to go through something

alike. We assume some things also. For instance, we hope that if the

situation is similar, doing similar things will bring similar results. This is

another step in calculating probabilities. We will learn this as the technique

of counting frequencies. You assume that if you find many similar

The Probable Universe

11

situations, the amount of times a certain consequence results from a given

decision will be roughly the same. This is already a measure of probability.

But we also take into consideration our emotions. Sometimes, they

are so strong that they even make us ignore all other evidence we

collected, no matter how much convincing it was. It is usually not simple,

but at the end we do it. The big surprise is that we will learn that even the

emotional influence enters BAYES. And it enters also as a probability! In

fact, one of the main mistakes of many other inference procedures which

are not Bayesian is that they ignore this influence.

Because this book is about taking decisions in the best way you

can, we are then forced to delve into the world of probabilities. Decisions

are the most important and unavoidable things in our lives. Taking

decisions transcends science, business and personal life. It is amazing that

in the last two centuries we not only learned there is a right way to do it,

but in one of the greatest exceptions in history, we actually discovered the

right way! That is one of the greatest unsung (among the general public)

achievements of all time.

In the best no-free-lunch way though, the fact that we know the

right way to decide does not mean that it is simple to implement. Still, I

want to show you that it is simple to understand. And once you understand

Bayesian inference and start to use BAYES and all its principles in your

personal life and your work, you will finally see the world in a totally

different way. This will make you, like me, lose your fear and stop trying to

avoid probability. You will, probably, learn to love it.

THE MEASURE OF OUR IGNORANCE

Probabilities are always related to things that we do not know for sure.

Whenever our knowledge about something is uncertain, be it a natural

phenomenon like where the lightning will strike next time or a daily life

Roberto C. Alamino

12

conundrum as which one must be the fastest way to work today, we

naturally think about probabilities.

The language of probability is part of our culture. Whenever we are

completely sure about something, we do not hesitate to say that we have

100% of certainty about that. This 100% is one of the many forms for

expressing a probability estimate. It means that we analysed all relevant

information (or at least the information we deemed relevant) to reach that

conclusion and that all other conclusions are, according to that

information, surely wrong. They have 0% chance of being right.

When we are not sure about our answer, this 100% certainty

decreases and can even get to 0%. The actual number, most of the time, is

not a precise calculation, but a rough estimation based on some key

numbers. If we have to consider the possibility of two different outcomes

for something, like whether the child is going to be a boy or a girl before

the ultrasound, and we have no idea whatsoever which alternative is the

correct one, we simply pick one of them and attribute to it 50% of

certainty. This is the measure of how much we do not know about that

outcome, or of how much information we do not have to infer it correctly.

Notice the emphasis I am giving to the fact that we talk about

probabilities according to the information we have (or not) at our disposal.

This is, again, a reference to our inference program BAYES. We need to

give information to get probabilities and consequently the probabilities we

get do depend on the given information very strongly. If we do not know

some aspect of a problem, most of the time we will not be able to choose

one of its possible answers with 100% certainty. Ignorance leads to

probabilities. The opposite is not always true, though. Sometimes, even if

we know everything we can know about something, we still cannot do

better than just estimating probabilities. This is what the scientific

community, especially the physicists, had to accept when it became

evident that quantum mechanics was the right description of physical

reality. Do not worry about that right now, we will have time to talk more

about quantum mechanics later on.

The Probable Universe

13

Although the association of probabilities with information (or

rather, the lack of it) makes a lot of sense when explained with the

arguments I presented up to this point, it took a long time to formalise this

idea in the appropriate mathematical/philosophical way. After a lot of

effort, wrong turns and a series of mistakes which are common in scientific

research, we finally discovered that it is not the probability itself that

measures how much you do not know about something, but a slightly

different quantity. In order to calculate this quantity, we have to know the

probabilities for all possible answers. By feeding these probabilities to a

certain program, it will give you back the size of your lack of knowledge

about the subject. This quantity, which is nothing but a positive real

number, is something that everyone heard about at some point. It pops up

once in a while on TV shows, popular science magazines and internet

videos. It is called entropy and it is indeed a measure of missing

information, in other words, ignorance. It is a fundamental concept

underlying BAYES and we will learn a lot about that here.

Invariably, when a new technology is incorporated to our society, it

brings together a new way of interpreting the world. Steam power made

people think in terms of energy. Electromagnetism incorporated the idea of

fields to the popular culture. After the Second World War, something

similar happened with information. The increasing importance and

popularisation of computers and communication media since then led to

another change in point of view by the middle nineties and to the

recognition that information should be treated as a physical entity, at the

same level of reality as energy or mass (Landauer, 1996). It is true that

information is a kind of physical entity which we do not fully understand

yet in all its subtleties, but which pervades science wherever you look at.

We have some good grasp of it, but once again it becomes more

complicated when we look at the quantum world and, as we know, the

quantum world is the actual world.

To argue that we have to know how much we do not know in order

to improve our learning sounds like an advice from a Buddhist monk, but it

Roberto C. Alamino

14

could not be more objective. It is actually a bit obvious. Without measuring

our ignorance about a subject, how could we evaluate the improvement in

our learning of it as we acquire more information? Learning something

does not only mean to accumulate new information, but also to review our

concepts concerning the subject based on that new information. For

instance, what if the new information we get is simply useless? Useless

information does not decrease our ignorance. Unless we actually know

how much our ignorance changes when the review of concepts is carried

out, we will not be able to evaluate the best way of doing that.

Right now, you must be thinking the following: It makes sense. By

knowing the amount of ignorance I can devise a way to make this decrease

as fast as possible! Right? The surprising answer is NO. It is more subtle

than that. It turns out that the best way to learn a subject is by

guaranteeing that, each time new information arrives, the new conclusions

we take are done without incorporating any assumption not contained in

that piece of information.

Think about this. You see a video of a poor man robbing a shop and

then you conclude that poor people are robbers. That last bit was a

conclusion that was not implied by the fact. You are including an extra

assumption, the one that allows you to generalise to a whole population

the behaviour of a single individual.

The amazing fact that we will understand later in this book is that,

to guarantee that our conclusions on the face of new information are

unbiased, we have actually to find a way to maximise our ignorance about

anything that does not concern the information we have. Looks a bit

frustrating, but I assure you that it does make sense. Be patient and we will

get there one step at a time.

The Probable Universe

15

ARE YOU AFRAID OF MATHS?

We are all afraid of criminals, accidents and failures. What do we do? Most

of us find ways around it. We put locks on our doors, insure our car and

study harder for exams. We might not like doing those things, but once we

are forced to, we just do them one way or another.

We are also forced to use maths everyday to deal with money. We

have to calculate taxes, changes, interest rates, mortgages and so on.

Slowly and with a lot of effort, we end up learning how to do those things

simply because we cannot avoid them. Well, we can, but then we would

have to accept a very different kind of life. Some do.

I am not going to lie to you. Probabilities require maths. The good

news is that it does not require more maths than you already know. You

will get along very well by knowing simply the basic four arithmetic

operations . You will have to learn a lot of new concepts, but

as far as calculations are concerned, those four will be enough for our

purposes. That does not mean that we will not see other mathematical

ideas in this book, but they will appear in the role of side dishes which you

should feel free not to eat if you do not want to.

My own experience is that there are two very scary features about

mathematics that makes it different from any other subject. The first is that

each new thing you learn in mathematics requires you to understand

almost everything that comes before. You will not understand square roots

without understanding squares, which you cannot understand without

multiplication, which you cannot understand without addition, which you

cannot understand without knowing what a number is. The main source of

this difficulty is that it is usually not enough to memorise. Given enough

time and enough storage space on our digital devices, memorising is not a

problem. The problem is that you are forced to actually understand the

concepts, and that is something that electronic devices still cannot do for

us.

Roberto C. Alamino

16

And that brings us to the second scary feature of mathematics: we

are forced to think. Most of the time, we are required to think very hard.

That is the feature that often puts people off. If you do not like to exercise

your brain too much, that is going to be a problem. I will assume that, once

you decided to read this book, this means that you are up to the challenge.

Otherwise, I will understand if you furtively put this book back on the shelf.

Right now some of you must be shaking your heads and saying No.

Its not that. I just dont like formulas. Thats it. Cant you just explain

everything without using formulas? Let me show that your problem with

formulas is, actually, a no-problem. Formulas are friends, not enemies.

When you think about a formula, you think about the modern

incarnation of it. You think about those modern mathematical symbols

with some equal sign, or its cousins, inserted somewhere in between a

sequence of strange letters. A good example of how weird (or artistic

according to ones preferences) they are is the following

This doodle, as my mother use to call the symbols she would see

on the sheets of paper lying on my desk, is one of the laws of

electromagnetism. It is one of the famous Maxwell equations and it is

meant to encode a beautiful experimental fact. When people started to

play with the connection between electricity and magnetism, they

discovered that whenever you have an electric current and/or an electric

field that changes with time, you will create a magnet. The above formula,

in addition to telling that to the trained eye, gives the precise amount of

magnetic field generated when the variation in the electric field and the

current are such and such. By measuring the latter ones, the formula allows

you to know with a high precision the value of the former.

The symbols themselves are actually not compulsory. We could

describe the whole equation with words. But the symbols do have a

The Probable Universe

17

purpose. They work as abbreviations for a thread of thought which can be

very complex. You can imagine the combination of symbols in a formula as

a sequence of instructions in a computer program. Some of these

instructions are once again abbreviations, now for other programs in

nested sequence that goes all the way down to simple additions and

subtractions!

The problem with describing this program symbolised by an

equation using only words is that it can take whole books to explain what

some of them really do in details. We do not want to write books every

time we describe some phenomenon and neither read a book every time

we want to calculate some quantity. Formulas are a healthy application of

human laziness. The Greeks, for instance, did not have our modern

notation and they were able to do a lot of mathematics themselves, but

doing mathematics with the Greek notation is something that at least I do

not have the wish to accomplish today.

In summary, formulas are convenient devices and not using them

would be a waste of everybodys time. But if even after being presented to

all these reasons you are still not convinced that you need them, there are

two things I can say to you. The first is the nice one: just jump over the

formulas and read the text. That is what most scientists, including myself,

do the first time we meet a formula we do not know. If you ever tried to

learn a foreign language, you know how it works. When you start reading a

text, you look for all the unknown words in the dictionary, but after one or

two paragraphs you just give up and try to make as much sense of the text

as you can with what you already know. At the end, we can always learn

something and the most important result is that, next time, we have that

feeling that we are somehow more familiar with the language.

The second thing will be a slap in the face. Do you seriously think

that you can truly understand mathematics without knowing how to read

formulas? Can you imagine someone saying that she understands English

but cannot read a book? You can somehow get an idea or a general feeling

of a subject that requires maths, but if you do not make an effort to learn

Roberto C. Alamino

18

the language of mathematical symbols you will always have only a

superficial understanding of the subject. If that is what you want though,

then that is fine. It is your decision to make.

The Probable Universe

19

2.

Games of Chance

The objective of the previous chapter was the same of a film trailer hook

you and give you a taste of the things we are going to talk about. Among

those things, there is one that stands out probabilities. Everything we will

learn in this book involves probabilities and therefore that will be the first

thing we will look at closer starting... now.

THE MANY FACES OF FAIRNESS

What is the probability of getting heads when you toss a fair coin? The

sensible, intuitive and often (approximately) correct answer is the obvious

value or 50%. The apparent triviality of this answer hides a series of

reasonable, but in no sense trivial, assumptions which are so natural that

are hardly ever noticed by us. Each one of those assumptions, not

mentioning the way they influence one another, plays an important role in

reaching the final values we gave above. Even understanding those values,

as we shall see, is a slippery task.

If you carefully read the previous paragraph, for instance, you will

notice that a precise definition of what should be understood by a coin

has not been given at any moment. Do I need to waste my time doing that?

This piece of information is surely part of the background culture of almost

every human being in this planet, and certainly of all those who are reading

this book. For the great majority, the image it conveys is that of a metal

disc with different inscriptions, or drawings, on each side. One of the sides

will usually portray the head of an authority, which is the reason it is

obviously denominated heads in English. The other side has the less classy

Roberto C. Alamino

20

name of tails (in Brazil, we call them face and crown respectively).

Depending on the country, the situation or the time, there will be some

variations. In some countries, coins have holes in the middle or have a

square instead of a circular shape. Some non-currency coins are made of

plastic or even wood. But some characteristics will usually be a constant,

the most important of them being the flatness of the coin.

The role played by the flatness assumption is twofold. First of all, it

tells us that we are dealing with an object with only two sides, which

translates to only two possible results of the coin tossing heads or tails. I

could, for instance, have asked instead the probability of getting any one of

the faces when rolling a fair dice instead of tossing a fair coin. The answer

that comes to the mind of most people is now 1/6, which has a much

higher chance of being the wrong one! Why? Because contrary to the word

coin, the word dice can describe a much more varied class of objects, not

only the typical six-sided casino dice.

In fact, if you are a geek or an RPG (Role Playing Game) player as I

was for many years (Im still a geek, by the way), you would readily ask

What kind of dice? If you ever played Dungeons & Dragons, you would

know that regular solids are used as dice in the game and called by the self-

explanatory terminology of d4, d6, d8, d10, d12 and d20 (d for dice,

followed by the number of faces). Sometimes, a coin is considered a

generalised dice and called a d2 in RPG terminology. There is even the

case when you might be required to toss a d100, which is often not actually

a dice with 100 faces, but in fact the combination of two d10 tossed at the

same time, with each one representing a digit of a number between 1 and

100. Given the convenience and obvious generality of this notation, I will

use it as my standard notation for dice in this book.

The Probable Universe

21

Dice with different number of faces normally used in Role Playing Games like Dungeons &

Dragons.

The second idea conveyed by the flatness assumption in a coin (or,

alternatively, d2) is that of symmetry. This is an important concept in all

areas of science, because it summarises the idea of a set of things which

are indistinguishable such that there is no reason whatsoever to prefer one

over the other. As a fancy but extremely important example, the most

fundamental laws of physics are based on a symmetry concept called

gauge symmetry, which among other things is the reason for the existence

of electromagnetic fields and the explanation of why electric charges are

conserved much like energy.

Back to our coin, when we assume that it is flat, this means that

both its sides are equally flat, which then guarantees that there is no

influence of the shape in the chances of getting either one or the other

result when we toss it. The same happens with dice. As long as all faces

have the same shape and the dice is geometrically symmetric, in the sense

that it looks the same when you are facing any of its sides, there is no

reason to think that one of them is privileged. This observation is

connected to another piece of information contained in the initial question,

the quality of being fair.

Roberto C. Alamino

22

Fair coins or dice are those which, in addition to being symmetric in

shape, are not loaded. As we have already seen, the geometric symmetry is

there to ensure that the external shape will not have any influence in the

final result, but coins and dice are physical objects and there is a another

way to make one face more likely than others, tampering with their

internal structure.

Fairness, in this case, goes beyond the external appearance of the

objects and requires the symmetry to be also an internal one. This means

that there should be no variations in the density of the material of which

the object is made of at any point in its three-dimensional structure. In

practice, there will always be some irregularities, but we expect them to be

small enough not to influence too much the final result. Besides, if these

irregularities are truly just random, they will have the same chance of

increasing or decreasing the local density, making them cancel out in

practice. Although perfect fair coins or dice do not really exist, these

idealisations are very useful and we will use them many times in this book

for the sake of clarity.

We have finally described all the relevant details of our coin, but

we are not done yet. Having opened up your eyes to the fact that there are

some details which are very important when calculating a probability, I still

have not described the most important one how the tossing itself is being

carried out.

Tossing a coin, or rolling a dice, is something that we have seen so

many times in our life that we hardly stop to think about the several ways

this can be done. We immediately assume another fairness quality to the

tossing (not the coin) with the meaning that the person who is carrying it

out is going to throw the object upwards or downwards with enough speed

and power to make it impossible to predict how it is going to stop. The

conviction that this is an agreed means of doing a tossing is so strong that if

the coin-tossing person simply chooses one of the sides and gently put it

upwards, that would result in a wave of angry complaints from all other

The Probable Universe

23

players of the game. That is especially true in a RPG session, where

emotions can run amazingly high.

You should now stop reading for a couple of minutes and reflect on

how eye-opening the above discussion is. We started with an almost trivial

action and saw that we automatically attach to it a whole series of

assumptions that might be completely wrong! Be assured that this happens

with almost every action we carry out in our daily lives. If you think enough,

you will start to see it everywhere.

One can now appreciate how assumptions based on some

previously acquired knowledge are put together in our minds to create a

scenario in which there is no reason to believe that one side of our coin has

a higher chance of turning upwards than the other. This symmetry, which

depends on the shape, the structure and even the way the coin is tossed, is

then translated into the number . Why? Well, we simply imagine that we

have a whole thing that has two possibilities. We attribute to this whole

thing the number 1 and then it is only natural to divide it in two halves,

neither being more probable than the other, resulting in the number 1/2.

The particular numbers, 1 and , obviously need a better justification, but

that is something more complicated and we will spend some time to

understand it in the following chapters.

But even without knowing the precise reasons why this is done, we

are all used to attribute numbers to probabilities. We see it on the TV or in

places like horse races all the time and we develop some understanding of

it. My preferred way to express probabilities in this book is by giving a

number between 0 and 1 instead of the most popular percentage. This is

really just a question of preference, of course. A percentage simply means

something divided by one hundred and you can convert from percentages

to numbers between zero and one simply by dividing the percentage by

100. For instance, 25% is the same as . The

conversion rule goes the other way round. If I say that some probability is

0.3, you just multiply it by 100 to get the percentage, in this case

Roberto C. Alamino

24

Back to our discussion on fairness, it serves to illustrate another

notion which is so important that we cannot emphasize enough. Whenever

we calculate the probability of a certain event, meaning the result of any

kind of experiment that is carried out, we start by making a series of

assumptions about how that event might happen. We make a mental

model (remember this word!) of that particular experiment and use it to

deduce symmetries and other rules that we think make sense, like for

instance how the coin or dice will move in space.

All the assumptions that enter in the building of a mental model

are based on some information we have before the actual experiment is

carried out and, because of that, it is called prior information. The idea of

prior information is one of the most important concepts we will

encounter in this book, not to mention in science and life, and we will

dedicate more time to it later on. From everything we have learned, it must

be clear that we use this prior information to construct the probabilities for

the possible outcomes of an event. Because this kind of probability is based

on prior information, we call it by the obvious name of prior probability.

For our games, the prior probability we are interested in is the chances of

each face coming out upwards in a dice or coin tossing.

But what happens once the coin is tossed? In real life, especially

when we go to a casino, we can never really trust that all the symmetry and

fairness assumptions are indeed correct. Consider the following situation. I

take a d10 out of my pocket and show it to you. At first sight it looks really

symmetric, which I swear to you is the truth. You take it into your hands

and examine it as closely as you want. Although it is difficult to evaluate

the homogeneity of its internal density, a quick check does not seem to

indicate any serious tampering. I also guarantee to you that I will throw it

as high as I can to make the action as fair as possible. This should be

enough to convince you that the probability of any side is 1/10, which is a

number with the meaning that any of them have the same chance of

turning upwards at the end. In other words, the odds for any face are 1 in

10. At this point, we say that your prior probability or, to get rid of too

The Probable Universe

25

many words, simply your prior for this dice rolling is 1/10 for each face. I

then throw the dice and the result is 1. Then I do it again and I get another

1. If that happens a third time, you would start to suspect that something is

wrong. Either the d10 was not as symmetric as you thought or I am using

some wicked sleight of hand to influence the rolling.

If I roll the d10 one hundred times and all the results turn out to be

1, you would have no doubts that the probabilities for each side are not the

same and that your prior, as plausible as it seemed to you at the beginning,

was completely wrong. If I stop at that point and say to you that, if you

predict the next result correctly, I will give you one thousand pounds, what

would be your best shot? No matter how symmetric the d10 looked like in

the beginning or how high I throw it each time, any other answer than 1

would obviously not be a very smart one. Excluding, of course, the case in

which I am really a magician who can actually manipulate the results and

want to get some easy money from you, everything points out to the fact

that the particular d10 I am rolling will always give the same result. Forget

the prior information. It was obviously wrong.

What happened here, as you surely can appreciate, is another

example of our central theme inference. As we discussed briefly in the

beginning of this book, whenever we have to predict something, either

about the future or about the past, we rely on some previously acquired

information to do that. The prior probability summarises all this initial

amount of information. In the d10 rolling case that was described above, all

available prior information would suggest us that any number from 1 to 10

would be equiprobable (having the same probability). But another

important lesson in life is that information might be either wrong or

incomplete and, once we recognise it, we have to change our beliefs. And

by changing our beliefs, we consequently change our next predictions! If

we insist in not changing our beliefs in face of new information, our

predictions will be wrong.

Let me add a small digression about prediction at this point.

Although most of the time the word is applied with the meaning of using

Roberto C. Alamino

26

information to make an educated guess about a future event, throughout

this book I am going to use it also to denote a hypothesis about a past

event that is still unknown. For instance, based on some oral legends about

some historical character, one might be able to predict her birthplace even

if it is not yet known. The same can happen with the location of some city

(like Troy) or archaeological site, even though these are events that

happened in the chronological past. Some people would use the word

retrodict, but this level of distinction is unnecessary and I will not be using

it.

Back to inference, I shall stress once and again that in order to

predict things correctly it is not enough to use all prior information, but we

also need to update that information in light of new results! The seemingly

infinite series of 1s in our d10 virtually forces us to consider a new

probability for each face, one where the face 1 is going to turn up with

probability 1 and all the others with probability 0. This new updated

probability, after new information is incorporated, is what is called a

posterior probability. This process of incorporating new information to a

prior to get a posterior is usually referred to as updating the probabilities.

We can now at the same time refine and shorten the definition we

gave of inference in the previous chapter as

Inference is a method for updating probabilities.

In a more mundane way, it is a method to change our beliefs

about how probable things are. As I have said before, Bayesian inference,

or our computer program BAYES, is the most fundamental method to

update probabilities and all other correct methods are derived from it.

The word Bayesian, repeated so many times here, comes from the

surname of Reverend Thomas Bayes, a clergyman who proposed a

simplified version of the full method in a now classic paper written in 1763

(Bayes, 1763). However, as it often happens in every area of human

knowledge, the greatest contributions and the general method were only

The Probable Universe

27

rigorously formulated later, in this case by the French mathematician Pierre

Simon Laplace not long after. Especially in the last decades, Bayesian

inference has proven invaluable in many areas ranging from artificial

intelligence to genetics with profound, although not largely recognised,

philosophical consequences in science. Still, there are many scientists who

are reluctant to use it purely on the basis of prejudice and ignorance.

We shall explore all aspects of Bayesian inference as we travel

through the principles and the history of an area of mathematics that

touches virtually every aspect of nature. The initial point, and one of

utmost importance, is a question that we overlooked in our discussion so

far. What is a probability? But before we answer that, let us take a

historical look on how the question has arisen and the attempts to answer

it. As it always happens, the information contained in history is crucial in

arriving at the right answer.

NOT-SO-NOBLE BEGINNINGS

Pierre Simon Laplace, the great French mathematician of the late 18

th

and

early 19

th

centuries, and a character we will meet again soon, is often

quoted as saying

IT IS REMARKABLE THAT A SCIENCE, WHICH COMMENCED WITH A

CONSIDERATION OF GAMES OF CHANCE, SHOULD BE ELEVATED TO

THE RANK OF THE MOST IMPORTANT SUBJECTS OF HUMAN

KNOWLEDGE.

One of the aims of this book is to give, or rather to be, supporting

evidence in favour of the last part of this quote. Right now, though, we

shall spend some time appreciating the very practical origins of the

mathematical interest in probabilities.

Roberto C. Alamino

28

Probability, the technical word used in mathematics for chance or

odds, has always been invariably associated with gambling. When I say

gambling I am not referring to the innocent Sunday game with friends,

but to the actual game of chance which is played legally in the casinos and

more furtively in closed rooms late at night. The kind of gambling in which

the stakes are real money and peoples lives can be ruined at the turning of

a card or the rolling of a dice.

Gambling was actually the driving force behind the development of

what we call today probability theory in its very beginning. The first

systematic study of probability was put together by an Italian physician

called Gerolamo Cardano at some point between 1525 and 1565. The

treatise itself was only published in 1663 (Cardano, 1693), 87 years after his

death.

Gerolamo Cardano (1501-1576)

Cardano was a very controversial character. A physician by

profession, he had many interests which included astrology, mathematics

and gambling. He had a life with highs and lows. For instance, one of his

The Probable Universe

29

sons was executed after poisoning a prostitute with white arsenic. He

wrote many books on several subjects. In most of them, he basically

compiled information about some topic adding, according to some of his

critics, very little to the actual knowledge of the area.

It is not easy to check all the claims, either in favour or against his

contributions, but it is usually attributed to him the discovery of the

formula for the general solution of a particular kind of third degree

polynomial equation, that of the form

.

Remember that in the above formula, is the unknown variable,

the one we have to find. The other letters, , and , are given numbers.

This is called a polynomial equation because a polynomial is a formula that

is composed by a sum of powers of a main variable (in this case , but it

could be any other letter) each one multiplied by a number. The degree of

the polynomial is always the largest power appearing in the formula. For

instance, the following formulas are polynomials

polynomial of degree 1 in

polynomial of degree 2 in

polynomial of degree 21 in

Notice that it is not necessary for all the powers to appear in a

given polynomial. When we force some polynomial to be equal to a certain

value, which is usually zero, we have a polynomial equation with the

degree given by the degree of the polynomial. We all have learned in

school to solve polynomial equations of degree 1 and 2. Equations of

degree 1 are trivial, as all we need to do is to move all numbers to one side

and the variable to the other. For instance, using the first polynomial above

Roberto C. Alamino

30

The symbol means implies. This is, however, not a general

solution. It is a solution for a particular equation. A general polynomial

equation of degree 1 can be written as

This equation is also known as a linear polynomial equation. This is

because it can be used in geometry to describe a straight line (we will see

that later). The letters and can be any numbers with one exception:

cannot be zero. If is zero, then we clearly do not have a polynomial

equation at all. The general solution for the linear polynomial equation is

then given by

The equation and the solution are general because any linear

polynomial equation can be obtained by an appropriate choice of and .

Similarly, we all have learned that the general polynomial equation of

degree 2, also known as a quadratic equation, is given by

,

and this usually has two solutions that we call

and

given by the

following formulas

You can appreciate that the solutions are much more complicated

than the solution for the linear equation. And there is also an extra

complication here. Due to the presence of the square root, we have to be

careful with the solutions. If the number inside the square root is negative,

there are no real solutions for the equation, which means that no real

The Probable Universe

31

number can satisfy the polynomial equation. For instance, consider the

equation

We have that

and, therefore, we cannot find any real value for x satisfying the equation.

If you do not believe, feel free to keep trying.

Today, we know that in cases like the one above, although there

are no real solutions for the equation, we can find complex numbers

satisfying it. Remember that complex numbers are a generalisation of the

real numbers in which it becomes legal to take square roots of negative

numbers. We will study them a bit later on.

Before the complex numbers were taken seriously, however, the

attitude towards solutions of equations in which square roots of negative

numbers appeared was to say that simply there were no solutions. Then, it

came Cardanos solution for his polynomial equation of degree 3, also

known as the cubic equation.

The equation whose solution was presented by Cardano is still not

the most general cubic equation, because it is lacking the term with the

square of the variable. Because of that, it sometimes gets the unusual

name of depressed cubic when . Still, even in this simplified case, the

solution iss already more complicated than the one for the quadratic

equation. I am not going to write it in full here, it is enough to know that it

also involves square roots, but in a more complicated way.

In the same way as we have seen for the quadratic equation, it

might happen that the numbers appearing inside the square roots might be

negative. For the quadratic equation, when this happens the solution is

complex and, if when people did not know about the complex numbers, we

Roberto C. Alamino

32

have seen that they would simply say that there are no solutions. However,

in Cardanos formula, the square roots would appear in an intermediate

step and what could happen is that we could have perfectly real solutions

even if those square roots were of negative numbers!

Because the square roots of negative numbers could be

manipulated to give real numbers as the final solutions for the equations,

Cardano was led to consider their existence as legal mathematical entities

that could be used in calculations and this led some to attribute to him also

the invention of the imaginary numbers, those which are square roots of

negative numbers.

Although imaginary numbers will not be necessarily used by us in

this book, they are connected to probability via (guess what!) quantum

mechanics. We will have a very brief encounter with them though, but if

you are interested in more details about this connection and Cardano

himself, I would recommend Penroses Shadows of the Mind (Penrose,

1994).

Cardanos love for gambling resulted in him being probably the

first person to create and study systems to earn money with that.

Cardanos book was called Liber de ludo aleae which is the Latin for Book

on games of chance and, as we had already seen, was only published

posthumously on 1663. The book was actually more of a gambling manual

than an actual mathematics treatise, but it is still considered the first of the

kind.

One of the most interesting things about Cardanos book is that it is

possibly the first place in which prior probabilities for dice rolling are

calculated taking into consideration arguments of symmetry. To be fair, the

idea is not exposed in this way, but it more or less follows the same steps

we used in the previous section. The possible results are enumerated and

then the same odds are attributed to each one. Later on, Galileo would

write a brief treatise named Considerazione sopra il Giuco dei Dadi, whose

exact date is unknown, in which he too would do the same feat. It is

The Probable Universe

33

however not certain if Galileo had the idea independently or if he knew

Cardanos work beforehand.

Although other scientists and mathematicians, like Kepler for

instance, also touched on the subject several times, the next great leap to

probability theory came with Pierre de Fermat and Blaise Pascal, and was

once again tied to gambling issues.

Pierre de Fermat (1601-1665)

Blaise Pascal (1623-1662)

Antoine Gombaud, known also as Chevalier de Mere, was a French

thinker, writer and many times considered a gamester who proposed to

Pascal a problem concerning an unfinished game of chance. The original

formulation of the problem is less important than the idea it conveys and

was described using what was familiar at that time. As our aim is to

understand the essence of it and therefore we are going to use a simplified

version of the game.

Suppose two men are playing a number of games of heads and

tails with a fair coin. As we all are used to in these situations, the number

is chosen as an odd number simply to guarantee that there will always be

a winner. It is just like saying the best of three, or five, or seven. Let us

use the last option and take . To be the winner, it is enough to

guarantee 4 victories. If this happens before the 7 games are played,

obviously there is no need to proceed and we can stop tossing the coin. To

Roberto C. Alamino

34

make things more exciting, let us assume also that there is a money prize

to be won. What happens if the sequence of games has to be interrupted in

the middle? For instance, suppose the players are two guys, one of which

did not finish the washing up before going to the bar and his wife has just

arrived with that angry look on her face (the male readers probably know

very well what I am talking about).

Assume that this happens after the 4th coin tossing. Surely, if both

players are tied at this point, no one would care too much. They would

simply stop the game and go back to their homes. However, if one of the

players won 3 games already, he would not simply be satisfied in coming

back home with empty hands. After all, he was one game from getting all

the money! How should the money be divided among the two guys in this

situation? What kind of arrangement would leave both of them satisfied?

The creative reader might come up with many solutions, but the

actual intention of the puzzle is to illustrate a situation in which one has to

calculate the odds of someone winning based on what happened so far.

This puzzle was attacked by both Fermat and Pascal in a series of now

famous letters exchanged during 1654 of which only three survived. These

letters are considered to be the landmark that defines the creation of the

mathematical foundations of probability theory. If you understood the

problem well, you probably realised that it is actually a problem of

inference.

From the Pascal-Fermat letters, two methods to calculate

probabilities were born. They were called the classical method and the

frequency method. Pay attention to these two terms as this is a distinction

that is in the kernel of even the most modern discussions about probability,

in particular those concerning Bayesian methods. Once I explain what both

of them are in details, you will surely understand the problem.

The idea of the classical method is to break down an experiment in

an exhaustive set of equiprobable outcomes. This set is technically called

the sample space. The feature of being exhaustive is extremely important

The Probable Universe

35

here. It means that this set contains all possible outcomes, nothing more

and nothing less. Therefore, if the total number of outcomes is , the

probability of each one should, accordingly to the equiprobable

assumption, be given by . In the case of a fair coin, the sample space

would be

and this gives the 1/2 probability for each possibility that we became

familiar with. For a fair d6, it gives probability 1/6 for each face and so on.

As you can readily appreciate, this allows one to attribute probabilities

before actually carrying out the experiment. The classical method is, in the

end, the simplest possible recipe to calculate a prior probability!

There are, however, situations in which it is difficult to infer prior

probabilities. This happens when there are no simple symmetry arguments

to help. Get a sheet of paper and make a ball with it. Can you give any

probability whatsoever about how it is going to stop if you throw it?

Another example would be a loaded coin. Although it would be possible in

principle to calculate probabilities if one has complete information about

the density variations, that would be a very complicated problem that

would probably require some computer program to solve in practice.

Because of all these complicated scenarios, which are actually

more frequent than the simple ones, the frequency method was

suggested. The idea is to count the number of times a certain result is

obtained if the experiment is repeated several, several... several times. The

more times the experiment is repeated the better is the estimation of the

probabilities, a result that we shall know by the name of Law of Large

Numbers. To be more precise, the usual formulation of the Law of Large

Numbers is concerned with average values. The version that concerns

frequencies is called Borels Law of Large Numbers. mile Borel was

another French mathematician. He lived in the late 19

th

and early 20

th

centuries and was also a member of the French resistance during the

Second World War.

Roberto C. Alamino

36

But the frequency method is not an ultimate solution, it has its own

problems too. If you need to repeat the experiment to calculate

probabilities, how can you determine the frequency of an unrepeatable

event? For instance, what is the probability that a certain person will die

from tuberculosis or cancer, lets say, tomorrow? That is in no way

repeatable many times. In this case, what is often done is to use some

extra assumptions which seem reasonable. These assumptions are always

implicitly based on some symmetry principle. For instance, physicians

assume that if you count the frequency of deaths in the whole population,

this frequency can be extended to any person inside the population

because peoples biology is approximately the same, it is approximately

symmetric. Rather than a trivial and obvious assumption, this can be

something prone to a whole series of problems when the deviations from

symmetry become important.

Another issue with the frequency method is the assumption that is

called statistical independence. When you infer the probability of heads

or tails in a loaded coin by counting frequencies of these individual

results, you are secretly assuming that these probabilities are independent,

meaning that the previous results of the tossing will not influence the next

one. This seems to be logical in the case of a real coin, but might not be the

case for other processes in nature! In fact, there is a general class of

processes in nature, called Markovian Processes in honour to the Russian

mathematician Andrey Markov, whose main characteristic is the

dependence on previous results. Think about raining, for instance. The

simple fact that we can actually talk about the existence of raining

seasons indicates that the event of raining today is far more probable to

happen if it rained yesterday than if yesterday was the driest day in the

year.

Of course one can fix this issue by being judicious about how to

count frequencies. If one knows that that each result depends only on the

previous one, then we can count frequencies of pairs of events. However, if

one does not know a priori what is the extension of the dependency (pairs,

The Probable Universe

37

triples and so on), this again becomes a prior assumption that has to be

vindicated by more experiments. As you can see, it is not easy to get rid of

priors.

Clearly, logical consistency requires that both the frequency and

the classical methods give the same answer when the situation is such that

both of them can be used to calculate the desired probabilities. Otherwise,

something should be very wrong with one or both of them.

For the sake of completeness, and to satisfy the curiosity of some

readers, let me explain what the solution of the puzzle stated by Chevalier

the Mere is in modern probability language. If you forgot what the problem

was, go back to the beginning of this section and read it again. To make

things simpler, let us assume that Andrew (player A) always bets heads

and Barney (player B) always chooses tails. We will also assume that

Andrew is the one that won 3 games, while Barney won only 1. The only

way for Barney to win is if the next 3 results turn out to be tails. If we list

all the possible results for the rest of the game, we end up with

HHH

HHT

HTH

THH

HTT

THT

TTH

TTT

By doing this we can easily see that the total number of

possibilities is 8 and, out of that, only 1 possibility will give Barney his

desired victory. Because they are friends and trust each other, they have

no problems agreeing that the coin should be fair and that the 8

possibilities above should have the same probability. Therefore, if Barney

has a chance of winning in only 1 of the 8 possible scenarios, they happily

agree that Barney should receive 1/8 of the total prize, while Andrew keeps

Roberto C. Alamino

38

the other 7/8 of it. This is the essence of the solution found by Pascal and

Fermat. In fact, the way I described it above, considering all possible

combinations of results, was actually due to Fermat. If Shakespeares

famous quote Kill all the lawyers! were somehow realised, probability

theory, and many other areas of mathematics, might be much less

developed right now as Fermat was actually an amateur mathematician.

His true profession was that of a lawyer.

Explained this way, the problem seems so trivial that it is almost

unbelievable it took the efforts of two of the greatest mathematicians in

history to find its solution. However, you must remember that those were

the beginnings of probability theory and the ideas involved were not well

understood in that time. Besides, once you already know the answer for

something, it all looks simple to you. That is a good advice to have in mind

when you are writing a test for your students...

One fact is of notice for our future discussions. Notably, James

Bernoullis Ars Conjectandi, published on 1713, showed that the frequency

method and the classical method were indeed consistent. The interesting

thing about James Bernoulli is that he was also called by the names Jakob

(or Jacob) and Jacques. So, you might see the above book attributed to

three or four different Bernoullis, who are in fact all the same.

Leibniz, the famous philosopher who shared the discovery of

calculus with Newton and earned the enmity of the latter for the rest of his

life because of that, published a dissertation in 1666 about combinatorics

called Dissertatio de Arte Combinatoria.

The term combinatorics is used to describe a branch of

mathematics concerned with counting the total number of elements in a

given set. The name comes from the fact that it was initially concerned

with calculating how many combinations one could find by grouping a

certain number of objects. For instance, consider that you have three balls,

one RED, one GREEN and one BLUE. How many combinations of two balls

out of these three can you get? The total is easily found as being 3:

The Probable Universe

39

RED+GREEN, RED+BLUE and GREEN+BLUE. This is an easy case that can be

solved simply by listing all possibilities. However, how would you calculate

the number of combinations if you have 100 different balls and need to

arrange them in groups of 5?

The importance of combinatorics for probabilities is that many

times the possibilities of an experiment are obtained by combinations. One

well known example is the lottery. A usual version is that in which there are

100 numbers and you need to predict a combination of 5 of them to earn

the prize. Because all numbers have the same probability of being chosen

in a draw, all combinations should have the same probability too by

symmetry. Therefore, the probability of each combination, by the classical

method, should be 1 divided by the number of combinations. To

discourage you, the number of combinations is 75 287 520, which means

that the odds for you to win this particular version of the lottery by

choosing one combination is 1 in approximately 76 million.

A curious fact about Leibniz book concerns the notation he

suggested for combinations. Calculus today is often written using Leibniz

notation, which is more modern and even more powerful than the one

suggested by Newton. So, it is not a surprise that he tried to introduce a

new notation for combinations as well. From the point of view of Internet

notation, Leibniz can be seen as four centuries ahead of his time. He

suggested to use, instead of the word combination, the term

com2nation for all arrangements of two objects, con3nation for three

objects and so on. Our lottery guess would be a com5nation. If he only

lived in todays world, maybe it could have been a huge success, but at that

time, it was not (Todhunter, 1865).

Many other great mathematicians wrote treatises about

probability. Great names like Huygens, Leibniz, Daniel Bernoulli (not

Jacob!), Montmort, de Moivre, Euler, DAlembert and Newton all

contributed their bit to the development of this area of human knowledge.

The list is very extensive, but the focus continued to be mainly on games of

Roberto C. Alamino

40

chance, with occasional excursions on insurance and demographics

problems until Laplace took the stage.

BAYES AND LAPLACE

Pierre-Simon Laplace was a prominent member of the great French

mathematical school of the 19

th

century. He was also an astronomer. In

fact, he and the English scientist John Michell were the first ones to

propose that there might be stars so dense that the escape velocity from

their gravitational fields would be superior to that of the light. Because

light coming from the surface of these bodies would not be able to escape

to outer space, they would appear as black balls to an external observer.

They called it black stars. Of course, in that time, the current theory of

gravity did not predict that light could be attracted by massive bodies.

Some decades later though, Einstein suggested that light would indeed be

affected by gravity. It did not take long for the idea to make a huge return

in the form we know today the black hole.

Among many works, one of the most important achievements of

Laplace, the most important for us at least, was his rediscovery of the work

of an English clergyman and mathematician from the 18

th

century called, as

you might expect, Reverend Thomas Bayes.

The Probable Universe

41

Thomas Bayes (1701-1761)

Pierre-Simon Laplace (1749-1827) by

Gurin

Two years after Bayes died, in 1793, one of his works called An

Essay Towards Solving a Problem in the Doctrine of Chances was presented

by Richard Price to the Royal Society and then published. This problem in

the doctrine of chances, doctrine of chances being the old name for

probability theory, was something that, in modern language, we would call

an inverse problem on probabilities. The direct problem in probability

theory is to predict the outcome of an experiment given the probabilities of

each one of the possible results, while the inverse problem consists in

predicting the probabilities themselves given the observed outcomes.

Price, a philosopher and preacher, thought that Bayes work helped prove

the existence of God. On that matter, he was certainly wrong.

What the paper actually contains is a formula for finding the

posterior probability of a certain hypothesis concerning an experiment

once two other pieces of information are given to you: (1) the prior

probability of the hypothesis and (2) a set of observed outcomes of that

experiment. This is, if you remember, our program BAYES.

A hypothesis about an experiment is some feature or assumption

you attribute to it. When we consider fair dice rolling, we assume a

symmetry hypothesis. If you remember our example of the loaded d10,

which would always give 1, the symmetry hypothesis would lead us to think

Roberto C. Alamino

42

that all the faces would have the same probability. Because the observed

outcomes were always 1, what was in disagreement with that hypothesis,

we had to decrease the probability that the symmetry hypothesis was true.

This is a subtle point and it is when even I get confused sometimes.

The probability of the symmetry hypothesis is one thing, the probability of

the faces of the dice is something completely different. Do not worry if you

do not get this right now, we will go through all of these concepts many

times and very slowly throughout this book. The first time I wrote the

previous paragraph, for instance, I started to talk about the probabilities of

the faces as being the probability of the hypothesis and had to erase

everything.

Bayes original work was, once again, a study concerned with

games of chance. Laplace, however, had more heavenly preoccupations on

his mind. He wanted to predict properties not of games of chance, but of

celestial bodies. In fact, what Laplace wanted to do was really revolutionary

for his time. His idea was to collect data from different sources to estimate

a property which was not intrinsically random! He did that, for instance, to

estimate the mass of the planet Saturn.

Let us highlight here the novelty of the idea because it is the stroke

of a genius. One thing is to estimate a random property, like the next

number to turn out in a dice throwing. It is clearly going to be different

each time (given a series of assumptions, as we have already seen). But the

mass of a planet, specially such a huge one, is not supposed to vary

significantly from time to time in a random way. Not within the relevant

precision anyway. Once the planet mass is defined, it is going to stay that

way for some time at least. Therefore, who would use probability theory to

estimate it? How could one do that? It does not even make sense.

But it did. What Laplace noticed is that the source of randomness

in the problem was not in the mass of Saturn, but in the errors of

measurements. When I was in my first undergraduate year in physics, we

had a laboratory task that required us simply to measure a set of plastic

The Probable Universe

43

pieces hundreds of times. The measurements were made with a very

precise tool and, guess what, within the precision of the tool each time we

would get a different number. That is because we could never measure the

piece from the exact same angle, in the exact same way. The shapes were

not changing randomly, but small uncontrollable variations caused by our

inability to exactly reproduce the position of the measuring device caused a

kind of noise (remember this word) in our obtained measurements.

If this could happen with small plastic objects that could fit in the

palm of our hands, imagine how much worse this can be in the case of

measuring the mass of something as distant as Saturn. What was random

was not the actual mass of the planet, but, in the same way as the plastic

shapes I once measured, it was its measured mass which was influenced by

random mistakes. In fact, the measurements were not even of Saturns

mass directly, they were of its position in the sky, which together with the

equations of celestial motion, would give the planets mass in a very

indirect (and error prone) way.

The whole idea then was the following. By putting together all

information we know about celestial mechanics given by the Newtonian

theory of gravity (prior information) and the measurements made by

several different observatories (experimental observations) we could

calculate a posterior probability for the hypothesis that the value of the

mass of Saturn was such or such! Once we have this probability, we can

then calculate an average mass and the uncertainty in this prediction. We

have not talked about this last part yet, but for now, just be aware that we

can actually do that with almost every probability of interest in science.

The formula that Laplace used was the same that Bayes discovered

in his work. Today it is known as Bayes Theorem. Laplaces estimate of

Saturns mass using this method was so accurate that the improvement

factor on his estimate up to this date is only of 0.63% (Sivia and Skilling,

2006). That was a revolution that was obscured by the other two that also

appeared in that century: relativity and quantum mechanics. Only today,

more than one hundred years later, Laplaces ideas received the due credit.

Roberto C. Alamino

44

After Laplace released probability theory from the shackles of

gambling and monetary applications, he used it to solve problems in every

other area he could think about, from law to medicine. Although Laplaces

Bayesian methods were vigorously opposed by many in his time (and for a

decreasingly few even today), probability theory became part of virtually

every area of science since then.

In the forthcoming years, for instance, physicists like Boltzmann,

Maxwell and Einstein would use it to prove the existence of atoms. In an

even more revolutionary twist, the probabilistic character of the laws of

nature was mercilessly imposed upon us by the discovery of quantum

mechanics, showing that probability was built in a much deeper level of our

universe.

Biology, psychology, medicine, geology. All sciences today rely on

probabilistic estimates to find correlations that allow one to make

predictions of phenomena either too complicated to be modelled in detail

or too difficult to be controlled with the necessary precision.

Even the foundations of what we understand today as being

science are deeply rooted in probability theory. But as every deep subject,

this is still full of controversy.

Before we get there, though, we have to endure a long but exciting

journey through some of the most intriguing questions and discoveries of

science and mathematics involving the very nature of what can and what

cannot be predicted about the universe. We will spend the next chapters in

this journey, trying to understand deep issues about the limits that nature

imposes on our ability to describe it. Welcome to the realm of the random

and the uncertain.

The Probable Universe

45

3.

Making Sense of Randomness

PREDICTABILITY

We spent the largest part of the previous chapter revisiting the history of

probability theory and how a large part of it was connected to gambling, at

least until Laplace started to apply it elsewhere. But this connection with

gambling is not a simple freak accident. A moment of reflection is enough

to convince oneself that casinos are, after all and for Laplaces dismay, the

most convenient places to see probabilities in action. Inside of them,

thousands of daily repetitive experiments are carried out and person after

person, at practically every second, tries to infer what would be the next

result in one of them.

Those experiments rolling dice, tossing coins, spinning the

roulette have one characteristic in common: nobody can acquire enough

information to exactly predict their result. We call experiments like them

by the name random experiments. Although there are sophisticated

methods that can be devised to generate a profit in the long run, if a dice is

rolled with enough strength, nobody will be able to measure every single

piece of information with the necessary precision to always give the right

answer for each roll.

It is not difficult to accept that some things in the world are

random in the sense that the result is difficult to be predicted in advance,

but the property of randomness is a very tricky one to define and can be

extremely deceptive when one starts to think too much about it. In this

sense, it is a property similar to life or consciousness.

Roberto C. Alamino

46

In most practical situations it is not that difficult to say that

something is or is not random, but when you try to define it precisely, or

when you are faced with some particular borderline cases, you get into

trouble. It is like trying to decide if a virus or a prion, a kind of rogue

replicating protein the most famous of which causes the mad cow

disease, is alive. Or even if a computer virus is. We would like it not to be,

but when we try to precisely characterise what is alive or not, we can

always find a loophole in the definition in such a way that something which

we would not like to be alive will satisfy all the criteria and, sometimes,

things that should be will not.

In order: The Bird Flu virus (computer model), the Mad Cow disease prion (computer

model) and the computer virus Blaster Worm. They all can live, spread and reproduce in

their appropriate environments.

Are they alive?

Suppose, for instance, that you are in Las Vegas and decide to bet

your money in one of the numbers of the roulette. Classical physics, the

laws discovered by Newton before quantum mechanics, are the ones that

govern the movement both of the ball and of the roulette within the

precision required. The modifications in the required calculations due to

quantum mechanics have negligible effects in this case, no matter what

some gurus might tell you.

The equations of classical physics state that if you know the

position and the velocity of an object at any time with enough precision,

you can predict its future arbitrarily well. Actually, that is what you would

read on most popular accounts of physics, but that is not quite true. You

also need to know with a good precision all the constraints that affect your

problem. This means that you have to know things like, for instance, the

The Probable Universe

47

shape and the material of the table on which you will roll your dice. This

kind of information constrains the possible ways the dice will bounce and

move around. If you roll your dice inside a round bowl, that will be

different than rolling them inside of a square box.

The main problem is that you will never be able to gather all the

necessary information to predict the required result, although, in principle,

classical physics states that you could. I emphasised in principle because

that is something you read a lot in physics texts. It means something that is

not forbidden by the known laws of physics, even if, in practice, that thing

is so difficult that nobody will really do that.

In the coin tossing or the dice rolling from the previous chapters,

we have seen that we need to know details of the geometry and density of

these objects as well as what was going to be the velocity with which they

would leave the hands of the person who was throwing them. And also, we

need to know the point where they leave the hand. Oh, and the velocity,

density and temperature of the air around it. Yes, I almost forgot, and

also... Classical physics does not forbid you measuring all of that, but unless

you are in a very controlled environment, with high precision equipment,

you will not be able to do that.

In a situation like this, randomness crawls in the experiment

because of our lack of complete information about all the variables

involved. If we could use all the resources of a very advanced laboratory, all

our knowledge about classical physics and as many supercomputers as we

needed to we could in principle measure things with precision enough to

give very good predictions. But the requirements would be so enormous

that we should probably have to run an entire universe full of computers

during billions of years to get the prediction. If you think that this is worth

one prediction, think again. And do not stop thinking until you change your

mind.

By the above arguments, it seems tempting to at least recognise a

quality that we can call practical randomness. If true randomness

Roberto C. Alamino

48

(whatever that means) is elusive, its practical version can be associated

simply with our incapacity of predicting within a certain precision the

results of an experiment. This accommodates situations as we saw above,

in which the experiment is predictable in principle, but not in practice.

As any definition, though, we need to be very specific about its

requirements. When we talk about prediction in systems that are

continuously evolving in time, for instance, we should specify if we are

interested in predicting either the long term or short term behaviour of a

system. This kind of difference can have serious consequences for some

particularly trouble-maker systems, especially when we are interested in

long term predictions. A good example is provided by the famous chaotic

systems, which deserve a closer look.

CHAOS (AND MAYHEM)

Chaotic systems (Gleick, 1988) became worldwide famous in the late 80s

due to a series of mathematical breakthroughs that allowed a better

understanding of their behaviour. In particular, the advancement of

computer technology was one of the decisive elements in these

achievements.

Many physical systems can have their time evolution predicted by

simple equations. For instance, a car moving with speed 80km/h and

starting its journey at the kilometre 20 of a motorway can have its position

predicted at any moment of time t by the formula

For a given time given in hours, provides the position of the car

in kilometres. After 2 hours of trip, the car can be found at kilometre 180 if

its velocity stays constant. We call numbers like the starting position of the

car initial conditions. They are the numbers we need to provide to the

equations to start the evolution of our systems. Different initial

The Probable Universe

49

conditions result in different evolution histories for the systems. If the car

starts at kilometre 100, then its whole trip will be different, passing

through different cities at the time given by the drivers watch.

But that is not the whole story. Suppose you do not know for sure

the starting point. Suppose you think that the car started its journey at

kilometre 18 instead of 20. If the car keeps going for 20 hours, the correct

position would be kilometre 1620, but you would predict the car to be in

kilometre 1618. The point is, no matter for how far in time you try to

predict, your error will always be of 2 kilometres. It will not be worse than

that. If you miss the starting point by, lets say, 2 metres instead of 2

kilometres, you have to agree that your predictions will be so good for the

cars position that the error will hardly be felt.

What happens then with random errors in the measurement of the

initial position of the car? They will result in random errors in your

prediction which will make it slightly wrong. However, these errors will not

increase with time and very little predictability will be lost. Systems like

these go by the name of linear systems. Chaotic systems are different.

The equations describing chaotic systems are so sensitive to the

initial conditions you use to start them that, if we try to predict their

evolution using an initial value which is wrong by just a tiny amount (for

some this can be as small as one part in one million), after some time our

prediction is so different from the actual result that our calculations are of

no value at all.

You probably read about the connection of chaotic systems with

those beautiful pictures of geometric structures known as fractals. They

are sets of points that are generated by chaotic systems following some

special procedures. The most famous of them are probably the Mandelbrot

Set, discovered by the French/American mathematician Benoit Mandelbrot

in 1979, which is a picture with an infinite level of self-similar detail, which

means that if you zoom in enough, you will find the initial set repeated

over and over again after a while.

Roberto C. Alamino

50

Different levels of detail of the Mandelbrot Set.

Note how the initial shape appears again in each one of those levels.

(Source: http://en.wikipedia.org/wiki/Mandelbrot_set)

By looking at fractals one might imagine that the equations needed

to create them are extremely complex. It might be a shock to know that

they are not. Many of them are the result of extremely simple equations

that use nothing more complex than additions and multiplications. The

Mandelbrot Set itself is generated by an equation extremely simple, but as

it involves calculating with complex numbers, I will instead use as an

example another equation which is even simpler: the logistic map.

This equation describes a system characterised by a single real

number at integer values of time. This number is given by the variable ,

in which is the symbol for time which assumes values 0, 1, 2, 3, 4 and so

on. Other values of time are not allowed in this equation, which is then

written as

The variable is an arbitrary real number which is fixed for each

system. What is done is to choose a value for , give an initial value for

and iterate this equation to get what can be called the trajectory of

the system, which is the sequence of values assumed by as time

increases. It is basically the position in which we would find the car of the

previous example and that is the reason for the name trajectory. We call

The Probable Universe

51

this a map because it maps a certain value of at a time into a new

value at time .

The first thing you have to notice is that if we start the system with

, all other subsequent values will also be zero. This is called a fixed

point of the equation and it is not the only one. For instance, choosing

and you have another one. To get a more interesting

trajectory, let us then choose and . Then we get

and so on. The trajectory would be given by the following set

which blows up very fast as you can see. You can also appreciate that there

is nothing random about this trajectory. Given the initial value and the

parameter , one can predict with 100% certainty the value of at any

point in time. The simplicity of this equation, though, hides a darker side.

The equation can also be written as

Because the last term of the equation contains the square of , we

cannot say this is a linear system anymore, we call it a quadratic map.

When squares appear in the equations, they can bring chaos (and

mayhem). To see this, look at the two pictures below. They are graphs

showing the value for the trajectories in the vertical axis and for time in the

horizontal axis of the logistic map with two different, but very close, initial

conditions.

Roberto C. Alamino

52

The initial values for are given in the captions of the graphs and

differ by only 10%. The parameter is chosen to be equal to 3.7 in both

cases. This choice requires an explanation. The logistic map shows chaos,

but not for all values of . Some values will give very repetitive behaviour

and will lead to extremely easy to predict systems. There is a magic number

though, which is approximately 3.56995, above which almost all values of r

will lead to chaotic behaviour. Did you notice the approximately and the

The Probable Universe

53

almost in the previous sentence? Yes, even this feature of chaotic systems

is difficult to predict.

At first sight, one might think that there is some regularity in the

graphs. In a sense, there is. The amplitude of the oscillations seems to

decrease and increase almost regularly. Almost. There is no specific

frequency and no way to predict with 100% certainty when these pseudo-

regularities will appear and neither their exact amplitude unless you know

the exact initial value. The graphs are definitely not completely random,

but they are also not completely regular and surely not predictable.

If you think about these graphs as representing real trajectories, as

if the systems were actually cars on a road, the following graph shows the

distance between the two systems with time.

Once again, there seems to be some regularity, but, after some

time, one cannot predict what will be the distance between the two

systems. If you pay attention to the vertical scale, you will see that the

distance oscillates between zero and 0.6, which is almost as large as the

position of each system which varies approximately between 0 and 0.9.

This shows that if we are trying to predict the position of the first system,

but we have a slightly wrong initial condition (in this case only 10% wrong),

then after some time we simply cannot predict it even approximately!

Roberto C. Alamino

54

It is important to notice that not all is lost here. By the above

argument, one might think that we should abandon all hope of keeping

track of chaotic systems. That is not true. One of the most common

systems which is chaotic is a set of three celestial bodies. However, we

have to deal with this kind of system every day when working with

satellites, astronomy and so on. The point is that we can not only allow for

some uncertainty, but we can also keep correcting our predictions

continually. For instance, in the example of the two systems above, one can

try to measure the position of the correct car after a regular interval of

time and correct our predictions. Because predictions take some time to

deteriorate too much, if we correct them fast enough we can keep the

system under a certain control.

Not all chaotic systems are the same. Some are more complicated

than others and they lose predictability at different paces. Some seem

more random and others less. The rate with which the prediction

deteriorates in a chaotic system is measured by quantities called Lyapunov

exponents and they vary from system to system.

Also, the amount of predictability lost depends on the precision

you are aiming for. The weather, for instance, is a chaotic system, but you

can have some predictability depending on what you want to measure. If

you want to know the exact temperature at some time, that will be almost

impossible, but you can still identify seasons very well in different parts of

the world.

Chaos is something that is present in many systems in nature. It is

very common in fluids, in which it is associated with turbulence. Whenever

a fluid becomes turbulent, we lose predictability concerning the way it is

going to flow. Because of its common presence in science, there are many

methods that allow us to deal with it. Although systems like these are

always on the verge of randomness, we can still extract some information

from them.

The Probable Universe

55

ORGANISED DISORDER

Another example of systems which are very difficult to predict are those

said to possess Self-Organised Criticality or SOC for short (Bak, 1996). SOC

systems live in a very delicate equilibrium between order and disorder and

some of their behaviour is so difficult to predict that even probabilities can

be tricky.

One of the prototypical phenomena associated with SOC is that of

earthquakes. In order to understand what the problem really is, let us first

talk about a physical concept called a typical scale.

Think about the following question: where does the division

between biology and chemistry lie? There is, of course, no exact answer as

there is a lot of chemical phenomena happening in biological systems, but

one still can identify the difference between a biologist and a chemist in

the typical case. The difference is in the typical scale of the phenomena

involved.

Because biology is so extensive, let us focus on zoology. If you are a

zoologist, you rarely will study objects of the size of a molecule. Not that

you wont, but the typical object of your study will be animals and their

organs, which can be thought of being in the scale of metres. Ask chemists

if they use this unity of length very often and you will hear them laughing.

Their objects of study are even smaller than a nanometre, which is a

billionth of a metre. This is a huge difference of scales and, in some sense,

these typical scales are enough to characterise different areas of research

or even whole disciplines.

One of the main advantages of the existence of typical scales in

most phenomena is related to predictability. If you are a chemist and some

prediction about the size of a molecule gives you a number which is in the

scale of metres, then you either found a Nobel Prize winning phenomenon

or you must have committed a serious mistake. In both cases, something

very odd is happening.

Roberto C. Alamino

56

The existence and usefulness of scales is terribly neglected in the

usual school education. A result of this is, for instance, students finding a

completely nonsensical result of a physics exercise and simply accepting it

without any questioning. I remember a story told to me by my physics

teacher in the high school in which a student calculated the size of a

bathroom pipe during an exam and found a diameter of 30 metres (THIRTY

METERS!). The student did not even think that this could be a wrong result.

Scientists use scales all the time to make rough estimates before

rushing to calculate things exactly. This always gives an initial idea of what

one can expect of the final result. The ability to estimate things by using

knowledge about the relevant scales in a problem is fundamental for

professional scientists, but also very useful for any person. If you are

interested, I would suggest you a fantastic little book with the odd name

How Many Licks? Or, How to Estimate Damn Near Anything by Aaron

Santos (Santos, 2009).

I have been talk about size scales all this time, but virtually all

quantities in science will have a typical scale for certain phenomena. The

amount of energy, for instance, characterises the difference between

everyday physics and particle physics, the latter involving higher energy

scales, which you can appreciate by the huge sizes of the particle

accelerators.

Some phenomena, however, do not present this separation of

scales. They live in a special kind of state called criticality, where all scales

contribute equally to it. Usually, criticality needs to be induced and

controlled in physical systems to survive for long times. However, there are

some special systems that evolve naturally to a critical state and stay there

by themselves. These are the systems that present self-organised criticality.

What does all this have to do with earthquakes? We all know the

Richter scale of earthquakes. This scale measures the amount of energy of

a certain earthquake. There comes SOC. It turns out that there is no typical

scale for the energy liberated by an earthquake! This has the gloom

The Probable Universe

57

consequence that one can never predict the typical size of earthquakes

before they actually happen. This is not a failure of technology or science, it

is a feature of own their nature.

In this sense, predictability becomes lost in a worse way than in

chaotic systems. While we can still talk about scales in many cases where

chaos is present, this does not work anymore in SOC systems. Even

probabilistic statements become difficult as, in the case of the size of

earthquakes for instance, all scales become equally probable.

Once again, in principle, if you know all the initial conditions and a

full description of the earthquake phenomenon, one could in principle

predict everything about them. The practical randomness comes once

more from the practical impossibility of gathering the necessary

information. Will we never find true randomness?

FALSE RANDOMNESS

In the back of your mind, a kind of discomfort should be growing right now.

If one cannot have a true random physical system, how can we trust that

some things like the numbers of the lottery are really being drawn fairly? If

you ignore the fact that people always find a way to cheat, as long as the

results are random enough, the difference between true and false

randomness should not be a big concern in many practical cases.

It is very common in science today to run computer simulations.

These simulations require, most of the time, the generation of random

numbers by a computer. But computers are physical systems whose

behaviour is designed to be predictable. How can they generate random

numbers?

The answer is that computers cannot generate true random

numbers by their very nature. What they do is to generate something that

resembles a random sequence of numbers. These are then appropriately

Roberto C. Alamino

58

called pseudorandom numbers. How? Well, that is a good question. You

need to be very ingenious to do that.

If you remember our discussion of chaotic systems, you must now

be thinking that using them could be a good idea, however if you look

again at the graphs of the logistic map, you will see that there is some

limited predictability in the way that the amplitude of the oscillations

increase and decrease almost periodically. In scientific applications, one

would like to have something even more unpredictable than that.

Creating a computer algorithm that generates pseudorandom

numbers is as much an art as it is a science. There are many algorithms

today, each one with their own disadvantages. Still, for most applications,

we actually have very good generators, meaning that any attempt to find

any kind of pattern in the generated sequence usually fails (Press, 1992).

There are catches though. Random numbers are generated much

like the logistic map in the sense that you need to start the generator with

some initial value. This value is usually called a seed. If you start a

pseudorandom generator with the same seed twice, you will get exactly

the same sequence twice. Secondly, most current generators are not even

chaotic, they repeat themselves after some time. This time is called a

period, but it is usually designed to be so large that for all practical

purposes it does not really matter.

Pseudorandom numbers is as close as we can get to generate

randomness with our present computers. In fact, if the universe was

governed by the equations of classical physics, that would be the best we

could do. There would be no true randomness. Interestingly, nature is

much more interesting and tricky.

The Probable Universe

59

COMPRESSING ISSUES

We are getting there, but before we find the holy grail of true randomness,

let us give its characterisation a bit more of a thought. Although practical

predictability seems a good way to characterise randomness, we actually

have been overlooking a lot of details.

First of all, when we associated randomness with predictability,

this implied that we are always analysing a sequence of numbers in time.

According to this, there is no sense in asking a question like: Is the number

9 random? This kind of question that does not make sense according to

some framework is called in mathematics an ill-posed question. This

means that there is no correct way to answer it simply because it does not

fit in the scheme we have at our hands.

The fact is that there is no definitive answer to the question of the

absolute randomness of something, but there are methods that can be

applied as long as one accepts some desired characteristics that may even

vary according to the situation. We can say that, in some sense,

predictability is the best guiding criteria, but we need a bit of sophistication

to generalise it and enlarge its scope of application.

Consider the following two numbers:

111111111 and 498762839

As numbers, they are not more random than the number 9 or the

number 100, but one cannot avoid, by looking at them, the feeling that

the second one should be considered more random than the other. We

could argue that we are, inadvertently, associating them as a sequence of

digits in time and attaching, once again, the idea of predictability of the

next digit. Well, that is exactly what it is. But suppose we right the same

numbers above as

Roberto C. Alamino

60

111 498

111 762

111 839

The first group still looks more random than the other even if we

now are looking at a two dimensional array instead of a sequence in time.

The fact is that, there is still predictability in the sense that if one of the

digits is erased, we can guess the missing digit much better in the first

group than in the last one.

This seems to indicate that we might be able to make sense of the

concept of a predictable object somehow. The key is to associate

predictability with a repeating pattern.

Consider, instead of numbers, the following sequence of symbols

It does not look very random, right? In fact, you can see that it is

only the two symbols repeated 5 times in sequence. It is highly

predictable. If we wanted to, we could write this sequence as

with an obvious meaning and a visible economy of characters. What we did

above can be seen as compressing the initial sequence. As a general rule,

whenever we can find regularities in a sequence or any other kind of

object, we can use them to write a description that is itself smaller than the

original object. In other words, regularities allow for compression.

The idea is that, in principle, random objects should have no

regularities. That is because regularity brings predictability. This

predictability does not need to be in time, but can also be in space. As

another example, think about a circle. There are not many things more

regular than a circle. If I tell you that I am going to draw a circle and I

present you the following figure

The Probable Universe

61

you can easily complete it to get the full circle. You can predict the rest of

the picture. How can you compress a circle? Easily. Every circle can be

described by a symbol and a number, namely, the circles radius.

Note that the circle is something extremely symmetric. Is that a

general rule? Yes. The more symmetric something is, the more

compressible is its description. Symmetry appears here once again, but this

time the role is different. In our previous encounter, symmetry was

responsible for our lack of reason to choose one result over another,

generating randomness. Here, it is the opposite. Symmetry decreases

randomness by increasing predictability. This shows that you must be very

careful when applying some concepts and think deep about how they fit in

each different situation.

The logical consequence of the ideas above is then the proposal

that a true random object should be one whose description cannot be

compressed. In other words, the smallest description of a random object

should be the object itself. Can we measure that more precisely? The

answer is an almost yes.

During the 60s two great mathematicians introduced the idea of

measuring the complexity of an object by measuring the size of the

smallest computer program that could generate that object. One of them

was Ray Solomonoff, from the USA, and the other was mathematician

Andrey Kolmogorov, a Russian about whom we will hear more later.

Although Solomonoff published it first, in 1960 while Kolmogorov

published his results in 1965, the measure ended up being known as

Kolmogorov Complexity (Li, 1997) or KC for short.

Roberto C. Alamino

62

The choice of a computer program should not be seen as

something too fundamental here. It is just a way of characterising

mathematically a description. It has been proved that the exact language in

which the program is written is not important, which means that we can

also use any normal language to describe our objects and the result will be

equivalent.

Finally, the idea of a true random object is then equivalent to an

object whose KC is that of a program that is composed simply by the

instruction Print the object ... In other words, the description of the object

is always larger than any program that can generate it.

KC is a concept which is very powerful and that leads to many

rigorous results, but as it happens with anything else when we try to study

matters concerning complexity, computation and randomness, there is a

catch and, in this case, a big one. It can be proved that KC is what we call

incomputable. This means that there is no program capable of calculating

KC for any object fed to it. However, we will see soon that there is a

concept which is very close to KC that comes to rescue and that will be

enough for us. It comes from an unexpected place thermodynamics and

we will call it by the name entropy. But we will only be able to understand

it if we learn a bit more about probability, so you will have to be a bit more

patient.

PATTERN RECOGNITION... OR NOT

It seems that we made some progress, right? The more we can compress

an object, the less random it is. Therefore, a completely random object is

one that follows no pattern at all. That should be easy to recognise, right?

Err... not really.

There is a nice discussion about our perception powers and how

our senses help us to recognise patterns in objects in Stephen Wolframs

book about cellular automata (Wolfram, 2001). A cellular automaton is a

The Probable Universe

63

mathematical structure composed by a 2 x 2 grid of cells, each one can

assume one of many colours. The simplest case is of two colours. The cells

change colours at each time step by looking at the colours of their

neighbour cells and following a certain rule. An example is John Conways

game of life, whose rule is that a black cell, considered as being alive,

remains black if two or three of all its eight neighbours are also black and

changes to white, or dead, otherwise. In addition, white cells change to

black if they have exactly three black neighbours.

Some cellular automata rules generate a whole universe of

different patterns that can vary with the initial configuration of the cells.

Some patterns are boring and repetitive, while others look completely

random. Of course, because the rules are very well defined, this

randomness cannot be true randomness. The fact is that, for many of these

produced patterns, it is extremely difficult, if not impossible, to find the

rule that generates it from the pattern itself.

Wolframs book once again has a thorough discussion of this

subject and many pictures showing that, even if you use very sophisticated

algorithms to try to compress the produced patterns, you end up with

descriptions that are many times larger than the pattern itself. As we have

discussed in the previous section, this is a strong indication of randomness,

the true one!

You might think that this is a fancy example. After all, even in the

case of the logistic map in its chaotic regime we can somehow identify

some regularity. The equation that defines the logistic map is a

compression of the data we saw in the generated graphs. Those graphs

have some similarity between then and, therefore, maybe it is possible in

simpler cases to always find the correct compression.

Look then at the following two graphs. What is the difference

between them?

Roberto C. Alamino

64

Not much, right? They both seem very similar. Both are sequences

of integers from 0 to 9. Differently from the graphs of the logistic map, they

look very random. In fact, you can run on these sequences pretty much

every algorithm that tries to find regularity in data and nothing will come

out. Both, however, are not really random. The second sequence is a

pseudo-random sequence generated with a computer program. It is not a

surprise that it looks random as it is designed to be that way.

The Probable Universe

65

The first graph is, however, some of a shock. It is actually the

sequence of digits in the following well known number:

3.1415926535897932384626433832795028841971693

993751058209749445923078164062862089986280348

25342117068

Can you recognise it? Yes, that is nothing more that good and old

. Its infinite sequence of digits can be calculated by many different finite

computer programs and can be compressed in the very short description

the ratio of the circumference of a circle and its diameter. Even with the

most sophisticated automated methods of pattern recognition, one would

not be able to find this amazing description of that sequence.

Here comes a SPOILER ALERT!

If you haven't read Contact by Carl Sagan (Sagan, 1985) yet, be aware that I

will be talking about something that happens literally at the end of the

book.

Be warned. At the very end of the novel, the main character, an

astronomer called Ellie, is running a program to find a message that has

been left hidden in the digits of by the supposed 'designers of the

universe'. The program suddenly spills out a sequence of 1s that somehow

form the picture of a circle.

Now, you might be very truly amazed to know that Carl is right:

there is indeed such a sequence hidden in the digits of ! That exact

sequence, by the way! Have I just changed your life? Before you start

twitting about these amazing news, let us review a couple of facts about

that number.

We have seen that is a number which is a result of dividing the

length of a circle by its diameter. In flat Eulidean space, which is the one

Roberto C. Alamino

66

obeying the geometric properties you have learned in the school, this

works for every circle. It does not work for curved spaces like those present

in General Relativity, for instance, which you might be tempted to interpret

as saying that somehow contains information about the flatness of space

around you.

But is a very interesting number in many other aspects too. One

of them is the fact that it is an irrational number. This means that there is

no way to write as a fraction, or a rate, between two other integer

numbers. I once received a paper from someone who claimed that he

had found the true value of and provided a fraction. When you are

scientists, sometimes you have to deal with those people.

A consequence of s irrationality is that its decimal digits cannot

(ever, never) be periodic. A periodic sequence is one that repeats itself

after a certain amount of time like the following ones

11111111111...

121212121212...

123567123567123567...

The three dots at the end mean that these sequences repeat

forever in the obvious way (I call the last one the 'Mambo Sequence', by

the way).

The first sequence has period one, the second has period 2 and the

Mambo Sequence has period 6. The period is then, clearly, the number of

digits that are repeating. A rational number, one that can be written as a

fraction between two integers, always finishes with a periodic sequence. It

can take a while to reach that sequence, but it is always there. The

The Probable Universe

67

converse is true: if the decimal expansion of a number becomes periodic a

some point, then the number is rational. For instance:

1.234566666666....

where the last digit '6' repeats forever is a rational number. In irrational

numbers, like , this never happens, and that is the main reason why the

sequence of its digits becomes random.

What does this have to do with Sagans Contact? We are getting

there. Another detail about irrational numbers is that the number of

decimal digits in their expansions is infinite. That is because any number

whose sequence of decimal digits is finite IS a rational number. All you

need to do to find its representation as a fraction is to multiply it by an

appropriate power of ten until it becomes an integer. The number is then

that integer divided by the power of ten. For instance

a rate between two integers.

Many of you must know Jorge Luis Borges' story The Library of

Babel. In it, Borges imagines a library containing books in which every

combination of the letters of the alphabet is present, in a random order.

This means that, if you only look at the books with say 400 pages, the

library contains all stories and all scientific books that have been ever

written, or that will one day be written, as long as they fit in 400 pages!

Even things that have not been discovered yet! Even stories that nobody

wrote yet, but that one day someone will! In fact, because the library is

infinite, it contains all books that have ever been written or that will ever

be.

Although Borges' library is fictional, it illustrates a truly amazing

property of the infinite. When you put together infinity and randomness,

Roberto C. Alamino

68

you get something even more amazing. It can be proved that in an infinite

random sequence, ANY finite sequence of characters appears an infinite

number of times! Now, the punchline:

Every finite sequence of numbers appears an infinite number of times in

the sequence of decimal digits of .

Think about this. In the same way as you can encode computer files

in binary form, you can also encode any information in decimal form. If you

doubt, just write down the binary representation of any file. That is an

integer number. Write that integer in decimal base and there you have it.

This means that every text that has ever been written or that will ever be

written can be found somewhere in the sequence of decimal digits of ! An

infinite number of times! This means that, whatever Sagan's character

found, it was not a message from another race, but simply the result of

good and old randomness.

If you are worried that there is so much information hidden in or

maybe trying to devise a plan to extract future information from it like in

the Bible Code, be aware that this is futile. Because the digits are random,

there is no way to know where the information is before hand, or even

which information is correct or not, because the same information appears

with all possible mistakes!

The bottom line: randomness is elusive, get over it.

TRUE RANDOMNESS

I probably have convinced you that, even if true randomness exists, we

would not recognise it if we had our faces glued on it. But nature was kind

to us and, at least apparently, has allowed us to look into the very face of

randomness, the true and clean one. We called it quantum mechanics.

The Probable Universe

69

There is nothing more fundamental and mysterious than this. Born

in the beginning of the 20

th

century, quantum mechanics is the most

fundamental description of physical phenomena that we have and it says

that, even if we could gather all possible information about a system with

the maximum allowed precision, we would only be able to predict

probabilities for outcomes of an experiment, not the actual value of the

outcome with 100% of certainty.

Quantum mechanical randomness is a very special kind of

randomness which is not a result of a lack of information, but a basic

characteristic of some aspects of nature itself. If something deserves to be

called true randomness, it is it. But because quantum mechanics does

require some familiarity with probabilities, we will leave its discussion to a

later point in this book. Right now, we will come back to more mundane

subjects... we will be back to Vegas!

BACK TO VEGAS

From our journey to the core of randomness, we have learned that there

are an incredible amount of times in life in which we will not be able to

gather enough information to predict a result. But cant we at least give our

best guess?

Of course we can and we usually do that. Randomness, be it true or

false, does not prevent us from saying something about those systems. But

we do not only guess, we do an educated guess, as we physicists like to call

it. Even without realising it, that is what we always do. We look at the

roulette and certify ourselves that it seems to be constructed in a way that

gives the same chance for the ball to stop in any of its holes. Symmetry,

remember? That allows us to infer that all of them should have the same

probability. Then, we choose any number and bet on it, hoping that the

probabilities are indeed equal.

Roberto C. Alamino

70

Probabilities are our main tool to tame randomness in all its facets.

Now that we had a better understanding of how randomness appears in

our experiments, we need a language to deal with it and this language is

provided by probability theory and we now have to understand how we

can evaluate them.

We touched this point already when we talked about coins and

dice. There, where everything was nicely symmetrical, the discussion was

almost straightforward. We also have seen that in more complicated

situations, one might say that there are at least two different answers.

They were subtly different the difference being using the classical and the

frequency method to calculate these so-called probabilities.

When we used the classical method, equal probabilities meant that

there were no reasons to expect any result in particular and we did not

have to actually do the random experiment to infer that. The idea was a

direct consequence of the symmetry arguments. The classical method is a

sort of static answer, because you calculate it once, you have your

information and you are done.

The frequency method, on the other hand, can be seen as a

dynamic answer to deal with randomness. It is based on the repetition of

an experiment. If all numbers have the same probability, by repeating the

random experiment many times, the Law of Large Numbers tells us that

each result would appear approximately the same amount of times. If the

probabilities are different, these differences will be reflected in the

obtained frequencies as long as we repeat the experiment enough times.

There is a raging battle here. The kernel of it is the dispute

concerning which one should be the correct method. The classical method

works by calculating probabilities as properties of the information we have

about a system, while the frequency method clearly gives a property of the

system. Because of this, many mathematicians call probabilities calculated

by the frequency method as objective probabilities, while those calculated

The Probable Universe

71

from the information one has about the system are frequently called

subjective probability.

If you believe in a physical reality out there, you would be tempted

to say that the real probability should be that given by the frequencies. As

we have learned in this chapter, however, unless you are dealing with

quantum mechanics, you will then never truly have objective probabilities.

We saw that in all other cases, randomness is actually related to our lack of

information about something, leading to a lack of predictability. Does that

mean that frequencies are then the wrong answer? The actual answer is

more subtle than that.

One of our goals will be to understand that there is no need to

choose one of the options above in detriment of the other. In order to do

that, we need to acquire a deeper understanding of how probability is

measured in general. We have to make sense of all those fractions and

percentages in a way that they stop being simply numbers and start to

acquire more meaningful visualisations inside our minds. There are

different ways in which this can be done. The path we will choose here is

via the Coxs Postulates and we will do that in the next chapter.

Roberto C. Alamino

72

4.

The Logic Behind

THE COXS POSTULATES

In his Philosophical Essay on Probabilities (Laplace, 1825), Laplace wrote

ONE SEES IN THIS ESSAY THAT THE THEORY OF PROBABILITIES

IS BASICALLY ONLY COMMON SENSE REDUCED TO A CALCULUS.

The great insight of Laplace was that probability was an extension

of logic. It was logic adapted to situations in which you cannot be

completely sure of something. This should have the consequence that,

somehow, all the rules that govern probabilities should reflect logic

principles. When things were certain, probability should also reduce to

deductive logic, which is that kind of logic that states that if this, then this

without room for doubts.

Let us go back and analyse once again our d10. We have seen that

the probability of any side was 1/10 because this number was supposed to

represent the symmetry of the whole situation. Therefore, there must be a

path connecting rules of commonsense and non-negative real numbers

between 0 and 1. These rules should include also how to combine them.

That is because, for instance, we want to say that if the face 1 and the face

2 have both probability 1/10, the chances of getting either 1 or 2 in our

dice rolling should be .

This connection is made by a set of 3 (yes, only three) simple

requirements that are called the Coxs Postulates, sometimes also called

The Probable Universe

73

Coxs Axioms. As any set of postulates, they are assumed to be true from

start. That is not cheating at all. We want probabilities to reflect

something, so we need to tell the mathematics what is this something we

want. Once we defined it, it can be proved that probabilities can be

represented by real numbers just like those we are using for the d10. The

beauty of these postulates is that they can actually be fully reproduced

here as their essence is so clear that precludes any technical knowledge.

They are

C1. Probabilities are real numbers representing degrees of plausibility.

C2. These real numbers should follow all properties required by common

sense.

C3. The numbers should be consistent.

They look pretty sensible requirements, dont you agree? We will

go through all of them one by one. The first postulate, to which we will

refer as C1, expresses the most fundamental idea behind all we are going

to do. In plain English, it says that it is possible to quantify the plausibility

for anything to happen. This plausibility is, of course, what we will end up

calling by the name probability.

Only postulate C1, however, is still not enough to fix the range of

numbers inside of which probabilities will be defined. Even if we decide to

define the number 1 as describing the certain event (that which has 100%

of chance to occur), C1 does not force us to attribute the value 0 to the

impossible event, although it would be a very convenient and fairly

intuitive choice. I have to agree that C1 has a sort of philosophical feeling

attached to it. It is more like a statement of a goal. But there is one thing

we can extract from it the idea that, to propositions with different levels

of plausibility, we should associate different real numbers.

C1 is very important as a starting point, but let us analyse a little

more its limitations. Consider two propositions to which we would like to

assign some degree of plausibility. I put the word proposition in boldface

Roberto C. Alamino

74

to draw your attention to something extremely important. By targeting

propositions, we are now able to assign values of plausibility not only to

random physical experiments, but to any kind of sentence you can

formulate in any language, as long as it has some meaning, of course.

Examples of propositions are = Its going to rain tomorrow and

proposition = I will spot a dodo today. Notice that I am labelling my

propositions by the letters and . This is just a trick that helps us avoid

rewriting the whole sentence again and again while we analyse it. What

postulate C1 suggests in this case is that we can attribute the real numbers

and (dop = degree of plausibility) which respectively give

the plausibility of propositions and .

Because of the name we chose for these numbers (again, degree of

plausibility) and knowing that is obviously less plausible than (even if

you leave in a desert), we might be tempted to require these numbers to

satisfy

Although the above equation looks pleasant to the eye, it is worth

to remind yourself that there is quite an ordinary counterexample to this

kind of ordering. Think about the property of coldness. The coldness of an

object can be easily measured by measuring its temperature. When one

says that an object with temperature

temperature

means a smaller temperature. Because we are used to it, we are not

confused by that terminology. The other reason this does not sound

strange to us is that we do not use the term degree of coldness to

describe temperature, although we could even get used to that if we had

spent a long time applying that name.

It clearly sounds very strange, but totally acceptable, to assign a

smaller number to propositions which are more plausible, although it

would then be more sensible to call degree of implausibility to that

The Probable Universe

75

number. We, for instance, feel much more comfortable saying that

temperature measures how warm something is than saying that it

measures how cold it is, even with both assertions providing the same

information. Therefore, keep in mind that the order given in the above

equation for the is not necessarily implied by the postulates, but is

conveniently chosen simply because of the name we chose for that

quantity.

But the order is not the only thing that is not implied by C1. The

usual 0-to-1 interval that I mentioned so many times, written in

mathematical notation as [0,1], is nothing more than a convention. And we

will see in a while that even putting all three postulates together, it is still

going to be a convention. We are used to consider that an impossible

proposition corresponds to a zero value and an absolutely sure proposition

to the value one. If you are a mathematically inclined reader, you can find a

detailed analysis of Coxs postulates in Jaynes book (Jaynes, 2003). There it

is shown that we could as well work on the interval from 1 to infinity,

which I will prefer to write with the shortcut notation [1, ], by assigning

the impossible event to the infinite value, the certain event to the value 1

once again and using the degree of implausibility way of thinking.

Mathematicians would be shouting in anger to me right now. The

reason is my interval [1, ]. In general, the notation for intervals in

mathematics states that we use the symbols [ or ] to indicate that the

points to which they are closer are considered to be included in the

interval. That is because we can choose to delete one or both of those

points and keep the rest. For instance, the interval is almost the same

as the interval , but does not contain the zero, and the interval

does not contain both the zero and the one. In rigorous mathematics,

infinity is not an actual number, but a kind of limit. Therefore, as a limit, it

is never reached and is never actually included inside the interval. This

means that the rigorously correct way to write the infinite interval starting

at 1 would be . I will use the excuse that I am doing this because I

Roberto C. Alamino

76

want infinity to be a possible value, not only a limit. I hope I can be

forgiven.

If you are not a mathematician, you might be suspicious about a

different issue. How can I change the interval from [0,1] to [1, ] and say

that they are equivalent descriptions if the interval from 1 to is clearly

larger than the one between 0 and 1? Well, it all depends on what you

mean by the word larger. If you measure with a ruler, than it is, but what

is important for them to be equivalent for our purposes is the fact that

both have exactly same quantity of numbers.

It looks like I am cheating, because it is not very difficult to

understand that both have an infinite number of real numbers, but the

issue runs deeper than that. Although there are infinities which are of

different sizes, theirs are the same. The trick is to understand that we can

associate to each number in one interval a unique number in the other. To

see this, call the numbers in the interval from 1 to the degrees of

implausibility, or , of the propositions. To check that the number of

points is the same as the number of points we associate them by the

following formula

To each value of , this formula associate exactly one value of

. In addition, any value of has its associated and vice versa.

There are no unmatched points in the two intervals! Therefore, they have

the same number of points. The number of points in both these intervals is

different, for instance, from the number of points in the set of natural

numbers (the numbers 1, 2, 3, 4 and so on). We will need to know about

this distinction later and we will discuss that when that time comes, so do

not be very worried about this right now.

The Probable Universe

77

A BIT OF LOGIC

We have extracted all we possibly could from C1 without delving

into more serious mathematical calculations. We can now safely proceed

to C2, which is probably the subtlest of all three postulates as we all know

from our daily life how difficult it is to agree on what common sense really

means.

I will adopt a very pragmatic view concerning what should be

understood by the term common sense. We are going to assume that this

should be equivalent to the requirement that the numbers we assign to our

degrees of plausibility, based on C1, should obey all sensible rules of logical

induction.

The difference between logical induction and logical deduction is

that, in a sense, the former is a relaxed version of the latter. Deduction is

concerned with proving assertions, while induction is concerned with

increasing or decreasing the likelihood of an assertion being true without

requiring definitive proof. The choice of induction in what we are doing is

then obvious. Apart from this, though, most of the content will be the

same for both.

Defining common sense as logical induction sounds a bit abstract,

but we will see that, although there is always a bit of required abstraction

when you are dealing with matters of logic, the rule of induction indeed the

ones we apply all the time without paying much attention. Every time we

want to draw conclusions from some piece of information, we are

invariably using logical induction. If you keep that in mind as we go along,

you will be able to find many examples of daily decisions in which those

rules are being applied.

As we have discussed before, instead of using plain words all the

time, it becomes convenient to introduce a bit of mathematical notation at

this point. Try to remember what we talked about mathematical symbols

and do not be scared. They are only shortcuts to words or whole

Roberto C. Alamino

78

sentences. You can even think about them as Chinese characters, only

simpler to draw, although with longer meanings.

First of all, let us agree once and for all to use the degree of

plausibility point of view and, consequently, our numbers will be always

between 0 and 1. I will then introduce a very subtle modification in the way

we interpret the symbol . We will define it as an abbreviation for

the sentence the degree of plausibility for proposition to be true. This

seemingly innocent change of wording is actually very important, because

it is here that we open up the doors for logic to enter the stage.

This modification will now require that the proposition can only

assume two truth values: true and false. Anything that can assume two

different values can actually be though as something that can be either

true or false. But how then can we attribute plausibilities to propositions

that can assume more than two values? What about our dice, for instance?

The number of possible results of a dice rolling is equal to the number of

faces of the dice, which obviously can be larger than only two.

Although at first sight this would present a difficulty, there is no

real problem. We are going to talk much more about how this works in the

next chapter, but to easy your mind, think about this. If we roll a d6, the

result can be broken in 6 propositions which can only have the true/false

values. They are the propositions the result was 1, the result was 2 and

so on. We can always do that for any experiment. We simply list the

number of possible results and consider each one as a binary proposition,

i.e., a proposition that can assume two values.

The word binary might ring some bells. Computers work using

binary numbers, right? The binary numbers are numbers that contain only

two digits, which are usually chosen to be 0 and 1. A variable that can

assume two values can alternatively be called a bit, although the more

correct way of putting that is that it carries a bit of information. We will

come back to that at some point as information is one of the things we will

be concerned in this book.

The Probable Universe

79

There are many reasons why a computer uses a binary

representation for numbers. It turns out that this is enough to deal with

logic. The connection is obvious. We associate the value 0 to the truth

value FALSE and the value 1 to the truth value TRUE. This is actually a very

convenient (and abbreviated) way of symbolising the truth value of our

propositions.

Some caution with notation is required now as logic is prone to a

phenomenon called abuse of notation. This happens all the time in science

and is a pain in the neck of those who are starting to learn some subject.

Once you get used to some mathematical notation, you start to use it in an

even more abbreviated way which, most of the time, turns out to be

rigorously wrong. However, because you are experienced and know what

you are doing, you get away with that. This happens a lot with

propositions.

Remember that our proposition is a sentence. We can attach a

degree of plausibility to it and now we have learned that we can also attach

a binary number named its truth value. I will use the notation

to indicate that the truth value of is 1 (TRUE) with the obvious

corresponding notation when its value is 0 (FALSE). The danger is that,

many times, you will see people writing, instead,

Strictly speaking, that is wrong. It is true that one can work with

plain numbers instead of propositions in mathematics, and then this would

be correct, but in our case that would be an abuse of notation and I will try

to avoid it as much as I can.

One thing we can do with binary propositions is to modify them.

For instance, we can negate it. We use the symbol

corresponding to our original proposition) to indicate the negation of

Roberto C. Alamino

80

proposition and we call it, quite obviously, not . There are actually

many symbols for this and each book will use its favourite one. Books on

logic will many times use the notation , or even , which will be

familiar for programmers of C or Java. If you think about as a sentence

like it is going to rain tomorrow then

going to rain tomorrow.

The word not is an example of what is called a logical operator

and we will many times write it in uppercase letters as NOT in order to

emphasise it. In the same way as BAYES, logical operators can be thought

as small computer programs that are fed with one or more propositions

and return a single proposition constructed by combining the initial

propositions in some way. In other words, it operates some modification in

the original propositions.

In the case of the NOT operator, it is fed with one proposition and

spills out another one. What happens in terms of truth value? Let us see. If

the truth value of is 0, then the truth value of

can be summarised by the following truth table

The second column of the table gives the truth value corresponding

to the negation of the proposition in the first column. Notice that this

table shows truth values, not the propositions themselves.

Common sense becomes relatively easy to apply here. We just

require that either or

Either it is going to rain tomorrow or it is not. There is no other possibility.

You can appreciate this by noticing that each line in the truth table has

The Probable Universe

81

different numbers. There is no repetition in each line. Our degree of

plausibility must reflect this feature.

If you know about quantum mechanical superpositions, you must

have a smile of superiority in the corner of your mouth right now. Yes, it is

true that in quantum theory things can be both TRUE and FALSE at the

same time in some very particular sense, but Coxs Postulates deal mainly

with classical logic, not any modification of it. Our conscious mind, after all,

works on a classical world doing classical logic even if you are a physicist

working with quantum theory. Still, even when dealing with quantum

mechanics, Coxs Postulates and in fact everything else we will learn in this

book, will continue to work when correctly interpreted.

The NOT operator is not the only one we use when dealing with

propositions. Another thing we usually do is to combine two or more

propositions into one. One way we can do that is by using the AND

operator.

Consider two propositions and . We use the symbol to

indicate a proposition which is true only if both propositions and B are

true at the same time and read this symbol in the obvious way as

. The meaning of the and is the same as in usual language. Just

think about it. If we say that such and such is true, we mean that both

things are true. It is with this form of combining two propositions that we

associate the AND operator, which once again can be interpreted as a

program that takes two propositions and and returns a single one

called .

In the same way as the NOT operator could be represented as a

truth table, AND also has its own given by

Roberto C. Alamino

82

This table summarises what happens with the truth values when

we use the AND operator. The white squares in the middle hold the result

of , which is only 1 (TRUE) when both propositions are 1 at the same

time.

In a similar fashion, we can introduce the OR operator with the

obvious interpretation in terms of usual language. The symbol we will use

for the combined proposition is and it corresponds to either only

being true, or only being true or both being true at the same time. That

is how the word or usually works in any language if you think about it.

The corresponding truth table is

These three logical operators are enough to represent all logical

combinations and manipulations of truth values. If you think in terms of

truth tables, it is obvious that we did not exhaust all possibilities. For

instance, there could be a truth table like this

The Probable Universe

83

But this particular case can be represented as

, and you

are invited to check all the possibilities to see that, indeed, the above table

corresponds to this expression. Because of this, these three operations

become extremely important in computer science and electronics. In

computer chips, those operators are implemented already in the hardware

level. They receive the alternate name of logical gates and are represented

by the following three drawings:

These are respectively the AND, OR and NOT gates. The incoming

wires (the lines at the left of each drawing) represent the bits that will be

operated on. The outgoing wire is the result of the operation. They are

represented as wires because that is what they are (or the equivalent) in

real circuits. The truth value 0 represents a physical situation in which no

current is passing through the corresponding wire, while the truth value 1

to that in which a current is.

In fact, it can be shown that one can construct all possible logical

operations with only one logical operator! This operator is not unique, but

the most commonly used is the NOT AND operator, or NAND. This is just

the result of applying first the AND operator to two propositions and then

applying the NOT operator in sequence. There are many more operators,

but I will stop this discussion here as NOT, AND and OR are not only

Roberto C. Alamino

84

enough, but actually the most convenient set for us to work with as it can

be readily translated to daily language.

Because AND and OR are such common operators, you will find

many other notations for them as well. The most common of these is to

depict them, respectively, by the symbols and .

Let us now see how all of this connects with our degrees of

plausibility. Remembering that we have agreed to work in the interval

[0,1], we should then require the following properties

and

.

If you are put off by the mathematical symbols, simply try to

substitute them by the words. For instance, the first bit of the line above

would read

at the same time is zero

If our definitions follow usual logic, this has to be true. As we have

already discussed, we are assuming that propositions are either true or

false, but not both at the same time. Never. That is why the probability of

both being true at the same time must be zero. This can be easily visualised

in our truth tables. Looking at the NOT truth table one can see that and

always have opposite truth values. The only case in which the AND of

two proposition is true is when both have the truth value 1, which in this

case never happens.

The second property concerning the OR operator translates to

true at the same time is one

That is simply because either one or the other has to be true. It is

the same as saying that is either true or false. It cannot be neither. These

two properties are logical consequences one would require from any

The Probable Universe

85

proposition. If something can be only true or false, but not both or neither

at the same time, then and

least one of them should be. This kind of reasoning, amazingly, narrows

down the possibilities for our degrees of plausibility enormously. Just to be

complete, the above reasoning would work also if we have chosen to work

in the interval [1, ].

LIARS

You might think that all propositions that make sense should be able to fall

into the tendrils of logic analysis. I am not talking those sentences

appearing in Lewis Carrolls Jabberwocky (Carroll, 1871), which is a poem

containing many nonsensical sentences. Some of them, of course, cannot

be either TRUE or FALSE. For instance, its stanza

TWAS BRYLLYG, AND YE SLYTHY TOVES

DID GYRE AND GYMBLE IN YE WABE:

ALL MIMSY WERE YE BOROGOVES;

AND YE MOME RATHS OUTGRABE.

does not really mean anything and, therefore, has no defined truth value.

Random ensembles of words are another example as they also have no real

meaning. Amazingly, languages are unconstrained enough to allow

sentences which do have meaning, but at the same time do not have a

defined truth value. Most of them are related to something to which we

will return later in this book, the ability of languages to talk about

themselves.

Consider the following sentence:

THIS SENTENCE IS TRUE.

Roberto C. Alamino

86

Notice that it is a sentence talking about itself. Can you decide if

this sentence is either true or false? No. It, in fact, can be both. You can

attribute whatever truth value you want to the sentence. Still, you would

be a bit reluctant in attributing both values and to say that it is both true

and false at the same time, right? That is because you are thinking at the

sentence as an object, but imagine that you write that sentence twice with

different colours, like this

THIS SENTENCE IS TRUE.

THIS SENTENCE IS TRUE.

You then can say that the red sentence is true, while the blue one

is false. Although that would make you more comfortable, both sentences

are still the same! So, we can say that the sentence is true and false at the

same time.

But there is an even worse consequence of addressing oneself

which is embodied by something called the Epimenides Paradox.

Epimenides was a Cretan philosopher that lived around the 6

th

century BC.

In one of his writings, he stated

CRETANS, ALL LIARS!

The paradox did not seem to be evident either for him or for many

people for several centuries. But wait, if the above sentence was stated for

a Cretan, is that a lie or not? That is equivalent to ask if the above

proposition is TRUE or FALSE. Let us analyse that.

If all Cretans are indeed liars, meaning that the above sentence is

TRUE, then Epimenides being a Cretan must be lying and the sentence

should be FALSE, which is a contradiction. If the sentence is FALSE and

Cretans are not liars, then Epimenides must be saying the truth and the

The Probable Universe

87

sentence must be TRUE, which is also a contradiction. Then the sentence is

neither FALSE nor TRUE!

You might oppose to such a radical interpretation by saying that

liars do not lie all the time and honest people lie once in a while. Fair

enough. Consider then this clearer version of Epimenides Paradox which is

very similar to the sentence that was both true and false at the same time:

THIS SENTENCE IS FALSE.

There is no social interpretation in this sentence. It states that it is

FALSE. Now, if the above sentence is TRUE, than according to itself it must

be FALSE, a contradiction. If it is FALSE, then it must be TRUE, another

contradiction. The above sentence cannot be either FALSE or TRUE. There

you go.

The lesson is that some languages, including mathematics, can hide

a few secrets and you should always be careful when assuming things to be

obvious. That is the reason why mathematicians are always concerned

about rigorous proofs of everything.

ENTERS CONSISTENCY

And with this we are done with what we called common sense. We can

finally move on to the last postulate in the set. If we now require C3 also to

hold, apart from the freedom on choosing the interval, our function

acquires virtually all mathematical properties that probabilities as we know

them have!

The postulate C3 is not difficult to understand and it is just the very

sensible requirement (at least for most people) that if we calculate our

degrees of plausibility using two different lines of reasoning, as long as C1

and C2 hold and we chose only one of the possible intervals, the final

values must be the same. Strictly speaking, the last postulate, C3, can also

be seen as a kind of higher level common sense requirement. After all, how

Roberto C. Alamino

88

good can be a theory that gives different answers for exactly the same

question just because you changed the way you asked it?

I cannot help stressing here how important postulate C3 really is.

By using Coxs Postulates as our foundation to define probabilities instead

of only using frequencies, we are doing what most people call the Bayesian

interpretation of probability. This, as I mentioned already, has been called

by many the subjective interpretation of probability because it is based on

using available information to calculate the numerical values of the

probabilities, and available information is obviously a feature of the

observer who is doing the calculation, not only of the system, and can be

different for different users.

However, when we require C3 to be valid, if two persons have

access to the same information, they should end up with the same values

of the probability. This is not more subjective than any other calculation

either in mathematics or in science as a whole. In fact, consistency is the

maximum we can require from any definition of an objective reality which

is free of contradictions. Our Bayesian probabilities have it for sure. Pay

attention to consistency, this is not the only time we are going to see it.

Once we force our degrees of plausibility to lie in the interval [0,1]

and to comply with everyones expectations (more technically, with Coxs

Postulates), it becomes legitimate to officially call them probabilities once

and for all. In order to officialise that properly, instead of , we will

finally start to use the scientists beloved notation for the probability

that proposition is true.

Take a moment now to appreciate the beauty of what we have

accomplished here. With a minimum of mathematics, and a lot of logical

reasoning, we derived the most fundamental aspects of probabilities. This

was not a small achievement. Even today, many mathematicians, scientists

and philosophers might look at this with some disdain, but apart from a

matter of taste, there is nothing to frown upon here. On the contrary, we

The Probable Universe

89

have a theory that should please all three of them with the additional

advantage of being extremely easy to grasp.

KOLMOGOROVS AXIOMS

For a great number of mathematicians and for those who have a

technical knowledge of probability theory, the derivation of probabilities

from Coxs Postulates (remember, CP) might lack that professional and

technical feeling. That is because the modern mathematical theory of

probability is conventionally based on something called measure theory.

This theory relies on the axioms proposed by a character that we have

already met before, Andrey Kolmogorov.

Kolmogorovs Axioms, which we will abbreviate here by KA, use

ideas of set theory to define probabilities. Although it can be shown that all

of KA can be derived from CP (for technical details you can look at Jaynes,

2003), the former are less clear in terms of interpretation than the latter.

The advantage of basing our derivation of probabilities on CP

rather than KA is that, because of the way they are formulated, it becomes

much clearer how and why we are doing it. In addition, if you are not a

mathematician, it is much easier to grasp the fundamental ideas of

probability using CP than using KA. Finally, the connection with logic and

information theory is much direct on CP.

As an example, let us analyse how one would see the throwing of a

d10 through KAs point of view. We start by creating a set of all possible

mutually exclusive results of the d10 for one rolling. We have seen this set

before. We called it the sample space. Each one of its elements will be

called now a sample point. The most common symbol in mathematics used

to name sample spaces is the uppercase version of the last letter in the

Greek alphabet, the omega. This symbol is . Then, for our d10, we could

write the sample space as the set

Roberto C. Alamino

90

According to KA, we then assign a number, or a measure, to each

one of the sample points. This number, of course, is the probability of that

particular result. KA are then roughly equivalent to the following three

conditions:

(1) Probabilities are non-negative real numbers.

(2) The probability of the whole sample space is 1.

(3) The probability of any combination of results (meaning that they are

put together with the OR operator) that cannot happen at the same

time should be equal to the sum of the probabilities of each result.

You can notice that KA chooses the probability interval from start,

but apart from that, there is no mathematical difference between the

approaches and, sometimes, we will use the visual picture of sets of

elements to make things easier to understand.

You should always remember that the two points of view are

complementary, not mutually exclusive. This is something that will happen

all the time. Bayesian inference is not meant to substitute what works, but

to enlarge the scope and improve the understanding of probability theory.

LOGIC, MATHEMATICS AND PHYSICS

I hid something underneath the carpet in the course of the previous

sections. You will find the clues scattered all around the text. I said that

common sense was a difficult thing to agree upon and that all we were

doing would change a little if quantum mechanics would enter the stage,

but that everything would be alright at the end.

Everything is indeed alright, as long as we take a deeper look on

what alright actually means. That sounds a little too philosophical, but

philosophy is in the basis of everything we do in science and mathematics.

Ignoring this, like many do, is like behaving as a computer program that

The Probable Universe

91

calculates things without knowing why. Computers hardly ever know if they

are making mistakes because they simply not know what a mistake is

unless you program them.

In 1960, one of the greatest unknown-to-the-public physicists of

the last century, Eugene Wigner, published a paper that became a kind of

holy text for physicists. The paper is called The Unreasonable Effectiveness

of Mathematics in the Natural Sciences (Wigner, 1960) and it deserves to

be part of our discussion. The weight of Wigners stature among the

scientific community of the time can be inferred by the fact that the paper,

although published on a journal of mathematics, is purely philosophical.

There is not a single equation in all of it! Would a less famous researcher

try a similar stunt, the paper would have bounced into the editors in less

than five minutes.

Wigners worries expressed on that paper reflect the questions

which were occupying the minds of the physics community during his time.

Since the Greek philosophers discovered that we could use mathematics to

describe nature instead of using divine explanations, the amount of

patterns discovered in the natural world has only increased and become

more complex. The fact that we can create a purely mathematical line of

reasoning from experiments to predictions that, afterwards, are confirmed

with outstanding precision is almost a miracle.

In other words, Wigner points out to the fact that mathematical

concepts entering physics via some analogy with similar but purely

mathematical constructions many times lead to conclusions, based only on

the mathematical manipulations of the used symbols, which are so

accurate when compared to the experiments that there seems to be no

sensible explanation for that.

One example, and probably the one who always impressed any

physicist that began to learn the principles of quantum mechanics, is the

necessity of complex numbers in quantum theory. We have talked briefly

about complex numbers in connection with Cardano. One of the main tasks

Roberto C. Alamino

92

of complex numbers in mathematics is to guarantee that any polynomial

equation with complex coefficients has solutions that are also complex

numbers. This property is called algebraic closure.

Algebraic closure is not a trivial property and it is not difficult to see

that. Even if your current profession has nothing to do with mathematics,

you might remember from high school that not all polynomial equations

with real coefficients have real solutions. Consider, for instance, the simple

quadratic equation

As anyone can see, there are only old good real numbers in the

above formula. Nothing involves the square root of -1, the telltale sign of

the complexes. However, to solve it one needs to find a number whose

square is -1. We all know that cannot be a real number. In fact, it is the

imaginary unit . This means that the reals are not algebraically closed

because an equation that can be written using only reals needs non-real

complexes to solve it. By considering from start the whole set of complex

numbers, this completely changes, meaning that every polynomial

equation that can be written by combining complexes has roots which are

necessarily complex.

Those who still remember complex numbers will be scratching

their heads asking what else can exist beyond them. It turns out that a lot

of things with strange names exist, like quaternions and -adic numbers.

Everything depends in essence of what you really call a number. In

mathematics, numbers are entities that obey some rules. In general, what

we see as a number is usually what in mathematics we call a field. Still, we

can stretch the boundaries a little and call other structures also numbers in

a sort of way.

Complex numbers indeed form a field in the mathematical sense,

so they deserve to be called numbers as much as the reals. One could

spend a whole book showing how complex analysis, the branch of

The Probable Universe

93

mathematics that studies complex numbers and their consequences, is

beautiful and how it simplifies so many derivations. However, when it

comes to classical physics, at first sight there is no reason to introduce

complex numbers apart from some tricks one can do to solve some

equations. They are useful, but they are not an integral part of the theory.

Go to the internet now and look up any article on quantum

mechanics. It is all about complex numbers. The most impressive fact about

the use of complex numbers in quantum theory is that, by following their

mathematical rules, one can indeed reach conclusions that can be directly

translated to phenomena in the real world. This kind of pattern happens in

physics all the time, not only for complexes. Once in a while, someone finds

a way of expressing part of physical reality by associating to it some

abstract mathematical structure and mathematics itself does the rest of

the trick! The fact that this works so often is what Wigner saw as a mystery.

There is more to the article, but that is its essence. However, things

changed since Wigners time and we are in a better position to look into

this philosophical problem from a different point of view. First of all, we

have to understand that mathematics is, in some sense, above science.

Above here means that it provides a language in which one can not only

encode rules of inference that are a result of our observations about the

physical world, but also any rule that we can invent. Science, on the other

hand, is always constrained by physical reality.

In addition, we actually know that things are not as simple as

Wigner painted them. Not every mathematical structure we try out end up

being perfectly successful in describing the world when we run the

mathematical lever. Not all solutions to the equations of physics are

realized in the real world, no matter how rigorously the mathematical rules

are followed. In those cases, we either change the principles we used to

derive the result or we change the mathematical rules themselves.

This works even for the most basic mathematical structures.

Imagine that someone comes to you with a mathematical challenge. The

Roberto C. Alamino

94

person says that her age is the solution of the following mathematical

equation

This equation has two solutions, -2 and 18. Which one are you

going to pick? Given that negative ages have no meaning for us, the -2

solution is just an artefact of the equation and we have to discard them. It

is not that we cannot represent ages by numbers and operate them with

minus, plus and squares. We can, but we need to be aware of the

limitations of it.

A more sophisticated and interesting example is Boolean algebra.

Boolean algebra, in a nutshell, is just the way logic is represented inside

computers. We have seen it before. It is comprised by the rules to operate

with the numbers 0 and 1 as representing, respectively, false and true

together with the logical operators AND, OR and NOT.

Boolean algebra is constructed such that the mathematics mirrors

what we would expect by substituting the numbers and symbols by words.

We saw that, if we work with propositions, attribute truth values values 0

and 1 to them and combine them using AND, OR or NOT, we will not end

up with nonsensical results. Remember that, when discussing Coxs

Postulates, we attributed truth values to propositions like = I am reading

this book now and used the idea that and

this book now) could not be both true at the same time. In terms of

Boolean algebra, that meant that

Think about this for a moment. Why do we insist that we can either

be doing or not doing something, but not both at the same time? When we

were constructing probabilities via Coxs Postulates, this exclusion rule

was imposed by us because of our daily experience with the physical

reality. We never see in our daily routine people doing and not doing

something at the same time, so we infer that this must be an acceptable

The Probable Universe

95

rule. We generalise this rule and deem it applicable to everything around

us.

By contemplating the example above, we could be inclined to

accept that Boolean algebra must be the mathematical structure that

describes logic in nature. This is fine, as long as we are not in the realm of

quantum physics. It is well disseminated now the fact that while classical

computers work with bits which can be either 0 or 1, but not both at the

same time, there is the possibility that we can construct quantum

computers which work with qubits (quantum bits), strange entities that can

be both 0 and 1 at the same time!

We will study quantum mechanics in more details later on, but

right now it is enough to know that everything in quantum theory is

defined by a mathematical object called a state vector. A state vector is a

mathematical construction that contains all possible information necessary

to describe anything at some particular moment in time. State vectors are

used to relate systems to their possible states, which are defined by

characteristics that can be measured. For instance, in physics we usually

talk about energy states. If you consider a highway with cars that have the

same mass, cars with the same velocity will have the same kinetic energy.

Each different value of this kinetic energy defines then a different energy

state. The value of the kinetic energy works as a label to identify the state.

Two cars with the same velocity are said to be in the same energy state.

The same concept works for other quantities too, not only energy. We can

have position states, electric current states, mass states and virtually any

kind of state associated with something measurable. These measurable

quantities in quantum mechanics are called observables, for obvious

reasons.

Classically, bits are systems for which we choose an observable

that can assume only two values. We then re-label one of the values 0 and

the other 1. In classical physics systems can only be in one or the other

state. Right now, for instance, you are in a state that would include among

many other pieces of information the fact that you are reading this text. If

Roberto C. Alamino

96

we forget about everything else, we could simply say that your state is

given by the above proposition = you are reading this book now.

Quantum effects are only perceptible for very small things or for very high

energies in our world and, because we are characterized by neither, we are

better described by the rules of classical physics. In classical physics, you

are now either reading or not reading these lines, but not both. That is a

completely consistent mathematical description, which means that it does

not lead to any internal problems in the physical theory describing this

particular phenomenon (of being reading this book now). Therefore, your

reading state can be associated with a classical bit by re-labelling the

proposition as 1 and the negation of , or

, as 0.

A subatomic particle, on the other hand, is small enough to be

affected by quantum physics. Be aware that this is not a change of the rules

of nature. Quantum physics describes both us and the subatomic particle.

What happens is that the differences between quantum physics and

classical physics become small when things are large and slow (like us). The

effect of quantum physics on the subatomic particle allows it to break the

rule that it can either be doing something or not doing it, but not both at

the same time. This means that while classical objects can either assume

the state described by or

, a quantum

object would be allowed to actually assume

described by saying that the system is in a superposition of the states

and

Of course there are subtleties and we will talk about them later on.

Do not get desperate if things do not make sense for you at this

point. The situation in quantum physics is a subtle and very confusing one.

It takes time to get used to it. As strange as it might seem though, this state

of affairs is the result of we physicists trying to make sense of a series of

experiments in such a way that does not lead to an internal contradiction in

our description of nature. It is our attempt to create a consistent theory

describing those phenomena. The only way we found to do that was

accepting that, at the quantum level, things can be described by assuming

The Probable Universe

97

that there is a state

the classical level. There is no contradiction in that because the theory is

constructed from start in such a way that, when we deal with daily life

objects (remember, slow and large), Boolean algebra is recovered.

The bottom line is that even the rules that we take for granted in

nature are a consequence of inference made on the basis of collected

information. We use mathematical structures because the experiments

seem to indicate that it makes sense to do so. You might say that not only

the experiments, but also logic. Just remember that logic is also inferred

from experiments, so it is basically all the same. Even science itself,

including all the rules we use for working with it, is a question of inference.

In this sense, the fact that mathematics is useful in physical sciences is a

result of us inferring the right rules.

There are, of course, philosophical alternatives. In a paper

originally released on the internet in 1997, Max Tegmark (Tegmark, 1997)

proposed that every mathematical structure actually exists as a different

reality. This would answer Wigners question in a different way by saying

that the fact that mathematics is useful is inevitable as any mathematical

description will fit some physical reality and every physical reality has a

mathematical description fitting it. It says that there is a one-to-one

relation between mathematical structures and realities.

This idea is, as many others, not provable in its present state, which

characterizes it as a hypothesis, not a scientific theory. It is not even clear

whether it is a scientific hypothesis, although it does not take its merit. But

even if that hypothesis turns out to be true somehow, we still would have

to infer which are the rules of our mathematical structure. That would take

us back to our game of inference.

Roberto C. Alamino

98

MESSAGES

The main message of this chapter is that probability theory can be

constructed from simple principles of inductive logic. By starting with some

basic requirements, which many people call more technically desiderata

(something you wish), we built step-by-step a formal framework that

allows us not only to talk about probabilities, but to calculate them in such

a way that we can compare the numbers we obtain to the results of actual

experiments. Probability is logic.

The second important message that you should keep with you not

only while reading this book, but also when thinking about science in

general, is that the mathematical frameworks that we develop, no matter

how complicated they might seem, are there to encode patterns in

whatever way we want to. The connection of mathematics with the real

world is made by selecting those patterns which we actually observe in

nature. If we decide to, we can modify the mathematical rules, although

respecting some limits, to make them reproduce whatever behaviour we

desire.

Mathematics, philosophy and science are a triad that is on the

foundations of all our knowledge. Although you can ignore their

connections if you wish, a consistent understanding of everything we learn

and do cannot be complete without considering all three at the same time.

This unity will be lurking behind every page of this book and we will have a

chance to feel it at its full power as we journey to its end.

The Probable Universe

99

5.

Information

AGE OF INFORMATION

Science, as we discussed at the end of the previous chapter, uses

mathematics to describe the world we live in. These mathematical

descriptions, which we have seen to be abbreviated ways of encoding a

series of rules, are called models. Every science, from psychology to

physics, works by creating models of phenomena and trying to test if they

work or not.

I have already lost count of how many times people come to me

and say that this or that theory has been proven wrong. Newtonian

mechanics has been proven wrong by relativity. Classical physics has been

proven wrong by quantum mechanics. If those theories are wrong, why do

we keep using them? To answer this question in full, we need to

understand a bit more what does it mean for a theory about nature to be

right or wrong. That will take us through a long path.

It is true that the scientist in general, but even more strongly the

physicist, has an inner desire to believe that nature is describable by one

single, mathematically coherent model the so famous and exaggeratedly

named Theory of Everything, or in one of those many instances in which

scientists like to make a joke, a TOE.

We are not there yet and, honestly, there is no guarantee that we

will ever be. But if there is one thing we are certain about is that all models

we have now are wrong in some way. The beauty of this is that it does not

matter if our models are right or wrong in every single detail as long as we

Roberto C. Alamino

100

can use them to make predictions about what they are supposed to

describe and within certain acceptable limits.

But what about models that offer different descriptions for the

same aspect of nature? In our daily life, most people agree on the colours,

shapes and other directly detectable characteristics of things around us,

which means those which can be detected by one of our senses. The

amazing thing about our present knowledge of nature is that we have

models describing things that are not detectable by our bodies unless we

use some kind of tool to indirectly measure them.

The classic example is atoms. The existence of atoms was not fully

accepted until the work of Einstein about Brownian motion in the

beginning of the 20

th

century. Ernst Mach, for instance, used to say publicly

that he did not believe in atoms. Today, atoms can be visualised, but only

by using a tunnelling microscope.

Another simple example is radio waves. We can literally see with

our eyes electromagnetic waves within a certain band of the spectrum. We

call it simply visible light. We can detect it because some proteins in our

eyes change shape when they are hit by light and this shape-changing is

transformed in electrochemical signals which are transferred to our brains

for post-processing. Light (the visible kind) differs from other kinds of

electromagnetic waves simply by its frequency. All other frequencies,

including radio waves, need special devices to be detected. In the case of

radio, we use antennas and electronic circuits to detect them and

electronic circuits to interpret them as images or sounds.

What is real and what is not is a very complicated, but not

unimportant, philosophical question. It is however one which we will only

discuss here very briefly. When we detect something with our senses, we

tend to attribute to that a quality of being real much greater than when

we are forced to use indirect measurements. We usually forget that the

shapes, colours and sounds are all interpretations of our brains of the

information it receives from our detectors. In fact, this information is

The Probable Universe

101

already pre-processed by the nervous cells that change them to electric

signals transmitted from neuron to neuron. This means that part of reality

is already lost and changed in this process. Many experiments in

psychology show that the brains interpretation of the information it

receives is so subjective in some aspects that it can be heavily affected by

our memories and experiences.

This realisation, that the reality we create in our brains is a model

resulting from the processing and interpretation of electric signals, puts in

evidence a different player in physical reality whose importance could only

be appreciated once we entered the present computer age. This element is

information.

Science and technology are complementary in the sense that

advances in one bring advances in the other. When new technologies are

incorporated to our daily routines, they change our culture and even our

way of thinking. This new way of thinking usually bounces back on the way

scientists interpret nature. That happened with the steam age, which

brought the energy paradigm in physics, and happened again with the

information age, which forced physicists to think about the world also in

terms of how information is acquired and processed between systems.

Computers are systems that eat, digest and spill information all the

time. We are so used to them nowadays that we find it very simple to

understand things in terms of information. Anyone today can appreciate

that, given enough bits, we can construct any kind of image, sound and

even actual three dimensional objects. We might still not be able to

reproduce some things like smells, but that is a technological issue, not a

fundamental one.

Because we created them, we can still understand how computers

work. At least, most of us have a feeling that we do. Not always in detail,

but in general. There is though a much more complex system which also

eats, digests and spills information called the human brain. Repeating what

I wrote before, our brain is all the time receiving information about things

Roberto C. Alamino

102

and creating a model that gives us our perception of reality. But there is

also another level in which this process happens. It is happening right now.

While you read this text, the words in it are being processed by your brain

and associated with your memories to create a meaning. If you read the

word ball, you know what it is without seeing the picture. In fact, right

now you are seeing a ball without really seeing it. If you think about your

favourite music, you will hear it without actually hearing! If you stop to

think about it for a moment, it gives you chills.

What is happening is that I am using a kind of code to store

information in a way that you can detect, decode and interpret later. We

call this code language. This particular language you are reading right now

is called English. This encoding allows me to think about something and

transmit it to you without you having to take a look inside my head.

Learning English means to learn how to associate sequences of letters to

images, sounds, smells and so on. Some words, however, are associated

with higher level concepts. The word English itself describes something

that is much more complex.

Mathematics, as I have argued before, is another kind of language.

It was developed for a different (an admittedly more restricted) purpose,

but it is still a code used to store, transmit and also to process information.

It turns out that it is the most efficient language to do science. In science,

we learned that it is extremely convenient to encode information about the

external world as mathematical structures. We then use mathematics itself

to process this information until we can extract some hidden pieces of it

from the original data.

Probabilities are one of those structures used to codify and process

information, in particular when there is a lot of uncertainty around, which

is almost always the case. In fact, we are going to see that they are one of

the most fundamental tools to do that in any case, with certainty being

just a special situation. We have already learned that probabilities encode

Coxs Postulates. In the next sections, we are going to understand what

else they can encode and how.

The Probable Universe

103

ENCODING INFORMATION

It is time to roll our dice again. Let us use the d10. After all we have

learned, we can safely agree that saying that the probability of getting a 10

when rolling the d10 is the same as the probability of the proposition =

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 10 upwards being true. We can then use

the shortcut mathematical notation , which I hope at this

point has become clear enough. We can now look at this number using our

new knowledge about what a probability is.

The important thing is to notice that, when we associated

probabilities with degrees of plausibility, the former could be interpreted

as a way to actually encode whatever information we had about a

proposition. The greatest challenge is how to do this encoding. There is not

a unique answer to this problem. There are many ways of doing it

according to which type of information we have in our hands. This is one of

those places in science in which creativity is essential. Although there are

some basic principles that can be used as guidelines, unfortunately (or

fortunately if you are a scientist who is afraid of being substituted by a

computer) there is no general procedure that works for all cases.

Rolling a d10 is a physical situation which is simple enough to allow

us to do that with little difficulty. Here too it is convenient to be economic

with mathematical symbols. In terms of information theory, whenever we

try to be economic with symbols this is equivalent to say that we would like

to compress information. To do that, we have to agree on some basic

conventions that will allow us to write our propositions and the

probabilities corresponding to them in the most compact way possible and

to recover their actual meaning whenever we see them. We start by trying

to find a condensed description to refer to all 10 possible flavours of

proposition which written in full become

Roberto C. Alamino

104

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 1 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 2 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 3 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 4 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 5 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 6 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 7 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 8 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 9 upwards

When I throw my d10 up into the air and it hits the table, it will stop with

the face marked by the number 10 upwards

I am pretty sure that after the second or third sentence you were

already losing your patience. I could have used much less space to describe

the above 10 possibilities. I can simply omit from the description of the

proposition the whole procedure of how the dice is being rolled. We keep a

kind of rulebook where we describe that procedure only once and then

assume that we agreed in following that exact procedure for all dice rolling.

The Probable Universe

105

Next, we give a name to the result of the dice rolling. Let us call it

. Then, each one of the ten propositions giving the results of the dice

rolling can be written in the following compact form

This is visibly a huge amount of compressing. For instance, in this

notation we have now a very compact way to write our proposition as

=

Because it is a notation that makes very clear what our proposition

is, we will many times use it inside the brackets of the probability symbol in

the following way

.

The variable is what is called a random variable in probability

theory and it is a convention in many books to use capital letters for them.

Random variables should not be confused with our propositions. A

proposition is, in fact, an assignment of a value to a random variable. You

can think of the random variable as a kind of incomplete proposition, one in

which something is missing. In the above case, this something is the value

of .

The act of assigning probabilities for each value of our random

variable is then exactly equivalent to say that a certain face has such and

such odds of being the result of the dice rolling. So what can we say now

about these odds? Remembering our previous discussions, let us assume

that our d10 is an exactly symmetric solid with 10 faces. In addition, we will

assume that along the entire d10, the density of its material is constant and

that the small painting indicating numbers at each face is so thin that the

paints mass is negligible and will not affect the result.

We discussed previously that, unless we live in a completely

chaotic world where things do not make any sense, like in a cartoon, it

Roberto C. Alamino

106

seems logical that if the dice is thrown with enough energy and in a

careless enough way all faces should have the same probability. Finally, we

also know that only one face can occur each time we roll the dice. That

summarises all our information about the entire game. But before we start

the encoding of this information into probabilities, I will introduce some

even lazier notation and compress things a bit more.

Because all that is relevant to the situation can be described by the

result of one single random variable, we do not need to repeat its name all

the time. We know that, at least for now, we will be only dealing with .

Therefore, there should be no need to keep repeating the name of the

variable over and over again as long as we remain working with the same

problem. So, instead of writing , we can simply use to

describe the probability of being . That is a risky move as sometimes

omitting the random variable can be the source of a lot of confusion. Our

situation here is, however, clear enough for this not to happen.

For what we have to do now, we will need to combine propositions

with the logical operators AND and OR, the same two that we learned

about when we were dealing with Coxs Axioms. The convention was that

combining two propositions with AND would be denoted by writing them

in sequence. For instance, the combined proposition AND would

simply be written as . Similarly, we would write A OR B as .

There is only one inconvenient when using the AND convention

together with our present compressed notation for dice rolling. The

problem is that it is visually confusing to put numbers together like

propositions. For instance, 1 AND 2 would be written 12, which could be

misunderstood as a result of twelve in the dice rolling, which cannot

happen for a d10. Therefore, instead of writing 12 for the proposition

and , we will separate them with a coma and write 1,2.

We now have a very compact and convenient notation to encode

the information about the d10 rolling. We can now summarise it by the

following requirements

The Probable Universe

107

1. All probabilities should be the same (symmetry):

2. When the d10 is rolled, at least one of the faces has to end up

upwards:

3. It is impossible to get more than one face at each time:

,

The first equation, which encodes the idea that all faces have the

same probability, is a simple way to encode the symmetry of this problem.

You need to be careful with the second equation. Remember that we are

using the convention that the symbol + means OR and not the usual

addition of numbers. This means that 1+2 is not equal to 3 in this notation.

The last set of equations seems to encode such an obvious fact that

it looks almost superfluous. But remember that we are using a variation of

the lawyers rule which says that, if something is not explicitly written

down (in this case, encoded), then it does not exist. We need to encode

somewhere the idea that only one face can be the result of the rolling,

otherwise we open a potentially dangerous loophole in our encoding

procedure.

This impossibility of having two or more faces at the same time is

important here. It turns out that the rules of probability state that when

two propositions, say A and B, are mutually exclusive with , this

implies that

Roberto C. Alamino

108

.

This is called the sum rule and, although it has admittedly a simpler

interpretation when one considers probabilities as frequencies, it works for

any proposition. The way to prove it involves some mathematical

manipulations and I will skip them here. If you are one of those persons

who (correctly) does not believe in something just because a book says so,

you can find the proof in chapter 2 of Jaynes book (Jaynes, 2003). Notice

that the sum rule is not always valid. The most complete form of it is given

by

and it states that when we sum the individual probabilities for and , we

are counting twice the probability of and happening at the same time.

Because we counted it twice, we have to discount it once to get the correct

value. When the discounted value is zero, which is when there is no chance

for both to happen at the same time, we recover the initial sum rule.

There is also the complementary rule called the product rule,

which says that the probability of the conjunction (AND) of two

propositions which are independent of each other is the product of their

probabilities, or in mathematical symbols

The independence property is extremely important here and

means that the occurrence of one of the propositions has no influence

whatsoever in the occurrence of the other. If the propositions are

somehow related like = We are in the Sahara desert and = It is going

to rain today, which very obviously influence each other, the rule changes

in a way that we will see later on. But it cannot be written as above

anymore. In the same way as there was a generalisation of the sum rule

that would work always, there is also a generalisation of the product rule

that works even when the propositions are not independent. However, this

generalisation involves something called a conditional probability,

The Probable Universe

109

something that is very important for our program BAYES, but which we will

talk about only later on.

Because the occurrence of faces in the d10 rolling are mutually

exclusive according to the rules of our game, using the sum rule the

equation containing the OR of all faces is equivalent to

The symmetry information says that all these probabilities should

be the same. Therefore, we can call all of them by the same name which I

am choosing to be the letter . This results in the simplified equation

which finally gives us the probability of any face to be , in a

relieving agreement with what we thought to be justified by common

sense.

Can you now look at the simple expression in the same

way as you did before? Take a moment to think about how much

information is contained in this simple-looking equality. Of course, once

you get the knack of it, the act of encoding information into probabilities

becomes easier, but it does not mean that it becomes easy. It requires a lot

of thinking and a lot of experimenting to find out which are the relevant

pieces of information and how to put them together in order to find

numbers which will allow you to make good predictions. In some cases, like

the stock market for instance, this might never be possible within the

accuracy we would like to. Get used to uncertainty. Life is full of it.

INSUFFICIENT REASON

The symmetry principle we used to find the probabilities for our d10 in the

previous section is a very powerful and very deep one. It is called Laplaces

Roberto C. Alamino

110

Principle of Insufficient Reason and is a simple statement of rational

neutrality.

Symmetry is a fundamental concept in our current understanding

of nature and, fortunately, is a very simple one. We have already seen it

working in two different ways: to allow us to calculate probabilities and to

generate uncertainty. We say that something is symmetric or has a

symmetry when, if we change this something in some way, there is at least

one characteristic of it that remains the same. If everything changes, there

is no symmetry at all.

For instance, the human body has what is called bilateral

symmetry. This is a fancy term for the fact that, if you take a photo of the

right side of your body, you can use a graphic software like Photoshop (or

GIMP, the free alternative to it) to complete the rest of the picture by

simply reflecting it. This also means that if you change your left and right

sides, your picture remains the same.

Nobody has perfect bilateral symmetry though. We call this

situation an approximate symmetry and, in most cases, this is already

powerful enough for most purposes. It is actually very rare to have a

perfect symmetry in any situation, but fortunately we can most of the time

ignore the small imperfections and assume that the symmetry is perfect

enough given the precision we are working with. Just remember that we

ignored differences in the density of each face of our dice caused by the

inscription of the numbers on it because they would not cause too much

deviation from symmetry.

I will interrupt the text flow here because enough this last issue

cannot be emphasized enough. In most tasks we perform in our lives, we

only need to be precise within certain limits. This is true for science as well.

Most of the time, perfection is not only unachievable, but is also

unimportant and looking for total precision can be a waste of resources.

Keep that in mind.

The Probable Universe

111

Returning to the main topic, in order to appreciate the power of

symmetry concepts in science, in particular in physics, it is worth knowing

that the most fundamental descriptions of natures laws that we presently

have are based on something called gauge symmetry. This is a very difficult

symmetry to visualize, because it is associated with the way we describe

some things mathematically. The first place in which it was identified in

physics was in electromagnetic theory.

We all know what a voltage is. Voltages are everywhere written in

our electric devices. They measure differences of something called an

electric potential. It turns out that there is no meaning in an absolute value

of an electric potential. They are always relative to some fixed reference

potential. You can change the numerical values of the potentials, but the

voltages remain the same. This was exactly what happened in our

definition of symmetry. This is a very simplified example of the gauge

symmetry present in the electromagnetic theory. It is called a gauge

symmetry because we can gauge the electric potential in the most

convenient way for us without changing the actual physics.

A full understanding of gauge symmetry would require a level of

mathematics that is beyond what we can have here. The consequences of

gauge symmetry, however, are far reaching. For instance, when we go to

quantum theory, if we take as a fundamental requirement that some gauge

symmetries should be satisfied, this implies that fields similar to the

electromagnetic field must exist. If they do not exist, the symmetry cannot

exist as well.

There are even more consequences of symmetries. In 1915, the

German mathematician Emmy Noether proved what is probably the most

beautiful theorem of all physics and, to show that prizes are not everything

in life, she never earned the Nobel Prize for that. In one of the fewer cases

in which deserved credit is correctly assigned, the theorem is known today

as Nothers Theorem, although the beauty of it more than provides

grounds to suggest that it should be called Nothers Poem.

Roberto C. Alamino

112

The theorem was only published in 1918 in German and was

translated to English in 1971 (Noether, 1971). What Noether proved is very

simple, but incredibly amazing. Her theorem shows that each continuous

symmetry of a physical system implies the existence of a conserved

quantity. Not only that, the theorem is so complete that it gives you the

formula to calculate the conserved quantities. The hypothesis and the

conclusion of the theorem are not difficult to understand, even if you are

not a mathematician, although the proof cannot be given without maths.

The key concepts are continuous symmetries and conserved quantities.

Let me explain what they mean.

Continuous symmetries refer to symmetries of an object when the

change we make on it depends on a parameter that can vary continuously

from the initial unchanged configuration of the object to the final, changed

one. For instance, think about a round plate. If you hold the plate in front

of you and rotate it by any angle, the plate looks just the same (once again

ignoring the tiny imperfections on it). But at any moment between the

initial position and the final position, the plate also looked the same, no

matter what the value of the angle was. Consider that, with respect to the

initial position of the plate, you rotated it by 90 degrees. In order to do

that, you needed to pass through 87, 87.5, 87.9, 87.999 and so on before

reaching 90 degrees. Any value between 0 and 90 is acceptable for the

angle of rotation. Inside this interval, there were no forbidden values, they

could be varied continuously and at each value the plate would still look

the same.

We say that the angle varies continuously in contrast to what we

would call a discrete change. When something changes discretely, it goes

through jumps. The real numbers, the numbers in a straight line, vary

continuously because there is always another real number between any

two. The integers, contrary to that, vary discretely. There are no integers

between 1 and 2. There is a jump from 1 to 2.

Consider the picture below.

The Probable Universe

113

The second row shows a square which is rotated by the angles

given in the first row, which are 0, 10, 45, 87 and finally 90 degrees. You

can see that the square will only look the same if you rotate it by 90

degrees or by multiples of it, like 180, 270 and so on. The third row shows a

circle (or a plate) for comparison, which is always the same no matter what

the angle of rotation is.

For the square, any angle between 0 and 90 degrees is not a

symmetry. We call cases like this a discrete symmetry, because the

symmetry appears only in jumps of 90 degrees. This kind of symmetry does

not obey the hypothesis of Noethers Theorem and we have no guarantees

that it leads to conserved quantities, although sometimes it does.

The other concept we need is that of a conserved quantity. This is

a quantity that does not change in the system as time goes by. The

example we are all used to is energy. Everyone has heard that energy

cannot be created or destroyed, just transformed. This is another way to

say that energy is conserved. Because if you calculate the total value of it

coming from all possible sources it should not change, energy is said to be a

conserved quantity. But have you ever wondered why?

Here comes the beauty of Noethers Theorem. Using it, we can

show that if the laws of physics are symmetric by time translations then

energy should be conserved! A time translation is just a fancy name by

Roberto C. Alamino

114

which we call the passage of time. As far as we know, time seems to flow

continuously. Up to the precision we can measure, there seems to be

possible to divide any time interval indefinitely. There is no minimum time

step. This matter, though, is not settled for sure. In any case, we can

assume that time is continuous and see what are the consequences.

By assuming that time flows continuously, we open up the way to

use Noethers Theorem. The fact that the laws of physics do not change

with time in their fundamental description can be summarised by the idea

that if we do the exact same experiment today or at any other day, keeping

all experimental conditions as equal as possible, the results should be the

same. Then, this allows us to calculate, via Noethers Theorem, a quantity

that is conserved. This quantity turns out to have exactly the formula of

energy.

In the same way, the laws of physics seem not to change from

place to place. If we do the same experiment in the UK and in the US, apart

from environmental differences, the results are the same. We call this

symmetry by spatial translations in analogy with that by time

translations. Because space is also continuous (within the precision we can

measure) this also leads to another conserved quantity, one we call

momentum. One of the consequences is that, unless any forces act on a

body, it will keep moving with the exact same velocity forever.

Just to complete the most famous trio of conserved quantities,

because there is no preferential direction for the laws of physics in space,

or if physics is symmetric by spatial rotations, then the conserved quantity

is the angular momentum. Very similarly to the non-angular momentum,

the angular version has the effect that if we do not try to stop a spinning

object, it will keep spinning with the same angular velocity forever.

There are many other consequences of symmetries, and even of

the failure of a system to be symmetric, but we are not going into the

details of it. One very important side effect of symmetries is that they are

accompanied by ignorance. In the example of the rotationally symmetric

The Probable Universe

115

plate, you will never know whether someone touched your plate if the final

effect was only a rotation. Any rotated position of the plate will be the

same. Of course, if you had access to some equipment that allowed you to

detect fingerprints, you would know that someone touched the plate. This

means that the symmetry, in this case, can also be the result of some lack

of information about some non-symmetric aspect of the plate.

We can think of the roulette as a discretised version of the plate.

The rotation symmetry in the roulette is discrete, like the one of the

square, because it is divided in compartments of finite size. When it is

rotated in such a way that the divisions between compartments coincide

for the initial and final positions, if we ignore the differences in the printed

numbers, we can consider both configurations as symmetric.

As we have seen with the d10 previously, the symmetry of the

compartments of the roulette tells us that there is no reason to assume

that one number is more probable than the other. In other words, there is

insufficient reason to prefer one result over the other. The fact that we

have no reason to favour any of a number of symmetric configurations of a

system is Laplaces Principle of Insufficient Reason.

In other words, what Laplace proposed was that, if the symmetries

of a system are such that there is no way to differentiate between one

result and another, the most reasonable thing to do, given all the

information you have, is to assume that every state is equiprobable. It does

not matter if 11 is your lucky number, nothing in the roulette says that it is

more probable and you would not be rational if you attributed a higher

probability to it.

Laplaces Principle is not something that can be proved

mathematically. It is a postulate based on a logical requirement. It has a

philosophical underpinning which reflects a rational consistency of the

world around us. Why on Earth (or in this case, on the whole universe)

would one of two exactly symmetric configurations be more probable than

the other? The only rational reason would be if there was something that

Roberto C. Alamino

116

would differentiate then. Of course, we cannot rule out that something like

that might occur in nature just because we cannot understand it, but

unless strong experimental evidence appears, it seems that we live in a

more or less sensible universe.

One of the places in physics in which Laplaces Principle has a great

impact, for instance, is in statistical physics, although this is rarely seen in

this way. Statistical physics is the area of physics that deals with systems

consisting of a very large number of smaller parts. Any macroscopic object

has more than

The way these atoms interact to result in a solid, a gas or a liquid and the

conditions over which these different phases change into one another are

subjects of statistical physics.

Because it is not practical to keep track of that amount of atoms,

statistical physics recurs to descriptions that involve probabilities. The

reason, again, is the lack of total information about the system. In principle,

total information could be obtained with enough time and resources, but in

practice it is not worth to do that. You do not want, and actually you do not

need, to measure all information about a cube of ice to know that it will

melt at about zero degrees Celsius.

In statistical physics, the strategy used is to focus on the energy

states of a system and this is the point where Laplaces Principle makes its

appearance. The fundamental assumption of statistical physics is that all

states with the same energy have the same probability. The symmetry

assumption here is a subtle one. It corresponds to the idea that the same

energy states behave in the same way macroscopically. Never forget that

assumptions like this should always be subjected to experimental

validation. As it turns out, it has been working quite well for more than a

century.

The Probable Universe

117

MAXIMUM ENTROPY

Laplaces Principle, in its plain formulation, is easy to apply in some

situations where the description of the symmetries is straightforward, but

not that much in others. Fortunately, it can be translated into a very

powerful formulation which is known these days by the name of Principle

of Maximum Entropy. Odds are that you are familiar with the word

entropy as describing disorder. That is one of the ways to look at entropy,

but there is a more modern point of view which is even deeper and

reduces to the old one when the situation is appropriate. We can look at

entropy as lack of information.

Claude Shannon was an illuminated engineer from the US. During

his studies about communications systems, he discovered that he could

define mathematically the concept of lack of information in a written

message. The legend tells that he was in Princeton University at that time

and talked about his result to John von Neumann, who told them that his

formula was nothing more than the formula for the entropy that was used

in statistical physics!

The main reason why this connection was a surprising one is that

Shannon found his formula by starting from some reasonable requirements

he thought a mathematical quantity would have if it was supposed to

measure lack of information. To understand better what is meant by lack of

information, think about a silly language where all texts are just sequences

of the letter A. The only information that you might not have is the actual

size of the message, but if you forget about that, you know everything you

need to know about any message! They are just sequences of As. There is

no missing information and, therefore, the entropy of a message like that is

zero.

Another useful way to understand missing information is in terms

of the average amount of surprise each time you receive a new symbol. In

the above message, there is no surprise at all. You always know that the

Roberto C. Alamino

118

next symbol will be another A. Surprise, of course, is just another way to

describe the amount of information revealed to you by each new symbol.

Imagine now that we have our full Latin alphabet at our disposition

all 26 letters, from A to Z and some special characters like the blank

space, the comma and the full stop. If we ignore the particular variations

present in different languages like or , then each language corresponds

to a different set of rules to put the same symbols together. Shannon

considered these rules to be probabilities of a symbol coming after the

other. A forbidden sequence has, for instance, zero probability. For

instance, in English, the sequence djhfasdf strictly speaking does not

correspond to a grammatically correct word. Of course it has a probability

of occurring in a text, like it just did here, but it is a very low probability.

Clearly, the lower the probability, the greater the surprise when we receive

a word like that. But it happens so rarely that, in average, the amount of

surprise will not be very big as long as it appears just once in a while.

However, if we still insist in using a language where everything is

written just with an A, we still have no surprise at all in any message, no

missing information and therefore, according to Shannons idea, zero

entropy. Notice that we still have not made the connection with disorder

here. Wait a bit more and we will get there.

What would be the situation in which we would get maximum

surprise when each new symbol arrived? Whenever we have a symbol

which is more probable than another, that messes up the average

(important word here!) surprise in the sense that we expect more of that

symbol to appear and it will. The only situation in which we cannot have

any expectations about the next symbol is when all of them have the same

probability! That is exactly the situation of maximum entropy! It also

means that each symbol contains the maximum amount of information

possible on it, because each next symbol is extremely important as we

cannot predict it.

The Probable Universe

119

This should remind you of our discussion on randomness and

predictability. When we have access to the probabilities of something, then

we can evaluate how random that object is by calculating the entropy

resulting from those probabilities. The entropy will give its highest

numerical value when all probabilities are the same and will give exactly

zero when one of the results has probability one and the others zero. This,

as you can imagine, is the case in which there is no randomness and the

result is completely predictable.

Now, look at the following text with only As:

AAAAAAAAAAAAAAAAAAAAAAAAAA

And compare with the following text containing every letter with

the same frequency, in this case, each one appears only once in a random

order

RUAEHKXNIFMVGJBZLWDOPSCQYT

Which one looks more disordered to you? For me, it is clearly the

second one the one with the largest entropy. The first sequence full of As

could not be more organised. Think about a very organised bedroom. If

something is out of place, you can readily tell it, what would not happen if

the place was a mess. Consider that I make a small change to the first

sequence:

AAAAAAAAAAAAAAACAAAAAAAAAA

It should be quite easy to find out where I put the different letter.

What if I do the same with the second sequence?

Roberto C. Alamino

120

RUAEHKXNIEMVGJBZLWDOPSCQYT

A bit less immediate, isnt it? For a small sequence like that, you

might still think that it is not too difficult, but imagine how it is to find one

typo in the Bible. The fact that the formula discovered by Shannon

considering text messages is exactly the same as the entropy used by John

Willard Gibbs for general physical systems should be taken as a sign of

something deeper. Nature usually does not provide such coincidences

without some underlying connection between them. Today, after this

connection was scrutinised time and again, we finally understand that

entropy is a measure of lack of information about a system.

There are many ways to appreciate the analogy between texts and

physical systems. We can think of a particular text as the current state of a

sequence of letters. Each particular sequence is a text-state. A

chromosome, for instance, can be thought as a kind of text with only the 4

letters A, T, C and G. Take mans chromosome Y. Each man has a different

sequence of letters (bases) in this chromosome, although the amount of

letters is basically the same. Different men have their chromosomes Y in

different states.

Material objects can be seen as a sequence of atoms juxtaposed in

some spatial order. In the same way as you can tell a story with different

words, an object can have modifications in the arrangements of its atoms

and still be the same object. For instance, you are the same person even if

your body composition is changing at every moment. Entropy can be

associated with the amount of states something can have without changing

some important macroscopic property of it, be it the essence of the story

or the individuality of a person. You name the property. Of course, when

you change things and some property does not change, we are led to

consider symmetries.

As strange as this can sound, disorder brings symmetry. This is

consistent with all we have learned before as we have seen that symmetry

The Probable Universe

121

brings and is the result of lack of information. Therefore, the larger the

symmetry of a system, the higher the entropy associated with it.

It would be totally understandable if you have difficulties to accept

that disorder brings symmetry. Our first impulse is always to associate

symmetry with order, but that is a great misconception that I want to

correct before we can go any further.

Go to your kitchen now, get a glass of clear water and observe it.

The water looks the same everywhere. Now, if you add to it a drop of black

ink, or any other kind of coloured fluid, the point created by the ink is said

to break the symmetry of the water. From that moment on, the water is

not the same everywhere because you can identify every point in the glass

by its position relative to the ink drop. But if you wait enough time, the ink

will mix with the water. If you are impatient, you can ever stir the mix to

help the process. The highly organised initial configuration with the ink

arranged in a small drop will decay to a chaotic situation where the ink is

now everywhere. But now the glass is all the same once again! Disorder

created symmetry.

The same idea works for our messages as well. You might right now

be arguing that the text with only As is more symmetric than the one with

all the letters. You would be right if the disorder in the text was associated

with some geometric translation of the symbols, but remember that we

associated the entropy of the text with the appearance of the different

symbols. Changing the symbols is the relevant transformation here to

evaluate symmetry. What happened when we changed only one symbol in

the first sequence? It changed completely, which was reflected by the fact

that we found the change very easily. When we changed the other one, the

change was less obvious. If you do that in a larger text, it will basically

remain the same. There you get your symmetry. You change something

and it remains (almost) the same. I know it is hard to swallow, so keep

thinking about that as much as you can.

Roberto C. Alamino

122

Let us recollect things. Laplaces Principle says that we should not

attribute different probabilities to results if we do not have any reason to

do so. Not having any reason to do so, as we have seen, means that any

states which are related by symmetries should be assigned equal

probabilities. A system whose all possible states are related by a symmetry

and cannot be distinguished has maximal symmetry, therefore disorder is

maximal and the lack of information is also maximal. This system,

therefore, has maximal entropy.

Because the mathematical formula for the entropy involves only

the probabilities of the possible states of a system, if we maximise this

entropy, we then get the correct formulas for the probabilities! This is the

general formulation of Laplaces Principle, or Maximum Entropy if you

prefer it, that we were after. This allows us to attribute probabilities by

using all the information about a system and nothing else for a far larger

class of systems than we could before.

There are some small details that I have skipped. Clearly, symmetry

is not the only kind of information we have about a system. Suppose, for

instance, that we are playing a dice game and we notice that something

very strange happens the average value of the results in our d10 seems to

be fixed to 3. Even if we do not know how this happens, we can include this

information in the Maximum Entropy formulation by introducing what is

called constraints. Constraints are nothing more than pieces of information

we know about the system. They are called this way because they are like

rules that constrain the behaviour of the system. For those who are

interested in the mathematics, this is accomplished by a technique called

Lagrange Multipliers and you can find it, once again, in Jaynes book

(Jaynes, 2003). It is enough for us to know that this can be done, but we

will not spend any time with the maths.

Let us do another pit stop and, one more time, appreciate the

beauty of all this. Maximum Entropy, as a generalisation of Laplaces

Principle, is a statement that the physical world must follow the rules of

what we consider everyday logic. It encompasses a concept which, as we

The Probable Universe

123

will see later on, is the basis of all human search for a rational description

of the world. It guides us in finding probabilities about phenomena by using

the information given by experiments in a maximal way, trying to avoid any

kind of emotional bias (unless we really want to be biased). The most

amazing fact is that all of this is related to a fundamental quantity of nature

which was rediscovered many times in different contexts entropy. This

relation is just one of the connections between information and philosophy

of science. We will see many more as we proceed.

FREQUENCIES

Encoding information in probabilities in the way we did via Maximum

Entropy is deeply connected to the classical way to calculate probabilities.

We did not have to repeat an experiment many times to calculate

probabilities. Our propositions could even be non-repeatable experiments

and nothing would change. Nothing prohibits us to assign probabilities to

the proposition that the world would be ending tomorrow, what would be

highly non-repeatable (without using some clever tricks).

All we studied by now seems to be biased towards using only the

classical method of obtaining probabilities and virtually marginalising the

frequency method. In a sense, I confess I did it. So, in order to reverse this

bias, let us talk about the relation between probabilities and frequencies.

This association, in fact, will be very useful to visualise some examples. Still,

you have to bear in mind that they are particular cases and the view of

probabilities as degrees of plausibility is more general, but in no way

contradictory.

When one considers experiments that can be repeated more than

once, there is a mathematical theorem that guarantees that, if we repeat

them enough times, probabilities coincide with the relative frequencies of

the possible results. This theorem is the famous Law of Large Numbers, the

name being highly self-explanatory.

Roberto C. Alamino

124

By relative frequency of a certain result one understands the

number of times that result occurred during the experiment divided by the

total number of times the experiment was repeated. This definition

immediately guarantees that the frequency is a number between zero and

one, as we chose the probabilities to be. Another obvious consequence is

that, in the case results are mutually exclusive (no more than one at each

time), the frequencies of individual results add to one in a straightforward

way.

As it should be for everything to fit together nicely, the AND and

OR rules also work for frequencies. In fact, it is working with frequencies

that it becomes easier to understand these rules. Let us then see how they

work. Just be careful to not forget that, although these rules have a nice

explanation with frequencies, they are also valid for general propositions

which might not be repeatable.

Let us continue to refer to our d10 rolling game. Dice rolling is a

good example for dealing with frequencies because, if we ignore the

inevitable but tiny differences appearing each time we roll them, we can

consider the rolling as a repetitive experiment.

Consider, for instance, the probability , which in our

notation means the probability of either a 1 or a 2 in the dice rolling. We

have seen before that, because 1 and 2 cannot appear at the same time,

this probability is simply given by

We can then attribute a frequency interpretation to this probability

by saying that the odds of getting either 1 or 2 are one in five. In fact, if you

ever went to horse or dog races, this is the terminology used in those

places.

Now, the Law of Large Numbers guarantees that if you roll your

d10 enough times, the relative frequency of each face will approach 1/10.

The Probable Universe

125

This means that if you roll it times and is very, very large, and if you

call

will become close to 1/10 as long as the dice is completely symmetric and

the rolling is fair enough. In mathematical notation, when a number

approaches another, we use a little arrow to indicate it in the obvious way

(as if one was going to the direction of the other). This means that

approaches 1/10 can be written as

This should be valid for all faces if the dice is perfectly symmetric. If

you ever tried this kind of experiment, as I had to do in my first physics

laboratory in the university, you know that no matter how many times you

throw the dice, the number of times you get each face will never be the

same. They will become closer and closer for large , but they are never

exactly equal.

This variation away from the exact result, which will be present in

all actual experiment, is called quite appropriately by the name variance.

Variance is a term which indicates fluctuations away from some average

value. In this case, we can consider the exact probabilities as some kind of

average value, as frequencies always are. Usually, the variance decreases

with , but sometimes it does not. Those cases in which the variance

refuses to o down are those in which the Law of Large Numbers does not

work and it becomes more difficult to work with frequencies directly. It

does not mean that we cannot find clever tricks that will still allow us to

work with frequencies, but the more tricks you use, the less distinguishable

from the Bayesian approach it gets.

For those who like economy, variance is the same as volatility.

When market indicators vary too much, generating those graphs which

look like rugged surfaces instead of smooth curves, economists say that the

volatility is high. The ruggedness is nothing but fluctuations away from

average smooth values, in other words, variance. The picture below shows

Roberto C. Alamino

126

two series of 50 numbers generated using a certain random rule such that

the average value of each series of numbers is zero. The difference is that

the blue sequence has a higher variance than the red sequence, meaning

that the blue numbers are more spread away from the average value (in

this case zero) than the red ones.

Two sequences of 50 numbers generated using the same rule with the only difference

being the variance. Both sequences have as their average value zero, but the blue

sequence has a variance equal to 3 while the red sequence has a variance equal to 0.05.

Let us assume now that we have been rolling our d10 long enough

to have variances so small that they can be safely ignored in practice.

When we reach that point, we can simply say that

Can you guess now how would we attribute frequencies to the

probability of either 1 or 2? We simply count the number of times in which

we got either 1 or 2. This obviously amounts to

never appear at the same time). To get the corresponding probability, we

calculate the relative frequency by dividing it by the total number of

repetitions, which gives us

The Probable Universe

127

The AND rule is similar. We just need to count how many times

both 1 and 2 appear at the same time. For our d10 rolling game, this will be

obviously zero.

Frequencies are very closely associated with the Kolmogorov way

of defining probabilities via sets. It is not very difficult to see that counting

the relative ratios of elements in a set is completely equivalent to counting

frequencies. Just imagine that you write down in a piece of paper each

result of your experiment and put them all together in a bag. The bag is the

equivalent of the set and the written papers are the equivalent of the

elements of the set.

As you can see, thinking about frequencies is really easy, useful and

absolutely fine as long as you understand the limits of what you are doing.

Because it is such a straightforward way to visualize probabilities, most

scientists are extremely attached to the frequency way of thinking. The

devil is hidden in one detail what exactly you consider to be a repetitive

experiment. It is clear that no experiment is exactly repetitive, but if we

require only a finite precision and allow for some unimportant variations,

this can be roughly achieved most of the times.

We did that with the dice rolling. We ignored dice imperfections

and differences in throwing styles. Physicians do exactly that when they

tell someone what are the odds that some disease will kill a person. They

count the number of people who died from a disease and divide by the

number of people who contracted it. What they are doing is to consider

one repetitive experiment called catching the disease with two mutually

exclusive possible results survival or death.

When they do that, however, they are clearly ignoring age, social

conditions, genetic predispositions and so many other differences in all

possible parts of this process. The frequency they get is not useless, it gives

some information, but one must be aware that this information is only

Roberto C. Alamino

128

approximate. Because each person is different, the proposition John died

from that disease is not a repeatable experiment as John can only die once

and there will never be another John exactly equal to him. Still, by

compromising some details, one can extract some information from

frequencies of similar experiments. This does not mean that physicians do

not know about the differences. The good ones do. So much that, if you

push them a bit, they might be able to tell you the death rate according to

age, social conditions and so on.

The assumption when we use frequencies is always that similar

experiments should give similar probabilities. The trick is to know exactly

where the dissimilarities hide and understand when and how they can be

ignored. This depends, of course, on information about the details of the

experiment. In one way or another, be via Maximum Entropy or

frequencies, we are once again trying to find a way to encode information

in the form of probabilities.

The Probable Universe

129

6.

It Depends...

CONDITIONED INFORMATION

In the same article in which he mused about the role of mathematics in

nature (Wigner, 1960), the physicist Eugene Wigner wrote the following

observation (the underlining was added by me)

(...) THE LAWS OF NATURE ARE ALL CONDITIONAL STATEMENTS

AND THEY RELATE ONLY TO A VERY SMALL PART OF OUR

KNOWLEDGE OF THE WORLD. (...) IT SHOULD BE MENTIONED,

FOR THE SAKE OF ACCURACY, THAT WE DISCOVERED ABOUT

THIRTY YEARS AGO THAT EVEN THE CONDITIONAL STATEMENTS

CANNOT BE ENTIRELY PRECISE: THAT THE CONDITIONAL

STATEMENTS ARE PROBABILITY LAWS WHICH ENABLE US ONLY TO

PLACE INTELLIGENT BETS ON FUTURE PROPERTIES OF THE

INANIMATE WORLD, BASED ON THE KNOWLEDGE OF THE PRESENT

STATE. THEY DO NOT ALLOW US TO MAKE CATEGORICAL

STATEMENTS, NOT EVEN CATEGORICAL STATEMENTS CONDITIONAL

ON THE PRESENT STATE OF THE WORLD.

The above excerpt is more than 50 years old and it contains a

statement about one of the basic foundations of the Bayesian

interpretation of probabilities. I am not aware of how much Wigner knew

about Bayesian inference, but he correctly identified that the important

fact about every piece of information we know, which includes what we

call the laws of physics, is conditional on some other previous collected

amount of information. I have repeated several times by now that the

Roberto C. Alamino

130

probabilities we assigned for our d10 work only under the condition that

the dice and the way it is thrown into the air are both fair, where we

defined fair as a condition that summarises a series of more detailed rules

and specifications that are required to be valid in the set up of the whole

experiment or game. If the specifications determining the initial set up (i.e.,

the information we start with) change, then we would have to review our

probabilities (i.e., the information we deduced about the possible results of

the dice rolling).

Let us keep everything as general as possible and deal once more

with (possibly non-repeatable) propositions. Consider the proposition =

The temperature tomorrow will be below zero degrees Celsius. We

already know that we can assign a probability for this proposition and

we also know that this probability will be different if the conditions in

which we calculate it are different, just like in the case of our dice. For

instance, common sense dictates that should assume different values

in Brazil and in Antarctica.

Most of the times the conditions under which a probability is

calculated are not included in the notation, where I am using the

symbol as a placeholder for any proposition. That happens mainly

because we usually know what the conditions are and we do not need to

be reminded of them all the time, but also because we want to save space

and writing time.

Sometimes, however, it becomes important or convenient to write

down explicitly some of the conditions on which a probability depends.

When we want to do that, we use the symbol | and call it the conditional

operator. In the same way as the AND and OR operators, the conditional

operator will connect two propositions, but this time in a slightly more

complex way. If propositions and are connected by it, in a way that we

will see very soon, we write the combined proposition as and call it a

conditional proposition. This name is just a way to make explicit the fact

that we are considering some condition, but always have in mind that, as

stated by Wigner in his article, all propositions are actually conditional,

The Probable Universe

131

even if this is not explicitly said or written. Accordingly, whenever we find a

conditional proposition, the associated probabilities are then called

conditional probabilities.

Let us use the example of the temperature to understand how the

conditional operator is actually used. The proposition about tomorrows

temperature depends on several pieces of information which we usually

take for granted, like the definition of temperature and what day we

mean by tomorrow, but, as we have seen, it makes no explicit reference

to where that phenomenon is going to happen. If you are talking to a

friend, the place to where the proposition is referring might even be

implicit, but let us assume that it is not. As I said before, if we change the

place, the probability should change accordingly.

In order to include the information about the place, we start by

writing it as two different propositions = We are in Brazil. And = We

are in Antarctica. We can then use the conditional operator to construct

the two different conditional propositions in the following way

= The temperature tomorrow will be below zero degrees Celsius given

that we are in Brazil.

= The temperature tomorrow will be below zero degrees Celsius given

that we are in Antarctica.

The conditional operator | is usually read as given (or given that,

when it makes more sense grammatically) and is read as given .

is then a conditional statement and the probability , the

probability of given , is a conditional probability. Mathematically, a

conditional probability can be calculated if we know both the probability of

AND and the probability of irrespectively of . It is then given by

dividing the former by the latter

Roberto C. Alamino

132

Here is a place where thinking about probabilities as the relative

amount of things inside sets, as used in Kolmogorovs Axioms, can be

helpful in order to visualize the meaning of this equation. Once again, if

you do not feel very comfortable with formulas, you might want to skip it

in a first reading, but I would strongly advise you to come back to it again

after some time.

The above formula for conditional probabilities is, of course, valid

not only for our propositions about climate, but for any two propositions.

This can include things like dice rolling or coin tossing, numbers in the

lottery or the chances of having a baby girl and a baby boy in a row.

To understand this, consider two sets of childrens LEGO. Call the

bricks squares and the bricks rectangles. As everybody

knows, LEGO comes in many different colours, but we will consider only

two, say red and green. The picture represents a box with a total of 6

bricks, 3 of them are red and 3 are green, 3 are squares and 3 are

rectangles.

Although I briefly said that relative frequencies and sets were

related (but be aware that they are not the same!) in probability theory via

The Probable Universe

133

Kolmogorovs Axioms, I never really explained the connection in details.

This is a good place to do that.

The link between these two things can be made by imagining a

simple experiment, namely, inserting our hand inside the LEGO box and

picking a certain brick from it with our eyes closed. As long as you put the

brick back in the box after you look at it, this is clearly a repeatable

experiment in which all the initial conditions are roughly the same. This

implies that we can count the number of times a certain result which kind

of block we picked is obtained. Using a bit of common sense, which

means that we will throw away any irrelevant information, we can envision

a random drawing and agree that it makes sense to say that the probability

of picking a brick with certain characteristics must reflect the ratio

between how many bricks have that characteristic and the total number

of bricks in the box.

Because we are considering the LEGO box as being a set, we can

then say that each LEGO brick is an element of that set. If we draw the

bricks fairly enough (with all the implications and conditions you by now

should have learned to consider in the back of your mind) we can assign an

equal probability of being picked to each one of them. Because we have six

bricks, we assign a probability of 1/6 for each one of them. These results

are obviously mutually exclusive as we already agreed that we can only

pick one brick at a time. We can attach to ours bricks numbered labels

ranging from 1 to 6. The mutually exclusiveness then means that the

probability of picking, lets say, either brick 1 OR brick 2 is according to our

rules . For all practical purposes, if we ignore colours and

shapes, this part of the game ends up being basically the same as a d6

rolling. But when we consider the added information about these two

properties, things obviously change.

Let us call the probability of picking a green brick. It does

not matter if it is a square or a rectangle, only the colour is important in

this case. After all we have said up to this point, it clearly makes sense that,

because there are 3 green bricks out of 6, this probability should be given

Roberto C. Alamino

134

by the fraction 3/6, or in other words, . We can arrive at

this result also by using the OR operator. Picking a green brick is equivalent

to picking either brick 2 OR brick 3 OR brick 6, which by our rules should

give the result .

The same reasoning gives the probabilities for picking a red brick as

(brick 1 OR brick 4 OR brick 5), for a square brick as

(brick 4 OR brick 5 OR brick 6) and for a rectangular brick as

(brick 1 OR brick 2 OR brick 3). The fact that each one of

these probabilities is numerically equal is a coincidence. It happened

because the ratios between the number of bricks with the corresponding

particular property and the total number of bricks are the same in this

case. Most of the times, probabilities like this will be different.

Because probabilities are very subtle things, it is worthwhile to stop

a bit at this point in order to make an important observation. The reason

we could assign ratios to each one of the above probabilities was that we

assumed that each brick had the same probability of being picked. If we

want to calculate the probability of picking a brick with a certain property,

we just count the number of bricks with that property, lets call it

and divide it by the total number of bricks (in this case it would be

). The probability would then be

which is misleadingly similar to assigning probabilities as frequencies! But

be careful! They are not the same! The above probability is not calculated

by repeating the experiment and is exact, without any fluctuation involved.

So remember ratios are different from frequencies. There is a connection

which is made, once again, by the Law of Large Numbers. If we repeat the

brick picking experiment a large number of times, the relative frequencies

should approach the above ratio, but there will always be fluctuations in

the actual measured frequencies.

The Probable Universe

135

With that distinction cleared up, let us proceed, in the most ancient

tradition of probability theory, with a game. Suppose that I am playing a

guessing game with a friend and say to her Ive just picked a red brick, is

it a square or a rectangle? If my friend knows the number of bricks of each

quality, her best guess will obviously be a square, because there are two

red squares while there is only one red rectangle. Instinctively, she is

calculating the conditional probabilities the probability of a

brick being square given that it is red and the probability

of a brick being rectangular given that it is red and then choosing the

largest one. We all would do that, even if we did not know anything about

probability theory (which now you do). It is reassuring when the theory

gives sensible results when used in practice as sometimes that might fail to

happen.

As you can see, it makes a lot of sense to consider the ratios of

bricks inside the box relative to a certain kind. For instance, out of a total of

3 red bricks, 2 are squares and 1 is a rectangle. Therefore, the ratio of

squares among the reds is and of rectangles is

. The highest probability case is obviously that of

squares. The formula for conditional probabilities that we saw before does

exactly this calculation automatically. We need that formula because we

not always have this clear picture of ratios when we are dealing with

general propositions. The formula does the job of encoding this procedure

in such a way that we can still find the correct result even if we cannot

visualise sets or ratios.

It remains for us to see that the formula actually works for our

LEGO box. Suppose we start by calculating the conditional probability of

our brick being square given that we know it is red, or .

According to the formula for conditional probabilities

Roberto C. Alamino

136

remembering that we are using the coma as an alternative notation for the

AND operator of two propositions. In the above case, is the

probability of picking a square AND red brick. Of course, the formula is

only useful if we know the probability of the combined case. In technical

language, when we put together two events with the AND operator, it is

common to call the resulting probability the joint probability of the two

events. In the above formula, we then have the joint probability of and

. Notice that in here, the two cases are not mutually exclusive as it

happened with the faces of a dice. Colour and shape are not exclusive

properties of the bricks, because every brick has both.

For our LEGO box, it is easy to calculate the required joint

probability. There are only 2 bricks out of the 6 which are square and red at

the same time, namely the bricks number 4 and 5. Therefore

We have already found the probability for a

brick to be red and, therefore, we can already put everything together in

our formula as

This is, fortunately, the same result we found before. For this

simple case, we can check that the answer is correct right away by

inspection. Because probabilities should add to 1, and there is no other

shapes besides square and rectangle, it is obvious that the above value

implies that , which we know is the right answer.

Once we understand what is behind a formula, we can start to

trust it (at least a bit). Back to our climate example, we see that in it we do

not have such a clear way to visualise the meaning of the conditional

probability formula. But there is one thing we know. If there is any way to

calculate the required probabilities and if this way is correct than we

should arrive at the conclusion that

The Probable Universe

137

where >> is just the way we physicists use to say that one thing is much

larger than another, not just an inexpressive larger, which is the more

common symbol >. This means that the probability of having subzero

temperatures in Antarctica must be much higher than in Brazil. If after all

we did we get a different result, something must be very wrong. The

information necessary to use the formula in this case is more difficult to

obtain. It must be inferred from measurements. But you can be assured

that, when all data is correctly gathered, the formula keeps working.

EVERYTHING IS SUBJECTIVE

We reached a point in which we managed to construct

probabilities by encoding information using a mixture of common sense,

logical deduction and experience. In the process, we learned that every

probability we calculate is conditioned on what we know about the

situation. This conditioning is the defining property of what has been

conventionally called the Bayesian interpretation of probability, or simply

Bayesian probability.

Although everything seems extremely reasonable, what we have

done has been repeatedly called the subjective interpretation of

probability, a term which carries a partially prejudicial meaning. The

reason for this is the idea that, because every probability we calculate is

conditioned on the information we have about something, we are not

calculating an objective property of a system, but only a subjective point of

view about it. Persons with different information about a system will

calculate different probabilities. Change the proposition after the symbol

| and you change the value of the probability. Think about the LEGO

bricks.

A huge amount of scientists feel very uncomfortable with this. It

seems nice that we have deduced the probabilities for the faces of our d10,

but where exactly is the connection with reality? How do we check if our

inferred probability is really correct? Of course, we have seen that we can

Roberto C. Alamino

138

always connect our calculations with actual experiments using frequencies

if we allow for some degree of imprecision. However, many scientists and

mathematicians go as far as to say that the only real probabilities are

frequencies. This is the essence of the so-called frequentist interpretation

of probability, also called by its followers the objective interpretation of

probability in a clear attempt to make it look the only sensible

interpretation when compared to the Bayesian one. In this version,

probabilities can only be defined by frequencies of experiments, not by

encoded information. This is an appealing interpretation for natural

scientists as probabilities become measurable objective properties of some

physical system. If you are naive enough to not think too deeply, this would

seem the correct thing to assume.

It is not rare to listen people even arguing that the frequentist

approach is the correct interpretation for probabilities because frequencies

are objective and the Bayesian view is subjective, and the word subjective

is a blasphemous word in science. Although it is understandable that

scientists have some aversion to that word, especially after the post-

modernist pseudo-science madness, discarding something because of the

name it was given is nothing but a logical fallacy.

In order to see where the above arguments against the Bayesian

interpretation actually fail, we need to identify where and how subjectivity

enters our derivation of the probabilities. Let us do this for our friend the

d10. What people points as an element of subjectivity in that derivation

comes from the fact that we chose the pieces of information we wanted to

encode in our probability and someone with different choices of

information would arrive at a different probability. Therefore, frequentists

claim, the calculated probability is not objective in the sense of not being a

property of the system.

First of all, let us consider the issue of different persons with access

to different information calculating different numbers. After that, we will

talk about the issue of choosing which information to use.

The Probable Universe

139

Consider two persons, each one within an inertial frame with

relative speed with respect to each other. In physics language, this is

used to indicate that the relative speed of the two observers is 90% of the

speed of light, which we call by the letter . Relativity implies that if one

person measures a constant electric field in her frame, the other one will

measure that field as being a mixture of an electric and a magnetic field.

How subjective is this description of physics? Is the magnetic field

measured by one of the persons less real than the electric field? They are

different descriptions, both valid, of the same physical situation. What

causes the difference?

The different descriptions happen because the two observers use

different information to calculate things under different conditions, but if

we have two persons in the same frame they both will use the same

information and arrive at the same description. Even more than that, if the

person in one reference system knows exactly what information is

available to that in the other, this person can calculate (if she knows

enough physics) what the latter will describe and they will both agree if

they are using relativity theory correctly. This agreement, in fact, is the

defining property of objectivity of a physical model (we will talk more

about that in the next section).

Things get even worse when you go to general relativity, the

theory of gravity that incorporates the principle of relativity. The weirdest

example is of the physical description of a black hole. If you imagine two

persons, one very far from a black hole and the other falling down on it,

their descriptions of what happens are completely different. For the distant

observer, the person who is falling towards the hole never actually reaches

the event horizon, which is the area around the hole that, once everything

passes it, it can never escape. But wait! If the person never reaches the

horizon, how can her pass it and never return. This is where things get

strange. For the person who is falling, she will indeed reach the horizon at

some moment in time, but will never feel anything different when she

does! But, once again, given the same information, both observers can

Roberto C. Alamino

140

actually calculate correctly what the other one will be seeing. Admittedly,

things become more complicated when one adds quantum mechanics to

this situation, but that is not the point here.

Bayesian probabilities are like the relativistic descriptions above.

Two persons with the same information always calculate the same

probabilities. It might be acceptable to say that they are subjective in the

sense that persons with different information arrive at different

probabilities, but then you have to admit that relativity, and actually all of

science, is subjective too. I am personally not completely averse to the

word subjective, but I have to admit that given the margin to

misinterpretation, that is a very inconvenient term to keep in scientific

books.

It is worth repeating here once again that associating frequencies

to probabilities is not contrary to associating information to probabilities.

When one tabulates frequencies, one is simply collecting information

about an experiment or a system. In fact, counting frequencies is a way of

doing inference and, as we have already cited some time ago, Bayesian

inference will account for including this new information in our

probabilities. But if you remember the Law of Large Numbers, frequencies

can only approximate some ideal property of the system. An extrapolation

is always necessary and any extrapolation requires information coming

from outside the experiment.

If you start to roll our d10 and count the frequency of each face, at

some point you will see that the frequency of every face is almost the same

but not quite so. You might arrive at the result that the frequency of 1 is

0.1013 while that of 5 is 0.0989. That is when most of us will do the jump

and see the differences in the frequencies as irrelevant. But that

information, the one that says that we are allowed to do this jump, is not

coming from that experiment alone, but from bits and pieces of

information that we know from everything else we have ever

experimented before!

The Probable Universe

141

The second issue concerns the choice of the information which you

want to encode in your probabilities. Yes, that sounds arbitrary and person-

dependent, but so is every experiment we carry out. In every experiment

there is an infinite number of things you might measure and an infinite

amount of choice in what precision is relevant and what can be thrown

away. We discussed that before. Choosing what to encode is equivalent to

taking a decision about what is relevant to the situation which you are

interested in understanding. Making those choices is, in the end, the only

sensible way to do science. But when two persons come together and

agree on what is relevant, they must get the same probabilities.

OBJECTIVITY AND CONSISTENCY

Although you will find a lot of physicists subscribing to the philosophy

known as Shut-Up-And-Calculate, the one saying that you do not need to

be thinking too much about the philosophy behind the phenomenon if you

can do the necessary calculations, that is a narrow minded attitude that we

cannot assume if we want to understand better the fundamentals that

underlie what we are doing. Without that, we are not much better than

simple spreadsheets whose only function is indeed to shut up and

calculate.

From the discussion of the previous section, many of you must now

be feeling a bad aftertaste in your mouths. The subjective versus objective

matter does not seem to be completely settled when we think enough

about that. If we can only deal with the information we can access about

something, what would we really call an objective property of a system? Or

am I conceding victory to the infamous post-modernists and saying that

there is indeed no objective reality?

These questions, as any really deep question, are not settled

among scientists and philosophers. That is because even the meaning of

the words themselves is a bit uncertain. Most people will associate the

word objective with the word real in the sense that most scientists,

Roberto C. Alamino

142

myself included, will claim that science is the search for the rules

underlying an objective reality. But the problem is that reality is probably

the most complicated concept to define in philosophy. You might think that

the answer is obvious something is real if it exists but that is just

carrying the problem to the next level. What do we mean when we say that

something exists? If you think carefully, you will see that you begin to walk

in swamped terrain sooner or later.

Most scientists assign an element of reality to sensorial models of

nature. Sensorial is a very important word here because that is what you

do as well, and also what I do if I am distracted and relaxed. Think about it.

Almost all things (if not all) that you are sure that are real have a sensorial

model inside your mind. If you are able to see, the odds are that these

models are in their great majority visual. Even those things you cannot see,

like sound, microwaves and atoms have a visual representation in your

mind. All of that is associated with some picture. You should be right now

imagining them. If you were blind since your birth, odds are that you have

different mental models associated with the other senses you possess.

They are probably associations with sounds or tactile sensations. All our

intuition about reality is based on models our brains create by

amalgamating information from our biosensors be them eyes, skin, ears

or any other.

When one can detect things directly through one of these sensors,

it is easy to say that what one is detecting is real. What about things like

light polarization? By using polarization filters, we can block or let pass light

with a certain property that we call by convention polarization. The black

3D glasses (not the red and blue ones) are nothing but two polarization

filters, each one letting light from a different polarization pass through it.

The problem is that we cannot actually see or sense directly with

our naturally evolved sensors the polarization of light. What we can detect

is the presence or absence of light after we use the filter and then we

associate the polarization with this light/shadow result. Can we say that

polarization is real? Would you answer this as No, because we cannot

The Probable Universe

143

detect it directly? Although we cannot, bees actually can detect

polarization directly. They have naturally evolved sensors in their eyes that

can instantly detect it. Does it make any sense to say then that polarization

is real for bees and not real for humans? I am not sure about you, but for

me that sounds like nonsense. The sensible solution is to admit that things

that we can detect indirectly must exist as well, as long as we can measure

its effect.

Another example is that of colours. We all see colours and, indeed,

we today can associate colours with the frequency of the light we are

detecting. Each frequency, within some precision, is interpreted by our

brains as a different hue. However, nobody is exactly the same and the

precision with which each person perceives different colours can vary

according to details of each persons biochemistry. Even today, I still

disagree with some old friends concerning the colours of public telephones

in my city more than twenty years ago. Some of us say they were yellow,

others that they were green.

Public telephones in So Paulo, Brazil: Yellow or green? Dont rush, take your time.

The extreme case is that of persons which are colour-blind. It is not

that a colour-blind person sees in black and white, but due to a biochemical

Roberto C. Alamino

144

glitch, they cannot correctly tell apart some colours. Consider a population

in which half of the persons are colour-blind. If you give to this population a

colour-chart and ask how many colours exist on it, half of the population

will disagree with the other half. How do you know which half is right? How

do you really know that the non-colour-blind half is not imagining colours

that do not actually exist?

The solution here can only be given in terms of consistency.

Consistency is the property of not having logical contradictions when

comparing information. The main argument when deciding on the issue of

colours is to say that if the non-colour-blind half is imagining the colours,

how can everyone imagine them in the same way? The solution which has

no logical contradictions, which is consistent, is the one in which it is the

colour-blind people who cannot see some of the colours the other half can.

But we can go even far once we have a theory that can associate the

frequency of the light with the colour. We can vary the frequency of a light

ray in an experiment in a systematic way and check that the colour-blind

people will associate the same colour to different frequencies, while the

other half will consistently attribute different colours to different

frequencies in the same way. This adds another level of required

consistency which is only satisfied by the description in which the different

colours seen by the non-colour-blind population actually exist.

Consider the examples about relativity I gave in the previous

section. In particular, the black hole one. The distant observer would see

the in-falling observer (the one falling into the black hole) never cross the

horizon, while the in-falling observer will think that she does, but feels

nothing special when doing that. Which description is right in this case?

The answer is that both are right from their points of view. That is because

all experiments that the distant observer can do will always be consistent

with her description, while all experiments that the in-falling observer can

do will also be consistent with her description!

This example concerning black holes is known as the black hole

complementarity principle. It should work for classical (non-quantum)

The Probable Universe

145

relativity, but when issues concerning black hole evaporation are

considered, which happens only when quantum mechanics is added to the

description of the phenomenon, it is not clear that things really happen

that way. However, purely classical relativity is a consistent model and, in

this case, the complementarity should be valid and, as we have seen

before, the two descriptions are also consistent in the sense that, if you

describe the whole situation for both observers in the same way, they can

calculate what each one of them will measure and these calculations will

coincide.

What happens is that, although each observer is in some sense

describing her own point of view in a different way, the whole physical

description is free of contradictions and the two observers arrive at the

same conclusions about each other if given the same information. Both

descriptions are consistent in the sense that there are no logical

contradictions. All measurements that can be done can be correctly

calculated by, let us say, a computer program fed with the same data. That

is all we can ask of these descriptions. In the case of (classical) relativity, we

have a consistent theory given by a mathematical description of the

situation. The same happens with our example of polarization.

Electromagnetic theory can describe it consistently and that is all we can

ask.

At this point I might have convinced you that things that are

indirectly measured should be considered real because they are part of a

consistent description of nature. Therefore, things like polarization, the

gravitational field, electromagnetic waves should all exist because they are

part of consistent theories. Not only internally consistent, but these

theories are all consistent when compared with one another. That solves

the issue, right?

Well, that is when things start to get blurrier. Because our

biosensors are very limited, there are much more things that we cannot

detect directly than things we can. Science has used the power of our

brains to create mathematical models describing all those things that are

Roberto C. Alamino

146

reachable by indirect detection. But what happens when all information

collected about a phenomenon leads to two mathematical models which

do not differ in their measurable predictions, but do differ in their internal

structure? Which model is real and which is not? Or can we say that both of

them are real at the same time?

What I am calling the internal structure of a model is the

mathematical entities it uses to describe the data of experiments.

Electromagnetic waves and fields are part of the internal structure of

electromagnetism. A spacetime fabric that bends and stretches in the

presence of matter and energy is part of the internal structure of relativity

(Greene, 2005). Energy is part of the internal structure of mechanics.

Two theories or models (the words are interchangeable) differing

in their internal structure might use different concepts. For instance, one

might not make use of the idea of energy in its description of a

phenomenon. What if, when measurements concerning the phenomenon

are done, both always predict the same results? If one of the theories uses

one concept and the other does not, what theory (and by extension what

concept) is real?

The first thing that might come to your mind is to invoke Occams

Razor. This principle says that, if you have two valid descriptions of the

same phenomenon, you should keep the simplest one. In this sense, we

would be advised to check which theory is the simplest of both and throw

away all but the least complicated model. The problem is that, as helpful as

it is in practice, Occams Razor is just a guideline, it cannot be used to

actually state that the simplest theory is the one which is real while the

others are not. It cannot separate the right theory from the wrong one. It is

a principle of practicality, but not of reality.

If you are interested in doing calculations and shutting up, then

Occams Razor might be enough for you, but it cannot answer the question

concerning the reality of either theory. Most scientists, contrary to

philosophers, would dismiss the question altogether. But lack of interest is

The Probable Universe

147

not an excuse in this case. We must at least know if the question is sensible

before we discard it!

One of the possibilities is that, when we put all theories of nature

together, there is only one description which will be fully consistent. It

might be that, whenever we have two different theories that describe

some piece of nature equally well, one of them will end up being

inconsistent with other natural phenomena. At this point in time, what

might be happening is that we still cannot single out one theory among the

others simply because we do not know all natural phenomena yet. That is

indeed possible and many physicists have hopes that this is the truth.

However, the prospects for that are not good. In fact, if (and this is a big

if) some hypothesis about high energy physics turn out to be correct, we

might be forced to admit that two different descriptions of nature, with

different elements, cannot be ruled out! This is such an important thing

that deserves a digression.

THE HOLOGRAPHIC UNIVERSE

There are many problems in physics that have not been solved yet. Our

current theory of gravity and mechanics for things that are large and heavy

is general relativity and it is consistent when we do not try to apply it for

too small things. When things get small, we use a theory called quantum

field theory, which is a sophistication of quantum mechanics that includes

some aspects of relativity. However, in phenomena where these two

theories meet, they are unfortunately inconsistent. Black holes, as I

mentioned briefly before, are an example of phenomena where this

happens. This obviously means that one or both of them are not

completely right, only approximately right.

Currently, we do not know the solution, but we have some working

hypotheses which are, however, not confirmed yet. You surely have heard

about string theory and might or not have heard about loop quantum

gravity. There are others, which differ in its details. They all result from the

Roberto C. Alamino

148

attempt to find a consistent description of phenomena where general

relativity and quantum field theory meet.

It turns out that, from many works on these theories and in these

borderline problems, some hints about what we call a duality has

appeared. A duality is a connection between two things in which if you

know one of them, the other is completely defined. You can say in a sense

that numbers of the form and are dual as long as they are not zero

(or if you allow infinity as a number), although this is not very conventional.

The point is that there is a very precise procedure to find one if the other is

given.

The duality that seems to exist in physics is not completely proven,

but it is a conjecture against which there is no counterexample yet,

although it must be said that it is difficult to find many complete examples

as well. This duality is called the Holographic Principle. In its most technical

realisation, it is known by the strange name of AdS/CFT Correspondence,

or the Maldacena Conjecture in honour of the Argentinean physicist Juan

Maldacena who first found a mathematical model where it was satisfied.

The principle states that we can describe all physics either by a

theory of quantum gravity in 3 spatial dimensions and 1 time dimension or

by a theory of fields which does not contain gravity and has 2 spatial

dimensions and 1 time dimension. The holographic part of the name

comes from the fact that it states that all information to describe the 3D

(excluding the time) world in which we live is contained in a 2D surface,

which according to the principle is in fact the boundary of that same 3D

space.

In order to understand more or less how this works, consider a

sphere as the one represented in the picture below.

The Probable Universe

149

Imagine that this is a glass sphere completely filled with jelly. The

jelly is the interior of the sphere and is called its bulk. The glass,

corresponding to the surface of the sphere, would be its boundary. The

correspondence says that everything that happens in the bulk can be

described by a theory that concerns only the boundary.

Because the principle is a duality, both descriptions are exactly

equivalent. None is better than the other. In this case, even Occams Razor

cannot save us as some calculations are easier in one theory and harder in

the other. There is no way to decide which one is the simplest.

You might say Okay, let us admit that both theories are real if they

are consistent. What is the problem then? Answer this question: is gravity

real? Gravity exists in the bulk theory (the 3D-space one) but not in the

boundary theory. Each of these theories, although they are dual, can stand

apart on their own entirely. What would you say then?

Again, of course, it might happen that this is a false duality and in the

future a counterexample will be found, but there is a possibility that it

stands against experimental validation. What happens if this happens?

Should we give up on reality, objectivity and related concepts? Fortunately,

not yet. But this leads us to think about demoting elements of the theories

from being real. Maybe only the interactions themselves can be said to be

real. But how do we know we can stop there?

Roberto C. Alamino

150

So where exactly do we stand? One of the greatest lessons of being

a critical thinker is that you have to accept that some questions are difficult

and you have to live (at least for a while) in doubt about the answer. There

is nothing wrong about saying I dont know. We have to find a way of

keep going on with our explorations of nature by going around the

difficulty until we have more information helping us to decide.

If we still do not know when something exists or is real in a

fundamental level, how can we characterise objectivity? How can we say

that science is objective, but religion is not? Fortunately, consistency is all

that we need for all practical purposes. Although we cannot completely

characterize what is actually the essence of a fundamental objective reality,

we can at least assume that if something objective exists it must have the

basic property of consistency. This is actually the property we will require

in this book to characterise objectivity. You may feel that it is a weak

definition, but think deep enough and you will hardly find a better working

one.

Is that wishful thinking that objectivity, and ultimately reality,

should require consistency? We can still not discard that possibility, and we

do not know if we will ever be able to do so, but we must draw a line from

which we can start. The consistency requirement seems to be the most

reasonable place to put it, if not the only reasonable one. If this is the

correct concept, it can be only known by always challenging the concept

itself, always keeping in mind the possibility that it might not survive.

Nobody said that understanding reality would be an easy task.

The Probable Universe

151

7.

Probability Zoo

INTERMISSION

Before we actually enter the zoo, it is worthwhile to localise ourselves and

find out how far in our journey we have come. In the last chapters, I have

tried to introduce to you to the idea of probabilities. Do you remember

why? If not, look at BAYES again:

BAYES is the program that would allow us to carry out inference

and, as you can clearly see, it depended on probabilities. When we started,

our understanding of what the symbols entering and leaving the program

meant. Now we do. We can understand now that in order for BAYES to

calculate the conditional probability of a model encoded by the proposition

given the data encoded by the proposition , which is what we want, it

has to be fed with the conditional probability of the data given the model

and the prior probability for the model.

What we learned is that it is not trivial to go from the actual

propositions and to the probabilities and . These are,

Roberto C. Alamino

152

themselves, procedures that can be seen as separate programs. In fact, we

could expand the above picture into the following one

The temporarily called ???, will be given a name later on. The

program PRIOR, takes a proposition and calculates its prior probability. As

we learned before, this is a bit misleading as, to calculate a prior, we

always need some more information than simply the proposition .

Consider it as an abbreviated way to represent it. We have seen some very

simple examples of priors, but there are in fact an infinite number of

possibilities for the mathematical object spilt out by PRIOR. The walk in the

zoo of this chapter will show you the most common of them.

ANATOMY LESSON

Have you ever heard about the bell curve? Long tail distributions? Have you

ever read the Four-Hour Work Week by Tim Ferris (Ferris, 2011) and

became amazed by the Pareto distribution? If you have, you might want to

know that we are at a stage in which we can start to understand what they

mean. But before we can do that, we have to understand some very simple

but deep facts about numbers.

The Probable Universe

153

Dice are very simple objects and they have a certain definite

number of faces, what translates to a certain definite number of results in a

dice rolling. If the dice is small enough, like the d10, we can count the

number of faces or results in our fingers. The fact that we attributed

numbers to these faces is, in a sense, irrelevant. We could label them in

any way we wanted, with letters or shapes for instance. Because in this

situation there is an end to the number of results we can get, we say that

this number is finite. Therefore, dice have a finite number of faces and we

have a finite number of fingers.

Collections or sets of finite things can be counted using natural

numbers. These are the first ones we learn: 0, 1, 2, 3 and so on. Although

we do not realise that when we are children, at some point we notice that

the natural numbers form an infinite set, which means that they never

end. If you tell me any natural number, no matter how high it is, all I need

to do to get a higher one is to add 1 to yours. This process never ends.

The second characteristic that we need to know about natural

numbers is that they are something called discrete. For numbers, this

means that, if you take two natural numbers in sequence like 2 and 3, there

is no other natural number between them. The number 2.5 does not count

as a natural number. Only integer numbers count as natural. In terms of

objects, if you can separate any two objects of your set, you have a discrete

collection of them. We then say that the set of natural numbers is discrete,

but infinite.

The natural numbers are very important for us because they are

what we use to count things. The act of counting can in fact be rigorously

defined as the act of associating things with natural numbers. The idea is

that, when you count objects, it is like assigning to them labels, with each

label being a non-zero natural number. This is so important that

mathematicians give a name to this. A set to which we can assign a label

corresponding to a natural number to each element is called a countable

set. Discrete sets are clearly countable.

Roberto C. Alamino

154

Let us consider finite discrete sets of numbers. If you remember

your school days, or if you are still there, you must know that sets are

represented by objects inside curly brackets. If we consider the set of faces

of a d10, we can write it as

Does it ring a bell? This was the sample space of our dice rolling

game for the d10. This set is finite and discrete. The size of a set is called its

cardinality. In this case, the cardinality is simply 10. It is quite easy to

attribute probabilities to each one of the elements of a finite discrete set.

In fact, when our sample space is given by a discrete set, be it finite or not,

we can always talk about the probability of each element of this set. We

saw that with the d10, in which the probability of each face was 1/10.

The nice thing about probabilities on discrete sets is that we can

easily visualise them graphically by points in a graph. A graph is basically an

aid to visualise tables of numbers with two columns. Let us see an example.

Our d10, with all equal probabilities, can be represented by the graph

below

This graph, or plot, is a graphical representation of the following

boring table

The Probable Universe

155

Face Probability

1 1/10

2 1/10

3 1/10

4 1/10

5 1/10

6 1/10

7 1/10

8 1/10

9 1/10

10 1/10

The rules for plotting the graph are:

The graph has two axes (plural of axis) which are the thick lines

with arrows on their tips that cross each other. If nothing is

specified, the standard assumption is that they cross at the point

where both are zero.

The horizontal axis corresponds to the first column, while the

vertical axis to the second.

The arrows indicate the direction in which values increase.

Each line of the table corresponds to one of the points in the graph,

which I have exaggerated to red circles.

To know which values of the table each point represents, put your

finger in the point and slide it downwards until it touches the

horizontal axis (follow the dashed lines downwards). This point

corresponds to the value in the first column. Now do the same, but

go towards the vertical axis, which gives you the value in the

second column.

The table and the graph are two different ways of representing the

same object which is, in this case, a probability distribution. When you are

attributing probabilities to elements of a discrete set, the values in the

second column of the table, or on the vertical axis, represent the actual

Roberto C. Alamino

156

probability of each element. We will many times use the term discrete

distribution for a probability distribution defined on a discrete set.

A more interesting (less boring) probability distribution, would be

the following one:

This time you can see that the probabilities for each face are

different. If this is a d10, it is clearly loaded; so much that the faces 1 and

10 will never turn up! I will leave to you the task of obtaining the table for

this probability distribution based on the graph above.

There are some standard names describing properties and pieces

of probability distributions. One of them, which is worth knowing before

we proceed, is the mean. The mean, as most people already know, is what

you obtain if you multiply each element of your set by its probability and

add them up. In the case of equal probabilities for all elements, the mean

gives the same as the usual averaging process of adding everything and

dividing by the total amount of numbers. For a fair d6, this would be given

by

The Probable Universe

157

Notice that the mean does not need to be a natural number too! If

you plot the graph of the distribution for a fair d6, you have the following

The red line I added is marking the place where the mean is

located. You can see that the red line does not correspond to any point in

the distribution and, therefore, is not part of its graph. There is no way for

you to roll a d6 and get 3.5 as a result. The mean, in case of discrete

distributions, just indicates the place of a hypothetical average element

of your set.

There are more properties of probability distributions with strange

names like kurtosis and skewness, but we will not need most of them.

Those we do, I will introduce as we go along.

A DANGEROUS CREATURE

Everything looks good, but there is one thing with which we always have to

be careful: the infinite. Nothing prevents countable sets from being infinite.

In this case, we usually will not be able to write full tables or plot the entire

graphs for the probability distributions for obvious reasons, but we still can

do it for parts of it or to write down a rule as an equation. To see this,

Roberto C. Alamino

158

suppose we have a dice with sides, all with the same probability. We

already know that, in this case, the probability of each side should be .

Now, if these faces were just the first sides of an infinite-sided dice,

we could still have a probability distribution such that these faces have

probability 1/ and all others have zero probability. We could write this as

Remember that here is a random variable that represents the

proposition the result of the rolling is... The value one gives to

corresponds to a possible face of the infinite dice. This is a completely legal

probability distribution that can be assigned to the whole set of non-zero

natural numbers. It is a bit of cheating, of course, but allowed nevertheless.

If we sum the probabilities for all natural numbers, the result is 1, as it

should be. Each probability is also positive and between zero and one, as

they should be.

The above distribution is clearly very unbalanced. What if we want

a probability distribution that gives the same probability for all natural

numbers? That would be the equivalent to the following situation. A friend

of yours asks you to think about a natural number. Any natural number. If

your friend guesses this number in a completely unbiased way, what would

be the probability of guessing it right? We can find it by a process named

taking the limit. We have seen that when we have numbers, the

corresponding probability would be . We then start with and

keep increasing it to see if we are going somewhere:

1 1

2 0.5

4 0.25

10 0.1

1000 0.001

1000000 0.000001

The Probable Universe

159

The more we increase , the smaller is the probability of guessing

a number correctly. If goes to infinite, the actual size of the set of

natural numbers, the above table clearly shows that the probability should

go to zero! So, the probability of guessing any natural number correctly is

zero! But now comes the strangest thing. All probabilities go to zero, but

their sum stays always equal to 1! To see that, notice that if you add

times the probabilities , this is the same as multiplying by . But

no matter how large is, this is always 1!

You can now appreciate how it is tricky to deal with infinite things.

If you are careful enough, though, you can live with it. The important thing

is that, once again, your probabilities should add to 1 and be between zero

and 1. If you are worried because a sum of an infinite number of numbers

will not be finite, look again at the result of the previous paragraph. What

happens is that, if each term of the sum is small enough, than the sum

might be finite. We say, in mathematics lingo, that the sum converges to 1.

The list below contains some other infinite sums that converge,

some of them to other values than 1. You can have an idea of how it

happens by using your computer to keep adding each successive term to

see how the total approaches the final result and try to guess what it is.

Have fun.

Roberto C. Alamino

160

SPECIMEN #1: POISSON

With Anatomy 101 cleared, we can now start our taxonomy classes. Of

course, we will not be able to go through mathematical details. If you are

interested, the standard introductory book, which contains a lot of

examples, is Fellers book (Feller, 1968). You can get more information

about specific distributions either in Wikipedia or in Wolframs MathWorld

(see the appendices for a list of Internet links).

Our first probability distribution is a discrete one called the Poisson

Distribution. Named after the French mathematician Simon Denis

Poisson, this is one of those probability distributions which appear

everywhere you look at in nature. Its usual shape is given by the following

graph

As an aid to the eye, I have connected the points and added some

colour to the space below each one of the above three curves. Each curve

represents a Poisson distribution, the only difference between them being

their means. Notice that the probabilities become so close to zero after

some point that they cannot be seen in the graph. The larger the mean of

the distribution, the more to the right this point is.

Suppose you have a certain time interval inside of which a certain

number of events might occur. One usual case is that of people arriving at a

The Probable Universe

161

certain shop. If people choose randomly and with the same probability to

go to the shop and the average amount of people arriving at any specific

hour is always the same, then the number of people at any given hour is

roughly given by a Poisson distribution!

That works with space intervals as well and, more interestingly,

with areas. A very interesting situation in which you can find the Poisson

distribution is in the case of colonies of bacteria in a Petri plate. Fellers

book (Feller, 1968) has this example. Colonies of bacteria will appear as

spots in the jelly inside the circular plate. If you take a picture and draw a

square grid over the picture, you can count the number of squares in your

grid with spots in it. This number approximately follows a Poisson

distribution and it is not difficult to check it in real experiments. Of course,

the distribution is approximate if drawn in this way, but it is very close!

SPECIMEN#2: ZIPFS LAW

Zipfs law or Zipfs distribution is a discrete version of a whole class of

distributions which are called power law distributions. Oddly enough,

these are also very common in nature. For instance, the probability of an

earthquake having a certain energy is given by a power law, although a

continuous one (we will talk about continuous distributions soon). The

name power law comes from the fact that the probability for a certain

value is proportional to a negative power of . For instance, the following

power law

has a power 2.

I said proportional because we always need to guarantee that our

probabilities add to 1. Because Zipfs law is discrete, of course must be a

natural number. The most common place where this distribution appears is

in the frequency of words in a certain language.

Roberto C. Alamino

162

Suppose you list the words of a certain language in order of

appearance, giving to the most used one the rank 1, the second most used

the rank 2 and so on. If you choose a text at random, the probability of

finding in it a word whose rank in the above list is will be proportional to

where is a power that depends on the specific language. For most known

human languages, the number is very close to 1. There are many

suggested explanations for this phenomenon, but none is perfect in

accounting for it completely.

Depending on the value of , Zipfs law can have very strange

properties. For instance, when we cannot make the probabilities add

to 1! That is because the sum of the inverse of natural numbers can be

shown to be infinite. That is why, in general, Zipfs law is only considered

up to a certain maximum natural number . For larger than 1 this

problem does not appear and can be even infinite. However if ,

now is the mean of the distribution that becomes infinite if is also

infinite! This means that there is no mean value. This can be seen by using a

computer to generate numbers according to this distribution. If you try to

average them, your average will increase forever and never reach a certain

number.

Finally, another curious thing appears for infinite and . In

this case, if your computer generates numbers according to this

distribution and you try to calculate their average, although mathematically

the mean exists, you will never get to it by this method! Your calculated

average will keep jumping from one number to another without any

apparent direction. This is because, in this case, the deviations away from

the mean are so large that they are never averaged away.

The Probable Universe

163

THE CONTINUUM

Before we take a look at the next species, we need to understand the

concept of the continuum. Discrete probabilities are straightforward

beasts. You simply give numbers to each one of the elements and these

numbers represent the probabilities of drawing that element according to

the valid rules of the game. But not all probabilities can be expressed as

discrete distributions. Think about the real line as in the picture below

If the only positions for points in the line where in the marks from 1

to 10 and the probability for finding one point in these positions was the

same, we know the result: the same as our d10. Suppose now that a point

can fall in any of the first 10 intervals between the numbers in the above

line. If the probability of a point falling in each interval is the same, then we

have the same result again if we ask what is the probability of finding a

point in the -th interval, with ranging from 1 to 10.

But what if I say to you that there is the same chance of a point

falling anywhere in the interval from 0 to 10? What is the probability to

find this point in exactly, let us say, 2.013?

We can answer this question by an approach we used before:

taking a limit. If only the 10 positions marked by the numbers 1 to 10 are

allowed, then the probability is surely 1/10. If we want to allow any place in

the interval, we should increase the number of allowed points. Let us say

that we put an extra point in the middle of each interval. Now we have 20

possible positions, all with equal chances, and therefore the probability is

1/20. You can easily see that, in order to cover the whole interval, we need

an infinite number of points. Therefore, very similar to the case of an

infinite number of discrete points, the probability for a point falling in one

exact place of the interval from 0 to 10 is zero! However, it is clear that the

Roberto C. Alamino

164

chance of a point being in one of the ten intervals will still be 1/10. How

can we deal with that?

The interval from 0 to 10 is said to be continuous because there is

always another point between any two points in this interval, contrary to

the discrete case. When this happens, talking about the probability of a

certain value is always going to give the result zero, because the number of

points in any continuous interval is infinite, no matter how small the

interval is.

But why cant we do something as we did with the Poisson

distribution or Zipfs law? Both these distributions can be applied to an

infinite number of points as long as each new point has a probability that

becomes ever smaller, which allows others to have non-zero probabilities.

The difference is that the types of infinity you have in these two

cases are different. Believe it or not, the infinite in the continuous interval

is larger than the infinite of the natural numbers! Although this might seem

confusing, this is related to the concept of counting that we saw before.

I said that counting is the same as associating objects with natural

numbers (0, 1, 2, 3, 4, ...). It turns out that you can extend this concept to

infinity. If you can associate a sequence of objects, even if it is infinite, with

the natural numbers, both infinities must be of the same size! Here is an

example. As hard as it is to believe, the amount of even numbers is exactly

equal to the amount of natural numbers. To see this, notice that to every

even number you can associate a natural number corresponding to its half:

0 2 4 6 8 10 12 14 16 ...

0 1 2 3 4 5 6 7 8 ...

Using the rule in the above table, you can associate a natural

number to every even number and vice-versa. Therefore, you have to have

the same amount of them! How can this be possible? Because these two

numbers are infinite, and of the same kind. Every infinite sequence of

The Probable Universe

165

numbers that can be put on an exhaustive list has the same amount of

elements as the natural numbers. We call this the cardinality of the set.

The cardinality of the integer numbers, when you also count the negative

ones, is also the same as that of the natural numbers and also the same as

that of the rational numbers, which comprise all fractions you can make

out of integers! This infinite number is given a very special symbol:

This is the first letter of the Hebrew alphabet with a subscript zero

attached to it and is called aleph-null. The theory concerning different

types of infinities was developed by the German mathematician Georg

Cantor by the end of the 19

th

century. He called them transfinite numbers.

The cardinality of any continuous interval is different from

.

Cantor proved that with a very ingenious method which is today called

Cantors Diagonal Slash, quite a catchy name. The details need some

mathematics, but the idea is that, if you try to organise the real numbers

inside any interval (finite or infinite), you will always ending up missing one

or more. Because of that, their amount, or cardinality, must be larger than

that of integers. This cardinality has its own symbol, which is far less fancy

than the one for the integers: , naturally standing for continuum.

You might be asking right now whether there is any cardinality in

between

there is not is called the continuum hypothesis and it was proved to be

optional in mathematics. The foundations of mathematics are usually

stated in a series of axioms defining how sets work. These are called the

Zermelo-Frenkel Axioms. In 1963, the USA mathematician Paul Cohen

proved that you can choose either the continuum hypothesis or its

negation as an independent axiom which will not affect the usual set

theory and won the Fields Medal, the most desired mathematical award,

for this.

Roberto C. Alamino

166

What all this means is that, in the end, we cannot do to continuous

intervals the same trick we used to define probabilities for an infinite

number of discrete points. What is the way out then? How can we talk

about the probability of a point in a continuous interval? Short answer: we

cannot. We can only talk about probabilities of intervals. But one thing we

can do is to talk about probability densities. Let us see how this works.

In physics, to find the density of a substance you divide the mass of

a certain quantity of the substance by its volume. We usually write these

using the Greek letter , called rho, for the density, the letter for mass

and the letter for volume. The density of lead is higher than that of water

because the same cup filled with water will be much lighter than if it is

filled with lead.

Notice that we cannot talk about the mass of a point of a

substance, because when we divide the substance enough times, we enter

in the subatomic realm and there is no sense in talking about the same

substance anymore. In a sense, we consider only large chunks of

substances such that we can calculate masses.

But although it does not make sense literally, within practical limits

we usually talk about continuous substances. Consider water once again.

We usually imagine water as a continuous material and forget that it is

made of molecules and empty space. The same is true for a block of solid

substance. As long as we remain far from the subatomic domain, we can

consider it as approximately continuous. We can still not talk about the

mass of a point, because points have zero volume and, therefore, zero

mass, but we can talk about the density at that point. How? Once again, we

take a limit.

We start by weighting the block and dividing the mass by the total

volume. This gives us the average density for the whole block. If the

substance is the same everywhere, like our fair d10 should be, then the

average density is the same as the density at each point and we are done. If

not, we have to choose the point in which we are interested and do the

The Probable Universe

167

following. We divide the block in two pieces and calculate the average

density for the piece containing our point. Then we do it again and again. If

we are lucky enough, these values will start to converge to some fixed

value. For instance, this would be a possible sequence of average densities

in kg/m

3

as we divide the block in smaller pieces:

If we need a precision of only 3 decimal places, we can take the

density at our point to be 1.260 with a good approximation. That is exactly

what we do to define probability densities. The only difference is that mass

becomes probability and the volume becomes the length of the interval.

For our previous interval from 0 to 10, this works fine as we are

now going to see. Suppose that every point in the interval is equiprobable.

That means that the average density (not the probability!) should be equal

to the density at every point. We know that the total probability (mass) in

the interval must be 1. Therefore, to obtain the probability density at

the point , we just divide the total probability (mass) 1 by the length of

the interval (volume) to get

which is called a uniform probability density as it is everywhere the same.

Unfortunately, most of the texts will use the letter both for the

actual probability and for the density, including myself. The difference

should be clear from the context, but as a rule of thumb, whenever we are

dealing with continuous variables, we will use densities and we will use

probabilities for discrete variables.

The above formula then gives the probability density for the points

of our interval and it is all we can say about the individual points. In order

to calculate probabilities, not densities, we need to choose intervals. Now,

the actual density of pure water at some standard temperature and

Roberto C. Alamino

168

pressure is 1 kg/m

3

. If you want to know what the mass in a volume

comprising 2 cubic meters of water is, what do you do? You obviously

multiply the density by two, right? What about in half a cubic meter? You

divide by two. In any case, you multiply the density by the volume to obtain

the mass. Guess what you do to obtain the probability of a smaller interval.

Correct. You multiply the probability density by the length of the interval.

Therefore, the probability of being in an interval from 0 to 1 or from 2 to 3

is just 1 times the probability density, which is 1/10 as we calculated from

symmetry arguments!

But what happens if the density changes from point to point? That

is a good question. The answer has the threatening name of an integral,

but it hides an almost trivial idea, easy to understand, but which can

however be difficult to calculate in general. For us, the idea is enough.

The graph below is a graphic representation of our uniform density

from 0 to 10:

Instead of 10 points, now we have a straight red line. That is

because the probability density is defined at every point in the interval, not

only for the natural numbers. If you look at the whole graph, you can easily

see that the region delimited by the density, the two axes and the dashed

green line coming upwards from the point 10 at the end, is a rectangle. The

The Probable Universe

169

area of this rectangle is

at any place in the interval. This is no coincidence, it is always true.

Suppose we have a different density given by the graph below

For the above red line to be an acceptable probability density, the

total probability represented by it must add to 1. As I said before, this total

probability is the area delimited by the density and the axes in the whole

interval. In the above case, this area is the area of the orange triangle in the

picture. This implies that the area of the triangle must be 1 and, therefore,

the height of the triangle (marked with an ?) has to be such that the

area is

.

What is the probability of a point falling inside a certain interval

when you have the shape of the probability density? It is just the area

below the density in that interval! Let us go back to our uniform density. If

you look at its graph, there are three regions painted with different

colours. The first region, in red, is the area below the density in the interval

[1,2] (try to remember our notation for intervals). Therefore, its area is

equal to the probability of a point being between 1 and 2. It is easy to

calculate this are, it is just 1/10. The area of the orange region then gives

the probability of a point being in the interval [5,8] and it is just 3/10.

Roberto C. Alamino

170

Finally, the blue region goes from 9 to 9.5 and, therefore, the probability of

a point being in the interval [9,9.5] is 1/10 divided by 2, or 1/20.

You have just learned the notion of an integral from calculus. The

integral of any function that can be plotted as a two-dimensional graph like

our probability densities is just the area between the curve and the

horizontal axis within a certain interval. Because of that, we can always say

that the probability associated with some interval is the integral of the

probability density. It works for any probability density. For instance, the

one below

As long as the total area below the red curve is 1, this is a perfectly

valid probability density between 0 and 8 (or 10, but it is trivially zero after

8). The probability for a point being in the interval [1,4] is then the area of

the orange region. Sometimes, if the density has a shape which is too

strange, it might not be very easy to calculate the area, but the fact that it

is the probability is still true.

SPECIMEN #3: THE GAUSSIAN

The Gaussian distribution, also known as the Normal distribution

or, sometimes, the Bell curve, is the most important of all continuous

distributions not only in mathematics, but also in nature. This is the result

of a combination of many different features which will discuss here.

The Probable Universe

171

First of all, let us see a plot of this distribution. From now on, I will

use the word distribution to mean both probability distribution and

probability density. Hopefully, the context will always be clear enough for

you to know which one I am talking about. Here it is:

The shape of the curve makes it evident why it is called a bell curve.

First of all, this curve is perfectly symmetric if reflected through the vertical

dashed green line. This green line marks another special place for this

distribution. The point on the horizontal axis marked by this line

corresponds to the mean of this distribution, which is the average value of

the quantity corresponding to this distribution. The most common notation

for the mean of the Gaussian is the Greek letter mu: . You can see that,

for the graph above,

For the Gaussian, the position of the mean coincides with the point

in which the distribution has its highest value, a point called the mode of

the distribution. In discrete distributions, the mode is the most probable

point, for continuous ones, the probability of being around the mode is

higher than being around any other point. The fact that the mean and the

mode of the Gaussian coincide is a special feature of it and does not

happen with all distributions.

The Gaussian is a very simple distribution and needs only two

parameters to be completely defined. One is the mean, the other is its

variance. We have already seen that the variance measures fluctuations

around the mean. In the graph above, this corresponds to the width of the

Roberto C. Alamino

172

curve marked by the orange double-arrow. In order to understand this,

consider the three Gaussians below:

They are plotted in the same scale. Notice that the narrower the

distribution is, the smaller is its value for points far from the location of the

mean. This means that it is less probable to be far from the mean. As the

variance measures how probable is for a point to be far from the mean, the

narrower the distribution, the smaller is the variance. In the picture above,

the variance increases to the right. Usually, the variance is symbolised by

the Greek letter sigma: . Most of the time though, it is much more

convenient to talk about

this distribution, the variance always appears squared.

If you are curious to see the formula, here it is

otherwise you can simply ignore it and go on reading. Notice that, apart

from the random variable that we named in this case, there are no other

variables in the formula except by the mean and variance.

There are, of course, more interesting properties of the Gaussian

than its simplicity. Suppose you are searching for the least biased

distribution to describe a continuous variable. You only know that there is a

fixed mean and a fixed variance. If you remember, to find the least biased

distribution we have to maximise the entropy. What do you think we get if

we search for a distribution that maximises the entropy given a mean and a

variance? You got it. It is the Gaussian!

The Probable Universe

173

But an even more interesting property is the one called the Central

Limit Theorem, which I will call CLT for short. The CLT is one of those

wonders of nature that makes you think that there is something special

about the universe. It is very simple, very powerful and very beautiful.

Consider a set of independently and identically distributed (or

i.i.d., in mathematical lingo) random variables

.

What I mean by this is that the probability density of all those variables is

given by exactly the same distribution (identically distributed) and that

they all have to do with experiments which do not influence one another

(independent).

Let us now consider the variable formed by the sum of all this

variables

The variable can be understood in the following way. Each is generated

by a random process according to its probability density. We generate each

one of them and then add all the results. Clearly, is a random variable as

well as its value is not pre-determined, but depends on the drawn values

of the s. Given the distributions of the s, we can indeed calculate ,

but there is something even more interesting that happens when the

number of s is very large. The CLT states that, the larger the number is,

the closer to a Gaussian becomes! It goes as far as to give you the

values of the mean and the variance of this Gaussian. The details involve

more mathematics than we will use here, but if you are interested you can

take a look at Fellers book (Feller, 1968).

I am not sure you understood the beauty of this. Let me explain

again. ANY very large set of random processes which are i.i.d. gives values

which, if you add them, have a probability distribution given by a Gaussian.

A-n-y-o-n-e. Whatsoever. If that does not impress you, I give up.

There is one subspecies of Gaussian which is very important in

probability and physics. It is called a Dirac delta in honour of the engineer

Roberto C. Alamino

174

turned physicist Paul Dirac who introduced it for the first time. The delta

in the name comes from the symbol used for it:

Delta is the Greek letter that you see on the right hand side of

the above formula. Do not be in despair, it is very simple to understand the

essence of the formula above. Imagine a Gaussian whose variance

becomes very, very small. Look back at the pictures to see that the smaller

the variance, the more the Gaussian is concentrated around the mean. The

Dirac delta is nothing more than a Gaussian with zero variance and with

mean

. In fact,

because we are dealing with a continuous distribution and, as we have

seen, things get weird in the continuum, the high of the Dirac delta is

infinite!

The meaning is that the whole probability is concentrated at

,

which means that it is the only value that can appear. In other words, the

Dirac delta is a probability distribution representing the certainty that a

continuous variable will have the exact value

. Simply that.

SPECIMEN #4: PARETOS DISTRIBUTION

The reason why I am including the Pareto distribution in our zoo is because

it has been largely popular in economics and social sciences literature and

was naturally introduced by an Italian economist called Vilfredo Pareto.

The Pareto distribution can be seen as a continuous version of

Zipfs law and is another example of a power law distribution. The

interesting thing about it is that it has what is called a cut-off, it is zero

below a certain value of the random variable. An example is the following

graph:

The Probable Universe

175

Notice that the Pareto distribution in the above example is zero

below 2 and becomes ever smaller as the values of the random variable

increase. It also is defined by only two parameters: the point of the cut-off

and the speed with which it goes to zero. As we have seen before, power

laws are very usual in nature and approximate Pareto distributions fit very

well many natural phenomena. Last time I checked, Wikipedia had a list

containing, among other things:

City sizes

The size of Bose-Einstein condensates at very low temperature

Total area of a forest fire

Sizes of meteorites

These are phenomena that can safely be said not to have any

common causes. But the main reason for the fame of Pareto distribution is

the so-called Pareto principle, which says that 80% of the results come

from 20% of the causes. I have seen this principle used to justify ideas like

the one that you just need to study 20% of something to be able to

understand 80% of it. Although there are many examples where this works,

the underlying causes being that the processes can be approximately

modelled by appropriate Pareto distributions, do not trust it always!

Pareto distribution is one of many distributions that occur in nature

and I can tell you with 100% certainty that there are many processes for

which the Pareto principle will not work. More than that, we do not know

Roberto C. Alamino

176

what the processes for which it works are. So, be warned when you read

things around.

THE TAIL OF THE BEAST

A bit more of anatomy. This time we will talk about tails. We now know

that if we want to define probability distributions for any value including

the infinites in the two possible directions, be them discrete or continuous,

we need to guarantee that the distributions go fast to zero when the

random variables become either too large or too negative. We can look

again at the Gaussian. The one below is plotted for a very wide range:

You can see that when the distance of the values on the horizontal

axes from the mean becomes much larger than around 3, the value of the

distribution becomes so small that is practically indistinguishable from

zero. The values away from the mean form the tail of the distribution. In

the Gaussian, this tail is said to be short because it goes very fast to zero at

both sides of the mean. This means that most of the probability mass of

the distribution is located in the neighbourhood of the mean. However,

that does not happen for some distributions. For distributions in which

there is more mass in the tail than a Gaussian, the name long tail

distribution was coined.

Although it is not always the case, when the tail of the distribution

is too big, we can have the pathological effects that we have already

The Probable Universe

177

discussed in the Zipfs law. Sometimes, the mean will not exist, while in

others the variance will be infinite invalidating the law of large numbers

and preventing one from estimating the mean by averaging the results of

experiments.

BIODIVERSITY

There are many more probability distributions than the ones listed in this

chapter. In fact, there is an infinite number of them, but only some have

properties that make them interesting. In general, these properties are

associated with the fact that these probabilities describe physical

phenomena in a very condensed (or compressed) way.

One common variation of probability distributions that we have

not seen in this chapter is that describing at the same time more than one

variable. They are usually called multivariate distributions. For two

variables, we can draw three-dimensional graphs, like the one below which

is that of a two-dimensional Gaussian, but when there are more variables

than two, visualisation becomes trickier.

In the above Gaussian distribution, the two variables have zero

mean and their values are in the horizontal plane, while the values of the

Roberto C. Alamino

178

distribution are in the vertical axis. Notice how the bump has a very similar

shape to the one-dimensional Gaussian, but it is symmetric around the

zero. Not all two-dimensional Gaussians are symmetric like that. If I choose

different variances for both variables, the shape you obtain is very

different.

The biodiversity of the probability distributions is very high and

new interesting species are being discovered all the time. If you are

interested in a day off in a larger probability zoo, Wikipedia has a nice list of

probability distributions at

http://en.wikipedia.org/wiki/List_of_probability_distributions

Enjoy it.

The Probable Universe

179

8.

Changing Mind

A DECISION WAS WISE, EVEN THOUGH IT LED TO DISASTROUS

CONSEQUENCES, IF THE EVIDENCE AT HAND INDICATED IT WAS

THE BEST ONE TO MAKE; AND A DECISION WAS FOOLISH, EVEN

THOUGH IT LED TO THE HAPPIEST POSSIBLE CONSEQUENCES, IF

IT WAS UNREASONABLE TO EXPECT THOSE CONSEQUENCES.

- HERODOTUS (500 BC)

DECISIONS

The moment is finally ripe for you to understand what Bayesian inference

actually is, how it works and how everyone can, or should, use it in their

lives. This point should be made over and over again. Bayesian inference is

a tool that is not only restricted to technical books or hard science research.

You do not need to be a professional mathematician to make use of it in

your daily life. Almost nobody goes around calculating numerical

probabilities all the time, but almost everyone goes around taking decisions

based on some information and, to take a decision, we are always

evaluating the relative importance of probable outcomes. What Bayesian

inference does is to help weighting the best one. It can give you numbers if

you need to, but it can also guide you with very little use of them.

Taking decisions is obviously not an easy task, as anyone knows. It

is a complicated issue that requires us to constantly change our minds.

When we face a difficult situation for a second time in our lives, our

reaction will not be the same as in the first time. That happens because the

available information for taking the decision has invariably changed. You

now know the consequences of your first decision. Because time has

Roberto C. Alamino

180

passed, the world changed and there are also new things to consider. You

are not the same person as before. Your needs changed. Your opinions

changed. Possibly your values changed. All this change forces us to change

our beliefs about the situation requiring a decision and that is what

inference is all about. Changing your mind. Changing your beliefs.

You must remember that we calculated the probability for our

extremely fair d10 rolling game as 1/10 (or 0.1, or 10%) for each different

face. We arrived at this by assuming many things. The fact is that all of

them might be completely wrong. For instance, the d10 might be loaded,

either because it was intentionally tampered with or simply because the

company that manufactured it was not careful enough to produce a good

quality dice.

The easiest way to check if we can really trust our fairness

assumptions is to roll the dice and acquire ever more information by

keeping track of frequencies. Frequentists, rejoice! As we have seen, the

Law of Large Numbers tells us that, within certain limits, those tabulated

frequencies will get closer and closer (although they will never get exactly

there) to the true probabilities of each face. As long as they do not differ

too much from 1/10, we will be satisfied within our desired precision.

However, if the numbers we get from the measured frequencies

are too different from those we are expecting, then we have to revise our

information about the d10 and the conditions of the game, because

something is not right. By repeating this many times we might be able to

infer (again, approximately) the actual conditions of the dice, which will

allow us to calculate correct probabilities and make good predictions.

Making good predictions is nothing more than deciding on which results

are more probable.

This procedure can obviously be applied to anything and can be

written in the following algorithmic way:

The Probable Universe

181

1. Use all information you have to create an initial model for your

experiment.

2. With that model in your hands, calculate the probabilities of each

result.

3. Test your model by doing the experiment and checking if the

frequencies agree with your calculations (taking into consideration the

limitations of this procedure).

4. If there is no agreement, you have to go back to 1 and change your

model. If there is agreement, you can relax for a while (until someone

finds a disagreement and you have to start again).

That should ring more than simply one bell. That is because the

above procedure is extremely similar to what people call the scientific

method. This is no coincidence. We will see, by the end of this book, that

Bayesian inference is exactly the scientific method. There is a lot to say

about this connection and we will spend a whole chapter of this book

explaining that.

The key part of Bayesian inference is then to change your beliefs

about something whenever new information about it appears. This new

information might either confirm your previous beliefs or indicate that they

were wrong. If they are wrong, the sensible thing to do is to change them.

That sounds obvious, but that is the most difficult part of the whole

procedure for the great majority of people in the world.

Although we use of the word belief associated to probabilistic

inference in a technical sense, meaning a probability assignment to some

piece of information, this is no different from the common use of the word.

A belief is something that you think is true, which includes the values for

the probabilities that you calculate for anything. The 1/10 probabilities that

we calculated for the faces of our d10 are beliefs that must be changed if

we measure frequencies that are too far from them.

This makes sense, doesnt it? The more we believe in something,

the higher is the probability we assign to that thing being true. In the same

Roberto C. Alamino

182

way as we might have assigned wrong probabilities to our d10, any other

belief can also be assigned wrong probabilities of being true. The second

trickiest thing, after admitting that changing beliefs is necessary, is to find

the right way of doing that. Of course, this right way is Bayesian inference,

the small formula we have learned about in the very beginning of this book

and that we used to create the program BAYES. It is now time to

understand each piece of that formula in details, and that is what we are

going to do now.

PRIORS AND POSTERIORS

Let us look again at the formula in the beginning of the book

As you already know, this formula is called Bayes Theorem in

honour of our late Reverend Thomas Bayes and now you have a much

better understand of what that means. It is not so alien anymore.

The probability distribution before the sign , which means

proportional to, we can now identify as the conditional probability of the

proposition given the data . The symbol simply means that, in order

for this to be an equality, there is something else multiplying

which we will not write right now because it is less important at this point.

This first probability before is, of course, what we want to

calculate with our inference procedure, our beloved BAYES. Let us forget

about the first probability after the and concentrate simply on I

have said in the beginning of the previous chapter that this is the prior

probability of . This is because, in Bayesian inference task, this probability

represents all the information we have about before we include the

information contained in .

However, this association of with the prior probability of is

not straightforward and we will discuss in details how this happens in the

The Probable Universe

183

following sections. Let us go over its meaning once again, but now using

our deeper understanding of probabilities.

One very important question that one should always ask when

faced with priors is Prior to what? We have been using the name prior to

indicate all the information we know about the results of an experiment

before the experiment is done. But what if we do the experiments more

than once? Do we call the prior only the probability distribution that we

calculate before all the experiments are done?

The answer is no. Not only. In terms of Bayesian inference, the

prior is a probability distribution calculated with the information we have

before each new step in our inference process. A prior probability is always

prior relative to new information that is incorporated. Of course, you might

think about a sort of ultimate prior when you have absolutely no

information about something, but by the Principle of Insufficient Reason

that would be just a uniform distribution, a probability in which all

possibilities have the same odds. Our super-fair assignment of probabilities

to our d10 in which all faces have probability 1/10 is, for instance, a

uniform distribution.

The prior is like a book that codifies all the acquired information

just before a new piece of data comes in. The inference task will add a new

page to that book and update the prior to a new form. This updated prior is

called the posterior distribution or simply the posterior as I have cited very

briefly before. The process of incorporating information to a prior so that it

becomes a posterior is the inference process. We can do that as many

times as we want. Each time we decide to do again the experiment and

update our knowledge or beliefs about the probabilities of the result, the

old posterior becomes the new prior and it is then updated to a new

posterior.

Consider the figure below.

Roberto C. Alamino

184

This picture describes an inference task about some proposition .

You can think about as being, for instance, one of the faces of our d10. In

the beginning of time, which is represented by the horizontal line with an

arrow indicating in which direction it increases, we have the prior

probability that encodes everything we know about before doing

any experiment. The first arrow indicates the moment we do our first

experiment, let us say, we roll the dice. We can roll it once, twice or as

many times as we want. When we finish, we collect all the data, which in

the case of a dice rolling might be the observed frequencies of the faces,

and put it into a dataset that we are calling . When we incorporate the

information contained in into our prior distribution, we end up with the

posterior distribution and now you see that what we have at the

left hand side of the inference formula at the beginning of this section is

exactly this posterior.

But we do not need to stop there. We might keep doing our

experiment to confirm if the new probabilities are already the correct one.

Let us call the next experiment, or bunch of them, . The posterior

becomes now a new prior, a prior to the experiment . It is still

the same probability, but only for cosmetic purposes, I call it . This is

at the same time a prior for and a posterior for . Once the

information coming from is also incorporated, we now have the

posterior , which can also be written as because it

contains information of both bunches of experiment.

This pattern can be repeated as many times as we want. Before

each new experiment is carried out, the previous posterior becomes the

new prior and so on. After each experiment, the gathered information is

incorporated to it and it becomes a new posterior. In theory, one should

The Probable Universe

185

repeat this forever, in practice, we stop once the probabilities stop

changing.

It should be clear that two persons with different information have

different priors for the same experiment. That is because one or both

might have incomplete information about it, with the missing information

being different for the case of both. As we have seen before, there is no

contradiction here as the description is completely consistent. Once both

start to do experiments and incorporate (the same) new information, their

probabilities should start to become similar to each other.

Consider our completely symmetric, completely homogeneous

d10. If we did not have discussed those matters before, we might be

tempted to say that the 1/10 probability for each face is an inbuilt,

objective property of the d10. We know that it is not. We saw that this

depends not only on the geometry of the dice, but also on the procedure

one uses to throw it and the environmental conditions in which it happens.

The uniform probabilities can then be seen as a kind of compressed

computer file containing only what is important about the above

description of the d10 tossing before we throw the dice. They are the prior

probabilities of the d10 throwing. The confusion comes because, in this

case, most people assume everything to be as far as possible and end up

assigning the same priors.

When we roll our d10 though, all of that can change. Consider that

you have a geek party and each person is arguing in favour of a different

prior for the d10 rolling. One person says that all faces have the same

probability. Another one swears that the dice is loaded and 1 is the most

probable result. Another yet says that she does not know about the other

faces, but she knows that the dice will never give a 10. The host of the

party then decides that experiments will be done. The dice is rolled 10

times and, after each bunch, people are allowed to review their probability

assignments. At the end of the night, as long as people are not drunk and

they are all looking at the same experiments, their posterior probabilities

will be practically the same. The subjectivity of the initial priors will be

Roberto C. Alamino

186

erased by the experimental information and, at each time a new

experiment is done, their influence on the posterior will decrease.

Wait a minute! That might seem to invalidate everything we said

before! After all, is that not the same as saying that the only true

probabilities are the frequencies? Do not forget about the fluctuations. But

even if you do, what would happen if you measured your dice with an

infinite precision and found it perfectly symmetric, but still the frequencies

would not be the same? I would assume that the dice is loaded, but what if

someone challenges that explanation? What happens if we measure and

weight the dice in all possible ways and find out that it is perfectly

symmetric? Would you give up reason and assume that a perfectly

symmetric dice would have different probabilities for each face? I would

not. I would look for some problem in the throwing of the dice. Why?

Because logic says that it does not make sense for a perfect symmetric dice

to have different frequencies for each face and logic is a piece of prior

information that we know works and we cannot throw away lightly. If you

ever find a situation with perfect symmetry in which probabilities are

different, call me. You are on the verge of a revolutionary discovery. The

most probable though, given all my years of posterior about life, is that you

should be wrong.

THE LIKELIHOOD

The more we repeat something, the more familiar it becomes. Therefore,

let us take another look at Bayes Theorem

We now know that is the posterior and is the prior.

What is left is the probability that makes this connection possible. That is

the ??? in our larger representation of the BAYES program. This

probability is called the likelihood function of given , or simply the

likelihood, and is symbolised by . The meaning is clear from the

The Probable Universe

187

probability definition the likelihood of a proposition given the data is

equal to how probable the data is if you assume that the proposition is

true.

You have all the rights to be confused, because the mathematicians

really messed up the names here. It would be very natural to associate the

likelihood of an explanation given the data it tries to explain should be

the probability of given . But unfortunately, it is exactly the opposite.

The choice of the name was because, as we are going to see, the likelihood

is usually used to choose between different explanations for a fixed

dataset, thus the given D part. But, in fact, it is always given by . To

add to the confusion, many times the notation is used for the

likelihood of given and we will have the equivalence

Too bad. We have to live with that.

The way it works is by thinking of the proposition as being a

tentative explanation of why we have observed that specific dataset . The

explanation which is more likely to be true (thus likelihood of ) is the one

for which is more probable.

Let us roll a d10 again. Suppose that we roll it ten times and we

obtain ten 1s. What we call the explanation is an explanation of why we

observed those results. Well, if we give a probability for each face, that will

explain why we observed a certain amount of faces of each kind.

Therefore, would be, in this case, a probability assignment for the faces.

Let us say that we have two possible explanations and . is our

uniform probability where every face has probability . is a

revolutionary explanation in which the face 1 has probability and each

one of the other faces has probability . Which explanation is more

likely to be true for the data 1-1-1-1-1-1-1-1-1-1 that we observed?

Let us do the calculation together. Because we want the probability

of 1 in the first roll AND 1 in the second roll AND 1 in the third row AND...

Roberto C. Alamino

188

up to ten, we multiply the probabilities for 1 each time. For explanation ,

this number is 0.0000000001 or 0.00000001% of chance of this happening!

On the other hand, according to explanation , the probability of the

observed data would be approximately 0.01 or 1%! The probability of

observing that data is ONE HUNDRED MILLION TIMES larger if the second

explanation is the correct one! Which one do you think is more likely?

Of course, for every dataset there is an explanation that explains it

better than any other else, which is the explanation that says that every

time you do an observation you will observe exactly that. For instance, this

explanation for the above d10 rolling would be that the probability of 1 is 1

and all the other faces have probability zero. If we call it the explanation

, the probability of the observed data in is 100%! It is obviously as

large as you can get. This explanation is, however, what is called fitting the

data. You just use the data as it is own explanation. It is the simplest thing

you can do when you do not have any other information. However, this is a

hypothesis that should be tested against not only all other information you

have, but also against more experiments. We will talk a lot about that.

EVALUATING HYPOTHESES

Quickly revising, conditional probabilities allowed us to think about

the probability of something given the knowledge of something else. This

naturally led to the idea of changing probabilities when the conditions

themselves change, which is equivalent to say that extra information

became available. This connection is given by the inference formula which

we called Bayes Theorem the prior becomes the posterior after the new

information is incorporated. Finally, we have learned that this connection is

done by multiplying the prior by the likelihood.

When we look at Bayes Theorem, we can understand it also as a

means to evaluate the probability of some hypothesis, which in

would be the proposition , given some piece of evidence . Being a

general proposition, this hypothesis clearly can be anything, even

The Probable Universe

189

something completely imaginary like (spoiler alert for kids!) the existence

of Santa Claus.

We will call the task of calculating the probability of a certain

hypothesis as hypothesis evaluation. This task is obviously central in

virtually all areas of human enterprise. This is the bread and butter of

science and also of financial markets. In science, the hypothesis is many

times the theory that explains some natural phenomenon , where we are

associating the phenomenon itself to the data describing it. In a bank, the

hypothesis could be the percentage of change in the price of some stock

by tomorrow given the news about the corresponding company that

were announced today.

This kind of hypothesis evaluation is also what happens when a

crime is committed and the investigators need to evaluate possible

scenarios based on forensic evidence. Solving a crime is a task full of

uncertainties of course. The investigator must put together a story based

on a possibly very small number of clues, or pieces of evidence. In an

intuitive way, the investigator needs to evaluate how likely each possible

story is, given the collected evidence. That is nothing more than evaluating

the conditional probability of the scenario given the forensic evidence ,

or .

Another very common situation is a trial, where either one person

the judge or a group of persons the jury needs to evaluate not only

the physical evidence but also the information given in the form of the

accounts of several witnesses, experts and victims. During the sessions,

lawyers representing the different parties will sew all of this together and

create stories given that evidence. The judge or jury then will need to

evaluate the conditional probability of those stories given the information

provided and all else they know about crimes, human nature, society and

so on. Because each one of them has different life experiences, the given

information will be different and different probabilities will be calculated.

The task is even more difficult, because there is also the need to evaluate if

Roberto C. Alamino

190

the information itself is reliable or not, in other words, the probability that

the information is true also needs to be evaluated at the same time.

Think about the murder of a billionaire lady, for instance. We can

think about the murder itself and all the collected evidence as the data

(for murder). Suppose that we have two suspects, the husband and,

naturally, the butler . What we want to do is to discover which

explanation is more likely. According to the discussion of the previous

section, this can be accomplished by evaluating the probability of the

murder given that the killer is the husband and the probability of

the murder given that the killer is the butler . If we have

then we can say that the likelihood that the husband is the murderer given

the evidence is larger than that the butler committed it ,

or that it is more likely that the husband is the killer given the evidence,

which would be written

But if that is all, why should we use Bayes Theorem? According to

the above explanation, we just need to calculate the likelihood. We did not

use priors and posteriors. We did not, but we should! As we have seen, the

likelihood of a hypothesis measures its ability to reproduce the data. This is

a very important point to be made. All information used in the likelihood

comes from the measured dataset. No other information enters in this

evaluation. The more probable a dataset is to be generated by the

hypothesis, the larger the likelihood of the hypothesis being right if we take

into consideration only that data. I will not get tired of repeating it! Is there

anything else that we should take into consideration? Yes, they go by the

name priors.

Suppose that we know for sure that the husband is a good person

and never did anything wrong, but the butler killed other people before

and is a sadistic cold psychopath! This could be reflected into, for instance,

The Probable Universe

191

their criminal records. That does not really prove that the butler is the

murderer, but the problem here is that many times we cannot find

conclusive proofs for anything! When conclusive proofs cannot be found

and we still need to give a verdict, all probabilities should contribute! Even

the previous history of each one of the suspects counts. This previous

history is, of course, the prior and what we can infer from it will be the

posterior. Because the likelihood and the prior multiply each other, they

will compete for supremacy. The reason why this competition is in the form

of a multiplication is related to something we have already seen before:

maximum entropy. It is maximum entropy that will give the correct

justification of Bayes Theorem. We already have in our hands all

mathematical ideas needed to understand it. Showtime.

THE INFERENCE TIME ARROW

Let us talk about two generic propositions and without

specifying what exactly they are. Remember that the conjunction

corresponds to the proposition that both and are true at the same

time. That is what we called the logical AND operation. It should be obvious

that and must describe exactly the same proposition as their

conjunction does not have any relation to temporal order. If and when

temporal order makes difference, we have to include it in the propositions

themselves, but it will not be part of the AND operator. As long as we use

the rules of usual logic, there will be no special or pathological situations in

which that will not be true. We can write this obvious observation in our

mathematical condensed form as

which is the short way of saying that the probabilities of the conjunction of

two propositions do not depend on the order in which we write them. We

can say that the probability is symmetric if we exchange by . Remember

that a symmetry is something that remains unchanged when we change

Roberto C. Alamino

192

something else. In this case, changing and keeps the probability of the

conjunction unchanged, therefore we have a symmetry.

Because this is always true, we can use it in the formula we wrote

for conditional probabilities. If you go back to that section and look at that

formula, you can take the denominator of the fraction and move it to the

left side multiplying. This will allow us to write the following equation

In this equation, we are relating the probability of the conjunction

to the conditional probability of given and to the probability of

irrespectively of what is. Note that the conditional probability and the

probability of alone are not symmetric under the exchange of and

separately. is not the same as and is obviously not

the same as The symmetry miraculously appears only when we

multiply both together.

We can do the same trick for the probability of the conjunction in

the different order to get

Because both need to be equal, we then have

This is valid for any two propositions. They do not even need to be

related to each other. can be the proposition that we can find a polar

bear in Africa and can be the proposition that there are aliens living in

Neptune. The above formulas still work. In particular, the formula works for

our hypothesis testing task of the previous section and, to make things

more mnemonic and easier to understand, we will use it as our working

example.

Instead of , we will then use the letter to make clear that our

proposition is now a Hypothesis about something. Instead of , we now

The Probable Universe

193

write for the Data we collected. By moving the second probability in the

left hand side of the equation to the other side dividing, we finally have

Believe it or not, that is Bayes Theorem. This is the complete

formula. You see that the difference to the one I gave you before

Is that I skipped the equal sign and the division by . This is because, as

we will see later, is usually just a constant number and will not need

to be calculated in some applications. In any case, we know that it should

be there. We will talk more about that later.

Are you surprised of how easy we got to Bayes Theorem? But what

about the maximum entropy relation I promised? You are missing it

because, although we found Bayes Theorem, it is still not the formula for

inference! You got me. I have been slightly misleading you from the

beginning, but that is because this point is very subtle.

Inference requires what we will call Bayes Rule. The difference

between it and Bayes Theorem is that, as it becomes clear from the

considerations above, the latter is simply a consequence of the definition

of conditional probabilities. Conditional probabilities are valid for any

propositions and have nothing to do with temporal order. Inference, on the

other hand, has a very clear temporal order associated with it. We have a

prior, collect data and change the prior to a posterior. It is something

different.

It is worthwhile to spend as much time as needed at this point

because this is the most important distinction in this whole book. This

distinction is many times overlooked even by those who work with it. The

reason why this happens is because whoever looks at the formula of Bayes

Theorem naturally interprets as the new probability (the posterior) of

Roberto C. Alamino

194

after is taken into consideration, after all, we read given right?

The problem arises because one attributes to the symbol | a temporal

interpretation that it does not really possess. Go back and look at the

example of the LEGO bricks. We do not talk about the probability of a brick

being square before and after it was known to be red. All probabilities on

both sides of the equation we call Bayes Theorem are defined at the same

instant of time and because inference is always related to change, we need

to add something extra to the equation. We need to justify why it works to

include this inference time arrow in it!

But beware! As it happens in many other places, even if those

things are different, people do not care too much about using different

notations and even I, when I am at work, write things in the form of Bayes

Theorem. As long as you understand of what you are doing, you are

excused on the grounds of simplicity of writing.

Let us proceed. As we have seen, inference concerns changing the

probabilities we assign to some proposition after new data has arrived, in

other words, changing a prior to a posterior. The act of adding new data to

our database defines a before the new data and an after the new data.

What we call our database is obviously out current probabilities, as you

should know by now that probabilities are simply a method to encode

information.

Because there is a change in this database, one immediately gets

the concept of a temporal order of events. One very important thing that

you must bear in mind is that this time passage implied by the arriving of

new data is not the same as the time in which each piece of data was

generated. Confusing the temporal order of arrival with the order in which

each piece of information was created is the source of huge

misinterpretations. We can call the late, the actual order in which things

happen, as the causal time arrow because it relates causes and effects.

Therefore, we must be very careful and always remember that the causal

arrow is different from the inference arrow.

The Probable Universe

195

To make this distinction clearer, let us consider an archaeologist

who is trying to piece together the life of Tutankhamun, the Egyptian

teenager pharaoh. Let us indicate his life history by the letter . Each time

someone finds a new artefact or document related to Tutankhamun, this

new piece of information has to be added to the database, which can be

imagined as a huge book were everything that was ever found about the

pharaoh, which we call the data , has been recorded, encoded in some

language as English, for instance.

Say that the archaeologist has to include in the book three new

findings about Tutankhamuns life which were discovered last year. Call

these findings by the names

and

the order in which they were found (not the historical order in which they

happened). Let us say that

then

time one of these findings was added to the book, the archaeologist had to

spend the whole night with the history team reviewing Tutankhamuns life

history . This is inference time. However, even if

comes before

in

inference order, nothing forbids

after the statue

was sculpted.

Bayes Rule, not Bayes Theorem, is the actual algorithm for

changing the probability of a hypothesis, like the life history of

Tutankhamun , as new evidence arrives. We will emphasise this by

changing a bit the way we write probabilities. We will attach a small label

in the probability for and write it as

The label represents inference time and a summary of how it

works for our digging is given by the table below.

the initial probability for before

taking into consideration any of the

archaeological findings we found last

Roberto C. Alamino

196

year in the digging. This is the book

about Tutankhamuns life before last

year.

after the information conveyed by the

book

consideration.

the evidence provided by the canopy

jar

consideration.

with information of all 3 findings have

been found and analysed by the

history team.

To automate this process, we need a rule, an algorithm, a program,

which starts with

each

time a new piece of data

already gave a name for this program. We called it BAYES. When we first

talked about it, we did not include the temporal details, but they are

important. BAYES is not Bayes Theorem, but actually Bayes Rule.

To be honest, the truth is that in the past it was usual to assume

that Bayes Theorem and Bayes Rule were the same. Bayes Theorem was

considered as the rule to do inference simply because it seemed to make

sense, but if we consider everything very carefully, it does not. Only a long

time later it was shown that Bayes Rule, with its temporal inference order,

could be derived from something much more basic: Laplaces Principle of

Insufficient Reason or its generalisation Maximum Entropy.

We talked a lot about both of them before. There, we saw that

these principles objective is to guarantee that we use only the available

information without making any extra assumptions when we calculate

probabilities. Although the situations we analysed at that point and the

The Probable Universe

197

inference or hypothesis evaluation tasks seem to be different ones, they

are actually the same! What we are trying to do now is nothing more than

to create a new probability that includes the information encoded in the

old one plus the new information. We are trying simply to append new

information to our probability distribution. Again, the ideal way to do that

is by encoding only these two pieces of information, the old probability

(the prior) and the new data, without making any extra assumptions.

The mathematical proof of how to arrive at Bayes Rule using

maximum entropy is not simple to describe. If you are interested, you can

find it on Ariel Catichas paper listed at the end of this book (Caticha, 2010).

The important thing is that this line of reasoning takes us to the following

conclusion

This formula says that the new probability, the one at the inference

time , which is in fact the posterior probability we are looking for after

including the new data

conditional probability using the formula we have derived for it as long as

all probabilities at the right hand side are calculated at inference time .

This finally is Bayes Rule. You might be very unsatisfied at this

point as there seems to be little difference between this formula and

Bayes Theorem, but there is a difference and a very crucial one! It is

contained in the first equality in the formula above given by

This formula means that the best rule to update probabilities is to

use the conditional probability given by the previous probability functions

Rule while the rest is just a reminder of how we can use Bayes Theorem to

Roberto C. Alamino

198

calculate

information!

This is a good time to stop and review the names of all the symbols

used above. These names, in the end, are what really make the connection

with our intuition, so it is very important to keep them in mind all the time.

The probability

Tutankhamuns life history because it is the probability of after the new

data

is included.

Can you remember how we call

.) Because this probability encodes all the knowledge we had before we

started to include new data, it is nothing more than the prior probability of

. You might now want to revise our definition of likelihood and soon you

will realise that

being a good account given the new data.

There is still one term in that formula to which we have not

assigned any name. We have been avoiding talking too much about it. This

is of course the term

symbol. The usual name for this term is the evidence for the new data,

but I personally hate that name and hardly use it especially because, in

practical situations, this terminology is usually unnecessary. In fact, most of

the time, this number is simply what we call a normalisation. Sometimes,

in particular in physics, it will have its importance. In those cases, this term

will be known by the much more interesting name of partition function. I

will open a parenthesis here and briefly digress about the former term,

normalisation. We will talk about partition functions when we see how

everything connects to physics.

NORMALISATIONS

At the very beginning of this book, when we deduced probabilities from

Coxs Postulates, we agreed that they would always be real numbers

The Probable Universe

199

between 0 and 1. We did this arbitrarily, for a question of convenience, as

we could have chosen the interval from 0 to . A consequence of our

choice was that if we consider all possible outcomes for some experience

and assign probabilities to them, they have to add to 1.

Now let us have a more detailed chat about that small symbol ,

the one we have been calling proportional to. It looks like the Greek letter

alpha , and most people (including me) do not bother in writing them

differently when using a pen, but they are actually different and is a

symbol that will be quite convenient many times.

We say that a certain quantity, let us call it , is proportional to

another quantity whenever is equal to times a number. For instance,

we say that the height of a building is proportional to the number of stores

it has. Suppose that each store has a fixed height, let us say 3 meters. If we

call the height of the building and the number of stores , then we can

write the simple relation

If you give me the number of stores, I just need to multiply it by 3

to get the height of the building. Therefore, the height is proportional to

the number of stores and we could write it as

Of course with this notation we are losing the information about

what the proportionality constant (that is the name we give for the

number 3) is, but this notation is used when it is not important. When

would that be not important? Sometimes, we just want to know things like

how many times will increase the height of the building if we double the

number of stores? It does not matter the actual size of the store, the

important thing is that if we double , because is proportional to , it will

also double. If we increase the store number by 10%, then the size of the

building will also increase in total by 10% and, sometimes, that is all we

need to know.

Roberto C. Alamino

200

To understand when something is not proportional to another

thing, consider the infamous body mass index (BMI) which is supposed to

indicate how much someone must weight according to his or her height.

The formula for this index is

where is the weight, or more precisely the body mass of the person, and

is the persons height. The idea is that you weight and measure yourself,

put those numbers in the formula above and get another number that will

tell you if you are too thin, too fat or just right. I will not comment on the

precision of this procedure, which is actually very poor, but let me use it to

illustrate the concept of proportionality.

Looking at the formula we can say that the BMI is proportional to

the body mass and write

The information conveyed by this is that, if you double your

weight, your BMI will also double. If you get 20% lighter, your BMI will also

decrease in 20%. Now, can we say that the BMI is proportional to the

height? Of course not! Suppose that you compare two persons with the

same weight, but with different sizes. Let us call the height of the first

person

twice the size of the second person we can write this as

If we square both sides we get

and this means that the BMI of person 1 will be four times smaller than

that of person two. Proportionality would require that the BMI should, like

the height, be two times larger for the first person. Therefore, we cannot

The Probable Universe

201

say that the BMI of a person is proportional to its height. However, we can

say that it is proportional to the inverse square of the height and write

That is because if you multiply the inverse square of the height by

any number, let us say by 3 like this

then the BMI is multiplied by the same number, which in this case would

be also 3

Caution! You should be careful when dealing with the same

quantities. You cannot say that

as in this case is not a constant.

I said that in many cases we do not need the proportionality

constant, but I have to admit that sometimes we do. The good thing about

probabilities is that, exactly because we agreed that probabilities always

add to 1, we can calculate proportionality constants even when we do not

write them. That is why we wrote Bayes Theorem at the beginning of this

book as a proportionality.

Let us see this is action for a dice rolling. Just to change a bit (and

simplify things) suppose that we have a loaded d3. Someone built it in such

a way as to make the face with the number 3 on it be three times more

probable to turn up than the face with a 1 on it. The person also had the

work to guarantee that the face with a 2 on it is twice more probable than

the face with a 1. Let us call the probability of getting the number 1 when

rolling this dice. The above considerations mean that

Roberto C. Alamino

202

and because is the same constant number for all three probabilities, we

can also write

where the proportionality constant is the same for all three cases, being

obviously . We could write the above 3 probabilities in a condensed

formula in two ways. Either

or

with being the number on the face, either 1, 2 or 3. The second way is

just a more simplified way of writing the first one. Now, because

probabilities are obliged to add up to 1, we could easily find the actual

numerical value of the constant by solving the following equation

This would give . Finding when we do not know it

beforehand is called to normalise the probabilities. Note that 6 is actually

the sum of all the values that appear after the symbol in the

probabilities. The constant that normalises the probabilities is usually

called the normalising factor or simply the normalisation. It is just a

constant that appears in all probabilities but do not depend on each one

individually, only on the sum of them all. The mathematical convention is

that the normalisation is the number which divides everything. So, in the

above example, the normalisation would be , not .

We are ready to understand why

in the Bayes Theorem part of the inference equation. Notice that Bayes

Theorem gives the posterior probability of the life history . If we could

sum over all possible s, we should arrive at the value 1 as it is the way we

agreed our probabilities should work. In this case,

would be a

The Probable Universe

203

constant exactly like above. We do not need to worry too much about

that because we could calculate it at some point when we really need it in

the same way as we calculated above.

This means that the right hand side of Bayes Rule is proportional

to the left hand side with the proportionality constant being

, or

in other words

This is a formula which involves just the posterior in terms of the

prior and the likelihood. If you really need the normalisation because you

want to calculate the exact numerical value of the posterior, then you can

calculate it as I have already said by summing over all possible values of ,

which we write to get

which is a formula that, again, involves only the prior and the likelihood.

The big symbol that you see in the above equation is the capital sigma

letter of the Greek alphabet. Sigma roughly corresponds to our S and it is

for that reason that it is used with the meaning of summation. The letter ,

or whatever other letter we put underneath this symbol, is used to indicate

what is the variable we are summing over.

Consider that there are three possible histories of Tutankhamuns

life which are labelled by the names

formula would be equivalent to

Therefore, if we wanted to, we could completely get rid of the

Roberto C. Alamino

204

and

symbol. That is why I do not really fancy it. It does not need to be there at

all!

But, as I said before, this is not the whole story for normalisations.

The first thing that can happen is that the normalisation can be difficult to

calculate. This is a problem because, in some cases, the normalisation itself

can become a very important tool which can be even more useful than the

probability. How can it be? If you have a formula for the normalisation, you

might be able to use some mathematical tricks to extract from it

interesting pieces of information like averages and variance. Due to these

tricks, normalisations play a central role in one of the most important areas

of physics, Statistical Physics. It is there that they are called partition

functions. We will get back to it.

TAKING DECISIONS

All we have been learning up to this point is concerned with one main

objective: taking decisions. This is nothing but the task we analysed in the

beginning of this chapter. In order to take decisions, we have to evaluate

the relative magnitude of several competing hypotheses. We will usually

not need to calculate the precise value of these probabilities, because they

all will have the same normalisation factor which, in this case, will not be

necessary to decide which one is the best.

Let us simplify things and consider only two hypotheses. If we have

more than that, we simply compare them in pairs. The problem then

becomes to evaluate how many times more probable a hypothesis

is

compared to a second hypothesis

both is the same. This would be the equivalent of, for instance, piecing

together two different accounts of Tutankhamuns life based on the

archaeological evidence in and trying to decide which one is his true life

story (or at least truer).

The Probable Universe

205

A warning: I am going to use the notation I said was the wrong one,

mainly for reasons of laziness. Therefore, pay a lot of attention to it and, in

doubt, come back to this paragraph to remind yourself of what we are

doing. Because I do not want to be carrying time indices all the time, we

will forget about all of them when writing the Bayes Rule and, in addition,

instead of using

but it is actually Bayes Rule that I want to write!

Yes, you will be right in complaining after I spent a whole section

highlighting the differences between Bayes Rule and Bayes Theorem. But

now that you know the difference, you can make the appropriate

corrections in your mind as long as the interpretations for each one of the

probabilities in the above formula, and the time at which each one is

related to, are clear in your head. In doubt, go back to the previous

sections. Welcome to the world of confusing notations in which

professional scientists live.

Now, because we want to compare the effect of the data in on

the relative probabilities of the two hypothesis

and

, what we need

to do is simply to divide the posterior of both hypotheses to obtain

The normalisation cancelled because it is the same in both

formulas. This is one of those cases in which it simply is not necessary. The

above equation has the following meaning: the odds that hypothesis

is

more correct than hypothesis

likelihoods of each hypothesis and (2) the ratio between their priors.

Roberto C. Alamino

206

The first ratio is called the likelihood ratio, but we will begin our

analysis by focusing on the second one, the ratio between the priors of the

hypotheses.

When evaluating the relative probability of two hypotheses, the

ratio between the priors is many times overlooked because one considers

both priors equal and their ratio becomes simply 1. Of course, if without

considering the data there is not reason to favour any of the hypotheses,

then their priors are indeed equal, disappearing from the formula. In other

words, if there is no reason to favour one of the hypotheses, then the

problem is reduced to calculate the likelihood ratio, or as we have seen,

how well each hypothesis explains the data.

HOWEVER, if the priors are different, the hypothesis with the

largest prior is favoured from start, even if both explain the data equally

well. To understand that, remember that the concept of a prior is always

relative to the addition of new data, in this case represented by .

For example, suppose that you are a teacher and you have two

students which are suspects of stealing the answers for an examination.

Call them Alice and Beth. After all evidence has been collected about the

crime, one thing is very clear one of them must be the criminal with

100% of certainty. You interrogate both and they both say that they are

innocent. Both Alices story (we will call it ) and Beths story (we will call

it ) explain the fact that the answers disappeared equally well, meaning

that the likelihood for the answer-stealing is the same for both stories.

Who would you think is lying? If Alice is that kind of student with a clean

profile, who always studied a lot and was never involved in any

wrongdoing, while Beth was always creating trouble, it is obvious that in

the absence of any other evidence, you would think that the guilty one

should be Beth.

Let us understand the Bayesian reasoning working behind the

curtains here. If you call the evidence of cheating , you can translate that

to equations by writing for how well Alices story explains the

The Probable Universe

207

cheating. Remember that we are using here the notation where we ignore

all time indices. In the same way, how well Beths story explains all the

evidence is written as . Because both stories explain the situation

equally well, these two probabilities are the same. This means that the

likelihood ratio part of the equation disappears and likelihoods alone

cannot help you take a decision. This is where the prior ratio enters. The

priors here represent how much you believe in the stories given your

prejudices about the two students. Prejudice? Yes, exactly. Prejudices are

nothing more than priors. In a perfect world, people should change their

prejudices with data using Bayesian inference, but we know people never

really do that, right?

Back to the crime. As we have seen, you know both students long

enough to believe that it is extremely unlikely that Alice would be lying. On

the other hand, given Beths previous problems, it is much more probable

that she is indeed lying. Your preconception about Alice saying the truth is

your prior , the probability that Alices story is true according to

your knowledge about her student profile. The same preconception for

Beth is . Because for you, this implies that

Because the likelihoods are the same

If we put this in the formula, we have

and, therefore,

,

Roberto C. Alamino

208

meaning that you are inclined to believe in Alice even if any of the stories

cannot be favoured!

The decision might sound unfair, and it is. If we had the possibility

of not punishing anyone until an irrefutable proof appeared, that would be

the moral thing to do as probabilities are not certainties. But what if one

has to be punished? What if there is absolutely no way to spare both

students? Even if the truth cannot be discovered, we would have to punish

Beth! Even if only this time the guilty one is Alice and she is abusing the

system. This is certainly unfair, but it comes from the necessity to take a

decision. This happens every day in every part of the world inside courts.

The judge cannot suspend the judgement forever. A final verdict has

always to be given.

We see here how the prior affects the decision if the likelihoods

are the same. But you must also be aware that even if the likelihoods are

not the same, but the priors are very different, their difference can

overcome the difference in the likelihoods! For an example, suppose that

Beths story is actually 2 times more plausible than Alices, or in symbols

However, if you think that Alice is 4 times more honest than Beth,

or

by inserting these two ratios in the formula we have

which implies that , or that it is still twice more

probable that Alices story is the correct one and she should still be

considered innocent! Lesson for life:

The Probable Universe

209

Never underestimate the power of prejudice,

even in the light of evidence!

Let us now consider what happens when the priors are equal. In

that case, our equation becomes

The ratio of the probabilities of the two hypotheses in this case will

only depend on the likelihood ratio as I had already pointed out. This

sounds less unfair, but this is just because I said to you that priors

represent prejudice and prejudice is never a nice word to hear. But you

must remember that the prior includes all previous information, which

might not be only prejudice in the bad sense of the word, but proved and

verified information that we had before we started to consider the data

and it might be wrong to ignore them.

The above formula clearly makes sense if you think about that.

When we do not have any other reasons but the data to choose between

two explanations, the one that explains better the data should be the

correct one. Coming back to Alice and Beth, now imagine that both

students have clean profiles. None has ever had any problems in the

school. In this case, the priors for them would be the same

and the one with the best story would win. You might be shocked by the

fact that it is not really the truth that wins, but the best account of the

facts. The point is that all we know about the truth is either in the priors or

in the data. How true each account is has to be judged only on the basis of

how consistent it is when compared with the data. That is why, as unfair as

it is, having the best lawyer ends up making the whole difference in a

Roberto C. Alamino

210

judgement and this can only be compensated by a good judge, one that has

enough experience with cases to put together good priors!

There is a subtle point here which, in fact, is extremely important.

It is the idea that we do not really need to know the exact value of the

probabilities to take a decision. All we need to know is which one is more

probable. In everyday life, we know that, but once we learn the above

arguments and formulas and get hooked by them, it is easy to forget the

simple things. Instead of calculating probabilities of hypotheses up to

several digits after the decimal point, we just need to rank them.

The importance of this simple observation goes very deep. It can

be appreciated by considering nothing less than science itself. There is no

way to evaluate exactly the probability of a scientific theory to be correct,

because we simply will never know all possible theories. Why would we

need to know all theories? Because to calculate the exact value of a

probability we need to calculate the normalisations, but normalisations

require us to calculate a sum over all possible values of our variables!

However, because the normalisation disappears when we calculate ratios,

even not knowing what is the probability of a certain theory being right, we

can still rank them in order of plausibility! Science still works.

THE BAYESIAN WAY

There is nothing more to BAYES than that. Seriously. We finally arrived at

the complete formulation, both conceptually and mathematically, of

Laplaces ideas based on the insight of Bayes. The previous section

summarises all you need to know to take decisions in the most unbiased

way possible considering all available information. In fact, it shows you

even how to take biased decisions knowing that you are doing that.

As you could see, you did not need much more mathematics than

you have learned on the high school. You need to multiply, divide and pass

things from one side to the other of an equation. That is pretty much all

The Probable Universe

211

mathematics we have used. I had to introduce some strange letters, but

they are still just abbreviations of the four basic arithmetic operations. The

sophistication of Bayesian reasoning is not in the formal techniques

required to use it, but in the philosophical framework underlying the idea.

Although Bayesian inference is a powerful tool for science, I believe

it is very clear after all the examples we have seen that the importance of it

is much wider. Bayesian reasoning is not only a tool for developing theories

or taking managerial decisions, although these are surely included. By

understanding how it works, we understand the very foundations of what

is meant by rational thinking. More than that, it shows us exactly where

our biases lie, how they enter in every single choice we make and what we

should do to modify that. You can even pinpoint where emotions enter to

bias the process! As you might guess, it will always be in the prior.

Now we finally can complete our picture:

In order to appreciate all the consequences and the power of both

probabilities and the Bayesian way of thinking, we need to see it in action

though. That is what we are going to do in the next chapters. I almost

forgot to say one last thing: theres a catch.

Roberto C. Alamino

212

9.

The Catch

I BEG YOUR PARDON,

I NEVER PROMISED YOU A ROSE GARDEN.

- LYNN ANDERSON

LAW AND DISORDER

After a whole book on Bayes, it seems clear that a justice court should be

the place in which Bayesian inference would be most important to

guarantee that the level of injustice remains at its minimum. Judgements,

by their own nature, admit only two results: guilty or non-guilty. If the

analysis of the whole case leads to the conclusion that no conclusion can

be reached, then the result should be the non-guilty verdict. There is then

only one option, you might think: calculate the ratio of the posteriors and

decide the case by choosing the highest one.

Lately, there has been much noise about decisions of forbidding

the use of Bayesian arguments inside courts in Britain. These rulings, if

taken literally, are simply ridiculous because, as we have seen in this entire

book, you either use BAYES or you are doing something wrong. Some of the

arguments are indeed very naive, to say the least. For instance, consider

the following comment given by Lord Justice Toulson on an appeal on

24/01/2013

THE CHANCES OF SOMETHING HAPPENING IN THE FUTURE MAY BE

EXPRESSED IN TERMS OF PERCENTAGE. EPIDEMIOLOGICAL

EVIDENCE MAY ENABLE DOCTORS TO SAY THAT ON AVERAGE

The Probable Universe

213

SMOKERS INCREASE THEIR RISK OF LUNG CANCER BY X%. BUT

YOU CANNOT PROPERLY SAY THAT THERE IS A 25 PER CENT

CHANCE THAT SOMETHING HAS HAPPENED: HOTSON V EAST

BERKSHIRE HEALTH AUTHORITY [1987] AC 750. EITHER IT

HAS OR IT HAS NOT.

The opinion expressed in the above comment is obviously wrong.

We know that we can attribute probabilities to past events. We did that

with the life of Tutankhamun. The argument is itself weird because, in the

same way as something either happened or not, the smoking person will

either die or not unless she becomes trapped in some kind of

interdimensional limbo, but that seems highly improbable. We have

learned that probabilities arise from lack of knowledge, not any temporal

or causal order. Whenever information is missing, the entropy is not zero

and probabilities will reflect uncertainty. The refusal in accepting Bayesian

reasoning here is just a question of misjudgement due to improper

knowledge of the subject. That is not the first time it happens in courts and

will not be the last, as we all know.

Another court decision in 2010 also ruled out Bayesian reasoning

by using a slightly different argument. The argument this time was that the

numbers used to calculate the likelihood ratio of one of the evidences

being reliable or not were not precise. Of course they will never be,

because they are probabilities. In this case, however, the complaint was

that this was not explained well to the jury and therefore it was unfair.

Without entering the discussion if being judged by a group of people

without guaranteed expertise in logical deduction, deception and inference

and fully susceptible to whatever influence from popular culture and media

is actually fair, we have spent around 200 pages only to understand in a

very basic way why BAYES is correct. And we were not under pressure!

There is no way to explain this to a jury in a couple of hours.

It is true that the prior and the likelihood contain uncertainties and

they should be taken into consideration. This is one of the catches in

Roberto C. Alamino

214

Bayesian inference. However, the alternative of not using BAYES is even

worse. The way out of this is not easy. It consists of finding guiding

principles to encode information in the probabilities, like we did with

symmetry, and in keeping testing them against new data. If they do not

work, like any other belief, you need to change them.

But that is not the only problem we face when taking decisions.

Why do judges worry so much about the uncertainty in the probabilities

after all? The answer is obvious, but it is worth going through it explicitly.

Think about a homicide case in the USA in one of those states in

which the death penalty is legal. Call the proposition that the accused is

guilty by and that the accused is innocent by . Suppose that the priors

are the same and disappear from the process of taking a decision. Suppose

also that, once the likelihoods given the evidence are calculated, you end

up with

This means that the probability of the accused being guilty is only

0.01% larger than the probability of that person being innocent. With such

a small margin, would you send this person to the electric chair? I would

not. That is because I know that for this kind of probability estimation, both

the priors and the likelihoods have a high chance of being wrongly

estimated, with errors that are probably larger than this. Am I sure? No.

But killing someone is something that cannot be undone. I would rather

not risk.

Here is the biggest catch of taking decisions: the cost of being

wrong. Whenever you use probabilities to estimate anything, there is a

chance that you are going to be wrong. This cost is not part of Bayes Rule

and cannot be calculated from any general principle of information theory.

Costs in taking decisions can be even purely emotional, without any

rational justification for them (except for the fact that being unhappy on

The Probable Universe

215

purpose is not rational at all). They vary with the situation and even from

person to person.

Let us say that a friend comes with a coin and wants to bet which

side is going to turn up when the coin is flipped. He says to you that, only

for fun, he wants to bet a toothpick in the game. You probably will not

mind and will probably go on with the game. But what if your friend says

that whoever loses the game has to cross the motorway from one side to

the other with shut eyes? You start to doubt that he is actually your friend.

The difference in these two cases is the cost of losing. There is no

problem in giving up a toothpick, you can always get another one in the

nearby cafe, but you cannot find a new life in there. What if your friend

rolls a d10 and says that you just need to cross the motorway if the number

1 turns up. I am not sure about you, but I need much less than 1/10

chances to die before I agree to risk my life. However, that is just me.

Everywhere in the world you can find people willing to enter bets which are

not much different from this one.

Why do people enter bets like that? That is because the cost of

losing has to be compensated by the temptation of winning. If the prize

one receives upon winning is very attractive, you will always find people

that will accept the cost of losing. There is an interplay between these two

quantities, but it is something which is quite difficult to estimate.

On the other hand, there is also the cost of not taking the decision.

If your friend has a gun and says to you that you either accept his bet or he

will shoot you in that moment, then your willingness to accept the game

will instantly change. There is no easy method to quantify costs in general.

Each case is a particular case.

In this sense, when a judge shows a worry about using Bayes Rule

in the court, what he or she is worried about is that, if the calculation of the

probabilities was done in a too uncertain way, the costs of deciding

wrongly will be extremely high. Because we know that each case is

Roberto C. Alamino

216

different, we also know that the model used to calculate likelihoods and

priors can be very poor. This, of course, should not prevent us from using

BAYES, but should force us to analyse carefully our models and

substantiate them with enough supporting data. If even with that judges

still do not accept that, then they will be doing a serious mistake.

MODELS

When we talked about Maximum Entropy, we were in the business of

encoding information about the symmetry of a situation into prior

probabilities. But, in fact, there is no known general method to calculate

priors in all possible situations. The likelihood suffers the same problem as

it also encodes information, although of a different kind. When building the

likelihood of proposition given the data , or , what we really

want is to encode all the influences and relationships between these two

things in a mathematical formula. Trying to build this mathematical

formula out of the known relationships between the variables has a name

in science and mathematics. It is called creating a model, or modelling.

Another equivalent way of visualizing a model is to look at it as a

program or a mathematical function. Every model can be broken up in a

collection of possibly interconnected programs, which you can consider as

a kind of super-program. In the same way as we have seen for the program

BAYES, you feed some information to models and they give you back a

different piece of information, processed in a way that is more convenient

for us to use according to the problem in our hands.

For instance, the program PRIOR asks you for a proposition and

all the information you have ever collected about that proposition. In

return, it gives you back the probability distribution that we called .

The probability distribution itself is information, but processed in a form

more useful to, let us say, doing inference using Bayes Rule.

The Probable Universe

217

The program LIKELIHOOD works in a very similar way, although it is

a bit more complex as it is concerned with the influence of one variable on

another. This program requires you to enter two propositions and and

all information about how one influences the other that you might have

collected somehow. Then it returns to you a function which takes

into consideration how a certain value of will determine the value of .

Consider the case in which every value of determines the value

of with complete certainty. We call models of that kind deterministic as

opposed to probabilistic models in which we can only model probabilities

for values of . Suppose that is the distance of a vehicle from a starting

position on a road and is the time in which that distance is measured. Let

us say that the car has a speed of 30 km/h. How far from the starting point

will the car be after 2 hours? The answer is clear 60 km and we can write a

very simple formula relating and that will always work

If you give any value, in hours, to then the value of in

kilometres is given with complete certainty. This is a very simple

deterministic model. There is a formal way to write a deterministic model

as a probability by using something that we have seen in the zoo: the Dirac

delta. The Dirac delta is the way to encode deterministic models as

likelihoods. In the case of the above model, our notation becomes

.

The meaning of the above equation is that, although we write it as

a conditional probability, the delta is telling us that we in fact know that

the only possible value of given is the one that makes whatever is

inside the brackets equal to zero. As you can easily confirm, if you do that

in the likelihood above you recover our formula for the position of the car

given the instant of time in which it is measured. You can think as if the

possible values for were given by a Gaussian with zero variance at the

point .

Roberto C. Alamino

218

Probabilistic models are a generalisation of deterministic models.

In deterministic models there is no uncertainty on predicting some

variable. When uncertainty starts to appear, we then need to introduce

probabilities. If we consider the Dirac delta above as a Gaussian with zero

variance, then we can introduce uncertainty by increasing the variance

from zero to any other positive number. This would create a probabilistic

instead of deterministic model in which we would know the probable

position of the car, but not the exact one.

As another example, we know that in a coin tossing, if the

probability of heads is , then the probability of tails is . If we call the

result of the coin tossing and the possible values of as the proposition

, then we can write a probabilistic model for the coin tossing game as

Now, if I say to you that is 1/3, you cannot give me the exact

value of , the most you can say is that has a probability 1/3 of being

heads and a probability 2/3 of being tails. That uncertainty characterises

the model as a probabilistic one. This is a general model that works for any

value of . Just choose one and plug it in the above formula.

We have already seen an example of how to calculate the prior of a

d10 rolling as long as we assume some fairness conditions. In that case, we

were building a probabilistic model of the prior. That modelling was based

in very clear assumptions. Nevertheless, we argued that we should always

test our prior by actually rolling the d10 and changing it if necessary, but up

to this point, we never really talked about changing the likelihood.

Of course, if everything we said before remains true, likelihoods

should also be changed if they are wrong in the exactly same way as we

change priors to posteriors using Bayes Rule. The only complication is

that the task now becomes to encode priors of the likelihoods themselves

and modify those priors to posteriors of the likelihoods. For instance, if we

The Probable Universe

219

call the likelihood of given by , then the prior of the likelihood

is something like

Sometimes, these are probabilities of the values of parameters in

your model, sometimes they are probabilities of how these parameters

relate to each other. Looks confusing? It can be, even for an experienced

researcher. Let us try to understand it using a very simple example. Let us

go back to the coin.

As we have seen, we can understand likelihoods as theories about

a certain phenomenon. If you are considering a coin tossing game, you

immediately create a theory about it inside your mind. A theory is created

by collecting a certain amount of rules or, as we physicists like to call them,

laws. In the case of tossing a coin, we can consider the following three

basic laws:

1. The coin has two sides and only one of each can be the result of a

tossing at each time.

2. The result of a coin tossing at some time does not depend on what

happened before.

3. The chances of one side being a result of the tossing do not change

with time.

All the above three laws are very natural, but they all might be

wrong! What if the coin hides some electronic mechanism that makes the

next result depend on the previous one? What if the coin is magic and the

faces keep changing? What if somehow the coin is melting and an

imbalance between the two faces keeps growing with time? We can call

the above three laws as part of our Coin Tossing Theory which we will call

CTT for short. We can now start to work with those laws and try to deduce

something more useful from them.

Let us start by creating some terminology that will make things

simpler for us to talk about experiments. As the coin has only two sides

Roberto C. Alamino

220

according to the first law, let us call one of them heads or and the other

tails or . Because of the second and third laws of CTT, each time we toss

the coin we can say that there is a constant probability for each side. Let us

call the probability that a result is by the letter and the probability that

a result is by . By combining CTT with what we learned about probability

theory, we know that and therefore . This shows that,

in fact, we can get rid of and say that the probability of getting is and

of getting is .

What you have to notice now is that CTT does not say anything

about the fairness of the coin tossing. Therefore, we cannot say that

. As far as CTT is concerned, can be anything. We call a free

parameter of our theory. The only way to know is by throwing the coin

an updating our information. How do we do that? The answer, as you

might expect, is Bayes Rule. Suppose that we call the result of a coin

tossing by the letter . Each time we toss the coin, can be either or .

This means that, if we want to infer the value of , we have found what is

the most probable value of it given the results of the coin tossing. This can

be found by using Bayes Rule as

Remember that is just the prior over the possible values of

before that specific coin tossing. Notice the subtlety of the problem here!

Because is itself a probability, is the probability of a probability! It

should include all information about the possible values of that we

gathered from all previous times we tossed our coin. Clearly, if we are

talking about the very first coin tossing, unless we have reasons to believe

the contrary, the most unbiased thing to do is to start assuming that can

assume any number between zero and one (because it is a probability) with

the same odds. That prior is, however, not important to us right now. What

we are interested in discussing is actually the likelihood. The likelihood

basically encodes our CTT. It would not be an exaggeration to say

that is CTT if we neglect some unimportant subtleties. This means

The Probable Universe

221

that we would like to be able to actually calculate . In the simple

case we are studying, we can.

It is not difficult to see how this can be done if you remember that

means the probability of one of the sides being given that the

probability of getting an is . This just means that if , then

and if , then . This is

a very neat result, but it was obtained only because we decided to assume

that the three laws of CTT are valid. If we now start to toss the coin and

estimate using Bayes Rule, we will of course get some result, but if CTT is

wrong, that form of Bayes Rule will not tell us that!

Imagine, for instance, that you are not actually watching the coin

tossing with your own eyes, but that it is being broadcasted to you via

radio. You simply assumed that a coin would have two faces, and you are

happily doing your inference on until the person at the other side says to

you This time it is neither nor , it is the third side! Your inference

crashes at that point simply because you do not have a theory that allows

you to calculate the likelihood when is neither nor !

If you want to continue in the inferring game from that point

onwards, you will have to change CTT to allow for a three-sided coin. If you

knew from the beginning that the number of sides was a variable, let us say

, you could even have included it as another free parameter in CTT and

done the inference of and altogether using Bayes Rule with a

likelihood given by , where the three dots mean any other

parameters that would be necessary, but you did not know that in advance.

What do you do now?

Well, you do what every professional scientist does at this point:

you try to find new laws. The simplest modification would be to change the

first law to allow for three sides in your coin. This means that you will have

three, not two probabilities to calculate: for , for and for , our

new side. You still can say that , but now you have two free

Roberto C. Alamino

222

parameters instead of 1. You can choose any pair as, once you find two of

them, the third is automatically defined. Let us choose and . Then, our

likelihood changes to

This is our new model, that we now call the Generalised-Coin

Tossing Theory, or GCTT. In a sense, this is a more general model, because

if the coin has only two sides after all, we will end up inferring as being

zero at some point and we will be able to, with some precision, and

eliminate it from GCTT. If that happens, GCTT reduces itself to CTT.

Now you can appreciate the problem created by the uncertainty of

the model. What if we need to judge a case based on the above coin

tossing, but the true coin has three sides and we are using the two-sided

model? Even if Bayes Rule gives us the best guess for the probabilities, by

using the wrong model we can only get wrong results.

What is the solution for this conundrum? The solution is that we

also need to be constantly revising our models. Sometimes, we will not

know how to do that using Bayes Rule and it is in those cases that

creativity takes place. A computer can run BAYES at any time, but only our

naturally evolved computer has the capability of imagining new models, at

least for now.

There is one last thing that we need to talk about in the case of

models. You noticed that we could either infer the free parameters of the

model or change the mathematical structure of the model itself. Although

these two things seem very different at first sight, in a higher level they are

just the same. The mathematical structure is also a piece of information

and, as any piece of information, can be written as a numerical parameter

that should be inferred!

The Probable Universe

223

If you do not believe this, think about the following. The

mathematical structure of electromagnetism can be contained in a book.

That book can be scanned and put into digital format if it has not been

written like that already. In any case, the book becomes a computer file

and, as we all know, a computer file is simply a sequence of zeroes and

ones. But a sequence of zeroes and ones is nothing more than a very big

integer number! Therefore, the entire electromagnetic theory can be

encoded into an integer number, and we do that all the time without

realising it. Discovering the correct electromagnetic theory might be seen

as inferring the right integer number. The problem is how difficult it is to do

that.

NOISE, ERRORS AND CODES

Suppose we have a perfectly deterministic model for our likelihood the

car speed from previous section, for instance. Let us just change the

notation to make things easier to understand and write

This means that we are interested in the probability of the car

being at position given that the measured time is . The speed of the car

is . If the clock measuring is perfectly precise, then the position of the

car is given with perfect precision by the value .

What if instead of being you measuring the time, it is a friend of

yours and your friend has to tell you the measured time via radio? So far,

so good, but what if the radio is very bad and full of interference? The

formula giving the car position at a certain time continues the same, but if

you understand the numbers given by your friend wrongly, you will still

calculate the wrong position of the car. If you know that it might be wrong,

then you now have an uncertainty, and probabilities are back in the game.

Roberto C. Alamino

224

The radio interference is an example of what is called noise. In the

real world, whenever we measure something, that measurement always

come with some errors and those errors are collectively called noise. We

say that the measurement is corrupted by noise. This happens, for

instance, when you copy files in your computer. Because of imperfections

of the materials, power surges and even the probabilistic nature of

quantum mechanics, there is always a chance that some of the bits

composing your files will be flipped. This means that if they are 1 they

might become 0 and vice-versa. In computer science, this has the obvious

name of flip noise.

It would be very good if we always knew when a bit has been

corrupted by flip noise in a file, but just looking at that particular bit, there

is no way to know that. To see how this can affect a model, let us consider

the following simplified situation. Call it the deterministic coin tossing

game. Instead of actually tossing a coin, we enter a number in the

computer. If the number is zero, the computer returns heads, if it is one,

the computer returns tails. That is a very boring game with the following

values for the likelihood of the result of the coin given the

entered number

However, if your computer suffers from a hardware problem that

flips every bit you enter with a probability , then the results of our coin

tossing cannot be predicted with certainty anymore! Although internally

the computer is generating numbers following a deterministic rule, now

there is noise in the data you are entering! Noise is a very tricky fellow. It

can be anywhere and appear at any time. In the above case, it would be

useless to use the deterministic model for the coin tossing as you would be

The Probable Universe

225

wrong half of the time! The best you can do is, given that you know the

strength of the noise, to say that you have now

no matter the value of the number you enter. Noise washed away your

certainty and forced you to use probabilities once again.

Whenever there is a chance that noise is present either in your

measurements or anywhere else in your model, you have to take that into

consideration when building it. If you do not, you will just be fooling

yourself.

The deeper meaning behind this is related to something that we

discussed before. In order to check our models, we also need a model of

how the measurements are taken and how noise corrupts those

measurements. But to check the noise model, we need another model that

says something about the errors involved in checking the noise model and

so on ad infinitum. The way out is, as we have seen before, to rely on the

consistency of our models.

But we do not need to go as deep as that to fight noise in general.

The branch of mathematics known as information theory, great part of

which can be attributed to the works of Shannon, revealed to us many

techniques that can be used to shield us from errors. It is variations of

these techniques that allow us to transmit information back and forwards

through the web and not lose the contents irreparably. All of them are

based on a very simple concept called redundancy.

Yu can probly understd tis sentnc in Englsh ven wit sme o te letrs

missin. U r also capable of understand words when they r not completely

written. Eevn mroe aznmialgy, you can uaterndnsd tihs etnrie sncnetee in

wcihh all the wdors are wnlgory wtiertn epecxt for the frsit and lsat lteerts!

The last example is the more difficult because there are much more

errors, but you can still understand the sentence with some effort. How is

Roberto C. Alamino

226

it possible that we can correct the errors of a text with such a high level of

noise? What is happening here is that all languages, English included, have

a high level of redundancy. By that, I mean that the number of words that

really exist is so much smaller than all the combinations of letters that only

a few of these combinations make sense. The larger the word, the easier it

becomes because there are much less possibilities.

But languages have other characteristics which create extra

redundancy, like grammar and semantics. These rules guarantee that there

is a certain rational order in combining the words that, if subverted,

renders the sentences meaningless. For instance, the following sentence is

extremely rare because it does not make any sense

CAR WINDOW WALL MAKES TO BE TO SAY AMONG FILES FLIES

BAD BLUE SMALL.

The sequence of words in this sentence does not make sense, so

upon reading it we know something must be wrong. In fact, if the publisher

of this book exchanged any of the words in my original sentence, there is

no way to know that! On the other hand, you can easily fill in the missing

word in the following sentence

I HAVE BEEN FEELING STRONG HEADACHES SINCE I HIT MY

____ ON THE WALL.

Surely there are other possibilities, but it is almost sure that the

missing word of this sentence is HEAD. More than that, the above sentence

makes sense because it has nouns and verbs arranged in an order that we

know is correct. This kind of redundancy can even help you when you are

learning languages that are not your native one.

The Probable Universe

227

But what happens if you have in your hands a message of a

language that you do not know? How do you know there is a mistake on it?

This is what happens in computers and other digital devices all the time.

Computers copy information, but they do not interpret it. How can they

correct the possible errors then? The answer is in the grammar rules. In

particular, the spelling rules of the language. Some languages have very

strict orthographical rules and, once you learn then, you can identify errors

even in words you do not know the meaning.

For instance, Portuguese has a rule that says that you never use n

before p or b. It is always m. This means that, even not knowing what

the word campo means, if you see it written as canpo, you immediately

know it is wrong. There are, of course, exceptions, but they are few enough

for the rule to work most of the time and give you a good error-correction

rule.

Error-correction is indeed the technical word used in information

theory for the process of identifying and correcting errors in messages. It

would be much easier to correct errors if we could read every message, but

nobody wants it, right? Well, except maybe the NSA, but that is a subject

for another book. The solution found was to create a code in which the

orthographical rules are so strict that one cannot violate them without

someone else noticing.

The problem then becomes to translate every message to this

code. When we are dealing with computers, this code has only two

characters 0 and 1. You can imagine how tricky it is to create a set of

rules of this kind in such a way that we can translate every kind of

information to this code. Shannon, once again, showed that this is possible,

but only if you introduce redundancy in the encoded message. The more

redundancy you introduce, the easier it becomes to recover the message.

These codes are appropriately called error-correction codes.

Let us see a classical example called, for reasons that will become

obvious in a moment, repetition codes. Suppose you want to send a

Roberto C. Alamino

228

message to a friend which is either 0 or 1, but something will happen with

your message and there is a 1-in-5 odds of your bit being flipped. How can

you guarantee that your friend will receive the correct message? The

answer is obvious. You send the same message several times. As we have

learned, the Law of Large Numbers guarantee that, the more times we

send, the closer the amount of flipped bits will become to 1/5 of the total

bits received by your friend. The only thing your friend needs to do is to

count which bit appears more times.

Clearly, the more probable is the flipping, the more bits you have

to send to guarantee that it will be possible to recover your message. For

instance, suppose that exactly 1 in every 5 bits is flipped. Then, if you send

a chain of five 1s, then your friend will receive something like

and it is not difficult to see that the correct message is 1. The most

frequent number wins. This is called, technically, the majority rule. If your

friend receives

the majority of the bits is 0 and, therefore, your message was probably 0.

You can also calculate the average value of the received bits. In the first

case you would have 0.8 and in the second it would be 0.2. As you can

notice, you would just need to choose the closest bit to the average to get

the correct answer.

Now, suppose that exactly 1 in every 2 bits is flipped. What would

be the original message if you receive this

In fact, unless your friend knows that your message can only be 1,

there is no way in this case to discover the message! You have only two

characters and they can either flip or not with equal probability each time.

No repetition code can help here. The problem is that the noise is so high

The Probable Universe

229

in this case that no amount of redundancy will allow you to correct errors.

In this case, of course, the only way out would be to find better hardware

for your transmission.

When noise is not so severe, though, there are much more efficient

schemes than repetition codes. These are used all the time in computers.

As I said before, the efficiency of these codes does not depend on the

content of the message, although knowing that would definitely help. We

can now generalise this idea beyond simple binary messages to every kind

of information we collect from the world.

It is because of noise that we repeat the same measurement many

times in science. We hope that by repeating the experiment, if the noise is

small enough, we can detect what is the most probable correct result.

Consistency of the results among different scientific theories is a kind of

redundancy that allows us to check if the result is correct or not, or maybe

if the theories themselves are correct or not.

Noise can be anywhere. It can be acoustic noise which prevents us

from hearing correctly what another person is saying. It can be visual noise,

like when you have short sight and are without your glasses. It can be the

strong rain which prevents you from seeing the car in front of yours. Our

brain is a wonderful error correcting machine, but it is so because it uses

prior knowledge and full information integration to create redundancy and

fill in the missing gaps. Even though, no matter how good it is, if you have

your eyes closed you cannot see the car ahead.

The fact that noise is everywhere means that there is always some

information lacking and, because of this, most of the time we have to use

probabilistic theories instead of deterministic ones. We have to learn how

to deal with noise and never forget to take it into consideration.

Roberto C. Alamino

230

WHAT ABOUT FALLACIES?

Roughly speaking, a fallacy is an error in a chain of logical reasoning. The

error can be anywhere. It might be hiding in the initial hypothesis or in any

one of the steps we take to reach the final conclusion.

Each one of the steps in a logical reasoning is based on some rule

of logic we assume to be true. We have seen some of those rules as logical

operators. For instance, if I say that either I am a physicist or my brother is

a physicist and then I say that my brother is not a physicist, you can

conclude that I am a physicist. This is the rule that:

If (A OR B) is TRUE, then if A is FALSE, B must be TRUE.

This is a simple rule that we assume to be true. Sometimes,

though, we have a tendency to assume rules which are not necessarily

true. These are fallacies.

Probably the most common fallacy is the authority argument. This

is an error of reasoning that happens by assuming that a recognised

authority in a subject is always right. For instance, consider the following

chain of reasoning:

The doctor said I have a disease.

Doctors are authorities on diseases.

Therefore, I have the disease that the doctor diagnosed.

If that was always true, we would not have deaths due to medical

errors. Although it seems logical to believe in physicians because they

studied much more about diseases than other people, that does not

prevent them from being humans and committing errors of judgement.

However, the clever reader must now be pointing to the fact that,

although it is not certain, it is probable that the physician is right. The odds

The Probable Universe

231

that a physician knows better about a disease are higher than those for a

non-physician. Without any other information, if you are forced to choose

between the advice of the physician and of the non-physician, the best

thing to do is to go with the physician. That is equivalent to choosing a

hypothesis based on the prior information that a physician has a higher

knowledge on matters of health.

Therefore, based on BAYES, the probability that the physician is

right is higher, but is not zero! There is a non-zero chance that the

physician is wrong and you have to prepare yourself for that. With luck, we

might be able to change our decision with time. If the treatment is not

working, we change our belief in the physician and look for another one.

Incredibly enough, a fallacy can be (not all of them actually are)

evidence that increases the probability of something, but never enough to

make that thing 100% certain. That is why you have to be careful with them

and how they are used.

Consensus is a fallacy that works in a similar way. It is logically

wrong, but has a large probability weight. Although 1000 expert opinions

do not constitute a proof for something, the more experts agree with

something, the more probable it is that they are correct (but still, they can

be wrong). All the medieval specialists saying that the Earth was flat were

still wrong, no matter their number.

What you always need to remember also is that more probable

does not mean highly probable. Suppose you have three doors and behind

one of them is a prize. The fact that there is a chance of 0.01% of the prize

being in door #1 and 1% of being in door #2 means that door #2 is 100

times more probable than door #1. Still, the probability of the prize being

in door #3 is 98.99 %! More probable in comparison to another probability

does not mean highly probable in absolute terms! Once again though, if

you are forced to choose only between doors #1 and #2, you would be a

fool if you do not choose the latter.

Roberto C. Alamino

232

As I said, though, not all fallacies can be seen as probabilistic

evidence in favour of something. For instance, the argument from

ignorance, a fallacy in which something is considered to be true because it

cannot be proved wrong, has no probabilistic use whatsoever. Not being

able to prove a proposition wrong does not increase its chances of being

correct as it can be easily seen by Carl Sagans invisible dragon (Sagan,

1997).

The invisible dragon is a dragon that a child swears to her parents

exists in their garage. The parents go there to check and the child says that

they cannot see it because it is invisible. The parents then try to find its

footprints, but the child says that the dragon flies. The parents then

prepare a trap in which paint is sprayed in the dragon, but the child, once

again, argues that the dragon can become intangible. The existence of the

dragon cannot be disproved. Still, that does not make it more probable.

The list of fallacies is extensive and changes depending on the

source. One good resource is Wikipedia:

http://en.wikipedia.org/wiki/List_of_fallacies

Each one of these fallacies works in a different way and their

validity as evidence in favour of some reasoning can only be evaluated by

analysing case by case carefully. BAYES does not change the fact that they

are wrong when used to reach certainties, but they can have some

probabilistic value once in a while. The important thing is to know to

recognise when.

The Probable Universe

233

10.

The Universe and Everything Else

FUNDAMENTAL UNCERTAINTY

Physics is the branch of science that studies the simplest systems in the

universe. Although simple, they span a huge range o sizes, from the whole

universe to the subatomic particles and beyond. By the 17

th

century, with

Isaac Newton, a very precise model of phenomena, which we agreed today

to fall under the physics umbrella, was developed. This model, known as

Newtonian Mechanics, was a deterministic one. The underlying idea was

that the natural patterns were not probabilistic, but rather deterministic

and probabilities, if needed at all, would appear from incomplete

knowledge of initial conditions for each new prediction.

The 18

th

century has seen the rise of electromagnetism and

thermodynamics, but both were models which fitted very well into

Newtons theory. Even with the discovery of relativity and the recognition

that simultaneity was a relative concept, the theory of relativity remained a

deterministic model. Given a set of initial conditions for any system, one

could predict its entire future from it with complete certainty.

Something, however, started to change more or less at the same

time relativity appeared. When physicists tried to predict the amount of

energy emitted by an object kept at a certain temperature using Newtons

model, complemented by Maxwells electromagnetic theory, they ended

up finding an infinite value for this energy. Clearly an object at a certain

temperature cannot be emitting infinite amounts of energy, otherwise we

would never need to worry about sustainability.

Roberto C. Alamino

234

The solution to this and other small inconsistencies between

experiments were at first seen as details that would be solved sooner or

later. The prior of Newtons theory was too strong as it had been working

without failing for over 200 hundred years up to that point. But in the best

Bayesian style, too much evidence started to accumulate, indicating that

there was something wrong with the Newtonian description of those

phenomena.

It turns out that the solution would involve a series of changes in

the prevailing models of physical phenomena, one of the most radical

being that we were forced to accept that probabilities are a fundamental

part of the description of nature even at its most basic level. I am talking, of

course, of quantum mechanics.

In order to understand where probabilities are hidden in quantum

mechanics, I have first to give you a crash course on it. It will involve a little

bit of mathematics, of course, but it will not be more than we had to use to

understand probabilities up to this point. Of course, we will need new

symbols, but you are already used to that.

LAYMEN QUANTUM MECHANICS

The shroud of mystery covering quantum mechanics is partially

disappearing due to the fact that it is now the leading technology of our

time. There is still an air of fantasy and science fiction around it and the

idea that it is something almost impossible to grasp is always circulating

around.

Quantum mechanics is indeed counter-intuitive in the sense that in

the microscopic world, where its effects are more accentuated, the results

drastically differ from what we are used to see in our daily, macroscopic

lives and those are the observations that shape our prior about the world

in which we live. It would be fair to say that we could not still understand

the principles of quantum mechanics by applying our classical thinking,

The Probable Universe

235

where by classical I tautologically mean everything which is not quantum.

In fact, we might never be able to fill this gap as the quantum description

seems to be more fundamental than any classical reasoning and some

argue that we should find a way to get used to it as it is.

If one assumes this posture and accepts its strangeness at least for

now, the amount of mathematics needed to gain some grasp of it is not

much worse than high school level and not much more complicated than

what we used to learn about probabilities. If you are a professional, of

course you will need some more involved concepts, like differential

equations and integrals, but we will not need them here.

The basic object of the model we call quantum mechanics is called

a state vector. You probably learnt that vectors are arrows of some size

pointing in some direction. They are used to represent quantities that have

a certain magnitude and a certain direction, like velocities for instance. It

turns out that the concept of an arrow can be generalised to a

mathematical entity in a certain way that it keeps all properties that the

original arrows have: they can be added, subtracted and multiplied by

numbers to make them larger, smaller or simply to change their direction.

A state vector is like one of these arrows, but it is designed to

represent all the information about a physical system. The simplest of all

these state vectors is the one that describes the spin of an electron. The

electron, like other fundamental particles, is also a very tiny magnet, with a

north and a south magnetic pole. To represent this, we use a small arrow

like in such a way that the head indicates the north pole and the tail the

south pole.

The name spin comes from the fact that we can associate the

magnetic field of the electron to a magnetic field produced by a current

spinning around. In this picture, the electron would be a spinning sphere

and its magnetic field would be produced by the effect of the negative

charge of the electron spinning. The correct picture is a little more

complicated and involves concepts about symmetry under rotations, but

Roberto C. Alamino

236

we will not talk much about that. The important thing is that spin is

associated somehow with rotations and not directly to charge or magnetic

fields. Even a neutral particle can have spin as it can, in a sense, rotate.

It happens that when a big object like a tennis ball spins around an

axis, this axis can be pointing in any direction. We say that it can assume a

continuous number of directions because we can rotate the axis smoothly

in whatever way we want. But experiments (data) in quantum mechanics

force us to accept that the electron spin is slightly different. Whenever we

try to measure in which direction each one of a bunch of electrons is

spinning, we can only find two mutually exclusive answers! They are always

spinning with the same velocity in parallel directions like the arrows

below

Because of this property, we say that the spin of the electron is

quantised and assumes one of two states that we call up and down

(depending on the alignment of the axis, it could be left and right for

instance, but that is not important) and write with the symbols

.

The arrows have an obvious meaning, but they are enclosed by a

strange-looking mixed bracket. That kind of bracket is called a ket and it is

a notation for vectors invented by the physicist Paul Dirac and therefore

The Probable Universe

237

called the Dirac notation for vectors. Many of us learned another notation

in the high school. There, for instance, when we wanted to label a vector

by the letter , we asserted its vectorial character by putting an arrow over

it and writing . This notation is also valid, but the Dirac notation is more

convenient when you want to give larger labels, like words, or strange ones

like the spin arrows.

The strangeness of quantum mechanics does not stop in the

quantisation of the spin. In the macroscopic world, things are usually in one

single state at a time. A tennis ball either spins in one direction or the

other. Fundamental particles, however, cannot be said to be spinning in

one direction or another until you measure them. Before you do, we say

that they are spinning in both directions at the same time! How do we

know it? Well, that is a tricky question.

As we can easily understand by now, quantum mechanics, like any

other model about the physical reality, was created to encode all the

information we collected from microscopic experiments. The way we

describe the spin of an electron in this model is by the state kets, but the

mysterious thing is that if we try to assign a definite spin up or down to the

electron before we measure it, we arrive at inconsistencies, also known as

paradoxes.

Einstein, together with the physicists Boris Podolsky and Nathan

Rosen, discovered one of these paradoxes and, since then, it has been

called the EPR paradox. They discovered that, if we assign definite spins for

a pair of electrons that are created together in a very special way called an

entangled state, we have problems. In an entangled pair of electrons, if

one of the electrons has a spin up the other has a spin down and vice-

versa. No matter how far they travel after creation, if you measure one of

the spins and another person measures the other, you both will find

opposite answers every time.

If you use the laws of quantum mechanics to calculate what

happens and consider that both electrons have their spins defined at the

Roberto C. Alamino

238

moment of creation, EPR discovered that this implies that the electrons

have an interaction that travels faster than light, which especially for

Einstein could not be true as relativity implied that this could not happen.

Indeed, today we know of only two solutions for this paradox. The

solution that guarantees that interactions cannot act instantaneously at

distance requires the spins not to exist until they are measured. This is the

usual interpretation and is the one assumed by usual quantum mechanics.

In some theories, called non-local theories, we can assign a definite state

to the electron, but then we have to admit that physics is non-local,

meaning that actions at a distance can be instantaneous. Usually, this is a

very unpopular solution. In both cases, information itself cannot be

transmitted faster than light and the spirit of Einsteins relativity (no signal

travels faster than light) remains intact.

Because in the usual quantum mechanics we cannot assign a

definite state to the electron until we measure it, we represent the state of

an electron by what we call a linear combination of the up and down

states. A linear combination is simply what you get if you multiply each

state ket by a number (remember they are vectors and, therefore, we can

do this) and add them. In symbols, we write the rather esoteric equation

.

The symbol is the Greek letter psi and must be familiar to

psychologists. I am only using this letter to describe the combined state of

the electron because it is traditional in physics. I could have used any other

label if I wanted. We are getting very close to the place in which

probabilities enter in quantum mechanics. Stay with me.

Let us understand what the physical meaning of the symbols in the

above equation is. As it is the rule in quantum mechanics, and are not

usual numbers in the sense that they are not real numbers, they are

complex numbers. Complex numbers include the usual real numbers plus

all the square roots of negative numbers usually called imaginary

The Probable Universe

239

numbers for obvious reasons plus all the mixed additions and

subtractions between members of the two classes. Remember our

discussion about Cardano and polynomial equations of degree 3? They

appeared there for the first time.

Complex numbers can all be written in the form , where

and are real numbers and . You will notice that, because of that,

every complex number is completely characterised by a pair of real

numbers, and . The number r is called the real part of the complex

number and , because it is multiplying , is called the imaginary part.

Consequently, we can associate to every complex number a two-

dimensional vector, by which I mean an arrow on a plane! You can see a

picture of this below

The arrow, for which I chose the name , is the graphical

representation of the complex number. We can write this as

.

The shaded area below it forms a right triangle. This allows us to

find the size of the arrow by using the well known Pythagoras Theorem.

The size of the arrow associated to is called its modulus, and it has the

symbol . If you look at the picture, you will see that the modulus of is

Roberto C. Alamino

240

the hypotenuse of the triangle and, therefore, Pythagoras Theorem gives

us

Why is this important at all? Because the modulus of a complex

number holds the last clue to understand the physical meaning of a

quantum mechanical state!

Right in the beginning of quantum theory, the physicist Max Born

proposed a rule that ended up being the correct way of understanding

what a state ket like our

really means. Long story short, whenever we

measure an electron in the state we get a spin up with probability

proportional to

.

Of course, we want to normalise these probabilities and that is easily done.

In the above case, all we need to do is to divide the ket by

. In

quantum mechanics, two kets differing by a multiplicative constant are

considered to represent the same physical state.

And thats how probabilities enter in quantum mechanics!

There are many other strange aspects in quantum mechanics, but

it is during the act of measurement that probabilities enter it. Whenever

you measure a state which is a superposition of two others in quantum

mechanics, you cannot predict the result, only the probabilities. This is not

because you are losing some information. Even if you have a perfect

measuring device, you could only predict probabilities. They are not a

consequence of incomplete knowledge in quantum mechanics. They are a

fundamental characteristic of the model.

Many people now argue that these probabilities are fundamental,

they do not represent states of knowledge or belief, but fundamental

frequencies in the sense that if the experiment is repeated many times, the

relative frequencies of the outcome are a property of the system, a very

objective one. As we have seen before, that is not the correct point of

view.

The Probable Universe

241

Consider the following experiment. Imagine that you have a two

dimensional sheet of atoms. There is one real example of that, called

graphene. Graphene is a sheet of carbon atoms organised in a hexagonal

lattice, which is the grid of points formed by the vertices of a series of tiled

hexagons like the picture below.

At each one of the vertices of this lattice, like the one marked by

the small green circle, lies one carbon atom. Imagine that instead of carbon

atoms, we have a lattice with fixed electrons. If we measure the spin of

each electron in the direction perpendicular to the plane, we will either

measure a spin up or down, as we have already seen.

Assume that we can arrange an initial configuration of electrons in

the lattice such that, for each vertex, the probability of up or down spin is

given by our state ket above already normalised. Is the probability a

feature of the electron? The answer is a clear no. It is a feature of the

entire system, which means, it also depends on how the experiment is

made.

If we decide to use some device to generate a magnetic field

perpendicular to the plane on which the lattice lies, the spins of the

electrons will tend to align with the field and the failure to align perfectly

Roberto C. Alamino

242

will depend on the temperature of the system. In this case, the probability

of measuring the spin up and down will change. You need information

about the whole experimental setup to calculate it.

It is never enough to repeat this: the probability stores information

about the whole system. In many cases, this information can be

summarised in the symmetries, or lack of them, in the system. In classical

mechanics probabilities arise from our inability of reproducing the exact

initial configurations at each repetition of the experiment which results on

lack of knowledge (remember chaos?). For quantum mechanics, even a

perfect reproduction of the initial state will allow us to predict only

probabilities. If there is one sense in which probabilities can be said to be

fundamental is in this one, but they still capture information about the

whole setup and can be wrongly inferred.

THE LARGE, THE SMALL AND THE COMPLEX

If you like science to the point of reading books from Stephen Hawking and

Carl Sagan, you probably heard that the main goal of physics for the last 60

or more years has been to find a theory of quantum gravity. You also

probably heard that quantum mechanics and general relativity are not

consistent when used together and that, maybe, string theory can do the

magic. However, nobody knows if string theory is right because it is difficult

to test. That is an inspiring story, but is only partially true.

Quantum gravity is indeed one of the main goals of theoretical

physics today, but it is not the only main goal. Quantum gravity would be

the culminating point of extending physics in two directions: the very large

and the very small.

Quantum mechanics, as we have seen, is our present model for the

physics of very small systems. If gravity is not taken into consideration, it

can be consistently described together with special relativity in a model

which is called quantum field theory, or QFT. This model is capable of

The Probable Universe

243

describing with the best precision known in science most experiments

related to microscopic physics.

On the other extreme, we have general relativity, or GR, being also

very successful in describing most astronomical observations we recorded

since we started to look to the stars. Where Newtonian gravity fails, even

for a very small margin, general relativity corrects it with a high precision.

QFT and GR however cannot be used in conjunction when we deal

with phenomena involving high energy and gravity. In the usual

microscopic world, masses are small and GR is not very important. In the

usual cosmological world, things are large and QFT is not very important.

But because relativity teaches us that mass and energy are equivalent,

whenever too much energy is concentrated in a too small region of space,

we should use both QFT and GR at the same time. This happens, for

instance, in the centre of black holes where a very strange object called a

singularity might exist (although most people bet it does not). The problem

is, if you try to calculate things in that regime, mathematical

inconsistencies start to appear and you cannot get any useful result. As

nature apparently works well everywhere, we tend to think that the

problem is that we did not get the models right.

I told you before that many theoretical physicists call the yet-to-be-

found consistent model that includes QFT and GR by the exaggerated name

of Theory of Everything, or TOE. String theory is considered as a candidate

for a TOE as it is an attempt of a unified description of QFT and GR. But

there is a detail that is left out here. If once we unify QFT and GR we have a

theory of everything, then what else remains?

The point is that QFT and GR are both fundamental theories and,

consequently, the TOE will be a fundamental theory as well. Fundamental

theories are supposed to contain sets of fundamental principles that

should be obeyed by the all other physical models and, in this sense, if we

find all fundamental principles we have already achieved a great goal.

However, we still cannot explain everything.

Roberto C. Alamino

244

The reason why we cannot is that our goal is to use the

fundamental principles to predict the behaviour of physical systems. When

we have very simple physical systems, it is very easy to apply those

fundamental principles. In fact, most of the fundamental principles are

supposed to be readily applicable to simple situations, but become more

difficult to be applied to complex ones.

The best example of a complex system that does not look that

complex at first sight is one which is very familiar to us and that we used as

an example in many other situations in this book: water. Water is made of

molecules containing two atoms of Hydrogen and one of Oxygen. If you

look at one such a molecule, it is not that difficult to predict its behaviour.

We can measure its speed, mass and calculate its trajectory with

acceptable precision for all practical purposes. But one molecule of H

2

O is

not really water! Water is something much more complex than one

molecule. Water is what you get when you put together a mind-boggling

number of H

2

O molecules together at a certain temperature. If you change

the temperature, you change this collection of molecules from water to ice

or to vapour. And all three have completely different behaviours! You do

not wash your clothes with ice and you do not drink vapour when you are

thirsty. The whole process of passing from one to the other in which these

characteristics change depends on the complex interaction taking place in

there.

To describe scenarios like this one, a whole new bunch of concepts

had to be introduced which deal only with situations where you have a

huge number of interacting units. Concepts that, afterwards, made their

way back to more fundamental theories. Temperature is one of them.

Entropy, our old friend, has its origins in the physics of the very complex

too. As you might imagine, the more complex a system becomes, the more

difficult it is to keep track of everything happening within it, and one thing

we know is that when this happens we need probabilities!

The level of complexity increases in general as we leave physics

into the direction of the social sciences. In the 1900s, quantum mechanics

The Probable Universe

245

found a bridge between the very simple systems of physics and the more

complex ones of chemistry. The next step after chemistry, known as

biology, requires an even bigger jump which has not been fully made even

today. The same happens as we climb up the ladder. Each time, we get a

higher number of different systems and more diverse interactions between

them. This is the barrier of the complex that we are also trying to break.

This barrier is even more unifying than the alleged TOE, as it is not only a

theory of physics, but it aims to bring together in a consistent way models

throughout all areas of knowledge. From biology, we go to psychology,

then to sociology and economics and ecology.

No matter how complex, the miracle is that the systems keep

presenting to us a series of repeating patterns depending on the way we

look at them. The art is to throw away the correct amount of information

to isolate those patterns in a sea of noise. To salvage the right datasets that

will allow us to develop our probabilistic models of nature is the ultimate

goal of science.

THE PSYCHOLOGIST PARADOX

Some time ago, I was trying to understand why people believe in irrational

things. I did what every physicist would do: I drafted a very simple model

and looked to it from every point of view I could imagine. I changed a bit

here and there until I could get some basic understanding out of it. It

looked interesting. Then, I talked about it to a friend of mine who is a

psychologist with the hope that he could point me to more interesting

problems or that we could start a collaboration and try to apply the model

to some interesting problems. My friend looked at it and replied with It is

interesting, but I dont think it can be right. Every human is different, you

cannot understand everyone with the same formula. I looked at my friend

and did not say anything as I did not want to be rude, but I immediately

thought If everyone is completely different, how can psychology even

exist?

Roberto C. Alamino

246

If every human being in the planet reacted completely differently

to everything, then no one could infer patterns of behaviour. This would

mean that, if you are a psychologist, everything you learned in the

university about behaviour will never be useful, because it was learned by

studying people who will not be your patients and, therefore, will have

completely different behaviours. Still, we do know two things that, at least

for my friend, should seem paradoxical: every individual is different and

psychology does work. This is what I call the Psychologist Paradox, in

honour to my friend.

Of course there is no real paradox. The misleading idea here is that

people are completely different. We all know that people are not

completely different. As a poet once said, if we are cut we all bleed the

same, right? We are all variations over the same theme with noise helping

to increase our variety. Almost everyone will shout if burned by a cigarette.

That is why we do not do it (in general) to other people. But a very small

number of people do not feel pain. These are exceptions, or as we learned

previously, noise.

In the case of the sensation of pain, natural selection was

responsible for decreasing the amount of noise. Those individuals that did

not feel pain ended up dying easier. Those who felt too much pain could

not carry on their necessary tasks to survive. But whatever lies in between

these extremes became fair game. We can safely say that the typical

human being feels pain, some more, some less. In popular lingo, typical

usually means normal. In fact, even the psychologists measure the

normal is equivalent to the typical.

What in fact my friends intuition was trying to say was that there

was a lot of variation in human beings and that, because he had very little

intuition with mathematical models of physical phenomena, he could not

believe that such a simple model could capture the similarities between

individuals. There is indeed a huge amount of noise in the making of a

human being. That is because the places in which this noise can enter are

extremely numerous. Still, miraculously, in some cases noise is kept

The Probable Universe

247

extremely small, most of them due to some kind of selection mechanism,

be it natural or cultural.

In any case, there are clear patterns in human behaviour. The tricky

is to deal with the noise and the consequent missing information. As we

have already seen, we deal with noise by using a statistical description. We

talked about that when we saw what physicians really mean by your

chances of survival are of 20%. They are calculating the average case of

many patients and hoping that the Law of Large Numbers is indicating the

right probabilities.

Probabilities are being increasingly used to describe biological and

social systems when they are composed by a large number of interacting

parts. We call systems like that, in a very broad sense, complex systems. In

biology, for instance, we can think of the constituent parts as cells or

neurons (in the case of the brain). In sociology, we deal with a huge

number of human beings. In all cases, we justify our hopes that we will get

useful results by using the ubiquitous Law of Large Numbers. Why?

Because it works so many times that we are led to have a very optimistic

prior about it.

Although each one of the sciences has its particularities and

different objectives, there are lots of overlaps. In many of the above

systems, one is interested in identifying and understanding something

which is called an emergent behaviour. The example of the water changing

phases (from ice to liquid, from liquid to vapour...) is a type of emergent

behaviour.

If you again imagine an isolated water molecule, there is nothing in

it that hints to the fact that it can change from water to ice to vapour.

These changes are called in physics phase transitions and are a result of

the interactions between the molecules. Because they can only happen

when a collection of molecules is present, once that water/ice/vapour are

nonsensical words to apply to a single molecule, they are called collective

Roberto C. Alamino

248

behaviour. And because they emerge from simpler laws, they are called

emergent.

This kind of behaviour is present in many different complex

systems. For instance, conscience is the ultimate emergent behaviour as it

is surely a result of the interaction of billions of neurons. In the same way,

groups of people can organise themselves into a revolution, or the car

traffic can suddenly become jammed without a clear reason. All these are

emergent behaviours. Presently, due to the power of modern computers

that allow us to simulate this kind of system, complex systems became very

fashionable in all areas of science.

STATISTICAL PHYSICS

The idea that everything in the universe should obey the same set of laws

comes from the belief that the division of nature in areas of knowledge is

something created by us to understand it better, but that nature is in

reality just one. This, among other things, means that if we knew all

important laws governing the simple things, we should be able to somehow

make a bridge between them and the laws we know that rule the complex

things.

One of the simplest things we know is an atom, so let us start

there. The idea that matter was composed by atoms was first proposed by

the Greeks, with the first references pointing to Democritus on the 5

th

century. However, the first experimental evidence of the existence of

atoms had to wait for Einsteins work on Brownian motion fifteen centuries

later, in 1905. It would take 76 more years for us to construct a device

powerful enough to allow us to indirectly picture single atoms, the

scanning tunnelling microscope.

The Probable Universe

249

Scanning Tunnelling Microscope image of a piece of gold. Each one of the visible blobs

represents a gold atom.

A hydrogen atom has a size of the order of

meters, with of

the order meaning that the size is some one-digit number times that. This

is about one-hundred thousand (100 000) times smaller than a thin human

hair. Due to its size, light does not bounce in an atom in a way that enables

our eyes to detect enough information to create a visual model of it. In

more mundane words, atoms are too small to see even with strong lenses.

The reason is not that we do not posses strong enough lenses which cannot

magnify things that much, but because atoms are already small enough to

make the quantum effects that our brain usually ignores make a lot of

difference.

In order for our eyes to directly detect an object, light must hit the

object and be reflected by it. This reflected light brings to our eyes

information, which is then interpreted by our brain. In order for something

to reflect light, it must interact with it and, usually, the object must be as

large as the wavelength of the light. The wavelength of any wave is the

distance that it takes for the wave to repeat itself. You can visualise it from

the picture below, where the distance marked by the black line

corresponds to the wavelength of the wave.

Roberto C. Alamino

250

The problem is that the minimum wavelength our eyes can detect

is of the order of

interact with one single atom is 1000 times smaller than what we can

detect. Even if the atom was 1000 times larger, we would still have the

problem of the limited resolution of our eyes. We would still need a

microscope to see it, in the same way we do with cells. No wonder the

existence of cells was proved in 1665, when Robert Hooke actually saw a

cell via a microscope, which is 240 years before atoms were proved to

exist.

But as we all know, when the number of atoms is large enough and

they are packed together really close to each other, we can see the object

they form. They are the objects of our daily life. Solids, liquids, gases and

everything in between. How can we see the whole object if each atom

cannot be seen individually?

For those who believed in atoms before their existence was

proved, questions like that posed a big problem. If everything is made of

atoms and, at least after Newton, we believed that even atoms should

follow the laws of mechanics, can we explain the phenomena we see

happening with matter around us starting from things as simple as atoms?

Today, we know that the answer is very probably yes, although

we are not completely sure (Stewart, 1997). However, even if that is indeed

possible, it might be so difficult that it becomes impractical. Even with our

reasonably powerful computers, calculating things starting with its atomic

structure, something that physicists call ab initio calculations, is too

The Probable Universe

251

complicated to be useful except in very simplified situations. The problem

is not only that we have so many particles to keep track of, but it has to do

with the fact that many of the necessary computations belong to a class of

problems which we usually call hard computational problems, for which

there are not any known fast algorithms to solve.

Still, even in the 17

th

century, physics already started to study the

laws of heat and how they affect many properties of matter. By the time

the atom was proved to exist, there was already a good deal of work that

allowed the study of phase transitions, which as we have already seen is

the phenomenon associated with matter changing from one phase to the

other.

When applied to the substances we use every day, phases are

what in common language people call states of matter, like liquid, solid and

gaseous. We have already used this word to describe ice, water and

vapour. The word state is never used in this sense in physics. Technically,

the term state is used to describe characteristics of any system, even a

single particle, either in classical or quantum mechanics and can cause

some confusion, like the energy states we talked about before.

The area of physics that deals with the phase transitions occurring

in the substances is the well-known thermodynamics. In 1738, Daniel

Bernoulli, a name that we heard before, published the book

Hydrodynamica (Bernoulli, 1738) on which he assumed that liquids were

formed by a huge number of molecules in movement and used this to

calculate some of their observed properties. Other scientists, like Clausius

and Maxwell (the same of the electromagnetism), used similar ideas to

describe not only liquids, but also gases.

The revolutionary idea of Bernoulli was to assume that the

molecules were moving in a random way inside the liquid and to use

probabilities to calculate average properties of it. He then associated the

average properties with what we can really measure. More than one

century later, Boltzmann took these ideas to their limits. Today, the latter is

Roberto C. Alamino

252

considered the father of statistical physics, also known as statistical

mechanics.

But thermodynamics had already existed before statistical physics.

What Boltzmann was one of the first persons to think about was the idea

we called emergent properties. He was sure that the laws of

thermodynamics should emerge from the simpler laws of Newtonian

mechanics and the key to accomplish that should be, following the example

of Bernoullis hydrodynamics, probabilities.

Systems composed by atoms and molecules at a certain

temperature (as long as the temperature is different from zero) have to be

described by probabilities, because heat is a kind of disorder which moves

things around randomly. Heat is, fundamentally, noise and temperature is

a measure of how high this noise is.

Here is where the distinction between phase and state becomes

even more important. We describe systems by the probabilities of them

being in some state, meaning the probabilities of values for all microscopic

variables that characterise the system. One of the great results of

Boltzmanns work was that, if we are given the normalisation of these

probabilities, there are methods that allow us to calculate everything we

are interested in about the system! We are even able to predict things like

at which temperature a phase transition should occur. In an interesting

twist, while normalisations are usually ignored in the inference tasks we

have seen before, they are actually a key calculational tool when we deal

with complex systems!

Because of this amazing fact, the normalisation is given a new,

more important name. It is called the partition sum and given the letter ,

which stands for the German word Zustandssumme, or sum over states,

because, in that time, German was a most important language in science

and also because Boltzmann himself was Austrian.

The Probable Universe

253

Unfortunately, the no free lunch theorem applies here. The more

complex your system is, the more difficult it is to calculate its partition

function. Many times, we are forced to do that numerically, which

nowadays we do using computers. A numerical calculation is one in which,

instead of trying to find a nice compact formula to express the

normalisation, we calculate each term separately as a number and add

them all by brute force.

The normalisation that we calculate in statistical physics is still

independent of the particular state because it is a sum over all of them, but

it might depend on other variables like the temperature in which the

system is being kept and the magnetic fields acting on the system.

However, it is way simpler to calculate partition functions than to calculate

what is going to happen with a system by tracking atom by atom, although

some people actually do that with computers as a different line of

research.

This and many other great insights make statistical physics such a

powerful tool to deal with complex systems that it can probably be

considered to have transcended physics to become a general mathematical

framework that can be applied to every area of human knowledge. The

methods of statistical physics today are used in areas as diverse as social

sciences and artificial intelligence.

All probabilistic concepts we have seen before are brought

together in statistical physics. One of its basic assumptions is that there are

some elementary states of a system which have all the same probability

under certain conditions. This last part is very important because, as we

have repeated many times by now, every probability is conditional.

In fact, we can associate this condition under which these states

are the same with the idea of a cost function, or simply a cost. We talked

briefly about that when we were discussing the cost of taking decisions.

Here, although the concept is related to that one, it is used in a slightly

different way.

Roberto C. Alamino

254

When we observe the world around us, we can immediately

recognise that some states are more difficult to maintain than others. For

instance, if you raise a rock above the ground, it will not stay there unless

you keep holding it. The moment you let it go, gravity guarantees that it

will go back to the ground. This means that there is a cost associated with

keeping the rock above the ground. In physics, this cost is usually

associated with the word energy. Therefore, energy is a cost function.

There is a chance that you still remember that everything in the

world tries to achieve a state of minimum energy. This seems to be at odds

with the idea that energy is conserved. When we say that a system tries to

achieve the state of minimum energy we are actually considering that the

system is interacting with others in such a way that it can exchange energy

with them. This guarantees that the energy remains conserved.

Another slight complication related to this picture of minimum

energy also appears when heat is present. Because heat is noise, it disrupts

the minimum energy state in a way that we will understand in a bit, but the

main idea, that the universe tries to minimise its costs, can still be seen to

be valid. What happens is that, due to some restrictions imposed by the

environment, it is prevented of doing so.

The main point is that the key assumption of statistical physics is

that, if a system can assume different states, we can associate a different

cost or energy value to each state. Then, if we do not know in which state

the system is and we measure it, states with the same probability have the

same chance of being measured.

This also means that if you know the exact value of the energy for a

system and if you also know that the total number of states with that

energy is, let us say, ten, then the system will be in one of those states with

probability 1/10. For all practical purposes, the system can is the same as a

fair d10 rolling experiment!

The Probable Universe

255

I guess you can appreciate that this is a symmetry assumption.

States with the same energy can be considered symmetric in the sense

that, if we change the state, the energy remains the same. Remember that

we defined a symmetry as being a change we do to a system that keeps

some aspects of the system unchanged.

When we are studying systems whose energy we know with

certainty, we say that we are working in the microcanonical ensemble. The

actual explanation of the term is not important. The important thing is

that, in the microcanonical ensemble, the energy of the system is fixed and

all states have the same probability. The only problem is that the most

common situation in practice is that we also do not know the exact value of

the energy of a system.

The reason is that systems are exchanging energy with other

systems all the time. It is not easy to isolate the systems in nature. Think

about the amount of work needed to construct a thermal bottle. Today we

know that heat is a form of energy and heat exchanges are energy

exchanges. It is not at all easy to keep your coffee warm the whole day.

Actually, it is virtually impossible.

Because it is so difficult to control the exchange of energy, we were

forced to go one step beyond the microcanonical ensemble. We are forced

to use what is called the canonical ensemble. Although in the canonical

ensemble we do not know the precise value of the energy in the system,

there is still one thing that we assume we know: its average value. This

average value of the energy is fixed by something we can measure and

control much easier than the energy: the temperature.

Temperature, mathematically, can be interpreted as a numerical

parameter which, by fixing it, we fix the average value of the energy of the

system. But there is still one thing missing. In the microcanonical ensemble,

once we knew the energy and the number of states, we knew how to

calculate the probabilities. In the canonical ensemble the rule that equal

energies correspond to equal probabilities is still valid, but there is an

Roberto C. Alamino

256

infinite number of functions that give the same probability to the same

energy and which have a fixed energy. Our probability zoo is full of

probability distributions with the same mean. We need to find a way to

attribute the correct probabilities to all states, even those with different

energies, which will reproduce the same results we observe in real systems

in nature.

Now comes the time to mix in a little bit of inference. How do we

calculate probabilities when we have all the necessary constraints and do

not want to assume anything else? Yes... we maximise the entropy!

If we fix the average energy and look for a function that maximise

the entropy of a probability distribution which give the same value for the

same energy, we find exactly what we call in physics the Gibbs distribution.

This distribution is one of the most fundamental results of statistical

physics and works either for classical or quantum systems almost

unchanged. In a nutshell, it says that the probability of finding a system in a

state with a certain energy is exponentially smaller the larger the energy is.

The exact value depends on the temperature.

In the Gibbs distribution, the temperature regulates the relative

probability between states of different energy. If the temperature is very

high, the energy does not matter and all the states have the same

probability. On the other hand, when the temperature is almost zero only

states with the minimum value of the energy can be found, all others have

zero probability. With the correct probability we can find the correct

normalisation and, as we have seen before, everything follows.

THE GREAT BRIDGE

But we still cannot bridge the gap between physics and the other

disciplines, can we? Yes, we can. This kind of research is in its early stages

(even being decades old), but we can already put together under the same

framework, that of statistical physics, many collective phenomena.

The Probable Universe

257

The concept of phases and phase transitions is probably the most

important in this whole idea. Phases are characterised by an overall

behaviour of the system. For instance, you can think about democracy and

authoritarian rule as two political phases of a society. Then you would be

interested in knowing what are the social parameters that make a culture

change from one to the other and back again. That might be difficult, but in

principle, it can be possible.

Several other things in many different areas are presently being

investigated using the methods of statistical physics. For instance, how

birds collectively organise themselves in flocks. Or in what moment and

how voters polarise in the direction of one candidate in an election. How

swarms of robots can be programmed to collectively achieve some goal. All

those things involve throwing away detailed information about the

individuals to look at the system from far away, as one single complex

system. In the same way as if you look at ice or water close enough they

will look the same, but if you look from far enough they are completely

different, this happens with all these systems too. By throwing this

information away, we need probabilities and that is how we enter the

realm of statistical physics.

OTHER CORNERS

There are, of courses, many other ways in which probability enters science.

In biology, genetics is an area where probabilities abound. What is the

probability of having a baby girl or a baby boy? How probable is that

someone is the parent of someone else? As we have seen before,

physicians are always talking about death rates, which are probabilistic

concepts. Epidemiologists need to know how probable is for a disease to

spread. Meteorologists need to predict the weather, but there are so many

variables involved that only probabilities can be given. The same happens

with stock markets.

Roberto C. Alamino

258

Each one of the sciences needs to use different concepts to make

different predictions with different kinds of information. These predictions

should match real outcomes to have any validity. In addition, given real

phenomena, these sciences need to make meta-prediction, meaning that

they need to predict what is the best theory to make predictions.

The way of doing these meta-predictions is, of course, using

Bayesian inference and it is at the core of science. Now that we have some

idea about how probability enters in the most fundamental aspects of the

universe, it is time to go to higher levels.

The Probable Universe

259

11.

The Highest Levels

MODELS ONCE AGAIN

The idea that science is a quest to understand how reality works is a

beautiful but subtle one. It is probably the one thing that is deep inside the

heart of every scientist as a drive and a hope, but what each person sees as

understanding is generally very subjective.

We have seen before that the best shot at a rigorous definition of

what is an objective reality is embodied by the requirement of consistency.

Something similar happens when we try to define what we mean by

understanding. Surprisingly, when one looks deeply enough, the best

definition of what understanding actually means is to say that we are

capable of creating a consistent model of whatever phenomenon we are

trying to understand. Consistency here, as we have seen before, is used in

the sense that the model is free of contradictions with all collected (and

connected) pieces of evidence.

The first reaction to this kind of definition is to be reluctant and

sceptical, as it is healthy to be. We all have a deep feeling that

understanding has something to do with the capacity to explain something

until we have answers to all possible questions concerning subject. Let us

think about that. Imagine something that you are sure you understand.

Right now, you are probably reticent in picking up anything, even those

things that you always considered that you understood. You are probably

becoming aware of many questions that someone could ask to you which

you would have to answer I dont know.

Roberto C. Alamino

260

It turns out that different people become satisfied with different

levels of explanations and, once this level is achieved, the persons do not

bother to formulate further questions even if they are possible. The truth is

that, unless you postulate that you cannot answer something above some

level, it is not even clear that you will ever find an end to the sequence of

possible questions. For most people, the final explanation is the religious

one simply because they assume that any question about that has no

answer. Religion, many times, works as a cut-off to the endless string of

doubts that can be raised about almost every subject.

This is the equivalent of assuming a set of axioms which are to be

accepted without questioning. Whenever we can explain things by using

the axioms, we say that we understood those things. Of course, one then

might say that we do not understand the postulates. The religion position is

that this is not acceptable. The postulates should not be questioned. The

scientific position is that anything can be questioned.

Because of this possibility of a never-ending questioning, when we

scientists say that we understood something, what we really mean is that

we have a model that describes that phenomenon, accommodates well all

collected data and is free of contradictions with other models. When a

contradiction appears, we say that there is a lack of understanding about

that point. This definition goes well with the daily-life one too. As we have

already seen, our brain works by constructing models with data. When we

do not understand something, this means a failure of creating a consistent

model of the world around us because the new information does not fit

with the already established model (or models).

If there is one thing that we have learned during this book is the

fact that the most efficient way of encoding the models of natural

phenomena we ever found is mathematics. That is why scientific models

are constructed using it. It is also because we can use mathematics to

generate predictions that we can test by measuring data. We can do this

because mathematics allows us to do logical inference simply by following

mechanical substitution rules of symbols.

The Probable Universe

261

All theories of science, all descriptions of phenomena, are

ultimately based on mathematical models. In order to make a connection

with our daily life we use mathematics plus our more elementary

communication codes, like English. That is because our brains were trained

to model the world in terms of languages much before we learn maths.

We have already studied how probability enters in the

constructions of models for the phenomena of our world. But remember

that there is also one level above that of creating models in which

probabilities enter their selection.

Sometimes we have more than one description of the same

phenomenon, i.e. more than one model, which we were able to create

based on all collected data and we need to decide between these models.

We do that in the same way as we decide everything else by doing

Bayesian inference. Bayesian inference, in this sense, is one level above

science itself in the sense that it is a mathematical description of a

philosophical framework. It is Bayesian inference that guides us on how to

do science. That is because one of the tasks of a scientist is to choose the

best models that fit evidence without assuming anything else and, when

more evidence becomes available, the models should be updated.

If you remember our discussion about the life history of

Tuthankhamun, you will remember that we needed to compare an ever

increasing number of evidences against possible stories and choose the

best one. We did not discuss that, but if all stories were incompatible with

the evidence, what should we do? The answer is that we should invent one

which is compatible. How do we know that that history would be the right

one? We start the cycle of finding more evidence and comparing again.

Does that seem similar to anything that you have learned about science?

Roberto C. Alamino

262

SCIENCE AND BAYES

I remember that by the end of my Ph.D. in the Institute of Physics of the

University of Sao Paulo, in Brazil, my supervisor, Prof Nestor Caticha, took

me to have lunch in the fancy restaurant of the universitys business

school. These days I have collected enough evidence to assert that the fact

that business schools have the best restaurants in any campus is an

international truth. It was a buffet and, while we are serving ourselves, I

remember that somehow the discussion ended up being about the

foundations of science.

Isnt science just a belief system as any other else? asked Nestor

casually to me.

I used to think that, I answered. That was true. I had even

convinced a friend of my wife who is a lawyer about that. But then I added

but after I understood Bayesian inference I realised that that is not the

case.

Nestor looked at me, smiled and said If that is all you take with

you from your Ph.D., I consider my role successful. Now go and spread the

word.

I am not exaggerating when I say that understanding Bayesian

inference changed my whole way of seeing the world around me. It is now

time for me to guarantee that it is going to do the same for you as well. If

that is the only thing you take with you from this book, I will have paid my

debt to Nestor.

We have been talking about many areas of knowledge, especially

of scientific knowledge, but there is one level that we did touched only

slightly up to this point, which you might be already guessing that it is how

to do science. Many people tried to suggest during the 20

th

century that

science is nothing more than a belief system not unlike any religion or

The Probable Universe

263

mythology. Are they right? What would be the difference, if any, between

science and the rest?

If you understood correctly the rest of this book, you are smiling on

your chair right now because you know the answer. It is a good feeling to

understand what is wrong with post-modernism, isnt it? The answer is, of

course, Bayesian inference.

What is the objective of science? It is to understand the laws that

organise the universe. But how exactly does science intend to do that

differently from, let us say, religion? After all, religion also tries to find

somehow a theory about the universe. Both science and religion create

models of the universe in their own way. They use different languages and

different symbols, but all can be summarised in inventing descriptions,

called explanations, for natural phenomena.

The crucial point is that sciences fundamental idea is to find

explanations, or models, for natural phenomena by only using the

information given by nature itself, with minimal extra assumptions. Yes,

that is maximum entropy. In addition, whenever something new is

discovered, we want to review our previous beliefs using the same

principle. There you are. Bayesian inference.

Compare it with other systems similar to religion. For instance,

think about the concept of faith. Faith is a nice idea from the emotional

point of view. It evokes noble feelings. It is romantic, poetic, but it is a

complete disaster when it is used for reasoning. Why? Why faith is so bad

for reasoning? For one single fact: faith is the idea that you should not

update your beliefs. You might object this by saying that, in fact, you can

update your beliefs, but you should give a prior so high to information that

is given by an authority that they change very little. That is not completely

true, because if you keep including new evidence, sooner or later your

beliefs will change to reflect the evidence. Religions, ideologies and football

rely on completely crushing the addition of new evidence that contradicts

your beliefs.

Roberto C. Alamino

264

But what about science? What about the absolute truths of

science, like gravity? If you did not get it by now, it is time. Especially

because after understanding Bayesian inference you should be completely

prepared for that:

There are no absolute truths in science.

Everything in science has a plausibility of being right, but that can

always change if new evidence appears. Every single concept in science

hangs on by a thread which is as thick as the amount of evidence collected

to support that concept. Review the evidence and the concept can change

at any time.

From the emotional point of view, that seems quite unsatisfactory.

The certainty provided by religion is comforting. The fact that everything

on science is always changing is disturbing and the great majority of people

cannot cope with that. That is too bad, but nature does not care if you are

prepared for the truth or not.

But we surely have to rely on some kind of fixed concept right? Am

I not saying that Bayesian inference, or even deeper, maximum entropy is

the one thing we should not doubt? No. Even that we should question.

Even consistency? Yes, even consistency. Nobody said understanding the

world (whatever it means) would be easy. But we should continue to

believe that it is possible until further evidence forces us to update this.

You do not need to be in dismay, though. There is some order in

the chaos as long as we use Bayesian inference. It is an irony that the one

who first had a glance at this was a reverend.

LIMITS TO KNOWLEDGE

Contrary to what many philosophers will say to you, it is possible to outline

a rigorous framework for science and the scientific method. This

The Probable Universe

265

framework entirely relies on the concepts we have seen throughout this

book and, in a sense, is its climax.

The first thing we need to do is to leave aside our strong feelings

about the subject and try to found a, at least rough, definition to what we

mean by science. Broadly, we can say that science is the art (or whatever

word you feel better to use here) of encoding the information we obtain

from the world around us into models that are falsifiable. Here we have a

new concept and we need to talk a bit about that.

The first person who proposed falsifiability as a requirement in

scientific models was Karl Popper (Popper, 1935) in the 20

th

century.

Roughly, this means that the models of science should be testable in the

sense that there must have some experiment that can be done in principle

(although sometimes not in practice) that the model has the possibility of

answering wrongly. This is basically the requirement that every model

should be able to predict something. If a model cannot make a prediction,

it can never be wrong. If it can never be wrong, it cannot be evaluated and

therefore is out of the scope of science (although it is still in the range of

philosophy).

Note that I am not saying that a non-falsifiable model cannot be

true, but this is very muddy philosophical area. What falsifiability makes

evident is a fundamental logical limitation of science. The idea of science is

a set of rules that would allow us to infer the laws of nature from

observations. But as any good lawyer knows, given a finite set of

observations without anything else, you can always create a good story

unless you have external limitations.

This is something that happens here as well. When one creates a

model to fit a certain dataset of observations, unless there is some

limitation, we can make the corresponding likelihood as complicated as it is

required to give a probability one for all our observations. There is a very

simple visual way to understand that in terms of certain mathematical

objects which we talked about before, the polynomials.

Roberto C. Alamino

266

Remember that we learned that polynomials are functions of one

or more variables involving only natural powers of the variables multiplied

by numbers and added together. The most common are those involving

only one variable whose most used letter is . The degree of the

polynomial is the value of the highest power appearing in the formula.

These are some additional examples

The first in the list is a polynomial of degree 1, the second of

degree 2 and the third of degree 3. It turns out that we can plot

polynomials of one variable in a graph by giving values to and calculating

the resulting number. For the above three, we get the following graphs:

The first one is a straight line, the second is called a parabola and

the third has no especial name.

Now, everybody knows that through any two points one can draw

straight line. Less known is that through any three points, one can draw a

parabola, which is a polynomial of the second degree, passing through all

the points. The pattern continues for all polynomials of one variable. If you

have points, you can always find a polynomial of degree which

exactly passes through all the given points. Finding the polynomial that

passes through a certain number of points is called fitting a curve. It does

The Probable Universe

267

not require a big leap to notice that fitting a curve is a very simple case of

finding a model for a dataset.

But if you can use a parabola to fit any three points you surely can

fit any two points by a parabola, right? Right. This means that the more

correct assertion is that through any points you can always fit a

polynomial of degree or higher. But if polynomials are models for our

points, does that mean that given a certain number of points we can find

an infinite number of models that explains them? Yes, that is correct. Now

you start to see the problem. If all you have is two points, your story to

explain those two points can be a straight line (the one rarely used by

lawyers), a parabola or any other polynomial. How do you know which one

is the correct model?

Suppose you had two points and fitted a straight line. How would

you try to find out if your straight line is the correct explanation for all

possible points? The obvious answer is that you need a third point to check

it! Without a third point, there is no way to decide. The third point is the

equivalent of an experiment that needs to be done to test your model.

Each time a new experiment gives a point over your straight line, it

increases the confidence on your model. The tricky part is that you never

know if the next point will still be on that line. All you can do, thanks to

Bayes, is to be sure that each new point on the line increases the

probability that it is the correct explanation.

Now, notice that can be explained is not the same as it is the

correct explanation. Most of you must remember a trigonometric function

called the sine. The sine is the prototype of wave and its graph is given by

the picture below, in which two points were highlighted.

Roberto C. Alamino

268

Although the two points were generated by a sine function, the

green dashed straight line also fits them. You could say that the straight

line also explains them. To check if the straight line is correct or not, one

would need at least one more point. Notice that in this case, if the next

point was exactly where the axes cross each other, the zero-zero point, the

straight line would still be over it! In that case, we would have our certainty

about the line increased even with it not being the correct explanation! In

daily life, this is called a coincidence. Another point then would be enough

for us to see that the straight line would be the wrong function. The

problem is that sometimes people like the straight line so much that they

start to ignore the next points.

But what happens now if someone says that there are no sines in

the universe? Everything is a polynomial, you just need to find its right

degree. Is there any experiment one can do to disprove this polynomial

theory of the universe? If our experiments are limited to points, the answer

is a no! Even if the function generating the points is actually a sine, there is

no way to tell which theory is correct by doing any experiment. Theories

like that have another characteristic: they cannot make predictions.

The sine-theory, on the other hand, is testable in principle. We can

start to do experiments and compare with the sine graph. If a (correctly

measured) point falls outside the curve, then the sine is falsified. We might

The Probable Universe

269

never know if the next point will be wrong, which means that we might

never know for sure that the sine is the correct theory, but we have the

possibility of disproving it in principle.

The sine-theory has the property we want to attribute to scientific

theories, while the polynomial-theory, while it might even be true, does

not. It is said to be non-falsifiable. In fact, unless you get some emotional

comfort from knowing that the universe is explained by a polynomial, this

theory is completely useless.

Consider solipsism. This is the idea that your mind is the only

existing thing in the universe and it creates everything else, including this

book. Solipsism is non-falsifiable. Can it be true? Yes. Does it matter?

Philosophically, a lot, in practice, not at all. It is completely undecidable

and, because of that, does not have any effect whatsoever in our lives

(except maybe, as I already said, an emotional one).

Another problem with non-falsifiable models or theories is that the

amount of them that can be created is limited only by the creators

imagination. Although the reality of non-falsifiable theories is a legitimate

and difficult philosophical problem, this is the point where science departs

from philosophy. Science is concerned with falsifiable theories and that

should be part of its definition.

THE METHOD

We now limited the scientific models to those which are falsifiable. This

defines the scope of our subject. It does not mean that, from now on, we

are forbidden to imagine non-falsifiable things, but being non-testable,

they are out of the scope of science simply because we cannot ever decide

if they are right or wrong by any effect.

We have spent the most part of this book understanding how to

test models: using Bayesian inference. That is exactly how we should test

Roberto C. Alamino

270

scientific models. The scientific method is an outline of a procedure to

create and test falsifiable models. In a nutshell, it can be described as a

disordered chain containing the following steps in any arbitrary sequence:

1. Information Gathering: most of the times, this phase consists in

measuring things. But it also consists of organising collections of

concepts and even models themselves. Going to the library and picking

up books is also information gathering. Playing with axioms to find new

theorems can also be included here, with the new theorems standing

for new information.

2. Modelling: encoding all gathered information into a model, or as we

have seen, a likelihood. As we extensively discussed in this book, there

is no rigid way of doing that. This is where creativity enters. However,

some rules must be followed. The most important of them is that the

model should be falsifiable. It needs also aim to be consistent.

3. Testing: this is a complicated phase. We have to check the consistency

of the model with the rest of the information we have. In other words,

we have to falsify the model and to update the posteriors. Then, we

compare models and, possibly, choose the best.

Welcome to science. Notice that phases 1 and 2 could also be

claimed to be realised by any religion or pseudoscience if not by the

requirement of falsifiability. It is then in phase 3 that falsification takes

place. That is why it is important to understand that phase in more detail.

CHECKING

Checking for consistency, falsifying and choosing are all instances of

decision making. Let us see each one separately through Bayesian lenses.

Checking for consistency requires embedding your model in a

larger context. If you call your model , then you are looking for a

probability , where means all other models which you assume

are trustworthy enough for you to require yours to be consistent with.

The Probable Universe

271

If we can find a way to actually check the consistency decisively, we

can discard inconsistent models as they acquire a zero probability.

Sometime we cannot do that. Science can be very complicated, to the point

that proposed models might not be easy to check for consistency. As long

as those models are not already proved to be inconsistent, the usual thing

to do is to keep them as possible hypothesis.

The same happens with falsifiability. As long as a model is not non-

falsifiable, it makes no harm in keeping an eye on it, especially if it has

other attractive characteristics which can vary enormously.

Consider string theory, for instance. We still do not know if string

theory is or is not falsifiable. As far as we are aware of, it might happen

that string theory is a sophisticated framework that is capable of fitting

many different universes. But up to this point, nobody has proved that it is

either non-falsifiable or non-consistent (either internally or with the rest of

science). The appeal of it is that it has inspired amazing ideas and

generated beautiful mathematics. Still, that is not enough, but all we can

do is to keep working on it. It is still a model under consideration which has

not been discarded.

If we have several competing models, we can rank them by finding

their posterior ratios. This allows us to choose the best theories in a set.

But wouldnt that allow us also to say if a theory is right or wrong by

considering this as a binary decision?

The answer is generally no. That is because it is not in general

possible to find the likelihood for the proposition that a theory is not right.

Let us think about it. Consider that we have the data point and a theory

. Now, suppose that we try to judge the probability of being right or

wrong. We can see this as two meta-theories, respectively and . If we

define properly, we should be able to calculate as it should be

simply , but we cannot calculate ! That is because this

probability can be anything in other theories that are not !

Roberto C. Alamino

272

Of course, all three phases of the scientific method present their

own difficulties, but they do form a consistent system to check for models

and to guarantee that we are choosing those that are the best one possible

to account for the phenomena we see in our world. Thanks, once again, to

Bayes and Laplace.

BEAUTY AND THE BEAST

Because I am a theoretical physicist, I think that I will be excused when I say

that it is one of the most beautiful areas of science. Relating symmetry,

geometry and mathematics to the patterns of nature in an intricate but

consistent way is something that never ceases to amaze you once you start

to understand how it works. But... yes, always a catch!

There are many stories about experiments that gave results

contradicting predictions of theoretical physicists which were, afterwards,

shown to be wrong. Should we give more confidence to established

theories than experiments? Should we simply discard the experiments

then? If so, should we simply not require experimental evidence to be so

fundamental in science?

The saying that if the data contradict the theory, then throw away

the data was attributed to many great theoretical physicists and probably

actually spoken by a good number of them. That, however, should be

considered as a kind of joke. Experimental evidence should never be

discarded, but indeed needs to be critically considered as any other piece

of information.

On the other hand, evidence can be filtered to models in the form

of mathematical patterns that repeat themselves time and again. This

repetition is nothing more than another kind of evidence as it is a pattern

in itself. Beauty, as seen by many theoretical scientists, is just a realisation

of these repeating patterns.

The Probable Universe

273

There is a problem though when the scientists level of confidence

in the mathematical models becomes closer to faith. When experimental

evidence is small, then we saw that BAYES does not allow it to change the

prior too much, but as it starts to pile up, then it affects the posterior in a

significant way. Because of this, even contradictory experiments should be

taken into consideration and not thrown away. If considered in the right

way, they cannot overthrow a correct model.

Nothing though prevents one of analysing the validity of the

contradictory experiment itself using BAYES. You can always calculate the

probability of the experiment being right or wrong according to the rest of

the relevant information. Although it is not possible to do that for general

theories, as we have seen, for experiments this becomes easier as some

background theories are always considered true.

ADDRESSING ONESELF

All seem reasonable and consistent, but how do we know BAYES, or its

basis, maximum entropy, are actually correct? If I say that I should check

BAYES by using BAYES on itself, am I not cheating?

This is an instance of something called self-reference, which is

when a system talks about itself. Remember the Epimenides Paradox in

which a sentence was neither true nor false? That was a case of self-

reference as the sentence was about itself. Self-reference is full of pitfalls,

but also full of wonderful things. Just think that what makes us feel

conscious as an independent being is the fact that we can actually think

about ourselves.

In mathematics, all sort of strange things happen when systems are

powerful enough to reason about themselves. One famous instance is

Gdels Theorem, a surprising mathematical result that appeared during

the first half of the 20

th

century. This theorem says that in some very well

established mathematical structures, like the arithmetic of the integer

Roberto C. Alamino

274

numbers, although the system can talk about itself there are some

questions that, although they are true, the system cannot decide if they are

true or not. A system like this is called incomplete. Incompleteness

happens with some systems, but not with others. There are technically

subtle issues that makes difficult to identify when this happens or not.

It can be shown via Gdels Theorem that systems that obey

certain conditions, self-reference one of them, if the system is complete,

then it has to be inconsistent! This means that are some propositions in the

system that are true and false at the same time!

Let us now get back to the first question: are we cheating if we use

BAYES to test BAYES? Well, as long as it is consistent, we can try and see if

it leads to some undesirable or wrong consequence. That is all we can do.

Now, if we allow that, we are allowing BAYES to talk about itself. Should we

be worried if it complies with Gdels Theorem?

We argued that science is based on BAYES. If BAYES turns out to

result in an inconsistent system, then we are in trouble, because some

results can end up being true and false at the same time and that might be

a disaster. What about incompleteness? That is less harmful, because in

science we have what is called an oracle. An oracle is something that we

can always use to decide upon the truth of something by asking, without

using the theory. In science, nature is our oracle. Of course, we need

always to check the validity of the experiments...

But still, it is not clear that Gdels Theorem applies to science. This

is a complicated subject, and one we do not actually need to be sure about

to continue to do science in the best way we can.

ENTROPY ALWAYS INCREASES

I need to include an observation before we finish our conversation. Many

people will object to this last part about science saying that scientists, like

The Probable Universe

275

any human being, are corrupt. Greed researchers can and do hide

information that contradict their ideas. Some simply do their experiments

in a wrong way. Others stick to non-falsifiable ideas and defend them as if

they were scientific ones. If science is what scientists do, doesnt it

contradict the purity that I described in the previous sections?

It does. That is because science is not what scientists do. That is a

horrible definition. At least, not the science with the objectives we have

discussed in this book. That is the same as saying that justice is what judges

do. If you find a corrupt one, should you believe that every decision he or

she makes is fair? Of course not. Both in justice and in science, what define

these concepts should be sought in higher level principles. Nobody said it is

easy to do that, but in science we have a very good idea!

Any system will tend to be corrupted with time. That is a

consequence of the fact that entropy (disorder) tends to increase. We all

know that governments, schools and virtually all human institutions might

start with good intentions, but are not immune to degradation. Corruption

can appear in every system, including the scientific community. That is why

it is important to keep the separation between science and the scientist.

We humans are not consistent machines. We are driven by instincts which

mainly guide our survival, no matter what it takes. There is some evidence

that our survival also depends on Bayesian inference, but that is a story for

another book.

One can always redefine the meaning of words. Nobody has a

monopoly of them. They change meaning with time. Science, in the

beginning, would not describe falsifiable knowledge, but we have learned

that without falsifiability we cannot check the validity of a model and then

it was included in the meaning. Today, the meaning of science seems to be

shifting from the procedure to the profession. This is dangerous if people

start to confuse the two things. As humans, that is what we invariably do.

Roberto C. Alamino

276

12.

Answers

How well can we, after a whole book, answer the questions we proposed in

the beginning? Let us see:

Q. How do you know we are not living inside the Matrix (or the next best

thing)?

A. We need first to define what this Matrix is by creating a model to explain

the data we collect from our observations of the world. Then, we gather

information to increase or decrease the probability of it by using BAYES. If

the model is not falsifiable, though, we will never know if it is true or false.

Q. Can you ever tell whether everything is not actually an illusion inside

your mind?

A. No. That is solipsism and it is not falsifiable.

Q. Isnt science just a belief system not unlike religion?

A. No. Science use BAYES, religion does not.

Q. How do we know that elementary particles exist if we cannot see

them?

A. Existing is complicated philosophically. We can say that the model with

particles is consistent with our observations of the world because they

provide indirect evidence that supports it.

Q. Can we ever hope to find an answer to any of these questions?

The Probable Universe

277

A. We just did.

And finally: what is the relation of these questions with this book?

A. BAYES

Roberto C. Alamino

278

Bibliography

Obs: books highlighted in red in this section are technical books or papers

and might need an extra level of scientific or mathematical knowledge to

be understood.

Bak, P. (1996) How Nature Works: The Science of Self-Organized Criticality,

Springer-Verlag

Bayes, T. (1763) An Essay Towards Solving a Problem in the Doctrine of

Chances, Philos. Trans. R. Soc. London 53, 370-418

Bernoulli, D. (1738) Hydrodynamica

Cardano, G. (1663) Liber de Ludo Aleae

Carroll, L. (1871) Through the Looking-Glass, and What Alice Found There

Caticha, A. (2010) Entropic Inference, arXiv:1011.0723 [physics.data-an]

David, F.N. (1962) Games, Gods and Gambling, Hafner Publishing Company

Feller, W. (1968) An Introduction to Probability Theory and Its Applications

Vols. 1 and 2, John Wiley & Sons

Ferris, T. (2011) The 4-Hour Work Week: Escape the 9-5, Live Anywhere and

Join the New Rich, Vermillion

Gleick, J. (1988) Chaos: Making a New Science, Penguin Books

Greene, B. (2005) The Fabric of the Cosmos: Space, Time and the Texture of

Reality, Penguin Press Science

Jaynes, E.T. (2003) Probability Theory The Logic of Science, Oxford

University Press

The Probable Universe

279

Landauer, R. (1996) The Physical Nature of Information, Phys. Lett. A 217,

188-193

Laplace, P.S. (1825) Philosophical Essay on Probabilities

Li, M., Vitanyi, P. (1997) An introduction to Kolmogorov complexity and its

applications, Springer-Verlag

Noether, E. (1971) Invariant Variation Problems, translated by M. Tavel,

Transport Theory and Statistical Physics 1, 186-207

Penrose, R. (1994) Shadows of the Mind

Popper, K. (1935) The Logic of Scientific Discovery

Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. (1992)

Numerical Recipes in C, The Art of Scientific Computing, Cambridge

University Press

Sagan, C. (1985) Contact

Sagan, C. (1997) The Demon-Haunted World: Science as a Candle in the

Dark, Ballantine Books

Santos, A. (2009) How Many Licks? Or, How to Estimate Damn Near

Anything

Sivia, D.S., Skilling, J. (2006) Data Analysis: A Bayesian Tutorial, Oxford

University Press

Stewart, I., Cohen, J. (1997) Figments of reality The evolution of the

curious mind, Cambridge University Press

Todhunter, I. (1865) A History of the Mathematical Theory of Probability

from the Time of Pascal to that of Laplace, Macmillan and Co

Wigner, E. (1960) The Unreasonable Effectiveness of Mathematics in the

Natural Sciences, Comm. Pure Appl. Math. 13, 1-14

Roberto C. Alamino

280

Wolfram, S. (2001) A New Kind of Science, Wolfram Media

The Probable Universe

281

Apendix A.

Internet Material

The reference material found in the bibliography of this book is from

mainly 3 sources: scientific journals, books and websites. If you are having

difficult to find them, you can find it with the appropriate links on my

website on the internet at the address:

http://alamino.org/the-probable-universe/

All you need to do is to click at the link and it will take you to the official

reference. Some of the material can also be found for free and, if this

option is available, I put the appropriate link as well.

RANDOM USEFUL WEBSITES

Maths

Wolfram Alpha: http://www.wolframalpha.com/

Among other things, this site can be used as a simplified version of the

famous software from the same company Mathematica. You can use it to

plot graphs, solve equations and do other kinds of calculations. It has an

impressive recognition algorithm. Try, for instance, to type Gauss(1,2) to

see what happens.

Wolfram Mathworld: http://mathworld.wolfram.com/

Previously known as Eric Weissteins World of mathematics, due to its

creator, this website is a mathematics encyclopaedia containing virtually

everything you would like to know about mathematics at a technical level.

Roberto C. Alamino

282

Random.org: http://random.org

It contains a lot of interesting information about random numbers. It also

has a handy random number generator for you to make experiments.

Public Domain Books

Gutemberg Project: http://www.gutenberg.org/

Open Library: https://openlibrary.org/

For scientific papers, there are two main ways to check and obtain them if

they are available: arXiv and Google Scholar. Both are described below.

ARXIV

Some of the material in this cited in this book is available for free in the

internet. References containing the word arXiv are of preprints that can

be found in the website

http://www.arxiv.org

The arXiv is today the standard preprint repository for physics and

has been running since the end of the 90s. A preprint is a preliminary

version of a scientific article or paper which is made available before it has

been passed by the process of peer-reviewing at some official scientific

journal. Most physicists, including famous ones, have the helpful habit of

posting their preprints in the arXiv before sending then for peer review.

In addition to articles, the arXiv also contains lecture notes, PhD

and Master thesis, drafts of books and many other helpful documents. As

any kind of information wherever you get it, you must be careful not to

blindly trust what is in then, but that is also true even for what is published

in the official journals.

As an example, if you want to access the reference

The Probable Universe

283

Caticha, A. (2010) Entropic Inference, arXiv:1011.0723 [physics.data-an]

you should type in your browser

http://arxiv.org/abs/1011.0723

This will take you to a page containing the abstract of the paper and

options of formats to download it. PDF is most of the times available, as it

is PS and source files for those who know how to handle them. This is a

relatively modern reference. Older ones have a slightly different code as

this one:

arXiv:hep-th/9203054

In this case, the address would be

http://arxiv.org/abs/hep-th/9203054.

GOOGLE SCHOLAR

Most scientific journals charge relatively expensive fees for you to

download a paper. It is worth remind that the authors never get any

amount of money for that and they even give up the copyright of the paper

to the journal without receiving one penny. Because many authors got

annoyed with the situation (although not enough) today several scientific

journals allow the authors to keep a copy of their papers on their personal

websites. Google Scholar can be used to find these copies. All you need to

do is to type the title of the paper in the search box of Scholar at:

http://scholar.google.com

Once you find the paper, search for a downloadable version on the link All

n versions right below it.

Roberto C. Alamino

284

OPEN ACCESS JOURNALS

Because scientific journals are not being able to keep the papers protected

in the internet, they opted for a different strategy. Instead of charging for

downloads, they charge the authors amounts that vary between 1000 and

3000 US dollars and the paper becomes free to access.

A list of these journals with the respective links can be found on Wikipedia:

http://en.wikipedia.org/wiki/List_of_open-access_journals

The Probable Universe

285

Appendix B.

Mathematical Symbols

The probability (density) of proposition

Not A

given

or

and

Proportional to

Cardinality (size) of the set of real numbers

Partition function (statistical physics)

numbers

Infinity

Real interval starting in and ending in , including the

endpoints

Sample space

Imaginary unit,

Ket, quantum state vector labelled by the letter

Absolute value or modulus of the number

Implies

AND logical operator

OR logical operator

Roberto C. Alamino

286

Appendix C.

Scientist List

We have seen many players during this book, so many that it is worthwhile

to have a list of them. There are two extra advantages of doing that. The

first is that you can have a better idea of who lived in what time and,

second, what their specialities were.

Bayes, Thomas 1701-1761 English statistician, philosopher

and Presbyterian minister

Boltzmann, Ludwig 1844-1906 Austrian physicist and

philosopher

Borel, Flix douard

Justin mile

1871-1956 French mathematician and

politician

Cantor, Georg

Ferdinand Ludwig

Philipp

1845-1918 German mathematician

Cardano, Gerolamo 1501-1576 Italian mathematician,

physician, astrologer,

philosopher and gambler

Cohen, Paul Joseph 1934-2007 American (USA) mathematician

Dirac, Paul Adrien

Maurice

1902-1984 British physicist

Einstein, Albert 1879-1955 German physicist

Fermat, Pierre de 1601-1665 French lawyer and

mathematician

Galileo Galilei 1564-1642 Italian physicist,

mathematician, engineer,

astronomer and philosopher

Gauss, Johann Carl

Friedrich

1777-1855 German mathematician

Gibbs, Josiah Willard 1839-1903 American (USA) scientist

Kolmogorov, Andrey 1903-1987 Soviet mathematician

The Probable Universe

287

Nikolaevich

Laplace, Pierre Simon 1749-1827 French mathematician and

astronomer

Mach, Ernst Waldfried

Josef Wenzel

1938-1916 Austrian physicist and

philosopher

Maldacena, Juan

Martin

1968- Argentinean physicist

Mandelbrot, Benoit B. 1924-2010 French-American

mathematician

Markov, Andrey

Andreyevich

1856-1922 Russian mathematician

Maxwell, James Clerk 1831-1879 Scottish physicist

Michell, John 1724-1793 English clergyman and natural

philosopher

Newton, Isaac 1642-1727 English physicist and

mathematician

Pareto, Vilfredo

Federico Damaso

1848-1923 Italian engineer, sociologist,

economist, political scientist

and philosopher

Pascal, Blaise 1623-1662 French mathematician,

physicist, inventor, writer and

philosopher.

Podolsky, Boris

Yakovlevich

1896-1966 American (USA) physicist

Price, Richard 1723-1791 Welsh philosopher and

preacher

Rosen, Nathan 1909-1995 American (USA)-Israeli physicist

Solomonoff, Ray 1926-2009 American (USA) mathematician

Shannon, Claude

Elwood

1916-2001 American (USA)

mathematician, electronic

engineer and cryptographer

Zipf, George Kingsley 1902-1950 American (USA) linguist

Roberto C. Alamino

288

Appendix D.

Greek Letters

The Greeks might not have invented mathematics, but they surely took it

to higher levels by inventing science. It is just natural that so many of their

letters appears as mathematical symbols today. To help you at least to spell

their names, here is a table with the Greek letters, their names and some

common variations.

Small Capital Name

Alpha

Beta

Gamma

Delta

Epsilon

Zeta

Eta

Theta

Iota

Kappa

Lambda

Mu

Nu

Xi

Omicron

Pi

Rho

Sigma

Tau

Upsilon

Phi

Chi

Psi

Omega

The Probable Universe

289

Brainstorming Area

Talk about other methods of inference which are not Bayesian and their

relations. -> when not to use Bayes (although using) - simplifications

Add How to solve it by Polia

Talk about Bayesian inference in biology. The brain might be Bayesian

hardwired.

[probabilities are related to what we do not know] , and we all know

that those are the things that scare us most

Generalisation vs memorisation, non-testable hypothesis, no extra

experiments to be done.

Talk about average faces when defining the mean?

Research the origin of the name aleph-zero/aleph-null

Talk about averages and how they enter the estimation process.

Means and averages as estimators

Explain that entropy is a functional.

Forgetting and adaptability.

Add the letters between Fermat and Pascal on Probability as an

appendix.

I have to answer at some point the question: Why do we keep

using wrong theories?

Include the non-interacting universes problem.

- Theory of heat by J. Clerk Maxwell, 1891Uploaded byOmegaUser
- 2D Product Sheet 2 Maxwell 3DUploaded byluis900000
- Maxwell’s Eqautions and some applicationsUploaded byadeelajaib
- The Enigma of the TreatiseUploaded byChartridge Books Oxford
- Advances and Applications of DSmT for Information Fusion, Vol. 3, editors F. Smarandache, J. DezertUploaded byAnonymous 0U9j6BLllB
- 5th International Probabilistic WorkshopUploaded bybino
- electricandmag02maxwrichUploaded byHamilton Smith
- Maya Releasenotes 0Uploaded bysmaran1983
- 127929606 Tesla Cold ElectricityUploaded byflorin0101
- Maxwell Equations MIT OCWUploaded bykesavavamsikrishna
- April, 2010Uploaded byAnonymous 0U9j6BLllB
- Learning Kernel Classifiers. Theory and AlgorithmsUploaded byngelectronic5226
- J.H. Conway, R.T. Curtis, S.P. Norton, R.A. Parker and R.A. Wilson - Atlas of Finite GroupsUploaded byOkomm
- An Electric Revolution: Reforming Monopolies, Reinventing the Grid, and Giving Power to the PeopleUploaded byGalvin Electricity Initiative
- Electricity HistoryUploaded byRupesh Shah
- 3856 Motor Maxwell2DUploaded byRiadh Tarkhani
- scriptingMaxwell_onlinehelpUploaded byIrina Atudorei
- Direct time integration of Maxwell’s equationsUploaded bySudantha Jayalal Perera
- Maxwells LegacyUploaded byjgreenguitars
- Treatise on Mathematical Theory of ElasticityUploaded bydis4sites
- RADIANT ELECTRICITY__1.pdfUploaded bybaywatch80
- Glossary of Control Names for Industrial Applications 25179_FUploaded byMaxwell Leon
- Special Relativity and Maxwell EquationsUploaded byfaradeyisahero
- Electron2Electricity-obooko-hist0017Uploaded bydenisse
- HowtoUploaded byLuis Benites
- 021213Uploaded byRakesh Pati
- Archaeological Prediction and Risk ManagementUploaded byFestim Beqiri
- Story of ElectricityUploaded byPhaniteja Manduri
- Revison of Maxwell EquationsUploaded byibbi_7
- Electricity and MagnetismUploaded bycornel_24

- DESIGNING FOR HOMO EXPLORENSUploaded bypornflake666
- Jose Rizal PhilosophyUploaded byRobert Green
- BMC Impact Solutions 7.3Uploaded byAnonymous RKuPAGk
- intasc model core teaching standards 2011Uploaded byapi-283870008
- Essay Writing for English TestUploaded byRashesh Shah
- Vocabulary March 3-7Uploaded bycabeaty
- ECE3080Lecture4EngSocExpUploaded byVocem Lux
- Color theoryUploaded byNurholis Setiawan
- Statement on NeuroestheticsUploaded byAvengingBrain
- Role of Architects in the Rural IndiaUploaded bya_j_sanyal259
- American English Conversation ExpressionsUploaded byFitri
- New Graduate Nurse Practice Readiness Perspectives on the Context Shaping Our Understanding and ExpectationsUploaded bykitsil
- grade 7 unit 3 rubricUploaded byapi-185034533
- The Voice of RevelationUploaded byFabriceArmand
- Facilitating LearningUploaded byrommel echanes
- The Effect of Knowledge Management Uses on Total Quality ManagementUploaded byKule89
- taksonomi hasil belajar.pdfUploaded byFarisa Anizarini
- BA7205 Information ManagementUploaded byPalani Arunachalam
- Shiva RahasyaUploaded bykalamegamr
- Sample ThesisUploaded byKris Angelie Fangonilo
- ling 610 favorite mealUploaded byapi-253811134
- Pdp and Self-regulationUploaded bySunwinarti
- cillian boyd teaching philosophyUploaded byapi-319188179
- CPD OnlineUploaded byManish Gupta
- Why Major in Linguistics (and what does a linguist do)?, by Monica Macaulay and Kristen SyrettUploaded byozukecalo
- jodi wilder observation 1 callahanUploaded byapi-360585476
- HÄGERSTRAND, Torsten - Geography and the Study of Interaction Between Nature and SocietyUploaded byIsaac Jerez
- Analysis and Assessment of Fmcg Market of IndiaUploaded byalokjain1987
- Harnack - Outlines of the history of dogma.pdfUploaded byMihai Corciu
- Fabric Inspector Job Description - Gihan RanganaUploaded byGihan Rangana