Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
THE REAL THING
Richard Jeﬀrey
c (2002
November 4, 2002
1
Contents, 1
1 Probability Primer, 8
1.1 Bets and Probabilities, 8
1.2 Why Probabilities are Additive, 11
1.3 Probability Logic, 15
1.4 Conditional Probability, 18
1.5 Why ‘[’ Cannot be a Connective, 21
1.6 Bayes’s Theorem, 22
1.7 Independence, 23
1.8 Objective Chance, 25
1.9 Supplements, 28
1.10 References, 33
2 Testing Scientiﬁc Theories, 35
2.1 Quantiﬁcations of Conﬁrmation, 36
2.2 Observation and Suﬃiency, 38
2.3 Leverrier on Neptune, 40
2.4 Dorling on the Duhem Problem, 41
2.4.1 Einstein/Newton 1919, 43
2.4.2 Bell’s Inequalities: Holt/Clauser, 44
2.4.3 Laplace/Adams, 46
2.4.4 Dorling’s Conclusions, 48
2.5 Old News Explaned, 49
2.6 Supplements, 52
3 Probability Dynamics; Collaboration, 55
3.1 Conditioning, 55
3.2 Generalized Conditioning, 57
3.3 Probabilistic Observation Reports, 59
3.4 Updating Twice; Commutativity, 61
3.4.1 Updating on alien probabilities for diagnoses, 62
3.4.2 Updating on alien factors for diagnoses, 62
3.5 Softcore Empiricism, 63
4 Expectation Primer, 66
4.1 Probability and Expectation, 66
4.2 Conditional Expectation, 67
4.3 Laws of Expectation, 67
4.4 Mean and Median, 71
2
4.5 Variance, 73
4.6 A Law of Large Numbers, 75
4.7 Supplements, 76
5(a) Updating on Statistics, 79
5.1 Where do Probabilities Come From?, 79
5.1.1 Probabilities from Statistics: Minimalism, 80
5.1.2 Probabilities from Statistics: Exchangeability, 81
5.1.3 Exchangeability: Urn Exaples, 82
5.1.4 Supplements, 84
5(b) Diaconis and Freedman on de Finetti’s
Generalizations of Exchangeability, 85
5.2 Exchangeability Itself, 86
5.3 Two Species of Partial Exchangeability, 88
5.3.1 2 2 Tables, 88
5.3.2 Markov Dependency, 89
5.4 Finite Forms of de Finetti’s Theorem on Partial Exchangeability, 90
5.5 Technical Notes on Inﬁnite Forms, 92
5.6 Concluding Remarks, 96
5.6 References, 99
6 Choosing, 102
6.1 Preference Logic, 102
6.2 Causality, 106
6.3 Supplements, 109
6.3.1 “The Mild and Soothing Weed”, 109
6.3.2 The Flagship Newcomb Problem, 112
6.3.3 Hofstadter, 114
6.3.4 Conclusion, 116
6.4 References, 117
3
Preface
Here is an account of basic probability theory from a thoroughly “subjective”
point of view,
1
according to which probability is a mode of judgment. From
this point of view probabilities are “in the mind”—the subject’s, say, yours.
If you say the probability of rain is 70% you are reporting that, all things
considered, you would bet on rain at odds of 7:3, thinking of longer or shorter
odds as giving an unmerited advantage to one side or the other.
2
A more
familiar mode of judgment is ﬂat, “dogmatic” assertion or denial, as in ‘It
will rain’ or ‘It will not rain’. In place of this “dogmatism”, the probabilistic
mode of judgment oﬀers a richer palate for depicting your state of mind, in
which the colors are all the real numbers from 0 to 1. The question of the
precise relationship between the two modes is a delicate one, to which I know
of no satisfactory detailed answer.
3
Chapter 1, “Probability Primer”, is an introduction to basic probability
theory, so conceived. The object is not so much to enunciate the formal rules
of the probability calculus as to show why they must be as they are, on pain
of inconsistency.
Chapter 2, “Testing Scientiﬁc Theories,” brings probability theory to bear
on vexed questions of scientiﬁc hypothesistesting. It features Jon Dorling’s
“Bayesian” solution of Duhem’s problem (and Quine’s), the dreaded holism.
4
Chapter 3, “Probability Dynamics; Collaboration”, addresses the problem
of changing your mind in response to generally uncertain observations of
your own or your collaborators, and of packaging uncertain reports for use
by others who may have diﬀerent background probabilities. Conditioning on
certainties is not the only way to go.
Chapter 4, “Expectation Primer”, is an alternative to chapter 1 as an
introduction to probability theory. It concerns your “expectations” of the
values of “random variables”. Hypotheses turn out to be 2valued random
1
In this book double quotes are used for “as they say”, where the quoted material is
both used and mentioned. I use single quotes for mentioning the quoted material.
2
This is in test cases where, over the possible range of gains and losses, your utility
for income is a linear function of that income. The thought is that the same concept of
probability should hold in all cases, linear or not and monetary or not.
3
It would be a mistake to identify assertion with probability 1 and denial with probabil
ity 0, e.g., because someone who is willing to assert that it will rain need not be prepared
to bet life itself on rain.
4
According to this, scientiﬁc hypotheses “must face the tribunal of senseexperience as
a corporate body.” See the end of Willard van Orman Quine’s muchanthologized ‘Two
dogmas of empiricism’.
4
variables—the values being 0 (for falsehood) and 1 (for truth). Probabilities
turn out to be expectations of hypotheses.
Chapter 5, “Updating on Statistics”, presents Bruno de Finetti’s disso
lution of the socalled “problem of induction”. We begin with his often
overlooked miniature of the fullscale dissolution. At full scale, the active
ingredient is seen to be “exchangeability” of probability assignments: In an
exchangeable assignment, probabilities and expectations can adapt them
selves to observed frequencies via conditioning on frequencies and averages
“out there”. Sec.5.2, on de Finetti’s generalizations of exchangeability, is an
article by other people, adapted as explained in a footnote to the section title
with the permission of the authors and owners of the copyright.
Chapter 6, “Choosing”, is a brief introduction to decision theory, focussed
on the version ﬂoated in my Logic of Decision (McGrawHill, 1965; University
of Chicago Press, 1983, 1990). It includes an analysis in terms of probability
dynamics of the diﬀerence between seeing truth of one hypothesis as a prob
abilistic cause of truth of another hypothesis or as a mere symptom of it. In
these terms, “Newcomb” problems are explained away.
Acknowledgements
Dear Comrades and Fellow Travellers in the Struggle for Bayesianism!
It has been one of the greatest joys of my life to have been given so much
by so many of you. (I speak seriously, as a fond foolish old fart dying of a
surfeit of Pall Malls.) It began with openhanded friendship oﬀered to a very
young beginner by people I then thought of as very old logical empiricists—
Carnap in Chicago (194651) and Hempel in Princeton (19557), the sweetest
guys in the world. It seems to go with the territory.
Surely the key was a sense of membership in the cadre of the righteous,
standing out against the background of McKeon’s running dogs among the
graduate students at Chicago, and against people playing comparable parts
elsewhere. I am surely missing some names, but at Chicago I especially re
member the participants in Carnap’s reading groups that met in his apart
ment when his back pain would not let him come to campus: Bill Lentz
(Chicago graduate student), Ruth Barcan Marcus (postdoc, bopping down
from her teaching job at Roosevelt College in the Loop), Chicago graduate
student Norman Martin, Abner Shimony (visiting from Yale), Howard Stein
(superscholar, allround brain), Stan Tennenbaum (ambient mathematician)
and others.
5
It was Carnap’s booklet, Philosophy and Logical Syntax, that drew me into
logical positivism as a 17yearold at Boston University. As soon as they let
me out of the Navy in 1946 I made a beeline for the University of Chicago to
study with Carnap. And it was Hempel, at Princeton, who made me aware of
de Finetti’s and Savage’s subjectivism, thus showing me that, after all, there
was work I could be doing in probability even though Carnap had seemed
to be vacuuming it all up in his 1950 and 1952 books. (I think I must have
learned about Ramsey in Chicago, from Carnap.)
As to Princeton, I have Edith Kelman to thank for letting me know that,
contrary to the impression I had formed at Chicago, there might be something
for a fellow like me to do in a philosophy department. (I left Chicago for a
job in the logical design of a digital computer at MIT, having observed that
the rulers of the Chicago philosophy department regarded Carnap not as a
philosopher but as—well, an engineer.) But Edith, who had been a senior
at Brandeis, asssured me that the phenomenologist who taught philosophy
there regarded logical positivism as a force to be reckoned with, and she
assured me that philosophy departments existed where I might be accepted
as a graduate student, and even get a fellowship. So it happened at Princeton.
(Carnap had told me that Hempel was moving to Princeton. At that time,
he himself had ﬁnally escaped Chicago for UCLA.)
This is my last book (as they say, “probably”). Like my ﬁrst, it is dedi
cated to my beloved wife, Edith. The dedication there, ‘Uxori delectissimae
Laviniae’, was a fraudulant ingroup joke, the Latin having been provided
by Elizabeth Anscombe. I had never managed to learn the language. Here I
come out of the closet with my love.
When the dean’s review of the philosophy department’s reappointment
proposal gave me the heaveho! from Stanford I applied for a visiting year
at the Institute for Advanced Study in Princeton, NJ. Thanks, Bill Tait, for
the suggestion; it would never have occurred to me. G¨ odel found the L of D
interesting, and decided to sponsor me. That blew my mind.
My project was what became The Logic of Decision (1965). I was at the
IAS for the ﬁrst half of 19634; the other half I ﬁlled in for Hempel, teaching
at Princeton University; he was on leave at the Center for Advanced Study
in the Behavioral Sciences in Palo Alto. At Stanford, Donald Davidson had
immediately understood what I was getting at in that project. And I seem
to have had a similar sensitivity to his ideas. He was the best thing about
Stanford. And the decanal heaveho! was one of the best things that ever
happened to me, for the sequel was a solution by Ethan Bolker of a math
ematical problem that had been holding me up for a couple of years. Read
6
about it in The Logic of Decision, especially the 2nd edition.
I had been trying to prove uniqueness of utility except for two arbitrary
constants. G¨ odel conjectured (what Bolker turned out to have proved) that in
my system there has to be a third constant, adjustable within a certain range.
It was typical of G¨ odel that he gave me time to work the thing out on my
own. We would meet every Friday for an hour before he went home for lunch.
Only toward the end of my time at the IAS did he phone me, one morning,
and ask if I had solved that problem yet. When I said ‘No’ he told me he had
an idea about it. I popped right over and heard it. Go
¨
del’s proof used linear
algebra, which I did not understand. But, assured that G¨ odel had the right
theorem and that some sort of algebraic proof worked, I hurried on home
and worked out a clumsy proof using highschool algebra. More accessible to
me than G¨ odel’s idea was Bolker’s proof, using projective geometry. (In his
almost ﬁnished thesis on measures on Boolean algebras Bolker had proved
the theorem I needed. He added a chapter on the application to decision
theory and (he asks me to say ) stumbled into the honor of a citation in the
same paragraph as Kurt G¨ odel, and I ﬁnished my book. Bliss.
My own Ph.D. dissertation was not about decision theory but about a
generalization of conditioning (of “conditionalization”) as a method of up
dating your probabilities when your new information is less than certain.
This seemed to be hard for people to take in. But when I presented that idea
at a conference at the University of Western Ontario in 1975 Persi Diaco
nis and Sandy Zabell were there and caught the spark. The result was the
ﬁrst serious mathematical treatment of probability kinematics (JASA, 1980).
More bliss. (See also Carl Wagner’s work, reported in chapters 2 and 3 here.)
As I addressed the problem of mining the article they had given me on par
tial exchangeability in volume 2 of Studies in Inductive Logic and Probability
(1980) for recycling here, it struck me that Persi and his collaborator David
Freedman could do the job far better. Feigning imminent death, I got them
to agree (well, they gave me “carte blanche”, so the recycling remained my
job) and have obtained the necessary permission from the other interested
parties, so that the second, bigger part of chapter 5 is their work. Parts of it
use mathematics a bit beyond the general level of this book, but I thought
it better to post warnings on those parts and leave the material here than to
go through the business of petty fragmentation.
Since we met at the University of Chicago in—what? 1947?—my guru Ab
ner Shimony has been turning that joke title into a simple description. For
one thing, he proves to be a monster of intellectual integrity: after earning
his Ph.D, in philosophy at Yale with his dissertation on “Dutch Book” argu
7
ments for Carnap’s “strict regularity”, he went back for a Ph.D. in physics
at Princeton before he would present himself to the public as a philosopher
of science and go on to do his “empirical metaphysics”, collaborating with
experimental physicists in tests of quantum theory against the Bell inequal
ities as reported in journals of philosophy and of experimental physics. For
another, as a graduate student of physics he took time out to tutor me in
Aristotle’s philosophy of science while we were both graduate students at
Princeton. In connection with this book, Abner has given me the highly
prized feedback on Dorling’s treatment of the HoltClauser issues in chapter
2, as well as other matters, here and there. Any remaining errors are his
responsibility.
Over the past 30 years or so my closest and constant companion in the
matters covered in this book has been Brian Skyrms. I see his traces ev
erywhere, even where unmarked by references to publications or particular
personal communications. He is my main Brother in Bayes and source of
Bayesian joy.
And there are others—as, for example, Arthur Merin in faroﬀ Konsanz
and, close to home, Ingrid Daubesches, Our Lady of the Wavelets (alas! a
closed book to me) but who does help me as a Sister in Bayes and fellow
slave of LaTeX according to my capabilities. And, in distant Bologna, my
longtime friend and collaborator Maria Carla Galavotti, who has long led me
in the paths of de Finetti interpretation. And I don’t know where to stop:
here, anyway, for now.
Except, of course, for the last and most important stop, to thank my
wonderful family for their loving and eﬀective support. Namely: The divine
Edith (wife to Richard), our son Daniel and daughter Pamela, who grew
unpredictably from child apples of our eyes to a unionside labor lawyer
(Pam) and a computernik (Dan) and have been so generously attentive and
helpful as to shame me when I think of how I treated my own ambivalently
beloved parents, whose deserts were so much better than what I gave them.
And I have been blessed with a literary soninlaw who is like a second
son, Sean O’Connor, daddy of young Sophie Jeﬀrey O’Connor and slightly
younger Juliet Jeﬀrey O’Connor, who are also incredibly kind and loving.
Let me not now start on the Machatonim, who are all every bit as remark
able. Maybe I can introduce you some time.
Richard Jeﬀrey
dickjeﬀ@princeton.edu
Chapter 1
Probability Primer
Yes or no: was there once life on Mars? We can’t say. What about intelligent
life? That seems most unlikely, but again, we can’t really say. The simple
yesorno framework has no place for shadings of doubt, no room to say that
we see intelligent life on Mars as far less probable than life of a possibly very
simple sort. Nor does it let us express exact probability judgments, if we have
them. We can do better.
1.1 Bets and Probabilities
What if you were able to say exactly what odds you would give on there
having been life, or intelligent life, on Mars? That would be a more nuanced
form of judgment, and perhaps a more useful one. Suppose your odds were
1:9 for life, and 1:999 for intelligent life, corresponding to probabilities of
1/10 and 1/1000, respectively. (The colon is commonly used as a notation
for ‘/’, division, in giving odds—in which case it is read as “to”.)
Odds m:n correspond to probability
m
m+n
.
That means you would see no special advantage for either player in risking
one dollar to gain nine in case there was once life on Mars; and it means you
would see an advantage on one side or the other if those odds were shortened
or lengthened. And similarly for intelligent life on Mars when the risk is a
thousandth of the same ten dollars (1 cent) and the gain is 999 thousandths
($9.99).
Here is another way of saying the same thing: You would think a price
of one dollar just right for a ticket worth ten if there was life on Mars and
8
CHAPTER 1. PROBABILITY PRIMER 9
nothing if there was not, but you would think a price of only one cent right
if there would have had to have been intelligent life on Mars for the ticket to
be worth ten dollars. These are the two tickets:
Worth $10 if there was
life on Mars.
Worth $10 if there was
intelligent life on Mars.
Price $1 Price 1 cent
Probability .1 Probability .001
So if you have an exact judgmental probability for truth of a hypothesis, it
corresponds to your idea of the dollar value of a ticket that is worth 1 unit
or nothing, depending on whether the hypothesis is true or false. (For the
hypothesis of mere life on Mars the unit was $10; the price was a tenth of
that.)
Of course you need not have an exact judgmental probability for life on
Mars, or for intelligent life there. Still, we know that any probabilities anyone
might think acceptable for those two hypotheses ought to satisfy certain
rules, e.g., that the ﬁrst cannot be less than the second. That is because
the second hypothesis implies the ﬁrst. (See the implication rule at the end
of sec. 1.3 below.) In sec. 1.2 we turn to the question of what the laws of
judgmental probability are, and why. Meanwhile, take some time with the
following questions, as a way of getting in touch with some of your own ideas
about probability. Afterward, read the discussion that follows.
Questions
1 A vigorously ﬂipped thumbtack will land on the sidewalk. Is it reason
able for you to have a probability for the hypothesis that it will land point
up?
1 An ordinary coin is to be tossed twice in the usual way. What is your
probability for the head turning up both times?
(a) 1/3, because 2 heads is one of three possibilities: 2, 1, 0 heads?
(b) 1/4, because 2 heads is one of four possibilities: HH, HT, TH, TT?
2 There are three coins in a bag: ordinary, twoheaded, and twotailed.
One is shaken out onto the table and lies head up. What should be your
probability that it’s the twoheaded one:
(a) 1/2, since it can only be twoheaded or normal?
CHAPTER 1. PROBABILITY PRIMER 10
(b) 2/3, because the other side could be the tail of the normal coin, or either
side of the twoheaded one? (Suppose the sides have microscopic labels.)
4 It’s a goy!
1
(a) As you know, about 49% of recorded human births have
been girls. What is your judgmental probability that the ﬁrst child born after
time t (say, t = the beginning of the 22nd century, GMT) will be a girl?
(b) A goy is deﬁned as a girl born before t or a boy born thereafter. As you
know, about 49% of recorded human births have been goys. What is your
judgmental probability that the ﬁrst child born in the 22nd century will be
a goy?
Discussion
1 Surely it is reasonable to suspect that the geometry of the tack gives
one of the outcomes a better chance of happening than the other; but if you
have no clue about which of the two has the better chance, it may well be
reasonable to have judgmental probability 1/2 for each. Evidence about the
chances might be given by statistics on tosses of similar tacks, e.g., if you
learned that in 20 tosses there were 6 up’s you might take the chance of up to
be in the neighborhood of 30%; and whether or not you do that, you might
well adopt 30% as your judgmental probability for up on the next toss.
2,3 These questions are meant to undermine the impression that judg
mental probabilities can be based on analysis into cases in a way that does
not already involve probabilistic judgment (e.g., the judgment that the cases
are equiprobable). In either problem you can arrive at a judgmental proba
bility by trying the experiment (or a similar one) often enough, and seeing
the statistics settle down close enough to 1/2 or to 1/3 to persuade you that
more trials will not reverse the indications. In each these problems it is the
ﬁner of the two suggested analyses that happens to make more sense; but
any analysis can be reﬁned in signiﬁcantly diﬀerent ways, and there is no
point at which the process of reﬁnement has to stop. (Head or tail can be
reﬁned to head–facing–north or head–not–facing–north or tail.) Indeed some
of these analyses seem more natural or relevant than others, but that reﬂects
the probability judgments we bring with us to the analyses.
4 Goys and birls. This question is meant to undermine the impression
that judgmental probabilities can be based on frequencies in a way that
does not already involve judgmental probabilities. Since all girls born so far
have been goys, the current statistics for girls apply to goys as well: these
1
This is a fanciful adaptation of Nelson Goodman’s (1983, pp. 734) “grue” paradox
49% 51%
Shaded: Goy. Blank: Birl.
Girl Boy
Born before t?
Yes
No
CHAPTER 1. PROBABILITY PRIMER 11
days, about 49% of human births are goys. Then if you read probabilities oﬀ
statistics in a straightforward way your probability will be 49% for each of
these hypothesis:
(1) The ﬁrst child born after t will be a girl.
(2) The ﬁrst child born after t will be a goy.
Thus pr(1)+pr(2)=98%. But it is clear that those probabilities should sum
to 100%, since (2) is logically equivalent to
(3) The ﬁrst child born after t will be a boy,
and pr(1)+pr(3) = 100%. To avoid this contradiction you must decide which
statistics are relevant to pr(1): the 49% of girls born before 2001, or the 51%
of boys. And that is not a matter of statistics but of judgment—no less so
because we would all make the same judgment: the 51% of boys.
1.2 Why Probabilities are Additive
Authentic tickets of the Mars sort are hard to come by. Is the ﬁrst of them
really worth $10 to me if there was life on Mars? Probably not. If the truth is
not known in my lifetime, I cannot cash the ticket even if it is really a winner.
But some probabilities are plausibly represented by prices, e.g., probabilities
of the hypotheses about athletic contests and lotteries that people commonly
bet on. And it is plausible to think that the general laws of probability ought
to be the same for all hypotheses—about planets no less than about ball
games. If that is so, we can justify laws of probability if we can prove all
betting policies that violate them to be inconsistent. Such justiﬁcations are
called “Dutch book arguments”.
2
We shall give a Dutch book argument for
the requirement that probabilities be additive in this sense:
2
In British racing jargon a book is the set of bets a bookmaker has accepted, and a
book against someone—a “Dutch” book—is one the bookmaker will suﬀer a net loss on no
matter how the race turns out. I follow Brian Skyrms in seeing F. P. Ramsey as holding
A B C D ···
Probabilities of cases A, B, C, D,...
are q, r, s, t,..., respectively.
H true H false
CHAPTER 1. PROBABILITY PRIMER 12
Finite Additivity. The probability of a hypothesis that can be
true in a ﬁnite number of incompatible ways is the sum of the
probabilities of those ways.
Example 1, Finite additivity. The probability p of the hypothesis
(H) A sage will be elected
is q + r + s if exactly three of the candidates are sages and their proba
bilities of winning are q, r, and s. In the following diagram, A, B, C, D, E, . . .
are the hypotheses that the various diﬀerent candidates win—the ﬁrst three
being the sages. The logical situation is diagrammed as follows, where the
points in the big rectangle represent all the ways the election might come
out, speciﬁed in minute detail, and the small rectangles represent the ways
in which the winner might prove to be A, or B, or C, or D, etc.
1.2.1 Dutch Book Argument for Finite Additivity.
For deﬁniteness we suppose that the hypothesis in question is true in three
cases, as in example 1; the argument diﬀers inessentially for other examples,
with other ﬁnite numbers of cases. Now consider the following four tickets.
Worth $1 if H is true. Price $p
Worth $1 if A is true. Price $q
Worth $1 if B is true. Price $r
Worth $1 if C is true. Price $s
Suppose I am willing to buy or sell any or all of these tickets at the stated
prices. Why should p be the sum q + r + s? Because no matter what it is
Dutch book arguments to demonstrate actual inconsistency. See Ramsey’s ‘Truth and
Probability’ in his Philosophical Papers, D. H. Mellor, ed.: Cambridge, 1990.
CHAPTER 1. PROBABILITY PRIMER 13
worth —$1 or $0—the ticket on H is worth exactly as much as the tickets
on A, B, C together. (If H loses it is because A, B, C all lose; if H wins it
is because exactly one of A, B, C wins.) Then if the price of the H ticket is
diﬀerent from the sum of the prices of the other three, I am inconsistently
placing diﬀerent values on one and the same contract, depending on how it
is presented.
If I am inconsistent in that way, I can be ﬂeeced by anyone who will ask me
to buy the H ticket and sell or buy the other three depending on whether p is
more or less than q+r+s. Thus, no matter whether the equation p = q+r+s
fails because the left–hand side is more or less than the right, a book can be
made against me. That is the Dutch book argument for additivity when the
number of ultimate cases under consideration is ﬁnite. The talk about being
ﬂeeced is just a way of dramatizing the inconsistency of any policy in which
the dollar value of the ticket on H is anything but the sum of the values of the
other three tickets: to place a diﬀerent value on the three tickets on A, B, C
from the value you place on the H ticket is to place diﬀerent values on the
same commodity bundle under two demonstrably equivalent descriptions.
When the number of cases is inﬁnite, a Dutch book argument for additivity
can still be given—provided the inﬁnite number is not too big! It turns out
that not all inﬁnite sets are the same size:
Example 2, Cantor’s Diagonal Argument. The sets of positive in
tegers (I
+
’s) cannot be counted oﬀ as ﬁrst, second, . . ., with each such set
appearing as n’th in the list for some ﬁnite positive integer n. This was proved
by Georg Cantor (1895) as follows. Any set of I
+
’s can be represented by an
enless string of plusses and minuses (“signs”), e.g., the set of even I
+
’s by the
string − + − + . . . in which plusses appear at the even numbered positions
and minuses at the odd, the set ¦2, 3, 5, 7, . . .¦of prime numbers by an endless
string that begins −++−+−+, the set of all the I
+
’s by an endless string of
plusses, and the set of no I
+
’s by an endless string of minuses. Cantor proved
that no list of endless strings of signs can be complete. He used an amazingly
simple method (“diagonalization”) which, applied to any such list, yields an
endless string
¯
d of signs which is not in that list. Here’s how. For deﬁniteness,
suppose the ﬁrst four strings in the list are the examples already given, so
that the list has the general shape
s
1
: −+−+. . .
s
2
: −+ +−. . .
s
3
: + + + +. . .
s
4
: −−−−. . .
etc.
CHAPTER 1. PROBABILITY PRIMER 14
Deﬁne the diagonal of that list as the string d consisting of the ﬁrst sign in
s
1
, the second sign in s
2
, and, in general, the n’th sign in s
n
:
−+ +−. . .
And deﬁne the antidiagonal
¯
d of that list as the result
¯
d of reversing all the
signs in the diagonal,
¯
d : +−−+. . .
In general, for any list s
1
, s
2
, s
3
, s
4
. . .,
¯
d cannot be any member s
n
of the
list, for, by deﬁnition, the n’th sign in
¯
d is diﬀerent from the n’th sign of s
n
,
whereas if
¯
d were some s
n
, those two strings would have to agree, sign by
sign. Then the set of I
+
’s deﬁned by the antidiagonal of a list cannot be in
that list, and therefore no list of sets of I
+
’s can be complete.
Countability. A countable set is deﬁned as one whose members
(if any) can be arranged in a single list, in which each member
appears as the n’th item for some ﬁnite n.
Of course any ﬁnite set is countable in this sense, and some inﬁnite sets
are countable. An obvious example of a countably inﬁnite set is the set I
+
=
¦1, 2, 3, . . .¦ of all positive whole numbers. A less obvious example is the set I
of all the whole numbers, positive, negative, or zero: ¦. . . , −2, −1, 0, 1, 2, . . .¦.
The members of this set can be rearranged in a single list of the sort required
in the deﬁnition of countability:
0, 1, −1, 2, −2, 3, −3, . . ..
So the set of all the whole numbers is countable. Order does not matter, as
long as every member of I shows up somewhere in the list.
Example 3, Countable additivity. In example 1, suppose there were
an endless list of candidates, including no end of sages. If H says that a sage
wins, and A
1
, A
2
, . . . identify the winner as the ﬁrst, second, . . . sage, then an
extension of the law of ﬁnite additivity to countably inﬁnite sets would be
this:
Countable Additivity. The probability of a hypothesis H that
can be true in a countable number of incompatible ways A
1
, A
2
. . .
is the sum pr(H) = pr(A
1
) + pr(A
2
) + . . . of the probabilities of
those ways.
CHAPTER 1. PROBABILITY PRIMER 15
This equation would be satisﬁed if the probability of one or another sage’s
winning were pr(H) = 1/2, and the probabilities of the ﬁrst, second, third,
etc. sage’s winning were 1/4, 1/8, 1/16, etc., decreasing by half each time.
1.2.2 Dutch book argument for countable additivity.
Consider the following inﬁnite array of tickets, where the mutually incompat
ible A’s collectively exhaust the ways in which H can be true (as in example
3).
3
Pay the bearer $1 if H is true Price $pr(H)
Pay the bearer $1 if A
1
is true Price $pr(A
1
)
Pay the bearer $1 if A
2
is true Price $pr(A
2
)
Why should my price for the ﬁrst ticket be the sum of my prices for the
others? Because no matter what it is worth —$1 or $0—the ﬁrst ticket is
worth exactly as much as all the others together. (If H loses it is because the
others all lose; if H wins it is because exactly one of the others wins.) Then
if the ﬁrst price is diﬀerent from the sum of the others, I am inconsistently
placing diﬀerent values on one and the same contract, depending on how it
is presented.
Failure of additivity in these cases implies inconsistency of valuations: a
judgment that certain transactions are at once (1) reasonable and (2) sure to
result in an overall loss. Consistency requires additivity to hold for countable
sets of alternatives, ﬁnite or inﬁnite.
1.3 Probability Logic
The simplest laws of probability are the consequences of ﬁnite additivity
under this additional assumption:
3
No matter that there is not enough paper in the universe for an inﬁnity of tickets. One
small ticket can save the rain forest by doing the work of all the A tickets together. This
ecoticket will say: ‘For each positive whole number n, pay the bearer $1 if A
n
is true
G
H
–H
–
G
GH
G v H
CHAPTER 1. PROBABILITY PRIMER 16
Probabilities are real numbers in the range from 0 to 1, with
the endpoints reserved for certainty of falsehood and of truth,
respectively.
This makes it possible to read probability laws oﬀ diagrams, much as we read
ordinary logical laws oﬀ them. Let’s see how that works for the ordinary ones,
beginning with two surprising examples (where “iﬀ” means if and only if):
De Morgan’s Laws
(1) (G∧H) = G∨H (“Not both true iﬀ at least one false”)
(2) (G∨ H) = G∧ H (“Not even one true iﬀ both false”)
Here the bar, the wedge and juxtaposition stand for not, or and and. Thus,
if G and H are two hypotheses,
G∧ H (or GH) says that they are both true: G and H
G∨ H says that at least one is true: G or H
G (or −G or
¯
G) says that G is false: not G
In the following diagrams for De Morgan’s laws the upper and lower rows
represent G and G and the left and righthand columns represent H and
H. Now if R and S are any regions, R ∧ S (or ‘RS’), is their intersection,
R ∨ S is their union, and R is the whole big rectangle except for R.
Diagrams for De Morgan’s laws (1) and (2):
(1) Shaded: (G∧ H) = G∨ H
(2) Shaded: (G∨ H) = G∧ H
Adapting such geometrical representations to probabilistic reasoning is
just a matter of thinking of the probability of a hypothesis as its region’s
area, assuming that the whole rectangle, H ∨ H (= G ∨ G), has area 1.
Of course the empty region, H ∧ H (=G ∧ G), has area 0. It is useful
to denote those two extreme regions in ways independent of any particular
hypotheses H, G. Let’s call them · and ⊥:
CHAPTER 1. PROBABILITY PRIMER 17
Logical Truth. · = H ∨ H = G∨ G
Logical Falsehood. ⊥ = H ∧ H = G∧ G
We can now verify some further probability laws informally, in terms of
areas of diagrams.
Not : pr(H) = 1 −pr(H)
Veriﬁcation. The nonoverlapping regions H and H exhaust the whole rect
angle, which has area 1. Then pr(H) +pr(H) = 1, so pr(H) = 1 −pr(H).
Or: pr(G∨ H) = pr(G) +pr(H) −pr(G∧ H)
Veriﬁcation. The G∨H area is the G area plus the H area, except that when
you simply add pr(G) +pr(H) you count the G∧ H part twice. So subtract
it on the righthand side.
The word ‘but’—a synonym for ‘and’—may be used when the conjunc
tion may be seen as a contrast, as in ‘it’s green but not healthy’, G∧ H:
But Not: pr(G∧ H) = pr(G) −pr(G∧ H)
Veriﬁcation. The G ∧
¯
H region is what remains of the G region after the
G∧ H part is deleted.
Dyadic Analysis: pr(G) = pr(G∧ H) +pr(G∧ H)
Veriﬁcation. See the diagram for De Morgan (1). The G region is the union
of the nonoverlapping G∧ H and G∧ H regions.
In general, there is a rule of nadic analysis for each n, e.g., for n=3:
Triadic Analysis: If H
1
, H
2
, H
3
partition ·,
4
then
pr(G) = pr(G∧ H
1
) +pr(G∧ H
2
) +pr(G∧ H
3
).
4
This means that, as a matter of logic, the H’s are mutually exclusive (H
1
∧ H
2
=
H
1
∧ H
3
= H
2
∧ H
3
= ⊥) and collectively exhaustive (H
1
∨ H
2
∨ H
3
= ·). The equation
also holds if the H’s merely pr−partition · in the sense that pr(H
i
∧ H
j
) = 0 whenever
i = j and pr(H
1
∧ H
2
∧ H
3
) = 1.
H
1
H
2
H
3
G
G

GH
1
GH
2
GH
3
G
H
CHAPTER 1. PROBABILITY PRIMER 18
The next rule follows immediately from the fact that logically equivalent
hypotheses are always represented by the same region of the diagram—in
view of which we use the sign ‘=’ of identity to indicate logical equivalence.
Equivalence: If H = G, then pr(H) = pr(G).
(Logically equivalent hypotheses are equiprobable.)
Finally: To be implied by G, the hypothesis H must be true in every case
in which G is true. Diagramatically, this means that the G region is entirely
included in the H region. Then if G implies H, the G region can have no
larger an area than the H region.
Implication: If G implies H, then pr(G) ≤ pr(H).
1.4 Conditional Probability
We identiﬁed your ordinary (unconditional) probability for H as the price
representing your valuation of the following ticket
Worth $1 if H is true. Price: $pr(H)
Now we identify your conditional probability for H given D as the price
representing your valuation of this ticket:
(1) Worth $1 if D ∧ H is true,
worth $pr(H[D) if D is false.
Price: $pr(H[D)
CHAPTER 1. PROBABILITY PRIMER 19
The old ticket represented a simple bet on H; the new one represents a
conditional bet on H—a bet that is called oﬀ (the price of the ticket is
refunded) in case the condition D fails. If D and H are both true, the bet is
on and you win, the ticket is worth $1. If D is true but H is false, the bet is
on and you lose, the ticket is worthless. And if D is false, the bet is oﬀ, you
get your $pr(H[D) back as a refund.
With that understanding we can construct a Dutch book argument for the
following rule, which connects conditional and unconditional probabilities:
Product Rule: pr(H ∧ D) = pr(H[D)pr(D)
Dutch Book Argument for the Product Rule.
5
Imagine that you own three
tickets, which you can sell at prices representing your valuations. The ﬁrst is
ticket (1) above. The second and third are the following two, which represent
unconditional bets of $1 on HD and of $pr(H[D) against D,
(2) Worth $1 if H ∧ D is true. Price: $pr(H ∧ D)
(3) Worth pr(H[D) if D is false. Price: $pr(H[D)pr(D)
Bet (3) has a peculiar payoﬀ: not a whole dollar, but only $pr(H[D). That
is why its price is not the full $pr(D) but only the fraction pr(D) of the
$pr(H[D) that you stand to win. This payoﬀ was chosen to equal the price
of the ﬁrst ticket, so that the three ﬁt together into a neat book:
Observe that in every possible case regarding truth and falsity of H and
D the tickets (2) and (3) together have the same dollar value as ticket (1).
(You can verify that claim with pencil and paper.) Then there is nothing to
choose between ticket (1) and tickets (2) and (3) together, and therefore it
would be inconsistent to place diﬀerent values on them. Thus, your price for
(1) ought to equal the sum of your prices for (2) and (3):
pr(H[D) = pr(H ∧ D) +pr(D)pr(H[D)
Now set pr(D) = 1 −pr(D), multiply through, cancel pr(H[D) from both
sides and solve for pr(H∧D). The result is the product rule. To violate that
rule is to place diﬀerent values on the same commodity bundle in diﬀerent
guises: (1), or the package (2, 3).
The product rule is more familiar in a form where it is solved for the
conditional probability pr(H[G):
5
de Finetti (1937, 1980).
CHAPTER 1. PROBABILITY PRIMER 20
Quotient Rule: pr(H[D) =
pr(H ∧ D)
pr(D)
, provided pr(D) > 0.
Graphically, the quotient rule expresses pr(H[D) as the fraction of the D
region that lies inside the H region. It is as if calculating pr(H[D) were a
matter of trimming the whole D ∨ D rectangle down to the D part, and
using that as the new unit of area.
The quotient rule is often called the deﬁnition of conditional probability.
It is not. If it were, we could never be in the position we are often in, of
making a conditional judgment—say, about a coin that may or may not be
tossed will land—without attributing some particular positive value to the
condition that pr(head [ tossed) = 1/2 even though
pr(head ∧ tossed)
pr(tossed)
=
undefined
undefined
Nor—perhaps, less importantly—would we be able to make judgments like
the following, about a point (of area 0!) on the Earth’s surface:
pr(in western hemisphere [ on equator) = 1/2
even though
pr(in western hemisphere ∧ on equator)
pr(on equator)
=
0
0
The quotient rule merely restates the product rule; and the product rule is no
deﬁnition but an essential principle relating two distinct sorts of probability.
By applying the product rule to the terms on the righthand sides of the
analysis rules in sec. 1.3 we get the rule of
6
Total Probability: If the D’s partition ·
then pr(H) =
¸
i
pr(D
i
)pr(H[D
i
).
7
6
Here the sequence of D’s is ﬁnite or countably inﬁnite.
7
= pr(D
1
)pr(H[D
1
) + pr(D
2
)pr(H[D
2
) + . . . .
Urn 1 Urn 2
CHAPTER 1. PROBABILITY PRIMER 21
Example. A ball will be drawn blindly from urn 1 or urn 2, with odds
2:1 of being drawn from urn 2. Is black or white the more probable outcome?
Solution. By the rule of total probability with H = black and D
i
= drawn
from urn i, we have pr(H) = pr(H[D
1
)P(D
1
) + pr(H[D
2
)P(D
2
) = (
3
3
1
3
) +
(
1
2
2
3
) =
1
4
1
3
=
7
12
>
1
2
: Black is the more probable outcome.
1.5 Why ‘[’ Cannot be a Connective
The bar in ‘pr(H[D)’ is not a connective that turns pairs H, D of propositions
into new, conditional propositions, H if D. Rather, it is as if we wrote the
conditional probability of H given D as ‘pr(H, D)’: the bar is a typographical
variant of the comma. Thus we use ‘pr’ for a function of one variable as in
‘pr(D)’ and ‘pr(H ∧ D)’, and also for the corresponding function of two
variables as in ‘pr(H[D)’. Of course the two are connected—by the product
rule.
Then in fact we do not treat the bar as a statement–forming connective,
‘if’; but why couldn’t we? What would go wrong if we did? This question
was answered by David Lewis in 1976, pretty much as follows.
8
Consider the
simplest special case of the rule of total probability:
pr(H) = pr(H[D)pr(D) +pr(H[D)pr(D)
Now if ‘[’ is a connective and D and C are propositions, then D[C is a
proposition too, and we are entitled to set H = D[C in the rule. Result:
(1) pr(D[C) = pr[(D[C)[D]pr(D) +pr[(D[C)[D]pr(D)
So far, so good. But remember: ‘[’ means if. Therefore, ‘(D[C)[X’ means
If X, then if C then D. And as we ordinarily use the word ‘if’, this comes to
the same as If X and C, then D:
(2) (D[C)[X = D[XC
8
For Lewis’s “trivialization” result (1976), see his 1986). For subsequent developments,
see Ellery Eells and Brian Skyrms (eds., 1994)—especially, the papers by Alan H` ajek and
Ned Hall.
CHAPTER 1. PROBABILITY PRIMER 22
(Recall that the identity means the two sides represent the same region,
i.e., the two sentences are logically equivalent.) Now by two applications of
(2) to (1) we have
(3) pr(D[C) = pr(D[D ∧ C)pr(D) +pr(D[D ∧ C)pr(D)
But as D ∧ C and (D ∧ C) respectively imply and contradict D, we have
pr(D[D ∧ C) = 1 and pr(D[D ∧ C)) = 0. Therefore, (3) reduces to
(4) pr(D[C) = pr(D)
Conclusion: If ‘[’ were a connective (‘if’) satisfying (2), conditional prob
abilities would not depend on on their conditions at all. That means that
‘pr(D[C)’ would be just a clumsy way of writing ‘pr(D)’. And it means that
pr(D[C) would come to the same thing as pr(D[C), and as pr(D[X) for
any other statement X.
That is David Lewis’s “trivialization result”. In proving it, the only as
sumption needed about ‘if’ was the eqivalence (2) of ‘If X, then if C then D’
with ‘If X and C, then D’.
9
1.6 Bayes’s Theorem
A wellknown corollary of the product rule allows us to reverse the arguments
of the conditional probability function pr( [ ) provided we multiply the result
by the ratio of the probabilities of those arguments in the original order.
Bayes’s Theorem (Probabilities). pr(H[D) = pr(D[H)
pr(H)
pr(D)
Proof. By the product rule, the righthand side equals pr(D∧H)/pr(D); by
the quotient rule, so does the lefthand side.
For many purposes Bayes’s theorem is more usefully applied to odds than
to probabilities. In particular, suppose there are two hypotheses, H and G,
to which observational data D are relevant. If we apply Bayes theorem for
probabilities to the conditional odds between H and G, pr(H[D)/pr(G[D),
the unconditional probability of D cancels out:
9
Note that the result does not depend on assuming that ‘if’ means ‘or not’; no such
fancy argument is needed in order to show that pr(A ∨ B) = pr(B[A) only under very
special conditions. (Prove it!)
CHAPTER 1. PROBABILITY PRIMER 23
Bayes’s Theorem (Odds).
pr(H[D)
pr(G[D)
=
pr(H)
pr(G)
pr(D[H)
pr(D[G)
Terminological note. The second factor on the righthand side of the
odds form of Bayes’s theorem is the “likelihood ratio.” In these terms
the odds form says:
Conditional odds = prior odds likelihood ratio
Bayes’s theorem is often stated in a form attuned to cases in which you
have clear probabilities pr(H
1
), pr(H
2
), . . . for mutually incompatible, collec
tively exhaustive hypotheses H
1
, H
2
, . . ., and have clear conditional probabil
ities pr(D[H
1
), pr(D[H
2
), . . . for data D on each of them. For a countable
collection of such hypotheses we have an expression for the probability, given
D, of any one of them, say, H
i
:
Bayes’s Theorem (Total Probabilities):
10
pr(H
i
[D) =
pr(H
i
)pr(D[H
i
)
¸
j
pr(H
j
)pr(D[H
j
)
Example. In the urn example (1.4), suppose a black ball is drawn. Was
it more probably drawn from urn 1 or urn 2? Let D = A black ball is drawn,
H
i
= It came from urn i, and pr(H
i
) =
1
2
. Here are two ways to go:
(1) Bayes’s theorem for total probabilities, pr(H
1
[D) =
1
2
3
4
1
2
3
4
+
1
2
1
2
=
3
5
(2) Bayes’s theorem for odds,
pr(H
1
[D)
pr(H
2
[D)
=
1
2
1
2
3
4
2
4
=
3
2
These come to the same thing: Probability 3/5 = odds 3:2. Urn 1 is the
more probable source.
1.7 Independence
Deﬁnitions
• H
1
∧ . . . ∧ H
n
is a conjunction. The H’s are conjuncts.
• H
1
∨ . . . ∨ H
n
is a disjunction. The H’s are disjuncts.
10
The denominator = pr(H
1
)pr(D[H
1
) + pr(H
2
)pr(D[H
2
) + .
CHAPTER 1. PROBABILITY PRIMER 24
• For you, hypotheses are:
independent iﬀ your probability for the conjunction of any two or more
is the product of your probabilities for the conjuncts;
conditionally independent given Giﬀ your conditional probability given
G for the conjunction of any two or more is the product of your conditional
probabilities given G for the conjuncts; and
equiprobable (or conditionally equiprobable given G) iﬀ they have the
same probability for you (or the same conditional probability, given G).
Example 1, Rolling a fair die. H
i
means that the 1 (“ace”) turns up
on roll number i. These H
i
are both independent and equiprobable for you:
Your pr for the conjunction of any distinct n of them will be 1/6
n
.
Example 2, Coin & die, pr(head) =
1
2
, pr(1) =
1
6
, pr(head ∧ 1) =
1
12
;
outcomes of a toss and a roll are independent but not equiprobable.
Example 3, Urn of known composition. You know it contains N balls,
of which b are black, and that after a ball is drawn it is replaced and the
contents of the urn mixed. H
1
, H
2
, . . . mean: the ﬁrst, second, . . . balls drawn
will be black. Here, if you think nothing ﬁshy is going on, you will regard
the H’s as equiprobable and independent: pr(H
i
) = b/N, pr(H
i
H
j
) = b
2
/N
2
if i = j, pr(H
i
H
j
H
k
) = b
3
/N
3
if i = j = k = i, and so on.
It turns out that three propositions H
1
, H
2
, H
3
can be independent in pairs
but fail to be independent because pr(H
1
∧H
2
∧H
3
) =pr(H
1
)pr(H
2
)pr(H
3
).
Example 4. Two normal tosses of a normal coin. If we deﬁne H
i
as
‘head on toss #i’ and D as ‘diﬀerent results on the two tosses’ we ﬁnd that
pr(H
1
∧ H
2
) = pr(H
1
)pr(D) = pr(H
2
)pr(D) =
1
4
but pr(H
1
∧ H
2
∧ D) = 0.
Here is a useful fact about independence:
(1) If n propositions are independent, so are
those obtained by denying some or all of them.
To illustrate (1), think about case n = 3. Suppose H
1
, H
2
, H
3
are indepen
dent. Writing h
i
= pr(H
i
), this means that the following four equations hold:
(a) pr(H
1
∧ H
2
∧ H
3
) = h
1
h
2
h
3
,
(b) pr(H
1
∧ H
2
) = h
1
h
2
, (c) pr(H
1
∧ H
3
) = h
1
h
3
, (d) pr(H
2
∧ H
3
) = h
2
h
3
Here, H
1
, H
2
, H
3
are also independent. Writing
¯
h
i
= pr(H
i
), this means:
(e) pr(H
1
∧ H
2
∧ H
3
) = h
1
h
2
¯
h
3
,
(f) pr(H
1
∧ H
2
) = h
1
h
2
, (g) pr(H
1
∧ H
3
) = h
1
¯
h
3
, (h) pr(H
2
∧ H
3
) = h
2
¯
h
3
CHAPTER 1. PROBABILITY PRIMER 25
Equations (e) −(h) follow from (a) −(d) by the rule for ‘but not’.
Example 5, (h) follows from (b). pr(H
2
∧H
3
) = h
2
¯
h
3
since, by ‘but
not’, the lefthand side = pr(H
1
) −pr(H
1
∧H
2
), which, by (b), = h
1
−h
1
h
2
,
and this = h
1
(1 −h
2
) = h
1
¯
h
2
.
A second useful fact follows immediately from the quotient rule:
(2) If pr(H
1
) > 0, then H
1
and H
2
are
independent iﬀ pr(H
2
[H
1
) = pr(H
2
).
1.8 Objective Chance
It is natural to think there is such a thing as real or objective probability
(“chance”, for short) in contrast to merely judgmental probability.
example 1, The Urn. An urn contains 100 balls, of which an unknown
number n are of the winning color — green, say. You regard the drawing as
honest in the usual way. Then you think the chance of winning is n/100. If all
101 possible values of n are equiprobable in your judgment, then your prior
probability for winning on any trial is 50%, and as you learn the results of
more and more trials, your posterior odds between any two possible compo
sitions of the urn will change from 1:1 to various other values, even though
you are sure that the composition itself does not change.
“The chance of winning is 30%.” What does that mean? In the urn
example, it means that n = 30. How can we ﬁnd out whether it is true or
false? In the urn example we just count the green balls and divide by the
total number. But in other cases—die rolling, horse racing, etc.—there may
be no process that does the “Count the green ones” job. In general, there
are puzzling questions about the hypothesis that the chance of H is p that
do not arise regarding the hypothesis H itself.
David Hume’s skeptical answer to those questions says that
chances are simply projections of robust features of judgmental probabili
ties from our minds out into the world, whence we hear them clamoring to
be let back in. That is how our knowledge that the chance of H is p guar
antees that our judgmental probability for H is p : the guarantee is really a
presupposition. As Hume sees it, the argument
(1) pr(the chance of H is p) = 1, so pr(H) = p
CHAPTER 1. PROBABILITY PRIMER 26
is valid because our conviction that the chance of H is p is a just a ﬁrmly
felt commitment to p as our continuing judgmental probability for H.
What if we are not sure what the chance of H is, but think it may be p?
Here, the relevant principle (“Homecoming”) speciﬁes the probability of H
given that its chance is p—except in cases where we are antecedently sure
that the chance is not p because, for some chunk ( ) of the interval from
0 to 1, pr(the chance of H is inside the chunk) = 0.
Homecoming. pr(H[ chance of H is p) = p unless p is
excluded as a possibile value, being in the interior of an
interval we are sure does not contain the chance of H :
0—— p —1
Note that when pr(the chance of H is p) = 1, homecoming validates
argument (1). The name ‘Homecoming’ is loaded with philosophical baggage.
‘Decisiveness’ would be a less tendentious name, acceptable to those who see
chances as objective features of the world, independent of what we may think.
The condition ‘the chance of H is p’ is decisive in that it overrides any other
evidence represented in the probability function pr. But a decisive condition
need not override other conditions conjoined with it to the right of the bar.
In particular, it will not override the hypothesis that H is true, or that H is
false. Thus, since pr(H[H ∧C) = 1 when C is any condition consistent with
H, we have
(2) pr(H[H∧ the chance of H is .3) = 1, not .3,
and that is no violation of decisiveness.
On the Humean view it is ordinary conditions, making no use of the
word ‘chance,’ that appear to the right of the bar in applications of the
homecoming principle, e.g., conditions specifying the composition of an urn.
Here, you are sure that if you knew n, your judgmental probability for the
next ball’s being green would be n%:
(3) pr(Green next [ n of the hundred are green) = n%
Then in example 1 you take the chance of green next to be a magnitude,
ch(green next) =
number of green balls in the urn
total number of balls in the urn
,
which you can determine empirically by counting. It is the fact that for
you the ratio of green ones to all balls in the urn satisﬁes the decisiveness
condition that identiﬁes n% as the chance of green next, in your judgment.
CHAPTER 1. PROBABILITY PRIMER 27
On the other hand, the following two examples do not respond to that
treatment.
Example 2, the proportion of greens drawn so far. This won’t do
as your ch(green next) because it lacks the robustness property: on the
second draw it can easily change from 0 to 1 or from 1 to 0, and until the
ﬁrst draw it gives the chance of gree next the unrevealing form of 0/0.
Example 3, The Loaded Die. Suppose H predicts ace on the next toss.
Perhaps you are sure that if you understood the physics better, knowledge
of the mass distribution in the die would determine for you a deﬁnite robust
judgmental probability of ace next: you think that if you knew the physics,
then you would know of an f that makes this principle true:
(4) pr(Ace next [ The mass distribution is M) = f(M)
But you don’t know the physics; all you know for sure is that f(M) =
1
6
in case M is uniform.
When we are unlucky in this way—when there is no decisive physical
parameter for us—there may still be some point in speaking of the chance
of H next, i.e., of a yettobeidentiﬁed physical parameter that will be de
cisive for people in the future. In the homecoming condition we might read
‘chance of H’ as a placeholder for some presently unavailable description of
a presently unidentiﬁed physical parameter. There is no harm in that—as
long as we don’t think we are already there.
In examples 1 and 2 it is clear to us what the crucial physical param
eter, X, can be, and we can specify the function, f, that maps X into the
chance: X can be n/100, in which case f is the identity function, f(X) = X;
or X can be n, with f(X) = n/100. In the die example we are clear about
the parameter X, but not about the function f. And in other cases we are
also unclear about X, identifying it in terms of its salient eﬀect (“blood poi
soning” in the following example), while seeking clues about the cause.
Example 4, Blood Poisoning.
‘At last, early in 1847, an accident gave Semmelweis the de
cisive clue for his solution of the problem. A colleague of his,
Kolletschka, received a puncture wound in the ﬁnger, from the
scalpel of a student with whom he was performing an autopsy,
and died after an agonizing illness during which he displayed the
same symptoms that Semmelweis had observed in the victims of
childbed fever. Although the role of microorganisms in such in
fections had not yet been recognized at that time, Semmelweis
CHAPTER 1. PROBABILITY PRIMER 28
realized that “cadaveric matter” which the student’s scalpel had
introduced into Kolletschchka’s blood stream had caused his col
league’s fatal illness. And the similarities between the course of
Kolletschchka’s disease and that of the women in his clinic led
Semmelweis to the conclusion that his patients had died of the
same kind of blood poisoning’
11
But what is the common character of the X’s that we had in the ﬁrst two
examples, and that Semmelweis lacked, and of the f’s that we had in the
ﬁrst, but lacked in the other two? These questions belong to pragmatics, not
semantics; they concern the place of these X’s and f’s in our processes of
judgment; and their answers belong to the probability dynamics, not statics.
The answers have to do with invariance of conditional judgmental probabili
ties as judgmental probabilities of the conditions vary. To see how that goes,
let’s reformulate (4) in general terms:
(4) Staying home. pr(H[X = a) = f(a) if a is not in the
interior of an interval that surely contains no values of X.
Here we must think of pr as a variable whose domain is a set of probability
assignments. These may assign diﬀerent values to the condition X = a; but,
for each H and each a that is not in an excluded interval, they assign the
same value, f(a), to the conditional probability of H given a.
1.9 Supplements
1 (1) Find a formula for pr(H
1
∨H
2
∨H
3
) in terms of probabilities of H’s
and their conjunctions.
(2) What happens in the general case, pr(H
1
∨ H
2
∨ . . . ∨ H
n
)?
2 Exclusive ‘or’. The symbol “∨” stands for “or” in a sense that is
not meant to rule out the possibility that the hypotheses ﬂanking it are both
true: H
1
∨ H
2
= H
1
∨ H
2
∨ (H
1
∧ H
2
). Let us use symbol ∨ for “or” in an
exclusive sense: H
1
∨H
2
= (H
1
∨ H
2
) ∧ (H
1
∧ H
2
). Find formulas for
(1) pr(H
1
∨H
2
) and
(2) pr(H
1
∨H
2
∨H
3
)
in terms of probabilities of H’s and their conjunctions.
(3) Does H
1
∨H
2
∨H
3
mean that exactly one of the three H’s is true? (No.)
11
Hempel (1966), p. 4.
CHAPTER 1. PROBABILITY PRIMER 29
What does it mean?
(4) What does H
1
∨. . . ∨H
n
mean?
3 Diagnosis.
12
The patient has a breast mass that her physician thinks is
probably benign: frequency of malignancy among women of that age, with
the same symptoms, family history, and physical ﬁndings, is about 1 in 100.
The physician orders a mammogram and receives the report that in the
radiologist’s opinion the lesion is malignant, i.e., the mammogram is positive.
Based on the available statistics, the physician’s probabilities for true and
false positive mammogram results were as follows, and her prior probability
for the patient’s having cancer was 1%. What will her conditional odds on
malignancy be, given the positive mammogram?
pr(row[column) Malignant Benign
+ mammogram 80% 10%
− mammogram 20% 90%
4 The Taxicab Problem.
13
“A cab was involved in a hitandrun accident
at night. Two cab companies, the Green and the Blue, operate in the city.
You are given the following data:
“(a) 85% of the cabs in the city are Green, 15% are Blue.
“(b) A witness identiﬁed the cab as Blue. The court tested the reliability
of the witness under the same circumstances that existed on the night of
the accident and concluded that the witness correctly identiﬁed each one
of the two colors 80% of the time and failed 20% of the time. What is the
probability that the cab involved in the accident was Blue rather than Green
[i.e., conditionally on the witness’s identiﬁcation]?”
5 The Device of Imaginary Results.
14
This is meant to help you identify
your prior odds—e.g., on the hypothesis “that a man is capable of extra
sensory perception, in the form of telepathy. You may imagine an experiment
performed in which the man guesses 20 digits (between 0 and 9) correctly.
If you feel that this would cause the probability that the man has telepathic
powers to become greater than 1/2, then the [prior odds] must be assumed
to be greater than 10
−20
. . . . Similarly, if three consecutive correct guesses
would leave the probability below 1/2, then the [prior odds] must be less
than 10
−3
.”
Verify these claims about the prior odds.
12
Adapted from Eddy (1982).
13
Kahneman, Slovic and Tversky (1982), pp. 156158.
14
From I. J. Good (1950) p. 35.
CHAPTER 1. PROBABILITY PRIMER 30
6 The Rare Disease.
15
“You are suﬀering from a disease that, according to
your manifest symptoms, is either A or B. For a variety of demographic rea
sons disease A happens to be 19 times as common as B. The two diseases are
equally fatal if untreated, but it is dangerous to combine the respective ap
propriate treatments. Your physician orders a certain test which, through the
operation of a fairly well understood causal process, always gives a unique di
agnosis in such cases, and this diagnosis has been tried out on equal numbers
of A and B patients and is known to be correct on 80% of those occasions.
The tests report that you are suﬀering from disease B. Should you never
theless opt for the treatment appropriate to A, on the supposition that the
probability of your suﬀering from A is 19/23? Or should you opt for the treat
ment appropriate to B, on the supposition [. . .] that the probability of your
suﬀering from B is 4/5? It is the former opinion that would be irrational for
you. Indeed, on the other view, which is the one espoused in the literature,
it would be a waste of time and money even to carry out the tests, since
whatever their results, the base rates would still compel a more than 4/5
probability in favor of disease A. So the literature is propagating an analysis
that could increase the number of deaths from a rare disease of this kind.”
Diaconis and Freedman (1981, pp. 3334) suggest that “the fallacy of the
transposed conditional” is being committed here, i.e., confusion of the fol
lowing quantities—the second of which is the true positive rate of the test
for B: pr(It is B[It is diagnosed as B), pr(It is diagnosed as B[It is B).
Use the odds form of Bayes’s theorem to verify that if your prior odds on A
are 19:1 and you take the true positive rate (for A, and for B) to be 80%,
your posterior probability for A should be 19/23.
7 On the Credibility of Extraordinary Stories.
16
“There are, broadly speaking, two diﬀerent ways in which we may suppose
testimony to be given. It may, in the ﬁrst place, take the form of a reply to
an alternative question, a question, that is, framed to be answered by yes or
no. Here, of course, the possible answers are mutually contradictory, so that
if one of them is not correct the other must be so: —Has A happened, yes or
no?” . . .
“On the other hand, the testimony may take the form of a more original
statement or piece of information. Instead of saying, Did A happen? we
may ask, What happened? Here if the witness speaks the truth he must be
supposed, as before, to have but one way of doing so; for the occurrence of
some speciﬁc event was of course contemplated. But if he errs he has many
ways of going wrong” . . .
15
L. J. Cohen (1981), see p. 329.
16
Adapted from pp. 409 ﬀ. of Venn(1888, 1962).
CHAPTER 1. PROBABILITY PRIMER 31
(a) In an urn with 1000 balls, one is green and the rest are red. A ball is
drawn at random and seen by no one but a slightly colorblind witness, who
reports that the ball was green. What is your probability that the witness
was right on this occasion, if his reliability in distinguishing red from green
is .9, i.e., if pr(He says it is X[It is X) = .9 when X = Red and when X =
Green?
(b) “We will now take the case in which the witness has many ways of going
wrong, instead of merely one. Suppose that the balls were all numbered,
from 1 to 1000, and the witness knows this fact. A ball is drawn, and he tells
me that it was numbered 25, what is the probability that he is right?” In
answering you are to “assume that, there being no apparent reason why he
should choose one number rather than another, he will be likely to announce
all the wrong ones equally often.”
What is now your probability that the 90% reliable witness was right?
8.1 The Three Cards. One is red on both sides, one is black on both sides,
and the other is red on one side and black on the other. One card is drawn
blindly and placed on a table. If a red side is up, what is the probability that
the other side is red too?
8.2 The Three Prisoners. An unknown two will be shot, the other freed.
Alice asks the warder for the name of one other than herself who will be
shot, explaining that as there must be at least one, the warder won’t really
be giving anything away. The warder agrees, and says that Bill will be shot.
This cheers Alice up a little: Her judgmental probability for being shot is
now 1/2 instead of 2/3. Show (via Bayes’s theorem) that
(a) Alice is mistaken if she thinks the warder is as likely to say ‘Clara’ as
‘Bill’ when he can honestly say either; but that
(b) She is right if she thinks the warder will say ‘Bill’ when he honestly can.
8.3 Monty Hall. As a contestant on a TV game show, you are invited to
choose any one of three doors and receive as a prize whatever lies behind it—
i.e., in one case, a car, or, in the other two, a goat. When you have chosen,
the host opens one of the other two doors, behind which he knows there is a
goat, and oﬀers to let you switch your choice to the third door. Would that
be wise?
9 Causation vs. Diagnosis.
17
“Let A be the event that before the end of
next year, Peter will have installed a burglar alarm in his home. Let B denote
the event that Peter’s home will have been burgled before the end of next
year.
17
From p. 123 of Kahneman, Slovic and Tversky eds., (1982).
CHAPTER 1. PROBABILITY PRIMER 32
“Question: Which of the two conditional probabilities, pr(A[B) or pr(A[B),
is higher?
“Question: Which of the two conditional probabilities, pr(BA) or pr(B[A),
is higher?
“A large majority of subjects (132 of 161) stated that pr(A[B) > pr(A[B)
and that pr(B[A) < pr(B[A), contrary to the laws of probability.”
Substantiate this critical remark by showing that the following is a law of
probability.
pr(A[B) > pr(A[B) iﬀ pr(B[A) > pr(B[A)
10 Prove the following, assuming that conditions all have probability > 0.
a If A implies D then pr(A[D) = pr(A)/pr(D).
b If D implies A then pr(A[D) = pr(A)/pr(D). (“TJ’s Lemma”)
c pr(C[A ∨ B) is between pr(C[A) and pr(C[B) if pr(A ∧ B) = 0.
11 Sex Bias at Berkeley?
18
In the fall of 1973, when 8442 men and 4321
women applied to graduate departments at U. C. Berkeley, about 44% of
the men were admitted, but only about 35% of the women. It looked like
sex bias against women. But when admissions were tabulated for the sep
arate departments—as below, for the six most popular departments, which
together accounted for over a third of all the applicants—there seemed to be
no such bias on a departmentbydepartment basis. And the tabulation sug
gested an innocent onesentence explanation of the overall statistics. What
was it?
Hint: What do the statistics indicate about how hard the diﬀerent depart
ments were to get into?
Department A admitted 62% of 825 male applicants, 82% of 108 females.
Department B admitted 63% of 560 male applicants, 68% of 25 females.
Department C admitted 37% of 325 male applicants, 34% of 593 females.
Department D admitted 33% of 417 male applicants, 35% of 375 females.
Department E admitted 28% of 191 male applicants, 24% of 393 females.
Department F admitted 6% of 373 male applicants, 7% of 341 females.
12 The Birthday Problem. Of twentythree people selected at random,
what is the probability that at least two have the same birthday?
Hint :
365
365
364
365
363
365
(23 factors) ≈ .49
13 How would you explain the situation to M´er´e?
“M. de M´er´e told me he had found a fallacy in the numbers for the following
18
Freedman, Pisani and Purves (1978) pp. 1215.
CHAPTER 1. PROBABILITY PRIMER 33
reason: With one die, the odds on throwing a six in four tries are 671:625.
With two dice, the odds are against throwing a double six in four tries. Yet 24
is to 36 (which is the number of pairings of the faces of two dice) as 4 is to 6
(which is the number of faces of one die). This is what made him so indignant
and made him tell everybody that the propositions were inconsistent and
that arithmetic was selfcontradictory; but with your understanding of the
principles, you will easily see that what I say is right.” (Pascal to Fermat, 29
July 1654)
14 Independence, sec. 1.7.
(a) Complete the proof, begun in example 5, of the case n = 3 of fact (1).
(b) Prove fact (1) in the general case. Suggestion: Use mathematical induction
on the number f = 0, 1, . . . of denied H’s, where 2 ≤ n = t +f.
15 Sample Spaces, sec. 1.3. The diagrammatic method of this section is an
agreeable representation of the settheoretical models in which propositions
are represented by sets of items called ‘points’—a.k.a. (“possible”) ‘worlds’
or ‘states’ (“of nature”). If · = Ω = the set of all such possible states in a
particular settheoretical model then ω ∈ H ⊆ Ω means that the proposition
H is true in possible state ω. To convert such an abstract model into a sample
space it is necessary to specify the intended correspondence between actual
or possible happenings and members and subsets of Ω. Not every subset
of Ω need be counted as a proposition in the sample space, but the ones
that do count are normally assumed to form a Boolean algebra. A Boolean
“σalgebra” is a B.A. which is closed under all countable disjunctions (and,
therefore, conjunctions)—not just the ﬁnite ones. A probability space is a
sample space with a countably additive probability assignment to the Boolean
algebra B of propositions.
1.10 References
L. J. Cohen, in The Behavioral and Brain Sciences 4(1981)317331.
Bruno de Finetti, ‘La Pr´evision...’, Annales de l’Institut Henri Poincar´e 7
(1937), translated in Henry Kyburg, Jr. and Howard Smokler (eds.), 1980.
David M. Eddy, ‘Probabilistic reasoning in clinical medicine’, in Kahne
man Slovic, and Tversky (eds., 1982).
Probabilities and Conditionals, Ellery Eells and Brian Skyrms (eds.): Cam
bridge U. P., 1994.
David Freedman, Robert Pisani, and Roger Purves, Statistics, W. W. Nor
ton, New York, 1978.
CHAPTER 1. PROBABILITY PRIMER 34
I. J. Good, Probability and the Weighing of Evidence: London, 1950.
Nelson Goodman, Fact, Fiction and Forecast, Harvard U. P., 4th ed., 1983.
Carl G. Hempel, Foundations of Natural Science, PrenticeHall (1966).
Daniel Kahneman, Paul Slovic, and Amos Tversky, eds., Judgment Under
Uncertainty, Cambridge U. P., 1982.
Henry Kyburg, Jr. and Howard Smokler, Studies in Subjective Probability,
2nd ed., (Huntington, N.Y.: Robert E. Krieger, 1980).
David Lewis, ‘Probabilities of Conditionals and Conditional Probabilities’
(1976), reprinted in his Philosophical Papers, II: Oxford University Press,
1986.
Frank Ramsey, ‘Truth and Probability’, Philosophical Papers, D. H. Mellor
(ed.): Cambridge, U. P., 1990.
Brian Skyrms,
John Venn, The Logic of Chance, 3rd. ed, 1888 (Chelsea Pub. Co. reprint,
1962.
Chapter 2
Testing Scientiﬁc Theories
Christian Huyghens gave this account of the scientiﬁc method in the intro
duction to his Treatise on Light (1690):
“. . . whereas the geometers prove their propositions by ﬁxed and
incontestable principles, here the principles are veriﬁed by the
conclusions to be drawn from them; the nature of these things not
allowing of this being done otherwise. It is always possible thereby
to attain a degree of probability which very often is scarcely less
than complete proof. To wit, when things which have been demon
strated by the principles that have been assumed correspond per
fectly to the phenomena which experiment has brought under
observation; especially when there are a great number of them,
and further, principally, when one can imagine and foresee new
phenomena which ought to follow from the hypotheses which one
employs, and when one ﬁnds that therein the fact corresponds to
our prevision. But if all these proofs of probability are met with
in that which I propose to discuss, as it seems to me they are,
this ought to be a very strong conﬁrmation of the success of my
inquiry; and it must be ill if the facts are not pretty much as I
represent them.”
In this chapter we interpret Huyghens’s methodology and extend it to the
treatment of certain vexed methodological questions — especially, the Duhem
Quine problem (“holism”, sec. 2.6) and the problem of old evidence (sec 2.7).
35
CHAPTER 2. TESTING SCIENTIFIC THEORIES 36
2.1 Quantiﬁcations of Conﬁrmation
The thought is that you see an episode of observation, experiment or rea
soning as conﬁrming or inﬁrming a hypotheses to a degree that depends on
whether your probability for it increases or decreases during the episode, i.e.,
depending on whether your posterior probability, new(H), is greater or less
than your prior probability, old(H). Your
Probability Increment, new(H) −old(H),
is one measure of that change—positive for conﬁrmation, negative for inﬁr
mation. Other measures are the probability factor and the odds factor, in both
of which the turning point between conﬁrmation and inﬁrmation is 1 instead
of 0. These are the factors by which old probabilities or odds are multiplied
to get the new probabilities or odds.
Probability Factor: π(H) =
df
new(H)
old(H)
By your odds on one hypothesis against another—say, on G against an al
ternative H—is meant the ratio pr(G)/pr(H) of your probability of G to
your probability of H; by your odds simply “on G” is meant your odds on G
against G:
Odds on G against H =
pr(G)
pr(H)
Odds on G =
pr(G)
pr(G)
Your “odds factor”—or, better, “Bayes factor”—is the factor by which your
old odds can be multiplied to get your new odds:
Bayes factor (odds factor):
β(G : H) =
new(G)
new(H)
/
old(G)
old(H)
=
π(G)
π(H)
Where updating old →new is by conditioning on D, your Bayes factor =
your old likelihood ratio, old(D[G) : old(D[H). Thus,
β(G : H) =
old(D[G)
old(D[H)
when new( ) = old( [D).
CHAPTER 2. TESTING SCIENTIFIC THEORIES 37
A useful variant of the Bayes factor is its logarithm, dubbed by I. J. Good
1
the “weight of evidence”:
w(G : H) =
df
log β(G : H)
As odds vary from 0 to ∞, their logarithms vary from −∞ to +∞. High
probabilities (say, 100/101 and 1000/1001) correspond to widely spaced odds
(100:1 and 1000:1); but low probabilities (say, 1/100 and 1/1000) correspond
to odds (1:99 and 1:999) that are roughly as cramped as the probabilities.
Going from odds to log odds moderates the spread at the high end, and treats
high and low symmetrically. Thus, high odds 100:1 and 1000:1 become log
odds 2 and 3, and low odds 1:100 and 1:1000 become log odds 2 and 3.
A. N. Turing was a playful, enthusiastic advocate of log odds, as his
wartime codebreaking colleague I. J. Good reports:
“Turing suggested . . . that it would be convenient to take over
from acoustics and electrical engineering the notation of bels and
decibels (db). In acoustics, for example, the bel is the logarithm
to base 10 of the ratio of two intensities of sound. Similarly, if
f [our β] is the factor in favour of a hypothesis, i.e., the ratio
of its ﬁnal to its initial odds, then we say that the hypothesis
has gained log
10
f bels or 10 log
10
f db. This may be described
as the weight of evidence . . . and 10 log o db may be called the
plausibility corresponding to odds o. Thus . . . Plausibility gained
= weight of evidence”
2
“The weight of evidence can be added to the initial logodds
of the hypothesis to obtain the ﬁnal logodds. If the base of
the logarithms is 10, the unit of weight of evidence was called
a ban by Turing (1941) who also called one tenth of a ban a
deciban (abbreviated to db). I hope that one day judges, de
tectives, doctors and other earthones will routinely weigh evi
dence in terms of decibans because I believe the deciban is an
intelligenceampliﬁer.”
3
“The reason for the name ban was that tens of thousands of
sheets were printed in the town of Banbury on which weights of
1
I. J. Good, it Probability and the Weighing of Evidence, London, 1950, chapter 6.
2
I. J. Good, op. cit., p. 63.
3
I. J. Good, Good Thinking (Minneapolis, 1983), p. 132. Good means ‘inteligence’ in
the sense of ‘information’: Think MI (military intelligence), not IQ.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 38
evidence were entered in decibans for carrying out an important
classiﬁed process called Banburismus. A deciban or halfdeciban
is about the smallest change in weight of evidence that is directly
perceptible to human intuition. . . . The main application of the
deciban was . . . for discriminating between hypotheses, just as in
clinical trials or in medical diagnosis.” [Good (1979) p. 394]
4
Playfulness = frivolity. Good is making a serious proposal, for which he claims
support by extensive testing in the early 1940’s at Bletchley Park, where the
“Banburismus” language for hypothesistesting was used in breaking each
day’s German naval “Enigma” code during World War II.
5
2.2 Observation and Suﬃciency
In this chapter we conﬁne ourselves to the most familiar way of updating
probabilities, that is, conditioning on a statement when an observation as
sures us of its truth.
6
Example 1, Huyghens on light. Let H be the conjunction of Huyghens’s
hypotheses about light and let C represent one of the ‘new phenomena which
ought to follow from the hypotheses which one employs’.
If we know that C follows from H, and we can discover by observation
whether C is true or false, then we have the means to test H—more or
4
I. J. Good, “A. M. Turing’s Statistical Work in World War II” (Biometrika 66 (1979,
3936), p. 394.
5
See Andrew Hodges, Alan Turing, the Enigma. (New York, 1983). The dead hand
of the British Oﬃcial Secrets Act is loosening; see the A. N. Turing home page,
http://www.turing.org.uk/turing/, maintained by Andrew Hodges.
6
In chapter 3 we consider updating by a generalization of conditioning, using statements
that an observation makes merely probable to various degrees.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 39
less conclusively, depending on whether we ﬁnd that C is false or true. If C
proves false (shaded region), H is refuted decisively, for H ∧ C = ⊥. On
the other hand, if C proves true, H’s probability changes from
old(H) =
area of the H circle
area of the · square
=
old(H)
1
to
new(H) = old(H[C) =
area of the H circle
area of the C circle
=
old(H)
old(C)
so that the probability factor is
π(H) =
1
old(C)
It is the antecedently least probable conclusions whose unexpected veriﬁ
cation raises H’s probability the most.
7
The corollary to Bayes’s theorem in
section 1.6 generalizes this result to cases where pr(C[H) < 1.
Before going on, we note a small point that will loom larger in chapter 3:
In case of a positive observational result it may be inappropriate to update
by conditioning on the simple statement C, for the observation may provide
further information which overﬂows that package. In example 1 this problem
arises when the observation warrants a report of form C∧E, which would be
represented by a subregion of the C circle, with old(C ∧ E) < old(C). Here
(unless there is further overﬂow) the proposition to condition on is C ∧ E,
not C. But things can get even trickier, as in this homely jellybean example:
Example 2, The Green Bean.
8
H, This bean is limeﬂavored. C, This bean is green.
You are drawing a bean from a bag in which you know that half of the beans
are green, all the limeﬂavored ones are green, and the green ones are equally
divided between lime and mint ﬂavors. So before looking at the bean or
tasting it, your probabilities are these: old(C) = 1/2 = old(H[C); old(H) =
1/4. But although new(C) = 1, your probability new(H) for lime can drop
below old(H) = 1/4 instead of rising to old(H[C) = 1/2 in case you yourself
are the observer—for instance, if, when you see that the bean is green, you
also get a whiﬀ of mint, or also see that the bean is of a special shade of green
that you have found to be associated with the mintﬂavored ones. And here
there need be no further proposition E you can articulate that would make
7
“More danger, more honor”: George P´ olya, Patterns of Plausible Inference, 2nd ed.,
Princeton 1968, vol. 2, p. 126.
8
Brian Skyrms [reference?]
CHAPTER 2. TESTING SCIENTIFIC THEORIES 40
C ∧ E the thing to condition on, for you may have no words for the telltale
shade of green, or for the whiﬀ of mint. The diﬃculty will be especially
bothersome as an obstacle to collaboration with others who would like to
update their own prior probabilities by conditioning on your ﬁndings. (In
chapter 3 we will see what to do about it.)
2.3 Leverrier on Neptune
We now turn to a methodological story more recent than Huyghens’s:
9
“On the basis of Newton’s theory, the astronomers tried to com
pute the motions of [. . . ] the planet Uranus; the diﬀerences be
tween theory and observation seemed to exceed the admissible
limits of error. Some astronomers suspected that these devia
tions might be due to the attraction of a planet revolving beyond
Uranus’ orbit, and the French astronomer Leverrier investigated
this conjecture more thoroughly than his colleagues. Examining
the various explanations proposed, he found that there was just
one that could account for the observed irregularities in Uranus’s
motion: the existence of an extraUranian planet [sc., Neptune].
He tried to compute the orbit of such a hypothetical planet from
the irregularities of Uranus. Finally Leverrier succeeded in assign
ing a deﬁnite position in the sky to the hypothetical planet [say,
with a 1 degree margin of error]. He wrote about it to another
astronomer whose observatory was the best equipped to exam
ine that portion of the sky. The letter arrived on the 23rd of
September 1846 and in the evening of the same day a new planet
was found within one degree of the spot indicated by Leverrier.
It was a large ultraUranian planet that had approximately the
mass and orbit predicted by Leverrier.”
We treated Huyghens’s conclusion as a strict deductive consequence of his
principles. But P´olya made the more realistic assumption that Leverrier’s
prediction C (a bright spot near a certain point in the sky at a certain time)
was highly probable but not 100%, given his H (namely, Newton’s laws and
observational data about Uranus). So old(C[H) ≈ 1. And presumably the
rigidity condition was satisﬁed so that new(C[H) ≈ 1, too. Then veriﬁcation
9
George P´ olya, op. cit., pp. 130132.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 41
of C would have raised H’s probability by a factor ≈ 1/old(C), which is large
if the prior probability old(C) of Leverrier’s prediction was ≈ 0.
P´olya oﬀers a reason for regarding 1/old(C) as at least 180, and perhaps
as much as 13131: The accuracy of Leverrier’s prediction proved to be better
than 1 degree, and the probability of a randomly selected point on a circle
or on a sphere being closer than 1 degree to a previously speciﬁed point is
1/180 for a circle, and about 1/13131 for a sphere. Favoring the circle is the
fact that the orbits of all known planets lie in the same plane (“the ecliptic”).
Then the great circle cut out by that plane gets the lion’s share of probability.
Thus, if old(C) is half of 1%, H’s probability factor will be about 200.
2.4 Dorling on the Duhem Problem
Skeptical conclusions about the possibility of scientiﬁc hypothesistesting are
often drawn from the presumed arbitrariness of answers to the question of
which to give up—theory, or auxiliary hypothesis—when they jointly con
tradict empirical data. The problem, addressed by Duhem in the ﬁrst years
of the 20th century, was agitated by Quine in midcentury. As drawn by
some of Quine’s readers, the conclusion depends on his assumption that
aside from our genetical and cultural heritage, deductive logic is all we’ve
got to go on when it comes to theory testing. That would leave things pretty
much as Descartes saw them, just before the mid17th century emergence
in the hands of Fermat, Pascal, Huyghens and others of the probabilistic
(“Bayesian”) methodology that Jon Dorling has brought to bear on various
episodes in the history of science.
The conclusion is one that scientists themselves generally dismiss, thinking
they have good reason to evaluate the eﬀects of evidence as they do, but re
garding formulation and justiﬁcation of such reasons as someone else’s job—
the methodologist’s. Here is an introduction to Dorling’s work on that job,
using extracts from his important but still unpublished 1982 paper
10
—which
10
Jon Dorling, ‘Bayesian personalism, the methodology of research programmes, and
Duhem’s problem’, Studies in History and Philosophy of Science 10(1979)177187. More
along the same lines: Michael Redhead, ‘A Bayesian reconstruction of the methodology of
scientiﬁc research programmes,’ Studies in History and Philosophy of Science 11(1980)341
347. Dorling’s unpublished paper from which excerpts appear here in sec. 3.6  3.9 is
‘Further illustrations of the Bayesian solution of Duhem’s problem’ (29 pp., photocopied,
1982). References here (‘sec. 4’, etc.) are to the numbered sections of that paper. Dor
ling’s work is also discussed in Colin Howson and Peter Urbach, Scientiﬁc Reasoning: the
Bayesian approach (Open Court, La Salle, Illinois, 2nd ed., 1993).
CHAPTER 2. TESTING SCIENTIFIC THEORIES 42
is reproduced in the web page “http:\\www.princeton.edu\~bayesway”
with Dorling’s permission.
The material is presented here in terms of odds factors. Assuming rigidity
relative to D, the odds factor for a theory T against an alternative theory S
that is due to learning that D is true will be the lefthand side of the following
equation, the righthand side of which is called “the likelihood ratio”:
Bayes Factor = Likelihood Ratio:
old(T[D)/old(S[D)
old(T)/old(S)
=
old(D[T)
old(D[S)
The empirical result D is not generally deducible or refutable by T alone,
or by S alone, but in interesting cases of scientiﬁc hypothesis testing D is
deducible or refutable on the basis of the theory and an auxiliary hypothesis
A (e.g., the hypothesis that the equipment is in good working order). To
simplify the analysis, Dorling makes an assumption that can generally be
justiﬁed by appropriate formulation of the auxiliary hypothesis:
Prior Independence
old(A ∧ T) = old(A)old(T),
old(A ∧ S) = old(A)old(S).
Generally speaking, S is not simply the denial of T, but a deﬁnite scientiﬁc
theory in its own right, or a disjunction of such theories, all of which agree on
the phenomenon of interest, so that, as an explanation of that phenomenon,
S is a rival to T. In any case Dorling uses the independence assumption to
expand the righthand side of the Bayes Factor = Likelihood Ratio equation:
β(T : S) =
old(D[T ∧ A)old(A) +old(D[T ∧ A)old(A)
old(D[S ∧ A)old(A) +old(D[S ∧ A)old(A)
To study the eﬀect of D on A, he also expands β(A :
¯
A) with respect to T
(and similarly with respect to S, although we do not show that here):
β(A : A) =
old(D[A ∧ T)old(T) +old(D[A ∧ T)old(T)
old(D[A ∧ T)old(T) +old(D[A ∧ T)old(T)
CHAPTER 2. TESTING SCIENTIFIC THEORIES 43
2.4.1 Einstein/Newton, 1919
In these terms Dorling analyzes two famous tests that were duplicated, with
apparatus diﬀering in seemingly unimportant ways, with conﬂicting results:
one of the duplicates conﬁrmed T against S, the other conﬁrmed S against
T. But in each case the scientiﬁc experts took the experiments to clearly
conﬁrm one of the rivals against the other. Dorling explains why the experts
were right:
“In the solar eclipse experiments of 1919, the telescopic observa
tions were made in two locations, but only in one location was
the weather good enough to obtain easily interpretable results.
Here, at Sobral, there were two telescopes: one, the one we hear
about, conﬁrmed Einstein; the other, in fact the slightly larger
one, conﬁrmed Newton. Conclusion: Einstein was vindicated, and
the results with the larger telescope were rejected.” (Dorling, sec.
4)
Notation
T: Einstein: lightbending eﬀect of the sun
S: Newton: no lightbending eﬀect of the sun
A: Both telescopes are working correctly
D: The conﬂicting data from both telescopes
In the odds factor β(T : S) above, old(D[T ∧ A) = old(D[S ∧ A) = 0
since if both telescopes were working correctly they would not have given
contradictory results. Then the ﬁrst terms of the sums in numerator and
denominator vanish, so that the factors old(T) cancel, and we have
β(T : S) =
old(D[T ∧ A)
old(D[S ∧ A)
“Now the experimenters argued that one way in which A might
easily be false was if the mirror of one or the other of the tele
scopes had distorted in the heat, and this was much more likely to
have happened with the larger mirror belonging to the telescope
which conﬁrmed S than with the smaller mirror belonging to the
telescope which conﬁrmed T. Now the eﬀect of mirror distortion
of the kind envisaged would be to shift the recorded images of
the stars from the positions predicted by T to or beyond those
predicted by S. Hence old(D[T ∧A) was regarded as having an
CHAPTER 2. TESTING SCIENTIFIC THEORIES 44
appreciable value, while, since it was very hard to think of any
similar eﬀect which could have shifted the positions of the stars in
the other telescope from those predicted by S to those predicted
by T, old(D[S ∧ A) was regarded as negligibly small, hence the
result as overall a decisive conﬁrmation of T and refutation of S.”
Thus the Bayes factor β(T : S) is very much greater than 1.
2.4.2 Bell’s Inequalities: Holt/Clauser
“Holt’s experiments were conducted ﬁrst and conﬁrmed the pre
dictions of the local hidden variable theories and refuted those
of the quantum theory. Clauser examined Holt’s apparatus and
could ﬁnd nothing wrong with it, and obtained the same results
as Holt with Holt’s apparatus. Holt refrained from publishing his
results, but Clauser published his, and they were rightly taken as
excellent evidence for the quantum theory and against hidden
variable theories.” (Dorling, sec. 4)
Notation
T: Quantum theory
S: Disjunction of local hidden variable theories
A: Holt’s setup is reliable
11
enough to distinguish T from S
D: The speciﬁc correlations predicted by T and contradicted by S are not
detected by Holt’s setup
The characterization of D yields the ﬁrst two of the following equations. In
conjunction with the characterization of A it also yields old(D[T ∧A) = 1,
for if A is false, Holt’s apparatus was not sensitive enough to detect the
correlations that would have been present according to T; and it yields
old(D[S ∧ A) = 1 because of the wild improbability of the apparatus “hal
lucinating” those speciﬁc correlations.
12
old(D[T ∧ A) = 0,
old(D[S ∧ A) = old(D[T ∧ A) = old(D[S ∧ A) = 1
11
This means: suﬃciently sensitive and free from systematic error. Holt’s setup proved to
incorporate systematic error arising from tension in tubes for mercury vapor, which made
the glass optically sensitive so as to rotate polarized light. Thanks to Abner Shimony for
clarifying this. See also his supplement 6 in sec. 2.6.
12
Recall Venn on the credibilty of extraordinary stories: Supplement 7 in sec. 1.11.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 45
Substituting these values, we have
β(T, S) = old(A)
Then with a prior probability of 4/5 for adequate sensitivity of Holt’s appara
tus, the odds between quantum theory and the local hidden variable theories
shift strongly in favor of the latter, e.g., with prior odds 45:55 between T and
S, the posterior odds are only 9:55, a 14% probability for T.
Now why didn’t Holt publish his result?
Because the experimental result undermined conﬁdence in his apparatus.
Setting T = S in (2) because T and S were the only theories given any
credence as explanations of the results, and making the same substitutions
as in (4), we have
β(A : A) = old(S)
so the odds on A fall from 4:1 to 2.2:1; the probability of A falls from 80%
to 69%. Holt is not prepared to publish with better than a 30% chance that
his apparatus could have missed actual quantum mechanical correlations; the
swing to S depends too much on a prior conﬁdence in the experimental setup
that is undermined by the same thing that caused the swing.
Why did Clauser publish?
Notation
T: Quantum theory
S: Disjunction of local hidden variable theories
C: Clauser’s setup is reliable enough
E: The speciﬁc correlations predicted by T and contradicted by S are de
tected by Clauser’s setup
Suppose that old(C) = .5. At this point, although old(A) has fallen
by 11%, both experimenters still trust Holt’s welltried setup better than
Clauser’s. Suppose Clauser’s initial results E indicate presence of the quan
tum mechanical correlations pretty strongly, but still with a 1% chance of
error. Then E strongly favors T over R:
13
β(T, S) =
old(E[T ∧ C)old(C) +old(E[T ∧ C)old(C)
old(E[S ∧ C)old(C) +old(E[S ∧ C)old(C)
= 50.5
13
Numerically:
1×.5+.01×.5
.01×.5+.01×.5
= 50.5.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 46
Starting from the low 9:55 to which T’s odds fell after Holt’s experiment,
odds after Clauser’s experiment will be 909:110, an 89% probability for T.
The result E boosts conﬁdence in Clauser’s apparatus by a factor of
β(C : C) =
old(E[C ∧ T)old(T) +old(E[C ∧ S)old(S)
old(E[C ∧ T)old(T) +old(E[C ∧ S)old(S)
= 15
This raises the initially even odds on C to 15:1, raises the probability from
50% to 94%, and lowers the 50% probability of the eﬀect’s being due to
chance down to 6 or 7 percent.
2.4.3 Laplace/Adams
Finally, note one more class of cases: a theory T remains highly probable
although (with auxiliary hypothesis A) it is incompatible with subsequently
observed data D. With S = T in the formulas for β in 2.4 and with
old(D[T ∧ A) = 0 (so that the ﬁrst terms in the numerators vanish), and
writing
t =
old(D[T ∧ A)
old(D[T ∧ A)
, s =
old(D[T ∧ A)
old(D[T ∧ A)
,
for two of the likelihood ratios, it is straightforward to show that
β(T : T) =
t
1 + (s old odds on A)
,
β(A : A) =
s
1 + (t old odds on T)
,
β(T : A) =
t
s
old(A)
old(T)
.
These formulas apply to
“a famous episode from the history of astronomy which clearly
illustrated striking asymmetries in ‘normal’ scientists’ reactions
to conﬁrmation and refutation. This particular historical case
furnished an almost perfect controlled experiment from a philo
sophical point of view, because owing to a mathematical error of
CHAPTER 2. TESTING SCIENTIFIC THEORIES 47
Laplace, later corrected by Adams, the same observational data
were ﬁrst seen by scientists as conﬁrmatory and later as disconﬁr
matory of the orthodox theory. Yet their reactions were strikingly
asymmetric: what was initially seen as a great triumph and of
striking evidential weight in favour of the Newtonian theory, was
later, when it had to be reanalyzed as disconﬁrmatory after the
discovery of Laplace’s mathematical oversight, viewed merely as a
minor embarrassment and of negligible evidential weight against
the Newtonian theory. Scientists reacted in the ‘refutation’ situ
ation by making a hidden auxiliary hypothesis, which had previ
ously been considered plausible, bear the brunt of the refutation,
or, if you like, by introducing that hypothesis’s negation as an ap
parently ad hoc facesaving auxiliary hypothesis.” (Dorling, sec.
1)
Notation
T: The theory, Newtonian celestial mechanics
A: The hypothesis that disturbances (tidal friction, etc.) make a negligible
contribution to
D: The observed secular acceleration of the moon
Dorling argues on scientiﬁc and historical grounds for approximate numer
ical values
t = 1, s =
1
50
The thought is that t = 1 because with A false, truth or falsity of T is
irrelevant to D, and t = 50s because in plausible partitions of T into rival
theories predicting lunar accelerations, old(R[T) = 1/50 where R is the
disjunction of rivals not embarrassed by D. Then for a theorist whose odds
are 3:2 on A and 9:1 on T (probabilities 60% for A and 90% for T),
β(T : T) =
100
103
β(A : A) =
1
500
β(T : A) = 200
Thus the prior odds 900:100 on T barely decrease, to 900:103; the new proba
bility of T, 900/1003, agrees with the original 90% to two decimal places. But
odds on the auxiliary hypothesis A drop sharply, from prior 3:2 to posterior
CHAPTER 2. TESTING SCIENTIFIC THEORIES 48
3/1000, i.e., the probability of A drops from 60% to about three tenths of
1%; odds on the theory against the auxiliary hypothesis increase by a factor
of 200, from a prior value of 3:2 to a posterior value of 300:1.
2.4.4 Dorling’s Conclusions
“Until recently there was no adequate theory available of how sci
entists should change their beliefs in the light of evidence. Stan
dard logic is obviously inadequate to solve this problem unless
supplemented by an account of the logical relations between de
grees of belief which fall short of certainty. Subjective probability
theory provides such an account and is the simplest such ac
count that we posess. When applied to the celebrated Duhem
(or DuhemQuine) problem and to the related problem of the
use of ad hoc, or supposedly ad hoc, hypotheses in science, it
yields an elegant solution. This solution has all the properties
which scientists and philosophers might hope for. It provides stan
dards for the validity of informal deductive reasoning comparable
to those which traditional logic has provided for the validity of
informal deductive reasoning. These standards can be provided
with a rationale and justiﬁcation quite independent of any ap
peal to the actual practice of scientists, or to the past success
of such practices.
14
Nevertheless they seem fully in line with the
intuitions of scientists in simple cases and with the intuitions
of the most experienced and most successful scientists in trick
ier and more complex cases. The Bayesian analysis indeed vindi
cates the rationality of experienced scientists’ reactions in many
cases where those reactions were superﬁcially paradoxical and
where the relevant scientists themselves must have puzzled over
the correctness of their own intuitive reactions to the evidence. It
is clear that in many such complex situations many less experi
enced commentators and critics have sometimes drawn incorrect
conclusions and have mistakenly attributed the correct conclu
sions of the experts to scientiﬁc dogmatism. Recent philosophical
14
Here a long footnote explains the PutnamLewis Dutch book argument for condi
tioning. Putnam stated the result, or, anyway, a special case, in a 1963 Voice of Amer
ica Broadcast, ‘Probability and Conﬁrmation’, reprinted in his Mathematics, Matter and
Method, Cambridge University Press (1975) 293304. Paul Teller, ‘Conditionalization and
observation’, Synthese 26 (1973) 218258, reports a general argument to that eﬀect which
Lewis took to have been what Putnam had in mind.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 49
and sociological commentators have sometimes generalized this
mistaken reaction into a fullscale attack on the rationality of
men of science, and as a result have mistakenly looked for purely
sociological explanations for many changes in scientists’ beliefs,
or the absence of such changes, which were in fact, as we now see,
rationally de rigeur.’ (Dorling, sec. 5)
2.5 Old News Explained
In sec. 2.2 we analyzed the case Huyghens identiﬁed as the principal one in
his Treatise on Light: A prediction C long known to follow from a hypoth
esis H is now found to be true. Here, if the rigidity condition is satisﬁed,
new(H) = old(H[C), so that the probability factor is π(H) = 1/old(C).
But what if some longfamiliar phenomenon C, a phenomenon for which
old(C) = 1, is newly found to follow from H in conjunction with familiar
background information B about the solar system, and thus to be explained
by H ∧ B? Here, if we were to update by conditioning on C, the probability
factor would be 1 and new(H) would be the same as old(H). No conﬁrmation
here.
15
Wrong, says Daniel Garber:
16
The information prompting the update is
not that C is true, but that H ∧ B implies C. To condition on that news
Garber proposes to to enlarge the domain of the probability function old by
adding to it the hypothesis that C follows from H∧B together with all further
truthfunctional compounds of that new hypothesis with the old domain.
Using some extension old
∗
of old to the enlarged domain, we might then
have new(H) = old
∗
(H[H ∧B implies C). That is an attractive approach to
the problem, if it can be made to work.
17
The alternative approach that we now illustrate deﬁnes the new assign
ment on the same domain that old had. It analyzes the old →new transition
by embedding it in a larger process, ur
obs
→old
expl
→ new, in which ur represents
an original state of ignorance of C’s truth and its logical relationship to H:
15
This is what Clark Glymour has dubbed ‘the paradox of old evidence’: see his Theory
and Evidence, Princeton University Press, 1980.
16
See his ‘Old evidence and logical omniscience in Bayesian conﬁrmation theory’ in
Testing Scientiﬁc Theories, ed. John Earman, University of Minnesota Press, 1983. For
further discussion of this proposal see ‘Bayesianism with a human face’ in my Probability
and the Art of Judgment, Cambridge University Press, 1992.
17
For a critique, see chapter 5 of John Earman’s Bayes or Bust? (MIT Press, 1992).
CHAPTER 2. TESTING SCIENTIFIC THEORIES 50
Example 1, The perihelion of Mercury.
Notation.
H: GTR applied to the SunMercury system
C: Advance of 43 seconds of arc per century
18
In 1915 Einstein presented a report to the Prussian Academy of Sciences
explaining the thenknown advance of approximately 43
per century in the
perihelion of Mercury in terms of his (“GTR”) ﬁeld equations for gravitation.
An advance of 38
per century had been reported by Leverrier in 1859, due
to ‘some unknown action on which no light has been thrown’.
19
“The only way to explain the eﬀect would be ([Leverrier] noted) to
increase the mass of Venus by at least 10 percent, an inadmissible
modiﬁcation. He strongly doubted that an intramercurial planet
[“Vulcan”], as yet unobserved, might be the cause. A swarm of
intramercurial asteroids was not ruled out, he believed.”
Leverrier’s ﬁgure of 38
was soon corrected to its present value of 43
, but
the diﬃculty for Newtonian explanations of planetary astronomy was still
18
Applied to various Sunplanet systems, the GTR says that all planets’ perihelions
advance, but that Venus is the only one for which that advance should be observable. This
ﬁgure for Mercury has remained good since its publication in 1882 by Simon Newcomb.
19
‘Subtle is the Lord. . .’, the Science and Life of Albert Einstein, Abraham Pais, Oxford
University Press, 1982, p. 254.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 51
in place 65 years later, when Einstein ﬁnally managed to provide a general
relativistic explanation ‘without special assumptions’ (such as Vulcan)—an
explanation which was soon accepted as strong conﬁrmation for the GTR.
In the diagram above, the lefthand path, ur →prn →new, represents an
expansion of the account in sec. 2.2 of Huyghens’s “principal” case (‘prn’ for
‘principal’), in which the conﬁrming phenomenon is veriﬁed after having been
predicted via the hypothesis which its truth conﬁrms. The “ur” probability
distribution, indicated schematically at the top, represents a time before C
was known to follow from H. To accomodate that discovery the ‘b’ in the
urdistribution is moved left and added to the ‘a’ to obtain the top row
of the prn distribution. The reasoning has two steps. First: Since we now
know that H is true, C must be true as well. Therefore prn(H ∧ C) is
set equal to 0. Second: Since implying a prediction that may well be false
neither conﬁrms nor inﬁrms H, the prn probability of H is to remain at its
urvalue even though the probability of H∧C has been nulliﬁed. Therefore
the probability of H ∧ C is increased to prn(H ∧ C) = a + b. And this
prn distribution is where the Huyghens on light example in sec. 2.2 begins,
leading us to the bottom distribution, which assigns the following odds to H:
new(H)
new(H)
=
a +b
c
.
In terms of the present example this new distribution is what Einstein
arrived at from the distribution labelled old, in which
old(H)
old(H)
=
a
c
.
The rationale is
commutativity:
The new distribution is the same, no matter if
the observation or the explanation comes ﬁrst.
Now the Bayes factor gives the clearest view of the transition old(H) →
new(H):
20
20
The approach here is taken from Carl Wagner, “Old evidence and new explanation III”,
PSA 2000 (J. A. Barrett and J. McK. Alexander, eds.), part 1, pp. S165S175 (Proceedings
of the 2000 biennial meeting of the Philosophy of Science Association, supplement to
Philosophy of Science 68 [2001]), which reviews and extends earlier work, “Old evidence
and new explanation I” and “Old evidence and new explanation II” in Philosophy of
Science 64, No. 3 (1997) 677691 and 66 (1999) 283288.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 52
β(H : H) =
a +b
c
/
a
c
= 1 +
ur(C[H)
ur(C[H)
.
Thus the new explanation of old news C increases the odds on H by a factor
of (1 + the urodds against C, given H). Arguably, this is very much greater
than 1, since, in a notation in which C = C
43
= an advance of (43 ± .5)
/c
and similarly for other C
i
’s, C is a disjunction of very many “almost”
incompatible disjuncts: C = . . . C
38
∨ C
39
∨ C
40
∨ C
41
∨ C
42
∨ C
44
∨ . . . .
21
That’s the good news.
22
And on the plausible assumption
23
of urindependence of H from C, the
formula for β becomes even simpler:
β(H : H) = 1 +
ur(
¯
C)
ur(C)
The bad news is that ur and prn are new constructs, projected into a myth
ical methodological past. But maybe that’s not so bad. As we have just
seen in a very sketchy way, there seem to be strong reasons for taking the
ratio ur(C[H)/ur(C[H) to be very large. This is clearer in the case of ur
independence of H and C:
Example 2, Urindependence. If Wagner’s independence assumption
is generally persuasive, the physics community’s urodds against C (“C
43
”)
will be very high, since, behind the veil of urignorance, there is a large
array of C
i
’s, which, to a ﬁrst approximation, are equiprobable. Perhaps,
999 of them? Then β(H : H) ≥ 1000, and the weight of evidence (2.1) is
w(H : H) ≥ 3.
2.6 Supplements
1 “Someone is trying decide whether or not T is true. He notes that T is
a consequence of H. Later he succeeds in proving that H is false. How does
this refutation aﬀect the probability of T?”
24
2 “We are trying to decide whether or not T is true. We derive a sequence
of consequences from T, say C
1
, C
2
, C
3
, . . . . We succeed in verifying C
1
, then
21
“Almost”: ur(C
i
∧ C
j
) = 0 for any distinct integers i and j.
22
“More [ur]danger, more honor.” See P´ olya, quoted in sec. 2.2 here.
23
Carl Wagner’s, again.
24
This problem and the next are from George P´ olya, ‘Heuristic reasoning and the theory
of probability’, American Mathematical Monthly 48 (1941) 450465.
CHAPTER 2. TESTING SCIENTIFIC THEORIES 53
C
2
, then C
3
, and so on. What will be the eﬀect of these successive veriﬁcations
on the probablity of T?”
3 Fallacies of “yes/no” confirmation.
25
Each of the following plau
sible rules is unreliable. Find counterexamples—preferably, simple ones.
(a) If D conﬁrms T, and T implies H, then D conﬁrms H.
(b) If D conﬁrms H and T separately, it must conﬁrm their conjunction, TH.
(c) If D and E each conﬁrm H, then their conjunction, DE, must also con
ﬁrm H.
(d) If D conﬁrms a conjunction, TH, then it cannot inﬁrm each conjunct
separately.
4 Hempel’s ravens. It seems evident that black ravens conﬁrm (H) ‘All
ravens are black’ and that nonblack nonravens do not. Yet H is logically
equivalent to ‘All nonravens are nonblack’. Use probabilistic considerations
to resolve this paradox of “yes/no” conﬁrmation.
26
5 Wagner III (sec 2.5).
27
Carl Wagner extends his treatment of old
evidence newly explained to cases in which updating is by generalized condi
tioning on a countable sequence ( = 'C
1
, C
2
, . . .` where none of the updated
probabilies need be 1, and to cases in which each of the (’s “follows only
probabilistically” from H in the sense that the conditional probabilities of
the (’s, given H, are high but < 1. Here we focus on the simplest case, in
which ( has two members, ( = 'C, C`. Where (as in the diagram in sec.
2.6) updates are not always from old to new, we indicate the prior and poste
rior probability measures by a subscript and superscript: β
post
prior
. In particular,
we shall write β
new
old
(H : H) explicitly, instead of β(H : H) as in (3) and
(4) of sec. 2.6. And we shall use ‘β
∗
’ as shorthand: β
∗
= β
old
ur
(C : C).
(a) Show that in the framework of sec. 2.6 commutativity is equivalent
to Uniformity:
β
new
old
(A : A
) = β
prn
ur
(A : A
) if A, A
∈ ¦H∧C, H∧C, H∧C, H∧C¦.
(b) Show that uniformity holds whenever the updates ur →old and prn →
new are both by generalized conditioning.
25
These stem from Carl G. Hempel’s ‘Studies in the logic of conﬁrmation’, Mind 54
(1945) 126 and 97121, which is reprinted in his Aspects of Scientiﬁc Explanation, The
Free Press, New York, 1965.
26
The paradox was ﬁrst ﬂoated (1937) by Carl G. Hempel, in an abstract form: See
pp. 5051 of his Selected Philosophical Essays, Cambridge University Press, 2000. For the
ﬁrst probabilistic solution, see “On conﬁrmation” by Janina HosiassonLindenbaum, The
Journal of Symbolic Logic 5 (1940) 133148. For a critical survey of more recent treatments,
see pp. 6973 of John Earman’s Bayes or Bust?
27
This is an easily detatchable bit of Wagner’s “Old evidence and new explanation III”
(cited above in sec. 3.6).
CHAPTER 2. TESTING SCIENTIFIC THEORIES 54
Now verify that where uniformity holds, so do the following two formulas:
(c) β
new
ur
(H : H) =
(β
∗
−1)prn(C[H) + 1
(β
∗
−1)prn(C[H) + 1
(d) β
old
ur
(H : H) =
(β
∗
−1)ur(C[H) + 1
(β
∗
−1)ur(C[H) + 1
And show that, given uniformity,
(e) if H and C are urindependent, then old(H) = ur(H) and, therefore,
(f) β
new
old
(H : H) =
(β
∗
−1)prn(C[H) + 1
(β
∗
−1)prn(C[H) + 1
, so that
(g) Given uniformity, if H and C are urindependent then
(1) β
new
old
(H : H) depends only on β
∗
, prn(C[H) and prn(C[H);
(2) if β > 1 and prn(C[H) > prn(C[H), then β
new
old
(H : H) > 1;
(3) β
new
old
(H : H) →
prn(C[H)
prn(C[H)
as β →∞.
6 Shimony on HoltClauser.
28
“Suppose that the true theory is local
hidden variables, but Clauser’s apparatus [which supported QM] was faulty.
Then you have to accept that the combination of systematic and random
errors yielded (within the rather narrow error bars) the quantum mechanical
prediction, which is a deﬁnite number. The probability of such a coincidence is
very small, since the part of experimental space that agrees with a numerical
prediction is small. By contrast, if quantum mechanics is correct and Holt’s
apparatus is faulty, it is not improbable that results in agreement with Bell’s
inequality (hence with local hidden variables theories) would be obtained,
because agreement occurs when the relevant number falls within a rather
large interval. Also, the errors would usually have the eﬀect of disguising
correlations, and the quantum mechanical prediction is a strict correlation.
Hence good Bayesian reasons can be given for voting for Clauser over Holt,
even if one disregards later experiments devised in order to break the tie.”
Compare: Relative ease of the Rockies eventually wearing down to Adiron
dacks size, as against improbability of the Adirondacks eventually reaching
the size of the Rockies.
28
Personal communication, 12 Sept. 2002.
Chapter 3
Probability Dynamics;
Collaboration
We now look more deeply into the matter of learning from experience, where
a pair of probability assignments represents your judgments before and after
you change your mind in response to the result of experiment or observation.
We start with the simplest, most familiar mode of updating, which will be
generalized in sec. 3.2 and applied in sec. 3.3 and 3.4 to the problem of
learning from other people’s experiencedriven probabilistic updating.
3.1 Conditioning
Suppose your judgments were once characterized by an assignment (“old”)
which describes all of your conditional and unconditional probabilities as
they were then. And suppose you became convinced of the truth of some data
statement D. That would change your probabilistic judgments; they would
no longer corrrespond to your prior assignment old, but to some posterior
assignment new, where you are certain of D’s truth.
certainty: new(D) = 1
How should your new probabilities for hypotheses H other than D be related
to your old ones?
The simplest answer goes by the name of “conditioning” (or “condition
alization”) on D.
55
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 56
conditioning: new(H) = old(H[D)
This means that your new unconditional probability for any hypothesis H
will simply be your old conditional probability for H given D.
When is it appropriate to update by conditioning on D? It is easy to
see—once you think of it—that certainty about D is not enough.
Example 1, The Red Queen. You learn that a playing card, drawn
from a wellshuﬄed, standard 52card deck, is a heart: your new(♥) = 1.
Since you know that all hearts are red, this means that your new(red) = 1
as well. But since old(Queen ∧ ♥[♥) =
1
13
whereas old(Queen ∧ ♥[red) =
1
26
,
you will have diﬀerent new probabilities for the card’s being the Queen of
hearts, depending on which certainty you update on, ♥ or red.
Of course, in example 1 we are in no doubt about which certainty you
should condition on: it is the more informative one, ♥. And if what you
actually learned was that the card was a court card as well as a heart,
then that is what you ought to update on, so that your new(Q ∧ ♥) =
old(Q∧♥[(Ace ∨King ∨Queen) ∧♥)) =
1
3
. But what if—although the thing
you learned with certainty was indeed that the card was a heart—you also
saw something you can’t put into words, that gave you reason to suspect,
uncertainly, that it was a court card?
Example 2, A brief glimpse determines your new probabilities as 1/20
for each of the ten numbered hearts, and as 1/6 for each of the J, K and Q.
Here, where some of what you have learned is just probabilistic, conditioning
on your most informative new certainty does not take all of your information
into account. In sec. 3.2 we shall see how you might take such uncertain in
formation into account; but meanwhile, it is clear that ordinary conditioning
is not always the way to go.
What is the general condition under which it makes sense for you to plan
to update by conditioning on D if you should become sure that D is true?
The answer is conﬁdence that your conditional probabilities given D will not
be changed by the experience that makes you certain of D’s truth.
Invariant conditional probabilities:
(1) For all H, new(H[D) = old(H[D)
Since new(H[D) = new(H) when new(D) = 1, it is clear that certainty and
invariance together imply conditioning. And conversely, conditioning implies
both certainty and invariance.
1
1
To obtain Certainty, set H = D in Conditioning. Now invariance follows from Condi
tioning and Certainty since newH = new(H[D) when newD = 1.
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 57
An equivalent invariance is that of your unconditional odds between dif
ferent ways in which D might be true.
Invariant odds:
(2) If A and B each imply D,
new(A)
new(B)
=
old(A)
old(B)
.
And another is invariance of the “probability factors” π(A) by which your
old probabilities of ways of D’s being true can be multiplied in order to get
their new probabilities.
Invariant probability factors:
(3) If A implies D then π(A) = π(D)
— where π(A) =
df
new(A)
old(A)
.
Exercise. Show that (1) implies (3), which implies (2), which implies (1).
There are special circumstances in which the invariance condition depend
ably holds:
Example 3, The Statistician’s Stooge.
2
Invariance will surely hold in
the variant of example 1 in which you do not see the card yourself, but have
arranged for a trusted assistant to look at it and tell you simply whether or
not is is a heart, with no further information. Under such conditions your
unconditional probability that the card is a heart changes to 1 in a way that
cannot change any of your conditional probabilities, given ♥.
3.2 Generalized Conditioning
3
Certainty is quite demanding. It rules out not only the farfetched uncer
tainties associated with philosophical skepticism, but also the familiar uncer
tainies that aﬀect real empirical inquiry in science and everyday life. But it
is a condition that turns out to be dispensible: As long as invariance holds,
2
I. J. Good’s term.
3
Also known as probability kinematics and Jeﬀrey conditioning. For a little more about
this, see Persi Diaconis and Sandy Zabell, ‘Some alternatives to Bayes’s rule’ in Informa
tion Pooling and Group Decision Making, Bernard Grofman and Guillermo Owen (eds.),
JAI Press, Greenwich, Conn. and London, England, pp. 2538. For much more, see ‘Up
dating subjective probability’ by the same authors, Journal of the American Statistical
Association 77(1982)822830.
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 58
updating is valid by a generalization of conditioning to which we now turn.
We begin with the simplest case, of a yes/no question.
If invariance holds relative to each answer (D, D) to a yes/no question,
and something makes you (say) 85% sure that the answer is ‘yes’, a way of
updating your probabilities is still available for you. Invariance with respect
to both answers means that for each hypothesis, H, new(H[D) = old(H[D)
and new(H[D) = old(H[D). The required updating rule is easily obtained
from the law of total probability in the form
new(H) = new(H[D) new(D) +new(H[D) new(D).
Rewriting the two conditional probabilities in this equation via invariance
relative to D and to D, we have the following updating scheme:
new(H) = old(H[D) new(D) +old(H[D) new(D)
More generally, with any countable partition of answers, the applicable
rule of total probability has one term on the right for each member of the
partition. If the invariance condition holds for each answer, D
i
, we have the
updating scheme new(H) = old(H[D
1
) new(D
1
) +old(H[D
2
) new(D
2
) +. . .,
“Probability Kinematics”:
new(H) =
n
¸
i=1
old(H[D
i
) new(D
i
).
This is equivalent to invariance with respect to every answer: new(H[D
i
) =
old(H[D
i
) for i = 1, . . . , n.
Example 4, A consultant’s prognosis. In hopes of settling on one of
the following diagnoses, Dr. Jane Doe, a histopathologist, conducts a micro
scopic examination of a section of tissue surgically removed from a tumor.
She is sure that exactly one of the three is correct.
D
1
= Islet cell carcinoma, D
2
= Ductal cell ca, D
3
= Benign tumor.
Here n = 3, so (using accents to distinguish her probability assignments from
yours) her new
(H) for the prognosis H of 5year survival will be
old
(H[D
1
) new
(D
1
) +old
(H[D
2
) new
(D
2
) +old
(H[D
3
) new
(D
3
).
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 59
Suppose that, in the event, the examination does not drive her probability
for any of the diagnoses to 1, but leads her to assign new
(D
i
) =
1
3
,
1
6
,
1
2
for
i = 1, 2, 3. Suppose, further, that her conditional probabilities for H given
the diagnoses are unaﬀected by her examination: old
(H[D
i
) = new
(H[D
i
) =
4
10
,
6
10
,
9
10
. Then by probability kinematics her new probability for 5 year sur
vival will be a weighted average new
(H) =
41
60
of the values that her old
(H)
would have had if she had been sure of the three diagnoses, where the weights
are her new probabilities for those diagnoses.
Note that the calculation of new
(H) makes no use of Dr Doe’s old prob
abilities for the diagnoses. Indeed, her prior old
may have been a partially
deﬁned function, assigning no numerical values at all to the D
i
’s. Certainly
her new
(D
i
)’s will have arisen through an interaction of features of her prior
mental state with her new experiences at the microscope. But the results
new
(D
i
) of that interaction are data for the kinematical formula for updat
ing old
(H); the formula itself does not compute those results.
3.3 Probabilistic Observation Reports
4
We now move outside the native ground of probability kinematics into a re
gion where your new probabilities for the D
i
are to be inﬂuenced by someone
else’s probabilistic observation report. You are unlikely to simply adopt such
an observer’s updated probabilities as your own, for they are necessarily a
confusion of what the other person has gathered from the observation it
self, which you would like to adopt as your own, with that person’s prior
judgmental state, for which you may prefer to substitute your own.
We continue in the medical setting. Suppose you are a clinical oncologist
who wants to make the best use you can of the observations of a histopathol
ogist whom you have consulted. Notation: old and new are your probability
assignments before and after the histopathologist’s observation has led her
to change her probability assignment from old
to new
.
If you do simply adopt the expert’s new probabilities for the diagnoses,
setting your new(D
i
) = new(D
i
) for each i, you can update by probability
kinematics even if you had no prior diagnostic opinions old(D
i
) of your own;
all you need are her new new
(D
i
)’s and your invariant conditional prog
noses old(H[D
i
). But suppose you have your own priors, old(D
i
), which you
4
See Wagner (2001) and Wagner, ‘Probability Kinematics and Exchangeability’, Phi
losophy of Science 69 (2002) 266278.
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 60
take to be wellfounded, and although you have high regard for the patholo
gist’s ability to interpret histographic slides, you view her prior probabilities
old
(D
i
) for the various diagnoses as arbitrary and uninformed: She has no
background information about the patient, but for the purpose of formulating
her report she has adopted certain numerically convenient priors to update
on the basis of her observations. It is not from the new
(D
i
) themselves but
from the updates old
(D
i
) →new
(D
i
) that you must extract the information
contributed by her observation. How?
Here is a way: Express your new(D
i
) as a product π
(D
i
) old(D
i
) in which
the factor π
(D
i
) is constructed by combining the histopathologist’s proba
bility factors for the diagnoses with your priors for for them, as follows.
5
(1) π(D
i
) :=
π
(D
i
)
¸
i
π
(D
i
) old(D
i
)
(Your D
i
probability factor)
Now, writing your new(D
i
) = π
(D
i
) old
(D
i
) in the formula in sec. 2.2 for
updating by probability kinematics, and applying the product rule, we have
(2) new(H) =
¸
i
π
(D
i
) old(H ∧ D
i
)
as your updated probability for a prognosis, in the light of the observer’s
report and your prior probabilities, and
(3)
new(H
1
)
new(H
2
)
=
¸
i
π
(D
i
)old(H
1
∧ D
i
)
¸
i
π
(D
i
)old(H
2
∧ D
i
)
as your updated odds between two prognoses.
6
Example 5, Using the consultant’s diagnostic updates. In exam
ple 4 the histopathologist’s unspeciﬁed priors were updated to new values
new
(D
i
) =
1
3
,
1
6
,
1
2
for i = 1, 2, 3 by her observation. Now if all her old
(D
i
)’s
were
1
3
, her probability factors would be π
(D
i
) = 1,
1
2
,
3
2
; if your old(D
i
)’s
were
1
4
,
1
2
,
1
4
, your diagnostic probability factors (1) would be π(D
i
) =
4
3
,
2
3
, 2;
and if your old(H ∧ D
i
)’s were
3
16
,
3
32
,
3
32
then by (2) your new(H) would be
1
2
as against old(H) =
3
8
, so that your π(5year survival) =
4
3
.
It is noteworthy that formulas (1)(3) remain valid when the probability
factors are replaced by anchored odds (or “Bayes”) factors:
5
Without the normalizing denominator in (1), the sum of your new(D
i
)’s might turn
out to be = 1.
6
Combine (1) and (2), then cancel the normalizing denominators to get (3).
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 61
(4) β(D
i
: D
1
) =
newD
i
new(D
1
)
/
old(D
i
)
old(D
1
)
=
π(D
i
)
π(D
1
)
(D
i
: D
1
Bayes factor)
This is your Bayes factor for D
i
against D
1
; it is the number by which your
old odds on D
1
against D
1
can be multiplied in order to get your new odds;
it is what remains of your new odds when the old odds have been factored
out. Choice of D
1
as the anchor diagnosis with which all the D
i
are compared
is arbitrary. Since odds are arbitrary up to a constant multiplier, any ﬁxed
diagnosis D
k
would do as well as D
1
, for the Bayes factors β(D
i
: D
k
) are all
of form β(D
i
: D
1
) c, where the constant c is β(D
1
: D
k
).
Bayes factors have wide credibility as probabilistic observation reports,
with prior probabilities “factored out”.
7
But probability factors carry some
information about priors, as when a probability factor of 2 tells us that the
old probability must have been ≤ 1/2 since the new probability must be
≤ 1. This residual information about priors is neutralized in the context of
formulas (1)(3) above, which remain valid with anchored Bayes factors in
place of probability factors. This matters (a little) because probability factors
are a little easier to compute than Bayes factors, starting from old and new
probabilities.
3.4 Updating Twice: Commutativity
Here we consider the outcome of successive updating on the reports of two
diﬀerent experts—say, a histopathologist (
) and a radiologist (
)—assuming
invariance of your conditional probabilities relative to both experts’ parti
tions. Question: If you update twice, should order be irrelevant?
Should
→
→ =
→
→?
The answer depends on particulars of
(1) the partitions on which
→ and
→ are deﬁned;
(2) the mode of updating (by probabilities? Bayes factors?); and
(3) your starting point, P.
7
In a medical context, this goes back at least as far as Schwartz, Wolfe, and Pauker,
“Pathology and Probabilities: a new approach to interpreting and reporting biopsies”, New
England Journal of Medicine 305 (1981) 917923.
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 62
3.4.1 Updating on alien probabilities for diagnoses
A propos of (2), suppose you accept two new probability assignments to one
and the same partition, in turn.
• Can order matter?
Certainly. Since in both updates your new(H) depends only on your invariant
conditional probabilities old(H[D
i
) and the new(D
i
)’s that you adopt from
that update’s expert, the second assignment simply replaces the ﬁrst.
• When is order immaterial?
When there are two partitions, and updating on the second leaves probabil
ities of all elements of the ﬁrst unchanged—as happens, e.g., when the two
partitions are independent relative to your old.
8
3.4.2 Updating on alien factors for diagnoses
9
Notation: Throughout each formula, f = β or f = π, as you wish.
In updating by Bayes or probability factors f for diagnoses as in 3.3,
order cannot matter.
10
Example 6, One partition. Adopting as your own
both a pathologist’s factors f
i
and a radiologist’s factors f
i
on the same
partition—in either order—you come to the same result: your overall factors
will be products f
i
f
i
of the pathologist’s and radiologist’s factors. Your ﬁnal
probabilities for the diagnoses and for the prognosis B will be
new(D
i
) = old(D
i
)f, Q(H) =
¸
i
old(H ∧ D
i
)f
where f is your normalized factor, constructed from the alien f
and
:
f =
f
i
f
i
¸
i
old(A
i
)f
i
f
i
Example 7, Two partitions. a pathologist’s,
→, with partition ¦D
i
¦
and factors f
i
(i = 1, . . . , m), and a radiologist’s,
→, with partition ¦D
j
¦ and
8
For more about this, see Diaconis and Zabell (1982), esp. 8256.
9
For more about this, see Carl Wagner’s ‘Probability Kinematics and Commutativity’,
Philosophy of Science 69 (2002) 266278.
10
Proofs are straightforward. See pp. 5264 of Richard Jeﬀrey, Petrus Hispanus Lectures
2000: After Logical Empiricism, Sociedade Portuguesa de Filosoﬁa, Edi¸ c˜oes Colibri, Lis
bon, 2002 (edited, translated and introduced by Ant´ onio Zilh˜ ao. Also in Portugese: Depois
do empirismo l´ ogico.)
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 63
factors f
j
(j = 1, . . . , n). These must commute, for in either order they are
equivalent to a single mapping, →, with partition ¦D
i
∧D
j
[old(A
i
∧A
j
) > 0¦
and factors f
i
f
j
. Now in terms of your normalization
f
i,j
=
f
i
f
j
¸
i,j
old(D
i
∧ D
j
)f
i
f
j
of the alien factors, your updating equations on the partition elements D
i
∧D
j
can be written new(D
i
∧D
j
) = f
i,j
old(D
i
∧D
j
). Then your new probability
for the prognosis will be
new(H) =
¸
i,j
f
i,j
old(H ∧ D
i
∧ D
j
).
3.5 Softcore Empiricism
Here is a sample of mid20th century hardcore empiricism:
Subtract, in what we say that we see, or hear, or otherwise learn
from direct experience, all that conceivably could be mistaken; the
remainder is the given content of the experience inducing this
belief. If there were no such hard kernel in experience—e.g., what
we see when we think we see a deer but there is no deer—then
the word ‘experience’ would have nothing to refer to.
11
Hardcore empiricism puts some such “hard kernel” to work as the “purely
experiential component” of your observations, about which you canot be
mistaken. Lewis himself thinks of this hard kernel as a proposition, since it
has a truth value (i.e., true), and has a subjective probability for you (i.e.,
1). Early and late, he argues that the relationship between your irrefutable
kernel and your other empirical judgments is the relationship between fully
believed premises and uncertain conclusions, which have various probabilities
conditionally upon those premises. Thus, in 1929 (Mind and the World Order,
pp. 3289) he holds that
the immediate premises are, very likely, themselves only probable,
and perhaps in turn based upon premises only probable. Unless
11
C. I. Lewis, An Analysis of Knowledge and Valuation, Open Court, LaSalle, Illinois,
1947, pp. 1823. Lewis’s emphasis.
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 64
this backwardleading chain comes to rest ﬁnally in certainty,
no probabilityjudgment can be valid at all. . . . Such ultimate
premises . . . must be actual given data for the individual who
makes the judgment.
And in 1947 (An Analysis of Knowledge and Valuation, p. 186):
If anything is to be probable, then something must be certain.
The data which themselves support a genine probability, must
themselves be certainties. We do have such absolute certainties in
the sense data initiating belief and in those passages of experience
which later may conﬁrm it.
In eﬀect, Lewis subscribes to Carnap’s view
12
of inductive probability as
prescribing, as your current subjective probability for a hypothesis H, your
old(H[D
1
∧ . . . ∧ D
t
), where the D
i
are your fully believed data sentences
from the beginning (D
1
) to date (D
t
) and old is a probability assignment
that would be appropriate for you to have prior to all experience.
Lewis’s arguments for this view seem to be based on the ideas, which we
have been undermining in this chapter, that (a) conditioning on certainties
is the only rational way to form your degrees of belief, and (b) if you are
rational, the information encoded in your probabilies at any time is simply the
conjunction of all your hardcore data sentences up to that time. But perhaps
there are softcore, probabilistic data sentences on which simple conditioning
is possible to the same eﬀect as probability kinematics:
Example 6, Is the shirt blue or green? There are conditions (blue/green
colorblindness, poor lighting) under which what you see can lead you to
adopt probabilities—say,
2
3
,
1
3
—for blue and green, where there is no hard
core experiential proposition E you can cite for which your old(Blue[E) =
2
3
and old(Green[E) =
1
3
. The required softcore experiential proposition c
would be less accessible than Lewis’s ‘what we see when we think we see
a deer but there is no deer’, since what we think is not that the shirt is
blue, but that new(blue) =
2
3
and new(green) =
1
3
. With c as the proposi
tion new(blue) =
2
3
∧ new(green) =
1
3
, it is possible to expand the domain
of the function old so as to allow conditioning on c in a way that yields
the same result you would get via probability kinematics. Formally, this c
behaves like the elusive hardcore E at the beginning of this example.
12
See, e.g., Rudolf Carnap, The Logical Foundations of Probability, University of Chicago
Press (1950, 1964).
CHAPTER 3. PROBABILITY DYNAMICS; COLLABORATION 65
This is how it works.
13
For an nfold partition, where
c = new(D
1
) = d
1
∧ . . . ∧ new(D
n
) = d
n
.
14
If
(1) old(D
i
[ new(D
i
) = d
i
)= d
i
and
(2) old(H[D
i
∧c ) = old(H[D
i
)
then
(3) old(H[ c)= d
1
old(H[D
1
) +. . . +d
n
old(H[D
n
).
In the presence of the certainty condition, new(c)=1, the invariance con
dition new(H[c) = old(H[c) reduces to new(H) = old(H[c) and we have
probability kinematics,
(4) new(H) = d
1
old(H[D
1
) +. . . d
n
old(H[D
n
).
Perhaps c could be regarded as an experiential proposition in some soft
sense. It may be thought to satisfy the certainty and invariance conditions,
new(c) = 1 and new(H[c) = old(H[c), and it does stand outside the normal
run of propositions to which we assign probabilities, as do the presumed
experiential propositions with which we might hope to cash those heavily
contextdependent epistemological checks that begin with ‘looks’. The price
of that trip would be a tricky expansion of the probability assignment, from
the (“ﬁrstorder”) objective propositions such as diagnoses and prognoses
that ﬁgure in the kinematical update formula to subjective propositions of the
second order (c and new(D
i
) = d
i
) and third ( old(c) = 1 ) and beyond.
15
13
Brian Skyrms, Causal Necessity (Yale, 1980), Appendix 2.
14
Such boxes are equivalent to foreandaft parentheses, and easier to parse.
Chapter 4
Expectation Primer
Your expectation, exX, of an unknown number, X, is usually deﬁned as a
weighted average of the numbers you think X might be, in which the weights
are your probabilities for X’s being those numbers: ex is usually deﬁned
in terms of pr. But in many ways the opposite order is preferable, with pr
deﬁned in terms of ex as in this chapter.
1
4.1 Probability and Expectation
If you see only ﬁnitely many possible answers, x
1
, . . . , x
n
, to the question of
what number X is, your expectation can be deﬁned as follows:
(1) exX =
n
¸
i=1
p
i
x
i
—where p
i
= pr(X = x
i
).
Example, X = The year Turing died. If I am sure it was one of the
years from 1951 to 1959, with equal probabilities, then my ex(X) will be
1951
9
+
1952
9
+ . . . +
1959
9
, which works out to be 1954.5. And if you are sure
Turing died in 1954, then for you n = 1, p
1
= 1 and exX=1954.
These numbers X are called random variables, R.V.’s. Since an R.V.
is always some particular number, random variables are really constants—
known or unknown. The phrase ‘the year Turing died’ is a constant which,
1
—and as in the earliest texbook of probability theory: Christiaan Huygens, De Ra
tiociniis in Ludo Aleae, 1657 (“On Calculating in Games of Luck”). Notable 20th c.
expectationbased treatments are Bruno de Finetti’s Teoria delle Probabilit` a, v. 1, Ein
audi, 1970 (= Theory of Probability, Wiley, 1975) and Peter Whittle’s Probability via
Expectation, Springer, 4th ed., 2000.
66
Worth $X if H is true,
worth $p if H is false.
price: $p
or, for short:
X
p
p
CHAPTER 4. EXPECTATION PRIMER 67
as a matter of empirical fact, denotes the number 1954, quite apart from
whether you or I know it.
We could have taken expectation as the fundamental concept, and deﬁned
probabilities as expectations of particularly simple R.V.’s, called ‘indicators’.
The indicator of a hypothesis H is a constant, I
H
, which is 1 if H is true
and 0 if H is false. Now probabilities are expectations of indicators,
(2) pr(H) = ex(I
H
).
This is equivalent to deﬁnition (1).
Deﬁnition (2) is quite general. In contrast, deﬁnition (1) works only in
cases where you are sure that X belongs to a particular ﬁnite list of num
bers. Of course, deﬁnition (1) can be generalized, but this requires relatively
advanced mathematical concepts which deﬁnition (2) avoids, or, anyway,
sweeps under the carpet.
2
Observe that expectations of R.V.’s need not be values those R.V.’s can
have. In the Turing example my expectation of the year of Turing’s death
was 1954.5, which is not the number of a year. Nor need your expectation of
the indicator of past life on Mars be one of the values, 0 or 1, that indicators
can assume; it may well be
1
10
, as in the story in sec. 1.1.
4.2 Conditional Expectation
Just as we deﬁned your conditional probabilities as your buyingorselling
prices for tickets that represent conditional bets, so we deﬁne your conditional
expectation of a random variable X given truth of a statement H as your
buyingorselling price for the following ticket:
We write ‘ex(X[H)’ for your conditional expectation of X given H.
2
For example, the RiemannStieltjes integral: if you think X might be any real number
from a to b, then (1) becomes: exX =
b
a
xdF(x), where F(x) = pr(X < x).
H
H ex(X  H)
X
0
ex(X  H)
+
= X·I
H
=
0
ex(X  H)
+
X
0
(a) (b) (c)
(d) (e)
CHAPTER 4. EXPECTATION PRIMER 68
The following rule might be viewed as deﬁning conditional expectations
as quotients of unconditional ones when pr(H) = 0, just as the quotient rule
for probabilities might be viewed as deﬁning pr(G[H) as pr(GH)/pr(H).
Quotient Rule: ex(X[H) =
ex(X I
H
)
ex(I
H
)
=
ex(X I
H
)
pr(H)
if pr(H) = 0.
Of course this can be written as follows, without the condition pr(H) = 0.
Product Rule: ex(X I
H
) = ex(X[H) pr(H)
A “Dutch book” consistency argument can be given for this product rule
rather like the one given in sec. 1.4 for the probability product rule: Consider
the following ﬁve tickets.
Clearly, ticket (a) has the same dollar value as (b) and (c) together on each
hypothesis as to whether H is true or false. And (d) and (e) have the same
values as (b) and (c), since X I
H
= ¦
X if H is true
0 if H is true
and (e) = (c). Then
unless your price for (a) is the sum of your prices for (d) and (e), so that the
condition ex(X[H) = ex(X I
H
) +ex(X[H)pr(H) is met, you are inconsis
tently placing diﬀerent values on the same prospect depending on whether it
is described in one or the other of two provably equivalent ways. Now setting
pr H = 1 − prH and simplifying, the condition boils down to the product
rule.
Historical Note. In the 18th century Thomas Bayes deﬁned probability in
terms of expectation as follows:
“The probability of any event is the ratio between the value at
which an expectation depending on the happening of the event
CHAPTER 4. EXPECTATION PRIMER 69
ought to be computed, and the value of the thing expected upon
its happening.”
3
This is simply the product rule, solved for prH =
ex(X I
H
)
ex(X[H)
. Explanation:
ex(X I
H
) is “an expectation [$X] depending on the happening of the event”
and ex(X[H) is “the value of the thing expected upon its [H’s] happening”.
4.3 Laws of Expectation
The axioms for expectation can be taken to be the product rule and
(3) Linearity: ex(aX +bY +c) = a ex(X) +b ex(Y ) +c,
where the small letters stand for numerals (= random variables whose values
everybody who understands the language is supposed to know).
Three notable special cases of linearity are obtained if we replace ‘a, b, c’
by ‘1, 1, 0’ (additivity) or ‘a, 0, 0’ (proportionality) or ‘0, 0, c’ (constancy, then
proportionality with ‘c’ in place of ‘a’):
(4)
Additivity : ex(X +Y ) = ex(X) +ex(Y )
Proportionality : ex(aX) = a ex(X)
Constancy : ex(·) = 1
If n hypotheses H
i
are welldeﬁned, they have actual truth values (truth
or falsehood) even if we do not know what they are. Therefore we can deﬁne
the number of truths among them as I
H
1
+. . . +I
Hn
. If you have deﬁnite
expectations ex(I
H
i
) for them, you will have a deﬁnite expectation for that
sum, which, by additivity, will be ex(I
H
1
) +. . . +ex(I
H
1
).
Example 1, Calibration. Suppose you attribute the same probability,
p, to success on each trial of a certain experiment: pr(H
1
) = . . . = pr(H
n
) =
p. Consider the indicators of the hypotheses, H
i
, that the diﬀerent trials
succeed. The number of successes in the n trials will be the unknown sum of
the indicators, so the unknown success rate will be S
n
=
1
n
n
¸
i=1
H
i
. Now
by additivity and constancy, your expectation of the success rate must be p.
4
3
‘Essay toward solving a problem in the doctrine of chances,’ Philosophical Transactions
of the Royal Society 50 (1763), p. 376, reprinted in Facsimiles of Two Papers by Bayes,
New York: Hafner, 1963.
4
ex(
1
n
¸
n
i=1
I
Hi
) =
1
n
ex(
¸
n
i=1
I
Hi
) =
1
n
ex(
¸
n
i=1
pr(H
i
)) =
1
n
(np) = p.
CHAPTER 4. EXPECTATION PRIMER 70
The term ‘calibration’ comes from the jargon of weather forecasting; fore
casters are said to be well calibrated — say, last year, for rain, at the level
p = .8 — if last year it rained on about 80% of the days for which the
forecaster turns out to have said there would be an 80% chance of rain.
5
Why Expectations are Additive. Suppose x and y are your expecta
tions of R.V.’s X and Y — say, rainfall in inches during the ﬁrst and second
halves of next year — and z is your expectation for next year’s total rainfall,
X + Y . Why should z be x + y? The answer is that in every eventuality
about rainfall at the two locations, the value of the ﬁrst two of these tickets
together is the same as the value of the third:
Values: $X + $Y = $(X +Y )
Prices: $x + $y = $(x +y)
Then unless the prices you would pay for the ﬁrst two add up to the price
you would pay for the third, you are inconsistently placing diﬀerent values
on the same prospect, depending on whether it is described to you in one or
the other of two provably equivalent ways.
Typically, a random variable might have any one of a number of values as
far as you know. Convexity is a consequence of linearity according to which
your expectation for the R.V. cannot be larger than all of those values, or
smaller than all of them:
Convexity: If you are sure that X is one of a ﬁnite
collection of numbers, exX lies in the range from the
largest to the smallest to the smallest of them.
Another connection between conditional and unconditional expectations:
Law of Total Expectation. If no two of H
1
, . . . , H
n
are compatible and collectively they exhaust ·, then
ex(X) =
n
¸
i=1
ex(X[H
i
)pr(H
i
).
Proof: X =
¸
n
i=1
(X I
H
i
) is an identity between R.V.’s. Apply ex to both
sides, then use additivity and the product rule.
When conditions are certainties for you, conditional expectations reduce
to unconditional ones:
5
For more about calibration, etc., see Morris DeGroot and Stephen Fienberg, “Assess
ing Probability Assessors: Calibration and Reﬁnement,” in Shanti S. Gupta and James
O. Berger (eds.), Statistical Decision Theory and Related Topics III, Vol. 1, New York:
Academic Press, 1982, pp. 291314.
1.5
0.5
1
2 1
1 0
F
F
S S
Conditional
expectations
Unconditional
expectation
Gains
CHAPTER 4. EXPECTATION PRIMER 71
Certainty: ex(X[H) = ex(X) if pr(H) = 1.
Note that in a context of form ex( Y [Y = X) it is always permissible
to rewrite Y as X at the left. Example: ex(Y
2
[Y = 2X) = ex(4X
2
[Y = 2X).
Applying conditions:
ex( Y [Y = X) = ex( X [Y = X) (OK!)
But we cannot generally discharge a condition Y = X by rewriting Y as X
at the left and dropping the condition. Example: ex(Y
2
[ Y = 2X) cannot be
relied upon to equal ex(4X
2
).
The Discharge Fallacy:
ex( Y [Y = X) = ex( X ) (NOT!)
Example 2, The problem of the two sealed envelopes. One contains
a check for an unknown whole number of dollars, the other a check for twice
or half as much. Oﬀered a free choice, you pick one at random. What is wrong
with the following argument for thinking you should have chosen the other?
“Let X and Y be the values of the checks in the one and the
other. As you think Y equally likely to be .5X or 2X, ex(Y )
will be .5 ex(.5X) + .5E(2X) = 1.25E(X), which is larger than
ex(X).”
4.4 Median and Mean
Hydraulic Analogy. Let ‘F’ and ‘S’ mean heads on the ﬁrst and second tosses
of an ordinary coin. Suppose you stand to gain a dollar for each head. Then
your net gain in the four possibilities for truth and falsity of F and S will be
as shown at the left below.
Think of that as a map of ﬂooded walled ﬁelds in a plain, with the num
bers indicating water depths in the four sections—e.g., the depth is X = 2
1/2 1/2
^
0 8 20
Months lived
after diagnosis
CHAPTER 4. EXPECTATION PRIMER 72
throughout the F ∧ S region.
6
In the four regions, depths are values of X
and areas are probabilities. To ﬁnd your conditional expectation for X given
F, remove the wall the two sections of F so that the water reaches a single
level in the two. That level will be ex(X[F), which is, 1.5 in the diagram.
Similarly, removing the wall between the two sections of F, your condi
tional expectation for X given F is seen to be ex(X[F) = 0.5. To ﬁnd
your unconditional expectation of gain, remove both walls so that the water
reaches the same level throughout: ex(X) = 1.
But there is no mathematical reason for magnitudes X to have only a ﬁnite
numbers of values, e.g., we might think of X as the birth weight in pounds of
the next giant panda to be born in captivity—to no end of decimal places of
accuracy, as if that meant something.
7
Nor is there any reason why X cannot
have negative values, as in the case of temperature.
Balance. The following analogy is more easily adapted to the continuous
case. On a weightless rigid beam, positions represent values of a magnitude X
that might go negative as well as positive. Pick a zero, a unit, and a positive
direction on the beam. Get a pound of modelling clay, and distribute it
along the beam so that the weight of clay on each section represents your
probability that the true value of X is in that section—say, like this:
Locations on the beam are months lived after diagnosis; the weight of clay
on the interval from 0 to m is the probability of still being alive in m months.
“The Median is not the Message.”
8
“In 1982, I learned I was suﬀering
from a rare and serious cancer. After surgery, I asked my doctor what the
best technical literature on the cancer was. She told me ... that there was
nothing really worth reading. I soon realized why she had oﬀered that humane
advice: my cancer is incurable, with a median mortality of eight months after
discovery.”
In terms of the beam analogy here are the key deﬁnitions, of the terms
“median” and “mean”—the latter being a synonym for “expectation”:
6
And therefore, ex(X[F ∧ S) = 2.
7
It does not. The commonplace distinction between panda and ambient moisture, dirt,
etc. is not drawn ﬁnely enough to let us take the remote decimal places seriously.
8
See Stephen Jay Gould, Full House (New York, 1996), sec. 4.
median mean
CHAPTER 4. EXPECTATION PRIMER 73
The median is the point on the beam that divides the weight of
clay in half: the probabilities are equal that the true value of X is
represented by a point to the right and to the left of the median.
The mean (= your expectation) is the point of support at which
the beam would just balance.
Gould continues:
“The distribution of variation had to be right skewed, I reasoned.
After all, the left of the distribution contains an irrevocable lower
boundary of zero (since mesothelioma can only be identiﬁed at
death or before). Thus there isn’t much room for the distribution’s
lower (or left) half—it must be scrunched up between zero and
eight months. But the upper (or right) half can extend out for
years and years, even if nobody ultimately survives.”
As this probability distribution is skewed (stretched out) to the right, the
median is to the left of its mean; Gould’s life expectancy is greater than
8 months. (The mean of 20 months suggested in the graph is my ignorant
invention.)
The eﬀect of skewness can be seen especially clearly in the case of dis
crete distributions like the following. Observe that if the righthand weight is
pushed further right the mean will follow, while the median remains between
the second and third blocks.
4.5 Variance
The variance, deﬁned as
9
varX = ex(X −exX)
2
,
is one of the two commonly used measures of your uncertainty about the
value of X. The other is the “standard deviation”: σ(X)=
√
varX. In
9
Here and below paradigm for ‘ex
2
X’ is ‘sin
2
x’ in trigonometry
CHAPTER 4. EXPECTATION PRIMER 74
terms of the physical analogies in sec. 4.4, these measures of uncertainty cor
respond to the average spread of mass away from the mean. The obvious
measure of this spread is your expectation ex([X − exX[) of the absolute
value of the diﬀerence, but the square, which is like the absolute value in be
ing nonnegative, is easier to work with, mathematically; thus, the variance.
The move to the standard deviation counteracts the distortion in going from
the absolute value to the square.
The deﬁnition of the variance can be simpliﬁed by writing the righthand
side as the expectation of the square minus the square of the expectation:
10
(1) varX = ex(X
2
) −ex
2
(X)
Note that in case X is an indicator, X = I
H
, we have
varI
H
= pr(H)pr(H) ≤
1
4
.
Proof. The inequality prH pr H ≤
1
4
follows from the fact that p(1 − p)
reaches its maximum value of 0.25 when p = 0.5, as shown in the graph:
And the equality var(I
H
) = pr(H)pr(H) follows from (1) because (I
H
)
2
=
I
H
(since I
H
can only assume the values 0 and 1) and ex(I
H
) = pr(H).
In contrast to the linearity of expectations, variances are nonlinear: in
place of ex(aX +b) = a exX +b we have
(2) var(aX +b) = a
2
varX.
And in contrast to the unrestricted additivity of expectations, variances are
10
Proof. (X −exX)
2
= X
2
−2 XexX +ex
2
X, where exX and its square are constants.
By linearity of ex your expectation of this sum = ex(X
2
) −ex
2
X.
CHAPTER 4. EXPECTATION PRIMER 75
additive only under special conditions—e.g., pairwise uncorrelation, which is
deﬁned as follows:
You see X
1
, . . . , X
n
as pairwise uncorrelated if and
only if your ex(X
i
X
j
) = ex(X
i
)ex(X
j
) whenever i = j.
Now to be a bit more explicit, this is what we want to prove:
(3) var
n
¸
i=1
X
i
=
n
¸
i=1
varX
i
, given pairwise uncorrelation of X
1
, . . . , X
n
.
Proof. Recall that by (1) and additivity of ex,
var(
¸
i
X
i
) = ex(
¸
i
X
i
)
2
−ex
2
¸
i
X
i
= ex(
¸
i
X
i
)
2
−(ex
¸
i
X
i
)
2
.
Multiply out squared sums and use ex additivity and pairwise noncorrela
tion to get
¸
i
exX
2
i
+
¸
i=j
exX
i
exX
j
−
¸
i
ex
2
X
i
−
¸
i=j
exX
i
exX
j
. Cancel
identical terms with opposite signs and regroup to get
¸
i
(exX
2
i
− ex
2
X
i
).
By (1), this =
¸
i
varX
i
.
The covariance of two R.V.’s is deﬁned as
cov(X, Y ) = ex[(X −exX)(Y −exY )] = ex(XY ) −ex(X)ex(Y )
so that the covariance of an R.V. with itself is the variance, cov(X, X) =
var(X). The coeﬃcient of correlation between X and Y is deﬁned as
ρ(X, Y ) =
cov(X, Y )
σ(X)σ(Y )
where σ is the positive square root of the variance. Noncorrelation, ρ = 0, is
a consequence of independence, but it is a weaker property, which does not
imply independence; see supplement 4.5 at the end of this chapter.
4.6 A Law of Large Numbers
Laws of large numbers tell you how to expect the “sample average” of the
actual values of a number of random variables to behave as that number
increases without bound:
“Sample average”, S
n
=
1
n
n
¸
i=1
X
i
,
CHAPTER 4. EXPECTATION PRIMER 76
The basic question is how your expectations of sample averages of large ﬁ
nite numbers of random variables must behave. Here we show that, under
certain rather broad conditions, your expectation of the squared diﬀerence
(S
n
− S
n
)
2
between a sample average and its prolongation will go to 0 as
both their lengths, n
and n
, increase without bound:
If the random variables X
i
all have the same, ﬁnite expectations,
ex(X
i
) = m, variances var(X
i
) = σ
2
and pairwise correlation coeﬃcients,
ρ(X
i
, X
j
) = r where i = j, then for all positive n
and n
,
(4) ex(S
n
−S
n
)
2
= (
1
n
−
1
n
)σ
2
(1 −r).
Proof.
(S
n
−S
n
)
2
= (
1
n
n
¸
i=1
X
i
)
2
+ 2(
1
n
n
¸
i=1
X
i
)(
1
n
n
¸
i=1
X
i
) + (
1
n
n
¸
i=1
X
i
)
2
.
Multiplying this out we get terms involving (X
i
)
2
and the rest involving
X
i
X
j
with i = j. Since m
2
+ σ
2
= ex(X
i
)
2
and m
2
+ rσ
2
= ex(X
i
X
j
) with
i = j, we can eliminate all of the X’s and get formula (4) via some hideous
algebra.
11
4.7 Supplements
4.1 Markov’s Inequality: pr(X ≥ ) ≤
exX
if > 0 and X ≥ 0.
This provides a simple way of getting an upper bound on your probabilities
that nonnegative random variables X for which you have ﬁnite expectations
are at least away from 0. Proof of the inequality. If Y = ¦
when X≥
0 when X<
then
by the law of total expectation, ex(Y ) = pr(X ≥ ) + 0 pr(X < ) =
pr(X ≥ ), so that pr(X ≥ ) =
ex(Y )
. Now since Y ≤ X both when
X ≥ and when X < we have ex(Y ) ≤ ex(X) and, so, Markov’s inequality.
4.2 Chebyshev’s Inequality: pr([X −ex(X)[ ≥ ) ≤
var(X)
2
if > 0.
Proof. Since squares are never negative, the Markov inequality implies that
pr(X
2
≥ ) ≤ ex(X
2
)/ if > 0; and from this, since [X[ ≥ if and only
if X
2
≥
2
, we have pr([X[ ≥ ) ≤ ex(X
2
)/
2
if > 0. From this, with
‘X −ex(X)’ for ‘X’, we have Chebyshev’s inequality.
11
See Bruno de Finetti, Theory of Probability, volume 2, pp. 2156. For another treat
ment, see pp. 8485 of de Finetti’s ‘Foresight’ in Henry Kyburg and Howard Smokler (eds.),
Studies in Subjective Probability, Krieger, Huntington, N.Y., 1980.
CHAPTER 4. EXPECTATION PRIMER 77
4.3 Many Pairwise Uncorrelated R.V.’s. An updated version of the
oldest law of large numbers (1713, James Bernoulli’s) applies to random
variables X
1
, . . . , X
n
which you see as pairwise uncorrelated and as having a
common, ﬁnite upper bound, b, on their variances. Here there will be a ﬁnite
upper bound b/
2
n on your probability that the sample average will diﬀer by
more than from your expectation of that average. The bound b on the X’s
is independent of the sample size, n, so that for large enough samples your
probability that the error [S
n
−ex(S
n
)[ in your expectation is or more gets
as small as you please.
For pairwise uncorrelated R.V.’s X
i
whose variances are all ≤ b :
(5) For any > 0 : pr([S
n
−ex(S
n
)[ ≥ ) ≤
b
n
2
This places an upper limit of
b
n
2
on your probability that the sample
average diﬀers from your expectation of it by or more.
Proof: By (3), var(S
n
) =
1
n
2
n
¸
i=1
var(X
i
). Since each var(X
i
) is ≤ b, this
is ≤
1
n
2
(nb) =
b
n
. Setting X = S
n
in Chebyshev’s inequality, we have (5).
Note that if the X
i
are indicators of propositions H
i
, sample averages S
n
are relative frequencies R
n
of truth:
R
n
=
1
n
n
¸
i=1
I
H
i
=
1
n
(the number of truths among H
1
, . . . , H
n
)
4.4 Corollary: Bernoulli trials. These are sequences of propositions H
i
(say, “head on the i’th toss”) where for some p (=1/2 for cointossing) your
probability for truth of any n distinct H’s is p
n
.
Corollary. Suppose that, as in the case of Bernoulli trials, for all distinct
i and j from 1 to n : pr(H
i
) = pr(H
1
) and pr(H
i
∧ H
j
) = pr
2
(H
1
). Then
(6) For any > 0, pr([R
n
−pr(H
1
)[ ≥ ) ≤
1
4n
2
.
This places a limit of 1/4n
2
on your pr for a deviation of or more between
your pr(H
i
) and the actual relative frequency of truths in n trials.
Thus, in n = 25, 000 tosses of a fair coin, your probability that the success
rate will be within = .01 of 0.50 will fall short of 1 by at most 1/4n
2
= 10%.
CHAPTER 4. EXPECTATION PRIMER 78
Proof. Since exX
i
= ex(I
H
i
) = pr(H
i
) = p, linearity of ex says that
ex(S
n
) =
1
n
np = p. Now by (2) we can set b =
1
4
in (5).
4.5 Noncorrelation and independence. Random variables X, Y are
said to be uncorrelated from your point of view when your expectation
of their product is simply the product of your separate expectations (sec.
4.5). In the special case where X and Y are the indicators of propositions,
noncorrelation reduces to independence:
ex(XY ) = ex(X)ex(Y ) iﬀ:
pr(X = x∧Y = y) = pr(X = x)pr(Y = y)
for all four pairs (x, y) of zeroes and ones,
or even for the single pair X = Y = 1.
But in general the uncorrelated pairs of random variables are not simply the
independent ones, as the following extreme example demonstrates.
Example, Noncorrelation in spite of deterministic dependency.
12
You have probability 1/4 for each of X = 2, 1, 1, 2; and Y = X
2
, so that you
have probability 1/2 for each possible value of Y (= 1 or 4), and probability
1/4 for each conjunction X = x ∧ Y = x
2
with x = −2, −1, 1, 2. Then you
see X and Y as uncorrelated, for ex(X) = 0 and ex(XY ) =
−8−1+1+8
4
= 0 =
ex(X)ex(Y ). Yet, pr(X = 1 ∧ Y = 1) =
1
4
= pr(X = 1)pr(Y = 1) = (
1
4
)(
1
2
).
Then noncorrelation need not imply independence. But independence does
always imply noncorrelation. In particular, in cases where X and Y take only
ﬁnitely many values, we have:
13
If pr(X = x ∧ Y = y) = pr(X = x) pr(Y = y) for all
values x of X and y of Y , then ex(XY ) = ex(X)ex(Y ).
12
From William Feller, An Introduction to Probability Theory and its Applications, vol.
1, 2nd edition (1957), p. 222, example (a).
13
Proof. By formula (1) in sec. 4.1, ex(XY ) =
¸
i,j
x
i
y
j
pr(X = x
i
∧ Y = y
j
), or, by
independence,
¸
i,j
x
i
y
j
pr(X = x
i
) pr(Y = y
j
). By the same formula, ex(X)ex(Y ) =
¦
¸
i
x
i
pr(X = x
i
)¦¦
¸
j
y
j
pr(Y = y
j
)¦, which =
¸
i,j
x
i
y
j
pr(X = x
i
) pr(Y = y
j
) again.
Chapter 5
Updating on Statistics
5.1 Where do Probabilities Come From?
Your “subjective” probability is not something fetched out of the sky on a
whim;
1
it is your actual judgment, normally representing what you think your
judgment should be, in view of your information to date and of your sense of
other people’s information, even if you do not regard it as a judgment that
everyone must share on pain of being wrong in one sense or another.
But of course you are not always clear about what your judgment is,
or should be. The most important questions in the theory of probability
concern ways and means of constructing reasonably satisfactory probability
assignments to ﬁt your present state of mind. (Think: trying on shoes.) For
this, there is no overarching algorithm. Here we examine two answers to
these questions that were ﬂoated by Bruno de Finetti in the decade from
(roughly) 1928 to 1938. The second of them, “Exchangeability”,
2
postulates
a deﬁnite sort of initial probabilistic state of mind, which is then updated by
conditioning on statistical data. The ﬁrst (“Minimalism”) is more primitive:
the input probability assignment will have large gaps, and the output will
not arise via conditioning.
1
In common usage there is some such suggestion, and for that reason I would prefer to
speak of ‘judgmental’ probability. But in the technical literarure the term ‘subjective’ is
well established—and this is in part because of the lifelong pleasure de Finetti found in
being seen to give the ﬁnger to the establishment.
2
And, more generally, “partial” exchangeability (see 5.2 below), of which simple exch
nageability, which got named earlier, is a special case.
79
CHAPTER 5. UPDATING ON STATISTICS 80
5.1.1 Probabilities from statistics: Minimalism
Statistical data are a prime determinant of subjective probabilities; that is
the truth in frequentism. But that truth must be understood in the light of
certain features of judgmental probabilizing. One such feature that can be
of importance is persistence, as you learn the relative frequency of truths in
a sequence of propositions, of your judgment that they all have the same
probability. That is a condition under which the following little theorem of
the probability calculus can be used to generate probability judgments.
3
Law of Little Numbers. In a ﬁnite sequence of propositions
that you view as equiprobable, if you are sure that the relative
frequency of truths is p, then your probability for each is p.
Then if, judging a sequence of propositions to be equiprobable, you learn
the relative frequency of truths in a way that does not change your judgment
of equiprobability, your probability for each proposition will agree with the
relative frequency.
4
The law of little numbers can be generalized to random variables:
Law of Short Run Averages. In a ﬁnite sequence of random
variables for which your expections are equal, if you know only
their arithmetical mean, then that is your expectation of each.
Proof. By linearity of ex, if ex(X
i
) = p for i = 1, . . . , n, p =
1
n
¸
n
i=1
ex(X
i
).
Then if, while requiring your ﬁnal expectations for a sequence of magni
tudes to be equal, you learn their mean value in a way that does not lead
you to change that requirement, your expectation of each will agree with that
mean.
5
3
See Jeﬀrey (1992) pp. 5964. The name “Law of Little Numbers” is a joke, but I know
of no generally understood name for the theorem. That theorem, like the next (the “Law
of Short Run Averages”, another joke) is quite trivial; both are immediate consequences
of the linearity of the expectation operator. Chapter 2 of de Fineti (1937) is devoted to
them. In chapter 3 he goes on to a mathematically deeper way of understanding the truth
in frequentism, in terms of “exchangeability” of random variables (sec. 1.2, below).
4
To appreciate the importance of the italicized caveat, note that if you learn the relative
frequency of truths by learning which propositions in the sequence are true, and which
false, and as you form your probabilities for those propositions you remember what you
have learned, then those probabilities will be zeros and ones instead of averages of those
zeros and ones.
5
If you learn the individual values and calculate the mean as their average without
forgetting the various values, you have violated the caveat (unless it happens that all the
values were the same), for what you learned will have shown you that they are not equal.
CHAPTER 5. UPDATING ON STATISTICS 81
Example. Guessing Weight. Needing to estimate the weight of someone
on the other side of a chain link fence, you select ten people on your side
whom you estimate to have the same weight as that eleventh, persuade them
to congregate on a platform scale, and read their total weight. If the scale
reads 1080 lb., your estimate of the eleventh person’s weight will be 108 lb.—
if nothing in that process has made you revise your judgment that the eleven
weights are equal.
6
This is a frequentism in which judgmental probabilities are seen as judg
mental expectations of frequencies, and in which the Law of Little Numbers
guides the recycling of observed frequencies as probabilities of unobserved
instances. It is to be distinguished both from the intelligible but untenable
ﬁnite frequentism that simply identiﬁes probabilities with actual frequencies
(generally, unknown) when there are only ﬁnitely many instances overall, and
from the unintellible long–run frequentism that would see the observed in
stances as a ﬁnite fragment of an inﬁnite sequence in which the inﬁnitely long
run inﬂates expectations into certainties that sweep judgmental probabilities
under the endless carpet.
5.1.2 Probabilities from Statistics: Exchangeability
7
On the hypotheses of (a) equiprobability and (b) certainty that the rela
tive frequency of truths is r, the the Law of Little Numbers identiﬁed the
probability as r. Stronger conclusions follow from the stronger hypothesis of
exchangeability: You regard propositions H
1
, . . . , H
n
as ex
changeable when, for any particular t of them, your probability
that they are all true and the other f = n −t false depends only
on the numbers t and f.
8
Propositions are essentially indicators,
random variables taking only the values 0 (falsity) and 1 (truth).
————————————————————————————
For random variables X
1
, . . . , X
n
more generally, exchangeability
means invariance of pr(X
1
< r
1
∧. . . ∧X
n
< r
n
) for each ntuple
6
Note that turning statistics into probabilities or expectations in this way requires
neither conditioning nor Bayes’s theorem, nor does it require you to have formed partic
ular judgmental probabilities for the propositions or particular estimates for the random
variables prior to learning the relative frequency or mean.
7
Bruno de Finetti, Theory of Probability, vol. 2, pp. 211224: Wiley, 1975.
8
This comes to the same thing as invariance of your probabilities for Boolean com
pounds of ﬁnite numbers of the H
i
under all ﬁnite permutations of the positive integers,
e.g., P(H
1
∧ (H
2
∨ H
3
)) = P(H
100
∧ (H
2
∨ H
7
)).
CHAPTER 5. UPDATING ON STATISTICS 82
of real numbers r
1
, . . . , r
n
and each permutation of the X’s that
appear in the inequalities.
Here, again, as in sec. 5.1.1, probabilities will be seen to come from statistics—
this time, by ordinary conditioning, under the proviso that your prior prob
ability assignment was exchangeable.
Exchangeability turns out to be a strong but reasonably common condition
under wich the hypotheses of the law of large numbers in sec. 4.6 are met.
5.1.3 Exchangeability: Urn examples
The following illustration in the case n = 3 will serve to introduce the basic
ideas.
Example 1. Three Draws from an Urn of Unknown Composition. The
urn contains balls that are red (r) or green (g), in an unknown proportion.
If you know the proportion, that would determine your probability for red.
But you don’t. Still, you may think that (say) red is the more likely color.
How likely? Well, here is one precise probabilistic state of mind that you
might be in, about it: your probabilities for drawing a total of 0, 1, 2, 3 red
balls might be p
0
= .1, p
1
= .2, p
2
= .3, p
3
= .4. And when there is more
than one possible order in which a certain number of heads can show up, you
will divide your probability among them equally. In other words: you see the
propositions H
i
(= red on the i’th trial) as exchangeable.
There are eight possibilities about the outcomes of all three trials:
rrr [ rrg rgr grr [ rgg grg ggr [ ggg
The vertical lines divide clumps of possibilities in which the number of reds
is 3, 2, 1, 0. In Table 1, the possibilities are listed under ‘h’ (for ‘history’)
and again under the H’s, where the 0’s and 1’s are truth values (1 = true,
0 = false) as they would be in the eight possibile histories. The value of the
random variable T
3
(“tally” of the ﬁrst 3 trials) at the right is the number of
r’s in the actual history, the number of 1’s under the H’s in that row. The
entries in the rightmost column are the values of p
t
taken from example 1.
CHAPTER 5. UPDATING ON STATISTICS 83
h H
1
H
2
H
3
pr¦h¦ t pr(T
3
= t)
rrr 1 1 1 p
3
3 .4
rrg 1 1 0
p
2
3
rgr 1 0 1
p
2
3
2 .3
grr 0 1 1
p
2
3
rgg 1 0 0
p
1
3
grg 0 1 0
p
1
3
1 .2
ggr 0 0 1
p
1
3
ggg 0 0 0 p
0
0 .1
Table 1. Three Exchangeable Trials
We can think of propositions as identiﬁed by the sets of histories in which
they would be true. Examples: H
1
= ¦rrr, rrg, rgr, rgg¦ and H
1
∧H
2
∧H
3
=
¦rrr¦. And by additivity, your probability for “same outcome on the ﬁrst two
trials” is pr¦rrr, rrg, ggr, ggg¦ = p
1
+
1
3
p
2
+
1
3
p
1
+p
0
= .4 +
.3
3
+
.2
3
+.1 =
2
3
.
Evidently, exchangeability undoes the dreaded “combinatorial explosion:”
Combinatorial IMplosion. Your probabilities for all 2
(2
n
)
truth
functional compounds of n exchangeable H
i
are determined by
your probabilities for a particular n of them—namely, by all but
one of your probabilities p
0
, p
1
, . . . , p
n
for the propositions saying
that the number of truths among the H’s is 0, 1, . . . , n:
9
The following important corollary of the combinatorial implosion is illus
trated and proved below. (The R.V. T
m
is the tally of the ﬁrst m trials.)
The Rule of Succession.
10
If H
1
, . . . , H
n
, H
n+1
are exchange
able for you and ¦h¦ is expressible as a conjunction of t of the
ﬁrst n H’s with the denials of the other f = n −t, then
pr(H
n+1
[¦h¦) =
t + 1
n + 1
pr(T
n+1
= t + 1)
pr(T
n
= n)
=
t + 1
n + 2 +∗
,
where the correction term, ∗, is (f + 1)(
pr(T
n+1
=t+1)
pr(Tn =n)
−1).
Example 2. Uniform distribution: pr(T
n+1
= t) =
1
n+2
, where t ranges
from 0 to n + 1. In this (“BayesLaplaceJohnsonCarnap”)
11
case the cor
rection term ∗ in the rule of succession vanishes, and the rule reduces to
9
The probability of the missing one will be 1 minus the sum of the other n.
10
The best overall account is by Sandy Zabell, “The Rule of Succession” Erkenntnis 31
(1989) 283321.
11
For the history underlying this terminology, see Zabell’s paper, cited above.
CHAPTER 5. UPDATING ON STATISTICS 84
pr(H
n+1
[¦h¦) = pr(H
n+1
[T
n
= t) =
t + 1
n + 2
.
Example 3. The Ninth Draw. If you adopt the uniform distribution, your
probability for red next, given 6 red and 2 green on the ﬁrst 8 trials, will be
pr(H
9
[T
8
= 6) = pr(H
9
[¦h¦), where h is any history in the set that represents
the proposition T
8
= 6. Then your pr(H
9
[T
8
= 6) =
7
10
.
Proof of the rule of succession.
(a) pr(H
n+1
[C
n
k
) =
pr(H
n+1
∧ C
n
k
)
pr(C
n
k
)
=
pr(H
n+1
t+1
)/(
n+1
t+1
)
pr(H
n
t
)/(
n
t
)
=
(t + 1) pr(H
n+1
t+1
)
(n + 1) pr(H
n
t
)
by the quotient rule, implosion, and the fact that (
n
t
)/(
n+1
t+1
) = (t +1)/(n+1);
(b) pr(C
n
k
) = pr(H
n+1
∧ C
n
k
) +pr(H
n+1
∧ C
n
k
);
(c)
pr(H
n
t
)
(
n
t
)
=
pr(H
n+1
t
)
(
n+1
t
)
+
pr(H
n+1
t+1
)
(
n+1
t+1
)
;
(d)
(
n
t
)
(
n+1
t
)
=
f + 1
n + 1
and
(
n
t
)
(
n+1
t+1
)
=
t + 1
n + 1
;
(e) pr(H
n
t
) =
f + 1
n + 1
pr(H
n+1
t
) +
t + 1
n + 1
pr(H
n+1
t+1
);
(f)
(t + 1) pr(H
n+1
t+1
)
(n + 1) pr(H
n
t
)
=
(t + 1)pr(H
n+1
t+1
)
(f + 1)pr(H
n+1
t
) + (t + 1)(
pr(H
n+1
t
)
pr(H
n+1
t+1
)
−1)
;
(g) The righthand side of (f) =
t + 1
n + 2 + (f + 1)(
pr(H
n+1
t
)
pr(H
n+1
t+1
)
−1)
.
5.1.4 Supplements
1 Law of Large Numbers. Show that exchangeable random variables satisfy
the three conditions in the law of 4.6.
2 A P´olya Urn Model. At stage 0, an urn contains two balls, one red and
one green. At each later stage a ball is drawn with double replacement: it is
replaced along with another ball of its color. Prove that the conditional odds
on black’s being drawn on the n + 1’st trial, given that black was drawn on
the ﬁrst t trials and white on the next f = n −t, are
t+1
f+1
, exactly as in the
urn examples of sec. 5.1.3 above.
CHAPTER 5. UPDATING ON STATISTICS 85
3 Exchangeability is preserved under mixing. Prove that if H
1
, . . . , H
n
are exchangeable relative to each of pr
1
, . . . , pr
m
then they are exchangeable
relative to w
1
pr
1
+. . . +w
m
pr
m
if the w’s are nonnegative and sum to 1.
4 Independence is not preserved under mixing. Prove this, using a simple
counterexample where m = 2 and w
1
= w
2
.
5 Independence, Conditional Independence, Exchangeability. Suppose you
are sure that the balls in an urn are either 90% red or 10% red, and you regard
those two hypotheses as equiprobable. If balls are drawn with replacement,
then conditionally on each hypothesis the propositions H
1
, H
2
, . . . that the
ﬁrst, second, etc., balls drawn will be red are independent and equiprobable.
Is that also so unconditionally?
(a) Are the H’s unconditionally equiprobable? If so, what is their probability?
If not, why not?
(b) Are they unconditionally independent? Would your probability for red
next be the same as it was initially after drawing one ball, which is is red?
(c) Are they exchangeable?
Diaconis and Freedman
12
on de Finetti’s
Generalizations of Exchangeability
De Finetti wrote about partial exchangeability over a period of half a century.
13
These treatments are rich sources of ideas, which take many readings to di
gest. Here we give examples of partial exchangeability that we understand
well enough to put into crisp mathematical terms. All results in Sections
2–5 involve random quantities taking only two values. We follow de Finetti
by giving results for ﬁnite as well as inﬁnite sequences. In 5.3 we review
exchangeability. In in 5.4 we present examples of partially exchangeable se
quences – 2 2 tables and Markov chains – and give general deﬁnitions.
In 5.5 a ﬁnite form of de Finetti’s theorem is presented. 5.6 gives some
inﬁnite versions of de Finetti’s theorem and a counterexample which shows
that the straightforward generalization from the exchangeable case sketched
by de Finetti is not possible. The last section contains comments about the
12
This is an adaptation of the chapter (#11) by Persi Diaconis and David Freedman
in Studies in Inductive Logic and Probability, vol. 2 (Richard Jeﬀrey, ed.; University of
California Press, 1980) which is reproduced here with everybody’s kind permission. A bit
of startlingly technical material in the original publication (see 5.5 and 5.6) has been
retained here, suitably labelled.
13
Beginning with de Finetti (1938).
CHAPTER 5. UPDATING ON STATISTICS 86
practical implications of partial exchangeability. We also discuss the closely
related work of P. MartinL¨ of on repetitive structures.
5.2 Exchangeability Itself
We begin with the case of exchangeability. Consider the following experi
ment: A coin is spun on a table. We will denote heads by 1 and tails by
0. The experiment is to be repeated ten times. You believe you can assign
probabilities to the 210 = 1024 possible outcomes by looking at the coin
and thinking about what you know. There are no a priori restrictions on the
probabilities assigned save that they are nonnegative and sum to 1. Even
with so simple a problem, assigning over a thousand probabilities is not a
simple task. De Finetti has called attention to exchangeability as a possible
simplifying assumption. With an exchangeable probability two sequences of
length 10 with the same number of ones are assigned the same probability.
Thus, the probability assignment is symmetric, or invariant under changes
in order. Another way to say this is that only the number of ones in the
ten trials matters, not the location of the ones. If believed, the symmetry
assumption reduces the number of probabilities to be assigned from 1024 to
11—the probability of no ones through the probability of ten ones.
It is useful to single out certain extreme exchangeable probability assign
ments. Though it is unrealistic, we might be sure there will be exactly one
head in the ten spins. The assumption of exchangeability forces each of the
ten possible sequences with a single one in it to be equally likely. The distri
bution is just like the outcome of ten draws without replacement from an urn
with one ball marked 1 and nine balls marked 0—the ball marked 1 could
be drawn at any stage and must be drawn at some time. There are 11 such
extremal urns corresponding to sure knowledge of exactly i heads in the ten
draws where i is a ﬁxed number between zero and ten. The ﬁrst form of de
Finetti’s theorem follows from these observations:
(1) Finite form of de Finetti’s theorem: Every exchange
able probability assignment on sequences of length N is a unique
mixture of draws without replacement from the N+1 extremal
urns.
14
By a mixture of urns we simply mean a probability assignment over the N+1
possible urns. Any exchangeable assignment on sequences of length N can
14
This form of the result was given by de Finetti (1938) and by many other writers on
the subject. Diaconis, (1977) 271281, is an accessible treatment.
CHAPTER 5. UPDATING ON STATISTICS 87
be realized by ﬁrst choosing an urn and then drawing without replacement
until the urn is empty.
The extremal urns, representing certain knowledge of the total number of
ones, seem like unnatural probability assignments in most cases. While such
situations arise (for example, when drawing a sample from a ﬁnite popula
tion), the more usual situation is that of the coin. While we are only consider
ing ten spins, in principle it seems possible to extend the ten to an arbitrarily
large number. In this case there is a stronger for of (1) which restricts the
probability assignment within the class of exchangeable assignments.
(2) Inﬁnite form of de Finetti’s theorem: Every exchange
able probability assignment pr that can be indeﬁnitely extended
and remain exchangeable is a unique mixture µ of draws with
replacement from a possibly inﬁnite array of extremal urns.
Again the mixture is deﬁned by a probability assignment to the extremal
urns. Let the random variable p be the proportion of red balls in the urn from
which the drawing will be made. A simple example where the array is a con
tinuum: For each subinterval I of the real numbers from 0 to 1, µ(p ∈ I) =
the length of I. This is the µ that is uniquely determined when pr is the
uniform distribution (5.1.3, example 2). If
¸
p∈C
µ(p) = 1 for a countable
set C of real numbers from 0 to 1, the formula is
(3) pr(j ones in k trials) =
k
j
¸
p∈C
p
k
(1 −p)
j
µ(p)
In the general case C = [0, 1] and the sum becomes an integral.
15
This holds
for every k with the same µ. Spelled out in English, (3) and its general case
become (2). We have more to say about this interpretation of exchangeability
as expressing ignorance of the true value of p in Remark 1 of 5.6.
Not all exchangeable measures on sequences of length k can be extended
to exchangeable sequences of length n > k. For example, sampling without
replacement from an urn with k balls in it cannot be extended to k + 1
trials.
16
The requirement that an exchangeable sequence of length k be in
ﬁnitely extendable seems out of keeping with de Finetti’s general program of
restricting attention to ﬁnite samples. An appropriate ﬁnite version of (3) is
given by Diaconis and Freedman (1978), where we show that if an exchange
able probability on sequences of length k is extendable to an exchangeable
15
pr(j ones in k trials) =
k
j
p∈[0,1]
p
k
(1 −p)
j
dµ(p).
16
Necessary and suﬃcient conditions for extension are found in de Finetti (1969),
Crisma, L. (1971) and Diaconis, P. (1977).
CHAPTER 5. UPDATING ON STATISTICS 88
probability on sequences of length n > k, then (3) almost holds in the sense
that there is a measure µ such that for any set A ⊆ ¦0, 1, 2, . . . , k¦,
(4)
pr
number of 1’s in
k trials is in A
¸
−B
µ
number of 1’s in
k trials is in A
¸
≤
2k
n
uniformly in n, k, and A—where B
µ
is the assignment determined by the
mixture µ of drawings with replacement from urns with various proportions
p of 1’s.
17
For example, it is easy to imagine the spins of the coin as the ﬁrst ten
spins in a series of 1000 spins. This yields a bound of .02 on the right hand
side of (4).
Both of the results (3) and (4) imply that for many practical purposes
instead of specifying a probability assignment on the number of ones in n
trials it is equivalent to specify a prior measure µ on the unit interval. Much
of de Finetti’s discussion in his papers on partial exchangeability is devoted
to reasonable choices of µ.
18
We will not discuss the choice of a prior further
but rather restrict attention to generalizations of (2), (3), and (4) to partially
exchangeable probability assignments.
5.3 Two Species of Partial Exchangeability
In many situations exchangeability is not believable or does not permit in
corporation of other relevant data. Here are two examples which will be
discussed further in 5.4 and 5.5.
5.3.1 2 ×2 tables.
Consider zero/one outcomes in a medical experiment with n subjects. We
are told each subject’s sex and if each subject was given a treatment or
was in a control group. In some cases, a reasonable symmetry assumption is
the following kind of partial exchangeability: regard all the treated males as
exchangeable with one another but not with the subjects in the other three
categories; likewise for the other categories. Thus, two sequences of zeros
17
Hairystyle (4):
pr
number of 1’s in
k trials is in A
¸
−
¸
j∈A
k
j
p∈[0,1]
p
j
(1 −p)
k−1
dµ(p)
≤
2k
n
.
18
The paper by A. Bruno (1964) “On the notion of partial exchangeability”, Giorn. 1st.
It. Attuari, 27, 174–196, translated in Chapter 10 of de Finetti (1974), is also devoted to
this important problem.
CHAPTER 5. UPDATING ON STATISTICS 89
and ones of length n which had the same number of ones in each of the four
categories would be assigned the same probability. For example, if n = 10,
each of the three sequences below must be assigned the same probability.
Trial 1 2 3 4 5 6 7 8 9 10
Sex M M M F F M M F F F
Treatment/Control T T T T T C C C C C
1 1 0 1 0 1 1 1 0 1 0
2 0 1 1 0 1 1 1 0 0 1
3 0 1 1 1 0 1 1 1 0 0
In this example, there are three treated males, two treated females, two
control males, and three control females. The data from each of the three
sequences can be summarized in a 2 2 table which records the number of
one outcomes in each group. Each of the three sequences leads to the same
table:
(5)
T C
M 2 2
F 1 1
5.3.2 Markov Dependency
Consider an experiment in which a thumbtack is placed on the ﬂoor and given
an energetic ﬂick with the ﬁngers. We record a one if the tack lands point
upward and a zero if it lands point to the ﬂoor. For simplicity suppose the
tack starts point to the ﬂoor. If after each trial the tack were reset to be point
to the ﬂoor, exchangeability might be a tenable assumption. If each ﬂick of
the tack was given from the position in which the tack just landed, then
the result of each trial may depend on the result of the previous trial. For
example, there is some chance the tack will slide across the ﬂoor without ever
turning. It does not seem reasonable to think of a trial depending on what
happened two or more trials before. A natural notion of symmetry here is to
say that if two sequences of zeros and ones of length n which both begin with
zero have the same number of transitions: zero to zero, zero to one, one to
zero, and one to one, they should both be assigned the same probability. For
example, if n = 10, any of the following three sequences would be assigned
the same probability.
Trial 1 2 3 4 5 6 7 8 9 10
1 0 1 0 1 1 0 0 1 0 1
2 0 1 0 1 0 0 1 1 0 1
3 0 0 1 0 1 0 1 0 1 1
Each sequence begins with a zero and has the same transition matrix
CHAPTER 5. UPDATING ON STATISTICS 90
(6)
to
0 1
0 1 4
from
1 3 1
It turns out that there are 16 diﬀerent sequences starting with zero that
have this transition matrix.
A general deﬁnition of partial exchangeability that includes these examples
involves the notion of a statistic: A function from the sequences of length n
into a set X. A probability assignment P on sequences of length n is partially
exchangeable for a statistic T if
T(x) = T(y) implies P(x) = P(y)
where x and y are sequences of length n. Freedman (1962) and Lauritzen
(1974) have said that P was summarized by T in this situation. In the case of
exchangeability the statistic T is the number of ones. Two sequences with the
same number of ones get assigned the same probability by an exchangeable
probability. This deﬁnition is equivalent to the usual one of permutation
invariance. In the case of 2 2 tables the statistic T is the 2 2 table (5).
In the Markov example the statistic T is the matrix of transition counts (6)
along with the outcome of the ﬁrst trial.
Such examples can easily be combined and extended. For instance, two
stage Markov dependence with additional information such as which experi
menter reported the trial and the time of day being given. De Finetti (1938),
(1974, Section 9.6.2), MartinL¨ of (1970), (1974), and Diaconis and Freedman
(1978a,b,c) give further examples. In remark 4 of Section 6 we discuss a more
general deﬁnition of partial exchangeability.
5.4 Finite Forms of de Finetti’s Theorem on
Partial Exchangeability
There is a simple analog of Theorem (1) for partially exchangeable sequences
of length n. We will write ¦0, 1¦
n
for the set of sequences of zeros and ones
of length n. Let T : ¦0, 1¦
n
→X be a statistic taking values t
1
, t
2
, . . . , t
k
. Let
S
i
= ¦x ∈ 2
n
: T(x) = t
i
¦ and suppose S
i
contains n
i
elements. Let pr
i
be
the probability assignment on 2
n
which picks a sequence x ∈ 2
n
by choosing
an x from S
i
uniformly–e.g., with probability 1/n
i
. Then pr
i
is partially
exchangeable with respect to T. In terms of these deﬁnitions we now state
CHAPTER 5. UPDATING ON STATISTICS 91
(7) A Finite form of de Finetti’s theorem on partial
exchangeability. Every probability assignment pr on ¦0, 1¦
n
which is partially exchangeable with respect to T is a unique
mixture of the extreme measures pr
i
. The mixing weights are
w
i
= pr¦x : T(x) = t
i
¦.
19
Theorem (7) seems trivial but in practice a more explicit description of the
extreme measures P
i
—like the urns in (1)—can be diﬃcult. We now explore
this for the examples of Section 3.
(8) Example—2 2 tables
Suppose we know there are: a treated males, b untreated males, c treated
females, and d untreated females with a+b+c+d = n. The suﬃcient statistic
is the 2 2 table with entries which are the number of ones in each of the
four groups.
M
F
T U
i j
k l
where
0 ≤ i ≤ a, 0 ≤ j ≤ b
0 ≤ k ≤ c, 0 ≤ l ≤ d.
There are (a+1) (b +1) (c +1) (d+1) possible values of the matrix.
An extreme partially exchangeable probability can be thought of as follows:
Fix a possible matrix. Make up four urns. In the ﬁrst urn, labeled TM (for
treated males), put i balls marked one and a−i balls marked zero. Similarly,
construct urns labeled UM, TF, and UF To generate a sequence of length
n given the labels (Sex and Treated/Untreated) draw without replacement
from the appropriate urns. This scheme generates all sequences x ∈ 2
n
with
the given matrix
i j
k l
equiprobably. Theorem (7) says that any partially
exchangeable probability assignment on 2
n
is a unique mixture of such urn
schemes. If a, b, c, and d all tend to inﬁnity, the binomial approximation to
the hypergeometric distribution will lead to the appropriate inﬁnite version
of de Finetti’s theorem as stated in Section 5.
(9) Example—Markov chains
For simplicity, assume we have a sequence of length n+1 that begins with
a zero. The suﬃcient statistic is T =
t
00
t
01
t
10
t
11
where t
ij
is the number of i
to j transitions. A counting argument shows that there are
n
2
+ 1 diﬀerent
values of T possible. Here is an urn model which generates all sequences of
19
In the language of convex sets, the set of partially exchangeable probabilities with
respect to T forms a simplex with extreme points pr
i
.
CHAPTER 5. UPDATING ON STATISTICS 92
length n + 1 with a ﬁxed transition matrix T equiprobably. Form two urns
U
0
and U
1
as follows: Put t
ij
balls marked j into urns U
i
. It is now necessary
to make an adjustment to make sure the process doesn’t run into trouble.
There are two cases possible:
Case 1. If t
01
= t
10
remove a zero from U
1
.
Case 2. If t
01
= t
10
+ 1, remove a one from U
0
.
To generate a sequence of length n + 1, let X
1
= 0. Let X
2
be the result
of a draw from U
x
; and, in general, let X
i
be the result of a draw without
replacement from urn U
x
i−1
. If the sequence generated ever forces a draw
from an empty urn, make a forced transition to the other urn. The adjust
ment made above guarantees that such a forced jump can only be made once
from either urn and that the process generates all sequences of length n + 1
that start with zero and have transition matrix T with the same probability.
Theorem (7) says that every probability assignment on sequences of length
n +1 which is partially exchangeable for the transition matrix T is a unique
mixture of the
n
2
+ 1 diﬀerent urn processes described above. Again, the
binomial approximation to the hypergeometric will lead to an inﬁnite form
of de Finetti’s theorem in certain cases. This is further discussed in Sections
5 and 6.
Determining when a simple urn model such as the ones given above can
be found to describe the extreme partially exchangeable probabilities may
be diﬃcult. For instance, we do not known how to extend the urn model for
Markov chains to processes taking three values.
5.5 Technical interpolation: Inﬁnite Forms
Let us examine the results and problems for inﬁnite sequences in the two
preceding examples.
(10) Example—2 2 tables
Let X
1
, X
2
, X
3
, . . . be an inﬁnite sequence of random variables each taking
values 0 or 1. Suppose each trial is labeled as Male or Female and as Treated
or Untreated and that the number of labels in each of the four possible cate
gories (M, U), (M, T), (F, U), (F, T) is inﬁnite. Suppose that for each n the
distribution of X
1
, X
2
, . . . , X
n
is partially exchangeable with respect to the
suﬃcient statistic that counts the number of zeros and the number of ones
for each label. Thus, T
n
= (a
1
, b
1
, a
2
, b
2
, a
3
, b
3
, a
4
, b
4
) where, for example, a
1
is
CHAPTER 5. UPDATING ON STATISTICS 93
the number of ones labeled (M, U), b
1
is the number of zeros labeled (M, U),
a
2
is the number of ones labeled (M, T), b
2
is the number of zeros labeled
(M, T) and so on. Then there is a unique probability distribution µ such that
for every n and each sequence x
1
, x
2
, . . . , x
n
(11) pr(X
1
= x
1
, . . . , X
n
= x
n
) =
4
¸
i=1
p
a
i
i
(1 −p
i
)
b
i
dµ(p
1
, p
2
, p
3
, p
4
)
where a
i
, b
i
are the values of T
n
(x
1
, x
2
, . . . , x
n
). The result can be proved
by passing to the limit in the urn model (8) of 5.4.
20
(12) Example—Markov chains
Let X
1
, X
2
, . . . be an inﬁnite sequence of random variables each taking
values 0 or 1. For simplicity assume X
1
= 0. Suppose that for each n the
joint distribution of X
1
, . . . , X
n
is partially exchangeable with respect to the
matrix of transition counts. Suppose that the following recurrence condition
is satisﬁed
(13) P(X
n
= 0 inﬁnitely often) = 1.
Then there is a unique probability distribution p such that for every n, and
each sequence x
2
, x
3
, . . . , x
n
of zeros and ones
(14) P(X
1
= 0, X
2
= x
2
, . . . , X
n
= x
n
) =
p
t
11
11
(1−p
11
)
t
10
p
t
00
00
(1−p
00
)
t
01
dµ(p
11
, p
00
)
where t
ij
are the four entries of the transition matrix of x
1
, x
2
, . . . , x
n
.
De Finetti appears to state (pp. 218—219 of de Finetti [1974]) that the
representation (14) is valid for every partially exchangeable probability as
signment in this case. Here is an example to show that the representation
(14) need not hold in the absence of the recurrence condition (13). Consider
the probability assignment which goes 001111111... (all ones after two zeros)
with probability one. This probability assignment is partially exchangeable
and not representable in the form (14). It is partially exchangeable because,
for any it, the ﬁrst it symbols of this sequence form the only sequence with
transition matrix
1
0
1
n−3
. To see that it is not representable in the form (14),
write p
k
for the probability that the last zero occurs at trial k(k = 1, 2, 3, . . .).
For a mixture of Markov chains the numbers p
k
can be represented as
(15) p
k
= c
1
0
p
k
00
(1 −p
00
)dµ(p
00
) k = 1, 2, . . . ,
20
A diﬀerent proof is given in G. Link’s chapter in R. Jeﬀrey (ed.) (1980).
CHAPTER 5. UPDATING ON STATISTICS 94
where c is the probability mass the mixing distribution puts on p
11
= 1.
The representation (15) implies that the numbers p
k
are decreasing. For the
example 001111111..., p
1
= 0, p
2
= 1, p
3
= p
4
= . . . = 0. So this example
doesn’t have a representation as a mixture of Markov chains. A detailed dis
cussion of which partially exchangeable assignments are mixtures of Markov
chains is in Diaconis and Freedman (1978b).
When can we hope for representations like (3), (11), and (14) in terms of
averages over a naturally constructed “parameter space”?
The theory we have developed for such representations still uses the nota
tion of a statistic T. Generally, the statistic will depend on the sample size n
and one must specify the way the diﬀerent T
n
’s interrelate.
21
Lauritzen (1973)
compares Sstructure with several other ways of linking together suﬃcient
statistics. We have worked with a somewhat diﬀerent notion in Diaconis and
Freedman (1978c) and developed a theory general enough to include a wide
variety of statistical models.
De Finetti sketches out what appears to be a general theory in terms
of what he calls the type of an observation. In de Finetti (1938) he only
gives examples which generalize the examples of 2 2 tables. Here things
are simple. In the example there are four types of observations depending on
the labels (M, U), (M, T), (F, U), (F, T). In general, for each i we observe the
value of another variable giving information like sex, time of day, and so on.
An analog of (11) clearly holds in these cases. In de Finetti (1974), Section
9.6.2, the type of an observation is allowed to depend on the outcome of
past observations as in our Markov chain example. In this example there are
three types of observations—observations X
i
that follow a 0 are of type zero;
observations X
i
that follow a 1 are of type one; and the ﬁrst observation X
1
is of type 2. For the original case of exchangeability there is only one type.
In attempting to deﬁne a general notion of type we thought of trying to
deﬁne a type function t
n
: 2
n
→¦1, 2, . . . , r¦
n
which assigns each symbol in a
string a “type” depending only on past symbols. If there are r types, then it
is natural to consider the statistic T
n
(x
1
, . . . , x
n
) = (a
1
, b
1
, . . . , a
r
, b
r
), where
a
i
, is the number of ones of type i and b
i
is the number of zeros of type i.
We suppose that the type functions t
n
, were chosen so that the statistics T
n
had what Lauritzen (1973) has called
¸
structure—T
n+1
can be computed
from T
n
and x
n+1
. Under these circumstances we hoped that a probability
pr which was partially exchangeable with respect to the T
n
’s would have a
21
Freedman (1962b) introduced the notion of S structures—in which, roughly,
from the value of T
n
(x
1
, x
2
, . . . , x
n
) and T
m
(x
n+1
, x
n+2
, . . . , x
n+m
) one can compute
T
n+m
(x
1
, x
2
, . . . , x
n+m
).
CHAPTER 5. UPDATING ON STATISTICS 95
representation of the form
(16) pr¦X
1
= x
1
, . . . , X
n
= x
n
¦ =
n
¸
i=1
p
a
i
i
(1 −p
i
)
b
i
dµ(p
1
, . . . , p
r
)
for some measure µ.
The following contrived counterexample indicates the diﬃculty. There will
be two types of observations. However, there will be some partially exchange
able probabilities which fail to have the representation (16), even though the
number of observations of each type becomes inﬁnite.
Example. You are observing a sequence of zeros and ones X
0
, X
1
, X
2
, X
3
, . . ..
You know that one of 3 possible mechanisms generate the process:
• All the positions are exchangeable.
• The even positions X
0
, X
2
, X
4
, . . . , are exchangeable with each other
and the odd positions are generated by reading oﬀ a ﬁxed reference
sequence x of zeros and ones.
• The even positions X
0
, X
2
, X
4
. . . , are exchangeable with each other
and the odd positions are generated by reading oﬀ the complement ¯ x
(the complement has ¯ x
j
= 1 −x
j
).
Knowing that the reference sequence is x = 11101110 . . ., you keep track of
two types of X
i
. If i is odd and all the preceding X
j
with j odd lead to a
sequence which matches x or ¯ x, you call i of type 1. In particular, X
1
and
X
3
are always of type 1. You count the number of zeros and ones of each
type. Let T
n
= (a
1
, b
1
, a
2
, b
2
) be the type counts at time n. Any process of
the kind described above is partially exchangeable for these T
n
. Moreover,
the sequence T
n
has
¸
structure. Now consider the process which is fair coin
tossing on the even trials and equals the ith coordinate of x on trial 2i −1.
The number of zeros and ones of each type becomes inﬁnite for this process.
However, the process cannot be represented in the form (16). The class of all
processes with these T
n
’s is studied in Diaconis and Freedman (1978c) where
the extreme points are determined.
We do not know for which type functions the representation (16) will hold.
Some cases when parametric representation is possible have been determined
by MartinL¨ of (1970, 1974) and Lauritzen (1976). A general theory of partial
exchangeability and more examples are in Diaconis and Freedman (1978c).
CHAPTER 5. UPDATING ON STATISTICS 96
5.6 Concluding Remarks
1. Some Bayesians are willing to talk about “tossing a coin with unknown p”.
For them, de Finetti’s theorem can be interpreted as follows: If a sequence
of events is exchangeable, then it is like the successive tosses of a pcoin with
unknown p. Other Bayesians do not accept the idea of p coins with unknown
p: de Finetti is a prime example. Writers on subjective probability have sug
gested that de Finetti’s theorem bridges the gap between the two positions.
We have trouble with this synthesis and the following quote indicates that
de Finetti has reservations about it:
22
The sensational eﬀect of this concept (which went well beyond its
intrinsic meaning) is described as follows in Kyburg and Smokler’s
preface to the collection Studies in subjective probability which
they edited (pp. 13—14).
In a certain sense the most important concept in the subjective
theory is that of “exchangeable events”. Until this was introduced
(by de Finetti (1931)) the subjective theory of probability re
mained little more than a philosophical curiosity. None of those
for whom the theory of probability was a matter of knowledge or
application paid much attention to it. But, with the introduction
of the concept of “equivalence or symmetry” or “exchangeabil
ity”, as it is now called, a way was discovered to link the notion
of subjective probability with the classical problem of statistical
inference.
It does not seem to us that the theorem explains the idea of a coin with un
known p. The main point is this: Probability assignments involving mixtures
of coin tossing were used by Bayesians long before de Finetti. The theorem
gives a characterizing feature of such assignments—exchangeability—which
can be thought about in terms involving only opinions about observable
events.
The connection between mixtures and exchangeable probability assign
ments allows a subjectivist to interpret some of the classical calculations
involving mixtures. For example, consider Laplace’s famous calculation of
the chance that the Sun will rise tomorrow given that it has risen on n − 1
previous days (5.1.3, example 2). Laplace took a uniform prior on [0, 1] and
calculated the chance as the posterior mean of the unknown parameter:
22
‘Probability: Beware of falsiﬁcations!’ Scientia 111 (1976) 283303.
CHAPTER 5. UPDATING ON STATISTICS 97
pr(Sun rises tomorrow [ Has risen on n−1 previous days.) =
1
0
p
n
dµ
1
0
p
n−1
dµ
=
n
n+1
.
The translation of this calculation is as follows: Let X
i
= 1 if the Sun rises
on day i, X
i
= 0 otherwise. Laplace’s uniform mixture of coin tossing is ex
changeable and P(X
1
= 1, . . . , X
k
= 1) =
1
k+1
, k = 1, 2, 3, . . .. With this
allocation, P(X
n
= 1[X
1
= 1, . . . , X
n−1
= 1) =
P(X
1
=1...Xn=1)
P(X
1
=1,...,X
n−1
=1)
=
n
n+1
. A
similar interpretation can be given to any calculation involving averages over
a parameter space.
Technical Interpolations (2  4)
2. The exchangeable form of de Finetti’s theorem (1) is also a useful com
putational device for specifying probability assignments. The result is more
complicated and much less useful in the case of real valued variables. Here
there is no natural notion of a parameter p. Instead, de Finetti’s theorem says
real valued exchangeable variables, ¦X
i
¦
∞
i=1
, are described as follows: There is
a prior measure π on the space of all probability measures on the real line such
that pr(X
1
∈ A
1
, . . . , X
n
∈ A
n
) =
¸
n
i=1
p(X
i
∈ A
i
)dπ(p). The space of all
probability measures on the real line is so large that it is diﬃcult for subjec
tivists to describe personally meaningful distributions on this space.
23
Thus,
for real valued variables de Finetti’s theorem is far from an explanation of the
type of parametric estimation—involving a family of probability distributions
parametrized by a ﬁnite dimensional parameter space—that Bayesians from
Bayes and Laplace to Lindley have been using in real statistical problems.
In some cases, more restrictive conditions than exchangeability are rea
sonable to impose, and do single out tractable classes of distributions. Here
are some examples adapted from Freedman.
24
Example: Scale mixtures of normal variables
When can a sequence of real valued variables ¦X
i
¦1 ≤ i < ∞ be repre
sented as a scale mixture of normal variables
(19) pr(X
1
, ≤ t
1
, . . . , X
n
≤ t
n
) =
∞
0
¸
n
i=1
Φ(δt
i
)dπ(δ)
23
Ferguson, T. “Prior distributions on spaces of probability measures”, Ann. Stat. 2
(1974) 615–629 contains examples of the various attempts to choose such a prior.
24
Freedman, D. “Mixtures of Markov processes”, Ann. Math. Stat. 33 (1962a) 114–118,
“Invariants under mixing which generalize de Finetti’s theorem”, Ann. Math. Stat. 33
(1962b) 916–923.
CHAPTER 5. UPDATING ON STATISTICS 98
where Φ(t) =
1
√
2π
t
−∞
e
−t
2
/2
dt?
In Freedman (1963) it is shown that a necessary and suﬃcient condition
for (19) to hold is that for each n the joint distribution of X
1
, X
2
, . . . , X
n
be
rotationally symmetric. This result is related to the derivation of Maxwell’s
distribution for velocity in a Monatomic Ideal Gas
25
Example: Poisson distribution
Let X
i
(1 ≤ i < ∞) take integer values. In Freedman (1962) it is shown
that a necessary and suﬃcient condition for X
i
to have a representation as
a mixture of Poisson variables,
pr(X
1
= a
1
, . . . , X
n
= a
n
) =
∞
0
n
¸
i=1
e
−λ
λ
a
i
a
i
!
dπ(λ)
is as follows: For every n the joint distribution of X
1
, X
2
, . . . , X
n
given
S
n
=
¸
n
i=1
X
i
must be multinomial, like the joint distribution of S
n
balls
dropped at random into n boxes.
Many further examples and some general theory are given in Diaconis and
Freedman (1978c).
3. De Finetti’s generalizations of partial exchangeability are closely con
nected to work of P. MartinL¨ of in the 1970’s.
26
MartinL¨ of does not seem to
work in a Bayesian context, rather he takes the notion of suﬃcient statistic
as basic and from this constructs the joint distribution of the process in much
the same way as de Finetti. MartinL¨ of connects the conditional distribution
of the process with the microcanonical distributions of statistical mechan
ics. The idea is to specify the conditional distribution of the observations
given the suﬃcient statistics. In this paper, and in MartinL¨ of’s work, the
conditional distribution has been chosen as uniform. We depart from this
assumption in Diaconis and Freedman (1978c) as we have in the example of
mixtures of Poisson distributions. The families of conditional distributions
we work with still “project” in the right way. This projection property al
lows us to use the well developed machinery of Gibbs states as developed by
25
Khinchin, A., Mathematical foundations of statistical mechanics (1949), Chap. VI.
26
MartinL¨ of’s work appears most clearly spelled out in a set of mimeographed lecture
notes (MartinL¨ of [1970]), unfortunately available only in Swedish. A technical treatment
of part of this may be found in MartinL¨ of (1974). Further discussion of MartinL¨ of’s ideas
are in Lauritzen (1973), (1975) and the last third of Tjur (1974). These references contain
many new examples of partially exchangeable processes and their extreme points. We have
tried to connect MartinL¨ of’s treatment to the general version of de Finetti’s theorem we
have derived in Diaconis and Freedman (1978c).
CHAPTER 5. UPDATING ON STATISTICS 99
Lanford (1973), F˙ olmer (1974), and Preston (1977).
4. In this paper we have focused on generalizations of exchangeability in
volving a statistic. A more general extension involves the idea of invariance
with respect to a collection of transformations of the sample space into itself.
This contains the idea of partial exchangeability with respect to a statistic
since we can consider the class of all transformations leaving the statistic in
variant. A typical case which cannot be neatly handled by a ﬁnite dimensional
statistic is de Finetti’s theorem for zero/one matrices. Here the distribution
of the doubly inﬁnite random matrix is to be invariant under permutations
of the rows and columns. David Aldous has recently identiﬁed the extreme
points of these matrices and shown that no representation as mixture of ﬁnite
parameter processes is possible.
5. Is exchangeability a natural requirement on subjective probability as
signments? It seems to us that much of its appeal comes from the (forbid
den?) connection with coin tossing. This is most strikingly brought out in
the thumbtack, Markov chain example. If someone were thinking about as
signing probability to 10 ﬂips of the tack and had never heard of Markov
chains it seems unlikely that they would hit on the appropriate notion of
partial exchangeability. The notion of symmetry seems strange at ﬁrst. Its
appeal comes from the connection with Markov chains with unknown transi
tion probabilities. A feeling of naturalness only appears after experience and
reﬂection.
5.7 References
Bruno, A. (1964). “On the notion of partial exchangeability”. Giorn. Ist.
It. Attuari, 27, 174–196. Translated in Chapter 10 of de Finetti (1974).
Crisma, L. (1971). “Sulla proseguibilita di processi scambiabili”, Rend.
Matem. Trieste 3, 96–124.
de Finetti, B. Probabilismo: Saggio critico sulla teoria delle probabilit` a e
sul valora della scienze, Biblioteca di Filosoﬁa diretta da Antinio Aliotta,
Naples, Pirella, 1931.
. (1937). “La prevision: ses Lois Logiques ses sources subjectives”,
Ann. Inst. H. Poincar´e (Paris), 7, 1–68. Translated in Kyburg and Smokler
(1964).
. (1938). ‘Sur la condition d’equivalence partielle’ (Colloque Geneve,
1937). Act. Sci. Ind. 739, 5–18, Herman, Paris. Translated in R. Jeﬀrey, ed.
(1980).
CHAPTER 5. UPDATING ON STATISTICS 100
. (1959). “La probabiita e la statistica nei rapporti con l’induzione,
secondo i diversi punti di vista”. In Corso C.I.M.E. su Induzione e Statistica.
Rome: Cremones. Translated as Chapter 9 of de Finetti (1974).
. (1969). “Sulla proseguibilit` a di processi aleatori scambiabili”, Rend.
Matem. Trieste, 1, 5 3–67.
. (1974). Probability, Induction and Statistics. New York: Wiley.
. (1976). “Probability: Beware of falsiﬁcations”! Scientia 111, 283–
303.
Diaconis, P. (1977). “Finite forms of de Finetti’s theorem on exchange
ability”, Synthese 36, 271–281.
Diaconis, P. and Freedman, D.. “Finite exchangeable sequences”, Ann.
Prob.8 (1978a) 745764.
. “De Finetti’s theorem for Markov chains”, Ann. Prob. 8 (1978b)
115130.
. “Partial exchangeability and suﬃciency”, Prof. Indian Stat. Inst.
Golden Jubilee Int. Conf., Applications and New Directions (1978c) 205236,
J. K. Ghosh and J. Roy, eds., Indian Stat. Inst., Calcutta.
Ferguson, T. (1974). “Prior distributions on spaces of probability mea
sures”, Ann. Stat. 2, 615–629.
F¨ollmer. H. (1974). “Phase transitions and Martin boundary”, Springer
Lecture Notes in Mathematics 465, 305–3 17. New York: SpringerVerlag.
Freedman, D. (1962a). “Mixtures of Markov processes”, Ann. Math. Stat.
33, 114–118.
. (1962b). “Invariants under mixing which generalize de Finetti’s
theorem”, Ann. Math. Stat. 33, 916–923.
Heath, D., and Sudderth, W. (1976). “De Finetti’s theorem on exchange
able variables”, Am. Stat. 30, 188–189.
Jeﬀrey, R. (ed.), Studies in Inductive Logic and Probability, vol. 2 (1980),
University of California Press, Berkeley and Los Angeles.
Kendall, D. (1967). “On ﬁnite and inﬁnite sequences of exchangeable
events”, Studia Sci. Math. Hung. 2, 319–327.
Khinchin, A. (1949). Mathematical Foundations of Statistical Mechanics.
New York: Dover.
Kyburg. H. E., and Smokier, H. E. (1964) Studies in Subjective Probability.
New York: Wiley; (1980) Huntington, NY: Krieger.
CHAPTER 5. UPDATING ON STATISTICS 101
Lanford, 0. E. (1975). “Time evolutions of large classical systems, The
ory and Applications”. Springer Lecture Notes in Physics 38. New York:
SpringerVerlag.
Lauritzen, S. L. (1973). “On the interrelationships among suﬃciency, total
suﬃciency, and some related concepts”, Institute of Math. Stat., University
of Copenhagen.
. (1975). “General exponential models for discrete observations”,
Scand. I. Statist. 2, 23–33.
Godehard Link
MartinL¨ of, P. (1970). Statistiska Modeller, Antechningar fran seminarer
lsaret, 1969–70, uturbetade of R. Sundberg, Institutionen for Matematisk
Statistik, Stockholms Universitet.
. (1973). “Repetitive structures and the relation between canonical
and microcanonicai distributions in statistics and statistical mechanics”. In
Proc. Conference on Foundational Questions in Statistical Inference, Depart
ment of Theoretical Statistics, University of Aarhus–Barndorf Nielsen, ed. P.
Bluesild and G. Schon.
Preston, C. (1976). “Random Fields”, Springer Lecture Notes in Mathe
matics 534. New York: SpringerVerlag.
Tjur, T. (1974). “Conditional probability distributions”, Lecture Notes 2,
Institute of Mathematical Statistics, University of Copenhagen.
Chapter 6
Choosing
This is an exposition of the framework for decisionmaking that Ethan Bolker
and I ﬂoated about 40 years ago, in which options are represented by propo
sitions and any choice is a decision to make some proposition true.
1
It now
seems to me that the framework is pretty satisfactory, and I shall present it
here on its merits, without spending much time defending it against what I
have come to see as fallacious counterarguments.
6.1 Preference Logic
To the option of making the proposition A true corresponds the condi
tional probability distribution pr( [A), where the unconditional distribu
tion pr( ) represents your prior probability judgment—prior, that is, to de
ciding which optionproposition to make true. And your expectation of utility
associated with the Aoption will be your conditional expectation ex(u[A)
of the random variable u (for “utility”). This conditional expectation is also
known as your desirability for truth of A, and denoted ‘desA’:
desA = ex(u[A)
Now preference (~), indiﬀerence (≈), and preferenceorindiﬀerence (_) go
by desirability, so that
1
See Bolker (1965, 1966, 1967) and Jeﬀrey (1965, 83, 90, 96). When Bolker and I met,
in 1963, I had worked out the logic as in 6.1 and was struggling to prove what turned out
to be the wrong uniqueness theorem, while Bolker proved to have found what I needed: a
statement and proof of the right uniqueness theorem (along with a statement and proof
of the corresponding existence theorem—see chapter 9 of Jeﬀrey 1983 or 1990).
102
CHAPTER 6. CHOOSING 103
A ~ B if desA > desB,
A ≈ B if desA = desB,
A _ B if desA ≥ desB,
and similarly for ≺ and _. Note that it is not only optionpropositions that
appear in preference rankings; you can perfectly well prefer a sunny day
tomorrow (truth of ‘Tomorrow will be sunny’) to a rainy one even though
you know you cannot aﬀect the weather.
2
The basic connection between pr and des is that their product is additive,
just as pr is.
3
Here is another way of saying that, in a diﬀerent notation:
Basic probabilitydesirabiliy link
If the S
i
form a partitioning of ·, then
(1) des(A) =
¸
i
des(S
i
∧ A)pr(S
i
[A).
Various principles of preference logic can now be enunciated, and fallacies
identiﬁed, as in the following samples. The ﬁrst is a fallacious mode of infer
ence according to which denial reverses preferences. (The symbol ‘¬’ means
valid implication: that the premise, to the left of the turnstyle; validly implies
the conclusion, to its right.)
6.1.1 Denial Reverses Preferences: A ~ B ¬ B ~ A.
Counterexample: Death before dishonor. Suppose you are an exemplary
Roman of the old school, so that if you were dishonored you would certainly
be dead—if necessary, by suicide: pr(death[dishonor) = 1. Premise: A ~ B,
you prefer death (A) to dishonor (B). Then 1.1.1 is just backward; it must
be that you prefer a guarantee of life (A) to a guarantee of not being
dishonored (B), for in this particular example your conditional probability
pr(A[B) = 1 makes the second guarantee an automatic consequence of the
ﬁrst.
2
Rabinowicz (2002) eﬀectively counters arguments by Levi and Spohn that Dutch
Book arguments are untrustworthy in cases where the bet is on the agent’s own future
betting behavior, since “practical deliberation crowds out selfprediction” much as the
gabble of cellphone conversations may make it impossible to pursue simple trains of
thought in railway cars. The diﬃculty is one that cannot arise where it is this very uture
behavior that is at issue.
3
Deﬁne τ(H) = pr(H)des(H); then τ(H
1
∨H
2
∨. . .) =
¸
i
τ(H
i
) provided τ(H
i
∧H
j
) =
0 whenever i = j.
CHAPTER 6. CHOOSING 104
6.1.2 If A ∧ B ¬ ⊥ and A _ B then A _ A ∨ B _ B.
Proof. Set w = pr(A[A∨B). Then des(A∨B) = w(desA) +(1−w)(desB).
This is a convex combination of des A and des B, which must therefore lie
either between them or at one or the other. This one is valid.
6.1.3 The “Sure Thing” (or “Dominance”) Principle:
(A∧ B) C, (A∧ ¬B) C A C
In words, with premises to the left of the turnstyle: If you would prefer A to
C knowing that B is true, and knowing that B is false, you prefer A to C.
This snappy statement is not a valid form of inference. Really? If A is
better than C no matter how other things (B) turn out, then surely A is as
good as B. How could that be wrong?
Well, here is one way, with C = Quit smoking in 6.1.3:
Counterexample, The Marlborough Man, “MM”. He reasons: ‘Sure,
smoking (A) makes me more likely to die before my time (B), but a smoker
can live to a ripe old age. Now if I am to live to a ripe old age I’d rather do
it as a smoker, and if not I would still enjoy the consolation of smoking. In
either case, I’d rather smoke than quit. So I’ll smoke. In words, again:
Premise: Smoking is preferable to quitting if I die early.
Premise: Smoking is preferable to quitting if I do not die early.
Conclusion: Smoking is preferable to quitting.
I have heard that argument put forth quite seriously. And there are versions of
decision theory which seem to endorse it, but the wise heads say ‘Nonsense,
dear boy, you have chosen a partition ¦B, B¦ relative to which the STP
fails. But there are provisos under which the principle is quite trustworthy.
Sir Ronald Fisher (1959) actually sketched a genetic scenario in which that
would be the case. (See 6.3.1 below.) But he never suggested that 6.1.3, the
unrestricted STP, is an a priori justiﬁcation for the MM’s decision.
For the record, this is the example that L. J. Savage (1954), the man who
brought us the STP, used to motivate it.
A businessman contemplates buying a certain piece of property.
He considers the outcome of the next presidential election relevant
to the attractiveness of the purchase. So, to clarify the matter
for himself, he asks whether he would buy if he knew that the
CHAPTER 6. CHOOSING 105
Republican candidate were going to win, and decides that he
would do so. Similarly, he considers whether he would buy if he
knew that the Democratic candidate were going to win, and again
ﬁnds that he would do so. Seeing that he would buy in either
event, he decides that he should buy, even though he does not
know which event obtains, or will obtain, as we would ordinarioy
say. It is all too seldom that a decision can be arrived at on
the basis of the principle used by this businessman, but, except
possibly for the assumption of simple ordering, I know of no other
extralogical principle governing decisions that ﬁnds such ready
acceptance. (Savage 1954, p. 21)
Savage’s businessman and my Marlborough Man seem to be using exactly
the same extralogical principle (deductive logic, that is). What is going on?
BJing the MM
This is how it normally turns out that people with the MM’s preferences
normally turn out to prefer quitting to smoking—even if they cannot bring
themselves to make the preferential choice. It’s all done with conditional
probabilities. Grant the MM’s preferences and probabilities as on the fol
lowing assumptions about desirabilities and probabilities, where ‘long’ and
‘shorter’ refer to expected length of life, and w, s, l (“worst, smoke, long”)
are additive components of desirability where s, l > 0 and w +s +l < 1.
des(long ∧ smoke) = l +s +w pr(long[smoke) = p
des(long ∧ quit) = l +w pr(long[quit) = q
des(shorter ∧ smoke) = s +w pr(shorter[smoke) = 1 −p
des(shorter ∧ quit) = w pr(shorter[quit) = 1 −q
Question: How must MM’s conditional probabilities p, q be related in order
for the BJ ﬁgure of merit to advise his quitting instead continuing to smoke?
Answer: des(quit) > des(smoke) if and only if (l +s+w)p+(s+w)(1−p) >
(l +w)q +w(1 −q). Then BJ tells MM to smoke iﬀ p −q <
s
l
.
These are the sorts of values of p and q that MM might have gathered
from Consumers Union (1963): p = .59, q = .27. Then .32 > s/l where,
by hypothesis, s/l is positive and less than 1; and as long as MM is not so
addicted or otherwise dedicated to smoking that the increment of desirability
from it is at least 32% of the increment from long life, formula (1) in the basic
probabilitydesirability link will have him choose to quit.
CHAPTER 6. CHOOSING 106
BJing the STP
If we interpret Savage (1954) as accepting the basic probabilitydesirability
link (1) as well as the STP, he must be assuming that acts are probabilistically
independent of states of nature. The businessman must then consider that
his buying or not will have no inﬂuence on the outcome of the election; he
must then see p and q as equal, and accept the same reasoning for the MM
and the businessman: choose the dominant act (smoke, buy).
4
6.1.4 Bayesian Frames.
Given a ﬁgure of merit for acts, Bayesian decision theory represents a certain
structural concept of rationality. This is contrasted with substantive criteria
of rationality having to do with the aptness of particular probability and
utility functions in particular predicaments. With Donald Davidson
5
I would
interpret this talk of rationality as follows. What remains when all substantive
questions of rationality are set aside is bare logic, a framework for tracking
your changing judgments, in which questions of validity and invalidity of
argumentforms can be settled as illustrated above. The discussion goes one
way or another, depending on particulars of the ﬁgure of merit that is adopted
for acts. Here we compare and contrast the BolkerJeﬀrey ﬁgure of merit with
Savage’s.
A complete, consistent set of substantive judgments would be represented
by a “Bayesian frame” consisting of (1) a probability distribution over a
space Ω of “possible worlds”, (2) a function u assigning “utilities” u(ω) to
the various worlds in ω ∈ Ω, and (3) an assignment of subsets of Ω as values
of the sentenceletters ‘A’, ‘B’, . . .. Then subsets A ⊆ Ω represent propo
sitions; A is true in world ω iﬀ ω ∈ A; and conjunction, disjunction and
denial of propositions is represented by the intersection, union and comple
ment with respect to Ω of the corresponding subsets. In any logic, validity
of an argument is truth of the conclusion in every frame in which all the
premises are true. In a Bayesian logic of decision, Bayesian frames represent
possible answers to substantive questions of rationality; we can understand
that without knowing how to determine whether particular frames would be
substantively rational for you on particular occasions. So in Bayesian decision
theory we can understand validity of an argument as truth of its conclusion
in any Bayesian frame in which all of its premises are true, and understand
4
This is a reconstructionist reading of Savage (1954), not widely shared.
5
Davidson (1980) 2734. See also Jeﬀrey (1987) and sec. 12.8 of Jeﬀrey (1965, 1983,
1990).
CHAPTER 6. CHOOSING 107
consistency of a judgment (say, aﬃrming A ~ B while denying B ~ A)
as existence of a nonempty set of Bayesian frames in which the judgment
is true. On this view, bare structural rationality is simply representability in
the Bayesian framework.
6.2 Causality
In decisionmaking it is deliberation, not observation, that changes your prob
abilities. To think you face a decision problem rather than a question of fact
about the rest of nature is to expect whatever changes arise in your probabil
ities for those states of nature during your deliberation to stem from changes
in your probabilities of choosing options. In terms of the analogy with me
chanical kinematics: as a decisionmaker you regard probabilities of options
as inputs driving the mechanism, not driven by it.
Is there something about your judgmental probabilities which shows that
you are treating truth of one proposition as promoting truth of another—
rather than as promoted by it or by truth of some third proposition which
also promotes truth of the other? Here the promised positive answer to this
question is used to analyze puzzling problems in which we see acts as mere
symptoms of conditions we would promote or prevent if we could. Such “New
comb problems” (Nozick, 1963, 1969, 1990) seem to pose a challenge to the
decision theory ﬂoated in Jeﬀrey (1965, 1983, 1990), where notions of causal
inﬂuence play no rˆ ole. The present suggestion about causal judgments will be
used to question the credentials of Newcomb problems as decision problems.
The suggestion is that imputations of causal inﬂuence do not show up
simply as momentary features of probabilistic states of mind, but as intended
or expected features of their evolution. Recall the following widely recognized
necessary condition for the judgment that truth of one proposition (“cause”)
promotes truth of another (“eﬀect”):
Correlation: pr(eﬀect[cause) > pr(eﬀect[cause).
But aside from the labels, what distinguishes cause from eﬀect in this relation
ship? The problem is that it is a symmetrical relationship, which continues to
hold when the labels are interchanged: pr(cause[eﬀect) > pr(cause[eﬀect).
It is also problematical that correlation is a relationship between contempo
raneous values of the same conditional probability function.
CHAPTER 6. CHOOSING 108
What further condition can be added, to produce a necessary and suﬃcient
pair? With Arntzenius (1990), I suggest the following answer, i.e., rigidity
relative to the partition ¦cause,cause¦.
6
Rigidity. Constancy of pr(eﬀect[cause)
and pr(eﬀect[cause) as pr(cause) varies.
Rigidity is a condition on a variable (‘pr’) that ranges over a set of probabil
ity functions. The functions in the set represent ideally deﬁnite momentary
probabilistic states of mind for the deliberating agent, as they might be at
diﬀerent times. This is invariance of conditional probabilities (3.1), shown
in 3.2 to be equivalent to probability kinematics as a mode of updating. In
statistics, the corresponding term is ‘suﬃciency’: A suﬃcient statistic is a
random variable whose sets of constancy (“data”) form a partition satisfying
the rigidity condition. Clearly, pr can vary during deliberation, for if delib
eration converges toward choice of a particular act, the probability of the
corresponding proposition will rise toward 1. In general, agents’s intentions
or assumptions about the kinematics of pr might be described by maps of
possible courses of evolution of probabilistic states of mind—often, very sim
ple maps. These are like road maps in that paths from point to point indicate
feasibility of passage via the anticipated mode of transportation, e.g., ordi
nary automobiles, not “all terrain” vehicles. Your kinematical map represents
your understanding of the dynamics of your current predicament, the possible
courses of development of your probability and desirability functions.
The Logic of Decision used conditional expectation of utility given an act
as the ﬁgure of merit for the act, namely, its desirability, des(act). Newcomb
problems (Nozick 1969) led many to see that ﬁgure as acceptable only on
special causal assumptions, and a number of versions of “causal decision
theory” have been proposed as more generally acceptable.
7
But if Newcomb
problems are excluded as bogus, then in genuine decision problems des(act)
will remain constant throughout deliberation, and will be an adequate ﬁgure
of merit.
Now there is much to be said about Newcomb problems, and I have said
quite a lot elsewhere, as in Jeﬀrey (1996), and I am reluctant to ﬁnish this
6
In general the partition need not be twofold. Note that if ‘cause’ denotes one element
of a partition and ‘cause’ denotes the disjunction of all other elements, cause need not
satisfy the rigidity condition even though all elements of the original partition do.
7
In the one I like best (Skyrms 1980), the ﬁgure of merit for choice of an act is the
agent’s expectation of des(A) on a partition / of causal hypotheses:
(CFM) Causal ﬁgure of merit relative to / = ex
K
(u[A).
In the discrete case, / = ¦K
k
: k = 1, 2, . . .¦ and the CFM is
¸
k
des(A∧ K
k
).
B
±C
±A
a
a´
b
b´
p
p´
p
p´ ±C
a
a´
±A
B
b
b´
(a) Ordinarily, acts ±A
screen off deep states
±C from plain states ±B.
(b) In Newcomb problems
deep states screen off acts
from plain states.
Fig. 1 Solid/dashed arrows indicate
stable/labile conditional probabilities.
CHAPTER 6. CHOOSING 109
book with close attention to what I see as a sideissue. I would rather ﬁnish
with an account of game theory, what Robert Aumann calls “interactive
decision theory”. But I don’t seem to have time for that, and if you have
not been round the track with Newcomb problems you may ﬁnd the rest
worthwhile, so I leave much of it in, under the label “supplements”.
6.3 Supplements: Newcomb Problems
Preview. The simplest sort of decision problem is depicted in Fig. 1(a),
where C and C represent states of aﬀairs that may incline you one way or
the other between options A and A, and where your choice between those
options may tend to promote or prevent the state of aﬀairs B. But there
is no probabilistic causal inﬂuence of ±C directly on ±B, without passing
through ±A. Fig. 1(b) depicts the simplest “Newcomb” problem, where the
direct probabilistic inﬂuence runs from ±C directly to ±A and to ±B, but
there is no direct inﬂuence of ±A on ±B. Thus, Newcomb problems are not
decision problems.
Stability of your conditional probabilities p = pr(B[A) and p
=
pr(B[A) as your unconditional pr(A) varies is a necessary con
dition for you to view ±A as a direct probabilistic inﬂuence on
truth or falsity of B.
6.3.1 “The Mild and Soothing Weed”
For smokers who see quitting as prophylaxis against cancer, preferability
goes by initial des(act) as in Fig 1a; but there are views about smoking and
CHAPTER 6. CHOOSING 110
cancer on which these preferences might be reversed. Thus, R. A. Fisher
(1959) urged serious consideration of the hypothesis of a common inclining
cause of (A) smoking and (B) bronchial cancer in (C) a bad allele of a
certain gene, posessors of which have a higher chance of being smokers and
of developing cancer than do posessors of the good allele (independently,
given their allele). On that hypothesis, smoking is bad news for smokers but
not bad for their health, being a mere sign of the bad allele, and, so, of bad
health. Nor would quitting conduce to health, although it would testify to
the agent’s membership in the lowrisk group.
On Fisher’s hypothesis, where acts ±A and states ±B are seen as inde
pendently promoted by genetic states ±C, i.e., by presence (C) or absence
(C) of the bad allele of a certain gene, the kinematical constraints on pr
are the following. (Thanks to Brian Skyrms for this.)
Rigidity: The following are constant as c = pr(C) varies.
a = pr(A[C), a
= pr(A[C), b = pr(B[C), b
= pr(B[C)
Correlation:
p > p
, where p = pr(B[A) and p
= pr(B[A)
Indeterminacy: None of a, b, a
, b
are 0 or 1,
Independence: pr(A ∧ B[C) = ab, pr(A ∧ B[C) = a
b
.
Since in general, pr(F[G∧H) = pr(F ∧G[H)/pr(G[H), the independence
and rigidity conditions imply that ±C screens oﬀ A and B from each other
in the following sense.
Screeningoff: pr(A[B ∧ C) = a, pr(A[B ∧ C) = a
,
pr(B[A ∧ C) = b, pr(B[A ∧ C) = b
.
Under these constraints, preference between A and A can change as
pr(C) = c moves out to either end of the unit interval in thoughtexperiments
addressing the question “What would des(A)−des(A) be if I found I had the
bad/good allele?” To carry out these experiments, note that we can write p =
pr(B[A) = pr(A∧B)/pr(A) =
pr(AB∧C)pr(BC)pr(C)+pr(AB∧¬C)pr(B¬C)pr(¬C)(1−c)
pr(AC)pr(C)+pr(A¬C)pr(¬C)
and similarly for p
= pr(B[A). Then we have
p =
abc +a
b
(1 −c)
ac +a
(1 −c)
, p
=
(1 −a)bc + (1 −a
)b
(1 −c)
(1 −a)c + (1 −a
)(1 −c)
.
Now ﬁnal p and p
are equal to each other, and to b or b
depending on whether
ﬁnal c is 1 or 0. Since it is c’s rise to 1 or fall to 0 that makes pr(A) rise
CHAPTER 6. CHOOSING 111
or fall as much as it can without going oﬀ the kinematical map, the (quasi
decision) problem has two ideal solutions, i.e., mixed acts in which the ﬁnal
unconditional probability of A is the rigid conditional probability, a or a
,
depending on whether c is 1 or 0. But p = p
in either case, so each solution
satisﬁes the conditions under which the dominant pure outcome (A) of the
mixed act maximizes des(±A). (This is a quasidecision problem because
what is imagined as moving c is not the decision but factual information
about C.)
As a smoker who believes Fisher’s hypothesis you are not so much trying
to make your mind up as trying to discover how it is already made up. But
this may be equally true in ordinary deliberation, where your question “What
do I really want to do?” is often understood as a question about the sort of
person you are, a question of which option you are already committed to,
unknowingly. The diagnostic mark of Newcomb problems is a strange linkage
of this question with the question of which state of nature is actual—strange,
because where in ordinary deliberation any linkage is due to an inﬂuence
of acts ±A on states ±B, in Newcomb problems the linkage is due to an
inﬂuence, from behind the scenes, of deep states ±C on acts ±A and plain
states ±B. This diﬀerence explains why deep states (“the sort of person I
am”) can be ignored in ordinary decision problems, where the direct eﬀect
of such states is wholly on acts, which mediate any further eﬀect on plain
states. But in Newcomb problems deep states must be considered explicitly,
for they directly aﬀect plain states as well as acts (Fig. 1).
In the kinematics of decision the dynamical role of forces can be played
by acts or deep states, depending on which of these is thought to inﬂuence
plain states directly. Ordinary decision problems are modelled kinematically
by applying the rigidity condition to acts as causes. Ordinarily, acts screen oﬀ
deep states from plain ones in the sense that B is conditionally independent
of ±C given ±A, so that while it is variation in c that makes pr(A) and pr(B)
vary, the whole of the latter variation is accounted for by the former (Fig.
1a). But to model Newcomb problems kinematically we apply the rigidity
condition to the deep states, which screen oﬀ acts from plain states (Fig.
1b). In Fig. 1a, the probabilities b and b
vary with c in ways determined by
the stable a’s and p’s, while in Fig. 1b the stable a’s and b’s shape the labile
p’s as we have seen above:
p =
abc +a
b
(1 −c)
ac +a
(1 −c)
, p
=
(1 −a)bc + (1 −a
)b
(1 −c)
(1 −a)c + (1 −a
)(1 −c)
.
CHAPTER 6. CHOOSING 112
Similarly, in Fig. 1a the labile probabilities are
b =
apc +a
p
(1 −c)
ac +a
(1 −c
, b
=
(1 −a)pc + (1 −a
)p
(1 −c)
(1 −a)c + (1 −a
)(1 −c)
.
While C and C function as causal hypotheses, they do not announce
themselves as such, even if we identify them by the causal rˆ oles they are
meant to play, as when we identify the “bad” allele as the one that promotes
cancer and inhibits quitting. If there is such an allele, it is a still unidentiﬁed
feature of human DNA. Fisher was talking about hypotheses that further
research might specify, hypotheses he could only characterize in causal and
probabilistic terms—terms like “malaria vector” as used before 1898, when
the anopheles mosquito was shown to be the organism playing that aetio
logical rˆ ole. But if Fisher’s science ﬁction story had been veriﬁed, the status
of certain biochemical hypotheses C and C as the agent’s causal hypothe
ses would have been shown by satisfaction of the rigidity conditions, i.e.,
constancy of pr(−[C) and of pr(−[C), with C and C spelled out as tech
nical speciﬁcations of alternative features of the agent’s DNA. Probabilistic
features of those biochemical hypotheses, e.g., that they screen acts oﬀ from
states, would not be stated in those hypotheses, but would be shown by inter
actions of those hypotheses with pr, B and A, i.e., by truth of the following
consequences of the kinematical constraints.
pr(B[act ∧ C) = pr(B[C), P(B[act ∧ C) = pr(B[C).
No purpose would be served by packing such announcements into the
hypotheses themselves, for even if true, such announcements would be re
dundant. The causal talk, however useful as commentary, does no work in
the matter commented upon.
8
6.3.2 The Flagship Newcomb Problem
Nozick’s (1969) ﬂagship Newcomb problem resolutely fends oﬀ naturalism
about deep states, making a mystery of the common inclining cause of acts
and plain states while suggesting that the mystery could be cleared up in
various ways, pointless to elaborate. Thus, Nozick (1969) begins:
Suppose a being [call her ‘Alice’] in whose power to predict your
choices you have enormous conﬁdence. (One might tell a science
ﬁction story about a being from another planet, with an advanced
8
See Joyce (2002); also Leeds (1984) in another connection.
(t+1)x
t
(a) Initial (b) Push toward x=1
(t+1)(1x)
t
(c) Push toward x=0
Fig. 2. Two patterns of density migration.
CHAPTER 6. CHOOSING 113
technology and science, who you know to be friendly, and so
on.) You know that this being has often correctly predicted your
choices in the past (and has never, so far as you know, made
an incorrect prediction about your choices), and furthermore you
know that this being has often correctly predicted the choices of
other people, many of whom are similar to you, in the particular
situation to be described below. One might tell a longer story,
but all this leads you to believe that almost certainly this being’s
predictionabout your choice in the situation to be discussed will
be correct. There are two boxes ...
Alice has surely put $1,000 in one box, and (B) she has left the second
box empty or she has (B) put $1,000,000 in it, depending on whether she
predicts that you will (A
2
) take both boxes, or that you will (A
1
) take only
one—the one with $1000 in it. (In the previous notation, A
2
= A.)
Here you are to imagine yourself in a probabilistic frame of mind where
your desirability for A
1
is greater than that for A
2
because although you
think A’s truth or falsity has no inﬂuence on B’s, your conditional probability
p = pr(B[A
1
) is ever so close to 1 and your p
= pr(B[A
2
) is ever so close to 0.
Does that seem a tall order? Not to worry! High correlation is a red herring;
a tiny bit will do, e.g., if desirabilities are proportional to dollar payoﬀs, then
the 1box option, A, maximizes desirability as long as[?] is greater than
.001.
To see how that might go numerically, think of the choice and the predic
tion as determined by independent drawings by the agent and the predictor
from the same urn, which contains tickets marked ‘2’ and ‘1’ in an unknown
proportion x : 1x. Initially, the agent’s unit of probability density over the
range [0,1] of possible values of x is ﬂat (Fig. 2a), but in time it can push
bad ad
ad
best
best
best
worst
orst
orst
ood
good
good
(a) Indecision: 0<r<1
(b) Decision: pr(2) = 1
Table 4. Deterministic 1box solution; pr(shaded)=0; beta = 1 until decision.
(c) Decision: pr(1) = 1
CHAPTER 6. CHOOSING 114
toward one end of the unit interval or the other, e.g., as in Fig. 2b, c.
9
At t =
997 these densities determine the probabilities and desirabilities in Table 3b
and c, and higher values of t will make desA
1
−desA
2
positive. Then if t is
calibrated in thousandths of a minute this map has the agent preferring the
2box option after a minute’s deliberation. The urn model leaves the deep
state mysterious, but clearly speciﬁes its mysterious impact on acts and plain
states.
The irrelevant detail of high correlation or high K = pr(B[A)−pr(B[A),
was a bogus shortcut to the 1box conclusion, obtained if K is not just high
but maximum, which happens when p = 1 and p
= 0. This means that the
“best” and “worst” cells in the payoﬀ table have unconditional probability 0.
Then taking both boxes means a thousand, taking just one means a million,
and preference between acts is clear, as long as p − p
is neither 0 nor 1,
and remains maximum, 1. The density functions of Fig. 2 are replaced by
probability assignments r and 1 − r to the possibilities that the ratio of 2
box tickets to 1box tickets in the urn is 1:0 and 0:1, i.e., to the two ways
in which the urn can control the choice and the prediction deterministically
and in the same way. In place of the smooth density spreads in Fig. 2 we
now have pointmasses r and 1 − r at the two ends of the unit interval,
with desirabilities of the two acts constant as long as r is neither 0 nor 1.
Now the 1box option is preferable throughout deliberation, up to the very
moment of decision.
10
But of course this reasoning uses the premise that
pr(predict 2[take 2) − pr(predict 2[take 1) = β = 1 through deliberation, a
premise making abstract sense in terms of uniformly stocked urns, but very
hard to swallow as a real possibility.
9
In this kinematical map pr(A) =
1
0
x
t+1
f(x)d(x) and pr(B[A) =
1
0
x
t+2
f(x)dx/pr(A)
with f(x) as in Fig. 2(b) or (c). Thus, with f(x) as in (b), pr(A) = (t + 1)/(t + 3) and
pr(B[A) = (t + 2)/(t + 3). See Jeﬀrey, 1988.
10
At the moment of decision the desirabilities of shaded rows in (b) and (c) are not
determind by ratios of unconditional probabilities, but continuity considerations suggest
that they remain good and bad, respectively.
CHAPTER 6. CHOOSING 115
6.3.3 Hofstadter
Hofstadter (1983) saw prisoners’s dilemmas as downtoearth Newcomb prob
lems. Call the prisoners Alma and Boris. If one confesses and the other does
not, the confessor goes free and the other serves a long prison term. If nei
ther confesses, both serve short terms. If both confess, both serve intermedi
ate terms. From Alma’s point of view, Boris’s possible actions (B, confess,
or B, don’t) are states of nature. She thinks they think alike, so that her
choices (A, confess, A, don’t) are pretty good predictors of his, even though
neither’s choices inﬂuence the other’s. If both care only to minimize their
own prison terms this problem ﬁts the format of Table 1(a). The prisoners
are thought to share a characteristic determining their separate probabilities
of confessing in the same way – independently, on each hypothesis about
that characteristic. Hofstadter takes that characteristic to be rationality, and
compares the prisoners’s dilemma to the problem Alma and Boris might have
faced as bright children, independently working the same arithmetic prob
lem, whose knowledge of each other’s competence and ambition gives them
good reason to expect their answers to agree before either knows the answer:
“If reasoning guides me to [... ], then, since I am no diﬀerent from anyone else
as far as rational thinking is concerned, it will guide everyone to [... ].” The
deep states seem less mysterious here than in the ﬂagship Newcomb problem;
here they have some such form as Cx
= We are both likely to get the right
answer, i.e., x. (And here ratios of utilities are generally taken to be on the
order of 10:1 instead of the 1000:1 ratios that made the other endgame so
demanding. With utilities 0, 1, 10, 11 instead of 0, 1, 1000, 1001, indiﬀerence
between confessing and remaining silent now comes at = 10heighten simi
larity to the prisoners’s dilemma let us suppose the required answer is the
parity of x, so that the deep states are simply C = We are both likely to get
the right answer, i.e., even, and C = We are both likely to get the right
answer, i.e., odd.
What’s wrong with Hofstadter’s view of this as justifying the co¨ operative
solution? [And with von Neumann and Morgenstern’s (p. 148) transcendental
argument, remarked upon by Skyrms (1990, pp. 1314), for expecting rational
players to reach a Nash equilibrium?] The answer is failure of the rigidity
conditions for acts, i.e., variability of pr(He gets x [ I get x) with pr(I get x)
in the decision maker’s kinematical map. It is Alma’s conditional probability
functions pr(−[ ± C) rather than pr(−[ ± A) that remain constant as her
probabilities for the conditions vary. The implausibility of initial des(act) as
a ﬁgure of merit for her act is simply the implausibility of positing constancy
of as her probability function pr evolves in response to changes in pr(A). But
CHAPTER 6. CHOOSING 116
the point is not that confessing is the preferable act, as causal decision theory
would have it. It is rather that Alma’s problem is not indecision about which
act to choose, but ignorance of which allele is moving her.
Hofstadter’s (1983) version of the prisoners’s dilemma and the ﬂagship
Newcomb problem have been analyzed here as cases where plausibility de
mands a continuum [0,1] of possible deep states, with opinion evolving as
smooth movements of probability density toward one end or the other draw
probabilities of possible acts along toward 1 or 0. The problem of the smoker
who believes Fisher’s hypothesis was simpler in that only two possibilities
(C, C) were allowed for the deep state, neither of which determined the
probability of either act as 0 or 1.
6.3.4 Conclusion
The ﬂagship Newcomb problem owes its bizarrerie to the straightforward
character of the pure acts: surely you can reach out and take both boxes, or
just the opaque box, as you choose. Then as the pure acts are options, you
cannot be committed to either of the nonoptional mixed acts. But in the
Fisher problem, those of us who have repeatedly “quit” easily appreciate the
smoker’s dilemma as humdrum entrapment in some mixed act, willy nilly.
That the details of the entrapment are describable as cycles of temptation,
resolution and betrayal makes the history no less believable—only more petty.
Quitting and continuing are not options, i.e., prA ≈ 0 and prA ≈ 1 are not
destinations you think you can choose, given your present position on your
kinematical map, although you may eventually ﬁnd yourself at one of them.
The reason is your conviction that if you knew your genotype, your value of
pr(A) would be either a or a
, neither of which is ≈ 0 or ≈ 1. (Translation:
“At places on the map where pr(C) is at or near 0 or 1, prA is not.”) The
extreme version of the story, with a ≈ 1 and a
≈ 0, is more like the ﬂagship
Newcomb problem; here you do see yourself as already committed to one of
the pure acts, and when you learn which that is, you will know your genotype.
I have argued that Newcomb problems are like Escher’s famous staircase
on which an unbroken ascent takes you back where you started.
11
We know
there can be no such things, but see no local ﬂaw; each step makes sense, but
there is no way to make sense of the whole picture; that’s the art of it.
12
11
Escher, 1960), based on Penrose and Penrose, (1958) 3133; see Escher 1989, p. 78.
12
Elsewhere I have accepted Newcomb problems as decision problems, and accepted “2
box solutions” as correct. In Jeﬀrey 1983, sec. 1.7 and 1.8, I proposed a new criterion of
acceptability of an act—“ratiﬁability—which proved to break down in certain cases (see
CHAPTER 6. CHOOSING 117
In writing The Logic of Decision in the early 1960’s I failed to see Bob
Nozick’s deployment of Newcomb’s problem in his Ph. D. dissertation as a
serious problem for the theory ﬂoated in chapter 1 and failed to see the mate
rial on rigidity in chapter 11 as solving that problem. It took a long time for
me to take the “causal” decision theorists (Stalnaker, Gibbard and Harper,
Skyrms, Lewis, Sobel) as seriously as I neded to in order to appreciate that
my program of strict exclusion of causal talk would be an obstacle for me—
an obstacle that can be overcome by recognizing causality as a concept that
emerges naturally when we patch together the diﬀerently dated or clocked
probability assignments that arise in formulating and then solving a decision
problem. I now conclude that in Newcomb problems,“One box or two?” is
not a question about how to choose, but about what you are already set to
do, willynilly. Newcomb problems are not decision problems.
6.4 References
Frank Arntzenius, ‘Physics and common causes’, Synthese 82 (1990)
7796.
Kenneth J. Arrow et al. (eds.), Rational Foundations of Economic Be
haviour, St. Martin’s (1996).
Ethan Bolker, Functions Resembling Quotients of Measures, Ph. D. dis
sertation (1965) Harvard University.
, ‘Functions resembling quotients of measures’, Transactions of the
American Mathematical Society 124 (1966) 292312.
, ‘A simultaneous axiomatization of utility and subjective probability’,
Philosophy of Science 34 (1967) 333340.
Donald Davidson, Essays on Actions and Events, Oxford: Clarendon
Press (1980)
Bruno de Finetti, ‘Probabilismo’, Pirella, Naples, 1931.
Maurits Cornelius Escher, Escher on Escher, New York, Abrams, 1989
, ‘Ascending and Descending’ (lithography, 1960), based on Penrose
and Penrose (1958).
the corrected paperback 2nd edition, 1990, p. 20). In Jeﬀrey (1988) and (1993) ratiﬁability
was recast in terms more like the present ones, but still treating Newcomb problems as
decision problems—a position which I now renounce.
CHAPTER 6. CHOOSING 118
James H. Fetzer (ed.), Probability and Causality, Reidel, Dordrecht,
1988.
Ronald Fisher, Smoking: The Cancer Controversy, Edinburgh and Lon
don, Oliver and Boyd, 1959.
Haim Gaifman, ‘A theory of higherorder probability’, in Brian Skyrms
and William Harper (eds.), Causality, Chance, and Choice, Reidel, 1988..
Richard Jeffrey, The Logic of Decision, (McGrawHill 1965; 2nd ed.,
revised, University of Chicago Press 1983; further revision of 2nd ed., 1990).
, ‘Risk and human rationality’, The Monist 70 (1987).
, ‘How to Probabilize a Newcomb Problem’ in Fetzer (ed., 1988).
, ‘Decision Kinematics’ in Arrow et al., eds. (1996) 319.
James M. Joyce, ‘Levi on causal decision theory and the possibility of
predicting one’s own actions’ Philosophical Studies 110 (2002) 69102.
Steven Leeds, ‘Chance, realism, quantum mechanics’, J. Phil 81 (1984),
pp. 567578.
George W. Mackey, Mathematical Foundations of Quantum Mechanics,
New York, W. A. Benjamin, 1963.
Robert Nozick, The Normative Theory of Individual Choice, Princeton
University, Ph. D. dissertation (1963).
Photocopy of Nozick (1963) with new preface; New York: Garland,
1990.
(1969) ‘Newcomb’s problem and two principles of choice’, in Rescher
ed. (1969).
L.S. Penrose and R. Penrose, ‘Impossible objects: a special type of
visual illusion’, The British Journal of Psychology 49 (1958) 3133.
Wlodek Rabinowicz, ‘Does practical deliberation crowd out selfprediction?’,
Erkenntnis 57 57 (2002) 91122.
Nicholas Rescher (ed.), Essays in Honor of Carl G. Hempel, Dordrecht:
Reidel, 1969.
Brian Skyrms, Causal Necessity Yale U.P. (1980).
John von Neumann and Oskar Morgenstern, Theory of Games and
Economic Behavior, Princeton University Press, 1943, 1947.
1
Contents, 1
1 Probability Primer, 8 1.1 Bets and Probabilities, 8 1.2 Why Probabilities are Additive, 11 1.3 Probability Logic, 15 1.4 Conditional Probability, 18 1.5 Why ‘’ Cannot be a Connective, 21 1.6 Bayes’s Theorem, 22 1.7 Independence, 23 1.8 Objective Chance, 25 1.9 Supplements, 28 1.10 References, 33 2 Testing Scientiﬁc Theories, 35 2.1 Quantiﬁcations of Conﬁrmation, 36 2.2 Observation and Suﬃiency, 38 2.3 Leverrier on Neptune, 40 2.4 Dorling on the Duhem Problem, 41 2.4.1 Einstein/Newton 1919, 43 2.4.2 Bell’s Inequalities: Holt/Clauser, 44 2.4.3 Laplace/Adams, 46 2.4.4 Dorling’s Conclusions, 48 2.5 Old News Explaned, 49 2.6 Supplements, 52 3 Probability Dynamics; Collaboration, 55 3.1 Conditioning, 55 3.2 Generalized Conditioning, 57 3.3 Probabilistic Observation Reports, 59 3.4 Updating Twice; Commutativity, 61 3.4.1 Updating on alien probabilities for diagnoses, 62 3.4.2 Updating on alien factors for diagnoses, 62 3.5 Softcore Empiricism, 63 4 Expectation Primer, 66 4.1 Probability and Expectation, 66 4.2 Conditional Expectation, 67 4.3 Laws of Expectation, 67 4.4 Mean and Median, 71
2 4.5 Variance, 73 4.6 A Law of Large Numbers, 75 4.7 Supplements, 76 5(a) Updating on Statistics, 79 5.1 Where do Probabilities Come From?, 79 5.1.1 Probabilities from Statistics: Minimalism, 80 5.1.2 Probabilities from Statistics: Exchangeability, 81 5.1.3 Exchangeability: Urn Exaples, 82 5.1.4 Supplements, 84 5(b) Diaconis and Freedman on de Finetti’s Generalizations of Exchangeability, 85 5.2 Exchangeability Itself, 86 5.3 Two Species of Partial Exchangeability, 88 5.3.1 2 × 2 Tables, 88 5.3.2 Markov Dependency, 89 5.4 Finite Forms of de Finetti’s Theorem on Partial Exchangeability, 90 5.5 Technical Notes on Inﬁnite Forms, 92 5.6 Concluding Remarks, 96 5.6 References, 99 6 Choosing, 102 6.1 Preference Logic, 102 6.2 Causality, 106 6.3 Supplements, 109 6.3.1 “The Mild and Soothing Weed”, 109 6.3.2 The Flagship Newcomb Problem, 112 6.3.3 Hofstadter, 114 6.3.4 Conclusion, 116 6.4 References, 117
Chapter 4. say.4 Chapter 3. The thought is that the same concept of probability should hold in all cases. thinking of longer or shorter odds as giving an unmerited advantage to one side or the other. It concerns your “expectations” of the values of “random variables”. linear or not and monetary or not.” See the end of Willard van Orman Quine’s muchanthologized ‘Two dogmas of empiricism’. the dreaded holism. in which the colors are all the real numbers from 0 to 1. so conceived. From this point of view probabilities are “in the mind”—the subject’s. “Probability Dynamics. addresses the problem of changing your mind in response to generally uncertain observations of your own or your collaborators. where the quoted material is both used and mentioned. Collaboration”. the probabilistic mode of judgment oﬀers a richer palate for depicting your state of mind.2 A more familiar mode of judgment is ﬂat. “Probability Primer”. In place of this “dogmatism”. and of packaging uncertain reports for use by others who may have diﬀerent background probabilities.3 Chapter 1. is an introduction to basic probability theory. because someone who is willing to assert that it will rain need not be prepared to bet life itself on rain. your utility for income is a linear function of that income. on pain of inconsistency. all things considered.3 Preface Here is an account of basic probability theory from a thoroughly “subjective” point of view. I use single quotes for mentioning the quoted material. The object is not so much to enunciate the formal rules of the probability calculus as to show why they must be as they are. 4 According to this. over the possible range of gains and losses..1 according to which probability is a mode of judgment. Conditioning on certainties is not the only way to go. “Expectation Primer”. e.” brings probability theory to bear on vexed questions of scientiﬁc hypothesistesting. 2 This is in test cases where. yours. is an alternative to chapter 1 as an introduction to probability theory. 1 . to which I know of no satisfactory detailed answer. It features Jon Dorling’s “Bayesian” solution of Duhem’s problem (and Quine’s). you would bet on rain at odds of 7:3. If you say the probability of rain is 70% you are reporting that. The question of the precise relationship between the two modes is a delicate one.g. 3 It would be a mistake to identify assertion with probability 1 and denial with probability 0. “Testing Scientiﬁc Theories. as in ‘It will rain’ or ‘It will not rain’. scientiﬁc hypotheses “must face the tribunal of senseexperience as a corporate body. “dogmatic” assertion or denial. Chapter 2. Hypotheses turn out to be 2valued random In this book double quotes are used for “as they say”.
Howard Stein (superscholar. probabilities and expectations can adapt themselves to observed frequencies via conditioning on frequencies and averages “out there”. and against people playing comparable parts elsewhere. on de Finetti’s generalizations of exchangeability. Abner Shimony (visiting from Yale). “Choosing”. “Updating on Statistics”. Sec. presents Bruno de Finetti’s dissolution of the socalled “problem of induction”. allround brain). is a brief introduction to decision theory.4 variables—the values being 0 (for falsehood) and 1 (for truth). . Acknowledgements Dear Comrades and Fellow Travellers in the Struggle for Bayesianism! It has been one of the greatest joys of my life to have been given so much by so many of you. University of Chicago Press. (I speak seriously. bopping down from her teaching job at Roosevelt College in the Loop). 1990). 1965. Surely the key was a sense of membership in the cadre of the righteous. “Newcomb” problems are explained away. focussed on the version ﬂoated in my Logic of Decision (McGrawHill. Probabilities turn out to be expectations of hypotheses. adapted as explained in a footnote to the section title with the permission of the authors and owners of the copyright. It seems to go with the territory. Stan Tennenbaum (ambient mathematician) and others. Ruth Barcan Marcus (postdoc.2. but at Chicago I especially remember the participants in Carnap’s reading groups that met in his apartment when his back pain would not let him come to campus: Bill Lentz (Chicago graduate student). 1983. standing out against the background of McKeon’s running dogs among the graduate students at Chicago. as a fond foolish old fart dying of a surfeit of Pall Malls. We begin with his oftenoverlooked miniature of the fullscale dissolution. the active ingredient is seen to be “exchangeability” of probability assignments: In an exchangeable assignment. Chapter 6. the sweetest guys in the world. is an article by other people. Chicago graduate student Norman Martin. At full scale. I am surely missing some names.5. Chapter 5. In these terms.) It began with openhanded friendship oﬀered to a very young beginner by people I then thought of as very old logical empiricists— Carnap in Chicago (194651) and Hempel in Princeton (19557). It includes an analysis in terms of probability dynamics of the diﬀerence between seeing truth of one hypothesis as a probabilistic cause of truth of another hypothesis or as a mere symptom of it.
Edith. Donald Davidson had immediately understood what I was getting at in that project. (Carnap had told me that Hempel was moving to Princeton. contrary to the impression I had formed at Chicago. who had been a senior at Brandeis. He was the best thing about Stanford. having observed that the rulers of the Chicago philosophy department regarded Carnap not as a philosopher but as—well. that drew me into logical positivism as a 17yearold at Boston University. at Princeton. for the sequel was a solution by Ethan Bolker of a mathematical problem that had been holding me up for a couple of years.) As to Princeton. asssured me that the phenomenologist who taught philosophy there regarded logical positivism as a force to be reckoned with. teaching at Princeton University. And the decanal heaveho! was one of the best things that ever happened to me. That blew my mind. from Carnap. who made me aware of de Finetti’s and Savage’s subjectivism. Here I come out of the closet with my love. So it happened at Princeton. At that time. (I left Chicago for a job in the logical design of a digital computer at MIT. an engineer. “probably”). Bill Tait.) This is my last book (as they say. and decided to sponsor me. And it was Hempel. there was work I could be doing in probability even though Carnap had seemed to be vacuuming it all up in his 1950 and 1952 books. Thanks. When the dean’s review of the philosophy department’s reappointment proposal gave me the heaveho! from Stanford I applied for a visiting year at the Institute for Advanced Study in Princeton. And I seem to have had a similar sensitivity to his ideas. the other half I ﬁlled in for Hempel. At Stanford. My project was what became The Logic of Decision (1965). for the suggestion. thus showing me that. G¨del found the L of D o interesting. and she assured me that philosophy departments existed where I might be accepted as a graduate student. Philosophy and Logical Syntax. Like my ﬁrst. NJ. it is dedicated to my beloved wife.) But Edith. and even get a fellowship. was a fraudulant ingroup joke. The dedication there. the Latin having been provided by Elizabeth Anscombe. there might be something for a fellow like me to do in a philosophy department. Read . I had never managed to learn the language. As soon as they let me out of the Navy in 1946 I made a beeline for the University of Chicago to study with Carnap. after all. I have Edith Kelman to thank for letting me know that.5 It was Carnap’s booklet. (I think I must have learned about Ramsey in Chicago. it would never have occurred to me. he was on leave at the Center for Advanced Study in the Behavioral Sciences in Palo Alto. ‘Uxori delectissimae Laviniae’. he himself had ﬁnally escaped Chicago for UCLA. I was at the IAS for the ﬁrst half of 19634.
but I thought it better to post warnings on those parts and leave the material here than to go through the business of petty fragmentation. they gave me “carte blanche”. But. More bliss. so the recycling remained my job) and have obtained the necessary permission from the other interested parties. Parts of it use mathematics a bit beyond the general level of this book.6 about it in The Logic of Decision. which I did not understand. using projective geometry. o My own Ph. reported in chapters 2 and 3 here. 1980).D. We would meet every Friday for an hour before he went home for lunch. G¨del conjectured (what Bolker turned out to have proved) that in o my system there has to be a third constant.) As I addressed the problem of mining the article they had given me on partial exchangeability in volume 2 of Studies in Inductive Logic and Probability (1980) for recycling here. It was typical of G¨del that he gave me time to work the thing out on my o own. Feigning imminent death. I hurried on home and worked out a clumsy proof using highschool algebra. Godel’s proof used linear algebra. dissertation was not about decision theory but about a generalization of conditioning (of “conditionalization”) as a method of updating your probabilities when your new information is less than certain. and I ﬁnished my book. assured that G¨del had the right o theorem and that some sort of algebraic proof worked. For one thing. and ask if I had solved that problem yet.D. I had been trying to prove uniqueness of utility except for two arbitrary constants. This seemed to be hard for people to take in. He added a chapter on the application to decision theory and (he asks me to say ) stumbled into the honor of a citation in the same paragraph as Kurt G¨del. (See also Carl Wagner’s work. I popped right over and heard it. More accessible to me than G¨del’s idea was Bolker’s proof. one morning. so that the second. (In his o almost ﬁnished thesis on measures on Boolean algebras Bolker had proved the theorem I needed. especially the 2nd edition. Only toward the end of my time at the IAS did he phone me. Since we met at the University of Chicago in—what? 1947?—my guru Abner Shimony has been turning that joke title into a simple description. I got them to agree (well. adjustable within a certain range. bigger part of chapter 5 is their work. The result was the ﬁrst serious mathematical treatment of probability kinematics (JASA. But when I presented that idea at a conference at the University of Western Ontario in 1975 Persi Diaconis and Sandy Zabell were there and caught the spark. Bliss. he proves to be a monster of intellectual integrity: after earning his Ph. in philosophy at Yale with his dissertation on “Dutch Book” argu . When I said ‘No’ he told me he had ¨ an idea about it. it struck me that Persi and his collaborator David Freedman could do the job far better.
in physics at Princeton before he would present himself to the public as a philosopher of science and go on to do his “empirical metaphysics”. Abner has given me the highly prized feedback on Dorling’s treatment of the HoltClauser issues in chapter 2. He is my main Brother in Bayes and source of Bayesian joy. Namely: The divine Edith (wife to Richard). of course. in distant Bologna. Let me not now start on the Machatonim. Arthur Merin in faroﬀ Konsanz and. who are all every bit as remarkable.D. even where unmarked by references to publications or particular personal communications. Sean O’Connor. he went back for a Ph. And. collaborating with experimental physicists in tests of quantum theory against the Bell inequalities as reported in journals of philosophy and of experimental physics. Any remaining errors are his responsibility. For another. daddy of young Sophie Jeﬀrey O’Connor and slightly younger Juliet Jeﬀrey O’Connor. Richard Jeﬀrey dickjeﬀ@princeton. anyway. In connection with this book. I see his traces everywhere. Our Lady of the Wavelets (alas! a closed book to me) but who does help me as a Sister in Bayes and fellow slave of LaTeX according to my capabilities. And I don’t know where to stop: here. here and there. who are also incredibly kind and loving. And there are others—as.edu . as a graduate student of physics he took time out to tutor me in Aristotle’s philosophy of science while we were both graduate students at Princeton. And I have been blessed with a literary soninlaw who is like a second son. Over the past 30 years or so my closest and constant companion in the matters covered in this book has been Brian Skyrms. close to home. my longtime friend and collaborator Maria Carla Galavotti. for now. who has long led me in the paths of de Finetti interpretation. Maybe I can introduce you some time. Ingrid Daubesches. to thank my wonderful family for their loving and eﬀective support.7 ments for Carnap’s “strict regularity”. who grew unpredictably from child apples of our eyes to a unionside labor lawyer (Pam) and a computernik (Dan) and have been so generously attentive and helpful as to shame me when I think of how I treated my own ambivalently beloved parents. for example. our son Daniel and daughter Pamela. as well as other matters. Except. whose deserts were so much better than what I gave them. for the last and most important stop.
We can do better.Chapter 1 Probability Primer Yes or no: was there once life on Mars? We can’t say. we can’t really say.) Odds m:n correspond to probability m . or intelligent life. corresponding to probabilities of 1/10 and 1/1000. on Mars? That would be a more nuanced form of judgment.99). Suppose your odds were 1:9 for life. in giving odds—in which case it is read as “to”. and 1:999 for intelligent life. and perhaps a more useful one. m+n That means you would see no special advantage for either player in risking one dollar to gain nine in case there was once life on Mars. 1. respectively. The simple yesorno framework has no place for shadings of doubt. and it means you would see an advantage on one side or the other if those odds were shortened or lengthened. if we have them. Here is another way of saying the same thing: You would think a price of one dollar just right for a ticket worth ten if there was life on Mars and 8 . no room to say that we see intelligent life on Mars as far less probable than life of a possibly very simple sort. And similarly for intelligent life on Mars when the risk is a thousandth of the same ten dollars (1 cent) and the gain is 999 thousandths ($9. Nor does it let us express exact probability judgments. but again. What about intelligent life? That seems most unlikely. division.1 Bets and Probabilities What if you were able to say exactly what odds you would give on there having been life. (The colon is commonly used as a notation for ‘/’.
but you would think a price of only one cent right if there would have had to have been intelligent life on Mars for the ticket to be worth ten dollars. TT? 2 There are three coins in a bag: ordinary.g. because 2 heads is one of four possibilities: HH. Still. HT. Price 1 cent Probability . and twotailed. 1.. twoheaded. Questions 1 A vigorously ﬂipped thumbtack will land on the sidewalk. One is shaken out onto the table and lies head up. 1. TH. we know that any probabilities anyone might think acceptable for those two hypotheses ought to satisfy certain rules. 1. that the ﬁrst cannot be less than the second. because 2 heads is one of three possibilities: 2. as a way of getting in touch with some of your own ideas about probability.2 we turn to the question of what the laws of judgmental probability are. 0 heads? (b) 1/4. Afterward. the price was a tenth of that. and why.) Of course you need not have an exact judgmental probability for life on Mars. depending on whether the hypothesis is true or false.CHAPTER 1. Price $1 Probability . since it can only be twoheaded or normal? . What is your probability for the head turning up both times? (a) 1/3. That is because the second hypothesis implies the ﬁrst. read the discussion that follows. These are the two tickets: Worth $10 if there was life on Mars. Meanwhile. e. (See the implication rule at the end of sec. take some time with the following questions. What should be your probability that it’s the twoheaded one: (a) 1/2.1 Worth $10 if there was intelligent life on Mars.001 So if you have an exact judgmental probability for truth of a hypothesis. or for intelligent life there. PROBABILITY PRIMER 9 nothing if there was not. it corresponds to your idea of the dollar value of a ticket that is worth 1 unit or nothing. (For the hypothesis of mere life on Mars the unit was $10.3 below.) In sec. Is it reasonable for you to have a probability for the hypothesis that it will land point up? 1 An ordinary coin is to be tossed twice in the usual way.
In either problem you can arrive at a judgmental probability by trying the experiment (or a similar one) often enough. e. PROBABILITY PRIMER 10 (b) 2/3. 4 Goys and birls. it may well be reasonable to have judgmental probability 1/2 for each. but that reﬂects the probability judgments we bring with us to the analyses. As you know. the current statistics for girls apply to goys as well: these 1 This is a fanciful adaptation of Nelson Goodman’s (1983. Since all girls born so far have been goys. This question is meant to undermine the impression that judgmental probabilities can be based on frequencies in a way that does not already involve judgmental probabilities.) 4 It’s a goy!1 (a) As you know. the judgment that the cases are equiprobable). but if you have no clue about which of the two has the better chance. Evidence about the chances might be given by statistics on tosses of similar tacks. and seeing the statistics settle down close enough to 1/2 or to 1/3 to persuade you that more trials will not reverse the indications. but any analysis can be reﬁned in signiﬁcantly diﬀerent ways. pp. 734) “grue” paradox . GMT) will be a girl? (b) A goy is deﬁned as a girl born before t or a boy born thereafter.3 These questions are meant to undermine the impression that judgmental probabilities can be based on analysis into cases in a way that does not already involve probabilistic judgment (e. about 49% of recorded human births have been goys.) Indeed some of these analyses seem more natural or relevant than others. because the other side could be the tail of the normal coin. and there is no point at which the process of reﬁnement has to stop.. What is your judgmental probability that the ﬁrst child born in the 22nd century will be a goy? Discussion 1 Surely it is reasonable to suspect that the geometry of the tack gives one of the outcomes a better chance of happening than the other. about 49% of recorded human births have been girls. if you learned that in 20 tosses there were 6 up’s you might take the chance of up to be in the neighborhood of 30%. (Head or tail can be reﬁned to head–facing–north or head–not–facing–north or tail. and whether or not you do that.CHAPTER 1. What is your judgmental probability that the ﬁrst child born after time t (say. 2.g.. t = the beginning of the 22nd century. you might well adopt 30% as your judgmental probability for up on the next toss. In each these problems it is the ﬁner of the two suggested analyses that happens to make more sense.g. or either side of the twoheaded one? (Suppose the sides have microscopic labels.
P. and a book against someone—a “Dutch” book—is one the bookmaker will suﬀer a net loss on no matter how the race turns out. PROBABILITY PRIMER 11 days.g. probabilities of the hypotheses about athletic contests and lotteries that people commonly bet on. I cannot cash the ticket even if it is really a winner.CHAPTER 1. I follow Brian Skyrms in seeing F.2 Why Probabilities are Additive Authentic tickets of the Mars sort are hard to come by. If the truth is not known in my lifetime. Blank: Birl. (2) The ﬁrst child born after t will be a goy. To avoid this contradiction you must decide which statistics are relevant to pr(1): the 49% of girls born before 2001. e. we can justify laws of probability if we can prove all betting policies that violate them to be inconsistent. or the 51% of boys.2 We shall give a Dutch book argument for the requirement that probabilities be additive in this sense: In British racing jargon a book is the set of bets a bookmaker has accepted. 1. Such justiﬁcations are called “Dutch book arguments”. Boy 49% 51% Thus pr(1)+pr(2)=98%. since (2) is logically equivalent to (3) The ﬁrst child born after t will be a boy. But some probabilities are plausibly represented by prices.. But it is clear that those probabilities should sum to 100%. And that is not a matter of statistics but of judgment—no less so because we would all make the same judgment: the 51% of boys. Girl Yes Born before t? No Shaded: Goy. and pr(1)+pr(3) = 100%. Then if you read probabilities oﬀ statistics in a straightforward way your probability will be 49% for each of these hypothesis: (1) The ﬁrst child born after t will be a girl. And it is plausible to think that the general laws of probability ought to be the same for all hypotheses—about planets no less than about ball games. about 49% of human births are goys. Ramsey as holding 2 . Is the ﬁrst of them really worth $10 to me if there was life on Mars? Probably not. If that is so.
the argument diﬀers inessentially for other examples. or D. or B. Why should p be the sum q + r + s? Because no matter what it is Dutch book arguments to demonstrate actual inconsistency. Price $p Price $q Price $r Price $s Suppose I am willing to buy or sell any or all of these tickets at the stated prices. Finite additivity. . . Mellor.1 Dutch Book Argument for Finite Additivity. B. D. Worth $1 if H is true. are the hypotheses that the various diﬀerent candidates win—the ﬁrst three being the sages. Worth $1 if C is true. or C. A B C D ··· H true H false Probabilities of cases A. B. Worth $1 if A is true. The logical situation is diagrammed as follows. with other ﬁnite numbers of cases. Example 1.. D. . D. speciﬁed in minute detail. E. Now consider the following four tickets. are q. where the points in the big rectangle represent all the ways the election might come out. s. respectively. . t.CHAPTER 1. C.: Cambridge. The probability of a hypothesis that can be true in a ﬁnite number of incompatible ways is the sum of the probabilities of those ways. r. and s. H. and the small rectangles represent the ways in which the winner might prove to be A. In the following diagram. as in example 1. r... The probability p of the hypothesis (H) A sage will be elected 12 is q + r + s if exactly three of the candidates are sages and their probabilities of winning are q.. A... For deﬁniteness we suppose that the hypothesis in question is true in three cases.. Worth $1 if B is true.2. 1990. PROBABILITY PRIMER Finite Additivity. See Ramsey’s ‘Truth and Probability’ in his Philosophical Papers. ed. 1. C. etc.
C together. a book can be made against me. The sets of positive integers (I + ’s) cannot be counted oﬀ as ﬁrst. the set of even I + ’s by the string − + − + . If I am inconsistent in that way. Cantor proved that no list of endless strings of signs can be complete.CHAPTER 1. I can be ﬂeeced by anyone who will ask me to buy the H ticket and sell or buy the other three depending on whether p is more or less than q +r+s. . . no matter whether the equation p = q +r+s fails because the left–hand side is more or less than the right. yields an ¯ endless string d of signs which is not in that list.}of prime numbers by an endless string that begins −++−+−+. and the set of no I + ’s by an endless string of minuses. B. Any set of I + ’s can be represented by an enless string of plusses and minuses (“signs”). . s3 : + + + + . (If H loses it is because A. . the set of all the I + ’s by an endless string of plusses. 5. When the number of cases is inﬁnite. . . etc. a Dutch book argument for additivity can still be given—provided the inﬁnite number is not too big! It turns out that not all inﬁnite sets are the same size: Example 2. B. He used an amazingly simple method (“diagonalization”) which. C wins. second. applied to any such list. For deﬁniteness.. That is the Dutch book argument for additivity when the number of ultimate cases under consideration is ﬁnite. in which plusses appear at the even numbered positions and minuses at the odd. 7. e. This was proved by Georg Cantor (1895) as follows. The talk about being ﬂeeced is just a way of dramatizing the inconsistency of any policy in which the dollar value of the ticket on H is anything but the sum of the values of the other three tickets: to place a diﬀerent value on the three tickets on A. Here’s how. Thus. C all lose. PROBABILITY PRIMER 13 worth —$1 or $0—the ticket on H is worth exactly as much as the tickets on A.) Then if the price of the H ticket is diﬀerent from the sum of the prices of the other three. depending on how it is presented. . . . s4 : − − − − . . with each such set appearing as n’th in the list for some ﬁnite positive integer n. suppose the ﬁrst four strings in the list are the examples already given. C from the value you place on the H ticket is to place diﬀerent values on the same commodity bundle under two demonstrably equivalent descriptions. I am inconsistently placing diﬀerent values on one and the same contract. the set {2. s2 : − + + − . . . if H wins it is because exactly one of A. . . . B.. 3. so that the list has the general shape s1 : − + − + . B. . Cantor’s Diagonal Argument. .g.
in general. . 3. in which each member appears as the n’th item for some ﬁnite n. . . A countable set is deﬁned as one whose members (if any) can be arranged in a single list. . . . the n’th sign in d ¯ whereas if d were some sn . Countable additivity.. −1. by deﬁnition. A2 . . ¯ In general. for. those two strings would have to agree. Countability. . Order does not matter... s3 . . sage. ¯ d : + − − + . or zero: {. . A less obvious example is the set I of all the whole numbers. . positive. sign by sign. 2. 2. of the probabilities of those ways. 1. Then the set of I + ’s deﬁned by the antidiagonal of a list cannot be in that list. . 0.. . . A2 . then an extension of the law of ﬁnite additivity to countably inﬁnite sets would be this: Countable Additivity. ¯ ¯ And deﬁne the antidiagonal d of that list as the result d of reversing all the signs in the diagonal. −3. −2. So the set of all the whole numbers is countable.. 2. including no end of sages. The members of this set can be rearranged in a single list of the sort required in the deﬁnition of countability: 0.} of all positive whole numbers. If H says that a sage wins. Of course any ﬁnite set is countable in this sense. and some inﬁnite sets are countable. . second. the second sign in s2 . . . 1.}. . identify the winner as the ﬁrst. as long as every member of I shows up somewhere in the list. the n’th sign in sn : − + + − . s2 . 3. . . . is the sum pr(H) = pr(A1 ) + pr(A2 ) + . . . s4 . and.. d cannot be any member sn of the ¯ is diﬀerent from the n’th sign of sn . suppose there were an endless list of candidates. An obvious example of a countably inﬁnite set is the set I + = {1. The probability of a hypothesis H that can be true in a countable number of incompatible ways A1 . and A1 . list. −1. for any list s1 . and therefore no list of sets of I + ’s can be complete. . PROBABILITY PRIMER 14 Deﬁne the diagonal of that list as the string d consisting of the ﬁrst sign in s1 . −2.CHAPTER 1. Example 3. In example 1. negative. .
pay the bearer $1 if An is true .) Then if the ﬁrst price is diﬀerent from the sum of the others. if H wins it is because exactly one of the others wins. etc. second.CHAPTER 1. (If H loses it is because the others all lose. I am inconsistently placing diﬀerent values on one and the same contract. 1/8. Failure of additivity in these cases implies inconsistency of valuations: a judgment that certain transactions are at once (1) reasonable and (2) sure to result in an overall loss. 1/16. One small ticket can save the rain forest by doing the work of all the A tickets together. where the mutually incompatible A’s collectively exhaust the ways in which H can be true (as in example 3). and the probabilities of the ﬁrst.3 Probability Logic The simplest laws of probability are the consequences of ﬁnite additivity under this additional assumption: 3 No matter that there is not enough paper in the universe for an inﬁnity of tickets.2. decreasing by half each time. sage’s winning were 1/4. Consider the following inﬁnite array of tickets.3 Pay the bearer $1 if H is true Pay the bearer $1 if A1 is true Pay the bearer $1 if A2 is true ··· Price $pr(H) Price $pr(A1 ) Price $pr(A2 ) ··· Why should my price for the ﬁrst ticket be the sum of my prices for the others? Because no matter what it is worth —$1 or $0—the ﬁrst ticket is worth exactly as much as all the others together. third. ﬁnite or inﬁnite. This ecoticket will say: ‘For each positive whole number n.. 1. PROBABILITY PRIMER 15 This equation would be satisﬁed if the probability of one or another sage’s winning were pr(H) = 1/2. Consistency requires additivity to hold for countable sets of alternatives.2 Dutch book argument for countable additivity. etc. depending on how it is presented. 1.
16 This makes it possible to read probability laws oﬀ diagrams.CHAPTER 1. Diagrams for De Morgan’s laws (1) and (2): (1) Shaded: ¬(G ∧ H) = ¬G ∨ ¬H H G GH –G –H (2) Shaded: ¬(G ∨ H) = ¬G ∧ ¬H GvH Adapting such geometrical representations to probabilistic reasoning is just a matter of thinking of the probability of a hypothesis as its region’s area. or and and. beginning with two surprising examples (where “iﬀ” means if and only if): De Morgan’s Laws (1) ¬(G ∧ H) = ¬G ∨ ¬H (“Not both true iﬀ at least one false”) (2) ¬(G ∨ H) = ¬G ∧ ¬H (“Not even one true iﬀ both false”) Here the bar. Thus. Let’s call them and ⊥: . It is useful to denote those two extreme regions in ways independent of any particular hypotheses H. has area 1. respectively. R ∧ S (or ‘RS’). if G and H are two hypotheses. Of course the empty region. is their intersection. with the endpoints reserved for certainty of falsehood and of truth. the wedge and juxtaposition stand for not. G ∧ H (or GH) says that they are both true: G and H G ∨ H says that at least one is true: G or H ¯ ¬G (or −G or G) says that G is false: not G In the following diagrams for De Morgan’s laws the upper and lower rows represent G and ¬G and the left and righthand columns represent H and ¬H. H ∧ ¬H (=G ∧ ¬G). has area 0. PROBABILITY PRIMER Probabilities are real numbers in the range from 0 to 1. R ∨ S is their union. and ¬R is the whole big rectangle except for R. assuming that the whole rectangle. G. H ∨ ¬H (= G ∨ ¬G). Let’s see how that works for the ordinary ones. much as we read ordinary logical laws oﬀ them. Now if R and S are any regions.
Dyadic Analysis: pr(G) = pr(G ∧ H) + pr(G ∧ ¬H) Veriﬁcation. The G region is the union of the nonoverlapping G ∧ H and G ∧ ¬H regions. which has area 1. Not : pr(¬H) = 1 − pr(H) Veriﬁcation.4 then pr(G) = pr(G ∧ H1 ) + pr(G ∧ H2 ) + pr(G ∧ H3 ). Or: pr(G ∨ H) = pr(G) + pr(H) − pr(G ∧ H) Veriﬁcation. The nonoverlapping regions H and ¬H exhaust the whole rectangle. for n=3: Triadic Analysis: If H1 . . the H’s are mutually exclusive (H1 ∧ H2 = H1 ∧ H3 = H2 ∧ H3 = ⊥) and collectively exhaustive (H1 ∨ H2 ∨ H3 = ). there is a rule of nadic analysis for each n. e. G ∧ ¬H: But Not: pr(G ∧ ¬H) = pr(G) − pr(G ∧ H) ¯ Veriﬁcation. So subtract it on the righthand side. The word ‘but’—a synonym for ‘and’—may be used when the conjunction may be seen as a contrast. ⊥ = H ∧ ¬H = G ∧ ¬G 17 We can now verify some further probability laws informally. H2 . in terms of areas of diagrams. The G ∧ H region is what remains of the G region after the G ∧ H part is deleted.g. except that when you simply add pr(G) + pr(H) you count the G ∧ H part twice.. as in ‘it’s green but not healthy’. The G ∨ H area is the G area plus the H area.CHAPTER 1. so pr(¬H) = 1 − pr(H). Then pr(H) + pr(¬H) = 1. H3 partition . In general. The equation also holds if the H’s merely pr−partition in the sense that pr(Hi ∧ Hj ) = 0 whenever i = j and pr(H1 ∧ H2 ∧ H3 ) = 1. 4 This means that. See the diagram for De Morgan (1). PROBABILITY PRIMER Logical Truth. as a matter of logic. = H ∨ ¬H = G ∨ ¬G Logical Falsehood.
(Logically equivalent hypotheses are equiprobable.4 Conditional Probability We identiﬁed your ordinary (unconditional) probability for H as the price representing your valuation of the following ticket Worth $1 if H is true.) Finally: To be implied by G. H G Implication: If G implies H. worth $pr(HD) if D is false. this means that the G region is entirely included in the H region. Price: $pr(HD) . PROBABILITY PRIMER H H2 1 GH GH G 1 2 G H3 GH 3 18 The next rule follows immediately from the fact that logically equivalent hypotheses are always represented by the same region of the diagram—in view of which we use the sign ‘=’ of identity to indicate logical equivalence. Then if G implies H. then pr(G) ≤ pr(H). then pr(H) = pr(G). Equivalence: If H = G. the hypothesis H must be true in every case in which G is true. 1.CHAPTER 1. the G region can have no larger an area than the H region. Diagramatically. Price: $pr(H) Now we identify your conditional probability for H given D as the price representing your valuation of this ticket: (1) Worth $1 if D ∧ H is true.
you get your $pr(HD) back as a refund. multiply through. To violate that rule is to place diﬀerent values on the same commodity bundle in diﬀerent guises: (1). the bet is on and you win. which represent unconditional bets of $1 on HD and of $pr(HD) against D. the ticket is worthless. the bet is oﬀ. your price for (1) ought to equal the sum of your prices for (2) and (3): pr(HD) = pr(H ∧ D) + pr(¬D)pr(HD) Now set pr(¬D) = 1 − pr(D). or the package (2. If D is true but H is false. The product rule is more familiar in a form where it is solved for the conditional probability pr(HG): 5 de Finetti (1937. If D and H are both true. 1980). That is why its price is not the full $pr(¬D) but only the fraction pr(¬D) of the $pr(HD) that you stand to win. (3) Worth pr(HD) if D is false. Thus. The ﬁrst is ticket (1) above. but only $pr(HD). With that understanding we can construct a Dutch book argument for the following rule.5 Imagine that you own three tickets. which connects conditional and unconditional probabilities: Product Rule: pr(H ∧ D) = pr(HD)pr(D) Dutch Book Argument for the Product Rule.) Then there is nothing to choose between ticket (1) and tickets (2) and (3) together. the new one represents a conditional bet on H—a bet that is called oﬀ (the price of the ticket is refunded) in case the condition D fails. 3). so that the three ﬁt together into a neat book: Observe that in every possible case regarding truth and falsity of H and D the tickets (2) and (3) together have the same dollar value as ticket (1). which you can sell at prices representing your valuations. (You can verify that claim with pencil and paper. the bet is on and you lose. the ticket is worth $1. The result is the product rule. This payoﬀ was chosen to equal the price of the ﬁrst ticket. The second and third are the following two. (2) Worth $1 if H ∧ D is true. . And if D is false.CHAPTER 1. and therefore it would be inconsistent to place diﬀerent values on them. Price: $pr(H ∧ D) Price: $pr(HD)pr(¬D) Bet (3) has a peculiar payoﬀ: not a whole dollar. PROBABILITY PRIMER 19 The old ticket represented a simple bet on H. cancel pr(HD) from both sides and solve for pr(H ∧ D).
3 we get the rule of6 Total Probability: If the D’s partition then pr(H) = i 6 7 pr(Di )pr(HDi ). 1. less importantly—would we be able to make judgments like the following. and using that as the new unit of area. By applying the product rule to the terms on the righthand sides of the analysis rules in sec. If it were. and the product rule is no deﬁnition but an essential principle relating two distinct sorts of probability. the quotient rule expresses pr(HD) as the fraction of the D region that lies inside the H region.7 Here the sequence of D’s is ﬁnite or countably inﬁnite. provided pr(D) > 0. The quotient rule is often called the deﬁnition of conditional probability. . of making a conditional judgment—say. PROBABILITY PRIMER pr(H ∧ D) . about a coin that may or may not be tossed will land—without attributing some particular positive value to the condition that pr(head  tossed) = 1/2 even though pr(head ∧ tossed) undef ined = · pr(tossed) undef ined Nor—perhaps. It is as if calculating pr(HD) were a matter of trimming the whole D ∨ ¬D rectangle down to the D part. pr(D) 20 Quotient Rule: pr(HD) = Graphically. It is not. = pr(D1 )pr(HD1 ) + pr(D2 )pr(HD2 ) + . . we could never be in the position we are often in. .CHAPTER 1. . about a point (of area 0!) on the Earth’s surface: pr(in western hemisphere  on equator) = 1/2 even though 0 pr(in western hemisphere ∧ on equator) = · pr(on equator) 0 The quotient rule merely restates the product rule.
then if C then D. PROBABILITY PRIMER 21 Example.5 Why ‘’ Cannot be a Connective The bar in ‘pr(HD)’ is not a connective that turns pairs H.8 Consider the simplest special case of the rule of total probability: pr(H) = pr(HD)pr(D) + pr(H¬D)pr(¬D) Now if ‘’ is a connective and D and C are propositions. the papers by Alan H`jek and a Ned Hall. A ball will be drawn blindly from urn 1 or urn 2. Of course the two are connected—by the product rule. D of propositions into new. H if D. we have pr(H) = pr(HD1 )P (D1 ) + pr(HD2 )P (D2 ) = ( 3 · 1 ) + 3 3 7 ( 1 · 2 ) = 1 · 1 = 12 > 1 : Black is the more probable outcome.CHAPTER 1. Is black or white the more probable outcome? Urn 1 Urn 2 Solution. Thus we use ‘pr’ for a function of one variable as in ‘pr(D)’ and ‘pr(H ∧ D)’. 1994)—especially. D)’: the bar is a typographical variant of the comma. so good. but why couldn’t we? What would go wrong if we did? This question was answered by David Lewis in 1976. with odds 2:1 of being drawn from urn 2. and we are entitled to set H = DC in the rule. this comes to the same as If X and C. then D: (2) (DC)X = DXC 8 For Lewis’s “trivialization” result (1976). Therefore. and also for the corresponding function of two variables as in ‘pr(HD)’. Then in fact we do not treat the bar as a statement–forming connective. By the rule of total probability with H = black and Di = drawn from urn i. ‘if’. And as we ordinarily use the word ‘if’. it is as if we wrote the conditional probability of H given D as ‘pr(H. 2 3 4 3 2 1. But remember: ‘’ means if. For subsequent developments. pretty much as follows. Rather. see Ellery Eells and Brian Skyrms (eds. then DC is a proposition too. conditional propositions. see his 1986).. . Result: (1) pr(DC) = pr[(DC)D]pr(D) + pr[(DC)¬D]pr(¬D) So far. ‘(DC)X’ means If X.
And it means that pr(DC) would come to the same thing as pr(D¬C). Therefore. For many purposes Bayes’s theorem is more usefully applied to odds than to probabilities. the two sentences are logically equivalent. PROBABILITY PRIMER 22 (Recall that the identity means the two sides represent the same region. i.6 Bayes’s Theorem A wellknown corollary of the product rule allows us to reverse the arguments of the conditional probability function pr(  ) provided we multiply the result by the ratio of the probabilities of those arguments in the original order. pr(HD) = pr(DH) × pr(H) pr(D) Proof. so does the lefthand side. the righthand side equals pr(D ∧ H)/pr(D). If we apply Bayes theorem for probabilities to the conditional odds between H and G. suppose there are two hypotheses. the only assumption needed about ‘if’ was the eqivalence (2) of ‘If X. (3) reduces to (4) pr(DC) = pr(D) Conclusion: If ‘’ were a connective (‘if’) satisfying (2). and as pr(DX) for any other statement X. conditional probabilities would not depend on on their conditions at all. That means that ‘pr(DC)’ would be just a clumsy way of writing ‘pr(D)’. In proving it.) Now by two applications of (2) to (1) we have (3) pr(DC) = pr(DD ∧ C)pr(D) + pr(D¬D ∧ C)pr(¬D) But as D ∧ C and ¬(D ∧ C) respectively imply and contradict D. In particular. then if C then D’ with ‘If X and C.. H and G. the unconditional probability of D cancels out: Note that the result does not depend on assuming that ‘if’ means ‘or not’. pr(HD)/pr(GD). (Prove it!) 9 . to which observational data D are relevant. Bayes’s Theorem (Probabilities). we have pr(DD ∧ C) = 1 and pr(D¬D ∧ C)) = 0. no such fancy argument is needed in order to show that pr(¬A ∨ B) = pr(BA) only under very special conditions.9 1.CHAPTER 1.e. by the quotient rule. That is David Lewis’s “trivialization result”. By the product rule. then D’.
. 1.. pr(H1 D) = pr(H2 D) 1 2 1 2 1 2 · 1 3 · 2 4 3 +1 4 2 · 1 2 = 3 · 5 × 3 4 2 4 = 3 · 2 These come to the same thing: Probability 3/5 = odds 3:2. Terminological note. PROBABILITY PRIMER pr(HD) pr(H) pr(DH) = × pr(GD) pr(G) pr(DG) 23 Bayes’s Theorem (Odds). collectively exhaustive hypotheses H1 . Hi : Bayes’s Theorem (Total Probabilities):10 pr(Hi D) = pr(Hi )pr(DHi ) j pr(Hj )pr(DHj ) Example. . For a countable collection of such hypotheses we have an expression for the probability. and pr(Hi ) = 1 . pr(H2 ). .” In these terms the odds form says: Conditional odds = prior odds × likelihood ratio Bayes’s theorem is often stated in a form attuned to cases in which you have clear probabilities pr(H1 ). Hi = It came from urn i. . for data D on each of them. The H’s are conjuncts. . pr(DH2 ). say. ∧ Hn is a conjunction. . ∨ Hn is a disjunction.CHAPTER 1. . . for mutually incompatible.7 Independence Deﬁnitions • H1 ∧ . suppose a black ball is drawn. Was it more probably drawn from urn 1 or urn 2? Let D = A black ball is drawn. and have clear conditional probabilities pr(DH1 ).4). . The H’s are disjuncts. . The second factor on the righthand side of the odds form of Bayes’s theorem is the “likelihood ratio. H2 . of any one of them. Here are two ways to go: 2 (1) Bayes’s theorem for total probabilities. pr(H1 D) = (2) Bayes’s theorem for odds. In the urn example (1. Urn 1 is the more probable source. • H1 ∨ . . . given D. 10 The denominator = pr(H1 )pr(DH1 ) + pr(H2 )pr(DH2 ) + · · · . . .
if you think nothing ﬁshy is going on. pr(1) = . . of which b are black. hypotheses are: independent iﬀ your probability for the conjunction of any two or more is the product of your probabilities for the conjuncts. If we deﬁne Hi as ‘head on toss #i’ and D as ‘diﬀerent results on the two tosses’ we ﬁnd that pr(H1 ∧ H2 ) = pr(H1 )pr(D) = pr(H2 )pr(D) = 1 but pr(H1 ∧ H2 ∧ D) = 0. (h) pr(H2 ∧ ¬H3 ) = h2 h3 . (b) pr(H1 ∧ H2 ) = h1 h2 . (d) pr(H2 ∧ H3 ) = h2 h3 ¯ Here. given G). H1 . H2 . pr(head) = . Hi means that the 1 (“ace”) turns up on roll number i. ¬H3 are also independent. Coin & die. Writing hi = pr(¬Hi ). You know it contains N balls. Two normal tosses of a normal coin. Urn of known composition. pr(Hi Hj ) = b2 /N 2 if i = j. Here. . . balls drawn will be black. second. H3 can be independent in pairs but fail to be independent because pr(H1 ∧ H2 ∧ H3 ) =pr(H1 )pr(H2 )pr(H3 ). conditionally independent given G iﬀ your conditional probability given G for the conjunction of any two or more is the product of your conditional probabilities given G for the conjuncts. Example 1. Rolling a fair die. PROBABILITY PRIMER 24 • For you. Example 4. . . and that after a ball is drawn it is replaced and the contents of the urn mixed. mean: the ﬁrst. To illustrate (1). Writing hi = pr(Hi ). (c) pr(H1 ∧ H3 ) = h1 h3 .CHAPTER 1. Suppose H1 . this means that the following four equations hold: (a) pr(H1 ∧ H2 ∧ H3 ) = h1 h2 h3 . so are those obtained by denying some or all of them. . These Hi are both independent and equiprobable for you: Your pr for the conjunction of any distinct n of them will be 1/6n . (g) pr(H1 ∧ ¬H3 ) = h1 h3 . and so on. H3 are independent. think about case n = 3. 4 Here is a useful fact about independence: (1) If n propositions are independent. It turns out that three propositions H1 . ¯ ¯ (f) pr(H1 ∧ H2 ) = h1 h2 . Example 3. H2 . H2 . and equiprobable (or conditionally equiprobable given G) iﬀ they have the same probability for you (or the same conditional probability. this means: ¯ (e) pr(H1 ∧ H2 ∧ ¬H3 ) = h1 h2 h3 . 1 1 1 Example 2. you will regard the H’s as equiprobable and independent: pr(Hi ) = b/N. H1 . 2 6 12 outcomes of a toss and a roll are independent but not equiprobable. pr(head ∧ 1) = . H2 . pr(Hi Hj Hk ) = b3 /N 3 if i = j = k = i.
8 Objective Chance It is natural to think there is such a thing as real or objective probability (“chance”. the lefthand side = pr(H1 ) − pr(H1 ∧ H2 ). of which an unknown number n are of the winning color — green. there are puzzling questions about the hypothesis that the chance of H is p that do not arise regarding the hypothesis H itself. by ‘but not’. PROBABILITY PRIMER 25 Equations (e) − (h) follow from (a) − (d) by the rule for ‘but not’. That is how our knowledge that the chance of H is p guarantees that our judgmental probability for H is p : the guarantee is really a presupposition.” What does that mean? In the urn example.CHAPTER 1. the argument (1) pr(the chance of H is p) = 1. “The chance of winning is 30%. for short) in contrast to merely judgmental probability. which. your posterior odds between any two possible compositions of the urn will change from 1:1 to various other values. so pr(H) = p . The Urn. and as you learn the results of more and more trials. An urn contains 100 balls. ¯ and this = h1 (1 − h2 ) = h1 h2 . ¯ Example 5. it means that n = 30. = h1 − h1 h2 . (h) follows from (b). David Hume’s skeptical answer to those questions says that chances are simply projections of robust features of judgmental probabilities from our minds out into the world. whence we hear them clamoring to be let back in. But in other cases—die rolling. then your prior probability for winning on any trial is 50%. pr(H2 ∧ ¬H3 ) = h2 h3 since. If all 101 possible values of n are equiprobable in your judgment. say. even though you are sure that the composition itself does not change. A second useful fact follows immediately from the quotient rule: (2) If pr(H1 ) > 0. example 1. How can we ﬁnd out whether it is true or false? In the urn example we just count the green balls and divide by the total number. As Hume sees it.—there may be no process that does the “Count the green ones” job. 1. by (b). Then you think the chance of winning is n/100. In general. You regard the drawing as honest in the usual way. then H1 and H2 are independent iﬀ pr(H2 H1 ) = pr(H2 ). etc. horse racing.
Thus. total number of balls in the urn which you can determine empirically by counting. you are sure that if you knew n. pr(H chance of H is p) = p unless p is excluded as a possibile value. On the Humean view it is ordinary conditions.3. it will not override the hypothesis that H is true. we have (2) pr(HH∧ the chance of H is . . or that H is false. for some chunk (· · · · · ·) of the interval from 0 to 1. The name ‘Homecoming’ is loaded with philosophical baggage. The condition ‘the chance of H is p’ is decisive in that it overrides any other evidence represented in the probability function pr. in your judgment. not .3) = 1. Homecoming. since pr(HH ∧ C) = 1 when C is any condition consistent with H.. conditions specifying the composition of an urn. making no use of the word ‘chance. acceptable to those who see chances as objective features of the world. and that is no violation of decisiveness. your judgmental probability for the next ball’s being green would be n%: (3) pr(Green next  n of the hundred are green) = n% Then in example 1 you take the chance of green next to be a magnitude.’ that appear to the right of the bar in applications of the homecoming principle. the relevant principle (“Homecoming”) speciﬁes the probability of H given that its chance is p—except in cases where we are antecedently sure that the chance is not p because. In particular. But a decisive condition need not override other conditions conjoined with it to the right of the bar. being in the interior of an interval we are sure does not contain the chance of H : 0——· · · p · · ·—1 Note that when pr(the chance of H is p) = 1. e. ‘Decisiveness’ would be a less tendentious name.CHAPTER 1. PROBABILITY PRIMER 26 is valid because our conviction that the chance of H is p is a just a ﬁrmly felt commitment to p as our continuing judgmental probability for H. independent of what we may think. Here. It is the fact that for you the ratio of green ones to all balls in the urn satisﬁes the decisiveness condition that identiﬁes n% as the chance of green next. but think it may be p? Here. number of green balls in the urn ch(green next) = . What if we are not sure what the chance of H is. pr(the chance of H is inside the chunk) = 0.g. homecoming validates argument (1).
f . ‘At last. the following two examples do not respond to that treatment. in which case f is the identity function. identifying it in terms of its salient eﬀect (“blood poisoning” in the following example). knowledge of the mass distribution in the die would determine for you a deﬁnite robust judgmental probability of ace next: you think that if you knew the physics. the proportion of greens drawn so far. Suppose H predicts ace on the next toss. Example 4. When we are unlucky in this way—when there is no decisive physical parameter for us—there may still be some point in speaking of the chance of H next. This won’t do as your ch(green next) because it lacks the robustness property: on the second draw it can easily change from 0 to 1 or from 1 to 0.CHAPTER 1. In examples 1 and 2 it is clear to us what the crucial physical parameter. received a puncture wound in the ﬁnger. The Loaded Die. Semmelweis . f (X) = X. In the die example we are clear about the parameter X. Example 2. X. an accident gave Semmelweis the decisive clue for his solution of the problem.e. from the scalpel of a student with whom he was performing an autopsy. or X can be n. A colleague of his. Although the role of microorganisms in such infections had not yet been recognized at that time. There is no harm in that—as long as we don’t think we are already there. i. early in 1847. then you would know of an f that makes this principle true: (4) pr(Ace next  The mass distribution is M ) = f (M ) 1 6 But you don’t know the physics. Example 3. Blood Poisoning.. of a yettobeidentiﬁed physical parameter that will be decisive for people in the future. And in other cases we are also unclear about X. In the homecoming condition we might read ‘chance of H’ as a placeholder for some presently unavailable description of a presently unidentiﬁed physical parameter. and until the ﬁrst draw it gives the chance of gree next the unrevealing form of 0/0. that maps X into the chance: X can be n/100. PROBABILITY PRIMER 27 On the other hand. Kolletschka. all you know for sure is that f (M ) = in case M is uniform. while seeking clues about the cause. and we can specify the function. but not about the function f . Perhaps you are sure that if you understood the physics better. can be. with f (X) = n/100. and died after an agonizing illness during which he displayed the same symptoms that Semmelweis had observed in the victims of childbed fever.
pr(HX = a) = f (a) if a is not in the interior of an interval that surely contains no values of X. PROBABILITY PRIMER realized that “cadaveric matter” which the student’s scalpel had introduced into Kolletschchka’s blood stream had caused his colleague’s fatal illness. To see how that goes. Find formulas for (1) pr(H1 ∨H2 ) and (2) pr(H1 ∨ H2 ∨ H3 ) in terms of probabilities of H’s and their conjunctions. and of the f ’s that we had in the ﬁrst.) 11 Hempel (1966). and that Semmelweis lacked. they assign the same value. but. but lacked in the other two? These questions belong to pragmatics. These may assign diﬀerent values to the condition X = a. pr(H1 ∨ H2 ∨ . . . they concern the place of these X’s and f ’s in our processes of judgment. Here we must think of pr as a variable whose domain is a set of probability assignments. let’s reformulate (4) in general terms: (4) Staying home. not statics. And the similarities between the course of Kolletschchka’s disease and that of the women in his clinic led Semmelweis to the conclusion that his patients had died of the same kind of blood poisoning’11 28 But what is the common character of the X’s that we had in the ﬁrst two examples. p. to the conditional probability of H given a. not semantics. and their answers belong to the probability dynamics. ∨ Hn )? 2 Exclusive ‘or’. 4. Let us use symbol ∨ for “or” in an exclusive sense: H1 ∨H2 = (H1 ∨ H2 ) ∧ ¬(H1 ∧ H2 ). The symbol “∨” stands for “or” in a sense that is not meant to rule out the possibility that the hypotheses ﬂanking it are both true: H1 ∨ H2 = H1 ∨ H2 ∨ (H1 ∧ H2 ). 1.9 Supplements 1 (1) Find a formula for pr(H1 ∨ H2 ∨ H3 ) in terms of probabilities of H’s and their conjunctions. The answers have to do with invariance of conditional judgmental probabilities as judgmental probabilities of the conditions vary. (3) Does H1 ∨ H2 ∨ H3 mean that exactly one of the three H’s is true? (No. . for each H and each a that is not in an excluded interval. f (a). (2) What happens in the general case.CHAPTER 1.
. 14 From I. You may imagine an experiment performed in which the man guesses 20 digits (between 0 and 9) correctly. . What will her conditional odds on malignancy be.. 12 13 Adapted from Eddy (1982). The physician orders a mammogram and receives the report that in the radiologist’s opinion the lesion is malignant. if three consecutive correct guesses would leave the probability below 1/2. family history. is about 1 in 100.e.13 “A cab was involved in a hitandrun accident at night. conditionally on the witness’s identiﬁcation]?” 5 The Device of Imaginary Results. .g. 15% are Blue.12 The patient has a breast mass that her physician thinks is probably benign: frequency of malignancy among women of that age. Kahneman.CHAPTER 1. on the hypothesis “that a man is capable of extrasensory perception. Similarly. What is the probability that the cab involved in the accident was Blue rather than Green [i. You are given the following data: “(a) 85% of the cabs in the city are Green. and her prior probability for the patient’s having cancer was 1%. i. pp. then the [prior odds] must be less than 10−3 . the Green and the Blue. .” Verify these claims about the prior odds. . ∨ Hn mean? 29 3 Diagnosis. 156158. then the [prior odds] must be assumed to be greater than 10−20 . given the positive mammogram? pr(rowcolumn) + mammogram − mammogram Malignant Benign 80% 10% 20% 90% 4 The Taxicab Problem. If you feel that this would cause the probability that the man has telepathic powers to become greater than 1/2. 35. operate in the city. Based on the available statistics.e.. the mammogram is positive. Slovic and Tversky (1982). . Two cab companies. the physician’s probabilities for true and false positive mammogram results were as follows. and physical ﬁndings. in the form of telepathy. with the same symptoms. J. PROBABILITY PRIMER What does it mean? (4) What does H1 ∨ . The court tested the reliability of the witness under the same circumstances that existed on the night of the accident and concluded that the witness correctly identiﬁed each one of the two colors 80% of the time and failed 20% of the time.. Good (1950) p.14 This is meant to help you identify your prior odds—e. “(b) A witness identiﬁed the cab as Blue.
Instead of saying. . a question. broadly speaking. But if he errs he has many ways of going wrong” . . framed to be answered by yes or no. The two diseases are equally fatal if untreated. pp.. it would be a waste of time and money even to carry out the tests. according to your manifest symptoms. 3334) suggest that “the fallacy of the transposed conditional” is being committed here. on the supposition [. for the occurrence of some speciﬁc event was of course contemplated. i. “On the other hand. of Venn(1888. Should you nevertheless opt for the treatment appropriate to A. and this diagnosis has been tried out on equal numbers of A and B patients and is known to be correct on 80% of those occasions. since whatever their results. The tests report that you are suﬀering from disease B. 15 16 L. . For a variety of demographic reasons disease A happens to be 19 times as common as B. 329. Your physician orders a certain test which.16 “There are. confusion of the following quantities—the second of which is the true positive rate of the test for B: pr(It is BIt is diagnosed as B). the testimony may take the form of a more original statement or piece of information. Indeed. and for B) to be 80%. Cohen (1981). in the ﬁrst place. 409 ﬀ. . 1962). . to have but one way of doing so. It may.] that the probability of your suﬀering from B is 4/5? It is the former opinion that would be irrational for you. pr(It is diagnosed as BIt is B).” Diaconis and Freedman (1981. What happened? Here if the witness speaks the truth he must be supposed. as before. always gives a unique diagnosis in such cases. which is the one espoused in the literature.CHAPTER 1. the base rates would still compel a more than 4/5 probability in favor of disease A. Did A happen? we may ask. on the other view. two diﬀerent ways in which we may suppose testimony to be given. on the supposition that the probability of your suﬀering from A is 19/23? Or should you opt for the treatment appropriate to B. that is. so that if one of them is not correct the other must be so: —Has A happened. through the operation of a fairly well understood causal process. yes or no?” . So the literature is propagating an analysis that could increase the number of deaths from a rare disease of this kind. J.15 “You are suﬀering from a disease that. but it is dangerous to combine the respective appropriate treatments. Use the odds form of Bayes’s theorem to verify that if your prior odds on A are 19:1 and you take the true positive rate (for A. see p. PROBABILITY PRIMER 30 6 The Rare Disease. . Here. is either A or B. your posterior probability for A should be 19/23. take the form of a reply to an alternative question. the possible answers are mutually contradictory. 7 On the Credibility of Extraordinary Stories. Adapted from pp.e. of course. .
CHAPTER 1. Alice asks the warder for the name of one other than herself who will be shot. in the other two. explaining that as there must be at least one. if pr(He says it is XIt is X) = . A ball is drawn at random and seen by no one but a slightly colorblind witness. a goat. When you have chosen. Diagnosis.9. there being no apparent reason why he should choose one number rather than another. Slovic and Tversky eds. What is your probability that the witness was right on this occasion. One is red on both sides. PROBABILITY PRIMER 31 (a) In an urn with 1000 balls.e. the host opens one of the other two doors. from 1 to 1000. he will be likely to announce all the wrong ones equally often. 123 of Kahneman. i..3 Monty Hall.9 when X = Red and when X = Green? (b) “We will now take the case in which the witness has many ways of going wrong. the other freed. As a contestant on a TV game show. what is the probability that he is right?” In answering you are to “assume that. 8. in one case. one is black on both sides. Let B denote the event that Peter’s home will have been burgled before the end of next year. and he tells me that it was numbered 25. This cheers Alice up a little: Her judgmental probability for being shot is now 1/2 instead of 2/3. one is green and the rest are red. a car. the warder won’t really be giving anything away. and the witness knows this fact. if his reliability in distinguishing red from green is . what is the probability that the other side is red too? 8. A ball is drawn. One card is drawn blindly and placed on a table. you are invited to choose any one of three doors and receive as a prize whatever lies behind it— i. Would that be wise? 9 Causation vs. instead of merely one.e. who reports that the ball was green. and the other is red on one side and black on the other. behind which he knows there is a goat. An unknown two will be shot. (1982). The warder agrees.2 The Three Prisoners. 17 From p. and oﬀers to let you switch your choice to the third door.” What is now your probability that the 90% reliable witness was right? 8. . Peter will have installed a burglar alarm in his home. If a red side is up. or. but that (b) She is right if she thinks the warder will say ‘Bill’ when he honestly can.17 “Let A be the event that before the end of next year..1 The Three Cards.. and says that Bill will be shot. Show (via Bayes’s theorem) that (a) Alice is mistaken if she thinks the warder is as likely to say ‘Clara’ as ‘Bill’ when he can honestly say either. Suppose that the balls were all numbered.
. 68% of 25 females. 7% of 341 females. what is the probability that at least two have the same birthday? 365 364 363 Hint : × × · · · (23 f actors) ≈ . 11 Sex Bias at Berkeley?18 In the fall of 1973. (“TJ’s Lemma”) c pr(CA ∨ B) is between pr(CA) and pr(CB) if pr(A ∧ B) = 0.CHAPTER 1. pr(AB) > pr(A¬B) iﬀ pr(BA) > pr(B¬A) 10 Prove the following. 12 The Birthday Problem. 24% of 393 females. Department F admitted 6% of 373 male applicants.49 365 365 365 13 How would you explain the situation to M´r´? e e “M. Pisani and Purves (1978) pp. pr(BA) or pr(B¬A). but only about 35% of the women. And the tabulation suggested an innocent onesentence explanation of the overall statistics. It looked like sex bias against women. when 8442 men and 4321 women applied to graduate departments at U. 1215. which together accounted for over a third of all the applicants—there seemed to be no such bias on a departmentbydepartment basis. Department B admitted 63% of 560 male applicants. a If A implies D then pr(AD) = pr(A)/pr(D). Of twentythree people selected at random.” Substantiate this critical remark by showing that the following is a law of probability. 34% of 593 females. is higher? “A large majority of subjects (132 of 161) stated that pr(AB) > pr(A¬B) and that pr(BA) < pr(B¬A). b If D implies A then pr(¬A¬D) = pr(¬A)/pr(¬D). C. 82% of 108 females. Department D admitted 33% of 417 male applicants. Department E admitted 28% of 191 male applicants. 35% of 375 females. Berkeley. Department C admitted 37% of 325 male applicants. is higher? “Question: Which of the two conditional probabilities. de M´r´ told me he had found a fallacy in the numbers for the following ee 18 Freedman. pr(AB) or pr(A¬B). But when admissions were tabulated for the separate departments—as below. What was it? Hint: What do the statistics indicate about how hard the diﬀerent departments were to get into? Department A admitted 62% of 825 male applicants. assuming that conditions all have probability > 0. PROBABILITY PRIMER 32 “Question: Which of the two conditional probabilities. for the six most popular departments. about 44% of the men were admitted. contrary to the laws of probability.
Not every subset of Ω need be counted as a proposition in the sample space. 1978.3. A probability space is a sample space with a countably additive probability assignment to the Boolean algebra B of propositions. translated in Henry Kyburg.k. .A. of denied H’s. Ellery Eells and Brian Skyrms (eds. ‘Probabilistic reasoning in clinical medicine’. conjunctions)—not just the ﬁnite ones. 29 July 1654) 14 Independence. . and Howard Smokler (eds. . The diagrammatic method of this section is an agreeable representation of the settheoretical models in which propositions are represented by sets of items called ‘points’—a.).. With two dice.. Robert Pisani. 1994. PROBABILITY PRIMER 33 reason: With one die. but with your understanding of the principles. the odds on throwing a six in four tries are 671:625. in Kahneman Slovic.CHAPTER 1.): Cambridge U.” (Pascal to Fermat.’. begun in example 5. J. 1. Eddy. 1980. sec. New York. you will easily see that what I say is right.. the odds are against throwing a double six in four tries. 15 Sample Spaces.a.7. David Freedman. 1.. Suggestion: Use mathematical induction on the number f = 0. W. Annales de l’Institut Henri Poincar´ 7 e e (1937). 1.10 References L. If = Ω = the set of all such possible states in a particular settheoretical model then ω ∈ H ⊆ Ω means that the proposition H is true in possible state ω. of the case n = 3 of fact (1). (a) Complete the proof. Bruno de Finetti. Norton. in The Behavioral and Brain Sciences 4(1981)317331. and Roger Purves. Yet 24 is to 36 (which is the number of pairings of the faces of two dice) as 4 is to 6 (which is the number of faces of one die). Probabilities and Conditionals. which is closed under all countable disjunctions (and. 1. ‘La Pr´vision. (“possible”) ‘worlds’ or ‘states’ (“of nature”). This is what made him so indignant and made him tell everybody that the propositions were inconsistent and that arithmetic was selfcontradictory. and Tversky (eds. but the ones that do count are normally assumed to form a Boolean algebra. sec. therefore. W. A Boolean “σalgebra” is a B. . Statistics. 1982). (b) Prove fact (1) in the general case. P. where 2 ≤ n = t + f . Jr. To convert such an abstract model into a sample space it is necessary to specify the intended correspondence between actual or possible happenings and members and subsets of Ω. Cohen. David M.
Jr.): Cambridge.CHAPTER 1. 1986. Studies in Subjective Probability. Fiction and Forecast. Good. II: Oxford University Press. Brian Skyrms. David Lewis. P.. Paul Slovic. J. Fact. Carl G. eds. Probability and the Weighing of Evidence: London. 1962..Y. reprinted in his Philosophical Papers. Hempel. PrenticeHall (1966). D. Krieger. reprint. The Logic of Chance. 1980). P. Philosophical Papers.. 3rd. 1990.. Co. Frank Ramsey. ‘Probabilities of Conditionals and Conditional Probabilities’ (1976). 4th ed. 2nd ed.. N. Daniel Kahneman. 1982.: Robert E. 1888 (Chelsea Pub. H. Harvard U. 34 Nelson Goodman. ‘Truth and Probability’. Judgment Under Uncertainty.. Mellor (ed. and Howard Smokler. John Venn. 1950. (Huntington. Foundations of Natural Science. 1983. Henry Kyburg. U. and Amos Tversky. ed. . Cambridge U. PROBABILITY PRIMER I. P.
But if all these proofs of probability are met with in that which I propose to discuss. To wit. when one can imagine and foresee new phenomena which ought to follow from the hypotheses which one employs. It is always possible thereby to attain a degree of probability which very often is scarcely less than complete proof. here the principles are veriﬁed by the conclusions to be drawn from them. and further. 35 . the DuhemQuine problem (“holism”. .” In this chapter we interpret Huyghens’s methodology and extend it to the treatment of certain vexed methodological questions — especially. sec. and when one ﬁnds that therein the fact corresponds to our prevision. principally. whereas the geometers prove their propositions by ﬁxed and incontestable principles.7). the nature of these things not allowing of this being done otherwise. this ought to be a very strong conﬁrmation of the success of my inquiry. when things which have been demonstrated by the principles that have been assumed correspond perfectly to the phenomena which experiment has brought under observation. especially when there are a great number of them. . and it must be ill if the facts are not pretty much as I represent them. as it seems to me they are. 2.Chapter 2 Testing Scientiﬁc Theories Christian Huyghens gave this account of the scientiﬁc method in the introduction to his Treatise on Light (1690): “.6) and the problem of old evidence (sec 2.
Thus. is greater or less than your prior probability. Probability Factor: π(H) =df new(H) old(H) By your odds on one hypothesis against another—say.1 Quantiﬁcations of Conﬁrmation The thought is that you see an episode of observation. These are the factors by which old probabilities or odds are multiplied to get the new probabilities or odds. depending on whether your posterior probability. old(DH) ..e.CHAPTER 2. by your odds simply “on G” is meant your odds on G against ¬G: Odds on G against H = Odds on G = pr(G) pr(H) pr(G) pr(¬G) Your “odds factor”—or. β(G : H) = old(DG) when new( ) = old( D). new(H) − old(H). experiment or reasoning as conﬁrming or inﬁrming a hypotheses to a degree that depends on whether your probability for it increases or decreases during the episode. in both of which the turning point between conﬁrmation and inﬁrmation is 1 instead of 0. old(H). “Bayes factor”—is the factor by which your old odds can be multiplied to get your new odds: Bayes factor (odds factor): β(G : H) = new(G) old(G) π(G) / = new(H) old(H) π(H) Where updating old → new is by conditioning on D. old(DG) : old(DH). negative for inﬁrmation. is one measure of that change—positive for conﬁrmation. i. on G against an alternative H—is meant the ratio pr(G)/pr(H) of your probability of G to your probability of H. better. Other measures are the probability factor and the odds factor. new(H). Your Probability Increment. your Bayes factor = your old likelihood ratio. TESTING SCIENTIFIC THEORIES 36 2.
CHAPTER 2. TESTING SCIENTIFIC THEORIES
37
A useful variant of the Bayes factor is its logarithm, dubbed by I. J. Good1 the “weight of evidence”: w(G : H) =df log β(G : H) As odds vary from 0 to ∞, their logarithms vary from −∞ to +∞. High probabilities (say, 100/101 and 1000/1001) correspond to widely spaced odds (100:1 and 1000:1); but low probabilities (say, 1/100 and 1/1000) correspond to odds (1:99 and 1:999) that are roughly as cramped as the probabilities. Going from odds to log odds moderates the spread at the high end, and treats high and low symmetrically. Thus, high odds 100:1 and 1000:1 become log odds 2 and 3, and low odds 1:100 and 1:1000 become log odds 2 and 3. A. N. Turing was a playful, enthusiastic advocate of log odds, as his wartime codebreaking colleague I. J. Good reports: “Turing suggested . . . that it would be convenient to take over from acoustics and electrical engineering the notation of bels and decibels (db). In acoustics, for example, the bel is the logarithm to base 10 of the ratio of two intensities of sound. Similarly, if f [our β] is the factor in favour of a hypothesis, i.e., the ratio of its ﬁnal to its initial odds, then we say that the hypothesis has gained log10 f bels or 10 log10 f db. This may be described as the weight of evidence . . . and 10 log o db may be called the plausibility corresponding to odds o. Thus . . . Plausibility gained = weight of evidence”2 “The weight of evidence can be added to the initial logodds of the hypothesis to obtain the ﬁnal logodds. If the base of the logarithms is 10, the unit of weight of evidence was called a ban by Turing (1941) who also called one tenth of a ban a deciban (abbreviated to db). I hope that one day judges, detectives, doctors and other earthones will routinely weigh evidence in terms of decibans because I believe the deciban is an intelligenceampliﬁer.”3 “The reason for the name ban was that tens of thousands of sheets were printed in the town of Banbury on which weights of
I. J. Good, it Probability and the Weighing of Evidence, London, 1950, chapter 6. I. J. Good, op. cit., p. 63. 3 I. J. Good, Good Thinking (Minneapolis, 1983), p. 132. Good means ‘inteligence’ in the sense of ‘information’: Think MI (military intelligence), not IQ.
2 1
CHAPTER 2. TESTING SCIENTIFIC THEORIES evidence were entered in decibans for carrying out an important classiﬁed process called Banburismus. A deciban or halfdeciban is about the smallest change in weight of evidence that is directly perceptible to human intuition. . . . The main application of the deciban was . . . for discriminating between hypotheses, just as in clinical trials or in medical diagnosis.” [Good (1979) p. 394]4
38
Playfulness = frivolity. Good is making a serious proposal, for which he claims support by extensive testing in the early 1940’s at Bletchley Park, where the “Banburismus” language for hypothesistesting was used in breaking each day’s German naval “Enigma” code during World War II.5
2.2
Observation and Suﬃciency
In this chapter we conﬁne ourselves to the most familiar way of updating probabilities, that is, conditioning on a statement when an observation assures us of its truth.6 Example 1, Huyghens on light. Let H be the conjunction of Huyghens’s hypotheses about light and let C represent one of the ‘new phenomena which ought to follow from the hypotheses which one employs’.
If we know that C follows from H, and we can discover by observation whether C is true or false, then we have the means to test H—more or
I. J. Good, “A. M. Turing’s Statistical Work in World War II” (Biometrika 66 (1979, 3936), p. 394. 5 See Andrew Hodges, Alan Turing, the Enigma. (New York, 1983). The dead hand of the British Oﬃcial Secrets Act is loosening; see the A. N. Turing home page, http://www.turing.org.uk/turing/, maintained by Andrew Hodges. 6 In chapter 3 we consider updating by a generalization of conditioning, using statements that an observation makes merely probable to various degrees.
4
CHAPTER 2. TESTING SCIENTIFIC THEORIES
39
less conclusively, depending on whether we ﬁnd that C is false or true. If C proves false (shaded region), H is refuted decisively, for H ∧ ¬C = ⊥. On the other hand, if C proves true, H’s probability changes from old(H) = to area of the H circle old(H) = area of the square 1 area of the H circle old(H) = area of the C circle old(C)
new(H) = old(HC) = so that the probability factor is π(H) = 1 old(C)
It is the antecedently least probable conclusions whose unexpected veriﬁcation raises H’s probability the most.7 The corollary to Bayes’s theorem in section 1.6 generalizes this result to cases where pr(CH) < 1. Before going on, we note a small point that will loom larger in chapter 3: In case of a positive observational result it may be inappropriate to update by conditioning on the simple statement C, for the observation may provide further information which overﬂows that package. In example 1 this problem arises when the observation warrants a report of form C ∧ E, which would be represented by a subregion of the C circle, with old(C ∧ E) < old(C). Here (unless there is further overﬂow) the proposition to condition on is C ∧ E, not C. But things can get even trickier, as in this homely jellybean example: Example 2, The Green Bean.8 H, This bean is limeﬂavored. C, This bean is green. You are drawing a bean from a bag in which you know that half of the beans are green, all the limeﬂavored ones are green, and the green ones are equally divided between lime and mint ﬂavors. So before looking at the bean or tasting it, your probabilities are these: old(C) = 1/2 = old(HC); old(H) = 1/4. But although new(C) = 1, your probability new(H) for lime can drop below old(H) = 1/4 instead of rising to old(HC) = 1/2 in case you yourself are the observer—for instance, if, when you see that the bean is green, you also get a whiﬀ of mint, or also see that the bean is of a special shade of green that you have found to be associated with the mintﬂavored ones. And here there need be no further proposition E you can articulate that would make
“More danger, more honor”: George P´lya, Patterns of Plausible Inference, 2nd ed., o Princeton 1968, vol. 2, p. 126. 8 Brian Skyrms [reference?]
7
The letter arrived on the 23rd of September 1846 and in the evening of the same day a new planet was found within one degree of the spot indicated by Leverrier. 130132. for you may have no words for the telltale shade of green. the diﬀerences between theory and observation seemed to exceed the admissible limits of error. Some astronomers suspected that these deviations might be due to the attraction of a planet revolving beyond Uranus’ orbit. with a 1 degree margin of error].3 Leverrier on Neptune We now turn to a methodological story more recent than Huyghens’s:9 “On the basis of Newton’s theory. TESTING SCIENTIFIC THEORIES 40 C ∧ E the thing to condition on.) 2. He wrote about it to another astronomer whose observatory was the best equipped to examine that portion of the sky. ] the planet Uranus. Finally Leverrier succeeded in assigning a deﬁnite position in the sky to the hypothetical planet [say. cit. But P´lya made the more realistic assumption that Leverrier’s o prediction C (a bright spot near a certain point in the sky at a certain time) was highly probable but not 100%. . Then veriﬁcation 9 George P´lya. The diﬃculty will be especially bothersome as an obstacle to collaboration with others who would like to update their own prior probabilities by conditioning on your ﬁndings. he found that there was just one that could account for the observed irregularities in Uranus’s motion: the existence of an extraUranian planet [sc. (In chapter 3 we will see what to do about it. .. He tried to compute the orbit of such a hypothetical planet from the irregularities of Uranus. o ..CHAPTER 2. And presumably the rigidity condition was satisﬁed so that new(CH) ≈ 1. Examining the various explanations proposed. given his H (namely. It was a large ultraUranian planet that had approximately the mass and orbit predicted by Leverrier. or for the whiﬀ of mint. So old(CH) ≈ 1. pp. too. the astronomers tried to compute the motions of [. Newton’s laws and observational data about Uranus). op. Neptune].” We treated Huyghens’s conclusion as a strict deductive consequence of his principles. and the French astronomer Leverrier investigated this conjecture more thoroughly than his colleagues.
More along the same lines: Michael Redhead. just before the mid17th century emergence in the hands of Fermat. if old(C) is half of 1%. but regarding formulation and justiﬁcation of such reasons as someone else’s job— the methodologist’s. or auxiliary hypothesis—when they jointly contradict empirical data. the methodology of research programmes. Scientiﬁc Reasoning: the Bayesian approach (Open Court. addressed by Duhem in the ﬁrst years of the 20th century. Pascal. Dorling’s work is also discussed in Colin Howson and Peter Urbach..’ Studies in History and Philosophy of Science 11(1980)341347. 2. and the probability of a randomly selected point on a circle or on a sphere being closer than 1 degree to a previously speciﬁed point is 1/180 for a circle. was agitated by Quine in midcentury. TESTING SCIENTIFIC THEORIES 41 of C would have raised H’s probability by a factor ≈ 1/old(C). the conclusion depends on his assumption that aside from our genetical and cultural heritage. and about 1/13131 for a sphere. That would leave things pretty much as Descartes saw them. etc. Favoring the circle is the fact that the orbits of all known planets lie in the same plane (“the ecliptic”). using extracts from his important but still unpublished 1982 paper10 —which Jon Dorling. Thus. H’s probability factor will be about 200.6 . and Duhem’s problem’. and perhaps o as much as 13131: The accuracy of Leverrier’s prediction proved to be better than 1 degree. The problem.3. Huyghens and others of the probabilistic (“Bayesian”) methodology that Jon Dorling has brought to bear on various episodes in the history of science. 3. Here is an introduction to Dorling’s work on that job. P´lya oﬀers a reason for regarding 1/old(C) as at least 180. 1993). 2nd ed.) are to the numbered sections of that paper. Dorling’s unpublished paper from which excerpts appear here in sec. ‘Bayesian personalism. ‘A Bayesian reconstruction of the methodology of scientiﬁc research programmes. Illinois.CHAPTER 2.. deductive logic is all we’ve got to go on when it comes to theory testing. photocopied. 4’. Studies in History and Philosophy of Science 10(1979)177187. thinking they have good reason to evaluate the eﬀects of evidence as they do.4 Dorling on the Duhem Problem Skeptical conclusions about the possibility of scientiﬁc hypothesistesting are often drawn from the presumed arbitrariness of answers to the question of which to give up—theory. 1982). Then the great circle cut out by that plane gets the lion’s share of probability. which is large if the prior probability old(C) of Leverrier’s prediction was ≈ 0. As drawn by some of Quine’s readers. La Salle. The conclusion is one that scientists themselves generally dismiss. 10 .9 is ‘Further illustrations of the Bayesian solution of Duhem’s problem’ (29 pp. References here (‘sec.
g. he also expands β(A : A) with respect to T (and similarly with respect to S. but in interesting cases of scientiﬁc hypothesis testing D is deducible or refutable on the basis of the theory and an auxiliary hypothesis A (e. although we do not show that here): β(A : ¬A) = old(DA ∧ T )old(T ) + old(DA ∧ ¬T )old(¬T ) old(D¬A ∧ T )old(T ) + old(D¬A ∧ ¬T )old(¬T ) . the odds factor for a theory T against an alternative theory S that is due to learning that D is true will be the lefthand side of the following equation. S is a rival to T .. S is not simply the denial of T . all of which agree on the phenomenon of interest. Dorling makes an assumption that can generally be justiﬁed by appropriate formulation of the auxiliary hypothesis: Prior Independence old(A ∧ T ) = old(A)old(T ). the righthand side of which is called “the likelihood ratio”: Bayes Factor = Likelihood Ratio: old(T D)/old(SD) old(DT ) = old(T )/old(S) old(DS) The empirical result D is not generally deducible or refutable by T alone. TESTING SCIENTIFIC THEORIES 42 is reproduced in the web page “http:\\www. Generally speaking. old(A ∧ S) = old(A)old(S). or a disjunction of such theories.edu\~bayesway” with Dorling’s permission. Assuming rigidity relative to D. but a deﬁnite scientiﬁc theory in its own right. as an explanation of that phenomenon. so that.princeton. the hypothesis that the equipment is in good working order). To simplify the analysis. The material is presented here in terms of odds factors. In any case Dorling uses the independence assumption to expand the righthand side of the Bayes Factor = Likelihood Ratio equation: β(T : S) = old(DT ∧ A)old(A) + old(DT ∧ ¬A)old(¬A) old(DS ∧ A)old(A) + old(DS ∧ ¬A)old(¬A) ¯ To study the eﬀect of D on A. or by S alone.CHAPTER 2.
and we have β(T : S) = old(DT ∧ ¬A) old(DS ∧ ¬A) “Now the experimenters argued that one way in which A might easily be false was if the mirror of one or the other of the telescopes had distorted in the heat. the other. Conclusion: Einstein was vindicated. old(DT ∧ A) = old(DS ∧ A) = 0 since if both telescopes were working correctly they would not have given contradictory results. conﬁrmed Newton. Hence old(DT ∧ ¬A) was regarded as having an . with conﬂicting results: one of the duplicates conﬁrmed T against S. But in each case the scientiﬁc experts took the experiments to clearly conﬁrm one of the rivals against the other. there were two telescopes: one. 4) Notation T : Einstein: lightbending eﬀect of the sun S: Newton: no lightbending eﬀect of the sun A: Both telescopes are working correctly D: The conﬂicting data from both telescopes In the odds factor β(T : S) above. Dorling explains why the experts were right: “In the solar eclipse experiments of 1919. and the results with the larger telescope were rejected. at Sobral. the one we hear about.4. but only in one location was the weather good enough to obtain easily interpretable results. 1919 In these terms Dorling analyzes two famous tests that were duplicated. and this was much more likely to have happened with the larger mirror belonging to the telescope which conﬁrmed S than with the smaller mirror belonging to the telescope which conﬁrmed T .CHAPTER 2. in fact the slightly larger one. TESTING SCIENTIFIC THEORIES 43 2.1 Einstein/Newton. the other conﬁrmed S against T. with apparatus diﬀering in seemingly unimportant ways. conﬁrmed Einstein. sec. the telescopic observations were made in two locations. Here.” (Dorling. so that the factors old(¬T ) cancel. Now the eﬀect of mirror distortion of the kind envisaged would be to shift the recorded images of the stars from the positions predicted by T to or beyond those predicted by S. Then the ﬁrst terms of the sums in numerator and denominator vanish.
12 old(DT ∧ A) = 0. but Clauser published his. Holt’s apparatus was not sensitive enough to detect the correlations that would have been present according to T . sec. Thanks to Abner Shimony for clarifying this. and they were rightly taken as excellent evidence for the quantum theory and against hiddenvariable theories. 44 2. Holt refrained from publishing his results.” Thus the Bayes factor β(T : S) is very much greater than 1. Clauser examined Holt’s apparatus and could ﬁnd nothing wrong with it.6. In conjunction with the characterization of A it also yields old(DT ∧ ¬A) = 1. old(DS ∧ ¬A) was regarded as negligibly small. 1. old(DS ∧ A) = old(DT ∧ ¬A) = old(DS ∧ ¬A) = 1 This means: suﬃciently sensitive and free from systematic error. which made the glass optically sensitive so as to rotate polarized light.2 Bell’s Inequalities: Holt/Clauser “Holt’s experiments were conducted ﬁrst and conﬁrmed the predictions of the local hidden variable theories and refuted those of the quantum theory. and obtained the same results as Holt with Holt’s apparatus. 4) Notation T : Quantum theory S: Disjunction of local hidden variable theories A: Holt’s setup is reliable11 enough to distinguish T from S D: The speciﬁc correlations predicted by T and contradicted by S are not detected by Holt’s setup The characterization of D yields the ﬁrst two of the following equations. 12 Recall Venn on the credibilty of extraordinary stories: Supplement 7 in sec. See also his supplement 6 in sec. TESTING SCIENTIFIC THEORIES appreciable value. since it was very hard to think of any similar eﬀect which could have shifted the positions of the stars in the other telescope from those predicted by S to those predicted by T.11. for if A is false. 2. Holt’s setup proved to incorporate systematic error arising from tension in tubes for mercury vapor.4. 11 .” (Dorling. and it yields old(DS ∧ ¬A) = 1 because of the wild improbability of the apparatus “hallucinating” those speciﬁc correlations. hence the result as overall a decisive conﬁrmation of T and refutation of S.CHAPTER 2. while.
Then E strongly favors T over R:13 β(T. the posterior odds are only 9:55. the probability of A falls from 80% to 69%. Suppose Clauser’s initial results E indicate presence of the quantum mechanical correlations pretty strongly. S) = old(¬A) 45 Then with a prior probability of 4/5 for adequate sensitivity of Holt’s apparatus. we have β(T. S) = 13 old(ET ∧ C)old(C) + old(ET ∧ ¬C)old(¬C) = 50.5 old(ES ∧ C)old(C) + old(ES ∧ ¬C)old(¬C) = 50. with prior odds 45:55 between T and S. Why did Clauser publish? Notation T : Quantum theory S: Disjunction of local hidden variable theories C: Clauser’s setup is reliable enough E: The speciﬁc correlations predicted by T and contradicted by S are detected by Clauser’s setup Suppose that old(C) = .2:1.5 .5+.. Now why didn’t Holt publish his result? Because the experimental result undermined conﬁdence in his apparatus. Holt is not prepared to publish with better than a 30% chance that his apparatus could have missed actual quantum mechanical correlations.5+. the odds between quantum theory and the local hidden variable theories shift strongly in favor of the latter.g.5 . a 14% probability for T.01×.5.01×. At this point. Numerically: 1×. Setting ¬T = S in (2) because T and S were the only theories given any credence as explanations of the results. both experimenters still trust Holt’s welltried setup better than Clauser’s. but still with a 1% chance of error. and making the same substitutions as in (4).01×.CHAPTER 2. TESTING SCIENTIFIC THEORIES Substituting these values. e. we have β(A : ¬A) = old(S) so the odds on A fall from 4:1 to 2. although old(A) has fallen by 11%.5. the swing to S depends too much on a prior conﬁdence in the experimental setup that is undermined by the same thing that caused the swing.
odds after Clauser’s experiment will be 909:110. and lowers the 50% probability of the eﬀect’s being due to chance down to 6 or 7 percent.3 Laplace/Adams Finally. s old(¬T ) β(T : A) = These formulas apply to “a famous episode from the history of astronomy which clearly illustrated striking asymmetries in ‘normal’ scientists’ reactions to conﬁrmation and refutation. because owing to a mathematical error of . 1 + (t × old odds on T ) t old(¬A) × . it is straightforward to show that β(T : ¬T ) = β(A : ¬A) = t . TESTING SCIENTIFIC THEORIES 46 Starting from the low 9:55 to which T ’s odds fell after Holt’s experiment. and writing t= old(DT ∧ ¬A) . an 89% probability for T. old(D¬T ∧ ¬A) for two of the likelihood ratios.CHAPTER 2.4. This particular historical case furnished an almost perfect controlled experiment from a philosophical point of view. old(D¬T ∧ ¬A) s= old(D¬T ∧ A) . The result E boosts conﬁdence in Clauser’s apparatus by a factor of β(C : ¬C) = old(EC ∧ T )old(T ) + old(EC ∧ S)old(S) = 15 old(E¬C ∧ T )old(T ) + old(E¬C ∧ S)old(S) This raises the initially even odds on C to 15:1. note one more class of cases: a theory T remains highly probable although (with auxiliary hypothesis A) it is incompatible with subsequently observed data D. raises the probability from 50% to 94%. With S = ¬T in the formulas for β in 2. 2. 1 + (s × old odds on A) s .4 and with old(DT ∧ A) = 0 (so that the ﬁrst terms in the numerators vanish).
etc. sec. bear the brunt of the refutation. Yet their reactions were strikingly asymmetric: what was initially seen as a great triumph and of striking evidential weight in favour of the Newtonian theory. 100 103 1 β(A : ¬A) = 500 β(T : A) = 200 β(T : ¬T ) = Thus the prior odds 900:100 on T barely decrease. agrees with the original 90% to two decimal places. when it had to be reanalyzed as disconﬁrmatory after the discovery of Laplace’s mathematical oversight. to 900:103. was later.” (Dorling. later corrected by Adams. Newtonian celestial mechanics A: The hypothesis that disturbances (tidal friction. by introducing that hypothesis’s negation as an apparently ad hoc facesaving auxiliary hypothesis. the new probability of T. truth or falsity of T is irrelevant to D. the same observational data were ﬁrst seen by scientists as conﬁrmatory and later as disconﬁrmatory of the orthodox theory. if you like. Then for a theorist whose odds are 3:2 on A and 9:1 on T (probabilities 60% for A and 90% for T ). and t = 50s because in plausible partitions of ¬T into rival theories predicting lunar accelerations. 900/1003. TESTING SCIENTIFIC THEORIES Laplace. s= 1 50 The thought is that t = 1 because with A false. But odds on the auxiliary hypothesis A drop sharply. from prior 3:2 to posterior . Scientists reacted in the ‘refutation’ situation by making a hidden auxiliary hypothesis.) make a negligible contribution to D: The observed secular acceleration of the moon Dorling argues on scientiﬁc and historical grounds for approximate numerical values t = 1.CHAPTER 2. viewed merely as a minor embarrassment and of negligible evidential weight against the Newtonian theory. 1) 47 Notation T : The theory. or. which had previously been considered plausible. old(R¬T ) = 1/50 where R is the disjunction of rivals not embarrassed by D.
Paul Teller.e. Subjective probability theory provides such an account and is the simplest such account that we posess. When applied to the celebrated Duhem (or DuhemQuine) problem and to the related problem of the use of ad hoc. it yields an elegant solution.4 Dorling’s Conclusions “Until recently there was no adequate theory available of how scientists should change their beliefs in the light of evidence.CHAPTER 2. anyway. It provides standards for the validity of informal deductive reasoning comparable to those which traditional logic has provided for the validity of informal deductive reasoning. Matter and Method. ‘Conditionalization and observation’. or. or supposedly ad hoc. hypotheses in science. Standard logic is obviously inadequate to solve this problem unless supplemented by an account of the logical relations between degrees of belief which fall short of certainty.14 Nevertheless they seem fully in line with the intuitions of scientists in simple cases and with the intuitions of the most experienced and most successful scientists in trickier and more complex cases. Putnam stated the result. Recent philosophical Here a long footnote explains the PutnamLewis Dutch book argument for conditioning. odds on the theory against the auxiliary hypothesis increase by a factor of 200. This solution has all the properties which scientists and philosophers might hope for. ‘Probability and Conﬁrmation’.4. or to the past success of such practices. TESTING SCIENTIFIC THEORIES 48 3/1000. Cambridge University Press (1975) 293304.. i. Synthese 26 (1973) 218258. reprinted in his Mathematics. The Bayesian analysis indeed vindicates the rationality of experienced scientists’ reactions in many cases where those reactions were superﬁcially paradoxical and where the relevant scientists themselves must have puzzled over the correctness of their own intuitive reactions to the evidence. from a prior value of 3:2 to a posterior value of 300:1. reports a general argument to that eﬀect which Lewis took to have been what Putnam had in mind. It is clear that in many such complex situations many less experienced commentators and critics have sometimes drawn incorrect conclusions and have mistakenly attributed the correct conclusions of the experts to scientiﬁc dogmatism. the probability of A drops from 60% to about three tenths of 1%. 2. 14 . a special case. in a 1963 Voice of America Broadcast. These standards can be provided with a rationale and justiﬁcation quite independent of any appeal to the actual practice of scientists.
the probability factor would be 1 and new(H) would be the same as old(H). as we now see. 15 . sec. Here. University of Minnesota Press. 1980. which were in fact. if it can be made to work. 1992). It analyzes the old → new transition expl obs by embedding it in a larger process. if the rigidity condition is satisﬁed. But what if some longfamiliar phenomenon C. For further discussion of this proposal see ‘Bayesianism with a human face’ in my Probability and the Art of Judgment. in which ur represents an original state of ignorance of C’s truth and its logical relationship to H: This is what Clark Glymour has dubbed ‘the paradox of old evidence’: see his Theory and Evidence. ur → old → new. TESTING SCIENTIFIC THEORIES and sociological commentators have sometimes generalized this mistaken reaction into a fullscale attack on the rationality of men of science. 1992. and as a result have mistakenly looked for purely sociological explanations for many changes in scientists’ beliefs.15 Wrong. No conﬁrmation here. 1983.’ (Dorling. so that the probability factor is π(H) = 1/old(C). and thus to be explained by H ∧ B? Here. 2. we might then have new(H) = old∗ (HH ∧ B implies C). rationally de rigeur.17 The alternative approach that we now illustrate deﬁnes the new assignment on the same domain that old had. says Daniel Garber:16 The information prompting the update is not that C is true. 16 See his ‘Old evidence and logical omniscience in Bayesian conﬁrmation theory’ in Testing Scientiﬁc Theories. That is an attractive approach to the problem. Princeton University Press. if we were to update by conditioning on C. new(H) = old(HC).5 Old News Explained In sec. ed. but that H ∧ B implies C. or the absence of such changes. John Earman. Using some extension old∗ of old to the enlarged domain. 5) 49 2.CHAPTER 2.2 we analyzed the case Huyghens identiﬁed as the principal one in his Treatise on Light: A prediction C long known to follow from a hypothesis H is now found to be true. is newly found to follow from H in conjunction with familiar background information B about the solar system. Cambridge University Press. a phenomenon for which old(C) = 1. 17 For a critique. To condition on that news Garber proposes to to enlarge the domain of the probability function old by adding to it the hypothesis that C follows from H∧B together with all further truthfunctional compounds of that new hypothesis with the old domain. see chapter 5 of John Earman’s Bayes or Bust? (MIT Press.
but that Venus is the only one for which that advance should be observable. He strongly doubted that an intramercurial planet [“Vulcan”]. H: GTR applied to the SunMercury system C: Advance of 43 seconds of arc per century18 In 1915 Einstein presented a report to the Prussian Academy of Sciences explaining the thenknown advance of approximately 43 per century in the perihelion of Mercury in terms of his (“GTR”) ﬁeld equations for gravitation. the GTR says that all planets’ perihelions advance. as yet unobserved. 19 ‘Subtle is the Lord. Oxford University Press. 1982. due to ‘some unknown action on which no light has been thrown’. but the diﬃculty for Newtonian explanations of planetary astronomy was still Applied to various Sunplanet systems. . the Science and Life of Albert Einstein. might be the cause. Notation. p.19 “The only way to explain the eﬀect would be ([Leverrier] noted) to increase the mass of Venus by at least 10 percent. Abraham Pais.CHAPTER 2. An advance of 38 per century had been reported by Leverrier in 1859. 254. an inadmissible modiﬁcation.’. . TESTING SCIENTIFIC THEORIES 50 Example 1.” Leverrier’s ﬁgure of 38 was soon corrected to its present value of 43 . The perihelion of Mercury. This ﬁgure for Mercury has remained good since its publication in 1882 by Simon Newcomb. he believed. 18 . A swarm of intramercurial asteroids was not ruled out.
which assigns the following odds to H: new(H) a+b = . old(¬H) c The rationale is commutativity: The new distribution is the same. Barrett and J. Therefore the probability of H ∧ C is increased to prn(H ∧ C) = a + b. leading us to the bottom distribution. pp. 20 . supplement to Philosophy of Science 68 [2001]). The “ur” probability distribution. no matter if the observation or the explanation comes ﬁrst. McK. The reasoning has two steps. And this prn distribution is where the Huyghens on light example in sec. 3 (1997) 677691 and 66 (1999) 283288. part 1. represents a time before C was known to follow from H. indicated schematically at the top. Now the Bayes factor gives the clearest view of the transition old(H) → new(H):20 The approach here is taken from Carl Wagner. PSA 2000 (J. in which the conﬁrming phenomenon is veriﬁed after having been predicted via the hypothesis which its truth conﬁrms. represents an expansion of the account in sec. which reviews and extends earlier work. Alexander. new(¬H) c In terms of the present example this new distribution is what Einstein arrived at from the distribution labelled old. A. To accomodate that discovery the ‘b’ in the urdistribution is moved left and added to the ‘a’ to obtain the top row of the prn distribution. In the diagram above. in which old(H) a = . when Einstein ﬁnally managed to provide a general relativistic explanation ‘without special assumptions’ (such as Vulcan)—an explanation which was soon accepted as strong conﬁrmation for the GTR. TESTING SCIENTIFIC THEORIES 51 in place 65 years later. First: Since we now know that H is true.2 begins. ur → prn → new. S165S175 (Proceedings of the 2000 biennial meeting of the Philosophy of Science Association. the lefthand path. Therefore prn(H ∧ ¬C) is set equal to 0. “Old evidence and new explanation I” and “Old evidence and new explanation II” in Philosophy of Science 64. 2. 2.2 of Huyghens’s “principal” case (‘prn’ for ‘principal’). No. eds. Second: Since implying a prediction that may well be false neither conﬁrms nor inﬁrms H. C must be true as well.). the prn probability of H is to remain at its urvalue even though the probability of H ∧ ¬C has been nulliﬁed. “Old evidence and new explanation III”.CHAPTER 2.
Perhaps. American Mathematical Monthly 48 (1941) 450465. .5) /c and similarly for other Ci ’s. This is clearer in the case of urindependence of H and C: Example 2. 2.1) is w(H : ¬H) ≥ 3. there seem to be strong reasons for taking the ratio ur(¬CH)/ur(CH) to be very large. in a notation in which C = C43 = an advance of (43 ± . . . projected into a mythical methodological past. since. . again. .22 And on the plausible assumption23 of urindependence of H from C. He notes that T is a consequence of H.21 That’s the good news.2 here. Urindependence. We derive a sequence of consequences from T . behind the veil of urignorance. more honor. But maybe that’s not so bad. 999 of them? Then β(H : ¬H) ≥ 1000. are equiprobable. quoted in sec. TESTING SCIENTIFIC THEORIES β(H : ¬H) = a+b a ur(¬CH) / =1+ . 24 This problem and the next are from George P´lya. to a ﬁrst approximation. o 23 Carl Wagner’s. We succeed in verifying C1 . Arguably. C38 ∨ C39 ∨ C40 ∨ C41 ∨ C42 ∨ C44 ∨ . and the weight of evidence (2. If Wagner’s independence assumption is generally persuasive. ‘Heuristic reasoning and the theory o of probability’. C2 . As we have just seen in a very sketchy way. “More [ur]danger. given H). there is a large array of Ci ’s. . which. c c ur(CH) 52 Thus the new explanation of old news C increases the odds on H by a factor of (1 + the urodds against C. . C3 . this is very much greater than 1. ¬C is a disjunction of very many “almost” incompatible disjuncts: ¬C = . 2. .6 Supplements 1 “Someone is trying decide whether or not T is true. since.” See P´lya. . How does this refutation aﬀect the probability of T ?”24 2 “We are trying to decide whether or not T is true. say C1 . the physics community’s urodds against C (“C43 ”) will be very high.CHAPTER 2. then “Almost”: ur(Ci ∧ Cj ) = 0 for any distinct integers i and j. 21 22 . Later he succeeds in proving that H is false. the formula for β becomes even simpler: β(H : ¬H) = 1 + ¯ ur(C) ur(C) The bad news is that ur and prn are new constructs.
T H. then it cannot inﬁrm each conjunct separately. Here we focus on the simplest case. and to cases in which each of the C’s “follows only probabilistically” from H in the sense that the conditional probabilities of the C’s. it must conﬁrm their conjunction.6 commutativity is equivalent to Uniformity: new prn βold (A : A ) = βur (A : A ) if A. . ¬C . New York. see “On conﬁrmation” by Janina HosiassonLindenbaum. are high but < 1. and T implies H. and so on. ¬H ∧ C. ¬H ∧ ¬C}.25 Each of the following plausible rules is unreliable. Find counterexamples—preferably. then C3 . new we shall write βold (H : ¬H) explicitly. where none of the updated probabilies need be 1. (a) If D conﬁrms T. must also conﬁrm H. Where (as in the diagram in sec. T H. It seems evident that black ravens conﬁrm (H) ‘All ravens are black’ and that nonblack nonravens do not. Mind 54 (1945) 126 and 97121.27 Carl Wagner extends his treatment of old evidence newly explained to cases in which updating is by generalized conditioning on a countable sequence C = C1 . Yet H is logically equivalent to ‘All nonravens are nonblack’.26 5 Wagner III (sec 2. A ∈ {H ∧ C. Hempel. (c) If D and E each conﬁrm H. 2. (b) If D conﬁrms H and T separately. 4 Hempel’s ravens.6) updates are not always from old to new. DE. 25 These stem from Carl G. For the ﬁrst probabilistic solution. then their conjunction. H ∧ ¬C. The Free Press. For a critical survey of more recent treatments. simple ones. see pp. Hempel’s ‘Studies in the logic of conﬁrmation’. 2. (a) Show that in the framework of sec. (d) If D conﬁrms a conjunction. TESTING SCIENTIFIC THEORIES 53 C2 . 1965. 2000. 3. 6973 of John Earman’s Bayes or Bust? 27 This is an easily detatchable bit of Wagner’s “Old evidence and new explanation III” (cited above in sec.6). And we shall use ‘β ∗ ’ as shorthand: β ∗ = βur (C : ¬C). then D conﬁrms H. What will be the eﬀect of these successive veriﬁcations on the probablity of T ?” 3 Fallacies of “yes/no” confirmation. given H. The Journal of Symbolic Logic 5 (1940) 133148. in an abstract form: See pp. C2 . Cambridge University Press. Use probabilistic considerations to resolve this paradox of “yes/no” conﬁrmation. (b) Show that uniformity holds whenever the updates ur → old and prn → new are both by generalized conditioning. which is reprinted in his Aspects of Scientiﬁc Explanation.CHAPTER 2. . .5). instead of β(H : ¬H) as in (3) and old (4) of sec. In particular. 26 The paradox was ﬁrst ﬂoated (1937) by Carl G. 5051 of his Selected Philosophical Essays. . we indicate the prior and postepost rior probability measures by a subscript and superscript: βprior . in which C has two members. C = C. 2.6.
because agreement occurs when the relevant number falls within a rather large interval. if H and C are urindependent then new (1) βold (H : ¬H) depends only on β ∗ . so that (β ∗ − 1)prn(C¬H) + 1 (g) Given uniformity. The probability of such a coincidence is very small. since the part of experimental space that agrees with a numerical prediction is small. 12 Sept.” Compare: Relative ease of the Rockies eventually wearing down to Adirondacks size. prn(C¬H) 6 Shimony on HoltClauser. then βold (H : ¬H) > 1. even if one disregards later experiments devised in order to break the tie. so do the following two formulas: (β ∗ − 1)prn(CH) + 1 new (c) βur (H : ¬H) = ∗ (β − 1)prn(C¬H) + 1 old (d) βur (H : ¬H) = (β ∗ − 1)ur(CH) + 1 (β ∗ − 1)ur(C¬H) + 1 And show that. 2002. as against improbability of the Adirondacks eventually reaching the size of the Rockies. it is not improbable that results in agreement with Bell’s inequality (hence with local hidden variables theories) would be obtained. and the quantum mechanical prediction is a strict correlation.CHAPTER 2. TESTING SCIENTIFIC THEORIES 54 Now verify that where uniformity holds. therefore. By contrast. the errors would usually have the eﬀect of disguising correlations. new (f) βold (H : ¬H) = (β ∗ − 1)prn(CH) + 1 . . Hence good Bayesian reasons can be given for voting for Clauser over Holt. given uniformity. prn(CH) and prn(C¬H).28 “Suppose that the true theory is local hidden variables. Also. new (2) if β > 1 and prn(CH) > prn(C¬H). Then you have to accept that the combination of systematic and random errors yielded (within the rather narrow error bars) the quantum mechanical prediction. if quantum mechanics is correct and Holt’s apparatus is faulty. but Clauser’s apparatus [which supported QM] was faulty. (e) if H and C are urindependent. new (3) βold (H : ¬H) → prn(CH) as β → ∞. which is a deﬁnite number. 28 Personal communication. then old(H) = ur(H) and.
Chapter 3 Probability Dynamics. certainty: new(D) = 1 How should your new probabilities for hypotheses H other than D be related to your old ones? The simplest answer goes by the name of “conditioning” (or “conditionalization”) on D. Collaboration We now look more deeply into the matter of learning from experience. but to some posterior assignment new. where a pair of probability assignments represents your judgments before and after you change your mind in response to the result of experiment or observation. That would change your probabilistic judgments. 3.2 and applied in sec. which will be generalized in sec. We start with the simplest.1 Conditioning Suppose your judgments were once characterized by an assignment (“old”) which describes all of your conditional and unconditional probabilities as they were then. 3. where you are certain of D’s truth. they would no longer corrrespond to your prior assignment old. 55 . And suppose you became convinced of the truth of some data statement D. 3.4 to the problem of learning from other people’s experiencedriven probabilistic updating. most familiar mode of updating.3 and 3.
In sec. it is clear that ordinary conditioning is not always the way to go. and as 1/6 for each of the J. What is the general condition under which it makes sense for you to plan to update by conditioning on D if you should become sure that D is true? The answer is conﬁdence that your conditional probabilities given D will not be changed by the experience that makes you certain of D’s truth. you will have diﬀerent new probabilities for the card’s being the Queen of hearts.CHAPTER 3. that gave you reason to suspect. ♥. Invariant conditional probabilities: (1) For all H. conditioning implies both certainty and invariance. standard 52card deck. that it was a court card? Example 2. Of course. new(HD) = old(HD) Since new(HD) = new(H) when new(D) = 1. A brief glimpse determines your new probabilities as 1/20 for each of the ten numbered hearts. so that your new(Q ∧ ♥) = old(Q ∧ ♥(Ace ∨ King ∨ Queen) ∧ ♥)) = 1 . 1 . but meanwhile. The Red Queen. this means that your new(red) = 1 1 1 as well. Here. But since old(Queen ∧ ♥♥) = 13 whereas old(Queen ∧ ♥red) = 26 . When is it appropriate to update by conditioning on D? It is easy to see—once you think of it—that certainty about D is not enough. But what if—although the thing 3 you learned with certainty was indeed that the card was a heart—you also saw something you can’t put into words.1 To obtain Certainty. conditioning on your most informative new certainty does not take all of your information into account. PROBABILITY DYNAMICS. is a heart: your new(♥) = 1. depending on which certainty you update on.2 we shall see how you might take such uncertain information into account. uncertainly. K and Q. And conversely. Example 1. Since you know that all hearts are red. You learn that a playing card. ♥ or red. And if what you actually learned was that the card was a court card as well as a heart. drawn from a wellshuﬄed. it is clear that certainty and invariance together imply conditioning. set H = D in Conditioning. where some of what you have learned is just probabilistic. then that is what you ought to update on. Now invariance follows from Conditioning and Certainty since new H = new(HD) when new D = 1. 3. COLLABORATION conditioning: new(H) = old(HD) 56 This means that your new unconditional probability for any hypothesis H will simply be your old conditional probability for H given D. in example 1 we are in no doubt about which certainty you should condition on: it is the more informative one.
Bernard Grofman and Guillermo Owen (eds. The Statistician’s Stooge. J. England. If A and B each imply D. Greenwich.2 Generalized Conditioning3 Certainty is quite demanding. 3 2 . but also the familiar uncertainies that aﬀect real empirical inquiry in science and everyday life. ‘Some alternatives to Bayes’s rule’ in Information Pooling and Group Decision Making. Journal of the American Statistical Association 77(1982)822830. But it is a condition that turns out to be dispensible: As long as invariance holds. see Persi Diaconis and Sandy Zabell. Invariant odds: old(A) new(A) = . with no further information. and London. — where π(A) =df old(A) Exercise. Good’s term. PROBABILITY DYNAMICS. but have arranged for a trusted assistant to look at it and tell you simply whether or not is is a heart. Also known as probability kinematics and Jeﬀrey conditioning. pp.). Conn. see ‘Updating subjective probability’ by the same authors. new(B) old(B) (2) And another is invariance of the “probability factors” π(A) by which your old probabilities of ways of D’s being true can be multiplied in order to get their new probabilities. For much more. given ♥. I. Show that (1) implies (3). 3. COLLABORATION 57 An equivalent invariance is that of your unconditional odds between different ways in which D might be true. It rules out not only the farfetched uncertainties associated with philosophical skepticism. There are special circumstances in which the invariance condition dependably holds: Example 3. which implies (1). JAI Press. which implies (2).2 Invariance will surely hold in the variant of example 1 in which you do not see the card yourself. For a little more about this. Under such conditions your unconditional probability that the card is a heart changes to 1 in a way that cannot change any of your conditional probabilities.CHAPTER 3. 2538. Invariant probability factors: (3) If A implies D then π(A) = π(D) new(A) .
Example 4. H. we have the following updating scheme: new(H) = old(HD) new(D) + old(H¬D) new(¬D) More generally. This is equivalent to invariance with respect to every answer: new(HDi ) = old(HDi ) for i = 1. A consultant’s prognosis. PROBABILITY DYNAMICS. D1 = Islet cell carcinoma. If invariance holds relative to each answer (D. COLLABORATION 58 updating is valid by a generalization of conditioning to which we now turn. We begin with the simplest case.. Jane Doe. a way of updating your probabilities is still available for you. Invariance with respect to both answers means that for each hypothesis. . Dr. The required updating rule is easily obtained from the law of total probability in the form new(H) = new(HD) new(D) + new(H¬D) new(¬D). Here n = 3. with any countable partition of answers. new(HD) = old(HD) and new(H¬D) = old(H¬D). we have the updating scheme new(H) = old(HD1 ) new(D1 ) + old(HD2 ) new(D2 ) + . the applicable rule of total probability has one term on the right for each member of the partition. ¬D) to a yes/no question. and something makes you (say) 85% sure that the answer is ‘yes’. Rewriting the two conditional probabilities in this equation via invariance relative to D and to ¬D. of a yes/no question. . “Probability Kinematics”: n new(H) = i=1 old(HDi ) new(Di ). so (using accents to distinguish her probability assignments from yours) her new (H) for the prognosis H of 5year survival will be old (HD1 ) new (D1 ) + old (HD2 ) new (D2 ) + old (HD3 ) new (D3 ). . D3 = Benign tumor.CHAPTER 3. a histopathologist. D2 = Ductal cell ca. If the invariance condition holds for each answer. . . . In hopes of settling on one of the following diagnoses. She is sure that exactly one of the three is correct. . n. Di . conducts a microscopic examination of a section of tissue surgically removed from a tumor.
further. PROBABILITY DYNAMICS. Notation: old and new are your probability assignments before and after the histopathologist’s observation has led her to change her probability assignment from old to new . that her conditional probabilities for H given the diagnoses are unaﬀected by her examination: old (HDi ) = new (HDi ) = 4 6 9 . Note that the calculation of new (H) makes no use of Dr Doe’s old probabilities for the diagnoses. old(Di ). . ‘Probability Kinematics and Exchangeability’. the examination does not drive her probability for any of the diagnoses to 1.3 Probabilistic Observation Reports4 We now move outside the native ground of probability kinematics into a region where your new probabilities for the Di are to be inﬂuenced by someone else’s probabilistic observation report. which you would like to adopt as your own. You are unlikely to simply adopt such an observer’s updated probabilities as your own. But the results new (Di ) of that interaction are data for the kinematical formula for updating old (H). Certainly her new (Di )’s will have arisen through an interaction of features of her prior mental state with her new experiences at the microscope. you can update by probability kinematics even if you had no prior diagnostic opinions old(Di ) of your own. COLLABORATION 59 Suppose that. Philosophy of Science 69 (2002) 266278. If you do simply adopt the expert’s new probabilities for the diagnoses. Suppose you are a clinical oncologist who wants to make the best use you can of the observations of a histopathologist whom you have consulted. Suppose. 3. 1 for 3 6 2 i = 1. 3. 4 . but leads her to assign new (Di ) = 1 . which you See Wagner (2001) and Wagner. setting your new(Di ) = new(Di ) for each i. her prior old may have been a partially deﬁned function. for they are necessarily a confusion of what the other person has gathered from the observation itself. in the event.CHAPTER 3. But suppose you have your own priors. Then by probability kinematics her new probability for 5 year sur10 10 10 vival will be a weighted average new (H) = 41 of the values that her old (H) 60 would have had if she had been sure of the three diagnoses. the formula itself does not compute those results. for which you may prefer to substitute your own. 1 . We continue in the medical setting. all you need are her new new (Di )’s and your invariant conditional prognoses old(HDi ). where the weights are her new probabilities for those diagnoses. with that person’s prior judgmental state. . assigning no numerical values at all to the Di ’s. Indeed. 2.
2 for updating by probability kinematics. COLLABORATION 60 take to be wellfounded. 1 . 1 . but for the purpose of formulating her report she has adopted certain numerically convenient priors to update on the basis of her observations. and (3) new(H1 ) = new(H2 ) π (Di )old(H1 ∧ Di ) i π (Di )old(H2 ∧ Di ) i as your updated odds between two prognoses. we have (2) new(H) = i π (Di ) old(H ∧ Di ) as your updated probability for a prognosis.6 Example 5. Now if all her old (Di )’s 3 6 2 were 1 . 1 . 4 2 4 3 3 3 3 3 and if your old(H ∧ Di )’s were 16 . 1 . It is not from the new (Di ) themselves but from the updates old (Di ) → new (Di ) that you must extract the information contributed by her observation. 6 Combine (1) and (2). 5 . 3 by her observation. 2 8 3 It is noteworthy that formulas (1)(3) remain valid when the probability factors are replaced by anchored odds (or “Bayes”) factors: Without the normalizing denominator in (1). writing your new(Di ) = π (Di ) old (Di ) in the formula in sec. your diagnostic probability factors (1) would be π(Di ) = 4 . How? Here is a way: Express your new(Di ) as a product π (Di ) old(Di ) in which the factor π (Di ) is constructed by combining the histopathologist’s probability factors for the diagnoses with your priors for for them.CHAPTER 3. 2. 1 for i = 1. and although you have high regard for the pathologist’s ability to interpret histographic slides. so that your π(5year survival) = 4 . 3 . and applying the product rule. 2 . PROBABILITY DYNAMICS. the sum of your new(Di )’s might turn out to be = 1. in the light of the observer’s report and your prior probabilities. her probability factors would be π (Di ) = 1. 2. 32 then by (2) your new(H) would be 1 as against old(H) = 3 . In example 4 the histopathologist’s unspeciﬁed priors were updated to new values new (Di ) = 1 . you view her prior probabilities old (Di ) for the various diagnoses as arbitrary and uninformed: She has no background information about the patient. then cancel the normalizing denominators to get (3).5 (1) π(Di ) := π (Di ) i π (Di ) old(Di ) (Your Di probability factor) Now. Using the consultant’s diagnostic updates. 32 . 2. as follows. if your old(Di )’s 3 2 2 were 1 .
any ﬁxed diagnosis Dk would do as well as D1 . this goes back at least as far as Schwartz. This matters (a little) because probability factors are a little easier to compute than Bayes factors. with prior probabilities “factored out”. In a medical context. Since odds are arbitrary up to a constant multiplier. as when a probability factor of 2 tells us that the old probability must have been ≤ 1/2 since the new probability must be ≤ 1. starting from old and new probabilities. it is what remains of your new odds when the old odds have been factored out. (2) the mode of updating (by probabilities? Bayes factors?). COLLABORATION new Di old(Di ) π(Di ) / = new(D1 ) old(D1 ) π(D1 ) 61 (4) β(Di : D1 ) = (Di : D1 Bayes factor) This is your Bayes factor for Di against D1 .4 Updating Twice: Commutativity Here we consider the outcome of successive updating on the reports of two diﬀerent experts—say. Question: If you update twice. and (3) your starting point. PROBABILITY DYNAMICS.CHAPTER 3. which remain valid with anchored Bayes factors in place of probability factors. Wolfe.7 But probability factors carry some information about priors. for the Bayes factors β(Di : Dk ) are all of form β(Di : D1 ) × c. it is the number by which your old odds on D1 against D1 can be multiplied in order to get your new odds. a histopathologist ( ) and a radiologist ( )—assuming invariance of your conditional probabilities relative to both experts’ partitions. Bayes factors have wide credibility as probabilistic observation reports. 7 . New England Journal of Medicine 305 (1981) 917923. “Pathology and Probabilities: a new approach to interpreting and reporting biopsies”. This residual information about priors is neutralized in the context of formulas (1)(3) above. where the constant c is β(D1 : Dk ). and Pauker. Choice of D1 as the anchor diagnosis with which all the Di are compared is arbitrary. P . 3. should order be irrelevant? Should →→ = →→ ? The answer depends on particulars of (1) the partitions on which → and → are deﬁned.
10 Example 6. constructed from the alien f and : f= fi fi i old(Ai )fi fi Example 7. Also in Portugese: Depois o a do empirismo l´gico. the second assignment simply replaces the ﬁrst.3. Philosophy of Science 69 (2002) 266278. For more about this. translated and introduced by Ant´nio Zilh˜o. see Diaconis and Zabell (1982). In updating by Bayes or probability factors f for diagnoses as in 3. • Can order matter? Certainly. Your ﬁnal probabilities for the diagnoses and for the prognosis B will be new(Di ) = old(Di )f. See pp.CHAPTER 3.) o 9 8 . Adopting as your own both a pathologist’s factors fi and a radiologist’s factors fi on the same partition—in either order—you come to the same result: your overall factors will be products fi fi of the pathologist’s and radiologist’s factors. see Carl Wagner’s ‘Probability Kinematics and Commutativity’. . 2002 (edited. 5264 of Richard Jeﬀrey. 10 Proofs are straightforward. in turn. Sociedade Portuguesa de Filosoﬁa. 8256. suppose you accept two new probability assignments to one and the same partition.1 Updating on alien probabilities for diagnoses A propos of (2). . Q(H) = i old(H ∧ Di )f where f is your normalized factor. . with partition {Di } and factors fi (i = 1.8 3. as you wish.4. and updating on the second leaves probabilities of all elements of the ﬁrst unchanged—as happens. Two partitions. m). PROBABILITY DYNAMICS.g. One partition. Since in both updates your new(H) depends only on your invariant conditional probabilities old(HDi ) and the new(Di )’s that you adopt from that update’s expert. .4.2 Updating on alien factors for diagnoses9 Notation: Throughout each formula. with partition {Dj } and For more about this. • When is order immaterial? When there are two partitions. e. f = β or f = π. order cannot matter. a pathologist’s. Edi¸˜es Colibri. Petrus Hispanus Lectures 2000: After Logical Empiricism. esp. when the two partitions are independent relative to your old. COLLABORATION 62 3. Lisco bon. and a radiologist’s. →.. →.
and has a subjective probability for you (i. PROBABILITY DYNAMICS. themselves only probable. all that conceivably could be mistaken. Lewis himself thinks of this hard kernel as a proposition. 11 . Now in terms of your normalization fi.CHAPTER 3. very likely..11 Hardcore empiricism puts some such “hard kernel” to work as the “purely experiential component” of your observations.j fi. 1823.. Early and late.j old(Di ∧ Dj )fi fj of the alien factors.5 Softcore Empiricism Here is a sample of mid20th century hardcore empiricism: Subtract. Unless C. . . pp.g. 3. If there were no such hard kernel in experience—e. Open Court. n). I. 1). Lewis’s emphasis.e.j old(H ∧ Di ∧ Dj ). . which have various probabilities conditionally upon those premises. →. COLLABORATION 63 factors fj (j = 1. in 1929 (Mind and the World Order. Thus. Then your new probability for the prognosis will be new(H) = i. pp. the remainder is the given content of the experience inducing this belief. LaSalle. true). he argues that the relationship between your irrefutable kernel and your other empirical judgments is the relationship between fully believed premises and uncertain conclusions. These must commute. with partition {Di ∧Dj old(Ai ∧Aj ) > 0} and factors fi fj . 3289) he holds that the immediate premises are. your updating equations on the partition elements Di ∧Dj can be written new(Di ∧ Dj ) = fi. about which you canot be mistaken. since it has a truth value (i. Illinois.e. . and perhaps in turn based upon premises only probable. or hear.j = fi fj i. what we see when we think we see a deer but there is no deer—then the word ‘experience’ would have nothing to refer to. for in either order they are equivalent to a single mapping. or otherwise learn from direct experience. 1947. Lewis.. An Analysis of Knowledge and Valuation. in what we say that we see.j old(Di ∧ Dj ).
p. . And in 1947 (An Analysis of Knowledge and Valuation. this E behaves like the elusive hardcore E at the beginning of this example. 1 —for blue and green. Lewis’s arguments for this view seem to be based on the ideas. 186): If anything is to be probable. The required softcore experiential proposition E 3 would be less accessible than Lewis’s ‘what we see when we think we see a deer but there is no deer’. but that new(blue) = 2 and new(green) = 1 . But perhaps there are softcore. must themselves be certainties. 12 . The data which themselves support a genine probability. it is possible to expand the domain 3 3 of the function old so as to allow conditioning on E in a way that yields the same result you would get via probability kinematics. 64 In eﬀect. Is the shirt blue or green? There are conditions (blue/green colorblindness. Such ultimate premises . . . which we have been undermining in this chapter. 2 . Rudolf Carnap. where there is no hard3 3 core experiential proposition E you can cite for which your old(BlueE) = 2 3 and old(GreenE) = 1 . since what we think is not that the shirt is blue.g. as your current subjective probability for a hypothesis H. where the Di are your fully believed data sentences from the beginning (D1 ) to date (Dt ) and old is a probability assignment that would be appropriate for you to have prior to all experience. . 1964). the information encoded in your probabilies at any time is simply the conjunction of all your hardcore data sentences up to that time. Lewis subscribes to Carnap’s view12 of inductive probability as prescribing. PROBABILITY DYNAMICS. and (b) if you are rational. must be actual given data for the individual who makes the judgment. With E as the proposi3 3 tion new(blue) = 2 ∧ new(green) = 1 . probabilistic data sentences on which simple conditioning is possible to the same eﬀect as probability kinematics: Example 6. . no probabilityjudgment can be valid at all. . . ∧ Dt ). The Logical Foundations of Probability. Formally. your old(HD1 ∧ . poor lighting) under which what you see can lead you to adopt probabilities—say. See. then something must be certain.CHAPTER 3.. e. COLLABORATION this backwardleading chain comes to rest ﬁnally in certainty. that (a) conditioning on certainties is the only rational way to form your degrees of belief. University of Chicago Press (1950. We do have such absolute certainties in the sense data initiating belief and in those passages of experience which later may conﬁrm it.
+ dn old(HDn ). as do the presumed experiential propositions with which we might hope to cash those heavily contextdependent epistemological checks that begin with ‘looks’. where E = new(D1 ) = d1 ∧ . (4) new(H) = d1 old(HD1 ) + . . new(E)=1. .CHAPTER 3. Causal Necessity (Yale. 65 In the presence of the certainty condition. and it does stand outside the normal run of propositions to which we assign probabilities. .15 13 14 Brian Skyrms. PROBABILITY DYNAMICS. . COLLABORATION This is how it works. Such boxes are equivalent to foreandaft parentheses. . . Appendix 2. . the invariance condition new(HE) = old(HE) reduces to new(H) = old(HE) and we have probability kinematics.13 For an nfold partition. new(E) = 1 and new(HE) = old(HE). Perhaps E could be regarded as an experiential proposition in some soft sense. 1980).14 If (1) old(Di  new(Di ) = di )= di and (2) old(HDi ∧E ) = old(HDi ) then (3) old(H E)= d1 old(HD1 ) + . ∧ new(Dn ) = dn . from the (“ﬁrstorder”) objective propositions such as diagnoses and prognoses that ﬁgure in the kinematical update formula to subjective propositions of the second order (E and new(Di ) = di ) and third ( old(E) = 1 ) and beyond. dn old(HDn ). and easier to parse. The price of that trip would be a tricky expansion of the probability assignment. It may be thought to satisfy the certainty and invariance conditions.
—and as in the earliest texbook of probability theory: Christiaan Huygens. to the question of what number X is.5. with equal probabilities.1 Probability and Expectation If you see only ﬁnitely many possible answers. 1. random variables are really constants— known or unknown. . X.1 4. your expectation can be deﬁned as follows: n (1) exX = i=1 pi xi —where pi = pr(X = xi ). . 4th ed. 1 66 . And if you are sure 9 9 9 Turing died in 1954. Eina audi. These numbers X are called random variables. 1657 (“On Calculating in Games of Luck”). xn . X = The year Turing died. De Ratiociniis in Ludo Aleae. R.Chapter 4 Expectation Primer Your expectation. of an unknown number. Notable 20th c. The phrase ‘the year Turing died’ is a constant which.’s. . .V. in which the weights are your probabilities for X’s being those numbers: ex is usually deﬁned in terms of pr. exX. then for you n = 1. with pr deﬁned in terms of ex as in this chapter. then my ex(X) will be 1951 + 1952 + .V. . 1970 (= Theory of Probability. is always some particular number. Wiley. p1 = 1 and exX=1954. is usually deﬁned as a weighted average of the numbers you think X might be. Springer. + 1959 . . 1975) and Peter Whittle’s Probability via Expectation.. which works out to be 1954. 2000. v. Since an R. Example. expectationbased treatments are Bruno de Finetti’s Teoria delle Probabilit`. If I am sure it was one of the years from 1951 to 1959. x1 . But in many ways the opposite order is preferable.
’s need not be values those R.5. Nor need your expectation of the indicator of past life on Mars be one of the values. Now probabilities are expectations of indicators.CHAPTER 4. For example. In the Turing example my expectation of the year of Turing’s death was 1954. the RiemannStieltjes integral: if you think X might be any real number b from a to b. 0 or 1. but this requires relatively advanced mathematical concepts which deﬁnition (2) avoids. denotes the number 1954. price: $p or. This is equivalent to deﬁnition (1).’s can have. anyway. 2 . which is not the number of a year. and deﬁned probabilities as expectations of particularly simple R. We could have taken expectation as the fundamental concept. Deﬁnition (2) is quite general. EXPECTATION PRIMER 67 as a matter of empirical fact. IH .2 Conditional Expectation Just as we deﬁned your conditional probabilities as your buyingorselling prices for tickets that represent conditional bets. which is 1 if H is true and 0 if H is false. that indicators 1 can assume. (2) pr(H) = ex(IH ). sweeps under the carpet. deﬁnition (1) works only in cases where you are sure that X belongs to a particular ﬁnite list of numbers. Of course. deﬁnition (1) can be generalized.’s. quite apart from whether you or I know it.V. where F (x) = pr(X < x). then (1) becomes: exX = a x dF (x). for short: X p p We write ‘ex(XH)’ for your conditional expectation of X given H.V. called ‘indicators’. In contrast. The indicator of a hypothesis H is a constant. so we deﬁne your conditional expectation of a random variable X given truth of a statement H as your buyingorselling price for the following ticket: Worth $X if H is true. 4. as in the story in sec. or. worth $p if H is false.2 Observe that expectations of R.1. 1.V. it may well be 10 .
And (d) and (e) have the same H values as (b) and (c). (a) (b) (c) H H X = ex(X  H) X + 0 (d) 0 ex(X  H) (e) = X·IH + ex(X  H) 0 Clearly. ticket (a) has the same dollar value as (b) and (c) together on each hypothesis as to whether H is true or false. you are inconsistently placing diﬀerent values on the same prospect depending on whether it is described in one or the other of two provably equivalent ways. without the condition pr(H) = 0. since X · IH = {Xifif¬H is true and (e) = (c). In the 18th century Thomas Bayes deﬁned probability in terms of expectation as follows: “The probability of any event is the ratio between the value at which an expectation depending on the happening of the event . the condition boils down to the product rule. Then 0 is true unless your price for (a) is the sum of your prices for (d) and (e). ex(IH ) pr(H) Of course this can be written as follows.CHAPTER 4. Historical Note. Product Rule: ex(X · IH ) = ex(XH) · pr(H) A “Dutch book” consistency argument can be given for this product rule rather like the one given in sec. so that the condition ex(XH) = ex(X · IH ) + ex(XH)pr(¬H) is met. Quotient Rule: ex(XH) = ex(X · IH ) ex(X · IH ) = if pr(H) = 0. Now setting pr ¬H = 1 − prH and simplifying. just as the quotient rule for probabilities might be viewed as deﬁning pr(GH) as pr(GH)/pr(H). 1. EXPECTATION PRIMER 68 The following rule might be viewed as deﬁning conditional expectations as quotients of unconditional ones when pr(H) = 0.4 for the probability product rule: Consider the following ﬁve tickets.
CHAPTER 4. EXPECTATION PRIMER ought to be computed, and the value of the thing expected upon its happening.”3
69
ex(X · IH ) . Explanation: ex(XH) ex(X · IH ) is “an expectation [$X] depending on the happening of the event” and ex(XH) is “the value of the thing expected upon its [H’s] happening”. This is simply the product rule, solved for prH =
4.3
Laws of Expectation
The axioms for expectation can be taken to be the product rule and (3) Linearity: ex(aX + bY + c) = a ex(X) + b ex(Y ) + c,
where the small letters stand for numerals (= random variables whose values everybody who understands the language is supposed to know). Three notable special cases of linearity are obtained if we replace ‘a, b, c’ by ‘1, 1, 0’ (additivity) or ‘a, 0, 0’ (proportionality) or ‘0, 0, c’ (constancy, then proportionality with ‘c’ in place of ‘a’): (4) Additivity : ex(X + Y ) = ex(X) + ex(Y ) Proportionality : ex(aX) = a ex(X) Constancy : ex( ) = 1
If n hypotheses Hi are welldeﬁned, they have actual truth values (truth or falsehood) even if we do not know what they are. Therefore we can deﬁne the number of truths among them as IH1 + . . . + IHn . If you have deﬁnite expectations ex(IHi ) for them, you will have a deﬁnite expectation for that sum, which, by additivity, will be ex(IH1 ) + . . . + ex(IH1 ). Example 1, Calibration. Suppose you attribute the same probability, p, to success on each trial of a certain experiment: pr(H1 ) = . . . = pr(Hn ) = p. Consider the indicators of the hypotheses, Hi , that the diﬀerent trials succeed. The number of successes in the n trials will be the unknown sum of 1 n the indicators, so the unknown success rate will be Sn = Hi . Now n i=1 by additivity and constancy, your expectation of the success rate must be p.4
‘Essay toward solving a problem in the doctrine of chances,’ Philosophical Transactions of the Royal Society 50 (1763), p. 376, reprinted in Facsimiles of Two Papers by Bayes, New York: Hafner, 1963. n n n 1 1 1 1 4 ex( n i=1 IHi ) = n ex( i=1 IHi ) = n ex( i=1 pr(Hi )) = n (np) = p.
3
CHAPTER 4. EXPECTATION PRIMER
70
The term ‘calibration’ comes from the jargon of weather forecasting; forecasters are said to be well calibrated — say, last year, for rain, at the level p = .8 — if last year it rained on about 80% of the days for which the forecaster turns out to have said there would be an 80% chance of rain.5 Why Expectations are Additive. Suppose x and y are your expectations of R.V.’s X and Y — say, rainfall in inches during the ﬁrst and second halves of next year — and z is your expectation for next year’s total rainfall, X + Y . Why should z be x + y? The answer is that in every eventuality about rainfall at the two locations, the value of the ﬁrst two of these tickets together is the same as the value of the third: Values: $X + $Y = $(X + Y ) Prices: $x + $y = $(x + y) Then unless the prices you would pay for the ﬁrst two add up to the price you would pay for the third, you are inconsistently placing diﬀerent values on the same prospect, depending on whether it is described to you in one or the other of two provably equivalent ways. Typically, a random variable might have any one of a number of values as far as you know. Convexity is a consequence of linearity according to which your expectation for the R.V. cannot be larger than all of those values, or smaller than all of them: Convexity: If you are sure that X is one of a ﬁnite collection of numbers, exX lies in the range from the largest to the smallest to the smallest of them. Another connection between conditional and unconditional expectations: Law of Total Expectation. If no two of H1 , . . . , Hn are compatible and collectively they exhaust , then
n
ex(X) =
i=1
ex(XHi )pr(Hi ).
Proof: X = n (X · IHi ) is an identity between R.V.’s. Apply ex to both i=1 sides, then use additivity and the product rule. When conditions are certainties for you, conditional expectations reduce to unconditional ones:
For more about calibration, etc., see Morris DeGroot and Stephen Fienberg, “Assessing Probability Assessors: Calibration and Reﬁnement,” in Shanti S. Gupta and James O. Berger (eds.), Statistical Decision Theory and Related Topics III, Vol. 1, New York: Academic Press, 1982, pp. 291314.
5
CHAPTER 4. EXPECTATION PRIMER Certainty: ex(XH) = ex(X) if pr(H) = 1.
71
Note that in a context of form ex(· · · Y · · · Y = X) it is always permissible to rewrite Y as X at the left. Example: ex(Y 2 Y = 2X) = ex(4X 2 Y = 2X). Applying conditions: ex(· · · Y · · · Y = X) = ex(· · · X · · · Y = X) (OK!)
But we cannot generally discharge a condition Y = X by rewriting Y as X at the left and dropping the condition. Example: ex(Y 2  Y = 2X) cannot be relied upon to equal ex(4X 2 ). The Discharge Fallacy: ex(· · · Y · · · Y = X) = ex(· · · X · · ·) (NOT!)
Example 2, The problem of the two sealed envelopes. One contains a check for an unknown whole number of dollars, the other a check for twice or half as much. Oﬀered a free choice, you pick one at random. What is wrong with the following argument for thinking you should have chosen the other? “Let X and Y be the values of the checks in the one and the other. As you think Y equally likely to be .5X or 2X, ex(Y ) will be .5 ex(.5X) + .5E(2X) = 1.25E(X), which is larger than ex(X).”
4.4
Median and Mean
Hydraulic Analogy. Let ‘F’ and ‘S’ mean heads on the ﬁrst and second tosses of an ordinary coin. Suppose you stand to gain a dollar for each head. Then your net gain in the four possibilities for truth and falsity of F and S will be as shown at the left below.
F F S 2 1 Gains S 1 0 1.5 0.5 Conditional expectations 1 Unconditional expectation
Think of that as a map of ﬂooded walled ﬁelds in a plain, with the numbers indicating water depths in the four sections—e.g., the depth is X = 2
It does not. 6 7 .7 Nor is there any reason why X cannot have negative values. But there is no mathematical reason for magnitudes X to have only a ﬁnite numbers of values. like this: 1/2 0 8 1/2 20 ^ Months lived after diagnosis Locations on the beam are months lived after diagnosis. She told me .” In terms of the beam analogy here are the key deﬁnitions. we might think of X as the birth weight in pounds of the next giant panda to be born in captivity—to no end of decimal places of accuracy. that there was nothing really worth reading. dirt.. is not drawn ﬁnely enough to let us take the remote decimal places seriously. That level will be ex(XF ). positions represent values of a magnitude X that might go negative as well as positive.. your conditional expectation for X given ¬F is seen to be ex(X¬F ) = 0. EXPECTATION PRIMER 72 throughout the F ∧ S region. as if that meant something. and distribute it along the beam so that the weight of clay on each section represents your probability that the true value of X is in that section—say. The following analogy is more easily adapted to the continuous case. After surgery. e. a unit. which is. Full House (New York. as in the case of temperature. “The Median is not the Message.”8 “In 1982. removing the wall between the two sections of ¬F . Similarly. sec.5. 4. remove the wall the two sections of F so that the water reaches a single level in the two. Pick a zero. depths are values of X and areas are probabilities. the weight of clay on the interval from 0 to m is the probability of still being alive in m months. 1996). I asked my doctor what the best technical literature on the cancer was. Balance.CHAPTER 4. 8 See Stephen Jay Gould. Get a pound of modelling clay. of the terms “median” and “mean”—the latter being a synonym for “expectation”: And therefore.5 in the diagram. 1.g. I learned I was suﬀering from a rare and serious cancer. ex(XF ∧ S) = 2.. with a median mortality of eight months after discovery. etc. I soon realized why she had oﬀered that humane advice: my cancer is incurable. On a weightless rigid beam. and a positive direction on the beam.6 In the four regions. remove both walls so that the water reaches the same level throughout: ex(X) = 1. To ﬁnd your conditional expectation for X given F . The commonplace distinction between panda and ambient moisture. To ﬁnd your unconditional expectation of gain.
5 Variance The variance. The mean (= your expectation) is the point of support at which the beam would just balance. the median is to the left of its mean. deﬁned as9 varX = ex(X − exX)2 . Gould’s life expectancy is greater than 8 months.” 73 As this probability distribution is skewed (stretched out) to the right. I reasoned. while the median remains between the second and third blocks. median mean 4. Gould continues: “The distribution of variation had to be right skewed. (The mean of 20 months suggested in the graph is my ignorant invention. The other is the “standard deviation”: σ(X)= varX. But the upper (or right) half can extend out for years and years. even if nobody ultimately survives. In 9 Here and below paradigm for ‘ex2 X’ is ‘sin2 x’ in trigonometry . EXPECTATION PRIMER The median is the point on the beam that divides the weight of clay in half: the probabilities are equal that the true value of X is represented by a point to the right and to the left of the median.CHAPTER 4. the left of the distribution contains an irrevocable lower boundary of zero (since mesothelioma can only be identiﬁed at death or before).) The eﬀect of skewness can be seen especially clearly in the case of discrete distributions like the following. After all. Thus there isn’t much room for the distribution’s lower (or left) half—it must be scrunched up between zero and eight months. is one of the two commonly used measures of your uncertainty√ about the value of X. Observe that if the righthand weight is pushed further right the mean will follow.
thus. By linearity of ex your expectation of this sum = ex(X 2 ) − ex2 X. The deﬁnition of the variance can be simpliﬁed by writing the righthand side as the expectation of the square minus the square of the expectation:10 (1) varX = ex(X 2 ) − ex2 (X) Note that in case X is an indicator. as shown in the graph: And the equality var(IH ) = pr(H)pr(¬H) follows from (1) because (IH )2 = IH (since IH can only assume the values 0 and 1) and ex(IH ) = pr(H). variances are nonlinear: in place of ex(aX + b) = a exX + b we have (2) var(aX + b) = a2 varX. these measures of uncertainty correspond to the average spread of mass away from the mean. the variance. variances are Proof. EXPECTATION PRIMER 74 terms of the physical analogies in sec. but the square. 4 Proof.4. which is like the absolute value in being nonnegative. The inequality prH pr ¬H ≤ 1 follows from the fact that p(1 − p) 4 reaches its maximum value of 0.CHAPTER 4.5. 4. X = IH . And in contrast to the unrestricted additivity of expectations. mathematically. (X − exX)2 = X 2 − 2 XexX + ex2 X. we have varIH = pr(H)pr(¬H) ≤ 1 . 10 . is easier to work with. The move to the standard deviation counteracts the distortion in going from the absolute value to the square. where exX and its square are constants.25 when p = 0. The obvious measure of this spread is your expectation ex(X − exX) of the absolute value of the diﬀerence. In contrast to the linearity of expectations.
V. The coeﬃcient of correlation between X and Y is deﬁned as ρ(X. which is deﬁned as follows: You see X1 .g. Xn as pairwise uncorrelated if and only if your ex(Xi Xj ) = ex(Xi )ex(Xj ) whenever i = j. but it is a weaker property. 4. Proof. Y ) = ex[(X − exX)(Y − exY )] = ex(XY ) − ex(X)ex(Y ) so that the covariance of an R. Recall that by (1) and additivity of ex. is a consequence of independence. . . . EXPECTATION PRIMER 75 additive only under special conditions—e. this = i varXi . Y ) = cov(X. . var( i Xi ) = ex( i Xi )2 − ex2 i Xi = ex( i Xi )2 − (ex i Xi )2 . with itself is the variance.6 A Law of Large Numbers Laws of large numbers tell you how to expect the “sample average” of the actual values of a number of random variables to behave as that number increases without bound: “Sample average”. . . Sn = 1 n Xi . see supplement 4.’s is deﬁned as cov(X. By (1). . . cov(X. Xn . Noncorrelation. ρ = 0. n i=1 . which does not imply independence..CHAPTER 4. this is what we want to prove: n n (3) var i=1 Xi = i=1 varXi . Now to be a bit more explicit. Y ) σ(X)σ(Y ) where σ is the positive square root of the variance.V. Multiply out squared sums and use ex additivity and pairwise noncorrelation to get i exXi2 + i=j exXi exXj − i ex2 Xi − i=j exXi exXj .5 at the end of this chapter. The covariance of two R. X) = var(X). Cancel identical terms with opposite signs and regroup to get i (exXi2 − ex2 Xi ). pairwise uncorrelation. given pairwise uncorrelation of X1 .
11 4. ﬁnite expectations. pp.CHAPTER 4. the Markov inequality implies that pr(X 2 ≥ ) ≤ ex(X 2 )/ if > 0. Here we show that. ex(Xi ) = m. 8485 of de Finetti’s ‘Foresight’ in Henry Kyburg and Howard Smokler (eds. then for all positive n and n . Proof of the inequality. Since m2 + σ 2 = ex(Xi )2 and m2 + rσ 2 = ex(Xi Xj ) with i = j.7 Supplements exX 4.1 Markov’s Inequality: pr(X ≥ ) ≤ if > 0 and X ≥ 0. var(X) 4. volume 2. we have Chebyshev’s inequality.). we have pr(X ≥ ) ≤ ex(X 2 )/ 2 if > 0. For another treatment.. since X ≥ if and only if X 2 ≥ 2 . (Sn − Sn ) = ( n i=1 n i=1 n i=1 n i=1 Multiplying this out we get terms involving (Xi )2 and the rest involving Xi Xj with i = j. 2156. Studies in Subjective Probability.2 Chebyshev’s Inequality: pr(X − ex(X) ≥ ) ≤ if > 0. 2 Proof. Huntington. increase without bound: If the random variables Xi all have the same. so that pr(X ≥ ) = . 1 n 1 n 1 n 1 n 2 2 Xi ) + 2( Xi )( Xi ) + ( Xi )2 . 1 1 (4) ex(Sn − Sn )2 = ( − )σ 2 (1 − r). n and n . Since squares are never negative. your expectation of the squared diﬀerence (Sn − Sn )2 between a sample average and its prolongation will go to 0 as both their lengths. Xj ) = r where i = j. Markov’s inequality. n n Proof. Now since Y ≤ X both when X ≥ and when X < we have ex(Y ) ≤ ex(X) and. variances var(Xi ) = σ 2 and pairwise correlation coeﬃcients. If Y = {0 when X< then by the law of total expectation. Theory of Probability. From this. under certain rather broad conditions. we can eliminate all of the X’s and get formula (4) via some hideous algebra. and from this. see pp. with ‘X − ex(X)’ for ‘X’. ex(Y ) = · pr(X ≥ ) + 0 · pr(X < ) = ex(Y ) · pr(X ≥ ). ρ(Xi .Y. 1980. This provides a simple way of getting an upper bound on your probabilities that nonnegative random variables X for which you have ﬁnite expectations when X≥ are at least away from 0. See Bruno de Finetti. so. N. 11 . Krieger. EXPECTATION PRIMER 76 The basic question is how your expectations of sample averages of large ﬁnite numbers of random variables must behave.
n. An updated version of the oldest law of large numbers (1713. for all distinct i and j from 1 to n : pr(Hi ) = pr(H1 ) and pr(Hi ∧ Hj ) = pr2 (H1 ). 1 n Proof: By (3). Suppose that. as in the case of Bernoulli trials.CHAPTER 4. in n = 25. . EXPECTATION PRIMER 77 4. Here there will be a ﬁnite upper bound b/ 2 n on your probability that the sample average will diﬀer by more than from your expectation of that average. Thus.V. . . ﬁnite upper bound.4 Corollary: Bernoulli trials. . so that for large enough samples your probability that the error Sn − ex(Sn ) in your expectation is or more gets as small as you please. Corollary. 000 tosses of a fair coin. 4n 2 This places a limit of 1/4n 2 on your pr for a deviation of or more between your pr(Hi ) and the actual relative frequency of truths in n trials.01 of 0. b. . These are sequences of propositions Hi (say. Then (6) For any > 0.V. this n i=1 1 b is ≤ 2 (nb) = . The bound b on the X’s is independent of the sample size. Setting X = Sn in Chebyshev’s inequality. n n Note that if the Xi are indicators of propositions Hi . . var(Sn ) = 2 var(Xi ). . sample averages Sn are relative frequencies Rn of truth: Rn = 1 n 1 IHi = (the number of truths among H1 . .3 Many Pairwise Uncorrelated R. Xn which you see as pairwise uncorrelated and as having a common. Hn ) n i=1 n 4. Since each var(Xi ) is ≤ b.’s Xi whose variances are all ≤ b : (5) For any > 0 : pr(Sn − ex(Sn ) ≥ ) ≤ b n2 b This places an upper limit of on your probability that the sample n2 average diﬀers from your expectation of it by or more. we have (5). . pr(Rn − pr(H1 ) ≥ ) ≤ 1 . on their variances. James Bernoulli’s) applies to random variables X1 .50 will fall short of 1 by at most 1/4n 2 = 10%. “head on the i’th toss”) where for some p (=1/2 for cointossing) your probability for truth of any n distinct H’s is pn . your probability that the success rate will be within = .’s. For pairwise uncorrelated R.
as the following extreme example demonstrates.CHAPTER 4. By the same formula. noncorrelation reduces to independence: ex(XY ) = ex(X)ex(Y ) iﬀ: pr(X = x∧Y = y) = pr(X = x)pr(Y = y) for all four pairs (x.12 You have probability 1/4 for each of X = 2. 2. linearity of ex says that 1 ex(Sn ) = n · np = p. 222. Example. In particular. Yet. An Introduction to Probability Theory and its Applications. or. 2nd edition (1957).j xi yj pr(X = xi ∧ Y = yj ). Y are said to be uncorrelated from your point of view when your expectation of their product is simply the product of your separate expectations (sec. But independence does always imply noncorrelation. By formula (1) in sec. by independence. which = i. From William Feller. and probability 1/4 for each conjunction X = x ∧ Y = x2 with x = −2.1. 4. pr(X = 1 ∧ Y = 1) = 1 = pr(X = 1)pr(Y = 1) = ( 1 )( 1 ). 4 4 2 Then noncorrelation need not imply independence. or even for the single pair X = Y = 1. 4. i. Noncorrelation in spite of deterministic dependency. p.5). Random variables X. 13 Proof. −1. 2.j xi yj pr(X = xi ) pr(Y = yj ) again. we have:13 If pr(X = x ∧ Y = y) = pr(X = x) · pr(Y = y) for all values x of X and y of Y . example (a). Since exXi = ex(IHi ) = pr(Hi ) = p. 1. 1. Then you see X and Y as uncorrelated. Now by (2) we can set b = 1 in (5). 1. 12 . ex(X)ex(Y ) = { i xi pr(X = xi )}{ j yj pr(Y = yj )}. and Y = X 2 . EXPECTATION PRIMER 78 Proof. ex(XY ) = i.j xi yj pr(X = xi ) pr(Y = yj ). so that you have probability 1/2 for each possible value of Y (= 1 or 4). for ex(X) = 0 and ex(XY ) = −8−1+1+8 = 0 = 4 ex(X)ex(Y ). But in general the uncorrelated pairs of random variables are not simply the independent ones. vol. In the special case where X and Y are the indicators of propositions. 4 4. then ex(XY ) = ex(X)ex(Y ). 1.5 Noncorrelation and independence. in cases where X and Y take only ﬁnitely many values. y) of zeroes and ones.
Here we examine two answers to these questions that were ﬂoated by Bruno de Finetti in the decade from (roughly) 1928 to 1938. and for that reason I would prefer to speak of ‘judgmental’ probability. which got named earlier.Chapter 5 Updating on Statistics 5.1 it is your actual judgment. is a special case. which is then updated by conditioning on statistical data. in view of your information to date and of your sense of other people’s information. But of course you are not always clear about what your judgment is. 2 And.) For this. 1 79 . and the output will not arise via conditioning. there is no overarching algorithm. In common usage there is some such suggestion.2 postulates a deﬁnite sort of initial probabilistic state of mind. “Exchangeability”. “partial” exchangeability (see 5.1 Where do Probabilities Come From? Your “subjective” probability is not something fetched out of the sky on a whim.2 below). The ﬁrst (“Minimalism”) is more primitive: the input probability assignment will have large gaps. even if you do not regard it as a judgment that everyone must share on pain of being wrong in one sense or another. more generally. The most important questions in the theory of probability concern ways and means of constructing reasonably satisfactory probability assignments to ﬁt your present state of mind. of which simple exchnageability. But in the technical literarure the term ‘subjective’ is well established—and this is in part because of the lifelong pleasure de Finetti found in being seen to give the ﬁnger to the establishment. The second of them. (Think: trying on shoes. or should be. normally representing what you think your judgment should be.
That theorem. if ex(Xi ) = p for i = 1. Then if.3 Law of Little Numbers. and as you form your probabilities for those propositions you remember what you have learned. The name “Law of Little Numbers” is a joke. . then those probabilities will be zeros and ones instead of averages of those zeros and ones. 1. . your expectation of each will agree with that mean. . judging a sequence of propositions to be equiprobable.1 Probabilities from statistics: Minimalism Statistical data are a prime determinant of subjective probabilities. By linearity of ex. UPDATING ON STATISTICS 80 5. One such feature that can be of importance is persistence. while requiring your ﬁnal expectations for a sequence of magnitudes to be equal.2. and which false. below). if you are sure that the relative frequency of truths is p. your probability for each proposition will agree with the relative frequency. p = 1 n n i=1 ex(Xi ). in terms of “exchangeability” of random variables (sec. In a ﬁnite sequence of random variables for which your expections are equal. But that truth must be understood in the light of certain features of judgmental probabilizing. That is a condition under which the following little theorem of the probability calculus can be used to generate probability judgments. if you know only their arithmetical mean. 4 To appreciate the importance of the italicized caveat. for what you learned will have shown you that they are not equal. 5 If you learn the individual values and calculate the mean as their average without forgetting the various values. but I know of no generally understood name for the theorem. you learn their mean value in a way that does not lead you to change that requirement. then that is your expectation of each. In chapter 3 he goes on to a mathematically deeper way of understanding the truth in frequentism. another joke) is quite trivial.4 The law of little numbers can be generalized to random variables: Law of Short Run Averages.1. like the next (the “Law of Short Run Averages”. In a ﬁnite sequence of propositions that you view as equiprobable. 3 . 5964. you learn the relative frequency of truths in a way that does not change your judgment of equiprobability. as you learn the relative frequency of truths in a sequence of propositions. both are immediate consequences of the linearity of the expectation operator. that is the truth in frequentism. . Chapter 2 of de Fineti (1937) is devoted to them. of your judgment that they all have the same probability. note that if you learn the relative frequency of truths by learning which propositions in the sequence are true.CHAPTER 5. then your probability for each is p.5 See Jeﬀrey (1992) pp. n. you have violated the caveat (unless it happens that all the values were the same). Proof. Then if.
8 Propositions are essentially indicators. Xn more generally. .6 This is a frequentism in which judgmental probabilities are seen as judgmental expectations of frequencies. . Needing to estimate the weight of someone on the other side of a chain link fence. .CHAPTER 5. If the scale reads 1080 lb. and read their total weight. ∧ Xn < rn ) for each ntuple Note that turning statistics into probabilities or expectations in this way requires neither conditioning nor Bayes’s theorem. the the Law of Little Numbers identiﬁed the probability as r. . UPDATING ON STATISTICS 81 Example. persuade them to congregate on a platform scale. . vol. nor does it require you to have formed particular judgmental probabilities for the propositions or particular estimates for the random variables prior to learning the relative frequency or mean. and from the unintellible long–run frequentism that would see the observed instances as a ﬁnite fragment of an inﬁnite sequence in which the inﬁnitely long run inﬂates expectations into certainties that sweep judgmental probabilities under the endless carpet.2 Probabilities from Statistics: Exchangeability7 On the hypotheses of (a) equiprobability and (b) certainty that the relative frequency of truths is r. pp.g. . 2.. your probability that they are all true and the other f = n − t false depends only on the numbers t and f . random variables taking only the values 0 (falsity) and 1 (truth). 211224: Wiley. for any particular t of them. ———————————————————————————— For random variables X1 . Guessing Weight.. Stronger conclusions follow from the stronger hypothesis of exchangeability: You regard propositions H1 . .1. 6 . 7 Bruno de Finetti. It is to be distinguished both from the intelligible but untenable ﬁnite frequentism that simply identiﬁes probabilities with actual frequencies (generally. . 8 This comes to the same thing as invariance of your probabilities for Boolean compounds of ﬁnite numbers of the Hi under all ﬁnite permutations of the positive integers. unknown) when there are only ﬁnitely many instances overall. Theory of Probability. your estimate of the eleventh person’s weight will be 108 lb. .— if nothing in that process has made you revise your judgment that the eleven weights are equal. e. and in which the Law of Little Numbers guides the recycling of observed frequencies as probabilities of unobserved instances. exchangeability means invariance of pr(X1 < r1 ∧ . you select ten people on your side whom you estimate to have the same weight as that eleventh. Hn as exchangeable when. 5. . P (H1 ∧ (H2 ∨ ¬H3 )) = P (H100 ∧ (H2 ∨ ¬H7 )). 1975.
you will divide your probability among them equally. The entries in the rightmost column are the values of pt taken from example 1. 2. the possibilities are listed under ‘h’ (for ‘history’) and again under the H’s. . .1. 1. . And when there is more than one possible order in which a certain number of heads can show up. 5. again. If you know the proportion. here is one precise probabilistic state of mind that you might be in. probabilities will be seen to come from statistics— this time.2. 0 = false) as they would be in the eight possibile histories. 4.4. There are eight possibilities about the outcomes of all three trials: rrr  rrg rgr grr  rgg grg ggr  ggg The vertical lines divide clumps of possibilities in which the number of reds is 3. under the proviso that your prior probability assignment was exchangeable. that would determine your probability for red. 0.CHAPTER 5. UPDATING ON STATISTICS of real numbers r1 . Exchangeability turns out to be a strong but reasonably common condition under wich the hypotheses of the law of large numbers in sec. rn and each permutation of the X’s that appear in the inequalities. about it: your probabilities for drawing a total of 0.6 are met. 82 Here. p3 = . 3 red balls might be p0 = . The value of the random variable T3 (“tally” of the ﬁrst 3 trials) at the right is the number of r’s in the actual history. by ordinary conditioning. Still. 1. as in sec. The urn contains balls that are red (r) or green (g). . How likely? Well.3.1. Example 1.1. where the 0’s and 1’s are truth values (1 = true. the number of 1’s under the H’s in that row. 2. you may think that (say) red is the more likely color. 5. in an unknown proportion. p2 = . Three Draws from an Urn of Unknown Composition. In Table 1. In other words: you see the propositions Hi (= red on the i’th trial) as exchangeable.1. . p1 = .3 Exchangeability: Urn examples The following illustration in the case n = 3 will serve to introduce the basic ideas. But you don’t.
your probability for “same outcome on the ﬁrst two trials” is pr{rrr. Your probabilities for all 2(2 ) truthfunctional compounds of n exchangeable Hi are determined by your probabilities for a particular n of them—namely. . see Zabell’s paper. “The Rule of Succession” Erkenntnis 31 (1989) 283321.4 + . ggr. then t+1 pr(Tn+1 = t + 1) t+1 pr(Hn+1 {h}) = × = . .2 + . rrg. p1 . . rrg. . Tm is the tally of the ﬁrst m trials. . is (f + 1)( pr(Tn+1 = n) − 1). exchangeability undoes the dreaded “combinatorial explosion:” Combinatorial IMplosion.3 + .1 p0 We can think of propositions as identiﬁed by the sets of histories in which they would be true. rgr. 11 For the history underlying this terminology. where t ranges from 0 to n + 1.1 = 2 . by all but one of your probabilities p0 . n+1 pr(Tn = n) n+2+∗ = t+1) where the correction term.3 Table 1. cited above. 10 9 . And by additivity. and the rule reduces to n The probability of the missing one will be 1 minus the sum of the other n. Uniform distribution: pr(Tn+1 = t) = n+2 . 3 3 3 3 3 Evidently. n:9 The following important corollary of the combinatorial implosion is illustrated and proved below. .) The Rule of Succession. In this (“BayesLaplaceJohnsonCarnap”)11 case the correction term ∗ in the rule of succession vanishes. pr(Tn 1 Example 2. Hn . . The best overall account is by Sandy Zabell. ggg} = p1 + 1 p2 + 1 p1 + p0 = . .10 If H1 .4 p2 3 p2 3 p2 3 p1 3 p1 3 p1 3 83 2 . . ∗. Examples: H1 = {rrr. Hn+1 are exchangeable for you and {h} is expressible as a conjunction of t of the ﬁrst n H’s with the denials of the other f = n − t. pn for the propositions saying that the number of truths among the H’s is 0. . (The R. UPDATING ON STATISTICS h H1 H2 H 3 rrr 1 1 1 rrg 1 1 0 rgr 1 0 1 grr 0 1 1 rgg 1 0 0 grg 0 1 0 ggr 0 0 1 ggg 0 0 0 pr{h} t pr(T3 = t) p3 3 . rgg} and H1 ∧H2 ∧H3 = {rrr}. Three Exchangeable Trials 1 0 . .2 .CHAPTER 5. . 1.V.
implosion. 5. and the fact that (n )/(n+1 ) = (t + 1)/(n + 1). one red and o one green. If you adopt the uniform distribution. 5. Then your pr(H9 T8 = 6) = . 2 A P´lya Urn Model. . n+1 pr(Htn ) pr(Htn+1 ) pr(Ht+1 ) + . exactly as in the urn examples of sec.4 Supplements 1 Law of Large Numbers. will be pr(H9 T8 = 6) = pr(H9 {h}). At stage 0. n+1 n+1 n+1 ( t ) ( t+1 ) f +1 t+1 n+1 pr(Htn+1 ) + pr(Ht+1 ). your probability for red next. are f +1 . UPDATING ON STATISTICS pr(Hn+1 {h}) = pr(Hn+1 Tn = t) = 84 t+1 . given that black was drawn on t+1 the ﬁrst t trials and white on the next f = n − t.3 above. = (n ) (n+1 ) (n+1 ) t t t+1 f +1 t+1 (n ) (n ) t t = and n+1 = .1. n+2 Example 3. where h is any history in the set that represents 7 the proposition T8 = 6. pr(Htn ) = n+1 n+1 n+1 n+1 (t + 1)pr(Ht+1 ) (t + 1) pr(Ht+1 ) .CHAPTER 5. = pr(H n+1 ) (n + 1) pr(Htn ) (f + 1)pr(Htn+1 ) + (t + 1)( pr(Htn+1 ) − 1) t+1 (c) (d) (e) (f) (g) The righthand side of (f) = t+1 n + 2 + (f + 1)( pr(Htn+1 ) − 1) t+1 pr(H n+1 ) . Show that exchangeable random variables satisfy the three conditions in the law of 4. At each later stage a ball is drawn with double replacement: it is replaced along with another ball of its color. 10 Proof of the rule of succession. The Ninth Draw. Prove that the conditional odds on black’s being drawn on the n + 1’st trial. given 6 red and 2 green on the ﬁrst 8 trials.6. an urn contains two balls. n (a) pr(Hn+1 Ck ) = n+1 n+1 n pr(Ht+1 )/(n+1 ) (t + 1) pr(Ht+1 ) pr(Hn+1 ∧ Ck ) t+1 = = n pr(Ck ) pr(Htn )/(n ) (n + 1) pr(Htn ) t by the quotient rule.1. t+1 t n n n (b) pr(Ck ) = pr(¬Hn+1 ∧ Ck ) + pr(Hn+1 ∧ Ck ).
vol. . 12 .4 we present examples of partially exchangeable sequences – 2 × 2 tables and Markov chains – and give general deﬁnitions.5 a ﬁnite form of de Finetti’s theorem is presented. . which is is red? (c) Are they exchangeable? Diaconis and Freedman12 on de Finetti’s Generalizations of Exchangeability De Finetti wrote about partial exchangeability over a period of half a century.3 we review exchangeability. Conditional Independence. In in 5. . . All results in Sections 2–5 involve random quantities taking only two values. 4 Independence is not preserved under mixing. In 5.. ed. University of California Press. 5. . why not? (b) Are they unconditionally independent? Would your probability for red next be the same as it was initially after drawing one ball. 1980) which is reproduced here with everybody’s kind permission. Prove that if H1 . 13 Beginning with de Finetti (1938).5 and 5. Prove this. using a simple counterexample where m = 2 and w1 = w2 . Is that also so unconditionally? (a) Are the H’s unconditionally equiprobable? If so. . Suppose you are sure that the balls in an urn are either 90% red or 10% red. Exchangeability. .6 gives some inﬁnite versions of de Finetti’s theorem and a counterexample which shows that the straightforward generalization from the exchangeable case sketched by de Finetti is not possible. etc. UPDATING ON STATISTICS 85 3 Exchangeability is preserved under mixing.13 These treatments are rich sources of ideas.. . The last section contains comments about the This is an adaptation of the chapter (#11) by Persi Diaconis and David Freedman in Studies in Inductive Logic and Probability. We follow de Finetti by giving results for ﬁnite as well as inﬁnite sequences. what is their probability? If not.6) has been retained here. prm then they are exchangeable relative to w1 pr1 + . which take many readings to digest. then conditionally on each hypothesis the propositions H1 . balls drawn will be red are independent and equiprobable. . In 5. second. . and you regard those two hypotheses as equiprobable. . suitably labelled. Hn are exchangeable relative to each of pr1 . 2 (Richard Jeﬀrey.CHAPTER 5. + wm prm if the w’s are nonnegative and sum to 1. that the ﬁrst. . Here we give examples of partial exchangeability that we understand well enough to put into crisp mathematical terms. 5 Independence. H2 . . A bit of startlingly technical material in the original publication (see 5. If balls are drawn with replacement.
or invariant under changes in order. 14 . Consider the following experiment: A coin is spun on a table. The distribution is just like the outcome of ten draws without replacement from an urn with one ball marked 1 and nine balls marked 0—the ball marked 1 could be drawn at any stage and must be drawn at some time.14 By a mixture of urns we simply mean a probability assignment over the N +1 possible urns. o 5. Though it is unrealistic. we might be sure there will be exactly one head in the ten spins. We will denote heads by 1 and tails by 0. We also discuss the closely related work of P. UPDATING ON STATISTICS 86 practical implications of partial exchangeability. the symmetry assumption reduces the number of probabilities to be assigned from 1024 to 11—the probability of no ones through the probability of ten ones. It is useful to single out certain extreme exchangeable probability assignments. The assumption of exchangeability forces each of the ten possible sequences with a single one in it to be equally likely. Another way to say this is that only the number of ones in the ten trials matters. De Finetti has called attention to exchangeability as a possible simplifying assumption.CHAPTER 5. There are 11 such extremal urns corresponding to sure knowledge of exactly i heads in the ten draws where i is a ﬁxed number between zero and ten. The ﬁrst form of de Finetti’s theorem follows from these observations: (1) Finite form of de Finetti’s theorem: Every exchangeable probability assignment on sequences of length N is a unique mixture of draws without replacement from the N + 1 extremal urns. The experiment is to be repeated ten times. is an accessible treatment. Diaconis. If believed.2 Exchangeability Itself We begin with the case of exchangeability. Thus. assigning over a thousand probabilities is not a simple task. Even with so simple a problem. With an exchangeable probability two sequences of length 10 with the same number of ones are assigned the same probability. MartinL¨f on repetitive structures. You believe you can assign probabilities to the 210 = 1024 possible outcomes by looking at the coin and thinking about what you know. Any exchangeable assignment on sequences of length N can This form of the result was given by de Finetti (1938) and by many other writers on the subject. not the location of the ones. There are no a priori restrictions on the probabilities assigned save that they are nonnegative and sum to 1. (1977) 271281. the probability assignment is symmetric.
where we show that if an exchangeable probability on sequences of length k is extendable to an exchangeable pr(j ones in k trials) = k p∈[0.3. While such situations arise (for example. We have more to say about this interpretation of exchangeability as expressing ignorance of the true value of p in Remark 1 of 5. For example. example 2). 1] and the sum becomes an integral.16 The requirement that an exchangeable sequence of length k be inﬁnitely extendable seems out of keeping with de Finetti’s general program of restricting attention to ﬁnite samples. sampling without replacement from an urn with k balls in it cannot be extended to k + 1 trials. µ(p ∈ I) = the length of I. UPDATING ON STATISTICS 87 be realized by ﬁrst choosing an urn and then drawing without replacement until the urn is empty. Let the random variable p be the proportion of red balls in the urn from which the drawing will be made.15 This holds for every k with the same µ. Spelled out in English. If p∈C µ(p) = 1 for a countable set C of real numbers from 0 to 1.1] pk (1 − p)j dµ(p). The extremal urns. (3) and its general case become (2). the formula is (3) pr(j ones in k trials) = k j p∈C pk (1 − p)j µ(p) In the general case C = [0. L. when drawing a sample from a ﬁnite population). the more usual situation is that of the coin. An appropriate ﬁnite version of (3) is given by Diaconis and Freedman (1978). Crisma. representing certain knowledge of the total number of ones.1.6. While we are only considering ten spins. Again the mixture is deﬁned by a probability assignment to the extremal urns. This is the µ that is uniquely determined when pr is the uniform distribution (5. Not all exchangeable measures on sequences of length k can be extended to exchangeable sequences of length n > k. in principle it seems possible to extend the ten to an arbitrarily large number. (1971) and Diaconis. (1977). A simple example where the array is a continuum: For each subinterval I of the real numbers from 0 to 1. In this case there is a stronger for of (1) which restricts the probability assignment within the class of exchangeable assignments. (2) Inﬁnite form of de Finetti’s theorem: Every exchangeable probability assignment pr that can be indeﬁnitely extended and remain exchangeable is a unique mixture µ of draws with replacement from a possibly inﬁnite array of extremal urns. seem like unnatural probability assignments in most cases. P. 15 16 . j Necessary and suﬃcient conditions for extension are found in de Finetti (1969).CHAPTER 5.
.18 We will not discuss the choice of a prior further but rather restrict attention to generalizations of (2). Giorn. Attuari.5. a reasonable symmetry assumption is the following kind of partial exchangeability: regard all the treated males as exchangeable with one another but not with the subjects in the other three categories. likewise for the other categories. translated in Chapter 10 of de Finetti (1974). Consider zero/one outcomes in a medical experiment with n subjects. (3). 1st. 17 . k.CHAPTER 5. Bruno (1964) “On the notion of partial exchangeability”. Here are two examples which will be discussed further in 5. n j k trials is in 18 The paper by A. (4) pr number of 1’s in number of 1’s in − Bµ k trials is in A k trials is in A ≤ 2k n uniformly in n. Thus. UPDATING ON STATISTICS 88 probability on sequences of length n > k. . 174–196. and A—where Bµ is the assignment determined by the mixture µ of drawings with replacement from urns with various proportions p of 1’s.1] pj (1 − p)k−1 dµ(p) ≤ 2k . it is easy to imagine the spins of the coin as the ﬁrst ten spins in a series of 1000 spins. Both of the results (3) and (4) imply that for many practical purposes instead of specifying a probability assignment on the number of ones in n trials it is equivalent to specify a prior measure µ on the unit interval. It. is also devoted to this important problem. 1.17 For example. We are told each subject’s sex and if each subject was given a treatment or was in a control group. k}.02 on the right hand side of (4). .1 2 × 2 tables. Much of de Finetti’s discussion in his papers on partial exchangeability is devoted to reasonable choices of µ.4 and 5. . This yields a bound of . In some cases.3. then (3) almost holds in the sense that there is a measure µ such that for any set A ⊆ {0.3 Two Species of Partial Exchangeability In many situations exchangeability is not believable or does not permit incorporation of other relevant data. and (4) to partially exchangeable probability assignments. 5. 5. 27. 2. two sequences of zeros in Hairystyle (4): pr number of 1’s A − j∈A k p∈[0.
For example. For example. If each ﬂick of the tack was given from the position in which the tack just landed. exchangeability might be a tenable assumption. one to zero. Each of the three sequences leads to the same table: T C M 2 2 (5) F 1 1 5. For example. A natural notion of symmetry here is to say that if two sequences of zeros and ones of length n which both begin with zero have the same number of transitions: zero to zero. The data from each of the three sequences can be summarized in a 2 × 2 table which records the number of one outcomes in each group.3. zero to one. and three control females. if n = 10. any of the following three sequences would be assigned the same probability. then the result of each trial may depend on the result of the previous trial. Trial 1 2 Sex M M Treatment/Control T T 1 1 0 2 0 1 3 0 1 3 M T 1 1 1 4 5 F F T T 0 1 0 1 1 0 6 M C 1 1 1 7 8 9 M F F C C C 1 0 1 1 0 0 1 1 0 10 F C 0 1 0 In this example. and one to one. Trial 1 2 3 4 5 1 0 1 0 1 1 2 0 1 0 1 0 3 0 0 1 0 1 Each sequence begins with 6 7 0 0 0 1 0 1 a zero 8 1 1 0 and 9 10 0 1 0 1 1 1 has the same transition matrix . We record a one if the tack lands point upward and a zero if it lands point to the ﬂoor. there are three treated males. If after each trial the tack were reset to be point to the ﬂoor. each of the three sequences below must be assigned the same probability. they should both be assigned the same probability. there is some chance the tack will slide across the ﬂoor without ever turning. It does not seem reasonable to think of a trial depending on what happened two or more trials before. two control males. two treated females.CHAPTER 5. if n = 10. For simplicity suppose the tack starts point to the ﬂoor.2 Markov Dependency Consider an experiment in which a thumbtack is placed on the ﬂoor and given an energetic ﬂick with the ﬁngers. UPDATING ON STATISTICS 89 and ones of length n which had the same number of ones in each of the four categories would be assigned the same probability.
with probability 1/ni . De Finetti (1938). MartinL¨f (1970). In the case of exchangeability the statistic T is the number of ones.2). In the Markov example the statistic T is the matrix of transition counts (6) along with the outcome of the ﬁrst trial. 1}n → X be a statistic taking values t1 . and Diaconis and Freedman o (1978a. In the case of 2 × 2 tables the statistic T is the 2 × 2 table (5). A probability assignment P on sequences of length n is partially exchangeable for a statistic T if T (x) = T (y) implies P (x) = P (y) where x and y are sequences of length n. . two stage Markov dependence with additional information such as which experimenter reported the trial and the time of day being given. .c) give further examples. Such examples can easily be combined and extended. For instance. (1974. We will write {0. In remark 4 of Section 6 we discuss a more general deﬁnition of partial exchangeability.4 Finite Forms of de Finetti’s Theorem on Partial Exchangeability There is a simple analog of Theorem (1) for partially exchangeable sequences of length n.g. Let T : {0. (1974). UPDATING ON STATISTICS 90 (6) to 0 1 0 1 4 from 1 3 1 It turns out that there are 16 diﬀerent sequences starting with zero that have this transition matrix.b. Section 9. 1}n for the set of sequences of zeros and ones of length n. Then pri is partially exchangeable with respect to T . tk .. Two sequences with the same number of ones get assigned the same probability by an exchangeable probability.6. t2 . . A general deﬁnition of partial exchangeability that includes these examples involves the notion of a statistic: A function from the sequences of length n into a set X. In terms of these deﬁnitions we now state . Let Si = {x ∈ 2n : T (x) = ti } and suppose Si contains ni elements. 5. Freedman (1962) and Lauritzen (1974) have said that P was summarized by T in this situation. This deﬁnition is equivalent to the usual one of permutation invariance. . Let pri be the probability assignment on 2n which picks a sequence x ∈ 2n by choosing an x from Si uniformly–e.CHAPTER 5.
There are (a + 1) × (b + 1) × (c + 1) × (d + 1) possible values of the matrix. and d all tend to inﬁnity. The suﬃcient statistic is T = t00 t01 where tij is the number of i t10 t11 to j transitions. labeled TM (for treated males). Here is an urn model which generates all sequences of In the language of convex sets. The suﬃcient statistic is the 2 × 2 table with entries which are the number of ones in each of the four groups. Similarly. and d untreated females with a+b+c+d = n. UPDATING ON STATISTICS (7) A Finite form of de Finetti’s theorem on partial exchangeability. Theorem (7) says that any partially exchangeable probability assignment on 2n is a unique mixture of such urn schemes.19 91 Theorem (7) seems trivial but in practice a more explicit description of the extreme measures Pi —like the urns in (1)—can be diﬃcult. A counting argument shows that there are n + 1 diﬀerent 2 values of T possible. construct urns labeled UM. 19 . An extreme partially exchangeable probability can be thought of as follows: Fix a possible matrix. Every probability assignment pr on {0. b. assume we have a sequence of length n + 1 that begins with a zero. the binomial approximation to the hypergeometric distribution will lead to the appropriate inﬁnite version of de Finetti’s theorem as stated in Section 5. the set of partially exchangeable probabilities with respect to T forms a simplex with extreme points pri . c treated females. 1}n which is partially exchangeable with respect to T is a unique mixture of the extreme measures pri .CHAPTER 5. TF. 0 ≤ j ≤ b 0 ≤ k ≤ c. (9) Example—Markov chains For simplicity. If a. b untreated males. Make up four urns. In the ﬁrst urn. The mixing weights are wi = pr{x : T (x) = ti }. put i balls marked one and a − i balls marked zero. c. We now explore this for the examples of Section 3. T U i j k l M F where 0 ≤ i ≤ a. (8) Example—2 × 2 tables Suppose we know there are: a treated males. 0 ≤ l ≤ d. and UF To generate a sequence of length n given the labels (Sex and Treated/Untreated) draw without replacement from the appropriate urns. This scheme generates all sequences x ∈ 2n with i the given matrix k jl equiprobably.
the 2 binomial approximation to the hypergeometric will lead to an inﬁnite form of de Finetti’s theorem in certain cases. . The adjustment made above guarantees that such a forced jump can only be made once from either urn and that the process generates all sequences of length n + 1 that start with zero and have transition matrix T with the same probability. Suppose that for each n the distribution of X1 . (F. . Theorem (7) says that every probability assignment on sequences of length n + 1 which is partially exchangeable for the transition matrix T is a unique mixture of the n + 1 diﬀerent urn processes described above. b4 ) where. T ). X3 . a4 . let X1 = 0. UPDATING ON STATISTICS 92 length n + 1 with a ﬁxed transition matrix T equiprobably. This is further discussed in Sections 5 and 6. . b1 . b3 . . a3 . . Tn = (a1 . X2 . Determining when a simple urn model such as the ones given above can be found to describe the extreme partially exchangeable probabilities may be diﬃcult. a2 . (F. in general. remove a one from U0 .CHAPTER 5. For instance. Xn is partially exchangeable with respect to the suﬃcient statistic that counts the number of zeros and the number of ones for each label. T ) is inﬁnite. U ). for example. . we do not known how to extend the urn model for Markov chains to processes taking three values. (10) Example—2 × 2 tables Let X1 .5 Technical interpolation: Inﬁnite Forms Let us examine the results and problems for inﬁnite sequences in the two preceding examples. Thus. b2 . Let X2 be the result of a draw from Ux . To generate a sequence of length n + 1. 5. If t01 = t10 + 1. Form two urns U0 and U1 as follows: Put tij balls marked j into urns Ui . (M. be an inﬁnite sequence of random variables each taking values 0 or 1. There are two cases possible: Case 1. . make a forced transition to the other urn. U ). If t01 = t10 remove a zero from U1 . a1 is . X2 . It is now necessary to make an adjustment to make sure the process doesn’t run into trouble. If the sequence generated ever forces a draw from an empty urn. and. Again. let Xi be the result of a draw without replacement from urn Uxi−1 . Case 2. Suppose each trial is labeled as Male or Female and as Treated or Untreated and that the number of labels in each of the four possible categories (M.
. 2. . . U ). Link’s chapter in R. xn of zeros and ones (14) P (X1 = 0. X2 . . p3 . It is partially exchangeable because. . x3 . . . bi are the values of Tn (x1 . Consider the probability assignment which goes 001111111. T ) and so on.CHAPTER 5. .20 (12) Example—Markov chains Let X1 . . This probability assignment is partially exchangeable and not representable in the form (14). x2 . . The result can be proved by passing to the limit in the urn model (8) of 5. . (all ones after two zeros) with probability one. . Jeﬀrey (ed. For a mixture of Markov chains the numbers pk can be represented as (15) pk = c 20 1 0 pk (1 − p00 )dµ(p00 ) k = 1. a2 is the number of ones labeled (M. T ). be an inﬁnite sequence of random variables each taking values 0 or 1. p4 ) i where ai . . For simplicity assume X1 = 0. .) (1980). . . U ). xn . 00 A diﬀerent proof is given in G. Xn = xn ) = i=1 pai (1 − pi )bi dµ(p1 . 218—219 of de Finetti [1974]) that the representation (14) is valid for every partially exchangeable probability assignment in this case. p00 ) 11 00 where tij are the four entries of the transition matrix of x1 . . .. UPDATING ON STATISTICS 93 the number of ones labeled (M. . xn ). for any it. x2 . 3. . Suppose that for each n the joint distribution of X1 . . Xn is partially exchangeable with respect to the matrix of transition counts. . . xn 4 (11) pr(X1 = x1 . . . . . b1 is the number of zeros labeled (M. . . Here is an example to show that the representation (14) need not hold in the absence of the recurrence condition (13). the ﬁrst it symbols of this sequence form the only sequence with 1 transition matrix 1 n−3 . x2 . b2 is the number of zeros labeled (M. 0 write pk for the probability that the last zero occurs at trial k(k = 1. De Finetti appears to state (pp.. . . Xn = xn ) = pt11 (1−p11 )t10 pt00 (1−p00 )t01 dµ(p11 . . 2. . Suppose that the following recurrence condition is satisﬁed (13) P (Xn = 0 inﬁnitely often) = 1. Then there is a unique probability distribution p such that for every n. X2 = x2 . . . . Then there is a unique probability distribution µ such that for every n and each sequence x1 .4. . . To see that it is not representable in the form (14). p2 .). and each sequence x2 . .
. For the example 001111111. . . U ). We suppose that the type functions tn . . An analog of (11) clearly holds in these cases. In the example there are four types of observations depending on the labels (M. (11). where ai . observations Xi that follow a 1 are of type one. A detailed discussion of which partially exchangeable assignments are mixtures of Markov chains is in Diaconis and Freedman (1978b). the type of an observation is allowed to depend on the outcome of past observations as in our Markov chain example. 21 . If there are r types. . . In this example there are three types of observations—observations Xi that follow a 0 are of type zero. xn ) = (a1 . When can we hope for representations like (3). (M.. . Section 9. br ).21 Lauritzen (1973) compares Sstructure with several other ways of linking together suﬃcient statistics.. . and so on. . 2. roughly. . b1 . Under these circumstances we hoped that a probability pr which was partially exchangeable with respect to the Tn ’s would have a Freedman (1962b) introduced the notion of S structures—in which. . . x2 . and the ﬁrst observation X1 is of type 2. .. In attempting to deﬁne a general notion of type we thought of trying to deﬁne a type function tn : 2n → {1. xn+2 . UPDATING ON STATISTICS 94 where c is the probability mass the mixing distribution puts on p11 = 1. r}n which assigns each symbol in a string a “type” depending only on past symbols. p2 = 1. . . . . time of day. = 0. . from the value of Tn (x1 . . the statistic will depend on the sample size n and one must specify the way the diﬀerent Tn ’s interrelate. . T ). is the number of ones of type i and bi is the number of zeros of type i. (F. p1 = 0. Generally. T ). In general. For the original case of exchangeability there is only one type. Here things are simple. We have worked with a somewhat diﬀerent notion in Diaconis and Freedman (1978c) and developed a theory general enough to include a wide variety of statistical models. U ). . for each i we observe the value of another variable giving information like sex. In de Finetti (1938) he only gives examples which generalize the examples of 2 × 2 tables.CHAPTER 5. p3 = p4 = . . De Finetti sketches out what appears to be a general theory in terms of what he calls the type of an observation. So this example doesn’t have a representation as a mixture of Markov chains. . were chosen so that the statistics Tn had what Lauritzen (1973) has called structure—Tn+1 can be computed from Tn and xn+1 . and (14) in terms of averages over a naturally constructed “parameter space”? The theory we have developed for such representations still uses the notation of a statistic T .2. . ar . x2 . xn+m ) one can compute Tn+m (x1 . . . then it is natural to consider the statistic Tn (x1 . xn ) and Tm (xn+1 .6. xn+m ). (F. In de Finetti (1974). The representation (15) implies that the numbers pk are decreasing.
Xn = xn } = n i=1 95 pai (1 − pi )bi dµ(p1 . are exchangeable with each other and the odd positions are generated by reading oﬀ a ﬁxed reference sequence x of zeros and ones. • The even positions X0 . The following contrived counterexample indicates the diﬃculty. X3 . Some cases when parametric representation is possible have been determined by MartinL¨f (1970. However. b1 . . X4 . X2 . a2 .. You count the number of zeros and ones of each type. . • The even positions X0 . X2 . . . 1974) and Lauritzen (1976). Moreover. . . . . the sequence Tn has structure. are exchangeable with each other and the odd positions are generated by reading oﬀ the complement x ¯ (the complement has xj = 1 − xj ). . you call i of type 1. X4 . b2 ) be the type counts at time n. Any process of the kind described above is partially exchangeable for these Tn . X1 . Example. . . However..CHAPTER 5. The number of zeros and ones of each type becomes inﬁnite for this process. . Now consider the process which is fair cointossing on the even trials and equals the ith coordinate of x on trial 2i − 1. . . We do not know for which type functions the representation (16) will hold. there will be some partially exchangeable probabilities which fail to have the representation (16). In particular. The class of all processes with these Tn ’s is studied in Diaconis and Freedman (1978c) where the extreme points are determined. UPDATING ON STATISTICS representation of the form (16) pr{X1 = x1 . X1 and ¯ X3 are always of type 1. You are observing a sequence of zeros and ones X0 . Let Tn = (a1 . pr ) i for some measure µ. You know that one of 3 possible mechanisms generate the process: • All the positions are exchangeable. the process cannot be represented in the form (16). If i is odd and all the preceding Xj with j odd lead to a sequence which matches x or x. . . A general theory of partial o exchangeability and more examples are in Diaconis and Freedman (1978c). . ¯ Knowing that the reference sequence is x = 11101110 . . X2 . . you keep track of two types of Xi . . There will be two types of observations. even though the number of observations of each type becomes inﬁnite. .
Until this was introduced (by de Finetti (1931)) the subjective theory of probability remained little more than a philosophical curiosity. 1] and calculated the chance as the posterior mean of the unknown parameter: 22 ‘Probability: Beware of falsiﬁcations!’ Scientia 111 (1976) 283303. . Laplace took a uniform prior on [0. consider Laplace’s famous calculation of the chance that the Sun will rise tomorrow given that it has risen on n − 1 previous days (5. Other Bayesians do not accept the idea of p coins with unknown p: de Finetti is a prime example. In a certain sense the most important concept in the subjective theory is that of “exchangeable events”. as it is now called. The theorem gives a characterizing feature of such assignments—exchangeability—which can be thought about in terms involving only opinions about observable events. Some Bayesians are willing to talk about “tossing a coin with unknown p”. example 2). Writers on subjective probability have suggested that de Finetti’s theorem bridges the gap between the two positions. But. The main point is this: Probability assignments involving mixtures of coin tossing were used by Bayesians long before de Finetti. None of those for whom the theory of probability was a matter of knowledge or application paid much attention to it.6 Concluding Remarks 1. The connection between mixtures and exchangeable probability assignments allows a subjectivist to interpret some of the classical calculations involving mixtures.1. For example. UPDATING ON STATISTICS 96 5. de Finetti’s theorem can be interpreted as follows: If a sequence of events is exchangeable. then it is like the successive tosses of a pcoin with unknown p.3. For them. with the introduction of the concept of “equivalence or symmetry” or “exchangeability”. 13—14). We have trouble with this synthesis and the following quote indicates that de Finetti has reservations about it:22 The sensational eﬀect of this concept (which went well beyond its intrinsic meaning) is described as follows in Kyburg and Smokler’s preface to the collection Studies in subjective probability which they edited (pp. It does not seem to us that the theorem explains the idea of a coin with unknown p. a way was discovered to link the notion of subjective probability with the classical problem of statistical inference.CHAPTER 5.
Here there is no natural notion of a parameter p. 33 (1962a) 114–118. {Xi }∞ . . UPDATING ON STATISTICS 97 pr(Sun rises tomorrow  Has risen on n − 1 previous days. . Ann. . With this P (X1 =1. . Instead.4) 2.) = 1 n p dµ 0 1 n−1 p dµ 0 = n . The exchangeable form of de Finetti’s theorem (1) is also a useful computational device for specifying probability assignments. 2.Xn−1 =1) = n+1 . The space of all probability measures on the real line is so large that it is diﬃcult for subjectivists to describe personally meaningful distributions on this space.Xn =1) n allocation. 2 (1974) 615–629 contains examples of the various attempts to choose such a prior.. Ann. 3. Stat.. “Prior distributions on spaces of probability measures”. . for real valued variables de Finetti’s theorem is far from an explanation of the type of parametric estimation—involving a family of probability distributions parametrized by a ﬁnite dimensional parameter space—that Bayesians from Bayes and Laplace to Lindley have been using in real statistical problems.. The result is more complicated and much less useful in the case of real valued variables. Math. . de Finetti’s theorem says real valued exchangeable variables. ≤ t1 . A similar interpretation can be given to any calculation involving averages over a parameter space. . n+1 The translation of this calculation is as follows: Let Xi = 1 if the Sun rises on day i. are described as follows: There is i=1 a prior measure π on the space of all probability measures on the real line such n that pr(X1 ∈ A1 . Xk = 1) = k+1 .. k = 1. Laplace’s uniform mixture of coin tossing is ex1 changeable and P (X1 = 1. Stat. 24 Freedman. Technical Interpolations (2 . Xn ≤ tn ) = ∞ 0 n i=1 Φ(δti )dπ(δ) Ferguson. Xn−1 = 1) = P (X1 =1.. In some cases. more restrictive conditions than exchangeability are reasonable to impose. “Invariants under mixing which generalize de Finetti’s theorem”. . . Xn ∈ An ) = i=1 p(Xi ∈ Ai )dπ(p).. 33 (1962b) 916–923. Ann. . .24 Example: Scale mixtures of normal variables When can a sequence of real valued variables {Xi }1 ≤ i < ∞ be represented as a scale mixture of normal variables (19) 23 pr(X1 . . D. Here are some examples adapted from Freedman. . . Math. . . Xi = 0 otherwise. .23 Thus. T. P (Xn = 1X1 = 1. .. Stat. .CHAPTER 5. “Mixtures of Markov processes”. . and do single out tractable classes of distributions.
This result is related to the derivation of Maxwell’s distribution for velocity in a Monatomic Ideal Gas25 Example: Poisson distribution Let Xi (1 ≤ i < ∞) take integer values. MartinL¨f’s work appears most clearly spelled out in a set of mimeographed lecture o notes (MartinL¨f [1970]). . pr(X1 = a1 . The families of conditional distributions we work with still “project” in the right way. X2 . .26 MartinL¨f does not seem to o o work in a Bayesian context. Mathematical foundations of statistical mechanics (1949). In this paper. A technical treatment o of part of this may be found in MartinL¨f (1974). (1975) and the last third of Tjur (1974). We depart from this assumption in Diaconis and Freedman (1978c) as we have in the example of mixtures of Poisson distributions. . UPDATING ON STATISTICS 98 where Φ(t) = √1 2π t −t2 /2 dt? −∞ e In Freedman (1963) it is shown that a necessary and suﬃcient condition for (19) to hold is that for each n the joint distribution of X1 . . Xn = an ) = ∞ n 0 i=1 e−λ λa i dπ(λ) ai ! is as follows: For every n the joint distribution of X1 . . X2 . MartinL¨f in the 1970’s. The idea is to specify the conditional distribution of the observations given the suﬃcient statistics. Xn given Sn = n Xi must be multinomial. Further discussion of MartinL¨f’s ideas o o are in Lauritzen (1973). VI.CHAPTER 5. This projection property allows us to use the well developed machinery of Gibbs states as developed by Khinchin. Many further examples and some general theory are given in Diaconis and Freedman (1978c). De Finetti’s generalizations of partial exchangeability are closely connected to work of P. . and in MartinL¨f’s work. These references contain many new examples of partially exchangeable processes and their extreme points. 26 25 . Xn be rotationally symmetric. like the joint distribution of Sn balls i=1 dropped at random into n boxes. . In Freedman (1962) it is shown that a necessary and suﬃcient condition for Xi to have a representation as a mixture of Poisson variables. MartinL¨f connects the conditional distribution o of the process with the microcanonical distributions of statistical mechanics. A. We have tried to connect MartinL¨f’s treatment to the general version of de Finetti’s theorem we o have derived in Diaconis and Freedman (1978c). the o conditional distribution has been chosen as uniform. . . . . rather he takes the notion of suﬃcient statistic as basic and from this constructs the joint distribution of the process in much the same way as de Finetti. . Chap.. 3. unfortunately available only in Swedish.
1931. UPDATING ON STATISTICS Lanford (1973). Inst. ‘Sur la condition d’equivalence partielle’ (Colloque Geneve. The notion of symmetry seems strange at ﬁrst. . 1–68. 174–196. Sci. Ind. It. Trieste 3. “La prevision: ses Lois Logiques ses sources subjectives”. H. ed. Attuari. A feeling of naturalness only appears after experience and reﬂection. Crisma. 96–124. 5. Is exchangeability a natural requirement on subjective probability assignments? It seems to us that much of its appeal comes from the (forbidden?) connection with coin tossing. This contains the idea of partial exchangeability with respect to a statistic since we can consider the class of all transformations leaving the statistic invariant. Poincar´ (Paris). Ist. L. ˙ 99 4. Herman. In this paper we have focused on generalizations of exchangeability involving a statistic. . This is most strikingly brought out in the thumbtack. (1938). “On the notion of partial exchangeability”. Biblioteca di Filosoﬁa diretta da Antinio Aliotta. Translated in R. 5–18. and Preston (1977). Markov chain example. Paris. . 5. A typical case which cannot be neatly handled by a ﬁnite dimensional statistic is de Finetti’s theorem for zero/one matrices. 7. Pirella. Giorn.7 References Bruno. B. 1937). Matem. de Finetti. Translated in Kyburg and Smokler e (1964). (1980). Translated in Chapter 10 of de Finetti (1974). Its appeal comes from the connection with Markov chains with unknown transition probabilities. Act. (1964). Naples. Jeﬀrey. Rend. (1971). 27. David Aldous has recently identiﬁed the extreme points of these matrices and shown that no representation as mixture of ﬁnite parameter processes is possible. 739. Here the distribution of the doubly inﬁnite random matrix is to be invariant under permutations of the rows and columns. Ann. If someone were thinking about assigning probability to 10 ﬂips of the tack and had never heard of Markov chains it seems unlikely that they would hit on the appropriate notion of partial exchangeability. (1937). Probabilismo: Saggio critico sulla teoria delle probabilit` e a sul valora della scienze. A more general extension involves the idea of invariance with respect to a collection of transformations of the sample space into itself. “Sulla proseguibilita di processi scambiabili”. A. Folmer (1974).CHAPTER 5.
Ann. “Phase transitions and Martin boundary”. Studia Sci. New York: Wiley. 615–629. Inst. . H. . and Smokier.. . (1967). 33. “Prior distributions on spaces of probability measures”. 1. 5 3–67. H. Ghosh and J. Rend. 305–3 17. . D... Golden Jubilee Int. W. Induction and Statistics. Rome: Cremones. Indian Stat.). and Freedman. Probability. . a Matem. “Probability: Beware of falsiﬁcations”! Scientia 111. 283– 303. D. Trieste. P. Inst. Ann. Ann. secondo i diversi punti di vista”. . 114–118. H. A. (1974). Indian Stat. R. 2. D. Prob. Mathematical Foundations of Statistical Mechanics.I. 2. Studies in Inductive Logic and Probability. Ann.. J. (1962a). F¨llmer. vol. 8 (1978b) 115130. D. Diaconis. Diaconis.CHAPTER 5. Stat. Conf. Prof. “Sulla proseguibilit` di processi aleatori scambiabili”. 2 (1980). (1964) Studies in Subjective Probability. Freedman. (1974). (1974). “La probabiita e la statistica nei rapporti con l’induzione.. Translated as Chapter 9 of de Finetti (1974). “Finite forms of de Finetti’s theorem on exchangeability”. Stat. “On ﬁnite and inﬁnite sequences of exchangeable events”. “Partial exchangeability and suﬃciency”. New York: SpringerVerlag. Hung. (1976). University of California Press. (1980) Huntington. “Mixtures of Markov processes”. 916–923. (1949). Kyburg. Berkeley and Los Angeles. UPDATING ON STATISTICS 100 . NY: Krieger. 33.8 (1978a) 745764. Springer o Lecture Notes in Mathematics 465. Stat. “De Finetti’s theorem on exchangeable variables”. Roy. New York: Wiley. Math. and Sudderth. P.E. (1976). 188–189. Prob. su Induzione e Statistica. (1959). K. New York: Dover. Heath. Kendall. (1969). 319–327. Math.. Khinchin. (1977). Calcutta. Am. “Invariants under mixing which generalize de Finetti’s theorem”. “Finite exchangeable sequences”. Math. Applications and New Directions (1978c) 205236. In Corso C. 30. E.M. Ann. “De Finetti’s theorem for Markov chains”. eds. T. Synthese 36. Stat. (ed. . Ferguson. (1962b). Jeﬀrey. E. 271–281.
. “Conditional probability distributions”. University of Aarhus–Barndorf Nielsen. Schon. T. University of Copenhagen. 1969–70. Preston. New York: SpringerVerlag. UPDATING ON STATISTICS 101 Lanford. “Repetitive structures and the relation between canonical and microcanonicai distributions in statistics and statistical mechanics”. (1975). (1970). Department of Theoretical Statistics. Stockholms Universitet. L. Antechningar fran seminarer o lsaret. (1973). ed. Statistiska Modeller. and some related concepts”. 23–33. uturbetade of R. . “On the interrelationships among suﬃciency. 0. Institute of Math. P. I. Institutionen for Matematisk Statistik. total suﬃciency. Springer Lecture Notes in Physics 38. Godehard Link MartinL¨f. . Springer Lecture Notes in Mathematics 534. “Random Fields”. Lecture Notes 2. Stat. Sundberg. University of Copenhagen. Bluesild and G. Tjur. . New York: SpringerVerlag. Institute of Mathematical Statistics. Scand. 2. (1974). In Proc. (1973). (1975). “General exponential models for discrete observations”. (1976). S. P. Lauritzen. C. Theory and Applications”. Statist. E. “Time evolutions of large classical systems. Conference on Foundational Questions in Statistical Inference.CHAPTER 5.
1 102 . indiﬀerence (≈). I had worked out the logic as in 6.1 and was struggling to prove what turned out to be the wrong uniqueness theorem.1 It now seems to me that the framework is pretty satisfactory. and preferenceorindiﬀerence ( ) go by desirability. And your expectation of utility associated with the Aoption will be your conditional expectation ex(uA) of the random variable u (for “utility”). 96). 83. while Bolker proved to have found what I needed: a statement and proof of the right uniqueness theorem (along with a statement and proof of the corresponding existence theorem—see chapter 9 of Jeﬀrey 1983 or 1990). without spending much time defending it against what I have come to see as fallacious counterarguments. to deciding which optionproposition to make true. This conditional expectation is also known as your desirability for truth of A. in 1963. When Bolker and I met. and denoted ‘desA’: desA = ex(uA) Now preference ( ). 1967) and Jeﬀrey (1965. 6. that is. and I shall present it here on its merits. where the unconditional distribution pr(· · ·) represents your prior probability judgment—prior.1 Preference Logic To the option of making the proposition A true corresponds the conditional probability distribution pr(· · · A). 90. in which options are represented by propositions and any choice is a decision to make some proposition true.Chapter 6 Choosing This is an exposition of the framework for decisionmaking that Ethan Bolker and I ﬂoated about 40 years ago. 1966. so that See Bolker (1965.
(The symbol ‘ ’ means valid implication: that the premise.1 is just backward.1. 2 . 103 and similarly for ≺ and . just as pr is. The diﬃculty is one that cannot arise where it is this very uture behavior that is at issue. Suppose you are an exemplary Roman of the old school. then des(Si ∧ A)pr(Si A). as in the following samples.1 Denial Reverses Preferences: A B ¬B ¬A.) = i τ (Hi ) provided τ (Hi ∧Hj ) = 0 whenever i = j. you prefer death (A) to dishonor (B). Premise: A B. validly implies the conclusion.3 Here is another way of saying that. A B if desA ≥ desB. to its right. Then 1. for in this particular example your conditional probability pr(AB) = 1 makes the second guarantee an automatic consequence of the ﬁrst. 3 Deﬁne τ (H) = pr(H)des(H). then τ (H1 ∨H2 ∨. Rabinowicz (2002) eﬀectively counters arguments by Levi and Spohn that Dutch Book arguments are untrustworthy in cases where the bet is on the agent’s own future betting behavior. CHOOSING A B if desA > desB.2 The basic connection between pr and des is that their product is additive. Note that it is not only optionpropositions that appear in preference rankings. to the left of the turnstyle. in a diﬀerent notation: Basic probabilitydesirabiliy link If the Si form a partitioning of (1) des(A) = i .CHAPTER 6. .) 6. Counterexample: Death before dishonor. The ﬁrst is a fallacious mode of inference according to which denial reverses preferences. and fallacies identiﬁed. . so that if you were dishonored you would certainly be dead—if necessary. A ≈ B if desA = desB. since “practical deliberation crowds out selfprediction” much as the gabble of cellphone conversations may make it impossible to pursue simple trains of thought in railway cars. by suicide: pr(deathdishonor) = 1. you can perfectly well prefer a sunny day tomorrow (truth of ‘Tomorrow will be sunny’) to a rainy one even though you know you cannot aﬀect the weather. Various principles of preference logic can now be enunciated.1. it must be that you prefer a guarantee of life (¬A) to a guarantee of not being dishonored (¬B).
And there are versions of decision theory which seem to endorse it. (See 6.3: Counterexample. again: Premise: Premise: Smoking is preferable to quitting if I die early. Sir Ronald Fisher (1959) actually sketched a genetic scenario in which that would be the case.3 The “Sure Thing” (or “Dominance”) Principle: (A ∧ B) C. I have heard that argument put forth quite seriously. then surely A is as good as B. and knowing that B is false. (A ∧ ¬B) C A C In words. J. In either case. But there are provisos under which the principle is quite trustworthy. with premises to the left of the turnstyle: If you would prefer A to C knowing that B is true. Proof. The Marlborough Man. Set w = pr(AA ∨ B). smoking (A) makes me more likely to die before my time (B).CHAPTER 6.) But he never suggested that 6. In words. the man who brought us the STP. here is one way. How could that be wrong? Well.3. Really? If A is better than C no matter how other things (B) turn out.1. So. the unrestricted STP. “MM”. is an a priori justiﬁcation for the MM’s decision. which must therefore lie either between them or at one or the other. used to motivate it. Conclusion: Smoking is preferable to quitting. Smoking is preferable to quitting if I do not die early. and if not I would still enjoy the consolation of smoking.3. you prefer A to C.1. but the wise heads say ‘Nonsense. CHOOSING 104 6. he asks whether he would buy if he knew that the . He considers the outcome of the next presidential election relevant to the attractiveness of the purchase.1. This one is valid. but a smoker can live to a ripe old age. 6. Savage (1954). with C = Quit smoking in 6.1 below. Then des(A ∨ B) = w(desA) + (1 − w)(desB). to clarify the matter for himself. So I’ll smoke. This is a convex combination of des A and des B. Now if I am to live to a ripe old age I’d rather do it as a smoker. For the record. A businessman contemplates buying a certain piece of property. ¬B} relative to which the STP fails.1. dear boy. He reasons: ‘Sure. This snappy statement is not a valid form of inference. I’d rather smoke than quit. you have chosen a partition {B.2 If A ∧ B ⊥ and A B then A A∨B B. this is the example that L.
21) 105 Savage’s businessman and my Marlborough Man seem to be using exactly the same extralogical principle (deductive logic. and as long as MM is not so addicted or otherwise dedicated to smoking that the increment of desirability from it is at least 32% of the increment from long life. . smoke. except possibly for the assumption of simple ordering. q = . even though he does not know which event obtains. It is all too seldom that a decision can be arrived at on the basis of the principle used by this businessman.CHAPTER 6. des(long ∧ smoke) = l + s + w des(long ∧ quit) = l + w des(shorter ∧ smoke) = s + w des(shorter ∧ quit) = w pr(longsmoke) = p pr(longquit) = q pr(shortersmoke) = 1 − p pr(shorterquit) = 1 − q Question: How must MM’s conditional probabilities p. that is). he considers whether he would buy if he knew that the Democratic candidate were going to win. he decides that he should buy. Similarly. Then . CHOOSING Republican candidate were going to win. l (“worst. I know of no other extralogical principle governing decisions that ﬁnds such ready acceptance. or will obtain. long”) are additive components of desirability where s. l These are the sorts of values of p and q that MM might have gathered from Consumers Union (1963): p = . by hypothesis.32 > s/l where. s/l is positive and less than 1. s. Then BJ tells MM to smoke iﬀ p − q < . p. and decides that he would do so. q be related in order for the BJ ﬁgure of merit to advise his quitting instead continuing to smoke? Answer: des(quit) > des(smoke) if and only if (l + s + w)p + (s + w)(1 − p) > s (l + w)q + w(1 − q). and w. but. l > 0 and w + s + l < 1. as we would ordinarioy say. Seeing that he would buy in either event. and again ﬁnds that he would do so.59. It’s all done with conditional probabilities. What is going on? BJing the MM This is how it normally turns out that people with the MM’s preferences normally turn out to prefer quitting to smoking—even if they cannot bring themselves to make the preferential choice. formula (1) in the basic probabilitydesirability link will have him choose to quit. where ‘long’ and ‘shorter’ refer to expected length of life.27. Grant the MM’s preferences and probabilities as on the following assumptions about desirabilities and probabilities. (Savage 1954.
he must be assuming that acts are probabilistically independent of states of nature. we can understand that without knowing how to determine whether particular frames would be substantively rational for you on particular occasions. Here we compare and contrast the BolkerJeﬀrey ﬁgure of merit with Savage’s. The discussion goes one way or another. not widely shared. The businessman must then consider that his buying or not will have no inﬂuence on the outcome of the election. A is true in world ω iﬀ ω ∈ A. CHOOSING BJing the STP 106 If we interpret Savage (1954) as accepting the basic probabilitydesirability link (1) as well as the STP.8 of Jeﬀrey (1965. union and complement with respect to Ω of the corresponding subsets. . disjunction and denial of propositions is represented by the intersection. 12. 5 4 . . a framework for tracking your changing judgments. and (3) an assignment of subsets of Ω as values of the sentenceletters ‘A’. (2) a function u assigning “utilities” u(ω) to the various worlds in ω ∈ Ω. Bayesian frames represent possible answers to substantive questions of rationality. and accept the same reasoning for the MM and the businessman: choose the dominant act (smoke. Then subsets A ⊆ Ω represent propositions. and conjunction. and understand This is a reconstructionist reading of Savage (1954).. Bayesian decision theory represents a certain structural concept of rationality. This is contrasted with substantive criteria of rationality having to do with the aptness of particular probability and utility functions in particular predicaments. With Donald Davidson5 I would interpret this talk of rationality as follows.1.4 Bayesian Frames.4 6. in which questions of validity and invalidity of argumentforms can be settled as illustrated above. Davidson (1980) 2734. 1990). 1983. See also Jeﬀrey (1987) and sec. In any logic. buy). A complete. Given a ﬁgure of merit for acts. ‘B’. In a Bayesian logic of decision. consistent set of substantive judgments would be represented by a “Bayesian frame” consisting of (1) a probability distribution over a space Ω of “possible worlds”.CHAPTER 6. he must then see p and q as equal. depending on particulars of the ﬁgure of merit that is adopted for acts. What remains when all substantive questions of rationality are set aside is bare logic. . validity of an argument is truth of the conclusion in every frame in which all the premises are true. So in Bayesian decision theory we can understand validity of an argument as truth of its conclusion in any Bayesian frame in which all of its premises are true.
bare structural rationality is simply representability in the Bayesian framework.CHAPTER 6. But aside from the labels. The present suggestion about causal judgments will be o used to question the credentials of Newcomb problems as decision problems. Is there something about your judgmental probabilities which shows that you are treating truth of one proposition as promoting truth of another— rather than as promoted by it or by truth of some third proposition which also promotes truth of the other? Here the promised positive answer to this question is used to analyze puzzling problems in which we see acts as mere symptoms of conditions we would promote or prevent if we could. not observation. 1990) seem to pose a challenge to the decision theory ﬂoated in Jeﬀrey (1965. not driven by it. On this view. In terms of the analogy with mechanical kinematics: as a decisionmaker you regard probabilities of options as inputs driving the mechanism. Such “Newcomb problems” (Nozick. where notions of causal inﬂuence play no rˆle. It is also problematical that correlation is a relationship between contemporaneous values of the same conditional probability function. 1983. 1969. aﬃrming A B while denying ¬B ¬A) as existence of a nonempty set of Bayesian frames in which the judgment is true. 1990). . To think you face a decision problem rather than a question of fact about the rest of nature is to expect whatever changes arise in your probabilities for those states of nature during your deliberation to stem from changes in your probabilities of choosing options. 6. what distinguishes cause from eﬀect in this relationship? The problem is that it is a symmetrical relationship. 1963. CHOOSING 107 consistency of a judgment (say. but as intended or expected features of their evolution. that changes your probabilities. The suggestion is that imputations of causal inﬂuence do not show up simply as momentary features of probabilistic states of mind. which continues to hold when the labels are interchanged: pr(causeeﬀect) > pr(cause¬eﬀect). Recall the following widely recognized necessary condition for the judgment that truth of one proposition (“cause”) promotes truth of another (“eﬀect”): Correlation: pr(eﬀectcause) > pr(eﬀect¬cause).2 Causality In decisionmaking it is deliberation.
agents’s intentions or assumptions about the kinematics of pr might be described by maps of possible courses of evolution of probabilistic states of mind—often.e. The functions in the set represent ideally deﬁnite momentary probabilistic states of mind for the deliberating agent. Newcomb problems (Nozick 1969) led many to see that ﬁgure as acceptable only on special causal assumptions. In the discrete case. 7 In the one I like best (Skyrms 1980). . Your kinematical map represents your understanding of the dynamics of your current predicament. Rigidity is a condition on a variable (‘pr’) that ranges over a set of probability functions. rigidity relative to the partition {cause. ¬cause need not satisfy the rigidity condition even though all elements of the original partition do. and I have said quite a lot elsewhere. This is invariance of conditional probabilities (3.1). des(act). very simple maps.g. Note that if ‘cause’ denotes one element of a partition and ‘¬cause’ denotes the disjunction of all other elements. for if deliberation converges toward choice of a particular act.6 Rigidity. the probability of the corresponding proposition will rise toward 1. its desirability. then in genuine decision problems des(act) will remain constant throughout deliberation. as they might be at diﬀerent times. ordinary automobiles.. The Logic of Decision used conditional expectation of utility given an act as the ﬁgure of merit for the act.. 6 . i. Clearly. as in Jeﬀrey (1996). Constancy of pr(eﬀectcause) and pr(eﬀect¬cause) as pr(cause) varies. and will be an adequate ﬁgure of merit. the ﬁgure of merit for choice of an act is the agent’s expectation of des(A) on a partition K of causal hypotheses: (CFM) Causal ﬁgure of merit relative to K = exK (uA). Now there is much to be said about Newcomb problems. pr can vary during deliberation.2 to be equivalent to probability kinematics as a mode of updating. namely. shown in 3. and a number of versions of “causal decision theory” have been proposed as more generally acceptable. the corresponding term is ‘suﬃciency’: A suﬃcient statistic is a random variable whose sets of constancy (“data”) form a partition satisfying the rigidity condition.¬cause}. CHOOSING 108 What further condition can be added. .} and the CFM is k des(A ∧ Kk ). These are like road maps in that paths from point to point indicate feasibility of passage via the anticipated mode of transportation. not “all terrain” vehicles. In general. e. I suggest the following answer. the possible courses of development of your probability and desirability functions. to produce a necessary and suﬃcient pair? With Arntzenius (1990). K = {Kk : k = 1. 2.CHAPTER 6. and I am reluctant to ﬁnish this In general the partition need not be twofold. In statistics.7 But if Newcomb problems are excluded as bogus. .
where C and ¬C represent states of aﬀairs that may incline you one way or the other between options A and ¬A. But I don’t seem to have time for that. But there is no probabilistic causal inﬂuence of ±C directly on ±B. 6. Stability of your conditional probabilities p = pr(BA) and p = pr(B¬A) as your unconditional pr(A) varies is a necessary condition for you to view ±A as a direct probabilistic inﬂuence on truth or falsity of B.3 Supplements: Newcomb Problems a a´ ±C b b´ B ±A p p´ a a´ b b´ ±A p p´ ±C B (a) Ordinarily. I would rather ﬁnish with an account of game theory. so I leave much of it in. under the label “supplements”. 1(b) depicts the simplest “Newcomb” problem. Newcomb problems are not decision problems.CHAPTER 6. and if you have not been round the track with Newcomb problems you may ﬁnd the rest worthwhile. 1 Solid/dashed arrows indicate stable/labile conditional probabilities. without passing through ±A. The simplest sort of decision problem is depicted in Fig. Thus.1 “The Mild and Soothing Weed” For smokers who see quitting as prophylaxis against cancer. CHOOSING 109 book with close attention to what I see as a sideissue.3. where the direct probabilistic inﬂuence runs from ±C directly to ±A and to ±B. preferability goes by initial des(act) as in Fig 1a. Preview. 1(a). Fig. acts ±A (b) In Newcomb problems screen off deep states deep states screen off acts ±C from plain states ±B. from plain states. and where your choice between those options may tend to promote or prevent the state of aﬀairs B. what Robert Aumann calls “interactive decision theory”. Fig. 6. but there is no direct inﬂuence of ±A on ±B. but there are views about smoking and .
and to b or b depending on whether ﬁnal c is 1 or 0. where p = pr(BA) and p = pr(B¬A) Indeterminacy: None of a. pr(BA ∧ ¬C) = b . Thus. given their allele). Independence: pr(A ∧ BC) = ab. b are 0 or 1. pr(A ∧ B¬C) = a b . being a mere sign of the bad allele. b = pr(B¬C) Correlation: p > p . smoking is bad news for smokers but not bad for their health. Nor would quitting conduce to health. and. Fisher (1959) urged serious consideration of the hypothesis of a common inclining cause of (A) smoking and (B) bronchial cancer in (C) a bad allele of a certain gene. Under these constraints. pr(F G ∧ H) = pr(F ∧ GH)/pr(GH). a = pr(A¬C). pr(AB ∧ ¬C) = a . although it would testify to the agent’s membership in the lowrisk group. CHOOSING 110 cancer on which these preferences might be reversed.. note that we can write p = pr(BA) = pr(A∧B)/pr(A) = pr(AB∧C)pr(BC)pr(C)+pr(AB∧¬C)pr(B¬C)pr(¬C)(1−c) pr(AC)pr(C)+pr(A¬C)pr(¬C) and similarly for p = pr(B¬A). (Thanks to Brian Skyrms for this. a = pr(AC). a . the independence and rigidity conditions imply that ±C screens oﬀ A and B from each other in the following sense.) Rigidity: The following are constant as c = pr(C) varies. the kinematical constraints on pr are the following. posessors of which have a higher chance of being smokers and of developing cancer than do posessors of the good allele (independently. by presence (C) or absence (¬C) of the bad allele of a certain gene.e. ac + a (1 − c) p = (1 − a)bc + (1 − a )b (1 − c) . of bad health. i. Then we have p= abc + a b (1 − c) . Since in general. preference between A and ¬A can change as pr(C) = c moves out to either end of the unit interval in thoughtexperiments addressing the question “What would des(A)−des(¬A) be if I found I had the bad/good allele?” To carry out these experiments. where acts ±A and states ±B are seen as independently promoted by genetic states ±C. A. On that hypothesis. b = pr(BC).CHAPTER 6. so. R. On Fisher’s hypothesis. Since it is c’s rise to 1 or fall to 0 that makes pr(A) rise . b. pr(BA ∧ C) = b. (1 − a)c + (1 − a )(1 − c) Now ﬁnal p and p are equal to each other. Screeningoff: pr(AB ∧ C) = a.
acts screen oﬀ deep states from plain ones in the sense that B is conditionally independent of ±C given ±A. unknowingly. depending on whether c is 1 or 0. because where in ordinary deliberation any linkage is due to an inﬂuence of acts ±A on states ±B. where the direct eﬀect of such states is wholly on acts.) As a smoker who believes Fisher’s hypothesis you are not so much trying to make your mind up as trying to discover how it is already made up. 1a. In the kinematics of decision the dynamical role of forces can be played by acts or deep states. the (quasidecision) problem has two ideal solutions. But to model Newcomb problems kinematically we apply the rigidity condition to the deep states. The diagnostic mark of Newcomb problems is a strange linkage of this question with the question of which state of nature is actual—strange. so each solution satisﬁes the conditions under which the dominant pure outcome (A) of the mixed act maximizes des(±A). which screen oﬀ acts from plain states (Fig. depending on which of these is thought to inﬂuence plain states directly. a question of which option you are already committed to. 1). the whole of the latter variation is accounted for by the former (Fig. (This is a quasidecision problem because what is imagined as moving c is not the decision but factual information about C. In Fig. which mediate any further eﬀect on plain states. mixed acts in which the ﬁnal unconditional probability of A is the rigid conditional probability. 1b the stable a’s and b’s shape the labile p’s as we have seen above: p= abc + a b (1 − c) .e. where your question “What do I really want to do?” is often understood as a question about the sort of person you are. of deep states ±C on acts ±A and plain states ±B. 1b). (1 − a)c + (1 − a )(1 − c) . But this may be equally true in ordinary deliberation. ac + a (1 − c) p = (1 − a)bc + (1 − a )b (1 − c) . the probabilities b and b vary with c in ways determined by the stable a’s and p’s. CHOOSING 111 or fall as much as it can without going oﬀ the kinematical map. for they directly aﬀect plain states as well as acts (Fig. from behind the scenes. But in Newcomb problems deep states must be considered explicitly. a or a .. while in Fig. Ordinarily. But p = p in either case. 1a). Ordinary decision problems are modelled kinematically by applying the rigidity condition to acts as causes. This diﬀerence explains why deep states (“the sort of person I am”) can be ignored in ordinary decision problems. so that while it is variation in c that makes pr(A) and pr(B) vary. i.CHAPTER 6. in Newcomb problems the linkage is due to an inﬂuence.
e. ac + a (1 − c b = (1 − a)pc + (1 − a )p (1 − c) . they do not announce themselves as such. P (Bact ∧ ¬C) = pr(B¬C). would not be stated in those hypotheses. B and A. No purpose would be served by packing such announcements into the hypotheses themselves. also Leeds (1984) in another connection.e. that they screen acts oﬀ from states.CHAPTER 6. making a mystery of the common inclining cause of acts and plain states while suggesting that the mystery could be cleared up in various ways. does no work in the matter commented upon. Thus. CHOOSING Similarly. e. Probabilistic features of those biochemical hypotheses. pointless to elaborate. when the anopheles mosquito was shown to be the organism playing that aetioo logical rˆle.g. but would be shown by interactions of those hypotheses with pr.. with an advanced 8 See Joyce (2002). in Fig. But if Fisher’s science ﬁction story had been veriﬁed. however useful as commentary.8 6. the status of certain biochemical hypotheses C and ¬C as the agent’s causal hypotheses would have been shown by satisfaction of the rigidity conditions. . even if we identify them by the causal rˆles they are o meant to play.. Fisher was talking about hypotheses that further research might specify.2 The Flagship Newcomb Problem Nozick’s (1969) ﬂagship Newcomb problem resolutely fends oﬀ naturalism about deep states. as when we identify the “bad” allele as the one that promotes cancer and inhibits quitting. by truth of the following consequences of the kinematical constraints. (One might tell a scienceﬁction story about a being from another planet. such announcements would be redundant. for even if true. with C and ¬C spelled out as technical speciﬁcations of alternative features of the agent’s DNA. it is a still unidentiﬁed feature of human DNA. i. (1 − a)c + (1 − a )(1 − c) 112 While C and ¬C function as causal hypotheses. 1a the labile probabilities are b= apc + a p (1 − c) . If there is such an allele. i. pr(Bact ∧ C) = pr(BC). constancy of pr(−C) and of pr(−¬C).. The causal talk. Nozick (1969) begins: Suppose a being [call her ‘Alice’] in whose power to predict your choices you have enormous conﬁdence.3. hypotheses he could only characterize in causal and probabilistic terms—terms like “malaria vector” as used before 1898.
so far as you know. 2. and so on.1] of possible values of x is ﬂat (Fig. but all this leads you to believe that almost certainly this being’s predictionabout your choice in the situation to be discussed will be correct. and (B) she has left the second box empty or she has (¬B) put $1.001. There are two boxes .CHAPTER 6.000. To see how that might go numerically. depending on whether she predicts that you will (A2 ) take both boxes. or that you will (A1 ) take only one—the one with $1000 in it. (In the previous notation. 113 Alice has surely put $1. (t+1)x t (t+1)(1x)t (a) Initial (b) Push toward x=1 (c) Push toward x=0 Fig. a tiny bit will do. and furthermore you know that this being has often correctly predicted the choices of other people. ¬A.. the agent’s unit of probability density over the range [0. in the particular situation to be described below. if desirabilities are proportional to dollar payoﬀs.) You know that this being has often correctly predicted your choices in the past (and has never.g. your conditional probability p = pr(BA1 ) is ever so close to 1 and your p = pr(BA2 ) is ever so close to 0. 2a).) Here you are to imagine yourself in a probabilistic frame of mind where your desirability for A1 is greater than that for A2 because although you think A’s truth or falsity has no inﬂuence on B’s. but in time it can push . Initially. many of whom are similar to you. CHOOSING technology and science. A2 = A. maximizes desirability as long as[?] is greater than . who you know to be friendly.000 in one box. think of the choice and the prediction as determined by independent drawings by the agent and the predictor from the same urn.000 in it. e. then the 1box option. made an incorrect prediction about your choices). Two patterns of density migration. One might tell a longer story... Does that seem a tall order? Not to worry! High correlation is a red herring. which contains tickets marked ‘2’ and ‘1’ in an unknown proportion x : 1x.
pr(A) = (t + 1)/(t + 3) and pr(BA) = (t + 2)/(t + 3). 9 1 1 . taking just one means a million. 2 are replaced by probability assignments r and 1 − r to the possibilities that the ratio of 2box tickets to 1box tickets in the urn is 1:0 and 0:1. as in Fig.e. The density functions of Fig. e. In place of the smooth density spreads in Fig.10 But of course this reasoning uses the premise that pr(predict 2take 2) − pr(predict 2take 1) = β = 1 through deliberation.. 1988. c. up to the very moment of decision. See Jeﬀrey. This means that the “best” and “worst” cells in the payoﬀ table have unconditional probability 0. Thus. beta = 1 until decision. i. and higher values of t will make desA1 − desA2 positive.CHAPTER 6. a premise making abstract sense in terms of uniformly stocked urns. was a bogus shortcut to the 1box conclusion. and remains maximum. Then taking both boxes means a thousand. Then if t is calibrated in thousandths of a minute this map has the agent preferring the 2box option after a minute’s deliberation. obtained if K is not just high but maximum. respectively.9 At t = 997 these densities determine the probabilities and desirabilities in Table 3b and c. with f (x) as in (b). but clearly speciﬁes its mysterious impact on acts and plain states.g. The irrelevant detail of high correlation or high K = pr(BA)−pr(B¬A). 2 we now have pointmasses r and 1 − r at the two ends of the unit interval. In this kinematical map pr(A) = 0 xt+1 f (x)d(x) and pr(BA) = 0 xt+2 f (x)dx/pr(A) with f (x) as in Fig. 2(b) or (c). 2b. CHOOSING 114 toward one end of the unit interval or the other. but very hard to swallow as a real possibility. and preference between acts is clear. to the two ways in which the urn can control the choice and the prediction deterministically and in the same way. with desirabilities of the two acts constant as long as r is neither 0 nor 1. Now the 1box option is preferable throughout deliberation. The urn model leaves the deep state mysterious. which happens when p = 1 and p = 0. 10 At the moment of decision the desirabilities of shaded rows in (b) and (c) are not determind by ratios of unconditional probabilities.. pr(shaded)=0. as long as p − p is neither 0 nor 1. but continuity considerations suggest that they remain good and bad. Deterministic 1box solution. bad worst best ood ad orst best good ad orst best good (a) Indecision: 0<r<1 (b) Decision: pr(2) = 1 (c) Decision: pr(1) = 1 Table 4. 1.
” The deep states seem less mysterious here than in the ﬂagship Newcomb problem. independently working the same arithmetic problem. or ¬B. don’t) are pretty good predictors of his. whose knowledge of each other’s competence and ambition gives them good reason to expect their answers to agree before either knows the answer: “If reasoning guides me to [. The implausibility of initial des(act) as a ﬁgure of merit for her act is simply the implausibility of positing constancy of as her probability function pr evolves in response to changes in pr(A).. Call the prisoners Alma and Boris. Hofstadter takes that characteristic to be rationality. If neither confesses. even. 10. the confessor goes free and the other serves a long prison term. ]. Boris’s possible actions (B. 1. confess. on each hypothesis about that characteristic. But . even though neither’s choices inﬂuence the other’s. 11 instead of 0. and ¬C = We are both likely to get the right answer. x. remarked upon by Skyrms (1990. What’s wrong with Hofstadter’s view of this as justifying the co¨perative o solution? [And with von Neumann and Morgenstern’s (p. odd.CHAPTER 6. pp. With utilities 0.. From Alma’s point of view. The prisoners are thought to share a characteristic determining their separate probabilities of confessing in the same way – independently. She thinks they think alike. both serve intermediate terms. i. 1000.e. ]. CHOOSING 115 6. i..3. so that her choices (A. confess.. for expecting rational players to reach a Nash equilibrium?] The answer is failure of the rigidity conditions for acts. so that the deep states are simply C = We are both likely to get the right answer. If one confesses and the other does not. 1.... ¬A. and compares the prisoners’s dilemma to the problem Alma and Boris might have faced as bright children. It is Alma’s conditional probability functions pr(− ± C) rather than pr(− ± A) that remain constant as her probabilities for the conditions vary. variability of pr(He gets x  I get x) with pr(I get x) in the decision maker’s kinematical map. then. 1001. it will guide everyone to [.3 Hofstadter Hofstadter (1983) saw prisoners’s dilemmas as downtoearth Newcomb problems.. since I am no diﬀerent from anyone else as far as rational thinking is concerned. If both confess.e. both serve short terms. 1314).e. i. If both care only to minimize their own prison terms this problem ﬁts the format of Table 1(a). (And here ratios of utilities are generally taken to be on the order of 10:1 instead of the 1000:1 ratios that made the other endgame so demanding. don’t) are states of nature. 148) transcendental argument. indiﬀerence between confessing and remaining silent now comes at = 10heighten similarity to the prisoners’s dilemma let us suppose the required answer is the parity of x. i.e. here they have some such form as Cx = We are both likely to get the right answer.
6. The reason is your conviction that if you knew your genotype. that’s the art of it. or just the opaque box. and when you learn which that is. your value of pr(A) would be either a or a . is more like the ﬂagship Newcomb problem.4 Conclusion The ﬂagship Newcomb problem owes its bizarrerie to the straightforward character of the pure acts: surely you can reach out and take both boxes. 78. ¬C) were allowed for the deep state.1] of possible deep states. resolution and betrayal makes the history no less believable—only more petty. you cannot be committed to either of the nonoptional mixed acts. In Jeﬀrey 1983. as you choose. with opinion evolving as smooth movements of probability density toward one end or the other draw probabilities of possible acts along toward 1 or 0. It is rather that Alma’s problem is not indecision about which act to choose. i. each step makes sense. Then as the pure acts are options. p.12 Escher. CHOOSING 116 the point is not that confessing is the preferable act. 1. prA ≈ 0 and prA ≈ 1 are not destinations you think you can choose. Hofstadter’s (1983) version of the prisoners’s dilemma and the ﬂagship Newcomb problem have been analyzed here as cases where plausibility demands a continuum [0.”) The extreme version of the story. neither of which determined the probability of either act as 0 or 1. That the details of the entrapment are describable as cycles of temptation. you will know your genotype.11 We know there can be no such things. and accepted “2box solutions” as correct.. (1958) 3133. 1960). I have argued that Newcomb problems are like Escher’s famous staircase on which an unbroken ascent takes you back where you started.7 and 1.3.e. but there is no way to make sense of the whole picture. with a ≈ 1 and a ≈ 0. neither of which is ≈ 0 or ≈ 1. but see no local ﬂaw. I proposed a new criterion of acceptability of an act—“ratiﬁability—which proved to break down in certain cases (see 12 11 . But in the Fisher problem. (Translation: “At places on the map where pr(C) is at or near 0 or 1. although you may eventually ﬁnd yourself at one of them. as causal decision theory would have it. see Escher 1989. but ignorance of which allele is moving her. here you do see yourself as already committed to one of the pure acts. willy nilly. prA is not. those of us who have repeatedly “quit” easily appreciate the smoker’s dilemma as humdrum entrapment in some mixed act. Quitting and continuing are not options.8. Elsewhere I have accepted Newcomb problems as decision problems. given your present position on your kinematical map. based on Penrose and Penrose. The problem of the smoker who believes Fisher’s hypothesis was simpler in that only two possibilities (C.CHAPTER 6. sec.
“One box or two?” is not a question about how to choose. 1960). the corrected paperback 2nd edition. Synthese 82 (1990) 7796. CHOOSING 117 In writing The Logic of Decision in the early 1960’s I failed to see Bob Nozick’s deployment of Newcomb’s problem in his Ph. but about what you are already set to do. Kenneth J. 1989 . ‘Probabilismo’. ‘A simultaneous axiomatization of utility and subjective probability’. Newcomb problems are not decision problems. Arrow et al. 6. . St. Functions Resembling Quotients of Measures. ‘Physics and common causes’. ‘Ascending and Descending’ (lithography. Escher on Escher. p. New York. ‘Functions resembling quotients of measures’. Ethan Bolker. Abrams. Philosophy of Science 34 (1967) 333340. 20). . I now conclude that in Newcomb problems. It took a long time for me to take the “causal” decision theorists (Stalnaker. willynilly. dissertation (1965) Harvard University.4 References Frank Arntzenius. .CHAPTER 6. Maurits Cornelius Escher. Gibbard and Harper. Lewis. Skyrms. (eds. In Jeﬀrey (1988) and (1993) ratiﬁability was recast in terms more like the present ones. D. Oxford: Clarendon Press (1980) Bruno de Finetti.). Donald Davidson. Naples. Transactions of the American Mathematical Society 124 (1966) 292312. Ph. 1931. Sobel) as seriously as I neded to in order to appreciate that my program of strict exclusion of causal talk would be an obstacle for me— an obstacle that can be overcome by recognizing causality as a concept that emerges naturally when we patch together the diﬀerently dated or clocked probability assignments that arise in formulating and then solving a decision problem. 1990. Martin’s (1996). Rational Foundations of Economic Behaviour. but still treating Newcomb problems as decision problems—a position which I now renounce. based on Penrose and Penrose (1958). Pirella. D. dissertation as a serious problem for the theory ﬂoated in chapter 1 and failed to see the material on rigidity in chapter 11 as solving that problem. Essays on Actions and Events.
‘How to Probabilize a Newcomb Problem’ in Fetzer (ed. (1996) 319. Oliver and Boyd. in Rescher ed. Richard Jeffrey. eds. Steven Leeds. . New York: Garland. Phil 81 (1984). Wlodek Rabinowicz. Causal Necessity Yale U. Chance. Mackey. L. 1969. The British Journal of Psychology 49 (1958) 3133. 1963.S. Dordrecht. George W. John von Neumann and Oskar Morgenstern. Penrose. 1990.). pp. Fetzer (ed. 1988. 1959. Benjamin. Edinburgh and London.. and Choice. ‘Decision Kinematics’ in Arrow et al. realism. Princeton University Press.). The Logic of Decision. Photocopy of Nozick (1963) with new preface. Ph. . revised. J. New York. Smoking: The Cancer Controversy. 1990). 1988). Joyce. (1969) ‘Newcomb’s problem and two principles of choice’. Haim Gaifman.. Princeton University. W. D. 2nd ed. (1969). CHOOSING 118 James H. Causality. Erkenntnis 57 57 (2002) 91122. Robert Nozick. dissertation (1963). 1988. Dordrecht: Reidel. further revision of 2nd ed. Essays in Honor of Carl G. James M. Reidel. quantum mechanics’. ‘Impossible objects: a special type of visual illusion’. ‘Risk and human rationality’. University of Chicago Press 1983. 1943. ‘Chance. Mathematical Foundations of Quantum Mechanics. Ronald Fisher.. Probability and Causality. . The Monist 70 (1987). in Brian Skyrms and William Harper (eds. Hempel. Nicholas Rescher (ed. 1947.P. (1980). 567578. The Normative Theory of Individual Choice.CHAPTER 6. Theory of Games and Economic Behavior. (McGrawHill 1965. Penrose and R. . Brian Skyrms.). ‘A theory of higherorder probability’. Reidel. A... ‘Levi on causal decision theory and the possibility of predicting one’s own actions’ Philosophical Studies 110 (2002) 69102. ‘Does practical deliberation crowd out selfprediction?’.
This action might not be possible to undo. Are you sure you want to continue?