Probabilistic thinking was a mid-17th century artifact originating in a famous correspondence between Fermat and Pascal -- a correspondence on which Huygens based a widely read textbook: On Calculating in Games of Luck (1657). The probabilistic framework didn't exist until those people cobbled it together. It remains in use today, much as in Huygens's book. Within the framework, we make up our minds by adopting probability functions -or, anyway, features of such functions. This is not a matter of eliciting what is already in mind; rather, it is a matter of artifice, a family of arts of probabilistic judgment. And that includes making up our minds about how to change our minds. With the founders (so I think), and certainly with Ramsey and de Finetti, who revived the view of probability as a mode of judgment, I see the probability calculus as a logic. Mere logic does not tell us how to make up our minds. It does help us spot inconsistencies within projected mental makeups, but fails to underwrite the further gradations we make in classifying survivors of the first cut as reasonable or not, or in grading them as more or less reasonable. These finer distinctions separate cases on the basis of standards rooted in our actual appreciation of ongoing methodological experience. The basic ideas, floated in 1 and 3, are applied in 2 and 4 to troubling questions about scientific method and practical decision-making. The question of normativity is addressed in 5. I'd be glad to have corrigenda and other suggestions.

Richard Jeffrey
7 Dec 99

Please write to with any comments or suggestions.

"Yes or no: was there once life on Mars?" I can't say. "What about intelligent life?"' That seems most unlikely, but again, I can't really say. The simple yes-or-no framework has no place for shadings of doubt; no room to say that I see intelligent life on Mars as far less probable than life of a possibly very simple sort. Nor does it let me express exact probability judgments, if I have them. We can do better.

1.1 Bets and Probabilities What if I were able to say exactly what odds I'd give on there having been life, or intelligent life, on Mars? That would be a more nuanced form of judgment, and perhaps a more useful one. Suppose my odds were 1:9 for life, and 1:999 for intelligent life, corresponding to probabilities of 1/10 and 1/1000, respectively; odds m:n correspond to probability m/(m+n.) That means I'd see no special advantage for either player in risking one dollar to gain nine in case there was once life on Mars; and it means I'd see an advantage on one side or the other if those odds were shortened or lengthened. And similarly for intelligent life on Mars when the risk is 1 thousandth of the same ten dollars (1cents ) and the gain is 999 thousandths ($9.99). Here is another way of saying the same thing: I'd think a price of one dollar just right for a ticket worth ten if there was life on Mars and nothing if there wasn't, but I'd think a price of only one cent right if there has to have been intelligent life on Mars for the ticket to be worth ten dollars.

So if I have an exact judgmental probability for truth of a hypothesis, it corresponds to my idea of the right price for a ticket worth 1 unit or nothing depending on whether the hypothesis is true or false. (For the life on Mars ticket the unit was $10; the price was a tenth of that.) Of course I have no exact judgmental probability for there having been life on Mars, or intelligent life there. Still, I know that any probabilities anyone might think acceptable for those two hypotheses ought to satisfy certain rules, e.g., that the first can't be less than the second. That's because the second hypothesis implies the first: see the implication rule in sec. 3 below. Another such rule, for 'not': the probabilities that a hypothesis is and is not true must add to 1. In sec. 2 we'll turn to the question of what the laws of judgmental probability are, and why. Meanwhile, take some time with these questions, as a way of getting in touch with some of your own ideas about probability. Afterward, read the discussion that follows.


1 A vigorously flipped thumbtack will land on the sidewalk. Is it reasonable for you to have a probability for the hypothesis that it will land point up?

2 An ordinary coin is to be tossed twice in the usual way. What is your probability for the head turning up both times-(a) 1/3, because 2 heads is one of three possibilities: 2 heads, 1 head, 0 heads? (b) 1/4, because 2 heads is one of four possibilities: HH, HT, TH, TT?

3 There are three coins in a bag: ordinary, two-headed, and two-tailed. One is shaken out onto the table and lies head up. What should be your probability that it's the two-headed one--

(a) 1/2, since it can only be two-headed or normal? (b) 2/3, because the other side could be the tail of the normal coin, or either side of the two-headed one?

4 "It's a goy!" (a) As you know, about 49% of recorded human births have been girls. What's your judgmental probability that the first child born in the 21st century will be a girl? (b) A goy is defined as a girl born before the beginning of the 21st century or a boy born thereafter. As you know, about 49% of recorded human births have been goys. What is your judgmental probability that the first child born in the 21st century will be a goy?


1 Surely it is reasonable to suspect that the geometry of the tack gives one of the outcomes a better chance of happening than the other; but if you have no clue about which of the two has the better chance, it may well be reasonable to have judgmental probability 1/2 for each. Evidence about the chances might be given by statistics on tosses of similar tacks, e.g., if you learned that in 20 tosses there were 6 "up"s you might take the chance of "up" to be in the neighborhood of 30%; and whether or not you do that, you might well adopt 30% as your judgmental probability for "up" on the next toss.

2, 3. These questions are meant to undermine the impression that judgmental probabilities can be based on analysis into cases in a way that doesn't already involve probabilistic judgment (e.g., the judgment that the cases are equiprobable). In either problem you can arrive at a judgmental probability by trying the experiment (or a similar one) often enough, and seeing the statistics settle down close enough to 1/2 or to 1/3 to persuade you that more trials won't reverse the indications. In each these problems it's the finer of the two suggested analyses that makes more sense; but any analysis can be refined in significantly different ways, and there's no

point at which the process of refinement has to stop. (Head or tail can be refined to head-facing-north or head-not-facing-north or tail.) Indeed some of these analyses seem more natural or relevant than others, but that reflects the relevance of probability judgments that you bring with you to the analyses.

4. Goys and birls.

This question is meant to undermine the impression that judgmental probabilities can be based on frequencies in a way that doesn't already involve judgmental probabilities. Since all girls born so far have been goys, the current statistics for girls apply to goys as well: these days, about 49% of human births are goys. Then if you read probabilities off statistics in a straightforward way your probability will be 49% for each hypothesis: (1) the first child born in the 21st century will be a girl; and (2) the first child born in the 21st century will be a goy. Thus P(1)+P(2)=98%. But it's clear that those probabilities should sum to 1, since (2) is logically equivalent to (3) the first child born in the 21st century will be a boy, and P(1)+P(3) = 100%. Contradiction. What you must do is decide which statistics are relevant: the 49% of girls or the 51% of boys. That's not a matter of statistics but of judgment -- no less so because we'd all make the same judgment, P(H) = 51%.

1.2 Why Probabilities are Additive Authentic tickets of the Mars sort are hard to come by. Is the first of them really worth $10 to me if there was life on Mars? Probably not. If the truth isn't known in my lifetime, I can't cash the ticket even if it's really a winner. But some probabilities are plausibly represented by prices, e.g., probabilities of the hypotheses about athletic contests and lotteries that people commonly bet on. And it is plausible to think that the general laws of probability ought to be the same for

all hypotheses - about planets no less than about ball games. If that's so, we can justify laws of probability if we can prove all betting policies that violate them to be inconsistent. Such justifications are called "Dutch book arguments." (In racing jargon your book is the set of bets you've accepted, and a book against you - a Dutch book - is one on which you inevitably suffer a net loss.) We now give a Dutch book argument for the requirement that probabilities be additive in this sense:

Finite Additivity. The probability of any hypothesis is the sum of the probabilities of the cases in which it is true, provided there is only a finite number of cases, incompatible and exhaustive.

Example 1. The probability p of the hypothesis (H) A woman will be elected is q+r+s if exactly three of the candidates are women, and their probabilities of winning are q, r and s. In the following diagram, A, B, C, D,... are the hypotheses that the various different candidates win; the first three are the women in the race.

Proof. For definiteness, we suppose that the hypothesis in question is true in three cases as in the example. The argument differs inessentially for other examples, with other finite numbers of cases. Now consider the following array of tickets. Suppose I am willing to buy or sell any or all of these tickets at the stated prices. Why should p be the sum q+r+s? Because no matter what it's worth -- $1 or $0 -- the ticket on H is worth exactly as much as the tickets on A, B and C together. (If H loses it's because A, B and C all lose; if H wins it's because exactly one of A, B, C wins.) Then if the price of the H ticket is different from the sum of the prices of the other three, I am inconsistently placing different values on one and the same contract, depending on how it is


If I am inconsistent in that way, I can be fleeced by anyone who'll ask me to sell the H ticket and buy the other three (in case p is less than q+r+s) or buy the H ticket and sell the other three (in case p is more). Thus, no matter whether the equation p = q+r+s fails because the left-hand side is less than the right or more, a book can be made against me. That's the Dutch book argument for additivity when the number of ultimate cases under consideration is finite. The talk about being fleeced is just a way of dramatizing the inconsistency of any policy in which the dollar value of the ticket on H is anything but the sum of the values of the other three tickets: to place a different value on the three tickets on A, B, C from the value you place on the H ticket is to place different values on the same commodity bundle under two demonstrably equivalent descriptions.

When the number of cases is infinite, a Dutch book argument for additivity can still be given -- provided the infinite number is not too big! It turns out that not all infinite sets are the same size. The smallest infinite sets are said to be "countable." A countable set is one whose members can be listed: first, second, etc., with each member of the set appearing as the n'th item for some finite n. Of course any finite set is countable in this sense, and some infinite sets are countable. An obvious example of a countably infinite set is the set { 1, 2, 3, ... } of the positive whole numbers. A less obvious example is the set { ... , -2, -1, 0, 1, 2, ... } of all the whole numbers; it can be rearranged in a list (with a beginning): 0, 1, -1, 2, -2, 3, -3, ... . Then it is countable. Order doesn't matter, as long as they're all in the list. But there are uncountably infinite sets, too (example 3).

Example 2. In the election example, suppose there were an endless list of candidates, including no end of women. If H says that a woman wins, and A1, A2, etc., identify the winner as the first, second, etc. woman, then an extension of the finite additivity law to countably infinite sets would be as follows, with no end of terms on the right. P(H) = P(A1) + P(A2) + ... Thus, if the probability of a woman's winning were 1/2, and the probabilities of winning for the first, second, third, etc. woman were 1/4, 1/8, 1/16, etc. (decreasing by half each time), the equation would be satisfied.

Dutch book argument for additivity in the countably infinite case. Whatever my probabilities P(An) may be, if they don't add up to P(H) there will be an infinite set of $1 bets on truth of A1, A2, ... separately, on which my net gain will surely be the same as my gain from betting $1 on truth of H. (Note that this infinity of bets can be arranged by a finite contract: "In consideration of $1 paid in advance, Bookmaker hereby undertakes to pay Bettor the amount $P(the true one) when the true one has been identified.") This will be a Dutch book if the sum P(A1) + P(A2) + ... is greater or less than P(H)--against me if it's greater, against the bookmaker if it's less.

Summarizing, the following additivity law holds for any countable set of alternatives, finite or infinite.

Countable Additivity. If the possible cases are countable, the probability of a hypothesis is the sum of the probabilities of the cases in which it is true.

Example 3. Cantor's Diagonal Argument. The collection of all sets of positive whole numbers is not enumerable. For, given any list N1, N2, ... , there will be a "diagonal" set D consisting of the positive whole numbers n that do not belong to the the corresponding sets Nn in the list. For example, supose the first two entries in the list are N1 = the odd numbers = {1, 3, ...}, and N2 = the powers of 10 = {1, 10, ...}. Then it is false that D = N1, because 1 is in N1 but not in D; and it is false that D = N2, because 2 is in D but not in N2. (For it to be true that D = N2 it must be that each number is in both D and N2 or in neither.) In general, D cannot be anywhere in the list N1, N2, ... because by definition of D, each positive whole number n is in one but not the other of D and Nn.

1.3 Laws of Probability The simplest laws of probability are the consequences of additivity under this assumption:

Probabilities are real numbers in the unit interval, 0 to 1, with the endpoints reserved for certainty of falsity and of truth, respectively.

This makes it possible to read laws of probability off diagrams, much as we read laws of logic off them. Let's pause to recall how that works for laws of logic. Example:

De Morgan's Law. -(G&H) = -Gv-H

Here the symbols '-', '&' and 'v' stand for not, and, and or. Thus, if G is the hypothesis that the water is green, and H is the hypothesis that it's hot, then G&H is the hypothesis that it's green and hot, GvH is the hypothesis that it's green or hot (not excluding the possibility that it's both), and -G and -H are the hypothesis that it's not green, and not hot. Here is a diagram for De Morgan's law.

Stippled: -Gv-H

Points in such diagrams stand for the ultimate cases -- say, complete possible courses of events, each specified in enough detail to make it clear whether each of the hypotheses under consideration is true or false in it. The cases where G and H are both true are represented by points in the upper left-hand corner; that's the G&H region. The cases where at least one of G, H is true make up the GvH region,

which covers everything but the points in the lower right-hand corner, where G and H are both false (-G&-H). And so on. In general, the merger of two regions covers the cases where one hypothesis or the other is true, and the intersection of two regions covers the cases where both hypotheses are true. Now in the diagram for De Morgan's law, above, the stippled region covers what's outside the G&H corner; then it represents the denial -(G&H) of G&H. At the same time it represents the merger (-Gv-H) of the lower region, where G is false, with the right-hand region, where H is false. So the law says: denying that G and H are both true, -(G&H), is the same as (=) asserting that G is false or H is, -Gv-H. Adapting that sort of thing to probabilistic reasoning is just a matter of thinking of the probability of a hypothesis as its region's fraction of the area of the whole diagram. Of course the fraction for the whole Hv-H rectangle is 1, and the fraction for the empty H&-H region is 0. It's handy to be able to denote those two in neutral ways. Let's call them 1 and 0:

The Whole Rectangle: 1 = Hv-H = Gv-G etc. The Empty Region: 0 = H&-H = G&-G etc.

Now let's read a couple of probability laws off diagrams.

Proof. The GvH area is the G area plus the H area, eacept that when you simply add, you count the G&H bit twice. So subtract it on the right-hand side.

Proof. The G&-H region is what remains of the G strip after you delete the G&H region.

We will often abbreviate by dropping ampersands (&), e.g., writing the subtraction law as follows.

Subtraction. P(G-H) = P(G)-P(GH)

Solving that for P(G), we have the rule of

In general, there is a rule of n-adic analysis for each n, e.g., for n=3:

You can verify the next two rules on your own, via diagrams. Not. P(-D) = 1-P(D) If. P(Hv-D) = P(DH)+P(-D)

In the second, 'H if D' is understood truth-functionally, i.e., as synonymous with 'H, unless not D': H or not D.

The idea is that saying "If D then H" is a guarded way of saying "H", for in case 'D' is false, the "if" statement makes no claim at all -- about "H" or anything else.

The next rule is an immediate consequence of the fact that logically equivalent hypotheses, e.g., -(GH) and -Gv-H, are always represented by the same region of the diagram.

Equivalence. Logically equivalent hypotheses are equiprobable.

That fact is also presupposed when we write '=' to indicate logical equivalence. Thus, since -(GH) = -Gv-H, the probability of the one must be the same as the probability of the other, for the one is the other. Recall that to be implied by G, H must be true in every case in which G is true, not just in the actual case. (In other words, the conditional "H if G" must be valid: true as a matter of logic.) Then the G region must lie entirely inside the H region.This gives us the following rule.

1.4 Conditional Probability Just as we identified your ordinary (unconditional) probability for H as the price you would think fair for the ticket at the left below, we now identify your conditional probability for H given D as the price you would think fair for the ticket at its right. We wrote 'P(H)' for the first of these prices. We write 'P(H | D)' for the second.

The tickets are represented as follows in diagrammatic form, with numbers indicating dollar values in the various cases.

The first ticket represents a simple bet on H; the second represents a conditional bet on H, i.e., a bet that's called off (the price of the ticket is refunded) in case the condition D fails. If D and H are both true the bet's on and you win; if D is true but H is false the bet's on and you lose; and if D is false, the bet's off: you get your $P(H | D) back. With that understanding we can construct a Dutch book argument for the rule connecting conditional and unconditional probabilities:

Product Rule. P(DH) = P(D)P(H | D)

Imagine that your pockets are stuffed with money and tickets whose prices you think fair -- including the following three tickets. The first represents a conditional bet on H given D; the second and third represent unconditional bets on DH and against D, respectively. The third bet has an odd payoff, i.e., not a whole dollar, but only $P(H | D). That's why its price isn't the full $P(-D) but only the fraction P(-D) of the $P(H | D) that you stand to win. This third payoff was chosen to equal the price of the first ticket. That's what makes the three fit together into a neat book.

The three tickets are shown below in compact diagrammatic form. In each, the upper and lower halves represent D and -D, and the left and right halves represent H and -H. The number in each region shows the ticket's value when the corresponding hypothesis is true.

Observe that in every possible case regarding truth and falsity of D and H the second two tickets together have the same value as the first. Then there is nothing to choose between the first and the other two together, and so it would be inconsistent to place different values on them. Thus, the price you think fair for the first ought to equal the sum of the prices you think fair for the other two: P(H | D) = P(DH)+P(-D)P(H | D). Rewriting P(-D) as 1-P(D), this boils down to

P(H | D) = P(DH) + P(H | D) - P(D)P(H | D).

Cancelling the term on the left and the second term on the right and transposing, we have the product rule.

That's the Dutch book argument for the product rule: to violate the rule is to place different values on the same commodity bundle when it is described in two probably equivalent ways.

1.5 Laws of Conditional Probability Here is the product rule in a slightly different form:

Graphically, the quotient rule expresses P(H | D) as the fraction of the area of the D strip that lies in the H region. It's as if calculating P(H | D) were a matter of trimming the square down to the D strip by discarding the blank region, and taking the stippled region as the new unit of area. Thus the conditional probability distribution assigns to H as its probability the H fraction of the D strip, the fraction P(HD)/P(D). The quotient rule is often spoken of as a definition of conditional probability in terms of unconditional ones -- when the unconditional probability of the condition D is positive. But if P(D) is zero then by the implication rule so is P(DH), and the quotient P(DH)/P(D) assumes the indeterminate form 0/0. Then if the quotient rule really were its definition, the conditional probability would be undefined in all such cases. Yet, in many cases in which P(D)=0, we do assign definite values to P(H | D).

Example: the spinner. Although the probability is 0 that when the spinner stops it will point straight up (U) or straight down (D) we want to say that the conditional probability of up, given up or down,is 1/2: although P(U)/P(UvD) = 0/0, still P(U | UvD) = 1/2.

By applying the product rule to each term on the right-hand side of the analysis rule, P(D) = P(DH1) + P(DH2) + ..., we get the rule of

Total Probability If the H's are incompatible and exhaustive, P(D) = P(D|H1)P(H1) + P(D|H2)P(H2) + ...

Example. A ball will be drawn at random from urn 1 or urn 2, with odds 2:1 of being drawn from urn 2. Is black or white the more probable outcome?

Solution. By the rule of total probability with n=2 and D=black, we have P(D) = P(D | H1)P(H1)+P(D | H2)P(H2) = (3/4) (1/3)+(1/2) (2/3) = 1/4 + 1/3 = 7/12, i.e., a bit over 1/2. So black is the more probable outcome.

Finally, note that for any fixed proposition D of positive probability, the function P( | D) obeys all the laws of unconditional probability, e.g., additivity:

P(GvH | D) = P(G | D) + P(H | D) - P(G&H | D)

(Proof. Multiply both sides of the equation by P(D), and apply the product rule.) Therefore we sometimes write the function P( | D) as PD( ), e.g., in the additivity law:

PD(GvH) = PD(G) + PD(H)PD(-H) - PD(G&H)

If we condition again, on E, PD becomes PD&E:

PD(H | E) = PDE(H) = P(DEH)/P(DE) = P(H | DE)

1.6 Why '|' Can't be a Connective The bar in 'P(H | D)' isn't a connective that turns pairs H, D of propositions into new, conditional propositions, H if D. Rather, it is as if we wrote the conditional probability of H given D as 'P(H, D)': the bar is a typographical variant of the comma. Thus we use 'P' for a function of one variable as in 'P(D)' and 'P(HD)', and also for the corresponding function of two variables as in 'P(H | D)'. The ambiguity is harmless because in every context, presence or absence of the bar clearly marks the distinction between the two uses. But of course the two are connected, i.e., by the product rule, P(HD) = P(H | D)P(D). That's why it's handy to make 'P' do double duty. But what is it that goes wrong when we treat the bar as a statement-forming connective, 'if'? This question was answered by David Lewis in 1976, pretty much as follows. Consider the simple special case of the rule of total probability where there are only two hypotheses, H and -H:

P(X) = P(X | H). P(H) + P(X | -H). P(-H)

Now if '|' is a connective, H | D is a proposition, and we are entitled to set X = H | D above. Result:

(*) P(H | D) = P[(H | D) | H] P(H) + P[(H | D) | -H] P(-H)

So far, so good. But remember: '|' means if, so '((H | D) | G)' means If G, then if D then H. And as we ordinarily use the word 'if', this comes to the same as If D and G, then H:

(H | D) | G = H | DG

(The identity means that the two sides represent the same region, i.e., the two sentences are logically equivalent.) Now we can rewrite (*) as follows.

P(H | D) = P(H | DH). P(H) + P(H | D-H). P(-H)

-- where the two terms on the right reduce to 1. P(H) and 0. P(-H), so that (*) itself reduces to

P(H | D) = P(H).

Conclusion: If '|' is a connective ("if"), conditional probabilities don't depend on their conditions at all. That means that 'P(H | D)' would be just a clumsy way of writing 'P(H)'. And it means that P(H | D) would come to the same thing as P(H | D), and as P(H | G) for any other statement G. That's David Lewis's "trivialization result." In proving this, the only assumption needed about "if" was that "If A, then if B then C" is equivalent to "If A and B then C": whatever region of a diagram represents (C | B) | A must also represent C | BA.


Huygens gave this account of the scientific method in the introduction to his Treatise on Light (1690):

"... whereas the geometers prove their propositions by fixed and incontestable principles, here the principles are verified by the conclusions to be drawn from them; the nature of these things not allowing of this being done otherwise. It is always possible thereby to attain a degree of probability which very often is scarcely less than complete proof. To wit, when things which have been demonstrated by the principles that have been assumed correspond perfectly to the phenomena which experiment has brought under observation; especially when there are a great number of them, and further, principally, when one can imagine and foresee new phenomena which ought to follow from the hypotheses which one employs, and when one finds that therein the fact corresponds to our prevision. But if all these proofs of probability are met with in that which I propose to discuss, as it seems to me they are, this ought to be a very strong confirmation of the success of my inquiry; and it must be ill if the facts are not pretty much as I represent them."

Here we interpret and extend Huygens's methodology in the light of the discussion of rigidity, conditioning, and generalized conditioning in 1.7 and 1.8.

2.1 Confirmation The thought is that you see an episode of observation, experiment, or reasoning as confirming or infirming a hypotheses depending on whether your probability for it increases or decreases during the episode, i.e., depending on whether your posterior probability, Q(H), is greater or less than your prior probability, P(H). The degree of confirmation


can be a useful measure of that change--positive for confirmation, negative for infirmation. Others are the probability factor, and the odds factor, greater than 1 for confirmation, less than 1 for infirmation:

Q(H) Q(H)/Q(-H) ----------------P(H) P(H)/P(-H) These are the factors by which prior probabilities P(H) or odds P(H)/P(-H) are multiplied to get the posterior probabilities Q(H) or posterior odds Q(H)/Q(-H). By the odds on one hypothesis against another -- say, on a theory T, against an alternative S, is meant the ratio of the probability of T to the probability of S. In these terms the plain odds on T are simply the odds on T against -T. The definition of the odds factor is easily modified for the case where S is not simply -T:

Q(T)/Q(S) Odds factor for T against S = ----------P(T)/P(S) The odds factor can also be expressed as the ratio of the probability factor for T to that for S: Q(T)/P(T) Odds factor for T against S = ----------Q(S)/P(S) S is confirmed against T, or T against S, depending on whether the odds factor is greater than 1, or less. We will choose among these measures case by case, depending on which measure seems most illuminating.

2.2 Huygens on Light Let H represent Huygens's principles and C the conclusions he drew from them-i.e., the conclusions which "verify" the principles.

If C follows from H, and we can discover by observation whether C is true or false, then we have the means to test H -- more or less conclusively, depending on whether we find that C is false or true. If C proves false, H is refuted decisively, for then reality lies somewhere in the shaded region of the diagram, outside the "H" circle. If C proves true, H's probability changes from area of "H" circle P(H) = -------------------area of square to area of "H" circle P(H | C) = -------------------area of "C" circle So verification of C multiples H's probability by 1/P(C). Therefore it is the antecedently least probable conclusions whose unexpected verification raises H's probability the most. George Pólya put it: "More danger, more honor."

2.3 Observation and Sufficiency It only remains to clarify the rationale for updating P to Q by conditioning on C or on -C , i.e., setting Q(H) = P(H | +/- C) depending on whether what we observe assures us of C's truth or of its falsity. According to the analysis in 1.7, the warrant for this must be rigidity (sufficiency) of truth or falsity of C as evidence about H, assuring us that whatever information the observation provides over and above a bare report of C's truth or falsity has no further relevance to H. This is guaranteed if the information about C arrives in a pre-arranged 1-word telegram: "true," or "false." But if the observers are the very people whose judgmental states are to be updated by the transition from P to Q, the possibility must be considered that the information about H conveyed by the observation will overflow the package provided by the sentence +/- C. Of course there will be no overflow if C is found to be false, for since the shaded region is disjoint from the "H" circle, any conditional probability function must assign 0 to H given falsity of C. This guarantees rigidity relative to -C:

Q(H | -C) = P(H | -C) = 0

No matter what else observation might reveal about the circumstances of C's falsity, H would remain refuted. But overflow is possible in case of a positive result, verification of C. In this case, observation may provide further information that complicates matters by removing our warrant to update by conditioning. Example: the Green Bean, yet again. H: the next bean will be lime-flavored. C: the next bean will be green. You know that half the beans in the bag are green, all the lime-flavored ones are green, and the green ones are equally divided between lime and mint flavors. So P(C) = 1/2 = P(H | C), and P(H) = 1/4. But although Q(C) = 1, your probability Q(H) for lime can drop below P(H)=1/4 instead of rising to 1/2 = P(H | C) -- e.g. if, when you see that the bean is green you also get a whiff of mint, or also see that it has a special shade of green that you have found to be associated with the mintflavored ones.

2.4 Leverrier on Neptune We now turn to a more recent methodological story. This is how Pólya tells it:

"On the basis of Newton's theory, the astronomers tried to compute the motions of ... the planet Uranus; the differences between theory and observation seemed to exceed the admissible limits of error. Some astronomers suspected that these deviations may be due to the attraction of a planet revolving beyond Uranus' orbit, and the French astronomer Leverrier investigated this conjecture more thoroughly than his colleagues. Examining the various explanations proposed, he found that there was just one that could account for the observed irregularities in Uranus' motion: the existence of an extra-Uranian planet [sc., Neptune]. He tried to compute the orbit of such a hypothetical planet from the irregularities of Uranus.

Finally Leverrier succeeded in assigning a definite position in the sky to the hypothetical planet [say, with a 1 margin of error]. He wrote about it to another astronomer whose observatory was the best equipped to examine that portion of the sky. The letter arrived on the 23rd of September 1846 and in the evening of the same day a new planet was found within one degree of the spot indicated by Leverrier. It was a large ultra-Uranian planet that had approximately the mass and orbit predicted by Leverrier."

We treated Huygens's conclusion as a strict deductive consequence of his principles. But Pólya made the more realistic assumption that Leverrier's prediction C (a bright spot near a certain point in the sky at a certain time) was highly probable but not 100%, given his H (i.e., Newton's laws and observational data about Uranus). So P(C | H)~ 1; and presumably the rigidity condition was satisfied so that Q(C | H)~ 1, too. Then verification of C would have raised H's probability by a factor ~ 1/P(C), which is large if the prior probability P(C) of Leverrier's prediction was ~ 0. Pólya offers a reason for regarding 1/P(C) as at least 180 -- and perhaps as much as 13131: The accuracy of Leverrier's prediction proved to be better than 1 , and the probability of a randomly selected point on a circle or on a sphere being closer than 1 to a previously specified point is 1/180 for a circle, and about 1/13131 for a sphere. Favoring the circle is the fact that the orbits of all known planets lay in a common plane ("of the ecliptic"). Then the great circle cut out by that plane gets the lion's share of probability. Thus, if P(C) is half of 1%, H's probability factor will be about 200.

2.5 Multiple Uncertainties In Pólya's story, Leverrier loosened Huygens's tight hypothetico-deductive reasoning by backing off from deductive certainty to values of P(C | H) falling somewhat short of 1 -- which he treated as approximately 1. But what is the effect of different shortfalls? Similarly, we can back off from observational certainty to Q(C) values less than 1. What if the confirming observation had raised the probability of Leverrier's C from a prior value of half of 1% to some posterior value short of 1; say, Q(C) = 95%. Surely that would that have increased H's probability by a factor smaller than Pólya's 200; but how much smaller? Again, it would be more realistic to tell the story in terms of a point prediction with stated imprecision -- say, +/- 1 . (In fact the new planet was observed within

that margin, i.e., 57' from the point.) As between two theories that make such predictions, the one making the more precise prediction can be expected to gain the more from a confirming observation. But how much more? The following formula for H's probability factor, with is due to John Burgess, answers such questions provided C and -C satisfy the rigidity condition.

Q(C)-P(C) x P(C | H)-P(C) pf(H,C) = 1 + -------------------------P(C)P(-C) By lots of algebra you can derive this formula from basic laws of probability and generalized conditioning with n=2 (sec. 1.8). If we call the term added to 1 in pf(H,C) the strength of confirmation for H in view of C's change in probability, then we have Q(C)-P(C) x P(C | H)-P(C) sc(H,C) = ----------------------------P(C)P(-C) The sign distinguishes confirmation (+) from infirmation (-, "negative confirmation").

Exercises. What does sc reduce to in these cases? (a) Q(C)=1 (b) P(C | H)=1 (c) Q(C)=P(C | H)=1 (d) P(C) = 0 or 1, i.e., prior certainty about C.

To see the effect of precision, suppose that C predicts that a planet will be found within +/- e of a certain point in the sky --a prediction that is definitely confirmed, within the limits of observational error. Thus P(C | H) = Q(C) = 1, and P(C) increases with e. Here sc(H,C) = P(-C)/P(C) = the prior odds against C, and H's probability factor is 1/P(C). Thus, if it was thought certain that the observed position would be in the plane of the ecliptic, P(C) might well be proportional to e, P(C) = ke.

Exercise. (e) On this assumption of proportionality, what happens to H's probability factor when e doubles?

2.6 Dorling on the Duhem problem Skeptical conclusions about scientific hypothesis-testing are often drawn from the presumed arbitrariness of answers to the question of which to give up -- theory, or auxiliary hypothesis -- when they jointly contradict empirical data. The problem, addressed by Duhem in the first years of the 20th century, was agitated by Quine in mid-century. As drawn by some of Quine's readers, the conclusion depends on his assumption that aside from our genetical and cultural heritage, deductive logic is all we've got to go on when it comes to theory testing. That would leave things pretty much as Descartes saw them, just before the mid-17th century emergence in the hands of Fermat, Pascal, Huygens and others of the probabilistic ("Bayesian") methodology that Jon Dorling has brought to bear on various episodes in the history of science. The conclusion is one that scientists themselves generally dismiss, thinking they have good reason to evaluate the effects of evidence as they do, but regarding formulation and justification of such reasons as someone else's job -- the methodologist's. Here is an introduction to Dorling's work on the job, using extracts from his important but still unpublished 1982 paper. It is presented here in terms of probability factors. Assuming rigidity relative to D, the probability factor for a theory T against an alternative theory S is the left-hand side of the following equation. The right-hand side is called the likelihood ratio. The equation follows from the quotient rule.

P(T | D)/P(S | D) P(D | T) ------------------ = ---------P(T)/P(S) P(D | S) The empirical result D is not generally deducible or refutable by T alone, or by S alone, but in interesting cases of scientific hypothesis testing D is deducible or refutable on the basis of the theory and an auxiliary hypothesis A (e.g., the hypothesis that the equipment is in good working order). To simplify the analysis, Dorling makes an assumption that can generally be justified by appropriate formulation of the auxiliary hypothesis:

Prior independence P(AT) = P(A)P(T), P(AS) = P(A)P(S)

In some cases S is simply the denial, -T, of T; in others it is a definite scientific theory R, a rival to T. In any case Dorling uses the independence assumption to expand the right-hand side of the odds Factor = Likelihood Ratio equation. Result, with f for odds factor: P(D | TA)P(A) + P(D | T-A)P(-A) (1) f(T,S) = --------------------------------P(D | SA)P(A) + P(D | S-A)P(-A) To study the effect of D on A, he also expands f(A,-A) with respect to T (and similarly with respect to S): P(D | AT)P(T) + P(D | A-T)P(-T) f(A,-A) = ----------------------------------P(D | -AT)P(T) + P(D | -A-T)P(-T)


2.7 Einstein vs. Newton, 1919 In these terms Dorling analyzes two famous tests that were duplicated, with apparatus differing in seemingly unimportant ways, with conflicting results: one of the duplicates confirmed T against R, the other confirmed R against T. But in each case the scientific experts took the experiments to clearly confirm one of the rivals against the other. Dorling explains why the experts were right: "In the solar eclipse experiments of 1919, the telescopic observations were made in two locations, but only in one location was the weather good enough to obtain easily interpretable results. Here, at Sobral, there were two telescopes: one, the one we hear about, confirmed Einstein; the other, in fact the slightly larger one, confirmed Newton. Conclusion: Einstein was vindicated, and the results with the larger telescope were rejected." ( 4)


T: General Relativistic light-bending effect of the sun R: No light-bending effect of the sun A: Both telescopes are working correctly D: The actual, conflicting data from both telescopes

Set S=R in the odds factor (1), and observe that P(D | TA) = P(D | RA) = 0. Then (1) becomes P(D | T-A) (3) f(T,R) = -----------P(D | R-A) "Now the experimenters argued that one way in which A might easily be false was if the mirror of one or the other of the telescopes had distorted in the heat, and this was much more likely to have happened with the larger mirror belonging to the telescope which confirmed R than with the smaller mirror belonging to the telescope which confirmed T. Now the effect of mirror distortion of the kind envisaged would be to shift the recorded images of the stars from the positions predicted by T to or beyond those predicted by R. Hence P(D | T-A) was regarded as having an appreciable value, while, since it was very hard to think of any similar effect which could have shifted the positions of the stars in the other telescope from those predicted by R to those predicted by T, P(D | R-A) was regarded as negligibly small, hence the result as overall a decisive confirmation of T and refutation of R." ( 4) Thus in (3) we have f(T,R) >> 1.

2.8 Bell's Inequalities: Holt vs. Clauser "Holt's experiments were conducted first and confirmed the predictions of the local hidden variable theories and refuted those of the quantum theory. Clauser examined Holt's apparatus and could find nothing wrong with it, and obtained the same results as Holt with Holt's apparatus. Holt refrained from publishing his results, but Clauser published his, and they were rightly taken as excellent evidence for the quantum theory and against hidden-variable theories." ( 4)

Notation T: Quantum theory R: Disjunction of local hidden variable theories A: Holt's setup is sensitive enough to distinguish T from R D: The specific correlations predicted by T and contradicted by R are not detected by Holt's setup

The characterization of D yields the first two of the following equations. In conjunction with the characterization of A it also yields P(D | T-A) = 1, for if A is false, Holt's apparatus was not sensitive enough to detect the correlations that would have been present according to T; and it yields P(D | R-A) = 1 because of the wild improbability of the apparatus "hallucinating" those specific correlations.

P(D | TA) = 0, P(D | RA) = 1, P(D | T-A) = P(D | R-A) = 1

Setting S=R in (1), these substitutions yield

(4) f(T,R) = P(-A) Then with a prior probability 4/5 for adequate sensitivity of Holt's apparatus, the odds between quantum theory and the local hidden variable theories shift strongly in favor of the latter, e.g., with prior odds 45:55 between T and R, the posterior odds are only 9:55, a 14% probability for T. Why then does not Holt publish his result? Because the experimental result undermined confidence in his apparatus. Setting -T = R in (2) because T and R were the only theories given any credence as explanations of the results, and making the same substitutions as in (4), we have


f(A,-A) = P(R)

so the odds on A fall from 4:1 to 2.2:1; the probability of A falls from 80% to 69%. Holt is not prepared to publish with better than a 30% chance that his apparatus could have missed actual quantum mechanical correlations; the swing to R depends too much on a prior confidence in the experimental setup that is undermined by the same thing that caused the swing. Now why did Clauser publish?

Notation T: Quantum theory R: Disjunction of local hidden variable theories C: Clauser's setup is sensitive enough E: The specific correlations predicted by T and contradicted by R are detected by Clauser's setup

Suppose that P(C) = .5. At this point, although P(A) has fallen by 11%, both experimenters still trust Holt's well-tried set-up better than Clauser's. Suppose Clauser's initial results E indicate presence of the quantum mechanical correlations pretty strongly, but still with a 1% chance of error. Then E strongly favors T over R:


P(E | TC)P(C)+P(E | T-C)P(-C) f(T,R) = ------------------------------P(E | RC)P(C)+P(E | R-C)P(-C)

.5 + .01 + .5 = --------------- = 50.5 .01 Starting from the low 9:55 to which T's odds fell after Holt's experiment, odds after Clauser's experiment will be 909:110, an 89% probability for T. The result E boosts confidence in Clauser's apparatus by a factor of

P(E | CT)P(T) + P(E | CR)P(R)


f(C,-C) = --------------------------------- = 15 P(E | -CT)P(T) + P(E | -CR)P(R) This raises the initially even odds on C to 15:1, raises the probability from 50% to 94%, and lowers the 50% probability of the effect's being due to chance down to 6 or 7 percent.

2.9 Laplace vs. Adams Finally, note one more class of cases: a theory T remains highly probable although (with auxiliary hypothesis A) it implies a false prediction D. With S=-T in formulas (1) and (2), with P(D | TA)=0, and setting P(D | T-A) t = ------------P(D | -T-A) we have P(D | -TA) s = -----------P(D | -T-A)


t f(T,-T) = ----------------sP(A)/P(-A) + 1 s f(A,-A) = ----------------tP(T)/P(-T) + 1


tP(-A) (10) f(T,-A) = -------sP(-T) These formulas apply to ( 1) "a famous episode from the history of astronomy which clearly illustrated striking asymmetries in `normal' scientists' reactions to confirmation and refutation. This particular historical case furnished an almost perfect controlled experiment from a philosophical point of view, because owing to a mathematical error of Laplace, later corrected by Adams, the same observational data were first seen by scientists as confirmatory and later as disconfirmatory of the orthodox theory. Yet their reactions were strikingly asymmetric: what was initially seen as a great triumph and of striking evidential weight in favour of the Newtonian theory, was later, when it had to be re-analyzed as disconfirmatory after

the discovery of Laplace's mathematical oversight, viewed merely as a minor embarrassment and of negligible evidential weight against the Newtonian theory. Scientists reacted in the `refutation' situation by making a hidden auxiliary hypothesis, which had previously been considered plausible, bear the brunt of the refutation, or, if you like, by introducing that hypothesis's negation as an apparently ad hoc face-saving auxiliary hypothesis."


T: the theory, Newtonian celestial mechanics A: The hypothesis that disturbances (tidal friction, etc.) make a negligible contribution to D: the observed secular acceleration of the moon.

Dorling argues on scientific and historical grounds for approximate numerical values

t=1, s=1/50

The general drift: t = 1 because with A false, truth or falsity of T is irrelevant to D, and t = 50s because in plausible partitions of -T into rival theories predicting lunar accelerations, P(R | -T) = 2% where R is the disjunction of rivals not embarrassed by D. Then for a theorist whose odds are 3:2 on A and 9:1 on T (probabilities 60% for A and 90% for T),

f(T,-T)=100/103, f(A,-A)=1/500, f(T,A)=200.

Thus the prior odds 900:100 on T barely decrease, to 900:103; the new probability of T, 900/1003, agrees with the original 90% to two decimal places. But odds on

the auxiliary hypothesis A drop sharply, from prior 3:2 to posterior 3/1000, i.e., the probability of A drops from 60% to about three tenths of 1%.

2.10 Dorling's conclusions "Until recently there was no adequate theory available of how scientists should change their beliefs in the light of evidence. Standard logic is obviously inadequate to solve this problem unless supplemented by an account of the logical relations between degrees of belief which fall short of certainty. Subjective probability theory provides such an account and is the simplest such account that we possess. When applied to the celebrated Duhem (or Duhem-Quine) problem and to the related problem of the use of ad hoc, or supposedly ad hoc, hypotheses in science, it yields an elegant solution. This solution has all the properties which scientists and philosophers might hope for. It provides standards for the validity of informal deductive reasoning comparable to those which traditional logic has provided for the validity of informal deductive reasoning. These standards can be provided with a rationale and justification quite independent of any appeal to the actual practice of scientists, or to the past success of such practices. [Here a long footnote explains the Putnam-Lewis Dutch book argument for conditioning.] Nevertheless they seem fully in line with the intuitions of scientists in simple cases and with the intuitions of the most experienced and most successful scientists in trickier and more complex cases. The Bayesian analysis indeed vindicates the rationality of experienced scientists' reactions in many cases where those reactions were superficially paradoxical and where the relevant scientists themselves must have puzzled over the correctness of their own intuitive reactions to the evidence. It is clear that in many such complex situations many less experienced commentators and critics have sometimes drawn incorrect conclusions and have mistakenly attributed the correct conclusions of the experts to scientific dogmatism. Recent philosophical and sociological commentators have sometimes generalized this mistaken reaction into a full-scale attack on the rationality of men of science, and as a result have mistakenly looked for purely sociological explanations for many changes in scientists' beliefs, or the absence of such changes, which were in fact, as we now see, rationally de rigeur. "It appears that in the past even many experts have sometimes been misled in trickier reasoning situations of this kind. A more widespread understanding of the adequacy and power of the kinds of Bayesian analyses illustrated in this paper could prevent such mistakes in the future and could form a useful part of standard scientific education. It would be an exaggeration to say that it would offer a wholly new level of precision to informal scientific reasoning, for of course the quantitative subjective probability assignments in such calculations are merely representative surrogates for informal qualitative judgments. Nevertheless the

qualitative conclusions which can be extracted from these relatively arbitrary quantitative illustrations and calculations seem acceptably robust under the relevant latitudes in those quantitative assignments. Hence if we seek to avoid qualitative errors in our informal reasoning in such scientific contexts, such illustrative quantitative analyses are an exceptionally useful tool for ensuring this, as well as for making explicit the logical basis for those qualitative conclusions which follow correctly from our premises, but which are sometimes nevertheless surprising and superficially paradoxical." ( 5)

2.11 Problems 1 "Someone is trying decide whether or not T is true. He notes that T is a consequence of H. Later he succeeds in proving that H is false. How does this refutation affect the probability of T?" In particular, what is P(T)-P(T|~ H)?

2 "We are trying to decide whether or not T is true. We derive a sequence of consequences from T, say C1, C2, C3, ... . We succeed in verifying C1, then C2, then C3, and so on. What will be the effect of these successive verifications on the probability of T?" In particular, setting P(T|C1&C2&... Cn-1&Cn) = pn, what is the probability factor pn/pn+1?

3 Four Fallacies. Each of the following plausible rules is unreliable. Find counterexamples to (b), (c), and (d) on the model of the one for (a) given below. (a) If D confirms T, and T implies H, then D confirms H. Counterexample: in an eight-ticket lottery, let D mean that the winner is ticket 2 or 3, T that it is 3 or 4, H that it is neither 1 nor 2. (b) If D confirms H and T separately, it must confirm their conjunction, T&H. (c) If D and E each confirm H, then their conjunction, E&F, must also confirm H. (d) If D confirms a conjunction, T&H, then it can't infirm each conjunct separately.

2.12 Notes Sec. 2.1. The term "Bayes factor" or simply "factor" is more commonly used than "odds factor". Call it `f'. A useful variant is its logarithm, sc., the weight of evidence for T against S:

w(T, S) = log f(T, S)

As the probability factor varies from 0 through 1 to , its logarithm varies from through 0 to + , thus equalizing the treatments of confirmation and infirmation. Where the odds factor is multiplicative for odds, weight of evidence is additive for logarithms of odds (`lods'):

(new odds) = f . (old odds) log(new odds) = w + log(old odds)

Sec. 2.2: "More danger, more honor." See George Pólya, Patterns of Plausible Inference, 2nd ed., Princeton University Press 1968, vol. 2, p. 126.

Sec. 2.4. See Pólya, op. cit., pp. 130-132.

Sec. 2.6. See Jon Dorling, "Bayesian personalism, the methodology of research programmes, and Duhem's problem" Studies in History and Philosophy of Science 10(1979)177-187. More along the same lines: Michael Redhead, "A Bayesian reconstruction of the methodology of scientific research programmes," Studies in History and Philosophy of Science 11(1980)341-347. Dorling's unpublished paper from which excerpts appear here in sec. 2.7 - 2.10 is "Further illustrations of the Bayesian solution of Duhem's problem" (29 pp.,

photocopied, 1982). References here (" 4" etc.) are to the numbered sections of that paper. Dorling's work is also discussed in Colin Howson and Peter Urbach, Scientific Reasoning: the Bayesian approach (Open Court, La Salle, Illinois, 2nd ed., 1993).

Sec. 2.10, the Putnam-Lewis Dutch book argument (i.e., for conditioning as the only legitimate updating policy). Putnam stated the result, or, anyway, a special case, in a 1963 Voice of America Broadcast, "Probability and Confirmation", reprinted in his Mathematics, Matter and Method, Cambridge University Press (1975)293-304. Paul Teller, "Conditionalization and observation", Synthese 26(1973)218-258, reports--and attributes to David Lewis--a general argument to that effect which Lewis had devised as a reconstruction of what Putnam must have had in mind.

Sec. 2.11. Problems 1 and 2 are from George Pólya, "Heuristic reasoning and the theory of probability", American Mathematical Monthly48(1941)450-465. Problem 3 relates to Carl G. Hempel's "Studies in the logic of confirmation", Mind 54(1945)1-26 and 97-121. Reprinted in Hempel's Aspects of Scientific Explanation, The Free Press, New York, 1965.

It was in terms of gambling that Pascal, Fermat, Huygens and others in their wake floated the modern probability concept. Betting was their paradigm for action under uncertainty; adoption of odds or probabilities was the relevant form of factual judgment. They saw probabilistic factual judgment and graded value judgment as a pair of hands to shape decision.

3.1 Desirability The matter was put as follows in the final section of a most influential 17th century How to Think book, "The Port-Royal Logic" (1662). "To judge what one must do to obtain a good or avoid an evil, it is necessary to consider not only the good and the evil themselves, but also the probability that they happen, or not; and to view geometrically the proportion that all these things have together." This "geometrical" view takes seriously the perennial image of deliberation as a weighing in the balance. Where a course of action might eventuate in a good or an evil, we are to weigh the probabilities of those outcomes in a balance whose arms are proportional to the gain and the loss that the outcomes would bring. To consider the "good and the evil themselves" is to compare their desirability differences, g-f and f-e in Fig. 1.

Fig. 1. Lengths are proportional to desirability differences, weights to probabilities. The desirability of the course of action is represented by the position f of the fulcrum about which the opposed turning effects of the weights just cancel.

Example 1. The last digit. The action under consideration is a bet on the last digit of the serial number of a $5 bill in your pocket: if it's one of the 8 digits from 2 to 9, you give me the bill; if it's 0 or 1, I give you $20. Then my odds on winning are 4:1. In the balance diagram, that's the ratio of weights in the pans. Suppose I have $100 on hand. Crassly, I might see my present desirability level as f = 100, and equate the desirabilities g and e with the cash I'll have on hand if I win and lose: g=105 and e=80. Now the 4:1 odds between the good and the evil agree with the 4:1 ratio of loss (f-e = 20) to gain (g-f = 5) as I see it. I'd think the bet fair.

In example 1, the options ("acts") were (G) take the gamble, and (-G) don't. My desirabilities des(act) for these were averages of my desirabilities des(level & act) for possible levels of wealth after acts, weighted with my probabilities P(level | act) for levels given acts:

des(G) = des($80 & G)P($80 | G) + des($105 & G)P($105 | G) des(-G) = des($100 & -G)

If desirability equals wealth in dollars no matter whether it is a gain, a loss, or the status quo, these work out as:

des(G) = (80)(.25)+(100)(0)+(105)(.75) = 100 des(-G) = (80)(0)+(100)(1)+(105)(0) = 100

Then my desirabilities for the two acts are the same, and I am indifferent between them.

In the next example preference between acts reveals a previously unknown feature of desirabilities.

Example 2. The Heavy Smoker. The following statistics were provided by the American Cancer Society in the early 1960's.

Percentage of American Men Aged 35 Expected to Die before Age 65 Nonsmokers 23% Cigar and Pipe Smokers 25% Cigarette smokers: Less than 1/2 pack a day 27% 1/2 to 1 pack a day 34% 1 to 2 packs a day 38% 2 or more packs a day 41% In 1965, Diamond Jim, a 35-year-old American man, had found that if he smoked cigarettes at all, he smoked 2 or more packs a day. Thinking himself incapable of quitting altogether, he saw his options as the following two. C = Continue to smoke 2 or more packs a day S = Switch to pipes and cigars And he saw these as the relevant conditions: L = He lives to age 65 or more D = He dies before the age of 65 His probabilities came from the statistics in the normal way, so that, e.g., P(D | C) = 41% and P(D | S) = .25. Thus, his conditional probability matrix was as follows. L D C 59% 41% S 75% 25% Unsure of the desirabilities of the four conjunctions of C and S with D and L, he was clear that DS (= die before age 65 in spite of having switched) was the worst of them; and he thought that longevity and cigarette-smoking would contribute independent increments of desirability, say l and c:

des(LS) = des(DS)+l, des(LC) = des(DC)+l des(LC) = des(LS)+c, des(DC) = des(DS)+c

Then if we set the desirability of the worst conjunction equal to d, his desirability matrix is this: L D C d+c+l d+c S d+l d Now in Diamond Jim's judgment the desirability of (C) continuing to smoke 2 packs a day and of (S) switching are as follows.

des(C) = des(LC)P(L | C) + des(DC)P(D | C) = (d +c +l )(.59) + (d +c )(.41) = d +c +.59l des(S) = des(LS)P(L | S) + des(DS)P(D | S) = (d +l )(.75) + (d )(.25) = d +.75l

The difference des(C)-des(S) between these is c -.16l. If Diamond Jim preferred to continue smoking, this was positive; if he preferred swiching, it was negative. Fact: Diamond Jim switched. Then the difference was negative, i.e., c was less than 16% of l: his preference for cigarettes over pipes and cigars was less than 16% as intense as his preference for living to age 65 or more over dying before age 65.

3.2 Problems 1 Train or Plane? With regard to cost and safety, train and plane are equally good ways of getting from Los Angeles to San Francisco. The trip takes 8 hours by train but only 1 hour by plane, unless the San Francisco airport proves to be fogged in, in which case the plane trip takes 15 hours. The weather forecast says there are 7 chances in 10 that San Francisco will be fogged in. If your desirabilities are simply negatives of travel times, how should you go?

2 The point of Balance. What must the probability of fog be, in problem 1, to make you indifferent between plane and train?

3 You assign the following desirabilities to wealth.

$: 0 10 20 30 40 des($): 0 10 17 22 26

a. With assets of $20 you are offered a gamble to win $10 with probability .58 or otherwise lose $10. Work out your desirabilities for accepting and rejecting the offer. Note that you should reject it. b. What if you had been offered a gamble consisting of two independent plays of the gamble in (a)? Should you have accepted?

4 The Allais Paradox. You may choose one of the following options at no cost to yourself. Don't calculate, just decide!

A: One million dollars ($1M) for sure. B: 10 or 1 or 0 $M with probabilities 10%, 89%, 1%.

What if you were offered the following options instead? Decide!

C: $1M or $0 with probabilities 11%, 89%. D: $10M or $0 with probabilities 10%, 90%.

Note your intuitive answers; then compute your desirabilities for the four options, using x, y, z for the desirabilities of $10M, $1M, $0. Verify that the excess desirability of A over B in must be the same as that for C over D. Thus a policy of maximizing conditional expectation of dollar payoffs would have you prefer C to D if you prefer A to B.

5 The Ellsberg Paradox. A ball will be drawn from an urn containing 90 balls: 30 red, the rest black and yellow in some unknown ratio. As in problem 4, choose between A and B, and between C and D. Then calculate.

A: $100 if red, $0 if not. B: $100 if black, $0 if not.

C: $0 if red, $100 if not. D: $0 if black, $100 if not.

6 Deutero Russian Roulette. You've got to play Russian roulette, using a sixshooter that has 2 loaded cylinders. You've got a million, and would pay it all to empty both cylinders before you have to pull the trigger. Show that if dying rich is no better than dying poor, and it's the prospects of being dead, or being alive at various levels of wealth to which you attach your various desirabilities, the present decision theory would advise you to pay the full million to have just 1 bullet removed if originally there were 4 in the cylinder.

7 Proto Russian Roulette. If dying rich is no better than dying poor, and des(Dead)=0, des(Rich)=1, how many units of desirability is it worth to remove a single bullet before playing Russian Roulette when the six-shooter has e empty chambers?

8 In the Allais and Ellsberg paradoxes, and in Proto Russian Roulette, many people would choose in ways incompatible with the analyses suggested above. Thus, in problem 4, the desirability of being so unlucky as to win nothing in option B having passed up the option (A) of a sure million - is often seen as much lower than the desirability of winning nothing in option C or D. Verify that the view of decision-making as desirability maximization needn't then see preference for A over B and for D over C as irrational. Review the Allais and Ellsberg paradoxes in that light. (Note that in each case the question of irrationality is addressed to the agent's values, i.e., determinants of desirability, rather than to how the agent weighs those values together with with probabilities.)

9 It takes a dollar to ride the subway. You and I each have a half-dollar coin, and sorely need a second. For each, desirabilities of cash are as in the graph above, so we decide to toss one coin and give both to you or to me depending on whether the head or tail turns up. In dollars, each thinks the gamble neither advantageous nor disadvantageous, since the expectation is 50cents , i.e., half way between losing ($0) and winning ($1). But in desirability, each thinks the gamble advantageous. To see why, read des(gamble) and (b) des(don't) off the graph.

10 The Certainty Equivalent of a Gamble

According to the graph, $50 in hand is more desirable than a ticket worth $100 or $0, each with probability 1/2, a ticket of "actuarial value" $50. How many dollars in hand would be exactly as desirable as the ticket?

11 The St. Petersburg Paradox.

"Peter tosses a coin and continues to do so until it should land "heads" when it comes to the ground. He agrees to give Paul one ducat if he gets "heads" on the very first throw, two ducats if he gets it on the second, four if on the third, eight if on th e fourth, and so on, so that with each additional throw the number of ducats he must pay is doubled. Suppose we seek to determine the value of Paul's expectation."

Paul's probability that the first head comes on the n'th toss is pn = 1/2n, and in that case Paul's receipt is rn = 2n-1. Then Paul's expectation of gain, p1r1+p2r2+..., will be 1/2+1/2+... = *. Then should Paul be glad to pay any finite sum for the privilege of playing?

"This seems absurd because no reasonable man would be willing to pay 20 ducats as equivalent. You ask for an explanation of the discrepancy between the mathematical calculation and the vulgar evaluation. I believe that it results from the fact that, in their theory, mathematicians evaluate money in proportion to its quantity while, in practice, people with common sense evaluate money in proportion to the utility they can obtain from it."

If the desirability des(r) of receiving r ducats increases more and more slowly, des(gamble) might be finite: des(gamble) = des(r1)/2 + des(r2)/4 + ... + des(rn)/2n + ...

"If, for example, we suppose the moral value of goods to be directly proportionate to the square root of their mathematical quantities, e.g., that the satisfaction provided by 40,000,000 is double that provided by 10,000,000, my psychic expectation becomes 1/2+ 2/4+ 4/8+ 8/16... = 1/(2- 2)."

On this reckoning Paul should not be willing to pay as much as 3 ducats to play the game, for des(3) is 3, i.e., 1.73..., which is larger than des(gamble) = 1.70... But the paradox reappears as long as des(r) does eventually exceed any preassigned value, for then a variant of the St. Petersburg game can be devised in which the payoffs rn are large enough so that des(r1)p1 + des(r2)p2 + ... = . Problem. With des(r) = r as above, find payoffs rnthat restore the paradox.

3.3 Rescaling A balanced beam would remain balanced if expanded or contracted uniformly about the fulcrum, e.g., if each inch stretched to a foot, or shrank to a centimeter. That's because balance is a matter of cancellation of the net clockwise and counterclockwise turning effects, and uniform expansion or contraction would multiply each of these by a common factor k, e.g., k=12 if inches stretch to feet and k=.3937 if inches shrink to centimeters. Applying the laws of the lever to choice, we conclude that nothing relevant to decision-making depends on the size of the unit of the desirability scale.

Furthermore, nothing relevant to decision-making depends on the location of the zero of the desirability scale. In physics this corresponds to the fact that if a beam is in balance then the turning effects of the various forces on it about any one point add up to 0, whether or not the point is the fulcrum - provided we view the fulcrum as pressing upward with a force equal to the net weight of the loaded beam Then if numbers des(H) accurately represent your judgments of how good it would be for hypotheses H to be true, so will the numbers ades(H), where a is any positive constant. That's because multiplying by a positive constant is just a matter of uniformly shrinking or stretching the scale -- depending on whether the constant is greater or less than 1 (as when lengths in feet look 12 times as great in inches and 1/3 as great in yards). And if numbers des(H) accurately represent your valuations, so will ades(H)+b, where a is positive and b is any constant at all; for moving the origin of coordinates left (positive b) or right (negative b) by the same amount, b, leaves distances between them (gains and losses) unchanged. E.g., in example 13.2, we can set d=0, l=1 without thereby making any substantive assuptions about Diamond Jim's desirabilities. On that scale, desirabilities of the acts are simply des(C) = c + .59 and des(S) = .75. Two desirability assignments des and V determine the same preferences among options if the graph of one against the other is a straight line des' (H) = ades(H)+b as in Fig. 1, sloping up to the right, so that des' is a positive linear transform of des. The multiplicative constant a is the line's slope (rise per unit run); the additive constant b is the des'-intercept, the height at which the line cuts the des'-axis.

Fig. 1. des and des' are equivalent desirability scales.

There is less scope for rescaling probabilities. If the weights in all pans are multiplied by the same positive constant, balance will not be disturbed; but no other sort of change, e.g., adding the same extra weight to all pans, can be relied upon never to unbalance the scales. It would be all right to use different positive numbers as probabilities of a sure thing in different problems -- perhaps, the upward force at the fulcrum. Although

do use 1 as the probability of a sure thing in every problem, that is just a convention; any positive constant would do. On the other hand, we adopt no such convention for desirabilities, e.g., we do not insist that the desirability of a sure thing (Av-A) always be 0, or that the desirability of an impossibility (A&-A) always be 0.

3.4 Expectations, RV's, Indicators My expectation of any unknown quantity -- any so-called "random variable" (or "RV" for short) -- is a weighted average of the values I think it can assume, in which the weights are my probabilities for those values.

Example 1. Giant Pandas. Let X = the birth weight to the nearest pound of the next giant panda (ailuropoda melanoleuca) to be born in captivity. If I have definite probabilities p0, p1, etc. for the hypotheses that X = 0, 1, ... , 99, etc., my expectation of X will be

0. p0 + 1. p1 + ... +99. p99 + ...

This sum can be stopped after the 99th term without affecting its value, since I attribute probability 0 to values of X of 100 or more.

It turns out that probability itself is an expectation:

Indicator Property. My probability p for truth of a hypothesis is my expectation of a random variable (the "indicator" of the hypothesis) that has value 1 or 0 depending on whether the hypothesis is true or false.

Proof. As 1 and 0 are the only values this RV can assume, its expectation is 1p + 0(1-p), i.e., p.

Observe that an expected value needn't be one of the values the random variable can actually assume. Thus in the Panda example X must be a whole number; but its expected value, which is an average of whole numbers, need not be a whole number. Nor need my expectation of the indicator of past life on Mars be one of the values that indicators can assume, i.e., 0 or 1; it might well be 1/10, as in the story at the beginning of chapter 1. The indicator property of expectation is basic. So is this:

Additivity. Your expectation of a sum is the sum of your expectations of its terms.

From additivity it follows that your expectation of X+X is twice your expectation of X, your expectation of X+X+X is three times your expectation of X, and for any whole number n,

Proportionality. Your expectation of nX is n times your expectation of X.

Example: 2. Success Rates. I attribute the same probability, p, to success on each trial of an experiment. Consider the indicators of the hypotheses that the different trials succeed. The number of successes in the first n trials will be the sum of the first n indicators; the success rate in the first n trials will be that sum divided by n. Now by additivity, my expectation of the number of successes must be the sum of my expectations of the separate indicators. Then by the indicator property, my expectation of the number of successes in the first n trials is np. Therefore my expectation of the success rate in the first n trials must be np divided by p, i.e., p itself.

That last statement deserves its own billing:

Calibration Theorem. If you have the same probability (say, p) for success on each trial of an experiment, then p will also be your expectation of the success rate in any finite set of trials.

The name comes from the jargon of weather forecasting; forecasters are said to be well calibrated when the fraction of truths ("success rate") is p among statements to which they have attributed p as probability. Thus theß theorem says: forecasters expect to be well calibrated.

3.5 Why Expectations are Additive Like probabilities, expectations can be related to prices. My expectation of a magnitude X can be identified with the buying-or-selling price I'd think fair for a ticket that can be cashed for X(r) units of currency. A ticket for X as in the Panda example is shown above. By comparing prices and values of combinations of such tickets we can give a Dutch book argument for additivity of expectations.

Suppose x and y are your expectations of magnitudes X and Y -- say, rainfall in inches during the first and second halves of next year -- and z is your expectation for next year's total rainfall, X+Y. Why should z be x+y? Because in every eventuality about rainfall at the two locations, the first two of these tickets together are worth the same as the third:

Then unless the prices you would pay for the first two add up to the price you would pay for the third, you are inconsistently placing different values on the same prospect, depending on whether it is described to you in one or the other of two provably equivalent ways.

3.6 Conditional Expectation Just as we defined your conditional probabilities as the prices you'd think fair for tickets that represent conditional bets, so we define conditional expectations:

Your conditional expectation E(X | H) of the random variable X given truth of the statement H is the price you'd think fair for the following ticket:

Corresponding to the notation E(X | H) for your conditional expectation for X, we use E(X) for your unconditional expectation for X, and E(XY) for your unconditional expectation of the product of the magnitudes X and Y. The following rule might be viewed as a definition of conditional expectations in terms of unconditional ones in case P(H)!= 0, just as the quotient rule for probabilities might be viewed as a definition of P(G|H) as the quotient P(G&H)/P(H). On the right-hand side, IH is the indicator of H. Therefeore P(H)=E(IH).

Quotient Rule. E(X | H) = E(X. IH)/P(H)

The quotient rule is equivalent to the following relationship between conditional and unconditional expectations.

Product Rule. E(X. IH) = E(X | H)P(H)

A "Dutch book" consistency argument for this relationship can be modelled on the one given in sec. 4 of for the corresponding relationship between probabilities. Consider the following tickets. Clearly, the first has the same value as the pair to its right, whether H is true or false. And those two have the same values as the ones under them, for XIH is X if H is true and 0 if H is false, and the last ticket just duplicates the one above it. Then unless your price for the first ticket is the sum of your prices for the last two, i.e., unless the condition

E(X | H) = E(X. IH) + E(X | H)P(-H),

is met, you are inconsistently placing different values on the same prospect depending on whether it is described in one or the other of two provably equivalent ways. Now set P(-H) = 1-P(H) in this condition, and simplify. It boils down to the product rule.

Historical Note. Solved for P(H), the product rule determines H's probability as the ratio of E(X. IH) to E(X | H). As IH is the indicator of H, X. IH is X or 0 depending on whether H is true or false. Thus, viewed as a statement about P(H), the product rule corresponds to Thomas Bayes's definition of probability:

"The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon its happening."

E(X. IH) is the value you place on the ticket at the left; E(X | H) is the value you place on the ticket at the right:

The first ticket is "an expectation depending on the happening of the event", i.e., an expectation of $X depending on the truth of H, an unconditional bet on H.Your price for the second ticket, $E(X | H), is "the value of the thing expected [$X] upon its [H's] happening": as you get the price back if H is false, your uncertainty about H doesn't dilute your expectation of X here, as it does if the ticket is worthless when H is false.

3.7 Laws of Expectation The basic properties of expectation are the product rule and

Linearity. E(aX+bY+c) = aE(X)+bE(Y)+c

Three notable special cases of the linearity equation are obtained by setting a=b=1 and c=0 (additivity), b=c=0 (proportionality), and a=b=0 (constancy):

Additivity. E(X+Y)=E(X)+E(Y) Proportionality. E(aX)=aE(X) Constancy. E(c)=c

By repeated application, the additivity equation can be seen to hold for arbitrary finite numbers of terms -- e.g., for 3 terms, by applying 2-term additivity to X+(Y+Z):

E(X+Y+Z) = E(X)+E(Y)+E(Z)

The magnitudes of which we have expectations are called "random variables" since they may have various values with various probabilities. May have: in the linearity property, a, b, and c are constants, so that, as we have seen, E(c) makes sense, and equals c. But more typically, a random variable might have any of a number of values as far as you know. Convexity says that your expectation for the variable cannot be larger than all of those values, or smaller than all of them:

Convexity. E(X) lies in the range from the largest to the smallest values that X can assume.

Where X can assume only a finite numer of values, convexity follows from linearity.

The following connection between conditional and unconditional expectations is of particular importance. Here, the H's are any hypotheses whatever.

Total Expectation. If no two of H1, H2, ... are compatible, and H is their disjunction, then E(X | H) = E(X | H1)P(H1| H) + E(X | H2)P(H2| H)+ ...

Proof. X . IH = X . IH1 + X . IH2 + ... ; apply E to both sides, then use additivity and the product rule. Divide both sides by P(H), and use the fact that P(Hi)/P(H) = P(Hi| H) since Hi&H = Hi.

Note that when when conditions are certainties, conditional expectations reduce to unconditional ones:

Certainty. E(X | H) = E(X) if P(H)=1

Applying conditions. It is always OK to apply conditions of form Y = blah (e.g., Y = 2X) appearing at the right of the bar, to rewrite Y as blah at the left:

E(... Y... | Y=blah) = E(... blah... | Y=blah) (OK!)

The Discharge Fallacy. But we cannot generally discharge a condition Y=blah by rewriting Y as blah at the left and dropping the condition; e.g., E(3Y2 | Y=2X) cannot be relied upon to equal E(3(2X)2). In general:

E(... Y... | Y=blah) = E(... blah... ) (NOT!)

The problem of the two sealed envelopes. One contains a check for an unknown whole number of dollars, the other a check for twice or half as much. Offered a free choice, you pick one at random. You might as well have chosen the other, since you think them equally likely to contain the larger amount. What is wrong with the following argument for thinking you have chosen badly? "Let X and Y be the values of the checks in the one and the other. As you think Y equally likely to be .5X or 2X, E(Y) will be .5E(.5X) + .5E(2X) = 1.25E(X), which is larger than E(X)."

A Valid Special Case of the Discharge Fallacy.As an unrestricted rule of inference, the discharge fallacy is unreliable. (That's what it is, to be a fallacy.) But it becomes a valid rule of inference when "blah" represents a constant, e.g., as in "Y=.5". Show that the following is valid.

E(... Y... | Y=constant) = E(... constant... )

3.8 Physical Analogies; Mean and Median Hydraulic Analogy. Let "F" and "S" mean heads on the first and second tosses of an ordinary coin. Suppose you stand to gain a dollar for each head. Then your net gain in the four possibilities for +/- F and +/- S will be as shown at the left below.

Think of that as a map of flooded walled fields in a plain, with the numbers indicating water depths in the four sections. In the four regions, depths are values of X and areas are probabilities. To find your expectation for X, open sluices so that the water reaches a common level in all sections. That level will be E(X). To find your conditional expectation for X given F, open a sluice between the two sections of F so that the water reaches a single level in F. That level will be E(X | F), i.e., 1.5. Similarly, E(X | -F) = 0.5. To find your unconditional expectation of gain, open all four sluices so that the water reaches the same level throughout: E(X) = 1.

There is no mathematical reason for magnitudes X to have only a finite numbers of values, e.g., we might think of X as the birth weight in pounds of the next giant panda to be born in captivity -- to no end of decimal places of accuracy, as if that meant something. (It doesn't. The commonplace distinction between panda and ambient moisture, dirt, etc. isn't drawn finely enough to let us take the remote decimal places seriously.) We can extend the hydraulic analogy to such continuous magnitudes by supposing that the fields may be pitted and contoured so that water depth (X) can vary continuously from point to point. But cases like temperature, where X can also go negative, require more tinkering -- e.g., solid state H2O, with heights on the iceberg representing negative values.

Balance. The balance analogy (sec. 13, Fig. 1) is more easily adapted to the continuous case. The narrow rigid beam itself is weightless. Positions on it represent values of a magnitude X that can go negative as well as positive. Pick a zero, a unit, and a positive direction on the beam. Get a pound of modelling clay, and distribute it along the beam so that the weight of clay on each section represents the probability that the true value of X is in that section. (Fig. 1 below is an example -- where, as it happens, X cannot go negative.)

Example. "The Median isn't the Message" In 1985 Stephen Gould wrote: "In 1982, I learned I was suffering from a rare and serious cancer. After surgery, I asked my doctor what the best technical literature on the cancer was. She told me ... that there was nothing really worth reading. I soon realized why she had offered that humane advice: my cancer is incurable, with a median mortality of eight months after discovery." In terms of the balanced beam analogy, here are the key definitions, of the terms "median" and "mean" -- the latter being a synonym for "expectation":

The median is the point on the beam that divides the weight of clay in half: the probabilities are equal that the true value of X is represented by a point to the right and to the left of the median.

The mean (= your expectation ) is the point of support at which the beam would just balance.

Gould continues: "The distribution of variation had to be right skewed, I reasoned. After all, the left of the distribution contains an irrevocable lower boundary of zero (since mesothelioma can only be identified at death or before). Thus there isn't much room for the distribution's lower (or left) half -- it must be scrunched up between zero and eight months. But the upper (or right) half can extend out for years and years, even if nobody ultimately survives." See Fig. 1, below. Being skewed (stretched out) to the right, the median of this probability distribution is to the left of its mean; Gould's life expectancy is greater than 8 months. (The mean of 24 months suggested in the graph is my invention. I don't know the statistics.)

Fig. l. Locations on the beam are months lived after diagnosis; the weight of clay on the interval from 0 to m is the probability of still being alive in m months.

The effect of skewness can be seen especially clearly in the case of discrete distributions like the following. Observe that if the right-hand weight is pushed further right the mean will follow, while the median stays fixed. The effect is most striking in the case of the St. Petersburg game, where the median gain is between 1 and 2 ducats but the expected (mean) gain is infinite.

Fig. 2. The median stays between the second and third blocks no matter how far right you move the fourth block.

3.9 Desirabilies as Expectations Desirability is a mixture of judgments of fact and value; your desirability for H represents a survey of your desirabilities for the various ways you think H could happen, combined into a single figure by multiplying each number by your probability for its being the desirability of the way H actually happens. In effect, your desirability for H is your conditional expectation of a magnitude ("U"), commonly called "utility":

des(H) = E(U | H)

U(s) is your desirability for a complete scenario, s, that says exactly how everything turns out. Then U(s) records a pure value judgment, untainted by uncertainty about details, whereas des(H) mixes pure value judgments with pure

factual judgments.

Example 1. The Heavy Smoker, Again (cf. example 13.2). In Fig.1(a), U's actual value is the depth of the unknown point representing the real situation, and desirabilities are average depths -- e.g., the desirability des(SL)=1 of switching and living at least 5 more years is a probability-weighted average of all manner of ways for that to happen--hideous, delightful, or middling. With P(L | C)=.60 (nearly; it's really .59), and P(L | S) = .75, switching raises the odds on L:D from 3:2 to 3:1. Then in (b), desirabilities des(C)=.69 and des(S)=.75 are mixtures--of 1.1 with 1 in the ratio 3:2, and 1 with 0 in the ratio 3:1.

(a) des(act & outcome) (b) des(act) Fig. 1. Hydraulic Analogy

(a) Initial P(act & outcome) (b) Final P(act & outcome) Fig. 2. Unconditional probabilities. (Depths as in Fig. 1.)

Figures 1(a, b) and 2(a) are drawn with P(C) = P(S) = 1/2; the upper and lower sections have equal areas. That is one way to represent Diamond Jim's initial state of mind, unsure of his action. Another is to say he has no numbers at all in mind for P(C) and P(S), even though he does for P(L | C) and P(L | S). Either way, he has clearly not yet made up his mind, for when he has, P(C) and P(S) will be numbers near the ends of the unit interval. In fact, deliberation ends with a realization that switching has the higher desirability; he decides to make S true, or try to; final P(S) will be closer to 1 than initial P(S), i.e., 2/3 instead of 1/2 in Fig. 2. (Diamond Jim is far from sure he will carry out the chosen line of action.)

Warning. des(act) measures choiceworthiness only if odds on outcomes conditionally on acts remain constant as odds on acts vary -- e.g. as in Fig. 2, where odds on L:D given C and given S remain constant at 3:2 and 3:1 as odds on C:S vary from 1:1 to 1:2. This warning is important in "Newcomb" problems, i.e., quasi-decision problems in which acts are seen as mere symptoms of outcomes that agents would promote or prevent if they could.

Example 2. Genetic Determinism. Suppose Diamond Jim attributes the observed correlations between smoking habits and longevities to the existence in the human population of two alleles (good, bad) of a certain gene, where the bad allele promotes heavy cigarette smoking and early death, and works against switching. Jim thinks it's the allele, not the habit, that's good or bad for you; he sees his act and his life expectancy as conditionally independent given his allele -- whether good or bad. And he sees the allele as hegemonic (sec. 11), determining the chances P(act | allele), P(outcome | allele) of acts and outcomes. Then higher odds on switching are a sign that his allele is the longevity-promoting one. He sees reason to hope to try to switch, but no reason to try.

3.10 Notes Sec. 3.1 For the statistics in example 2, see The Consumers Union Report on Smoking and the Public Interest (Consumers Union, Mt. Vernon, N.Y., 1963, p. 69). This example is adapted from R. C. Jeffrey, The Logic of Decision (2nd ed., Chicago: U. of Chicago Press, 1983).

Sec. 3.2, Problems 1 and 2 come from The Logic of Decision. 3 is from D. V. Lindley, Making Decisions (New York: Wiley-Interscience, 1971), p. 96.

4 is from Maurice Allais, "Le comportment de l'homme rationnel devant la risque," Econometrica 21(1953): 503-46. Translated in Maurice Allais and Ole Hagen (eds.), Expected Utility and the Allais Paradox, Dordrecht: Reidel, 1979. 5 is from Daniel Ellsberg, "Risk, Ambiguity, and the Savage Axioms," Quarterly Journal of Economics 75(1961) 643-69. 6 is Alan Gibbard's variation of problem 7 (i.e. Richard Zeckhauser's) that Daniel Kahneman and Amos Tversky report in Econometrica 47 (1979) 283. See also "Risk and human rationality" by R. C. Jeffrey, The Monist 70(1987)223-236. 9. In the diagram, "marginal desirability" (rate of increase of desirability) reaches a maximum at des = 4, and then shrinks to a minimum at des = 6. The second half dollar increases des twice as much as the first. 11, The St. Petersburg paradox. Daniel Bernoulli's "Exposition of a new theory of the measurement of risk" (in Latin) appeared in the Proceedings of the St. Petersburg Imperial Academy of Sciences 5(1738) Translation: Econometrica 22 (1954) 123-36, reprinted in Utility Theory: A Book of Readings, ed. Alfred N. Page (New York: Wiley, 1968) The three quotations areare from correspondence between Daniel's uncle Nicholas Bernoulli and (first, 1713) Pierre de Montmort, and (second and third, 1728) G abriel Cramer. It seems to have been Karl Menger (1934) who first noted that the paradox reappears as long as U(r) is unbounded; see the translation of that paper in Essays in Mathematical Economics, Martin Shubik (ed.), Priceton U. P., 1967, especially the first footnote on p. 211.

Sec. 3.4. For more about calibration, etc., see Morris DeGroot and Stephen Fienberg, "Assessing Probability Assessors: Calibration and Refinement," in Shanti S. Gupta and James O. Berger (eds.), Statistical Decision Theory and Related Topics III, Vol. 1, New York: Academic Press, 1982, pp. 291-314.

Sec. 3.5, 3.6. The Dutch book theorems for expectations and conditional expectations are bits of Bruno de Finetti's treatment of the subject in vol. 1 of his Theory of Probability, New York: Wiley, 1974.

Sec. 3.6. Bayes's definition of probability is from his "Essay toward solving a problem in the doctrine of chances," Philosophical Transactions of the Royal Society 50 (1763), p. 376, reprinted in Facsimiles of Two Papers by Bayes, New

York: Hafner, 1963.

Sec. 3.8. "The Median isn't the Message" by Stephen Jay Gould) appeared in Discover 6(June 1985)40-42.

Sec. 3.9, example 2. This is a "Newcomb" problem; see Robert Nozick, "Newcomb's Problem and Two Principles of Choice" in Essays in Honor of Carl G. Hempel, N. Rescher, ed. (Dordrecht: Reidel Publishers, 1969). For recent references and further discussion, see Richard Jeffrey, "Causality in the Logic of Decision" in Philosophical Topics 21 (1993) 139-151.

SOLUTIONS Sec. 3.2 1 Train 2 1/2 3(b) Yes.

5 If you prefer A to B, you should prefer D to C.

6 With 0 = des[die], 1 = des[rich and alive], and u = des[a million dollars poorer, but alive], suppose that u = des[get rid of the two bullets]; you'd pay everything you have. Then you are indifferent between A and B below.

To see that you must be indifferent between C and D as well, observe that if you plug the A diagram in at the "u" position in the D diagram, you get the DA diagram. But there, the probability of getting 1 is 1/3 (i.e., 1/2 times 2/3), so the probability of getting 0 one way or the other must be 2/3, and the DA diagram is equivalent to the C diagram. Thus you should be indifferent between C and D if you are indifferent between A and B.

7 1/(e+1) 9

(gamble) = 3,

(don't) = 2.

10 $10 11 rn = (2n)2

Sec. 3.7, The Two Envelopes. It's the discharge fallacy. To see why, apply the law of total expectation and the assumption that P(Y=.5X) = P(Y=2X) = 1/2, to get this:

E(Y) = .5E(Y | Y=.5X) + .5E(Y | Y=2X)

By the discharge fallacy, we would then have

E(Y) = .5E(.5X) + .5E(2X) = 1.25E(X) (NOT!)

But in fact, what we have is this:

E(Y) = .5E(.5X | Y=.5X) + .5E(2X | Y=2X)

= .25E(X | Y=.5X) + E(X | Y=2X)

In fact, E(X) and E(Y) are the same mixture of your larger and smaller expectations of X when Y is the larger (2X) or smaller (X/2) amount: E(X) = E(Y) = five parts of E(X | Y=.5X) with two parts of E(X | Y=2X).


Suppose you regard two events as positively correlated, i.e., your personal probability for both happening is greater than the product of their separate personal probabilities. Is there something that shows you see one of these as promoting the other, or see both as promoted by some third event? Here an answer is suggested in terms of the dynamics of your probability judgments. This answer is then applied to resolve a prima facie difficulty for the logic of decision posed by ("Newcomb") problems in which you see acts as mere symptoms of conditions you would promote or prevent if you could. We begin with an account of preference as conditional expected utility, as in my 1965 book, The Logic of Decision.

4.1 Preference Logic In Theory of Games and Econmic Behavior, von Neumann and Morgenstern represented what we do in adopting options as a matter of adopting particular probability distributions over the states of nature. The thought was that your preference ranking of options will then agree with your numerical ranking of expectations of utility as computed according to the adopted probabilities. In fact, they took your utilities for states of nature to be the same, no matter which option you choose, as when states of nature determine definite dollar gains or losses, and these are all you care about. This condition may seem overly restrictive, for the means by which gains are realized may affect your final utility -- as when you prefer work to theft as a chancy way of gaining $1000 (success) or $0 (failure). But the condition can always be met by making distinctions, e.g., by splitting each state of nature in which you realize an outcome by work-or-theft into one in which you realize it by work, and another in which you realize it by theft. (The difference between the work and theft options will be encoded in their associated probability distributions: each assigns probability 0 to the set of states in which you opt for the other.) This means taking a naturalistic view of the decision-maker, whose choices are blended with states of nature in a single space, ; each point in that space represents a particular choice by the decision-maker as well as a particular state of (the rest of) nature. In 1965 Bolker and Jeffrey offered a framework of that sort in which options are represented by propositions (i.e., subsets of : in statistical jargon, "events"), and any choice is a decision to make some proposition true. Each such proposition

corrsponds to a definite probability distribution in the von Neumann-Morgenstern scheme. To the option of making the proposition A true corresponds the conditional probability distribution P(--| A), where the unconditional distribution P(--) represents your prior probability judgment -- i.e., prior to deciding which option-proposition to make true. And your expectation of utility associated with the A-option will be your conditional expectation of utility, E(u | A) -- also known as your desirability for truth of A, and denoted "des A":

des A = E(u | A)

Now preference (>), indifference (=), and preference-or-indifference (>= ) go by desirability, so that

A>B if des A > des B A=B if des A = des B A>= B if des A >= des B

Note that it is not only option-propositions that appear in preference rankings; you can perfectly well prefer a sunny day tomorrow (= truth of "Tomorrow will be sunny") to a rainy one even though you know you cannot affect the weather. Various principles of preference logic can now be enunciated, and fallacies identified, as in the following two examples. The first is a fallacious mode of inference according to which you must prefer B's falsity to A's if you prefer A's truth to B's:

A > B ---------B > -A Counterexample: Death before dishonor. A = You are dead tomorrow; B = You are dishonored today. (You mean to commit suicide if dishonored.) If your probabilities and desirabilities for the four cases tt, tf, ft, ff concerning truth and falsity of AB are as follows, then your desirabilities for A, B, -B, -A will be 0.5, 2.9, 5.5, 6.8, so that the premise is true but the conclusion false. Invalid:

case: tt P(case): .33 des(case): 0 The second is a valid mode of inference:

tf .33 1

ft .01 -100

ff .33 10

A >= B Valid: ---------------- (if A and B are incompatible) A >= AvB >= B. Proof. Given the proviso, and setting w = P(A | AvB), we find that des(AvB) = w(des A) + (1-w)(des B). This is a convex combination of des A and des B, which must therefore lie between them or at an endpoint. Bayesian decision theory is said to represent a certain structural concept of rationality. This is contrasted with substantive criteria of rationality having to do with the aptness of particular probability and utility functions to particular predicaments. With Davidson, I would interpret this talk of rationality as follows. What remains when all substantive questions of rationality are set aside is bare logic, a framework for tracking your changing judgments, in which questions of validity and invalidity of argument-forms can be settled as illustrated above. A complete set of substantive judgments would be represented by a Bayesian frame, consisting of (1) a probability distribution over a space of "possible worlds," (2) a function u assigning "utilities" to the various worlds in , and (3) an assignment of subsets of as values to letters "A", "B", etc., where the letters represent sentences and the corresponding subsets represent "propositions." In any logic, validity of an argument is truth of the conclusion in every frame in which all the premises are true. In a Bayesian logic of decision, Bayesian frames represent possible answers to substantive questions of rationality; we can understand that without knowing how to determine whether particular frames would be substantively rational for you on particular occasions. So in Bayesian decision theory we can understand validity of an argument as truth of its conclusion in any Bayesian frame in which all of its premises are true, and understand consistency of a judgment (e.g., affirming A > B while denying -B > -A), as existence of a nonempty set of Bayesian frames in which the judgment is true. On this view, consistency -- bare structural rationality -- is simply representability in the Bayesian framework.

4.2 Kinematics In the design of mechanisms, kinematics is the discipline in which rigid rods and distortionless wheels, gears, etc. are thought of as faithful, prompt communicators of motion. The contrasting dynamic analysis takes forces into account, so that, e.g., elasticity may introduce distortion, delay, and vibration; but kinematical analyses often suffice, or, anyway, suggest relevant dynamical questions. That is the metaphor behind the title of this section and behind use of the term "rigidity" below for constancy of conditional probabilities. (Here, as in the case of mechanisms, rigidity assumptions are to be understood as holding only within rough bounds defining normal conditions of use, the analogs of load limits for bridges.)

In choosing a mixed option -- in which you choose one two options, O1 or O2, depending on whether some proposition C is true or false -- you place the following constraint on your probabilities.

Stable conditional probabilities P(O1 | C) and P(O2 | -C) are set near 1

(Near: you may mistake C's truth value, or bungle an attempt to make Oi true, or revise your decision.) As choosing the mixed option involves expecting to learn whether C is true or false, choosing it involves expecting your probabilities for C and -C to move toward the extremes:

Labile probabilities of conditions P(C) and P(-C) change from middling values to extreme values, near 0 or 1

This combination of stable conditional probabilities and labile probabilities of conditions is analogous to the constraints under which modus ponens (below) is a useful mode of inference; for if confidence in the second premise is to serve as a channel transmitting confidence in the first premise to the conclusion as well, the

increase in P(first premise) had better not be accompanied by a decrease in P(second premise).

Modus Ponens

C D or not C -----------D

When unconditional probabilities change, some conditional probabilities may remain fixed; but others will change. Example. Set

a = P(A | C), a' = P(A | -C), o = P(C)/P(-C)

and suppose a and a' both remain fixed as the odds o on C change. Then P(C | A) and P(C | -A) must change, since

a a' P(C | A) = ----------P(C | -A) = --------a + a' /o a' + ao What alternative is there to conditioning, as a way of updating probabilities? Mustn't the rational effect of an observation always be certainty of the truth of some data proposition? Surely not. Much of the perception on which we reasonably base action eludes that sort of formulation. Our vocabulary for describing what we see, hear, taste, smell, and touch is no match for our visual, auditory, etc. sensitivities, and the propositional judgments we make with confidence are not generally tied to confident judgments expressible in terms of bare sensation. Example. One day on Broadway, my wife and I saw what proved to be Mayor Dinkins. There were various clues: police, cameramen, etc. We looked, and smiled tentatively. He came and shook hands. Someone gave us Dinkins badges. We had known an election was coming and the candidates campaigning. At the end we had no doubt it was the Mayor. But there was no thought of describing the sensations on which our progress toward conviction was founded, no hope of formulating sensory data propositions that brought our probabilities up the unit interval toward 1:

Pold(It's Dinkins) Pold(It's Dinkins | data1) Pold(It's Dinkins | data1 & data2) ... Of course the people in uniform and the slight, distinguished figure with the moustache might all have been actors. Our visual, auditory and tactile experiences did combine with our prior judgments to make us nearly certain it was the Mayor, but there seems to be no way to represent that process by conditioning on data propositions that are sensory certainties. The accessible data propositions were chancy claims about people on Broadway, not authoritative reports of events on our retinas, palms, and eardrums. We made reasonable moves (smiling, shaking the hand) on the basis of relatively diffuse probability distributions over a partition of such chancy propositions -- distributions not obtained by conditioning their predecessors on fresh certainties. (Jeffrey 1992 1-13, 78-82, etc.) Here are two generalizations of conditioning that you can prove are applicable in such cases, provided rigidity conditions hold for a partition { C1, C2, ... } of .

If the conditions Q(A | Ci) = P(A | Ci) all hold, then probabilities and factors can be updated so:

(Probabilities) Q(A) = (Factors) f(A) =


| Ci)

if(Ci)P(Ci | A)

In the second condition, f(A) and f(Ci) are the factors Q(A)/P(A) and Q(Ci)/P(Ci) by which probabilities P(A), P(Ci) are multiplied in the updating process.

Generalized conditioning allows probabilistic response to observations which prompt no changes in your conditional probabilities given any of the Ci but do prompt definite new probabilities or factors for the Ci. (If your new probability for one of the Ci is 1, this reduces to ordinary conditioning.) Probabilistic judgment is not generally a matter of assigning definite probabilities to all propositions of interest, any more than yes/no judgment is a matter of

assigning definite truth values to all of them. Typical yes/no judgment identifies some propositions as true, leaving truth values of others undetermined. Similarly, probabilistic judgment may assign values to some propositions, none to others. The two sorts of generalized conditioning tolerate different sorts of indefiniteness. The probability version determines a definite value for prnew(A) even if you had no old probabilities in mind for the Ci, as long as you have definite new values for them and definite old values for A conditionally on them. The factor version, determining the probability ratio f(A), tolerates indefiniteness about your old and new probabilities of A and of the Ci as long as your prold(Ci | A) values are definite. Both versions illustrate the use of dynamic constraints to represent probabilistic states of mind. In the next section, judgments of causal influence are analyzed in that light.

4.3 Causality In decision-making it is deliberation, not observation, that changes your probabilities. To think you face a decision problem rather than a question of fact about the rest of nature is to expect whatever changes arise in your probabilities for those states of nature during your deliberation to stem from changes in your probabilities of choosing options. In terms of the analogy with mechanical kinematics: as a decision-maker you regard probabilities of options as inputs, driving the mechanism, not driven by it. Is there something about your judgmental probabilities which shows that you are treating truth of one proposition as promoting truth of another -- rather than as promoted by it or by truth of some third proposition which also promotes truth of the other? Here the promised positive answer to this question is used to analyze puzzling problems in which we see acts as mere symptoms of conditions we would promote or prevent if we could. Such "Newcomb problems" (Nozick 1963, 1969, 1990) pose a challenge to the decision theory floated in the first edition of The Logic of Decision (Jeffrey 1965), where notions of causal influence play no rô le. The present suggestion about causal judgments will be used to question the credentials of Newcomb problems as decision problems. The suggestion (cf. Arntzenius) is that imputations of causal influence are not shown simply by momentary features of probabilistic states of mind, but by intended or expected features of their evolution. The following is a widely recognized probabilistic consequence of the judgment that truth of one proposition ("cause") promotes truth of another ("effect").

>0 P(effect | cause) - P(effect | -cause) > 0

But what distinguishes cause from effect in this relationship? -- i.e., a relationship equivalent to

P(cause | effect) - P(cause | -effect) > 0

With Arntzenius, I suggest the following answer, i.e., rigidity relative to the partition { cause, -cause} .

Rigidity Constancy of P(effect | cause) and P(effect | -cause) as P(cause) varies

Both >0 and rigidity are conditions on a variable "pr" ranging over a set of probability functions. The functions in the set represent ideally definite momentary probabilistic states of mind for the deliberating agent. Clearly, pr can vary during deliberation, for if deliberation converges toward choice of a particular act, the probability of the corresponding proposition will rise toward 1. In general, agents's intentions or assumptions about the kinematics of pr might be described by maps of possible courses of evolution of probabilistic states of mind -- often, very simple maps. These are like road maps in that paths from point to point indicate feasibility of passage via the anticipated mode of transportation, e.g., ordinary automobiles, not "all terrain" vehicles. Your kinematical map represents your understanding of the dynamics of your current predicament, the possible courses of development of your probability and desirability functions. The Logic of Decision used conditional expectation of utility given an act as the figure of merit for the act, sc., its desirability, des(act). Newcomb problems (Nozick 1969) led many to see that figure as acceptable only on special causal assumptions, and a number of versions of "causal decision theory" were proposed as more generally acceptable. In the one I like best (Skyrms 1980), the figure of merit for choice of an act is the agent's unconditional expectation of its desirability on various incompatible, collectively exhaustive causal hypotheses. But if Newcomb problems are excluded as bogus, then in genuine decision problems

des(act) will remain constant throughout deliberation, and will be an adequate figure of merit. In any decision problem whose outcome is not clear from the beginning, probabilities of possible acts will vary during deliberation, for finally an act will be chosen and so have probability near 1, a probability no act had initially. Newcomb problems (Table 1) seem ill posed as decision problems because too much information is given about conditional probabilities, i.e., enough to fix the unconditional probabilities of the acts. We are told that there is an association between acts (making A true or false) and states of nature (truth or falsity of B) which makes acts strong predictors of states, and states of acts, in the sense that p and q are large relative to p' and q' -- the four terms being the agent's conditional probabilities:. p = P(B | A), p' = P(B | -A), q = P(A | B), q' = P(A | -B)

But the values of these terms themselves fix the agent's probability for A, for they fix the odds on A as P(A) qp' ---------- = -------P(~ A) (1-q)p Of course this formula doesn't fix P(A) if the values on the right are not all fixed, but as decision problems are normally understood, values are fixed, once given. Normally, p and p' might be given, together with the desirabilities of the act-state combinations, i.e., just enough information to determine the desirabilities of A's truth and falsity, which determine the agent's choice. But normally, p and p' remain fixed as P(A) varies, and q and q' , unmentioned because irrelevant to the problem, vary with P(A).

4.4 Fisher We now examine a Newcomb problem that would have made sense to R. A. Fisher in the late 1950's.

For smokers who see quitting as prophylaxis against cancer, preferability goes by initial des(act) as in Table 1b; but there are views about smoking and cancer on which these preferences might be reversed. Thus, R. A. Fisher (1959) urged serious consideration of the hypothesis of a common inclining cause of (A) smoking and (B) bronchial cancer in (C) a bad allele of a certain gene, posessors of which have a higher chance of being smokers and developing cancer than do posessors of the good allele (independently, given their allele). On that hypothesis, smoking is bad news for smokers but not bad for their health, being a mere sign of the bad allele, and, so, of bad health. Nor would quitting conduce to health, although it would testify to the agent's membership in the low-risk group. On Fisher's hypothesis, where +/- A and +/- B are seen as independently promoted by +/- C, i.e., by presence (C) or absence (-C) of the bad allele, the kinematical constraints on pr are the following. (Thanks to Brian Skyrms for this.)

Rigidity The following are constant as c = P(C) varies. a = P(A | C) a' = P(A | -C) b = P(B | C) b' = P(B | -C)

>0 P(B | A) > P(B | -A), i.e., p > p

Indeterminacy None of a, b, a' , b' are 0 or 1.

Independence P(AB | C) = ab, P(AB | -C) = a' b' Since in general, P(F | GH) = P(FG | H)/P(G | H), the independence and rigidity conditions imply that +/- C screens off A and B from each other, in the following

sense. Screening-off P(A | BC) = a, P(A | B-C) = a' P(B | AC) = b, P(B | A-C) = b' Under these constraints, preference between A and -A can change as P(C) = c moves out to either end of the unit interval in thought-experiments addressing the question "What would des A - des -A be if I found I had the bad/good allele?" To carry out these experiments, note that we can write p = P(B | A) = P(AB)/P(A) =

P(A | BC)P(B | C)P(C) + P(A | B-C)P(B|-C)P(-C)(1-c) ----------------------------------------------------P(A | C)P(C) + P(A | -C)P(-C) and similarly for p' = P(B | -A). Then we have abc +a' b' (1-c) (1-a)bc + (1-a')b' (1-

c) p = -----------------p' = ------------------------ac + a' (1-c) (1-a)c + (1-a')(1-c) Now final p and p' are equal to each other, and to b or b' depending on whether final c is 1 or 0. Since it is c's rise to 1 or fall to 0 that makes P(A) rise or fall as much as it can without going off the kinematical map, the (quasi-decision) problem has two ideal solutions, i.e., mixed acts in which the final unconditional probability of A is the rigid conditional probability, a or a' , depending on whether c is 1 or 0. But p = p' in either case, so each solution satisfies the conditions under which the dominant pure outcome (A) of the mixed act maximizes des +/- A. (This is a quasidecision problem because what is imagined as moving c is not the decision but factual information about C.) The initial probabilities .093 and .025 in Table 1b were obtained by making the following substitutions in the formulas for p and p' above.

a = .9, a' = .5, b = .2, b' = .01, c (initially) = .3

As p and p' rise toward b = .2 or fall toward b' = .01, tracking the rise or fall of c toward 1 or 0, the negative difference des(continue) - des(quit) = -1.8 in Table 1b

rises toward the positive values 5-4b = 4.2 and 5-4b' = 4.96 in Table 2. Unless you, the smoker, somehow becomes sure of your allele, neither of the two judgmental positions shown in Table 2 will be yours. The table only shows that for you, continuing is preferable to quitting in either state of certainty about the allele. The kinematical map leads you to that conclusion on any assumption about initial c. And initial uncertainty about the allele need not be modelled by a definite initial value of c. Instead, an indefinite initial probabilistic state can be modelled by the set of all pr assigning the values a, a' , b, b' as above, and with c = P(bad allele) anywhere in the unit interval.

If you are a smoker convinced of Fisher's hypothesis, your unconditional probabilities of continuing and quitting lag behind P(good) or P(bad) as your probability for the allele rises toward 1. In particular, your probability ac+a' (1-c) for continuing rises to a = .9 from its initial value of .62 or falls to a' = .5, as c rises to 1 from its initial value of .3 or falls to 0. Here you see yourself as committed by your genotype to one or the other of two mixed acts, analogs of gambles whose possible outcomes are pure acts of continuing and quitting, at odds of 9:1 or 1:1. You do not know which of these mixed acts you are committed to; your judgmental odds between them, c:(1-c), are labile, or perhaps undefined. This genetic commitment antedates your current deliberation. The mixed acts are not options for you; still less are their pure outcomes. (Talk about pure acts as options is shorthand for talk about mixed acts assigning those pure acts probabilities near 1.) Then there is much to be said for the judgment that quitting is preferable to continuing (sc., as the more desirable "news item"), for quitting and continuing are not options.

As a smoker who believes Fisher's hypothesis you are not so much trying to make your mind up as trying to discover how it is already made up. But this may be equally true in ordinary deliberation, where your question "What do I really want to do?" is often understood as a question about the sort of person you are, a question of which option you are already committed to, unknowingly. The diagnostic mark of Newcomb problems is a strange linkage of this question with the question of which state of nature is actual -- strange, because where in ordinary deliberation any linkage is due to an influence of acts +/- A on states +/- B, in Newcomb problems the linkage is due to an influence, from behind the scenes, of deep states +/- C on acts +/- A and plain states +/- B. This difference explains why deep states ("the sort of person I am") can be ignored in ordinary decision problems, where the direct effect of such states is wholly on acts, which mediate any further effect on plain states. But in Newcomb problems deep states must be considered explicitly, for they directly affect plain states as well as acts (Fig. 1). In the kinematics of decision the dynamical role of forces can be played by acts or deep states, depending on which of these is thought to influence plain states directly. Ordinary decision problems are modelled kinematically by applying the rigidity condition to acts as causes. Ordinarily, acts screen off deep states from plain ones in the sense that B is conditionally independent of +/- C given +/- A, so that while it is variation in c that makes P(A) and P(B) vary, the whole of the latter variation is accounted for by the former (Fig. 1a). But to model Newcomb problems kinematically we apply the rigidity condition to the deep states, which screen off acts from plain states (Fig. 1b). In Fig. 1a, the probabilities b and b' vary with c in ways determined by the stable a's and p's, while in Fig. 1b the stable a's and b's shape the labile p's as we have seen above:

abc +a' b' (1-c) (1-a)bc + (1-a')b'(1-c) p = -----------------p' = ------------------------ac + a' (1-c) (1-a)c + (1-a')(1-c) Similarly, in Fig. 1(a) the labile probabilities are: apc +a' p' (1-c) (1-a)pc + (1-a')p'(1-c) b = -----------------b' = ------------------------ac + a' (1-c) (1-a)c + (1-a')(1-c) While C and -C function as causal hypotheses, they do not announce themselves as such, even if we identify them by the causal rô les they are meant to play, e.g., when we identify the "bad" allele as the one that promotes cancer and inhibits quitting. If there is such an allele, it is a still unidentified feature of human DNA. Fisher was talking about hypotheses that further research might specify, hypotheses he could only characterize in causal and probabilistic terms -- terms like "malaria vector" as used before 1898, when the anopheles mosquito was shown to be the organism playing that aetiological rô le. But if Fisher's science fiction story had been verified, the status of certain biochemical hypotheses C and C as the agent's causal hypotheses would have been shown by satisfaction of the rigidity conditions, i.e., constancy of P(--| C) and of P(--| -C), with C and -C spelled out as technical specifications of alternative features of the agent's DNA. Probabilistic features of those biochemical hypotheses, e.g., that they screen acts off from states, would not be stated in those hypotheses, but would be shown by interactions of those hypotheses with pr, B, and A, i.e., by truth of the following consequences of the kinematical constraints.

P(B | act & C) = P(B | C), P(B | act & -C) = P(B | -C)

As Leeds (1984) points out in another connection, no purpose would be served by packing such announcements into the hypotheses themselves, for at best -- i.e., if true -- such announcements would be redundant. The causal talk, however useful as commentary, does no work in the matter commented upon.

4.5 Newcomb The flagship Newcomb problem resolutely fends off naturalism about deep states, making a mystery of the common inclining cause of acts and plain states while

suggesting that the mystery could be cleared up in various ways, pointless to elaborate. Thus, Nozick (1969) begins: Suppose a being in whose power to predict your choices you have enormous confidence. (One might tell a science-fiction story about a being from another planet, with an advanced technology and science, who you know to be friendly, and so on.) You know that this being has often correctly predicted your choices in the past (and has never, so far as you know, made an incorrect prediction about your choices), and furthermore you know that this being has often correctly predicted the choices of other people, many of whom are similar to you, in the particular situation to be described below. One might tell a longer story, but all this leads you to believe that almost certainly this being's prediction about your choice in the situation to be discussed will be correct. There are two boxes ... ... The being has surely put $1,000 in one box, and (B) left the second empty or (B) put $1,000,000 in it, depending on whether the being predicts that you will take (A) both boxes, or (-A) only the second. Here you are to imagine yourself in a probabilistic frame of mind where your desirability for -A is greater than that of A because although you think A's truth or falsity has no influence on B's, your is near 1 (sec. 4.3), i.e., p is near 1, p' near 0. Does that seem a tall order? Not to worry! High is a red herring; a tiny bit will do, e.g., if desirabilities are proportional to dollar payoffs, then the 1-box option, A, maximizes desirability as long as is greater than .001. To see how that might go, think of the choice and the prediction as determined by independent drawings by the agent and the predictor from the same urn, which contains tickets marked "2" and "1" in an unknown proportion x : 1-x. Initially, the agent's unit of probability density over the range [0,1] of possible values of x is flat (Fig. 2a), but in time it can push toward one end of the unit interval or the other, e.g., as in Fig. 2b, c. At t = 997 these densities determine the probabilities and desirabilities in Table 3b and c, and higher values of t will make des A - des -A positive. Then if t is calibrated in thousandths of a minute this map has the agent preferring the 2-box option after a minute's deliberation. The urn model leaves the deep state mysterious, but clearly specifies its mysterious impact on acts and plain states.

The irrelevant detail of high was a bogus shortcut to the 1-box conclusion, obtained if is not just high but maximum, which happens when p = 1 and p' = 0. This means that the "best" and "worst" cells in the payoff table have unconditional probability 0. Then taking both boxes means a thousand, taking just one means a million, and preference between acts is clear, as long asthe probability r of A (take both boxes), is neither 0 nor 1, and remains maximum, 1. The density functions of Fig. 2 are replaced by probability assignments r and 1-r to the possibilities that the ratio of 2-box tickets to 1-box tickets in the urn is 1:0 and 0:1, i.e., to the two ways in which the urn can control the choice and the prediction deterministically and in the same way. In place of the smooth density spreads in Fig. 2 we now have point-masses r and 1-r at the two ends of the unit interval, with desirabilities of the two acts constant as long as r is neither 0 nor 1. Now the 1-box option is preferable throughout deliberation, up to the very moment of decision. But of course this reasoning uses the premise that =1 through deliberation, a premise making abstract sense in terms of uniformly stocked urns, but very hard to swallow as a real possibility.

4.6 Hofstadter Hofstadter (1983) saw prisoners's dilemmas as down-to-earth Newcomb problems. Call the prisoners Alma and Boris. If one confesses and the other does not, the confessor goes free and the other serves a long prison term. If neither confesses, both serve short terms. If both confess, both serve intermediate terms. From Alma's point of view, Boris's possible actions (B, confess, or -B, don't) are states of nature. She thinks they think alike, so that her choices (A, confess, -A, don't) are pretty good predictors of his, even though neither's choices influence the other's. If both care only to minimize their own prison terms this problem fits the format of Table 1(a). The prisoners are thought to share a characteristic determining their separate probabilities of confessing in the same way -- independently, on each hypothesis about that characteristic. Hofstadter takes that characteristic to be rationality, and compares the prisoners's dilemma to the problem Alma and Boris might have faced as bright children, independently working the same arithmetic problem, whose knowledge of each other's competence and ambition gives them good reason to expect their answers to agree before either knows the answer: "If reasoning guides me to [... ], then, since I am no different from anyone else as far as rational thinking is concerned, it will guide everyone to [... ]." The deep states seem less mysterious here than in the flagship Newcomb problem; here they have some such form as Cx" = We are both likely to get the right answer, i.e., x. (And here ratios of utilities are generally taken to be on the order of 10:1 instead of the 1000:1 ratios that made the other endgame so demanding. With utilities 0, 1, 10, 11 instead of 0, 1, 1000, 1001, indifference between confessing and remaining silent now comes at = 10% instead of one tenth of 1%.) But to heighten similarity to the prisoners's dilemma let us suppose the required answer is the parity of x, so that the deep states are simply C = We are both likely to get the right answer, i.e., even, and -C = We are both likely to get the right answer, i.e., odd. What's wrong with Hofstadter's view of this as justifying the coö perative solution? [And with von Neumann and Morgenstern's (p. 148) transcendental argument, remarked upon by Skyrms (1990, pp. 13-14), for expecting rational players to reach a Nash equilibrium?] The answer is failure of the rigidity conditions for acts, i.e., variability of P(He gets x | I get x) with P(I get x) in the decision maker's kinematical map. It is Alma's conditional probability functions P(-- | +/- C) rather than P(-- | +/- A) that remain constant as her probabilities for the conditions vary. The implausibility of initial des(act) as a figure of merit for her act is simply the implausibility of positing constancy of as her probability function pr evolves in response to changes in P(A). But the point is not that confessing is the preferable act, as causal decision theory would have it. It is rather that Alma's problem is not indecision about which act to choose, but ignorance of which allele is moving her.

4.7 Conclusion Hofstadter's (1983) version of the prisoners's dilemma and the flagship Newcomb problem have been analyzed here as cases where plausibility demands a continuum [0,1] of possible deep states, with opinion evolving as smooth movements of probability density toward one end or the other draw probabilities of possible acts along toward 1 or 0. The problem of the smoker who believes Fisher's hypothesis was simpler in that only two possibilities (C, -C) were allowed for the deep state, neither of which determined the probability of either act as 0 or 1. The story was meant to be a credible, down-to-earth Newcomb problem; after all, Fisher (1959) honestly did give his hypothesis some credit. But if your genotype commits you to one mixed act or the other, to objective odds of 9:1 or 1:1 on continuing, there is no decision left for you to make. Yet, the story persuaded us that, given your acceptance of the Fisher hypothesis, you would be foolish to quit, or to try to quit: continuing would be the wiser move. This is not to say you will surely continue to smoke, i.e., not to say you see a mixed act at odds of 1:0 on continuing as an option, and, in fact, as the option you will choose. It only means you prefer continuing as the pure outcome of whichever mixed act you are unknowingly committed to. "Unknowingly" does not imply that you have no probabilistic judgment about the matter -- although, indeed, you may have none, i.e., c may be undefined. In fact, with c = .3, you think it unlikely that your commitment makes odds of 9:1 on continuing; you think the odds most likely to be 1:1 But whatever the odds, you prefer the same pure outcome: continuing. You don't know which "gamble" you face, but you know what constitutes winning: continuing to smoke, i.e., the less likely outcome of the more desirable "gamble." These scare-quotes emphasize that your mixed act is not a matter of spinning a wheel of fortune and passively awaiting the outcome; you yourself are the chance mechanism. You think there is an objective, real probability of your quitting, i.e., .9 or .5, depending on whether you have the bad genotype or the good one; there is a fact of the matter, you think, even though you do not know the fact. If the real odds on your continuing to smoke are even, that is because your tropism toward smoking is of the softer kind, stemming from the good allele; you are lucky in your genotype. But how does that work? How does the patch of DNA make you as likely to quit as continue? How do we close the explanatory gap from biochemistry to preference and behavior, i.e., to things like the relative importance you place on different concomitants of smoking, on the positive side a certain stimulus and sensual gratification, on the negative a certain inconvenience and social pressure? These influences play themselves out in the micro-moves which add up to the actual macro-outcome: continue, or quit. And if the odds are 9:1, that will stem from a

different pattern of interests and sensitivities, forming a strong tropism toward continuing to smoke, somehow or other rooted in your DNA. What's weird about Fisher's science fiction story is not its premise, that the mental and physical states of reasoning animals are interconnected, but the thought that we might have the sort of information about the connection that his story posits -- information unneeded in ordinary deliberation, where acts screen it off. The flagship Newcomb problem owes its bizarrerie to the straightforward character of the pure acts: surely you can reach out and take both boxes, or just the opaque box, as you choose. Then as the pure acts are options, you cannot be committed to either of the non-optional mixed acts. But in the Fisher problem, those of us who have repeatedly "quit" easily appreciate the smoker's dilemma as humdrum entrapment in some mixed act, willy nilly. That the details of the entrapment are describable as cycles of temptation, resolution and betrayal makes the history no less believable -- only more petty. Quitting and continuing are not options, i.e., pr A ~ 0 and pr A ~ 1 are are not destinations you think you can choose, given your present position on your kinematical map, although you may eventually find yourself at one of them. The reason is your conviction that if you knew your genotype, your value of pr A would be either a or a' , neither of which is ~ 0 or ~ 1. (Translation: "At places on the map where pr C is at or near 0 or 1, pr A is not.") The extreme version of the story, with a ~ 1 and a' ~ 0, is more like the flagship Newcomb problem; here you do see yourself as already committed to one of the pure acts, and when you learn which that is, you will know your genotype. I have argued that Newcomb problems are like Escher's famous staircase on which an unbroken ascent takes you back where you started. We know there can be no such things, but see no local flaw; each step makes sense, but there is no way to make sense of the whole picture; that's the art of it.

4.8 Notes (End of sec. 4.1) See Jeffrey, "Risk and human rationality," and sec. 12.8 of The Logic of Decisin. The point is Davidson's; e.g., see pp. 272-3. (Sec. 4.2) Rigidity is also known as "sufficiency" (Diaconis and Zabell). A sufficient statistic is a random variable whose sets of constancy ("data") form a partition satisfying the "rigidity" condition.


(Sec. 4.3, ) The "regression coefficient" of a random variable Y on another, X, is = cov(X,Y)/var(X), where cov(X,Y) = E[(X-EX)(Y-EY)] and var(X) = E(XEX)2. If X and Y are indicators of propositions (sc., "cause" and "effect"), cov(X,Y) = P(cause & effect)-P(cause)P(effect), var(X) = P(cause)P(-cause), and

reduces to the left-hand side of the inequality.

(Sec. 4.3, rigidity) For random variables generally, rigidity is constancy of the conditional probability distribution of Y given X as the unconditional probability distribution of X varies.

(Sec. 4.4) In the example, I take it that the numerical values a = .9, a' = .5, b = .2, , b' = .01 hold even when c is 0 or 1, e.g. b = P(ca | bad) = .2 even when P(bad) = 0; the equation b. P(bad) = P(ca & bad) isn't what defines b.

(Sec. 4.5, Fig. 2) In this kinematical map, P(A) = 01xt+1f(x)dx and P(B | A) = 1 t+2 0 x f(x)dx/P(A), with f(x) as in Fig. 2(b) or (c). Thus, with f(x) as in (b), P(A) = (t+1)/(t+3) and P(B | A) = (t+2)/(t+3). See Jeffrey (1988).

(Sec. 4.5, end) At the moment of decision the desirabilities of shaded rows in (b) and (c) are not determined by ratios of unconditional probabilities, but continuity considerations suggest that they remain good and bad, respectively.

(Sec. 4.7, start of third paragraph) "You think there is an objective, real probability... " See the hard-core subjectivist's guide to objective chance in The Logic of Decision, sec. 12, and note that the "no one chooses to have sacked Troy" passage from the Nichomachean Ethics, used by Skyrms (1980, p. 128) to introduce causal decision theory, also fits the present skepticism about Newcomb problems. (Sec. 4.7, end of third paragraph) Cf. Davidson's conclusion, that "nomological slack between the mental and the physical is essential as long as we conceive of man as a rational animal" ( p. 223). (Sec. 4.7, Escher staircase) "Ascending and Descending" (lithograph, 1960), based on Penrose (1958); see Escher (1989, p. 78). Elsewhere I have accepted Newcomb problems as decision problems, and accepted "2-box" solutions as correct. Jeffrey (1983, sec. 1.7 and 1.8) proposed a new criterion for acceptability of an act -"ratifiability" -- which proved to break down in certain cases (see Jeffrey 1990, p. 20). In Jeffrey (1988, 1993), ratifiability was recast in terms more like the present ones -- but still treating Newcomb problems as decision problems.

4.9 References Arntzenius, F. (1990) `Physics and common causes', Synthese, vol. 82, pp. 77-96. Bolker, E. (1965), Functions Resembling Quotients of Measures, Ph. D. dissertation (Harvard University). ------ (1966),`Functions resembling quotients of measures', Transactions of the American Mathematical Society, vol. 124, pp. 292-312. ------ (1967), "A simultaneous axiomatization of utility and subjective probability', Philosophy of Science, vol. 34, pp. 333-340. Davidson, D. (1980), Essays on Actions and Events (Oxford: Clarendon Press). Diaconis, P. and Zabell, S. (1982), `Updating subjective probability'. Journal of the American Statistical Asociation, vol. 77, pp. 822-830. Escher, M.C. (1989), Escher on Escher (New York: Abrams). Fisher, R. (1959), Smoking, the Cancer Controversy (London: Oliver and Boyd). Hofstadter, D.R. (1983), `The calculus of coö peration is tested through a lottery', Scientific American, vol. 248, pp. 14-28. Jeffrey, R.C. (1965; 1983, 1990) The Logic of Decision (New York: McGrawHill; Chicago: University of Chicago Press). ------ (1987), `Risk and human rationality' The Monist, vol. 70, no. 2, pp. 223-236. ------ (1988), `How to probabilize a Newcomb problem', Fetzer, J.H. (ed.), Probability and Causality (Dordrecht: Reidel). ------ (1992), Probability and the Art of Judgment, Cambridge. ------ (1993), `Probability kinematics and causality', in Hill, D., Forbes, M. and Okruhlik, K. (eds.) PSA 92, vol. 2 (Philosophy of Science Assn.: Michigan State University, E. Lansing, MI). Kolmogorov, A.N. (1933),`Grundbegriffe der Wahrscheinlichkeitsrechnung', Ergebnisse der Mathematik, Vol. 2, No. 3; Springer, Berlin. Translation: Foundations of Probability (New York: Chelsea, 1950).

Leeds, S. (1984), `Chance, realism, quantum mechanics', Journal of Philosophy, vol. 81, pp. 567-578. Nozick, R. (1963) The Normative Theory of Individual Choice, Ph. D. dissertation (Princeton University). ------ (1969), `Newcomb's problem and two principles of choice', in N. Rescher (ed.), Essays in Honor of Carl G. Hempel, (Dordrecht: Reidel). ------ (1990) Photocopy of Nozick (1963), with new preface (New York: Garland). Penrose, L.S. and Penrose, R (1958), `Impossible Objects: a Special Type of Visual Illusion', The British Journal of Psychology, vol. 49, pp. 31-33. Skyrms, B. (1980) Causal Necessity (New Haven: Yale). ------ (1990), The Dynamics of Rational Deliberation (Cambridge, Mass: Harvard). von Neumann, J and Morgenstern, O. (1943, 1947), Theory of Games and Economic Behavior (Princeton: Princeton University Press).

Introduction What reason is there to suppose that the future will resemble the past, or that unobserved particulars will resemble observed ones? None, of course, until resemblances are further specified, e.g., because we do not and should not expect the future to resemble the past in respect of being past, nor do or should we expect the unobserved to resemble the observed in respect of being observed. Thus Nelson Goodman replaces the old problem ('Hume's') of justifying induction by the new problem of specifying the respects in which resemblances are expectable between past and future, observed and unobserved. The old problem is thereby postponed, not solved. As soon as the new problem is solved, the old one returns, as a request for the credentials of the solution: "What reason is there to expect the future/unobserved to resemble the past/observed with respect to such- and- such dichotomies or classificatory schemes or magnitudes?" The form of the question is further modified when we talk in terms of judgmental probability instead of all-or- none expectation of resemblance, but the new problem still waits, suitably modified. It seems to me that Hume did not pose his problem before the means were at hand to solve it, in the probabilism that emerged in the second half of the seventeenth century, and that we know today primarily in the form that Bruno de Finetti gave it in the decade from 1928 to 1938. The solution presented here (to the old and new problems at once) is essentially present in Chapter 2 of de Finetti's 'La prevision' (1937), but he stops short of the last step, shifting in Chapter 3 to a different sort of solution, that uses the notion of exchangeability. At the end of this paper I shall compare and contrast the two solutions, and say why I think it is that de Finetti overlooked (or, anyway, silently balked at) the solution that lay ready to hand at the end of his Chapter 2.

5.1 Probabilism, what In a nutshell: probabilism sees opinions as more or less precise estimates of various magnitudes, i.e., probability weighted averages of fom ( I ) est X = xopo + xl p1 + . . .,

where the xi are the different values that the magnitude X can assume, and each pi is the probability that the value actually assumed is xi (If X is a continuous magnitude, replace the sum by an integral.) Estimation is not a matter of trying to guess the true value, e.g., 2.4 might be an eminenty reasonable estimate of the number of someone's children, but it would be a ridiculous guess. (If that was my estimate, my guess might be 2.) Similarly, taking the truth value of a proposition to be 1 or 0 depending on whether it is true or false, my estimate of the truth value of the proposition that I shall outlive the present century is about 1/2, which couldn't be the truth value of any proposition. The probability you attribute to a proposition is your estimate of its truth value: if X is a proposition, then (2) prob X = est X. Here I follow de Finetti in taking propositions to be magnitudes that assume the value 1 at worlds where they are true, and 0 where false. This comes to the same thing as the more familiar identification of propositions with the sets of worlds at which they are true, and makes for smoothness here. Observe that (2) follows from (1), for as X can take only the two values 0 and 1, we can set x o = 0 and x1 = 1 in (1) to get est X = 0po + 1p1= p1 where pi is the probability that X assumes the value I, i.e., the probability (prob X) that X is true. Still following de Finetti, I take estimation to be the basic concept, and define probability in terms of it. (The opposite tack, with estimation defined in terms of probability as in (1), is more familiar.) Then (2) is given the status of a definition, and the following axioms are adopted for the estimation operator. (3) Additivity: est X + Y = est X + est Y (4) Positivity: If X > 0 then est X > 0 (5) Normalization: est 1 = 1 (1 is the magnitude that assumes the value 1 everywhere. i.e., the necessary proposition. 'X > 0' means that X assumes negative values nowhere.) Once understood, these axioms are as obvious as the laws of logic - - in token of which fact I shall call them and their consequences laws of 'probability logic' (de Finetti's 'logic of the probable'). Here they are, in English:

(3) An estimate of the sum of two magnitudes must be the sum of the two separate estimates. (4) An estimate must not be negative if the magnitude estimated cannot be negative. (5) If the magnitude is certainly I, the estimate must be 1 . Additivity impliesl that for each real number k, (6) est kX = k est X The Kolmogorov axioms for probability are easy consequences of axioms (3)- (5) for estimation, together with (2) as a notational convention, i.e., (7) If X can take no values but 0 and I, then est X = prob X. Here are the Kolmogorov axioms. (8) Additivity: If XY= 0 then prob X + Y = prob X + prob Y (9) Positivity: prob X>=0 (10) Normalization: prob 1 =1 (10) is just copied from the normalization axiom (5) for est. with 'est' transcribed as 'prob'. Positivity is the same for 'prob' as for 'est', given that when we write 'prob X' it goes without saying that X > 0, since a proposition X can assume no values but 0 and 1. And additivity of prob as above comes to the same thing as the more familiar version, i.e., If X and Y are incompatible propositions, prob X v Y = prob X + prob Y. (With 0 for falsehood and 1 for truth, the condition XY = 0 that the product of X and Y be 0 everywhere comes to the same thing as logical incompatibility of X and Y; and under the condition XY = 0 the disjunction Xv Y, i.e., X + Y - XY in the present notation, comes to the same thing as the simple sum X + Y.) A precise, complete opinion concerning a collection of propositions would be represented by a probability function defined on the truth functional closure of that collection. More generally, a precise, complete opinion concerning a collection of magnitudes might be represented by an estimation operator on the closure of that collection under the operations of addition and multiplication of magnitudes, and of multiplication of magnitudes by constants. (One might also include other options, e.g. closure under exponentiation, XY.)

Such are precise, complete opinions, according to probabilism. But for the most part our opinions run to imprecision and incompleteness. Such opinions can be represented by conditions on the variable 'est' or, equivalently, by the sets of particular estimation operators that satisfy those conditions Such sets will usually be convex, i.e., if the operators esto and est1 both belong to one, then so will the operator woesto+w1est1, if the w's are non- negative real numbers that sum to 1. An example is given by what de Finetti (1970, 1974 3.10) calls 'The Fundamental theorem of probability': Given a coherent assignment of probabilities to a finite number of propositions, the probability of any further propositions is either determined or can be coherently assigned any value in a certain closed interval. Thus, the set of probability measures that assign the given values to the finite set of propositions must be convex. And incomplete, imprecise opinions can arise in other ways, e.g. in my book, The Logic of Decision (1965, 1983), a complete preference ranking will normally determine an infinite set of probability measures, so that probabilities of propositions may be determined only within intervals: see 6.6, 'Probability quantization'. Probabilism would have you tune up your opinions with the aid of the probability calculus, or, more generally, the estimation calculus: probability logic, in fact. This is a matter of tracing consequences of conditions on estimation operators that correspond to your opinion. When you trace these consequences you may find that you had misidentified your opinion, i.e., you may see that after all, the conditions whose consequences you traced are not all such as you really accept. Note that where your opinion is incomplete or imprecise, there is no estimation operator you can call your own. Example: the condition est(X-- est X)2 <= 1 is not a condition on your unknown estimation operator, est. Rather: in that condition, 'est' is a variable, in terms of which your opinion can be identified with the set { est: est (X est X)2 <= 1 } of estimation operators that satisfy it.

5.2 Induction, what Here is the probabilistic solution that I promised, of the new problem of induction. It turns on the linearity of the expectation operator, in view of which we have X1 + . . . + Xn est X1 + . . . + est Xn est ------------------ = -------------------------n n i.e., in words:

(11) An estimate of the average of any (finite) number of quantities must equal the average of the estimates of the separate quantities. (Proof: use (6) to get 1/n out front, and then apply (3) n - 1 times.) From (11) we get what I shall call the ESTIMATION THEOREM. If your opinion concerning the magnitudes Xl,...,Xn, .. ,Xn+m is characterized by the constraints est Xi = est Xj for all i, j = 1, ..., n + m (among other constraints, perhaps), then your estimate of the average of the last m of them will equal the observed average of the first n - if you know that average, or think you do. Implicitly, this assumes that although you know that the average of the first n X's is (say) x, you don't know the individual values assumed by Xl,...,Xn separately unless it happens that they are all exactly x, for if you did, and they weren't, the constraint est Xi = est Xj would not characterize your opinion where the known value of Xi differs from the known value of Xj. Proof of the estimation theorem. If you assign probability I to the hypothesis that the average of the first n X's is x, then by (1)2 you must estimate that average as x The constraints then give est Xi = x for the last m X's, and the conclusion of the estimation theorem follows by (11). Example 1: Guessing weight For continuous magnitudes, estimates serve as guesses. Suppose that you will be rewarded if you guess someone's weight to within an accuracy of one pound. One way to proceed is to find someone who seems to you to be of the same build, to be dressed similarly, etc., so that where X2 is the weight you wish to guess correctly, and Xl is the other person's weight, your opinion satisfies the constraint est X,=est X2. Now have the other person step on an accurate scale, and use that value of X, as your estimate of X2. This is an application of the estimation theorem with n = m = 1. Mind you: it is satisfaction of the constraint after the weighing that justifies (or amounts to) taking the other person's actual weight as your estimate, and under some circumstances, your opinion might change as a result of the weighing so as to cease satisfying the constraint. Example: the other person's weight might prove to be so far from your expectation as to undermine your prior judgement that the two were relevantly similar, i.e., the basis for your prior opinion's satisfaction of the constraint. Here is a case where you had antecedently judged the two people both to have weights in a certain interval (say, from 155 to 175 pounds), so that when the other person's weight proved to be far outside this interval (perhaps, 120 pounds) your opinion changed from {est: 155 <= est Xl = est X2< 175} to something else, because the weighing imposed the further condition est Xl = 120

on your opinion, i.e., a condition incompatible with the previously given ones. Probability logic need not tell you how to revise your opinion in such cases, any more than deductive logic need tell you which of an inconsistent set of premises to reject. Example 2: Using averages as guesses If you can find (say) ten people, each of whom strikes you as similar in the relevant respects to an eleventh person, whose weight you wish to guess, then have the ten assemble on a large platform scale, read their total weight, and use a tenth of that as your estimate of Xl1. This is an application of the estimation theorem with n = 10, m = 1. The estimation theorem does not endorse this estimate as more accurate than one based on a single person's weight, but under favorable conditions it may help you form an opinion of your estimate's accuracy, as follows. Example 3: Variance The variance of a magnitude X is defined relative to an estimation function: var X = est (X - est X)2 = est X2 - est2 X. Thus, relative to an estimation function that characterizes your precise opinion, the variance of the eleventh person's weight is your estimate of the square of your error in estimating that weight, and this turns out to be equal to the amount by which the square of your estimate of the magnitude falls short of your estimate of the square of the magnitude. Now the estimation theorem can be applied to the magnitudes Xl2, ..., X2n+l to establish that under the constraints est Xi2 = est Xj2 for i, j = 1, ..., n + 1, your estimate of the square of the eleventh person's weight must equal the observed average of the squares of the first ten people's weights.3 To get the variance of Xl1, simply subtract from that figure the square of the estimate of Xl1 formed in Example 2. It is worthwhile to ring the changes on these examples, e.g. imagining that you are estimating weights not by eye, but on the basis of significant but limited statistical data, say, age and sex of the members of the sample, and of the person whose weight is to be estimated; and imagining that it is not weight that is to be estimated, but length of life - the estimate of which has the familiar name, 'life expectancy'. (In this case the members of the sample are presumably dead already: people no younger than the one whose life expectancy is sought, who were relevantly similar to that one, at that age). The estimation theorem was inspired by a class of applications (in 'La prevision', Chapter 2) of what I shall call

DE FINETTI'S LAW OF SMALL NUMBERS: your estimate of the number of truths among the propositions A1, ..., An must equal the sum of the probabilities you attribute to them . Here is another formulation, obtained by dividing both sides of the equation by n and applying the linearity of est: (12) Your estimate of the relative frequency of truths among the propositions A1,..., An must equal the average of the probabilities you attribute to them. Proof of de Finetti's law. The claim is that est (Al + . . . + An) = prob Al + . . . + prob An, which is true by (3) and (7). Example 4: Applying de Finetti's law To form my probabilistic opinion concerning the proposition Al01 that the 101st toss of a certain (possibly loaded) die will yield an ace, I count the number of times the ace turns up on the first hundred tosses. Say the number is 21, and suppose I don't keep track of the particular tosses that yielded the aces. It is to be expected that I attribute the same probability to all 101 propositions of form Ai, i.e., it is to be expected that my opinion satisfies the constraints est Ai = estAj (i,j = 1, ...,101). Then by de Finetti's law with n = 100, the common value of those probabilities will be 21%, and that will be the probability of Al01 as well. Observe that in the case of propositions, i.e., magnitudes whose only possible values are 0 and 1, the variance is determined by the estimate, i.e., by the probability attributed to the proposition: (13) If X is a proposition of probability p, then var X = p(l - p). Proof. As X is a proposition, X2 =X, and therefore var X, in the form est X2 - est2 X, can be written as est X - est2 X, i.e., p - p2, i.e., p(l - p). Thus, the variance of a proposition is null when its probability is extreme: 0, or 1. And variance is maximum (i.e., 1/4) when p is 1/2. You can see that intuitively by considering that the possible value of 'p' that is furthest from both of the possible values of X is the one squarely in the middle.

5.3 Justifying induction These examples show how probabilism would have us form our opinion about the future on the basis of past experience, in simple cases of the very sorts concerning which the problem of induction is commonly posed. The estimation theorem, and

de Finetti's laws of large and small numbers, are especially accessible parts of probabilism's solution to the new problem of induction As the theorem and the law are consequences of the axioms (3) (5) of probability logic, this solution can be seen as borrowing its credentials from those axioms. Thus the old problem of induction, in the form in which it bears on probabilism's solution to the new problem, is the question of the credentials of the axioms of probability logic. Note that I say 'probability logic', not 'logical probability'. De Finetti's subjectivism implies that the basic axioms governing est (or, if you prefer, the corresponding axioms for prob) are all the universally valid principles there are, for this logic. In contrast, Carnap (1950,1952, 1971, 1980) tentatively proposed further principles as universal validities for what he called 'logical probability', i.e., either a particular probability function, e.g. c* (1945, 1950), or a class of them, e.g. {c : 0 < < } (1952). But such attempts to identify a special class of one or more probability functions as the 'logical' ones strike me as hopeless. Example: the functions of form c . do have interesting properties that recommend them for use as subjective probability functions in certain sorts of cases, but there are plenty of other sorts of cases where none of those functions are suitable.4 Carnap (1980, 17) did see that, and finally added further adjustable parameters and , in an effort to achieve full generality. But I see no reason to think that would have been the end of the broadening process, had Carnap lived to continue it, or had others taken sufficient interest in the project to pursue it after Carnap's death5. With de Finetti, I take the laws of probability logic to be the axioms (3)-(5) and their consequences6. It seems appropriate to call these axioms 'logical' because of the strength and quality of their grip, as constraints that strike us as appropriate for estimation functions. The feel of that grip is like that of the grip of such logical laws (i.e., constraints on truth- value assignments) as that any proposition, X, implies its disjunction with any proposition, Y. In our notation, this comes out as X <= X + Y - XY, given that X and Y take no values other than 0 and 1. I am at a loss to think of more fundamental principles from which to deduce these axioms: to understand is to acknowledge, given what we understand by 'estimate.' But this is not to deny that illustrations can serve to highlight this logical character of the axioms: a notable class of such illustrations are the 'Dutch book' arguments, which proceed by considering situations in which your estimates of magnitudes will be the prices- in dollars at which you are constrained to buy or sell (on demand) tickets that can be exchanged for numbers of dollars equal to the true values of those magnitudes Example. The Dutch book argument for axiom (3) goes like this, where x and y are the unknown true values of the magnitudes X and Y. For definiteness, we consider the case where the axiom fails because the left- hand side is the greater. (If it fails

because the left- hand side is the smaller, simply interchange 'buy' and 'sell' in the following argument, to show that you are willing to suffer a sure loss of est X + est Y est (X + Y).) If est (X + Y) exceeds est X + est Y, you are willing to buy for est (X + Y) dollars a ticket worth x +y dollars; and for a lower price, i.e., est X + est Y dollars, you are willing to sell a pair of tickets of that same combined worth: x +y dollars. Thus, you are willing to suffer a sure loss, viz., est (X + Y) - est X - est Y dollars. To paraphrase Brian Skyrms (1980, p. 119): if your estimates violate axiom (3) in such cases, you are prepared to pay different amounts for the same good, depending on how it is described. For a single ticket worth x + y dollars you will pay est (X + Y) dollars, but for two tickets jointly worth x + y dollars you will pay a different amount, i.e., est X + est Y dollars. In a certain clear sense, this is an inconsistency. Such Dutch book arguments serve to highlight the credentials of (3) (5) as axioms of the logic of estimation, i.e., of probability logic. One might even think of them as demonstrating the apriori credentials of the axioms, sc., as 'logical' truths in a certain sense under the hypothesis that you are prepared to make definite estimates of all the magnitudes that appear in them, in circumstances where those estimates will be used as your buying- or- selling prices for tickets whose dollar values equal the true values of those magnitudes. But the point of such demonstrations is blunted if one's opinions are thought to be sets of estimation functions, or conditions on estimation functions, for then the hypothesis that you are prepared to make the definite estimates that the Dutch book arguments relate to will be satisfied only in the extreme cases where opinion is precise and complete, relative to the magnitudes in question.

5.4 Solution or evasion? Even if you see the Dutch book arguments as only suggestive, not demonstrative, you are unlikely to balk at the logicist solution to the old problem of induction ( 3) if you accept the probabilistic solution floated in 2 for the new problem. But many will see probabilism as an evasion, not a solution; and while there can be no perfectly decisive answer to such doubts, it would be evasive to end this paper without some effort to meet them. The doubt can be illustrated well enough in connection with Example 1: "Indeed, if you maintain your initial opinion, according to which your estimates of the two people's weights will be the same, the information that one of them weighs (say) 132 pounds will produce a new opinion in which both weights are estimated as 132 pounds. But to use your initial opinion as a datum in this way is to beg the question

("What shall your opinion be?") that the new problem poses. It is only by begging that question that the old problem can be misconceived as a request for the credentials of the general constraints (3) (5) on all estimation functions, rather than for the special constraints that characterize your own opinion." That's the question which a probabilist must fault as question- begging in a characteristically objectivistic way. For that question contrasts your initial opinion, according to which est X1, = est X2, with the information that the true value of X1, is 132. But what is thus dignified as 'information' is more cautiously described as a new feature of your opinion, specified by two new constraints: estX1 = 132, varX1 = 0 It is the second of these that corresponds to the objectivistic characterization of the new estimate, 132, as information, not mere opinion. To interpret 'information' more strongly than this is to beg the question that the agent answers (in effect) by forming his new opinion in accordance with the new constraints: it is to say that his opinion about X, is not only definite (estX1 = 132) and confident (varX1 = 0) but correct. In turn, objectivists will identify this move as a typical subjectivistic refusal to see the difference between subjective confidence and objective warrant. "If the only constraints that my opinion must meet are the basic axioms, I am free to adopt any further constraints I please, as long as they are consistent. But then I don't need to go through the business of finding somebody who strikes me as similar in build, weight of clothing, etc., to the person whose weight I wish to guess, and weighing that similar person. I can simply decide on any number at all (it might even be 132, as luck would have it), and use that as my guess: est X2 = 132. Nor does anything prevent me from adopting great confidence regarding that guess: var X2=0". (Note that in Example 1, where var X1 was 0, there was no reason to set var X2 = 0 as well.) But this is nonsense, on a par with "If God is dead then everything is permitted". Look at it for a minute: God didn't die, either suddenly or after a long illness. The hypothesis is rather that there is no such entity, and never was. Our morality has no divine basis. Instead, its basis is in us: it is as it is because and insofar as we are as we are. ('Insofar as': humans are not as uniform as the divine template story suggests.) Moses smuggled the ten commandments up Mt. Sinai, in his heart. You know the line. I take the same line about Carnap's efforts, and Henry Kyburg's, to tell us just what constraints on our probabilistic opinions would be rationally justified by this or that corpus of fully- held beliefs. If you think that some such set of epistemological commandments must be produced and justified if we are to form warranted probabilistic opinions, then you will find subjectivistic probabilism offensively

nihilistic: a license for wishful thinking and all other sorts of epistemological sin. But I think such fears unwarranted. Wishful thinking is more commonly a feature of (shared or private) fantasy than of judgement. In forming opinion we aim to the truth, for the most part. The fact that I would be violating no laws of logic if I were simply to decide on a number out of the blue, as my estimate of someone's weight, does not mean that I would or could do that. (I could say '132' or '212' easily enough, but could I believe it?) In a way, the contrast with von Mises' frequentism is more illuminating than that with Carnap's logicism. Mises sought to establish probability theory as an independent science, with its own subject- matter: mass phenomena. If that were right, there would be a general expertise for determining probabilities of all sorts: concerning horse races, the weather, U235--whatever. But according to the sort of probabilism I am putting forward, such expertise is topical: you go to different people for your opinions about horse races, weather, U235 etc. Probability logic provides a common framework within which all manner of opinion can be formulated, and tuned. In the weight- guessing example you are supposed to have sought, and found, a person you saw as sufficiently similar in relevant respects to the one whose weight you wished to estimate, to persuade you that within broad limits, any estimate you form for the one (e.g. by weighing him on a scale you trust) will be your estimate for the other as well. Objectivism is willing to accept your judgement that the true value of X1, is 132, based on the scale reading, but rejects the part of your judgement that identifies the two estimates. But probabilism views both scalereading and equality-judging (for people's weights) as acquired skills. (The same goes for judging the accuracy of scales.) The point about scale- reading is that it is a more widely and uniformly acquired skill, than is equality-judgement for people's weights. But when it comes down to it, your opinion reflects your own assessment of your own skills of those sorts. Here is how de Finetti (1938) expressed the basic attitude: must invert the roles of inductive reasoning and probability theory: it is the latter that has autonomous validity, whereas induction is the derived notion. One is thus led to conclude with Poincare that "whenever we reason by induction we make more or less conscious use of the calculus of probabilities". The difference between the approach to the problem of induction that I suggest here and the one de Finetti espoused in 'La prévision...' is a consequence of the difference between de Finetti's drive to express uncertainty by means of definite probability or estimation functions (e.g. 'exchangeable' ones (1937) and 'partially exchangeable' ones (1938)), and the looser course taken here, where opinions can be represented by constraints on estimation functions in cases where no one function adequately represents the opinion. By taking this looser point of view,'

one can use the mathematically trivial estimation theorem to find that under the constraints est X1 = . . . = est Xn+m, observed averages must be used as estimates of future averages, on pain of incoherence, i.e., inconsistency with the canons of probability logic. In contrast, de Finetti (1937) uses the mathematically nontrivial law of large numbers, according to which one's degree of belief in the difference between any two averages' exceeding (say) 10-10 can be made as small as you like by making the numbers of magnitudes Xi that appear in the averages both get large enough. The constraint in de Finetti's version of the law of large numbers are stronger than those in the estimation theorem: they require existence of real numbers a, b, c for which we have estXi = a, est Xi2 = b, est(XiXj) = c for all i, j = 1,2, ... with i != j. (Only the first of these constraints applies to the estimation theorem, and then only for i = 1, ...,n + m.) I think that de Finetti spurns or overlooks the estimation theorem because he insists on representing opinions by definite probability and estimation functions. He then uses conditionalization to take experience into account. As the presumed initial opinion is precise and complete, he gets not only precise estimates of averages in this way, but precise variances, too . It is a merit of the estimation theorem that it uses a very diffuse initial opinion, i.e., one that need satisfy no constraints but est X1 = . . . = est Xn+m . There is no use of conditionalization in proving or applying the estimation theorem. If variances are forthcoming, it is by a further application of the estimation theorem, as in Example 3: empirically, in large part8. The claim is that through the estimation theorem, probabilism makes what considerable sense there is to be made of naive frequentism9, i.e., of Hume's inductivism in its statistical atavar.

Notes 1. For positive integers k, additivity clearly yields (6) by induction, since est(n+1)X=estX+nX. Then for a positive integral k, est(1/k)X=(1/k)estX, since kest(1/k)X=estX. This yields (6) for positive rational k, whence (6) follows for positive real k by the density of the rationals in the reals. By (3) and (5), est 1 + 0 = 1 + est0, so that since 1 + 0 = 1, (5) yields est 0 = 0 and, thus, (6) for k = 0. Finally, to get (6) for negative real k it suffices to note that est - 1 = - 1 since 0 = est 1 - 1 = 1 - est - l by (3) and (5). Here we have supposed that for real a and b, est aX + b Y is defined whenever est X and est Y are, i e., we have assumed that the domain on which est is defined is closed under addition and under multiplication by reals.

2 (1) is deducible from (3), (6), and (7) in case X assumes only finitely many different values xi, for then we have X= vixSXs where Xi is the proposition that X = xi, i.e., Xi assumes the value 1 (0) at worlds where X is true (false). 3 The constraints est Xi2 = est Xj2 represent a judgement quite different from that represented by the constraints est Xi = est Xj a judgement we are less apt to make, or to feel confident of, having made it. (Note that estimating x 2 is not generally just a matter of squaring your estimate of X!) 4 As Johnson (1932) showed, and Kemeny (1963) rediscovered, the cases where one of the functions cx is suitable are precisely those in which the user takes the appropriate degree of belief in the next item's belonging to cell Pi Of the partitioning {P1, ...,Pk} to depend only on (i) the number of items already sorted into cells of that partitioning, and (ii) the number among them that have been assigned to cell Pi. 5 Not everyone would agree that nobody is continuing work on Carnap's project, e.g. Costantini (1982) sees himself as doing that. But as I see it, his program - a very interesting one - is very different from Carnap's. 6 There are two more axioms, which de Finetti does not acknowledge: an axiom of continuity, and an axiom that Lewis (1980) calls 'The Principal Principle' and others call by other names: prob (H 1 chance H = x ) = x, where chance H is the objective chance of H's truth. De Finetti also omits axiom (5), presumably on the ground that the estimates of magnitudes are to represent estimates of the utilities you expect from them, where the estimated utility est X need not be measured in the same units as X itself, e.g. where X is income in florins, est X might be measured in dollars. 7 I gather that it originates with Keynes (1921), reappears with Koopman (1940), and is given essentially the form used here by Good (1950, e.g. p. 3). It is espoused by Levi (1974,1980) as part of a rationalistic program. I first encountered it, or something like it, in Kyburg (1961), but it took me 20 years to see its merits. Among statisticians, the main support for this way of representing imprecise or incomplete opinion comes from Good (1950, 1962), Smith (1961), Dempster (1967, 1968), and Shafer (1976). In practice, the business of reasoning in terms of a variable, prob' or 'est', that satisfies certain constraints, is widespread - but with an unsatisfactory rationale, according to which one is reasoning about an unknown, definite function, which the variable denotes. 8 Anyway, in larger part than in de Finetti's approach. Use of the observed average is common to both, but the further constraints on est are weaker in this approach than in de Finetti's. Observe that with the symmetric flat prior probability function (Carnap's C#), conditioning on the proposition that m of the first n trials have been successes yields a posterior probability function prob relative to which we always

have prob X1 = . . . = prob Xn = m/n, but have prob Xn+1 = (m + 1)/(n + 2) + m/n unless n = 2m. The case is similar for other nonextreme symmetric priors, e.g. for all of form 9 Not von Mises' science of limiting relative frequencies in irregular collectives, but the prior, plausible intuition. References Carnap, R.: 1945, 'On inductive logic', Philosophy of Science 12, 72 97. Carnap, R.: 1950, Logical Foundations of Probability, Univ. of Chicago Press. Carnap, R.: 1952, The Continuum of Inductive Methods, Univ. of Chicago Press. Carnap, R.: 1971, 'A basic system of inductive logic', in Carnap and Jeffrey (eds.) (1971) and Jeffrey (ed.) (1980). Carnap, R. and R. Jeffrey: 1971, Studies in Inductive Logic and Probability, Vol. 1, Univ. of California Press. Costantini, D: 1982, 'The role of inductive logic in statistical inference', to appear in Proceedings of a Conference on the Foundations of Statistics and Probability, Luino, September 1981. Dempster, Arthur P.: 1967, 'Upper and lower probabilities induced by a multivalued mapping', Annals of Mathematical Statistics 38, 395 339. Dempster, Arthur P.: 1968, 'A generalization of Bayesian inference', J. Royal Stat. Soc., Series B. 30, 205 247. Finetti, Bruno de: 1937, 'La prevision: ses lois logiques, ses sources subjectives', Annales de l'lnstitut Henri Poincare 7, 1- 68. (English translation in Kyburg and Smokler.) Finetti, Bruno de: 1938, 'Sur la condition d'equivalence partielle', ActualEites Scient. et Industr., No. 739, Hermann & Cie., Paris. (English translation in Jeffrey (1980).) Finetti, Bruno de: 1970, 1974, TeoHa Della Probabditai, Torino; English translation, Theory of Probability, Vol. 1, Wiley, New York. (Vol. 2, 1975). Good, I. J.: 1950, Probability and the Weighing of Evidence, Griffin, London. Good, I. J.: 1962, 'Probability as the measure of a non- measurable set', in Ernest Nagel, Patrick Suppes, and Alfred Tarski (eds.), Logic, Methodology, and

Philosophy of Science: Proceedings of the 1960 International Congress, Stanford Univ. Press. Reprinted in Kyburg and Smokler. Goodman, N.: 1979, Fact, Fiction and Forecast, Hackett Publ. Co., Indianapolis. Hume, D.: 1739,A Treatise of Human Nature, London. Jeffrey, Richard C.: 1965, 1983, The Logic of Decision, McGrawHill; 2nd ed., Univ of Chicago Press. Jeffrey, Richard C.: 1980 (ed.), Studies in Inductive Logic and Probability, Vol. 2, Univ. of California Press. Johnson, W. E.: 1932, 'Probability', Mind 41, 1- - 16, 281- 296, 408 423. Kemeny, J: 1963, 'Carnap's theory of probability and induction', in P. A. Schilpp (ed.), The Philosophy of Rudolf Carnap, La Salle,111. Keynes, John M.: 1921,A Treatise on Probabdity, London. Kolmogorov, A. N.: 1933, Grundbegriffe der Wahrscheinlich-keitsrechnung, Ergebnisse der Msth., Band II, No. 3. (English Translation, Chelsea, N.Y., 1946.) Koopman, B O.: 1940, 'The bases of probability', BuGetin of the American Mathematical Society 46, 763 774. Reprinted in Kyburg and Smokler. Kyburg, Henry E., Jr.: 1961, Probability and the Logic of Rational Belief, Wesleyan Univ . Press, 1961. Kyburg, Henry E., Jr. and Howard Smokler (eds.): 1980, Studies in Subjective Probability, 2nd ed., Krieger Publ. Co., Huntington, N.Y. Levi, I.: 1974, 'On indeterminate probabilities', J. Phd. 71, 391418. Levi, I.: 1980, The Enterprise of Knowledge, MIT Press. Lewis, David K.: 1980, 'A subjectivist's guide to objective change', in Jeffrey (ed.) (1980). Mises, Richard v.: 1919, 'Grundlagen der Wahrscheinhchkeitsrechnung', Math. Zs. 5. Skyrms, B.: 'Higher order degrees of belief', in D. H. Mellor (ed.), Prospects for Pragmatism, Cambridge Univ. Press Shafer, G.: 1976, A Mathematical Theory of Evidence, Princeton Univ. Press.

Smith, C. A. B.: 1961, 'Consistency in statistical inference and decision'. Royal Stat. Soc., Series B. 23,1 25. Dept. of Philosophy Princeton University Princeton, N.J. 08544, U.S.A.

Please write to with any comments or suggestions.

Sign up to vote on this title
UsefulNot useful