You are on page 1of 316
Springer Series in Statistics DF, Andrews and A. M. Herzberg, Data: A Collection of Problems from Many Fields for the Student and Research Worker. xx, 442 pages, 1965. F J, Anscombe, Computing in Statistical Science through APL. xvi. 426 pages, 1981 4.0. Berger, Statistical Decision Theory and Bayesian Analysis, 2nd edition. xiv, 425 pages, 1986. P. Brémaud, Point Processes ané Queues: Martingale Dynamics. xvii, 954 pages, 1981 K. Dzhaparidze, Parameter Estimation and Hypothesis Testing in Spectral Analysis Of Stationary Time Series. xi, 300 pages, 1985. AH, Farrell, Multivariate Calculation. xvi, 367 pages, 1985, LA. Goodman and W. H. Kruskal, Measures of Association for Gross Classifications, x, 146 pages, 1979. J. A. Hartigan, Bayes Theory. xii, 145 pages, 1983. H, Heyor, Theory of Statistical Experiments. x, 269 pages, 1982. M. Kres, Statistical Tables for Multivariate Analysis, xxi, 504 pages, 1983. M.A. Leadbetter, G. Lindgren and H. Rootzén, Extremes and Related Properties of Random Sequences and Processes. xi, 336 pages, 1983, A.G. Miler, Je, Simultaneous Statistical Inference, 2nd edition. xvi, 298 pages, 1961 F. Mosteller and D. S. Wallace, Applied Bayesian and Ciassical Inference: The Case of The Federalist Papers. xxv, 301 pages, 1984. . Pollard, Convergence of Stochastic Processes. xiv, 215 pages, 1984. J. W. Pratt and J. D. Gibbons, Concepts of Nonparametric Theory. xvi, 462 pages, ‘981 L.Sachs, Applied Statistics: A Handbook of Techniques, 2nd edition. xxvill,706 pages, 1984. E. Seneta, Non-Negative Matrices and Markov Chains. xv, 279 pages, 1981, , Siegmund, Sequential Analysis: Tests and Confidence Intervals. xi, 272 pages, 1986. \V. Vapnik, Estimation of Dependences Based on Empirical Data. xvi, 399 pages, 1982, K. M. Wotter, Introduction to Variance Estimation. xi, 428 pages, 1985. James O. Berger Statistical Decision Theory and Bayesian Analysis Second Edition With 23 Illustrations Springer-Verlag New York Berlin Heidelberg Tokyo James O. Berger Department of Statistics Purdue University West Lafayette, IN 47907 USA AMS Classification: 6OCXX. "4, OKT. 1985 Library of Congress Cataloging in Publication Data Berger, James O. ‘Statistical decision theory and Bayesian analysis, (Springer series in statistics) Bibliography: p- Includes index. IE Statistical decision. 2, Bayesian statistical decision theory. I. Title. II, Series. QA2704.B46 1985 519.942——_—85-9891 “This isthe second edition of Statistical Decision Theory: Foundations, Concepts, and ‘Methods. © 1980 Springer-Verlag New York Inc. © 1980, 1985 by Springer-Verlag New York Inc. All rights reserved. No part of this book may be translated or reproduced in any form without written permission {rom Springer-Verlag, 175 Fifth Avenue, New York, New York 10010, US.A. ‘Typeset by J. W. Arrowsmith Ltd, Bristol, England. Printed and bound by R. R. Donnelley & Sons, Harrisonburg, Virginia Printed in the United States of Americs 987654321 ISBN 0-387-96098-8 Springer-Verlag New York Berlin Heidelberg Tokyo ISBN 3-540-96098-8 Springer-Verlag Berlin Heidelberg New York Tokyo To Ann, Jill, and Julie Preface Statistical decision theory and Bayesian analysis are related at a number of levels, First, they are both needed to solve real decision problems, each embodying a description of one of the key elements of a decision problem, Ata deeper level, Bayesian analysis and decision theory provide unified outlooks towards statistics; they give a foundational framework for thinking about statistics and for evaluating proposed statistical methods, ‘The relationships (both conceptual and mathematical) between Bayesian analysis and statistical decision theory are so strong that it is somewhat unnatural to learn one without the other. Nevertheless, major portions of cach have developed separately. On the Bayesian side, there is an extensively developed Bayesian theory of statistical inference (both subjective and objective versions). This theory recognizes the importance of viewing statis- tical analysis conditionally (ie., treating observed data as known rather than unknown), even when no loss function is to be incorporated into the analysis. There is also a well-developed (frequentist) decision theory, which avoids formal utilization of prior distributions and seeks to provide a foundation for frequentist statistical theory. Although the central thread of the book will be Bayesian decision theory, both Bayesian inference and non-Bayesian decision theory will be extensively discussed. Indeed, the book is written so as to allow, say, the teaching of a course on either subject separately, Bayesian analysis and, especially, decision theory also have split per- sonalities with regard to their practical orientation, Both can be discussed. at a very practical level, and yet they also contain some of the most difficult and elegant theoretical developments in statistics. The book contains a fair amount of material of each type. There is extensive discussion on how to actually do Bayesian decision theory and Bayesian inference, including how a preface prior distributions and loss functions, as well as hos 1 utilize to construct Gntroductions are given to some of the beautiful them. At the other extreme, theoretical developments in these areas. et tistical level ofthe book is formally rather low, in Wat Piet knowledge of Bayesian analysis, decision theory, oF advanced statistics is oe eetry, The book will probably be rough going, howeyeh for those wierately serious statistics course. For without previous exposure {0 @ mo warrance previous exposure to such concepls as sulficiens> desirable. It “Hould also be mentioned that parts of the Book are philosophically very ‘disagreements that exist among statisticians, con these fundamental challenging; the extcem: crenine the correct approach to statistics, suggest that coring ve conceptually dificult Periodie rereading of such material (€.8.. Seegons 1.6, 41, and 412), as one proceeds through the book, is recom= mended. ded athematical level of the book is, for the most parts at £0 Set advamosd calculus Tevel, Some knowledge of probability 1 required; at tena cay, a knowledge of expectations and conditional probability. From cae fo dime (especially in later chapters) some higher ‘mathematical facts time 10 Gmployed, but knowledge of advanced mathematics s 038 required wa low most ofthe text Because ofthe imposed matheraica! limitations, 10 oli he stated theorems need, say, additional measurably conditions se cr pletely precise. Also less important (but ronignorabe) technical xo eons for some developments are sometimes omitted, Put stich develop- nents are called “Results,” rather than “Theorems. “The book is primarily concerned with discussing basic ists ciples of Bayesian analysis and decision theory. Ne systematic attempt is — will be con- sidered. This condition is satisfied by all loss functions of interest. Chapter 2 will be concerned with showing why a loss function will typically exist in a decision problem, and with indicating how a loss function can be determined. ‘When a statistical investigation is performed to obtain information about 6, the outcome (a random variable) will be denoted X. Often X will be a vector, as when X =(X;, Xz,..., X,),the X, being independent observa- tions from a common distribution. (From now on vectors will appear in boldface type: thus X.) A particular realization of X will be denoted x, The set of possible outcomes is the sample space, and will be denoted #: (Usually 2 will be a subset of R*, n-dimensional Euclidean space.) The probability distribution of X will, of course, depend upon the unknown state of nature @ Let P,(A) or P,(X ¢ A) denote the probability 1, asic Concepts 4 of the event A(Ac 2), when 0 isthe true state of nature, For simpli XX wil be assumed to be either a continuous ofa discrete random variable, swith density f(x] 4). Thus if X is continuous (i.e., has a density with resps to Lebesgue measure), then Paes while if X is discrete, then PA(A= E fl). Certain common probability densities and their relevant properties are given anne tations over randot Pel requenly be necesary {0 consider expectations over random varlables The expectation (over X) of a function h(x), fOr 8 give of @, is defined to be {f h(x)f(x| 8)dx (continuous case), BKx1=} moo) | L h(xyfixley (discrete case), would be cumbersome to have to deal separately with these two diferent expressions for Es[h(X)]. Therefore, as a convenience, we will defi fo EAN (i tt ete ute Stun Fnston of j dF*(x| 0). eta \ction was prior not PAA ym variables over 1.2, taste Elements : in terms of probabilities of various possible values of @ being true.) The symbol 7(@) will be used to represent a prior density of @ (again for either the continuous or discrete case), Thus if Ac @, (discrete case) Chapter 3 discusses the construction of prior probability distributions, and also indicates what is meant by probabilities concerning @, (After all, in ‘most situations there is nothing “‘random" about 0, A typical example is when @ is an unknown but fixed physical constant (say the speed of light) which is to be determined. The basic idea is that probability statements concerning 0 are then to be interpreted as “personal probabilities” reflecting the degree of pe:sonal belief in the likelihood of the given statement.) ‘Three exampies of use of the above terminology follow. EXAMPLE 1. In the drug example of the introduction, assume it is desired {0 estimate 0. Since 0, is a proportion, itis clear that © =(6;:0< 0,= 1} = (0, 1]. Since the goal is to estimate @,, the action taken will simply be the choice of a number as an estimate for #.. Hence sa =[0, 1). (Usually af = © for estimation problems.) The company might determine the loss function to be @-a it by 2a~63) a>0, uored-[Fcay temas (The loss is in units of “utility,” a concept that will be discussed in Chapter 2.) Note that an overestimate of demand (and hence overproduction of the drug) is considered twice as costly as an underestimate of demand, and that otherwise the loss is linear in the error. A reasonable experiment which could be performed to obtain sample information about #; would be to conduct a sample survey. For example, assume n people are interviewed, and the number X who would buy the drug is observed. It might be reasonable to assume that X is 93(n, 05) (see Appendix 1), in which case the sample density is feateny=(") ose There could well be considerable prior information about @ arising from previous introductions of new similar drugs into the market. Let's say that, in the past, new drugs tended to capture between +b and } of the market, with all values between and {being equally likely. This prior information could be modeled by giving # a 2(0.1,0.) prior density, i, letting, (02) = 10ho.,02\(6). 1. Basie Conceots and zis quite orude, and usually much quired to obtain satisfactory results. veloped as we proceed. The above development of L, f more detailed constructions are re techniques for doing this will be de received by a radio company. It seu 2: 8 sipment of ranstors i reseed ya aio company er ettenk he peormance of esh ansior sept 0 aor cede chet Ie shipment sale, A random sample 22am ee tvom te shipment and ested fased upon X the of ast i astrsin esp, the shipment wil be 2cepted fs the av vo pons acton:ey—aeep the PMN 2a ejec the shipment fi small compared the SPM Ue a ree ng co have a (a, 6) distribution, where 08th Pi x casing trumaistors inthe shipment see sspany determines thatthe L(8, a2) =1. (When a, is decided (i.¢., ° se eich yetecn cots de to Homers, dl nd crea eicme saps. hey dene hit sceepted he es radon priced. The factor 1 indieates the rela “ Ta pant eueved numerous other transistor inet yes the ps esved numerous oe arson sera nnn campany ene ay aves en me peel off post shipments, Inded 2 sia of data concerning ats eed tat # a tbe ceding £08 B.(0.05, 1) distribution. Hence (8) = (0.05) 6 °** Kra.n(8). ir loss function is L(#, aoa Toc in wejeted) the Toss isthe a 0 inconvenience, delay, and costs involved i desde whether or not to buy rather tsk bende they an beredeemed at mat Fer ne or) Thee could, however, bea eta on the Bondo ne En of Seal $1000 invement would be fos. 1 the investor eh ase IN mney in asa" investment, he wil be guaranteed nel instead po i nea ime period. The inventor esimatese probability fain of 300 ove Sa default co be 01 Here af ~tan a) where a Examece 3. An investor must ZZZ bonds. Ifthe investor buys t stands for buying the bonds and a for nt ithe state of nate "0 ineuite (0, 0), where 0 denotes th re aa ana ste state a defi osu” Realing shat ein [ca resented by gative loss, the loss function is given DY table. @, | -s00 | -300 | fi tere ; | 1000 | ~300 | 1.2, Basie Elements 7 (When both © and sf are finite, the loss function is most easily represented by such a table, and is called a loss matrix. Actions are typically placed along the top of the table, and 9 values along the side.) The prior information can be written as 1(@)=0.9 and 7(8;)=0.1 Note that in this example there is no sample information from an associated statistical experiment. Such a problem is called a no-data problem. It should not be construed from the above examples that every problem will have a well-defined loss function and explicit prior information. In many problems these quantities will be very vague or even nonuinique. The most important examples of this are problems of statistical inference. In statistical inference the goal is not to make an immediate decision, but is instead to provide a “summary” of the statistical evidence which @ wide variety of future “users” of this evidence can easily incorporate into their own decision-making processes. Thus a physicist measuring the speed of light cannot reasonably be expected to know the losses that users of result will have. Because of this point, many statisticians use “statistical inference” as a shield to ward off consideration of losses and prior information. This is a mistake for several reasons. The firs is that reports from statistical inferences should (ideally) be constructed so that they can be easily utilized in individual decision making, We will see that a number of classical inferences are failures in this regard, A second reason for considering losses and prior information in inference is that the investigator may very well possess such information: he will often be very informed about the uses to which his inferences are likely to >be put, and may have considerable prior knowledge about the situation. It is then almost imperative that he present such information in his analysis, although care should be taken to clearly separate “subjective” and “objec. tive” information (but see Subsection 1.6.5 and Section 3.7) The final reason for involvement of losses and prior information in inference is that choice of an inference (beyond mere data summarization) can be viewed as a decision problem, where the action space is the set of all possible inference statements and 2 loss function reflecting the success in conveying knowledge is used. Such “inference losses” will be discussed in Subsections 2.4.3 and 4.4.4. And, similarly, “inference priors” can be constructed (see Sections 3.3 and 4.3) and used to compelling advantage in inference. While the above reasons justify specific incorporation of loss functions and prior information into inference, decision theory can be useful even when such incorporation is proscribed. This is because many standard inference criteria can be formally reproduced as decision-theoretic criteria with respect to certain formal loss functions. We will encounter numerous illustrations of this, together with indications of the value of using decision- theoretic machinery to then solve the inference problem. 1. asic Concepts 8 1.3. Expected Loss, Decision Rules, and Risk s mentioned in the Introduction, we will be involved with decision making ine presence of uncertainty. Hence the actual incurred Toss, L(2 2), wil never be known with certsnty (at the ime of decision making) A natural method of proceeding in the face of this uncertainty isto consider the Toss of making devision, and then choose an “optimal 1d loss. In this section we consider “expected” decision with respect to this expecte: several standard types of expected loss. 1.3.1. Bayesian Expected Loss rom an intuitive viewpoint, the most natural expected loss to consider i be nveling the uncertain nce sal tht unkrowa atthe ime ofnating the Jesnon, We have altedy mentioned tht Kis possible to tea Oasaandom quant witha probably dsb, nd consiesing Capected loss with sept to ths probabil dation i emi Eee (and il indeed be jstiied in Chapters 2 3 and ) 2°) isthe belived probability dssbuton of 0 atthe Saab te . f an action a is Definitior time of decision making, the Bayesian expected loss of pat, a)= 10,0 [ (6, a)dF""(8). ExaMce 1 (continued). Assume no data is obtained, so that the believed distribution of 62 is simply 7(8:) = 10K(y4,9.2)(62). Then otz.ai= [ L(0, a) #(62)d03, = ie 2a~ A)101o oaiedors (0~ a)0ho102)(82)4B: 1Sa*—4a+03 if0.1sa aus-a fas, -{s -0.3 ifa=0.2. Exampce 3 (continued). Here pl, ay) = E*L(4, ay) = 10), as) (0) + L(@>, @1) 16) (500)(0.9) +(1000)(0.1) = -350, pm a2) = E*L(6, a2) = L(84, a3) 7(84)+ L(O>, a3) 7102) = 300. 13. Expected Loss, Decision Rules, and Risk ° We use * in Definition 1, rather than 7, because 7 will usually refer to the initial prior distribution for @, while “* will typically be the final (posterior) distribution of @ after seeing the data (see Chapter 4). Note that it is being implicitly assumed here (and throughout the book) that choice of @ will not affect the distribution of @ When the action does have an effect, one can replace 7*(0) by 73(6), and still consider expected loss. Sce Jeffrey (1983) for development. 13.2. Frequentist Risk ‘The non-Bayesian school of decision theory, which will henceforth be called the frequentist or classical school, adopts a quite different expected loss based on an average over the random X, As a first step in defining this expected loss, itis necessary to define a decision rule (or decision procedure). Definition 2. A (nonrandomized) decision rule (x) is a function from & into sf. (We will always assume that functions introduced are appropriately “measurable.”) If X = x is the observed value of the sample information, then 8(.x) is the action that will be taken. (For a no-data problem, a decision rule is simply an action.) Two decision rules, 8, and 42, are considered equivalent if P,(8,(X) = 6,(X)) = for all 8, Exampte 1 (continued). For the situation of Example 1, 5(x) = x/n is the standard decision rule for estimating 8,. (In estimation problems, a decision rule will be called an estimator.) This estimator does not make use of the Joss funetion or prior information given in Example 1. It will be seen later how to develop estimators which do so. EXxaMpLe 2 (continued). The decision rule a itx/n008, se (e: reson is a standard type of rule for this problem, The frequentist decision-theorist seeks to evaluate, for each @, how much he would “expect” to lose if he used 8(X) repeatedly with varying X in the problem. (See Subsection 1.6.2 for justification of this approach.) Defi 3. The risk function of a decision rule 8(.x) is defined by R(8,6)= Es(L(6, ann [ (0, (x) )dF*(x] 8). (For a no-data problem, R(#, 8)= L(8, 3).) 10 1. asic Concepts To a frequentist, itis desirable to use a decision rule 6 which has small (0,8). However, whereas the Bayesian expected loss of an action was a Single number, the risk isa function on @, and since 0 is unknown we have fa problem in saying what “small” means, The following partial ordering of decision rules is a first step in defining a “good” decision rule. Definition 4, A decision rule 5, is R-better than a decision rule 6: if (6, 5,) = R(O, 6s) for all 8, with strict inequality for some 6. A rule 5 ig Reeguivalent to 6; if R(4, 6,) = RO, 6) for all 6 Definition 5. A decision rule 6 is admissible if there exists no R-better decision rule. A decision rule 6 is inadmissible if there does exist an R-better decision rule. It is fairly clear that an inadmissible decision rule should not be used, since a decision rule with smaller risk can be found, (One might take exception to this statement if the inadmissible decision rule is simple and easy {0 use, while the improved rule is very complicated and offers only slight improvement. Another more philosophical objection to this exclusion of inadmissible rules will be presented in Section 4.8.) Unfortunately, there is usually a large class of admissible decision rules for a particular problem ‘These rules will have risk functions which cross, i. which are better in different places. An example of these ideas is given below. Exampe 4, Assume X is (0,1), and that it is desired to estimate @ under oss £(6, a) =(0—a)*. (This loss is called squared-error loss.) Consider the decision rules 6,(x) = cx. Clearly R(B,B.) = ES 10, 6(X))= BS(G= eX)? = EN e(o-xX14{1- 19" = EMO X]'+2e(1-C)OEMO~ X]+U~ PO = e+ (1-00 Since for e>1, R(0, 5) =1— eH (L090 = ROB), 8, is Rebetter than 8, for c>1. Hence the rules 6, are inadmissible for cot. On the other hand, for 0e=1 the tules are noncomparable. For ‘example, the risk functions of the rules 8, and 6,/2 are graphed in Figure 11. The risk functions clearly cross. Indeed it will be seen later that for O=c=1, 8. is admissible. Thus the “standard” estimator 5, is admissible. So, however, is the rather silly estimator 8y, which estimates @ to be zero rno matter what x is observed. (This indicates that while admissibility may be a desirable property for a decision rule, it gives no assurance that the decision rule is reasonable.) 13. Expected Loss, Decision Rules, and Ris 1d Risk n Risk | R045) : 0.54) Figure 1.1 .s matrix of a particular no-data problem. a | 0} -1 The rule (action) a; is Rebetter th L(0,. a3) . etter than ay since L(A, a;) = L(0, as) for all 6, with strict inequality for 6. (Recall that, in a no-data problem, the risk is Simply the fos.) Hence 4 inadmisibe, The acon and a: are noncomparable, in that L(0, a4) ~< L(6,, a3) for @, and 0, while the rev sae aestiiin FaTEatTA rtssssd a TesadTAHTESTTTTaeTTanaoet “li 20k ne wil ony conse eon as whit isk More prmally, we wil asume thatthe only (nonrandomized) dein fl under consideration are those in the class : a 2 {all decision rules 5: R(6, 5) <:° forall 6€ @}, (There are actually technical reasons for allowing infinite risk decision rules in certain abstract settings, but we will encounter no such situations in this 00k; our life will be made somewhat simpler by book: oui jewhat simpler by not having to worry about spit defer dscusion ofthe dilerencesbetneen using Bayesian expected loss and the rik funtion until Section 1.6 (and elsewhere inthe hook) ere is, however, one other relevant expected loss to consider, and that is the expected loss which averages over both 8 and X. : Defniion 6. The Boyes risk of a éedsin rl distribution = on @, is defined a proeaeiete Hm, 8 E"[R(0, 8)1 1. Basle Concepts 2 Exampte 4 (continued). Suppose that (6) is a (0, 1°) density. Then, for the decision rule 8, 6.) = E(R(G, 6.) HEH (1 cP EO] “The Bayes risk of a decision rule will be seen to play an important role in virtually any approach to decision theory. Eve +(1- 06] : Pee) r( 1.4. Randomized Decision Rules In some decision situations it is necessary to take actions in a random manner, Such situations most commonly arise when an intelligent adversary js involved, As an example, consider the following game called “matching pennies. Exampce 6 (Matching Pennies), You and your opponent are to simul- taneously uncover a penny. If the two coins match (i.e. are both heads or both tails) you win SI from your opponent. If the coins don't match, your ‘opponent wins $1 from you. The actions which are available to you are “choose heads, or a;—choose tails. The possible states of nature are : 1,—the opponent's coin is a tail. 6,—the opponent's coin is a head, and ‘The loss matrix in this game is 6, [=a ara =) a, and ay are admissible actions. However, played" amber of times, then kt would clearly be a very poor idea Co Gece tones oF ensue Your opponent would we ssi ocatze your strategy, and simply choose his action to guarantee SUEY [Skewes any patted choice of and a; could be discerned by sreeelaigent opponent who could then develop winning strategy. The aasetin way of preventing uimate deeat, therefore, €0 cho0se ay 2a be sme random mechanisn, A natural ay fo do this simply to Sener na wih probabiiiesp and 1p respectively, The formal Seamtion of sucha randomized decion rule follows 4; if the game is to be Definition 7. A randomized decision rule 5°(x,-) is, for each x, a probability distribution on i, with the interpretation that if x is observed, 6*(x, A) is 1.4. Randomized Decision Rules B the probability that an action in A (a subset of sf) will be chosen. In no-data problems a randomized decision rule, also called a randomized action, will simply be noted 8*(-), and is again a probability distribution on sf. Nonran- domized decision rules will be considered a special case of randomized rules, in that they correspond to the randomized rules which, for each x, ‘choose a specific action with probability one, Indeed if 6(x) isa nonrandom. ized decision rule, let (6) denote the equivalent randomized rule given by 1 ifa(xeA, (Bs, A)= 1A(6(30) = wear=Wiseo=Ly iremea EXAMPLE 6 (continued). The randomized action discussed in this example is defined by 5*a,)=p and 8"(a;)=1~p. A convenient way to express this, using the notation of the preceding definition, is 8" = play) +(1—p)ay) ExaMpLe 7. Assume that a 98(n, 6) random variable X is observed, and that it is desired to test Hy: @=@, versus H,: @= 8, (where 0)> 8). The nonrandomized “most powerful” tests (see Chapter 8) have rejection regions of the form C={xe#: x=j}(j=0,1,2,-..,). Since % is discrete and finite, the size of these tests (namely a ~ Py.(C)) can attain only a finite ‘number of values. If other values of a are desired, randomized rules must be used. It suffices to consider the randomized rules given by Litre), wrsanel ifx=j, 0 ifx>) and 8}(x, a) = 1~87(x, a), where a, denotes accepting Hy, (Thus if x is observed, Hy will never be rejected (ie, will always be accepted). If x= is observed, a randomization will be performed, rejecting H, with probability p and accepting with probability 1p. Through proper choice of j and 7, a most powerful test of the above form can be found for any given size «) The natural way to define the loss function and the risk function of a randomized decision rule is in terms of expected loss. (This will be justified in Chapter 2.) Definition 8. The loss function L(4, 8*(x,-)) of the randomized rule 8* is defined to be 10, 6*(x,-)) = E°"[L(0, a), where the expectation is taken over a (which by an abuse of notation will denote a random variable with distribution 8*(x, -)). The risk function of 1, Basie Concepts 14 5% will then be defined to be R(6, 5") = EN{L(6, 8°(X, °))) Exampct 6 (continued). As this is a no-data problem, the risk is just the loss, Clearly 16, 8°) = E*[L(G, a)}= 6(a,)L(8, a1) +3*(a2)LC, as) »L( 8, a) +(1—p)L(8, a2) p+(-p)=1-2p if Note that if p=! is chosen, the loss is zero no matter what the opponent {thus guarantees an expected loss does. The randomized rule 6* with of zero. Exarce 7 (continued): Assume that the loss is zero if a correct decision is made and one if an incorrect decision is made. Thus . 0 ifims, LOT nine ‘The loss of the randomized rule 6* when @~ Gp is then given by LUO, 691, -)) = BML (5 a1 BFL, ao) LC Boy a) + 8F% 41) L(y 4) = 6}(4, a). Hence EX[6MX,a))] ). f) = E*LL(G0, 87%, ) Pf X + (1 p)Pa LX =I [As with nonrandomized rules, we will restrict attention to randomized rules with finite risk Definition 9, Let 2 be the set of all randomized decision rules 8° for which (9, *) 1.96, where x is the sample mean. Now it is unlikely that the null hypothesis is ever exactly true. Suppose, for instance, that 0 = 10-"°, which while nonzero is probably a meaningless difference from zer0 in most practical contexts. If now a very large sample, say n~ 10° is taken, then with extremely high probability X will be within 10""' of the true mean @= 107". (The standard deviation of X is only 10) But, for ¥ in this region, it is clear that 10”|z|> 1.96. Hence the classical testis virtually certain to reject Hq, even though the true mean is negligibly different from zero. This same phenomenon exists no matter what size a>0 is chosen and no matter how small the difference, e>0, is between zero and the true mean. For a large enough sample size, the classical test will be virtually certain to reject 1.6, Foundations m1 The point of the above example is that it is meaningless to state only that a point null hypothesis is rejected by a size a test (or is rejected at significance level a). We know from the beginning that the point null hypothesis is almost certainly not exactly true, and that this will always be confirmed by a large enough sample. What we are really interested in determining is whether or not the null hypothesis is approximately true (see Subsection 4.3.3). In Example 8, for instance, we might really be interested in detecting a difference of at Ieust 10-° from zero, in which case a better null hypothesis would be Hy: |é|~=10. (There are certain situations in which it is reasonable to formulate the problem as a test of a point aull hypothesis, but even then serious questions arise concerning the “final precision” of the classical test. This issue will be discussed in Subsection 433) As another example of this basic problem, consider standard “tests of fit,” in which ic is desired to see if the data fits the assumed model. (A typical example is a test for normality.) Again itis virtually certain that the model is not exactly correct, so a large enough sample will almost always reject the model. The problem here is considerably harder to correct than in Example 8, because it is much harder to specify what an “approximately correct” model is, A historically interesting example of this phenomenon (told to me by Herman Rubin) involves Kepler's laws of planetary motion. Of interest is his first law, which states that planetary orbits are ellipses. For the observa- tional accuracy of Kepler's time, this model fit the data well. For todays data, however, (oF even for the data just 100 years after Kepler) the null hypothesis that orbits are ellipses would be rejected by a statistical signifi- cance test, due to perturbations in the orbits caused by planetary interactions. The elliptical orbit model is, of course, essentially correct, the error caused by perturbations being minor. The concern here i that an essentially correct model can be rejected by too accurate data if statistical significance tests are blindly applied without regard to the actual size of the discrepancies The above discussion shows that a “statistically significant” difference between the true parameter (or true model) and the null hypothesis can be ‘an unimportant difference practically. Likewise a difference that is not significant statistically can nevertheless be very important practically. Con. sider the following example. Exampue 9. The effectiveness of a drug is measured by X ~.(4,9). The null hypothesis is that @=0. A sample of 9 observations results in X=. This is not significant (for a one-tailed test) at, say, the a = 0.05 significance level. It is significant at the a =0.16 significance level, however, which moderately convincing. If | were a practically important difference from zero, we would certainly be very interested in the drug. Indeed if we had to make a decision solely on the basis of the given data, we would probably decide that the drug was effective. 2 1, Basic Concepts ste above problems ae, ofcourse, wel secognize by classical satis oun nny at east Berkson (193) who whe using the framework of aaa eel ypotheses, do concern themselves wh the eal por a int seen somewhat nonsesial, however, {0 deliberately etetsae 1 problem wrong and then inan adhoc fashion explain the al serra ade eesonale terms, Also, there ae unfortunately any ses etice who go mol understand the pifals ofthe incorrestcasca : ‘One of the main benefits of decision theory is that it forces one to think spout he cere formulation ofa problem. A numberof decsion theoreti sree cote caceal sinicance ests wl be introduced a we proceed, aihougho systematic study of such aerades wil be undenaken 1.6.2. The Frequentist Perspective On the face oft it may seem rather peculiar fo use a risk (or any other Frequent, nieawire such as confidence, eror probabilities, bias, ete) ia thetenor from an experiment since they involve averaging the performance ra procedure overal possible data while tis known which data ocurred In the setion we will ely discus the motivation for using fequentst menithough one can undoubtedly find earlier traces, the fist systematic development of frequents Keas canbe found inthe ealy writings of J Negi and Pearson (ef. Neyman (1967). The original diving force petind their Fequenist development seemed to be the desire to produce vegas uhich Bid not depend on for any prior knowledge about @ The Method of doing tht was to consider a procedure 3(x) and some ereion rreeesn Ld, s)) and then find a number R such tha repeated use of 8 would yield average long run performance of at least R. EXAMPLE 10. For dealing with standard univariate normal theory problems, consider the usual 95% confidence rule for the unknown mean 8, a(x) where ¥ and s are the sample mean and standard deviation, respectively, and tis the appropriate percentile from the relevant ¢ distribution. Suppose that we measure the performance of 5 by e Il) = 1 Note that, if this is treated as a decision-theoretic loss, the risk becomes (including the unknown standard deviation o as part of the parameter) Rots, 8415), if ee o(x), ay if ae (x). R((4,0), 8) = Exe L(0, 8(X)) = Poo(5(X) does not contain 6) = 0.05. 16. Foundations 2 The idea now is to imagine that we will use & repeatedly on a series of (independent, say) normal problems with means 6, standard deviations @;, and data X'”. Itis then an easy calculation, using the law of large numbers, to show that (with probability one) fim 1 E100, 8(X"%)) 0.05% & ay no matter what sequence of (6,, 7) is encountered, ‘The above frequentist motivation carries considerable appeal. As stats ticians we can “proclaim” that the universal measure of performance that is to be reported with 8 is R =0.05, since, on the average in repeated use, 5 will fail to contain the true mean only $% of the time. This appealing motivation for the frequentist perspective was formalized as the Confidence Principle by Birnbaum (see Cox and Hinkley (1974) and Birnbaum (1977) for precise formulations). Other relevant works of Neyman on this issue are Neyman (1957 and 1977). ‘Two important points about the above frequentist justification should be stressed. These are: (i) the motivation is based on repeated use of 6 for different problems; and (ii) a bound & on performance must be found which applies to any sequence of parameters from these different problems. The elimination of either of these features considerably weakens the case for the frequentist measure. And, indeed, the risk R(0, 8), that we have so far considered, seems to violate both of these conditions: it is defined as the repeated average loss if one were to use 8 on a series of data from the same problem (since @ is considered fixed), and a report of the function (A, 5) has not eliminated dependence on 8 Several justifications for R(4, 8) can stillbe given in terms ofthe “primary motivation,” however. The first is that risk dominance of 6, over 6 will usually imply that 4, is better than 8, in terms of the primary motivation. The second is that R(, 5) may have an upper bound R, and, if so, this can typically be shown to provide the needed report for the primary motivation, To see the problem in using just R(4, 5), consider the following example. ExaMpLe 1, Consider testing the simple null hypothesis Hy: = 6) versus the simple alternative hypothesis Hy: 6 = 6). Ifthe loss is chosen to be "0-1" Joss (see Subsection 2.4.2), the risk function of a test 5 turns out to be given by R(@, 5) = ay ~ Py, (Type I error) and R( 4, 5)~ a, = Py, (Type Il error). Suppose now that one always uses the most powerful test of level a= 0.01 ‘This would allow one to make the frequentist statement, upon rejecting Ho, “my procedure ensures that only 1% of true null hypotheses will be rejected.” Unfortunately, this says nothing about how often one errs when rejecting. For instance, suppose a, = 0.99 (admittedly terrible Type II error probabil- ity, but useful for making the point) and that the null and alternative 24 1. asic Concepts parameter values occur equally often in repetitive use of the test. (Again, ive are imagining repeated use of the a= 0.01, « = 0.99 most powerful test ‘on a sequence of different simple versus simple testing problems.) Then it can easily be shown that half of all rejections of the null will actually be in error. And this is the “error” that really measures long run performance of the test (when rejecting). Thus one cannot make useful statements about the actual error rate incurred in repetitive use, without a satisfactory bound on R(@, 6) for all 6. Other justifications for R(#,5) can be given involving experimental design and even “Bayesian robustness” (see Subsection 1.6.5 and Section 4.7). [twill be important, however, to bear in mind that all these justifications are somewhat secondary in nature, and that assigning inherent meaning to (R(8,5), as an experimental report, is questionable. For more extensive discussion of this issue, see Berger (1984b), which also provides other references. 1.6.3. The Conditional Perspective ‘The conditional approach to statistics is concerned with reporting data specific measures of accuracy. The overall performance of a procedure 5 is deemed to be of (at most) secondary interest; what is considered to be ‘of primary importance is the performance of 8(x) for the actual dara x that is observed in a given experiment. The following simple examples show that there can be a considerable difference between conditional and frequen tist measures. ExaMpLe 12. Suppose that X; and X; are independent with identical distribution given by Py(X/= 8-1) = PAX = O41) =h, where —c0< @ <0 is unknown. The procedure (letting X =(X1, X2)) the point (Xy+NG) if X# Xa, is easily seen to be a frequentist 75% confidence procedure of smallest size {(ie., Pp(6(X) = 6) =0.75 for all 9). However, a conditionalist would reason fas follows, depending on the particular x observed: if x has x, x2, then we know that 4(x, + %;) = 6 (since one of the observations must be @—1 and the other must be 6+1), while, if x,=x2, the data fails to distinguish in any way between the two possible @ values x;~L and x,+1. Hence, condi- tionally, 8(x) would be 100% certain to contain @ if x, # x2, while if x1 = x2 it would be 50% certain to contain 8. 1.6, Foundations 25 Careful consideration of this example will make the difference between the conditional and frequentist viewpoints clear. The overall performance of 6, in any type of repeated use, would indeed be 75%, but this arises because half the time the actual performance will be 100% and half the time the actual performance will be 50%. And, for any given application, fone knows whether one is in the 100% or 50% case. It clearly would make little sense to conduct an experiment, use 5(x), and actually report 75% as the measure of accuracy, yet the frequentist viewpoint suggests doing so. Here is another standard example. EXAMPLE 13. Suppose X is 1, 2, or 3 and 0 is 0 or 1, with X having the following probability density in each case: 1 1 2 3 fixj0) | 0.005 | 0.005 | 0.99 1 fix|1) | 0.0051 | 0.9849 | The asia most powefl tet of Hy 8=0 versus Hy: @=1, at level tor, concludes Hr when X-=1 of 3 and this test also has 4 Type coor proba of 001. Hence, Standard frequent, upon obsering sold report thatthe decison and thatthe et had cor prataiis of O01 This certainly geste inpresion that one can place Tren deal ofconfidnce inthe conclusion, buts thse as? Condon ‘casing show thatthe ansberksometines no! When Lis osened, the Ietihood rat betven #0 and 81 (OOS)/(OU1) whch very ese (0 one Toa condanalit and to mos ote ais cana), ‘ketnod rate near one mean tat the daa does vr ite distinguish Between? -Oand #1 Hence the candionlwonfdence inthe decison to conchde Hy when x= obsered, would be only about 30%. (OT ‘The next example is included for historical reasons, and also because it turns out to be a key example for development of the important Likelihood Principle in the next subsection. This example is a variant of the famous Cox (1958) conditioning example. EXAMPLE 14, Suppose a substance to be analyzed can be sent either to a laboratory in New York or a laboratory in California, The two labs seem equally good, so a fair coin is fipped to choose between them, with “heads” denoting that the lab in New York will be chosen. The coin is flipped and 26 1. Basie Coneepes comes up tails, 0 the California lab is used. After a while, the experimental results come back and a conclusion and report must be developed, Should this conclusion take into account the fact that the coin could have been hheads, and hence that the experiment in New York might have been performed instead? Common sense (and the conditional viewpoint) cries ho, that only the experiment actually performed is relevant, but frequentist reasoning would call for averaging over all possible data, even the possible New York data. The above examples were kept simple (o illustrate the ideas. Many complex and common statistical situations in which conditioning seems very important can be found in Berger and Wolpert (1984) and the references therein. An example is the use of observed, rather than expected, Fisher information (see Subsection 4.7.8). Examples that will be encountered in this book include hypothesis testing (see Subsection 4.3.3), several decision theoretic examples, and the very important example of optional stopping, which will be considered in Section 7.7. (The conditional viewpoint leads to the conclusion that many types of optional stopping of an experiment can be ignored, a conclusion that can have a drastic effect on, for instance, the running of clinical trials.) ‘avage (1962) used the term Initial precision to describe frequentist ‘measures, and used the term final precision to describe conditional measures. Initially, Le., before seeing the data, one can only measure how well 3 is likely to perform through a frequentist measure, but after seeing the data one can give a more precise final measure of performance. (The necessity for using at least partly frequentist measures in designing experiments is apparent.) The examples above make abundantly clear the necessity for consider- ation of conditioning in statistics. The next question, therefore, is—What kind of conditional analysis should be performed? There are a wide variety of candidates, among them Bayesian analysis, fiducial analysis (begun by R.A. Fisher, see Fisher (1935)), various “likelihood methods” (ef. Edwards (1972) and Hinde and Aitkin (1984)), structural inference (begun by D. A. S. Fraser, see Fraser (1968)), pivotal inference (see Barnard (1980)), and even a number of conditional frequentist approaches (see Kiefer (1977a) or Berger (1984, 1984e)), Discussion of these and other conditional approaches (as well as related conditional ideas such as that of a “relevant subset") can be found in Barnett (1982), Berger and Wolpert (1984), and Berger (1984d), along with many references. In this book we will almost exclusively use the Bayesian approach to conditioning, but a few words should be said about the conditional frequentist approaches because they can provide important avenues for generalizing the book's frequentist decision theory to allow for conditioning (cf. Kiefer (1976, 1977a) and Brown (1978)). Kiefer (1977a) discussed two types of conditional frequentist approaches, calling them “conditional confidence” and “estimated confidence.” The 1.6, Foundations 7 idea behind conditional confidence is to use frequentist measures, but conditioned on subsets of the sample space. Thus, in Example 12, it would be possible to condition on (x:x) =x} and (xx, X), and then use frequentist reasoning to arrive at the “correct” measures of confidence. And in Example 14, one could condition on the outcome of the coin flip. The estimated confidence approach does not formally involve condition ing, but instead allows the reported confidence to be data dependent. Thus, in Example 12, one could report a confidence of 100% or 50% as x17 Xp or x = xy, respectively. The Frequentist aspect of this estimated confidence approach is that the average reported performance in repeated use will be equal to the actual average performance, thus satisfying the primary frequen- tist motivation, For a rigorous statement of this, and a discussion of the interesting potential that estimated confidence has for the frequentist view- point, see Kiefer (19774) and Berger (1984 and 1984c]. 1.6.4. The Likelihood Principle In attempting to settle the controversies surrounding the choice of a para- digm or methodology for statistical analysis, many statisticians turn to foundational arguments. These arguments generally involve the proposal of axioms or principles that any statistical paradigm should follow, together with a logical deduction from these axioms of particular paradigm or ‘more general principle that should be followed. The most common such foundational arguments are those that develop axioms of “rational behavior” and prove that any analysis which is “rational” must correspond {to some form of Bayesian analysis. (We will have a fair amount to say about these arguments later in the book.) A much simpler, and yet profoundly important, foundational development is that leading to the Likelihood Principle. Indeed the Likelihood Principle, by itself, can go a long way in settling the dispute as to which statistical paradigm is correct. Tt also says a great deal about how one should condition. ‘The Likelihood Principle makes explicit the natural conditional idea that only the actual observed x should be relevant to conclusions or evidence about 0. The key concept in the Likelihood Principle is that of the likelihood function, Definition 11. For observed data, x, the function 1(6) = f(x| 8), considered as a function of 6, is called the likelihood function. ‘The intuitive reason for the name “likelihood function” is that a @ for which f(x] ) is large is more “likely” to be the true @ than a @ for which J(x|4) is small, in that x would be a more plausible occurrence if f(x| 0) ‘were large. 28 1. asic Concepts ‘The Likelihood Principle. In making inferences or decisions about 8 after x is observed, all relevant experimental information is contained in the likelihood function for the observed x, Furthermore, two likelihood functions contain the same information about 8 if they are proportional to each other (as functions of 8). EXAMPLe 15 (Lindley and Phillips (1976)). We are given a coin and are interested in the probability, 8, of having it come up heads when flipped is desired to test Hy: @=! versus H,: >}. An experiment is conducted by flipping the coin (independently) in a series of trials, the result of which is the observation of 9 heads and 3 tails. This is not yet enough information to specily fix|4), since the “series of trials” was not explained. Two possibilities are: (1) the experiment consisted of a predetermined 12 flips, so that X=[# heads] would be (12, 6); or (2) the experiment consisted of flipping the coin until 3 tails were observed, so that X would be .\(3, 0). The likelihood functions in eases (1) and (2), respectively, would be Wuo)=sisie=(“)ei-w > =@206% 0-0) and next as wea=siaiay=("" "aay (ssa ~0) “The Likelihood Principle says that, in either case, (0) is all we need to know from the experiment, and, futhermore, that /, and I, would contain the sume information about @ since they are proportional as functions of 8. Thus we didnot really need to know anything about the "series of tials” knowing that independent fips gave 9 heads and 3 tails would, by itself, tell us that the likelihood function would be proportional to 6°(1~ 0)". Classical analyses, in contrast, are quite dependent on knowing f(s] 9), and not just for the observed x. Consider classical significance testing, for instance. For the Binomial model, the significance level of ¥=9 (against, 0=4) would be 4 = Py X 29) = f(9]3) +410] 2) + HCN) + A213) 0.075, For the negative binomial model, the significance level would be = Py lX 9) = 409 |3)+f2(10| 3) = 0.0325, If significance at the 5% level was desired, the two models would thus lead to quite different conclusions, in contradiction to the Likelihood Principle. 16, Foundations 29 Several important points, illustrated in the above example, should be emphasized. First the correspondence of information from proportional likelihood functions applies only when the two likelihood functions are for the same parameter. (In the example, @ is the probability of heads for the given coin on a single flip, and is thus defined independently of which experiment is performed. If J, had applied to one coin, and /; to a different coin, the Likelihood Principle would have had nothing to say.) A second point is that the Likelihood Principle does nor say that all information about @ is contained in (8), just that all experimental informa tion is. There may well be other information relevant to the statistical analysis, Such as prior information or considerations of loss. The example also reemphasizes the difference between a conditional perspective and a frequentist type of perspective. The significance level calculations involve not just the observed x =9, but also the “more extreme” x=10, Again it seems somewhat peculiar to involve, in the evaluation, observations that have not occurred. No one has phrased this better than Jelireys (1961): a hypothesis which may be true may be rejected because it has not predicted observable results which have not occurred.” Thus, in Example 15, the null hypothesis that 9=5 certainly would not predict that X would be larger than 9, and indeed such values do not occur. Yet the probabilities of these unpredicted and not occurring observations are included in the classical evidence against the hypothesis, Here is another interesting example (from Berger and Wolpert (1984). ExaMpLe 16. Let 2={1,2,3) and @= {0,1}, and consider experiments E, and E, which consist of observing X, and X,, respectively, both having sample space # and the same unknown #, The probability densities of X, and X, are (for #=0 and @ = 1) fz ; CT Aeslo) | 090 | aos | 0.05 | fixjo) | 026 | 073 | oor Acalt) | a09 | 0.085 | ass | f(x!1) | 0.026 | 0803 | 0.171 If, now, x)=1 is observed, the Likelihood Principle states that the information about @ should depend on the experiment only through (f,(1|0), f,(1}1))=(0.90, 0.09). Furthermore, since this is proportional to (0.26, 0.026) = (f2(1|0), fa(1|1)), the Likelihood Principle states that x= 1 would provide the same information about # as x,=1. Another way of stating the Likelihood Principle for testing simple hypotheses, as here, is that the experimental evidence about 0 is contained in the likelihood ratio 30 1. Basie Concepts for the observed x, Note that the likelihood ratios for the two experiments, are also the same when 2 is observed, and also when 3 is observed. Hence, no matter which experiment is performed, the same conclusion about 6 should be reached for the given observation, This example clearly indicates the startling nature of the Likelihood Principle. Experiments E, and E: are very different from a frequentist perspective. For instance, the test which accepts @ = 0 when the observation is 1 and decides ¢=1 otherwise is a most powerful test with error prob- abilities (of Type I and Type II, respectively) 0.10 and 0.09 for Ey, and 0.74 and 0.026 for E,, Thus the classical frequentist would report drastically different information from the two experiments, ‘The above example emphasizes the important distinction between initial precision and final precision. Experiment E, is much more likely to provide useful information about 8, as evidenced by the overall better error prob- abilities (which are measures of initial precision). Once x is at hand, however, this initial precision is no longer relevant, and the Likelihood Principle states that whether x came from E, or Eis irrelevant. This example also provides a good testing ground for the various conditional ‘methodologies that were mentioned in Subsection 1.6.3. For instance, either of the conditional frequentist approaches has a very hard time in dealing with the example. So far we have not given any reasons why one should believe in the Likelihood Principle. Examples 15 and 16 are suggestive, but could perhaps be viewed as refutations of the Likelihood Principle by die-hard classicists. Before giving the axiomatic justification that exists for the Likelihood Principle, we indulge in one more example in which it would be very hard to argue against the Likelihood Principle. ExaMpLe 17 (Pratt (1962}).""An engineer draws random sample of electron tubes and measures the plate voltages under certain conditions with a very accurate voltmeter, accurate enough so that measurement error is negligible compared with the variability of the tubes. A statistician examines the measurements, which look normally distributed and vary from 75 to 99 volts with a mean of 87 and a standard deviation of 4, He makes the ordinary normal analysis, giving a confidence interval for the true mean. Later he visits the engineer's laboratory, and notices that the voltmeter used reads only as far as 100, so the population appears to be ‘censored’. This necessi- tates a new analysis, if the statistician is orthodox. However, the engineer says he has another meter, equally accurate and reading to 1000 volts, which hhe would have used if any voltage had been over 100. This is a relief to the orthodox statistician, because it means the population was effectively uncensored after all. But the next day the engineer telephones and says, “I just discovered my high-range voltmeter was not working the day | did the experiment you analyzed for me.’ The statistician ascertains that the engineer 1.6. Foundations 31 would not have held up the experiment until the meter was fixed, and informs him that a new analysis will be required. The engineer is astounded. He says, ‘But the experiment turned out just the same as if the high-range meter had been working. I obtained the precise voltages of my sample anyway, so I learned exactly what I would have learned if the high-range meter had been available. Next you'll be asking about my oscilloscope.” In this example, two different sample spaces are being discussed. If the high-range voltmeter had been working, the sample space would have effectively been that of a usual normal distribution, Since the high-range voltmeter was broken, however, the sample space was truncated at 100, and the probability distribution of the observations would have a point mass at 100. Classical analyses (such as the obtaining of confidence intervals) would be considerably alfected by this difference. The Likelihood Principle, on the other hand, states that this difference should have no effect on the analysis, since values of x which did not occur (here x= 100) have no bearing on infe or decisions concerning the true mean. (A formal verification is left for the exercises.) Rationales for at least some forms of the Likelihood Principle exist in carly works of R. A. Fisher (cf. Fisher (1959)) and especially of G. A. Barnard (cf. Barnard (1949). By far the most persuasive argument for the Likelihood Principle, however, was given in Birnbaum (1962). (It should bbe mentioned that none of these three pioneers were unequivocal supporters of the Likelihood Principle. See Basu (1975) and Berger and Wolpert (1984) for reasons, and also a more extensive historical discussion and other references. Also, the history of the concept of “likelihood” is reviewed in Edwards (1974).) The argument of Birnbaum for the Likelihood Principle was a proof of its equivalence with two other almost universally accepted natural principles. The first of these natural principles is the sufficiency principle (see Section 1.7) which, for one reason or another, almost everyone accepts. The second natural principle is the (weak) conditionality principle, which is nothing but a formalization of Example 14. (Basu (1975) explicitly named the “weak” version.) ‘The Weak Conditionality Principle. Suppose one can perform either of two experiments E, or E>, both pertaining to 0, and that the actual experiment conducted is the mixed experiment of first choosing J = 1 or 2 with probability each (independent of 6), and then performing experiment E, Then the actual information about @ obtained from the overall mixed experiment should depend only on the experiment E, that is actually performed. For a proof that sufficiency together with weak conditionality imply the Likelihood Principle in the case of discrete , see Birnbaum (1962) of Berger and Wolpert (1984): the latter work also gives a similar development 2 1. Basle Concepts and proof in an extremely general probabilistic setting, The argument poses 4 serious challenge to all who are unwilling to believe the Likelihood Principle; the only alternatives are to reject the sufficiency principle (which would itself cause havoc in classical statistics) or to reject the weak condi- tionality principle—yet what could be more obvious? ‘There have been a number of criticisms of Birnbaum's axiomatic develop- ment, including concerns about the existence of the likelihood function(ie., of f(x|@)), and even of the existence of “information from an experiment about &” Also, some of the consequences of the Likelihood Principle are so startling (such as the fact that the Likelihood Principle implies that optional stopping of an experiment should usually be irrelevant to con- clusions, see Section 7.7) that many statisticians simply refuse to consider the issue. Basu (1975), Berger and Wolpert (1984), and Berger (1984d) present (and answer) essentially all of the eriticisms that have been raised, and also extensively discuss the important consequences of the Likelihood Principle (and the intuitive plausibility of these consequences). It should be pointed out that the Likelihood Principle does have several inherent limitations. One has already been mentioned, namely that, in designing an experiment, it is obviously crucial to take into account all x that can occur; frequentist measures (though perhaps Bayesian frequentist measures) must then be considered. The situation in sequential analysis is similar, in that, at a given stage, one must decide whether or not to take another observation. This is essentially a design-type problem and, in making such a decision, it may be necessary to know more than the likelihood function for 6 from the data observed up until that time. (See Section 7.7 for further discussion.) A third related problem is that of prediction of Future observables, in which one wants to predict a future value of X. Again, there may be information in the data beyond that in the likelihood function for 6. Actually, the Likelihood Principle will apply in all these situations if is understood to consist of all unknowns relevant to the problem, including further random X, and not consist just of unknown model parameters. See Berger and Wolpert (1984) for discussion. The final, yet most glaring, limitation of the Likelihood Principle is that it does not indicate how the likelihood function is to be used in making decisions or inferences about 8. One proposal has been to simply report the entire likelihood function, and to educate people in its interpretation, This is perhaps reasonable, but is by no means the complete solution, First of all, it is frequently also necessary {o consider the prior information and Joss, and the interaction of these quantities with the likelihood function. Secondly, itis not at all clear that the likelihood function, by itself, has any particular meaning. It is natural to attempt to interpret the likelihood function as some kind of probability density for @ The ambiguity arises in the need to then specify the “measure” with respect to which it is a density. There are often many plausible choices for this measure, and the choice can have a considerable effect on the conclusion reached. This probl 1.6, Foundations 33 ‘basically that of choosing a “noninformative” prior distribution, and will be discussed in Chapter 3. (Of the methods that have been proposed for using the likelihood function to draw conclusions about @ (see Berger and Wolpert (1984) for references), only the Bayesian approach seems generally appropriate. This will be indicated in the next section, and in Chapter 4. (More extensive such arguments can be found in Basu (1975) and Berger and Wolpert (1984).) Icwill also be argued, however, that a good Bayesian analysis may sometimes require slight violation ofthe Likelihood Principle, in attempting to protect against the uncertainties in the specification of the prior distribution. The conclusion that will be reached is that analysis compatible with the Likeli- hhood Principle is an ideal towards which we should strive, but an ideal which is not always completely attainable. In the remainder of the book, the Likelihood Principle will rarely be used to actually do anything (although conditional Bayes implementation of it will be extensively considered). The purpose in having such a lengthy discussion of the principle was to encourage the “post-experimental” way of thinking. Classical statistics teaches one (0 think in terms of “pre- experimental” measures of initial precision. The Likelihood Principle states that this is an error; that one should reason only in terms of the actual sample and likelihood function obtained. Approaching a statistical analysis ‘with this viewpoint in mind isa radical departure from traditional statistical reasoning. And note that, while the Likelihood Principle is the “s urging adoption of the conditional approach, there is also the “carrot” that the conditional approach often yields great simplification in the statistical analysis: its usually much easier to work with just the observed likelihood function, rather than having to involve f(x|0) for all x, as a frequentist must (see also Sections 4.1 and 7.7) 1.6.5. Choosing a Paradigm or Decision Principle So far we have discussed two broad paradigms, the conditional and the Frequentist, and, within each, a number of possible principles or methodologies that could be followed. As these various paradigms and decision principles are discussed throughout the book, considerable effort will be spent in indicating when the methods seem to work and, more importantly, when they do not. The impression that may emerge from the presentation is that statistics is a collection of useful methodologies, and that one should “keep an open mind as to which method to use in a given application.” This is indeed the most common attitude among statisticians. While we endorse this attitude in a certain practical sense (to be made clearer shortly), we do not endorse it fundamentally. The basic issue is— How can we know that we have a sensible statistical analysis? For example, how can we be certain that a particular frequentist analysis has not run 34 1. Basic Concepts afoul of a conditioning problem? It is important to determine what funda- ‘mentally constitutes a sound statistical analysis, so that we then have a method of judging the practical soundness and usefulness of the various methodologies. We have argued that this desired fundamental analysis must be compat- ible with the Likelihood Principle. Furthermore, we will argue in Chapter 4 that it is conditional Bayesian analysis that is the only fundamentally correct conditional analysis. From a practical viewpoint, however, things are not so clearcut, since the Bayesian approach requires specification of a prior distribution 7, for @, and this can never be done with complete assurance (see Section 4.7). Hence we will modify our position (in Section 4.7) and argue that the fundamentally correct paradigm is the “robust Bayesian” paradigm, which takes into account uncertainty in the prior. Unfortunately, robust Bayesian analysis turns out to be quite difficult; indeed for many problems itis technically almost impossible. We thus run into the need for what Good (1983) calls “Type IT rationality”: when time and other realistic constraints in performing a statistical analysis are taken into account, the optimal analysis may be an analysis which is not rigorously justifiable (from, say, the robust Bayesian viewpoint). The employment of any alternative methodology should, however, be justified from this perspec- tive, the justification being that one is in this way most likely to be “close to” the philosophically correct analysis, With the above reasoning, we will be able to justify a number of uses of frequentist measures such as R(0, 5). Also, recall that partially frequentist reasoning is unavoidable in many statistical domains, such as design of experiments and sequential analysis. A final justification for consideration of R(0, 8) is that, whether we like it or not, the bulk of statistical analyses that will be performed will use prepackaged procedures. Although the primary concern should be to see that such procedures are developed so as to be conditionally sound, the fact that they will see repeated use suggests that verification of acceptable long-run performance would only be prudent. In spite of all these reasons, we would strongly argue that conditional (Bayesian) reasoning should be the primary weapon in a statistician's arsenal. It should be noted that we did not attempt to justily use of frequentist ‘measures on certain “traditional” grounds such as the desire for “objec- tivity” or avoidance of use of subjective inputs (such as prior information). Objectivity is clearly very difficult in decision theory, since one cannot avoid subjective choice ofa loss function. Even more to the point, strong arguments can be made that one can never do truly objective (sensible) statistical analyses; analyses that have the appearance of objectivity virtually always contain hidden, and often quite extreme, subjective assumptions. (For instance, the choice of a model is usually a very sharp subjective input.) Some indications of this will be seen throughout the book, although for more thorough discussions (of this and the other foundational issues), see 1. Sumiient Statistics 35 Jeffreys (1961), Zellner (1971), Box and Tiao (1973), Good (1983), Jaynes (1983), Berger (1984a), and Berger and Wolpert (1984) (all of which also have other references). With the exception of Chapters 3 and 4, the book will tend to emphasize methodologies based on R(@, 8). The reason is mainly historical: the bulk of existing statistical decision-theoretic methodology is frequentist in nature. We will often pause, however, to view things from the conditional per- spective. 1.7. Sufficient Statistics ‘The concept of a sufficient statistic (due to Fisher (1920, 1922)) is of great importance in simplifying statistical problems. Intuitively, a sufficient statis- tic is a function of the data which summarizes all the available sample information concerning #, For example, if an independent sample Xiyees, Xp for a Wu, 0°) distribution is to be taken, it is well known that T=(X,S") is a sufficient statistic for @=(y,07). (Here $ YX Xn 1)) Itis assumed that the reader is familiar with the concept of sufficiency and with the methods of finding sufficient statistics. We will content ourselves here with a rather brief discussion of sufficiency, including a presentation of the major decision-theoretic result concerning sufficiency. (For an i depth examination of sufficiency, see Huzurbazar (1976).) The following formal definition of sufficiency uses the concept of a conditional distribution, with which the reader is also assumed familiar. Definition 12. Let X be a random variable whose distribution depends on the unknown parameter @, but is otherwise known. A function T of X is said to be a sufficient statistic for @ if the conditional distribution of ‘X, iven T(X)= 1, is independent of 8 (with probability one). For understanding the nature of a sufficient statistic and for the develop. ‘ment of the decision-theoretic result concerning sufficiency, the concept of 4 partition of the sample space must be introduced, Definition 13. If T(X) is a statistic with range J (ie., #= (T(x): xe 2), the partition of % induced by T is the collection of all sets of the form H,={xe%: Tx) = th for re J, Note that if 4 # t, then 2, 0%, = and also observe that U,., 2, ‘Thus 2 is divided up (or partitioned) into the disjoint sets %- 36 1. Basie Concepts Definition 14, A sufficient partition of 2 is a partition induced by a sufficient statistic T, Consider now the formal definition of sufficiency given in Definition 12. The conditional distribution of X, given T(X) =, is clearly a distribution giving probability one to the set , Indeed the distribution can usually be represented by a density, to be denoted f(x), on 2). The density does not depend upon @, since by Definition 12 the conditional distribution is independent of @, This implies, in particular, that the densities f(x) are known, being explicitly calculable from f(x|) The intuitive reason that a sulicient statistic is said to contain all the sample information concerning @ can be seen from the above considerations. Basically, the random variable X can be thought of as arising frst from the random generation of T, followed by the random choice of x from 2, (1 being the observed value of T) according to the density f(x). This second stage involves a randomization not involving 8, and so, from a number of intuitive viewpoints, carries no information about ‘In developing the decision-theoretic result concerning sufficiency, the concept of a conditional expectation will be needed. The conditional expectation of a function A(x), given T= 1, will be denoted E*'(A(X)}, and, providing the conditional density f, exists, is given by [J rcerricciae continuous case EX (A(X) Jf, i) i { L A(xyfi(x) discrete case. We will also need the standard probabilistic result that EX h(X)]= BTE* [K(X )), Finally, for any statistic T(X), we will define randomized decision rules 8*(1,-), based on T, to be the usual randomized decision rules with # being considered the sample space. The risk function of such a rule is clearly (9, 8*) = E"(L(9, 8%(T,"))] Theorem 1. Assume that Tis a sufcent statistic for and let 8f(%,°) be any randomized ral in 2%, Then (subject to measurabihty conditions) there exists a randomized rule 8{(1,-), depending only on T(x), which is R- equivalent 10 8§ Proor. For Ac uf and ©. define oF(GA) X(65(X, AD) Thus 54(4,-) is formed by averaging 4$ over 2;, with respect to the condi- tional distribution of X given T= ¢ Itis easy to check that, for each f, 57(0,-) 1.7 Sufficient Statistics 37 is probability distribution on of, Assuming itis also appropriately measur. able, it follows that 4 is a randomized decision rule based on T(x). Note that the sufficiency of T is needed to ensure that 8 does not depend on 6. Observe next that 18, B8(1,-)) = BT (L(0, a)]= EXEL, a) It follows that RU0, 61) = ELLA, ET“) = BTENES (100, a)] = EXE {L(6, a)] BNL, 83(X, YI] ROO, 83). Qo The above theorem applies also to a nonrandomized rule 4p, through the identification of 8, and (6,) discussed in Section 1.4, Note, however, thateven though 5, is nonrandomized, the equivalent 5 may be randomized. Indeed it is clear that oF A) E*[(50MX, AD] E*[14(8o(X))) PX 5(X)e A). ‘When evaluating decision rules through risk functions, Theorem 1 implies that it is only necessary to consider rules based on a sufficient statistic. If a tule is not a function of the sufficient statistic, another rule can be found that is a function of the sufficient statistic and has the same risk function, It will, in fact, often be the case that a decision rule, which is not solely function of the sufficient statistic, will be inadmissible. (One such situation is discussed in the next section.) This is an important point because sufficiency is not a universally accepted principle (although itis one of the few points that classical statisticians and Bayesians agree upon). Inadmissi- bility is a serious criticism, however (even to conditional Bayesians, see Section 48), so that violation of sufficiency is hard to justify, (See Berger (19844) for further discussion and references.) It should come as no surprise that the Likelihood Principle immediately implies that a sufficient statistic contains all the sample information about 6, because sufficiency was a major component of Birabaum's derivation of the Likelihood Principle. For completeness, however, note that (under mild conditions) the factorization theorem (ef. Lehmana (1959)) for a sufficient statistic will show that the likelihood function can be written as 1(6)= f(x] 8) = h()g(T(916), where ht does not depend on 6. Hence the likelihood function is proportional 10 g(T(x)|6), and the Likelihood Principle implies that all decisions and inferences concerning # can be made through T.

You might also like