This Page intentionally left blank
BAYESlAN
THEORY
This Page intentionally left blank
BAYESIAN
THE0RY
Jose M. Bernard0
Professor of Statistics Universidad de Valencia, Spain
Adrian F. M. Smith
Professor of Statistics Imperial College of Science, Technology and Medicine, London, UK
JOHN WILEY & SONS, LTD
Chichester . New York . Weinheim Brisbane . Singapore * Toronto
1
Copyright 6 2000 by John Wiley & Sons,Ltd Bafiins Lane, Chichestcr, West Sussex PO19 IUD, Engtand National 01243 779771 Intemional (W)1243 179777 email (for orders and customer s m i c e enquiries): csbooks@viley.co.uk Visit our Home Page on http://www.wiley.co.uk or http://www.wilcy.com First published in hardback 1994 (ISBN 0 471 92416 4)
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval systcm, or transmitted, in any form or by any means, electronk, mechanical, photocopying, wording, s m i n g or otherwise, except under the terms of the copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency, 90 Tottenham Court Road,London, UK W1 P 9HE. without UIC permission in writing of the Publisher, with the exception of any malerid supplied specifically for the purpose of being entmd and executed on a computer system, for exclusive use by the purchaser of the publication.
Neither the authors nor John Wiley & Sons Ud accept any responsibility or liability for loss or damage o occasioned t any person or propmy through using the material, instructions, methods or ideas contained herein, or acting or refraining from acting as a result of such use. The authors and Publisher ins expressly disclaim all implied warranties, including merchantabilityor ftes for MYparticular purpose. There will be no duty on the authors or Publisher to comct any CKO~Sor defects in the software. Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons is aware of a claim, the product name appear in initial capital or all capital lettca. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.
Other Wiky Editorial Ogices
John Wiley & Sons, lnc., 605 Third Avenue, New York, NY 101580012, USA Weinhcim Brisbane Singapore Toronto
Libraw o Cbngress ~atdoginginPubIicationData f
Bernardo, JOSC M. Bayesian theory / JoSt M. Bernardo, Adrian F. M. Smith. p. cm.  (Wiley series in probability and mathematical statistics) Includes bibliographical references and indexes. ISBN 0 471 92416 4 1. Bayesian statistical decision theory. 1. Smith, Adrian F. M. 11. Title. ill Series. QA279.5.B47 1993 519.5'424~20 9331554 CIP
British U b r q v Cataloguing in Pvblicaron Data
A catalogue record for this book is available from the British Library ISBN 0 471 49464 X
T O
MARINAand DANIEL
This Page intentionally left blank
Preface
This volume, first published in hardback in 1994, presents an overview of the foundations and key theoretical concepts of Bayesian Statistics. Our original intention had been to produce further volumes on computationand methods. However, these projects have been shelved as tailored Markov chain Monte Carlo methods have emerged and are being refined as the standard Bayesian computational tools. We have taken the opportunity provided by this reissue to make a number of typographical corrections. The original motivation for this enterprise stemmed from the impact and influence of de Finetti’s twovolume Theory o Probability, which one of us helped f translate into English from the Italian in the early 1970’s. This was widely acknowledged as the definitiveexposition of the operationalist, subjectivist approach to uncertainty, and provided further impetus at that time to a growth in activity and interest in Bayesian ideas. From a philosophical,foundationalperspective,the de Finetti volumes provide in the words of the author’s dedication to his friend Segrea necessary document for clarifying one point of
view in its entirety.
From a statistical, methodological perspective, however, the de Finetti volumes end abruptly, with just the barest introduction to the mechanics of Bayesian inference. Some years ago, we decided to try to write a series of books which would take up the story where de Finetti left off, with the grandiose objective of “clarifying in its entirety” the world of Bayesian statistical theory and practice.
viii
Prejiice
It is now clear that this was a hopeless undertaking. The world of Bayesian Statistics has been changing shape and growing in size rapidly and unpredictablymost notably in relation to developments in computational methods and the subsequent opening up of new application horizons. We are greatly relieved that we were too incompetent to finish our books a few years ago! And, of course, these changes and developments continue. There is no static world of Bayesian Statistics to describe in a onceandforall way. Moreover. we are dealing with a field of activity where, even among those whose intellectual perspectives fall within the broad paradigm, there are considerable differences of view at the level of detail and nuance of interpretation. This volume on Bayesiun Theor:\. attempts to provide a fairly complete and uptodate overview of what we regard as the key concepts, results and issues. However, it necessarily reflects the prejudices and interests of its authorsas well as the temporal constraints imposed by a publisher whose patience has been sorely tested for far too long. We can hut hope that our sins of commission and omission are not too grievous. Too many colleagues have taught us too many things for it to be practical to list everyone to whom we are beholden. However, Dennis Lindley has played a special role, not least in supervising us as Ph.D. students. and we should like to record our deep gratitude to him. We also shared many enterprises with Morrie DeGroot and continue to miss his warmth and intellectual stimulation. For detailed comments on earlier versions of material in this volume. we are indebted to our colleagues M. J. Bayam, J. 0. Berger. J. de la Horra, P. Diaconis, F. J. Gir6n. M. A. G6mezVillegas. D. V Lindley, M. Mendoza, J. Muiioz. E. Moreno. L. R. Pericchi. . A. van der Linde. C. Villegas and M. West. We are also grateful, in more ways than one, to the State of Valencia. It has provided a beautiful and congenial setting for much of the writing of this book. And. in the person of the Governor, Joan Lerma. it has been wonderfully supportive of the celebrated series of Vulenciu International Meetings on Buyesiun Statistics. During the secondment of one of us as scientific advisor to the Governor. it also provided resources to enable the writing of this book to continue. This volume has been produced directly in TEX and we are grateful to Maria Dolores Tortajada for all her efforts. Finally, we thank past and present editors at John Wiley & Sons for their support of this project: Jamie Cameron for saying "Go!" and Helen Ramsey for saying "Stop!"
Valencia, Spain January 26,2000
J. M. Bernardo A. F. M. Smith
Contents
1. INTRODUCTION 1.1. ThornasBayes
1
1
1.2. The subjectivist view of probability
1.3. Bayesian Statistics in perspective
1.4. An overview of Bayesian Theory 1.4.1. Scope 5 I .4.2. Foundations 5 I .4.3. Generalisations 6 1.4.4. Modelling 7 1.4.5. Inference 7 I .4.6. Remodelling 8 1.4.7. Basic formulae 8 1.4.8. NonBayesian theories 9 1.5. A Bayesian reading list
2
3
5
9
X
Contents
2. FOUNDATIONS
13
2.1. Beliefs and actions
I3
2.2. Decision problems 16 16 2.2.1. Basic elements 2.2.2. Formal representation
I8
23
2.3. Coherence and quantification 23 2.3.1. Events, options and preferences 2.3.2. Coherent preferences 23 2.3.3. Quantification 28
2.4. Beliefs and probabilities 33 33 2.4. I . Representation of beliefs 2.4.2. Revision of beliefs and Bayes’ theorem 45 2.4.3. Conditional independence 47 2.4.4. Sequential revision of beliefs 2.5. Actions and utilities 49 2.5. I . Bounded sets of consequences 49 2.5.2. Bounded decision problems 50 2.5.3. General decision problems 54
2.6.
38
Sequential decision problems 56 2.6. I . Complex decision problems 56 59 2.6.2. Backward induction 2.6.3. Design of experiments 63
67 69
2.7. Inference and information 67 2.7.1. Reporting beliefs as a decision problem 2.7.2. The utility of a probability distribution 2.7.3. Approximation and discrepancy 75 77 2.7.4. Information
2.8.
Discussion and further references 8 I 2.8.1. Operational definitions 8I 2.8.2. Quantitative coherence theories 83 1.8.3. Related theories 85 2.8.3. Critical issucs 92
2. I .3.2.2. Motivation and preliminaries 141 145 3. Critical issues 161 3.2. Some particular multivariate distributions I33 3. Review of probability theory 109 3.5. Generalised representation of beliefs 3. The general model .2.Contents xi I05 I5 0 3. Motivation 105 3.2.1. The value of information 147 3.1.1 Statistical models I65 4. 4.2.3.1. The role of mathematics I60 3. GENERALISATIONS 3. Generalised information 157 Discussion and further references 160 3.3. Random vectors.3.3.3. Generalised approximation and discrepancy 154 3.1. The multinomial model I77 4.3.2.2.1. Generalised preferences 3. The Bernoulli and binomial models 176 4.1.4. Countable additivity 106 3.4.3. Random quantities and distributions 109 3.5.5.2. The utility of a general probability distribution 151 3. I .3. Exchangeability and related concepts I67 I67 4.2.4.2. Generalised options and utilities 141 3.3. Some particular univariate distributions I 14 3. MODELLING 165 I65 4. Models via exchangeability I72 4.4. Beliefs and models 4.2. Generalised information measures 1 SO 3. Convergence and limit theorems 125 I27 3. I .3.2.3.2. Dependence and independence 4. The general problem of reporting beliefs I0 5 3.4.2.5.4. Bayes' theorem 3. I .1. I .4. Exchangeability and partial exchangeability I72 168 4.
Sufficiency.6.5. Predictive and parametric inference 247 5.2.4.Critical issues 237 5.1.7. I . Models via invariance I8 I 4.3. The multivariate normal model 4.6. Hierarchical models 4.5. I . Decisions and inference summaries 255 763 5.4.1.8. Summary statistics 190 4.4.2. Representation theorems 235 4.5.7. Parametric and nonparametric models 4. Model elaboration 229 4. Model simplification 233 4. Sufficiency and the exponential family 4.2.8. I . Observables.4.3.7. I . Pragmatic aspects 226 4.4.4. Models for extended data structures 209 21 I 4. Models via sufficient statistics 190 4. Subjectivity and objectivity 236 4.2. ancillarity and stopping rules 5. The role of Bayes' theorem 243 5. Finite and infinite exchangeability 226 228 4. INFERENCE S.5.5.7. The geometric model Corttents 185 4.xii 4.6.2. Prior distributions 234 4.2. Structured layouts 2 I7 4. Covariates 2 I9 222 4.7.3. I . Predictive sufficiency and parametric sufficiency 197 4.4.7. Information measures and the exponential family I9 1 207 4.6.1.5.3.1.8.3. I .5. Models via partial exchangeability 209 4. I .3. Discussion and further references 235 4. beliefs and models 242 5.6.4.6. I . The exponential model 187 I89 4.6. 241 The Bayesian paradigm 24 I 24 I 5. Implementation issues .8. The normal model 18 1 4.4.1.4.5. Several samples 4.
1.3. Approximations with conjugate families 279 5.6. An historical footnote 5.5. Importance sampling 348 5. Discussion and further references 356 356 5. I. Nuisance parameters 320 5. Asymptotic analysis 285 286 5.4. Approximation by crossvalidation 403 6. I. Robustness 367 5. Asymptotics under transformations 295 5. Laplace approximation 340 5.6.1.1.4.6.4.3. Prior ignorance 357 5.5.2. Samplingimportanceresampling 350 5.1.4.3.2.4.4. Model comparison as a decision problem 6. Zeroone utilities and Bayes factors 389 6.2.3.2. Hierarchical and empirical Bayes 37 1 5.3. Discrete asymptotics 5.3.5.2.2.3.6. Onedimensional reference distributions 5.6. Numerical approximations 339 5.5. REMODELLING 6.4. Critical issues 374 377 6.2.1. Model comparison 377 377 6.5.1. Perspectives on model comparison 386 6.5.7. Restricted reference distributions 3 16 5. Covariate selection 407 . General utilities 395 6.5.6. Further methodological developments 373 5.6.5. Iterative quadrature 346 5. I . Reference analysis 298 299 5. I.4. Multiparameter problems 333 5.3. Reference decisions 5. I. Conjugate families 265 5. Canonical conjugate analysis 269 5.Conrenrs xiii 5.5.3.6.6.1. Ranges of models 383 6. Conjugate analysis 265 5.4.1.2. Markov chain Monte Carlo 353 302 5. Continuous asymptotics 287 5.1.4.3.5.2.2.
Approaches to prediction 482 B.xiv 6.3.2.3. I . 427 Probability distributions Inferential processes 436 427 B. Interval estimation 465 B.3.1.S. Modelling and remodelling 418 6.l. Zeroone discrepancies 413 6. Aspects of asymptotics 485 B.4.2.4.4. Discrepancy measures for model rejection 6.4. 1.2. Critical issues 4 I8 A.19 B. B.3. SUMMARY O F BASIC FORMULAE A.3.3. Conditional and unconditional inl'erence 8. Significance testing 475 Comparative issues 478 8.4.3 .2. Classical decision theory 445 B.2. Overview 4 17 6.2.4. Nuisance paramctcrs and marginalisation B. Model rejection 409 6.4. I .2. Fiducial and related theories 456 Stylised inference problems 460 B. 443 Overview 443 Alternative approaches 445 B. A.2. B.4.3.3. Likelihood inference 454 B.3. Frequentist procedures 5. Point estimation 460 B. General discrepancies 4 I5 409 4 12 6.2.2.4. Model rejection through model comparison 6.2.2. REFERENCES SUBJECT INDEX AUTHOR INDEX 555 57.3.2. I .3. Discussion and further references 4 I7 6.2.3.3.3.2. Model choice criteria 486 489 478 379 B. Hypothesis testing 469 8.4. NONBAYESIAN THEORIES B.2. I .
1761. Ltd Chapter 1 Introduction Summary A brief historical introduction to Bayes' theorem and its author is given.1 THOMAS BAYES According to contemporary journal death notices and the inscription on his tomb in Bunhill Fields cemetery in London. Thomas Bayes died on 7th April. Smith Copyright 02000 by John Wiley & Sons. An overview is provided of the material to be covered in successive chapters and appendices. as a prelude to a statement of the perspective adopted in this volume regarding Bayesian Statistics. M. at the age of 59.BAYESIAN THEORY Edited by Josd M. Definitive records of Bayes' birth do not seem to exist. The inscription on top of the tomb reads: Rev. In recognition of Thomas Bayes's important work in probability. Murk . but. Thomas Bayes. Son of the said Joshua and Ann Bayes (59). it seems likely that he was born in 1701 (an argument attributed to Bellhouse in the Insr. allowing for the calendar reform of 1752 and accepting that he died at the age of 59. Bemardo and Adrian F. 1. 7 April 1761. The vault was restored in 1969 with contributions received from statisticians throughout the world. and a Bayesian reading list is provided.
lie in the interpretation and assumed scope of thc formal inputs to the two sides of the equationand it is here that past and present commentators part compiiny in their responses to the idea that Bayes' theorem can or should be regarded as a central feature of the statistical learning process. it was Laplace ( I774/ 1986)apparently unaware of Bayes' work .has made of all the fuss over the last 233 years we shall never know. 26. I970/ 1975) . Dale ( 1990.2 THE SUBJECTIVIST VIEW OF PROBABILITY Throughout this work. Holland (1962). At the heart of the controversy is the issue of the philosophical interpretation of probability objective or subjective?. Stigler ( 1986a). In its simplest form. Having specified f)( 11 I H ) and P( 11).who stated the theorem in its general (discrete) form. The technical result at the heart of the essay is what we now know as Buyrs' theorem. What Thomas Baycsfrom the tranquil surroundingsof Bunhill Fields. he is in no position to complain at the liberties we are about to take in his name. if II denotes an hypothesis and Ll denotes data. Pearson (1978). I99 I ) and Earman ( 1990). we shall adopt a wholehearted subjectivist position regarding the interpretation of probability. 53. which asserts that the lefthand side of the equation must equal the righthand side. 1. The interest and controversy. of course.2 I Ititrodiictioti Srutist. Like any theorem in probability. the theorem states that P( II I D )= P ( D I H )x P ( H ) / I ' (11) With I 12) regarded as a probabilistic statement of belief about H before obtaining ' ( data D. the lefthand side P(H 1 U ) becomes a probabilistic statement of belief about H after obtaining 11.ahi/ih ( I970/ 1974. also. According to Stigler ( 1986b). 3 7 0 4 18). Rox Soc. in any case. Actually. See. Gillies (1987). at the technical level Bayes' theorem merely provides a form 0f"uncertainty accounting". We would like to think that he is a subjectivist fellowtraveller but. However.the mechanism of the theorem provides a solution to the problem of how to learn from data.and the appropriateness and legitimacy of basing a scientific theory on the latter. That his name lives on in the characterisation of a modern statistical methodology is a consequence of the publication of An rssciy towards solving (Iproblem irt the doctrine ofchmces. Bayes only stated his result for a uniform prior. The definitive account and defence of this position are given in de Finetti's twovolume Theory ofProl. 1992). Bull. attributed to Bayes and communicated to the Royal Society after Bayes' death by Richard Price in I763 (Phil. Trcrtrs. where he lies in peace with Richard Price for company. Some background on the life and the work of Bayes may be found in Barnard (1958). from a purely formal perspective there is no obvious reason why this essentially trivial probability result should continue lo excite interest.
The actual fact of whether or not the events considered are in some sense determined. the result aimed at (but unattainable) in the statistical formulation. in that it considers an evenr as a class of individuul events. by means of the subjectiveapproach. or known by other people. The numerous. or a 'more scientific' character. The logical view is similar. The result. different. which are n motivated and characterised by varying degrees of formal intent. The classical view. if it leads him to assign the same probabilities to such events. it is straightforward. activity relating to . in the opinion of their supporters. and to provoke wellknown polemics and disagreementseven between supporters of essentially the same framework. why? The original sentence becomes meaningful if reversed: the symmetry is probabilistically significant.3 BAYESIAN STATISTICS IN PERSPECTIVE The theory and practice of Statistics span a range of diverse activities. also. and so on. but also 'stochastically independent' . (these notions when applied to individual events are virtually impossible to define or explain in terms of the frequentist interpretation). It suffices to make use of the notion of exchangeability. in a perfectly valid manner. or 'firmer' philosophical or logical foundations. is of no consequence. would endow Probability Theory with a 'nobler' status. Thefrequenrisr (or staristiruf) view presupposes that one accepts the classical view. in any case.3 Bayesian Statistics in Perspective 3 and the following brief extract from the Preface to that work perfectly encapsulates the essence of the case. the latter being 'trials' of the former. . turn out to be useful and good as valid auxiliary devices when included as such in the subjectivist theory. The main points of view that have been put forward are as follows. based on physical considerationsof symmetry. have only served to generate confusion and obscurity. but merely from the sentences which describe them.I . in which one should be obliged to give the same probability to such 'symmetric' cases. Activity i the context of initial data exploration is typically rather informal. opposed attempts to put forward particular points of view which. xixii) 1. although useless per se. . But which symmetry? And. to obtain. which acts as a bridge connecting the new approach with the old. in someone's opinion. Preface. under the appropriateconditions. 1970/1974. and from their formal structure or language. The individual events not only have to be 'equally probable'. The only relevant thing is uncertaintythe extent of our own knowledge and ignorance. (de Finetti. In this case. has often been referred to by the objectivists as "de Finetti's representation theorem". It follows that all the three proposed definitions of 'objective' probability. but much more superficial and irresponsible inasmuch as it is based on similarities or symmetries which no longer derive from the facts and their actual properties.
it can happen that evidence arises which discredits a previously assumed and accepted formal framework and necessitates a rethink. often a pragmatic ambiguity about the boundaries of the formal and the informal. Both these phases of initial structuring and subsequent restructuring might well be guided by "Baycsian thinking". and we have drawn attention to this whenever appropriate. The theory establishes that expected utility maximisation provides the basis for rational decision making and that Bayes' theorem provides the key to the ways in which beliefs should tit together in the light of changing evidence. In any field of application. the development of the theory necessarily presumes a rather formal frame of discourse. there are sections where the full slory would require a greater level of abstraction than we have adopted. in our view. The theory is not descriptive. in the sense of claiming to model actual khaviour.4 I Introduction concepts and theories of evidence and uncertainty is somewhat more formally structured. From the very beginning. cannot readily be seen as part of the formalism. a prerequisite for arriving at a structured frame of discourse will typically be an informal phase of exploratory data analysis. The goal. is to establish rules and procedures for individuals concerned with disciplined uncertainty accounting. The emphasis in this book is on ideas and we have sought throughout to keep the level of the mathematical treatment as simple as is compatible with giving what s we regard a an honest account.by which we mean keeping in mind the objective of creating or recreating a formal framework for uncertainty analysis and decision makingbut are not themselves part of the Bayesian formalism. That said. of course. Part of the process of realising that a change is needed can take place within the currently accepted framework using Bayesian ideas. . within which uncertain events and available actions can be described and axioms of rational behaviour can be stated. But this formalism is preceded and succeeded in the scientific learning cycle by activities which. with the central aim of characterising how an individual should act in order to avoid certain kinds of undesirable behavioural inconsistencies. Also. but the process of rethinking is again outside the formalism. Rather. and activity directed at the mathematical abstraction and rigorous analysis of these structures is intentionally highly formal. What is the nature and scope of Bayesian Statistics within this spectrum of activity? Bayesian Statistics offers a rationalist theory of personalistic beliefs in contexts of uncertainty. in effect. However. in the sense of saying "if you wish to avoid the possibility of these undesirable consequences you must act in the following way". it is prescriptive. there is.
Sequential Analysis. Survival Analysis and Time Series. hopefully. The emphasis throughout is on general ideasthe Why?of Bayesian Statistics. see French (1986). 1.the What?will be provided in the volume Bayesian Merhod~. Modelling.?. each of Chapters 2 to 6 concludes with a Discussion und Further References section. Prior Elicitation. these are. and to avoid complicating the main text with too many digressionary asides and references.4. Linear Models.1 AN OVERVIEW OF BAYESIAN THEORY Scope This volume on Bayesian Theory focuses on the basic concepts and theory of Bayesian Statistics. that even colleagues who are committed to the Bayesian paradigm will disagree with at least some points of detail and emphasis in our account.2 Foundations In Chapter 2.4. For this reason. In addition. avoiding too dogmatic a tone. we stress the importance of a decisionoriented . the omission of a topic. Where we hold strong views. Nonpararnetric Inference. However. Topics falling into this category include Design o Experiments. The selection of topics and the details of approach adopted in this volume necessarily reflect our own preferences and prejudices. rather clearly and forcefully stated. Imuge Analysis. In most cases. which we believe to have considerable intuitive appeal and to be an improvement on the many such systems that have been previously proposed. Mulrivarif ute Analysis. We acknowledge. reflects the fact that a detailed treatment will be given in one or other of the volumes Buyesiun Compurution and Buyesiun Methods. there are two appendices providing a Summary ofBasic Formulae and a review of NunBuyesiun Theories. Robustness.4 An Overview o Bayesian Theory f 5 1.4 1.will be provided in the volume Bayesian Computation.the How. and throughout this volume. the concept of rationality is explored in the context of representing beliefs or choosing actions in situations of uncertainty. For a convenient source of discussion and references at the interface of Decision Theory and Game Theory. there are important topics. in which some of the key issues in the chapter are critically reexamined. A detailed treatment of analytical and numerical techniquesfor implementing Bayesian procedures. A systematic study of the methods of analysis for a wide range of commonly encountered model and problem types. or its abbreviated treatment in this volume. mathematical Generalisations of the Foundations. Here. We introduce a formal framework for decision problems and an axiom system for the foundations of decision theory. for the most part. Inference and Remodelling. such as Game Theory and Group Decision Muking.1. however. which are omitted simply because a proper treatment seemed to us to involve too much of a digression from our central theme. with chapters covering elementary Foundufions. while.
58) and employed in statistical contexts by Kullback ( 195911968). An additional postulate concerning the comparison of a countable collection of events is appended to the axiom system of Chapter 2. The elements of mathematical probability theory required in our subsequent development are then reviewed. These measures are mathematically closely related to wellknown informationtheoretic measures pioneered by Shannon ( 19. specifically. The analysis of sequential decision problems is shown to reduce to successive applications of the methodology introduced. The notions of actions and utilities. the criterion of maximising expected utility is shown to be the onl? decision making criterion compatible with the extended axiom system. An important special feature of what we shall call a pure inference problem is the form of utility function to be adopted. A further additional mathematical postulate regarding preferences is introduced and. the ideas and results of Chapter 2 are extended to a much more general mathematical setting. A key feature of our approach is that statistical inference is viewed simply as a particular form of decision problem. within this more general framework. Within this framework. introduced in a simple discrete setting in Chapter 1. 1.6 I Introduction framework in providing a disciplined setting for the discussion of issues relating to uncertainty and rationality. Thus defined. We establish that the logarithmic utility functionmore often referred toas a score function in this contextplays a special role as the natural utility function for describing the preferences of an individual faced with a pure inference problem. respectively. and is shown to provide a justification for restricting attention to countably additive probability as the basis for representing beliefs. rather than requiring a separate “theory of inference”. measures of the discrepancy between probability distributions and the amount of information contained in a distribution are naturally defined in terms of expected loss and expected increase. A resulting characteristic feature of our approach is therefore the systematic appearance of these informationtheoretic quantities as key elements in the Bayesian analysis of inference and general decision problems. are extended in a natural way to provide a very general mathematical framework for our development of decision theory.3 Generalisations In Chapter 3. in logarithmic utility. . the inference problem can be analysed within the general decision theory framework.4. a decision problem where an action corresponds to reporting a probability belief distribution for some unknown quantity of interest. The dual concepts of probability and utility are formally defined and analysed within this decision making context and the criterion of maximising expected utility is shown to be the only decision criterion which is compatible with the axiom system.
such as normals and exponentials. a featurecommon to many individual beliefs about sequencesof observables. and hierarchies. inference problems are again considered simply as special cases of decision problems and generalised definitions of score functions and measures of information and discrepancy are given. A feature of our approach is an emphasis on the primacy of observables and the notion of a model as a (probabilistic) prediction device for such observables. From this perspective. problems involving covariates. Structures considered include those of several samples. likelihoods and prior distributions. The problem is approached by considering simple structural characteristics such as symmetry with respect to the labelling of individual counts or measurements. formalise and then use to establish a version of de Finetti's celebrated representation theorem. random samples. This demonstrates that judgements of exchangeabilitylead to general mathematical representationsof beliefs that justify and clarify the use and interpretationsof such familiar statistical concepts as parameters. The concept of a sufficient statistic is introduced and related to representations involving the exponential family of distributions. we examine in detail the role of familiar mathematical forms of statistical models and the possible justificationsfrom a subjectivist perspectivefor their use as representations of actual beliefs about observable random quantities. The key concept here is that of exchangeability. Going beyond simple exchangeability. A further approach to characterising belief distributions is considered. multiway layouts.4 An Overview o Bayesian Theory f 7 In this generalised setting. based on data reduction. the key role of Bayes' theorem in the updating of beliefs about observables in the light of new information is identified and related to conventional mechanisms of predictive and parametric inference.4. . or translation of the origincan lead to mathematical representations involving other familiar specific forms of parametric distributions. 1. the role of conventional parametric statistical modelling is problematic.4. and requires fundamental reexamination. The roles of sufficiency.1. Various forms of partial exchangeability judgements about data structures are then discussed in a number of familiar contexts and links are established with a number of other commonly used statistical models.4 Modelling In Chapter 4. 1. to rotation of the axes of measurements.5 Inference In Chapter 5. which we motivate. ancillarity and stopping rules in such inference processes are also examined. we show that beliefs which have certain additional invariance propertiesfor example.
The mathematical convenience and elegance of conjugate analysis are illustrated in detail. An alternative.Various standard forms of statistical problems. A particular feature of this volume is the extended account of socalled reference analysis. there are good reasons for systematically entertaining a range of possible belief models. 1. The problems of implementing Bayesian procedures are discussed at length. Our discussion relates and links these ideas with aspects of hypothesis testing. throughout. importance sampling. but intractable. a clear distinction is drawn among three rather different perspectives on the comparison of and choice from among a range of competing models. The second perspective arises when the range of models is assumed to be under consideration in order to provide a more conveniently implemented proxy for an actual. others involving only a terminal action. The first perspective arises when the range of models under consideration is assumed to include the "true" model. samplinginiportanceresampling.4.and Markov chain Monte Carlo methods. A feature of our treatment of this topic is that. together with outline accounts of numerical quadrature. which can be viewed as a Bayesian formalisation of the idea of "letting the data speak for themselves".6 Remodelling In Chapter 6. in the absence of any specification of an actual belief mtdel. belief model. closely related idea is that of how to represent "vague beliefs" or "ignorance". some involving model choice only. A brief account is given of recent analytic approximation strategie5 derived from Laplacetype methods. it is argued that. signiticancc testing and crossvalidation. are reexamined within the general Bayesian decision framework and related to formal and informal inference summaries. such as prediction. in tabular format. A variety of decision problems are examined within this framework. rather than predicating all analysis on a single assumed model. such as point and interval estimation and hypothesis testing. The third perspective arises when thc range of models is under consideration because the models are "all there is available". summaries of the main univariate and multivariate probability distributions that appear in the . 1. we collect together for convenience. We provide a detailed historical review of attempts that have been made to solve this problem and compare and contrast some of these with the reference analysis approach. as are the mathematical approximations available under the assumption of the validity of largesample asymptotic analysis.7 Basic Formulae In Appendix A.4. whether viewed from the perspective of a sensitive individual modeller or from that of a group of modellers. some involving model choice followed by a terminal action.
published in 1961. Thus. likelihood theory. Polson and Tiao (1995) is a two volume collection of classic papers in Bayesian inference. 1. Wichmann ( 1990). we quote both the original date and the date of the most recent English edition. these include Actuarial Science (Klugman. 198I . Regazzini (1983). Kleiter (1980). 1970/1975. Lindley (1965. (1965).8 NonBayesian Theories In Appendix B.4. O'Hagan (1988a). In those cases where there are several editions. Lindley (1971/1985).Schlaifer( 1959. DeGroot (1970). Daboni and Wedlin ( I982). Elementary and intermediate Bayesian textbooks include those of Savage (1968). and Berry ( 1996).1965). Jeffreys ( 1939/196I ) refers to Jeffreys' Theory of Probability. (1990).Jeffreys (1939/1961). Schmitt (1969). Florens et nl. de Finetti (1970/1974. Savchuk ( 1989).5 A Bayesian Reading List 9 text. hypothesis and significance testing. Pioneering Bayesian books include Laplace (18 12). The following is a list of other Bayesian books. Robert (1992) and O'Hagan (1994a). 1 and to its English translation (published in 1974). Bemardo ( 1981b). namely.Good (1950. Special topics have also been examined from a Bayesian point of view. classical decision theory. similarly. we indicate why we find all these alternatives seriously deficient as formal inference theories. this work is necessarily a selective account of Bayesian theory. reflecting our own interests and perspectives. Berger (1985a). Winkler (1972). and fiducial and related theories. 1. Through counterexamples and general discussion.Dubinsand Savage (1965/1976). Iversen ( I984). or when the original is not in English. Press (1989). Lavalle (1970). Biostutistics (GirelliBruni.5 A BAYESIAN READING LIST As we have already remarked. 1972).by no means exhaustive. Tribus (1969). first published in 1939. Scozzafava ( 19891. 1972) and Box and Tiao (1973).1962). Keynes (1921/1929). 1961). Pratt et al. frequentist procedures. we review what we perceive to be the main alternatives to the Bayesian approach.RaiffaandSchlaifer( 1961). Savage( 1954/1972. 1992).1. Mostellerand Wallace( 1964/1984).whose contents would provide a significant complement to the material in this volume. Cifarelli and Muliere ( 1989). . We compare and contrast these alternatives in the context of "stylised" inference problems such as point and interval estimation. de Finetti (1970/1974) refers to the original (1970) Italian version of de Finetti's Teoria delle Probabilitu vol. More advanced Bayesian monographs include Hartigan ( 1983). Lee ( 1989). Borovcnik ( 1992).and to its most recent (3rd) edition. together with summaries of the prior/posterior/predictive forms corresponding to these distributions in the context of conjugate and reference analyses.
Savage (1961. Smith. . 1979. 1985.Justice. Stael von Holstein and Matheson. Hadley. Cooke. 1978. 1987). 1965/1983. Educutional und Psvchoiogical Research (Novick and Jackson. 1988a. Mmirnunr Enrrupy (Levine and Tribus. Rios. Viertl(1987). 1996). Howson and Urbach 1989. Box et ul. 1979.Erickson and Smith. I97 I . 1985). 1975. 1992). 1973. Bmai et al. 1967. Skilling. Geisser et trl. 1970. I97 1 . Marinell and Seeber. Verbraak. 1989. Economics arid Econometrics (Morales. 1982. I974). 1988. 1989).1992.Fearn and O'Hagan (1993). 1959. Control Theory (Aoki. 1992.Satnple Surveys (Rubin. Aitken and Stoney. Pollard. Berry and Stangl. 1966. Savage (1981). Morris. 1990. 1967. 1960. Good (1983). 1976. Fienberg and Zellner ( 1974). 1993). Oliver and Smith ( 1990). 1987. Fellner. Raiffa. 1986. 1979. and last. 1991).. 1978. Berger and Wolpert. 1965: Roberts. Fishburn. 1991.Ghosh and Pathak ( 1992). Dawid and Smith (1983). 1984.. 1985). Foundations (Fishburn. 1968. 1976. 1967. 1957. Keeney and Raiffa. Gtxl and Zellner ( 1986).Smith and Grandy. 1970. West and Harrison. Florens et (11. Rosenkranz. Among these. Among these. 1977: Lavalle. Schlaifer. 1983. 1987. Broemeling. 1978. I996 and 1999). 1989. 1984. General discussions of Bayesian Statistics may be found in review papers and encyclopedia articles. 1994). Pilz. Halter and Dean. 1967.White and Bowen (19751. Martin. 1968. 1989. Geisser. 1973)and Spectral Anu1ysi. 1971: Learner. 1984). 1985). 1992. 1964. Richard. we note de Finetti ( 195 I ).Multivtiritrtr Analysis (Press.Decision Atialvsis (Duncan and Raiffa. Pole et al. 1987. 1971. I988a).stction (Mockus. 1988. Gatsonis Y I ul. Grandy and Schick. 1977. 1980. Sawagari et ul. Parenti (1978). Fougere. 199 I 1. 1981. Chernoff and Moses. 1987). Berger and DasGupta.Osteyee and Gcxxl. 1990. 1986). A number of collected works also include a wealth of Bayesian material. but not least. 1986. ( 1993). 1984. 0ptinii. 1969. 1994). (1990). 1988. 1985. 1982b.. 1970. Lusted. 1993). French. MohammadDjafari and Demoment. Ansconibe (1961)..10 I Iniroditction Lecoutre. Dynanric Forecasting (Spall. Grayson. we note particularly. Prediction (Aitchison and Dunsmore. Kyburg and Smokler ( 1964/ 1980). Rivadulla. Kadane ( 1984). 1970.Goel and Iyengar ( 1992). Reliability (Martz and Waller. Freeman and Smith ( 1994). Gardenfors and Sahlin (1988). History (Dale. Smith and Erickson. Roberts. 1990. Jaynes ( 1983). Zellner (1980).s (Bretthorst. Edwards and Tversky. Pattern Recognition (Simon. Inforniurion Theory ( Y a glom and Yaglom. 1971.Social Science (Phillips. Aykaq and Brumat (1977). 1968. 1982. Bauwens.Godambeand Sprott ( 197 1 ). Box (1985). 1967). Lrrw and fimnsic Science (DeGroot etal. Logic und Philosophy qfScience (Jeffrey. Lindley ( 1953. 1984/1988. Boyer and Kihlstrom. 1988). 1991). 19911. Probabiliry Assessnient (Stael von Holstein. Brown. 197211982.de Finetti ( 1993). the Proceedings of the kleticia Internutional Meetings on Stryesitin Statistics (Bernard0 et al. French et al. Seidenfeld. 1960/ 1983. (1983). 1978. Cyert and DeGroot. Kapur and Kesavan. 1988). Smith and Dawid (1987). 1984. Lindgren. (1983. Linear Models (Lempers. Gupta and Berger (1988. Zellner. Meyer and Collier ( 1970). Aitchison. 1988. I99 I ). Hinkelmann (1990). 1974. 1982~. 19x2: Claroti and Lindley. 1983/1991.
1.1978).de Finetti ( I968). 1987. Bartholomew (1994). (1986).Zellner ( 1988c). Eric9 son ( 1988). Dawid (1983a). 1988b). (1992). Bartholomew (1963.Edwards et al. I985a).Good (1976.Birnbaum ( 1968.1986. Arnold (1993). 1986).Trader ( 1989). Dawid (1983b. Good (1982. (1963). 1988b). Pack (1986a. Smith (1991). Berger (1994) and Hill (1994). RacinePoon etal. see Luce and Suppes ( 1965). Cifarelli (1987). 1988a.Lindley ( I 91). 1986b). Kadane (1993). Joshi ( 1983). . 1987. 1988a). For discussion of important specific topics. 1992). Hodges (1987). Fishburn ( I98 1. Geisser (1982. Ghosh (1991) and Berger (1993). 1992).Breslow (1 W). Ferguson et al. 1985. LaMotte (1985).5 A Buyesian Reading List 11 1970). Bernard0 (l989). Roberts (1978). Goldstein (1986~). Cornfield (1969). Zellner ( 1985. DeGroot (1982). Press ( I980a. Barlow and Irony (1992). Smith (l984). Genest and Zidek (19861. 1986a. Dickey (1982).
This Page intentionally left blank .
This uncertainty may relate to past situations. with intuitive operational appeal.BAYESIAN THEORY Edited by Josd M. Statistical inference is viewed as a particular decision problem which may be analysed within the framework of decision theory.1 BELIEFS AND ACTIONS We spend a considerable proportion of our lives. The dual concepts of probability and utility are formally defined and analysed within this context. respectively. is introduced for the foundationsof decision theory. An axiomatic basis. where direct . both private and professional. in a state of uncertainty. Ltd Chapter 2 Foundations Summary The concept of rationality is explored in the context of representing beliefs or choosing actions in situations of uncertainty. the concept of discrepancy between probability distributions and the quantification of the amount of information in new data are naturally defined in terms of expected loss and expected increase in utility. 2. M. The logarithmic score is established as the natural utility function to describe the preferences of an individual faced with a pure inference problem. The analysis of sequential decision problems is shown to reduce to successive applicationsof the methodology introduced. The criterion of maximising expected utility is shown to be the only decision criterion which is compatible with the axiom system. Bemardo and Adrian F. Smith Copyright 02000 by John Wiley & Sons. Within this framework.
and we must take i4ncertoint. In this case. Whatever the circumstances. when we feel that we have no (or only negligible) capacity to influence matters. they result from the large number of possible strategics: . interesting decision problems are those for which such perfect information is not available. or the potential effects are trivial in comparison with the effort involved in carrying out a conscious analysis. of course. the difficulty is trchriicul. In other words. nor to order our thoughts and opinions in any kind of responsible way. To choose the best among a set of actions would. where each involves a range of uncertain consequences and we are concerned to avoid making an “illogical” choice. in principle. in principle. i. In such cases. Here. there is a sense in which all states of uncertainty may be described in the same way: namely. be immediate if we had perfect information about the consequences to which they would lead. we all regularly encounter uncertain situations in which we at least aspire to behave “rationally” in some sense. Many feelings of uncertainty are rather insubstantial and we neither seek to analyse them.14 2 Foundatioris knowledge or evidence is not available. We take the view that such problems are purely technical. bearing in mind that others may subsequently use this summary as the basis for choosing an action. the choice of a particular mode of representing and communicating our beliefs. And yet it is obvious that we do not attempt to treat all our individual uncertainties with thc same degree of interest or seriousness.e. More specifically. we are not motivated to think carefully about our uncertainty either because nothing depends on it. or that the possible outcomes have no (or only negligible) consequences so far as we are concerned. So fw as this work is concerned. not conceptual. On the other hand.. however. we might regard the summary itself. complete information. be shared by other individuals). In the first case. For example.v into account as a major feature of the problem. we might be called upon to sumrnarise our beliefs about the uncertain aspects of the situation. This typically happens when we feel no actual or practical involvement with the situation in question. Our basic concern in this chapter is with exploring the concept of “rationality” in the context of representing beliefs or choosing actions in situations of uncertainty. cven though we have. an individual feeling of incomplete knowledge in relation to a specified situation (a feeling which may. This might be because we face the direct practical problem of choosing from among a set of possible actions. or has been lost or forgotten: or to present and future developments which are not yet completed. it is typically not easy to decide what is the optimal strategy to rebuild a Rubik cube or which is the cheapest diet fulfilling specified nutritional requirements. Alternatively. as being a form of action t o which certain criteria of “rationality” might be directly applied. we are concerned that our summary be in a form which will enable a “rational” choice to be made at some future time. It might be argued that there are complex situations where we do have complete information and yet still find it difficult to take the best decision.
with the decision criterion to be adopted f when we do not have complete information and are thus faced with. although formally it can still be considered a decision problem if the inferential statement itself is interpreted as the decision to be taken (Lehmann. It is assumed in our approach to such problems that the notion of “rational belief“ cannot be considered separately from the notion of “rational action”. . does actually lead to action.3. Frequently. a statement of beliefs might be regarded as an end in itself. . actually or potentially. instead. . or. . . it is a question of providing a convenient summary of the data . For example. they reduce to the mathematical problem of finding a minimum under certain constraints. We can therefore explore the notion of “rationality”for both beliefs and actions by concentratingon the latter and asking ourselveswhat kinds of rules should govern preference patterns among sets of alternative actions in order that choices made in accordance with such rules commend themselves to us as “rational”. at least some. . Our concern. .2. in that they cannot lead us into forms of behavioural inconsistency which we specifically wish to avoid. an input into the process of choosing some practical course of action. we make precise the notion of “rational” preferences in the form of axioms. we describe the general structure of problems involving choices under uncertainty and introduce the idea of preferences between options. In Section 2.1 Beliefs and Actions 15 in the second.just as a lump of arsenic is called poisonous not because it actually has killed or will kill anyone. we should emphasise that we do not interpret “actions in situations of uncertainty” in a narrow. in which case the choice of the form of statement to be made constitutes an action. elements of uncertainty. In this work we shall not consider these kinds of combinatorial or mathematical programming problems. alternatively.2. In Section 2. or trying to facilitate the task of others seeking to decide upon their beliefs in the light of the experimental results. But in neither case is there any doubt about the decision criterion to be used. In other words. To avoid any possible confusion. directly “economic” sense. within our purview we include the situation of an individual scientist summarising his or her own current beliefs following the results of an experiment. but because it would kill anyone if he ate it (Ramsey. 1926). always choose the best alternative. the emphasis is on the inference rather than the decision aspect of problem. it is not asserted that a belief. in principle. . I959/ 1986). Either a statement of beliefs in the light of available information is. but would lead to action in suitable circumstances. is with the logical process o decision making in sitf uarions o uncertainty. and we shall assume that in the presence of complete information we can. In such cases.
5.j E . I and ..s.. . In Section 2. . j E . (ii) for each action a..I. i E I}of available actions. However. Suppose we choose action ( 1 . I . j E . one of which is to be selected. in order to conform with the principles of quantitative coherence. In Sections 2. This identification of inference as decision provides the fundamental justification for beginning our development of Bayesian Statistics with the discussion of decision theory.8. we make precise the sense in which choosing a form of a statement of beliefs can be viewed as a special case of a decision problem.4 and 2. the decision problem can be represented schematically by means of a decision tree as shown in Figure 2..} and {c.. we shall omit this dependence. We shall come back to this point in Section 2.. to simplify notation. I . whose structure is determined by three basic elements: (i) a set {u. . a set { E. a general review of ideas and references is given in Section 2. or cohere.. are typically finite.6. In practical problems.I (for each i). In Section 2. I } .1 DECISION PROBLEMS Basic Elements We shall describe any situation in which choices are to be made among alternative courses of action with uncertain consequences as a decision prohlern. the labelling sets. occurs and leads to the corresponding consequence c. } for each action a. j E J . so that a more precise notation would be {I?. wc discuss sequential decision problems and show that their analysis reduces to successive applications of the maximum expected utility methodology.. we prove that. a set of consequences {c.6. Naturally.I}. Each set of events {E.We describe these as principles of quantitative coherence because they specify the ways in which preferences need to be made quuntitutively precise and fit together. .. j E . relative values of individual possible consequences should be described in terms of a utilipfunction... both the set of consequences and the partition which labels them may depend on the particular action considered./) forms a purririon (an exclusive and exhaustive decomposition) of the total set of possibilities. . if “illogical” forms of behaviour are to be avoided. while remarking that it should always be borne in mind. 2.j E J } of uncertcrin e\*ent. then one and only one of the uncertain events E.j E ..7.2..: (iii) corresponding to each set { E.j E .2 2. in particular. In such cases.describing the uncertain outcomes of taking action a. and the rational choice of an action is to select one which has the nitrrirnuni expected utility. Finally. degrees of belief about uncertain events should be described in terms of a (finitely additive) probubility meusure.I. The idea is as follows. .. we identify the design of experiments as a particular case of a sequential decision problem..
Following the choice of an action and the occurrence of a particular event. most practical problems involve sequential considerations but.6..2. whereas others may become more plausible. either from our general discussion. In other words. an individual decisionmaker often prefaces .. that we can formally identify any a. is very much dependent on the information currently available. Further information. to repeated analyses based on the above structure. It is therefore of considerable importance to bear in mind that a representation such as Figure 2. or from the decision tree representation.(E J )= c.j E J } and { c J . in that some of the EJ's may become very implausible (or even logically impossible) in the light of the new information. Indeed. of a kind which leads to a restriction on what can be regarded as the total set of possibilities. where the notation cJ I El signifiesthat event EJ leads to consequence cI. It is clear.2 Decision Problems 17 Figure 2.1 only captures the structure of a decision problem as perceived at a particular point in time. to choose a. Of course.1 Decision tree The square represents a decision node..i. essentially.. The circle represents an uncertainty node where the outcome is beyond our control. with the combination of { E J . j E J . The latter are clearly subject to change as new information is acquired and this may well change overall preferences among the various courses of action. We shall write a. as shown in Section 2. I EJ j E J } to denote this identification. these reduce.j E J} to which it leads. i E I. Preferences about the uncertain scenarios resulting from the choices of actions depend on attitudes to the consequences involved und assessments of the uncertainties attached to the corresponding events. will change the perception of the uncertainties. of course.e. In particular. where the choice of an action is required. is to opt for the uncertain scenario labelled by the pairs (E. An individual's perception of the state of uncertainty resulting from the choice of any particular a. { EJ j E J} forms a partition of the total set of relevant possibilities as the individual decisionmaker now perceives them to be. = {c. very familiar in the everyday context of actual or potential choice. The notion of preference is. that a. the branch leads us to the corresponding consequence. c J ) .
we expund our horizons to encompass analogous problems. Bayesian procedures. we must also be able to represenr rhe ideu clfprejkrence as applied to the comparison of some or all of the pairs of available options. To prefer action n l to action a.. a range of possible forms of medical treatment. detailed elaboration is provided in the remarks following the definition. .an actual choice (from a menu.1. so that n l 5 a:! signifies that ( 1 1 is nor prejerred to a?. The development which follows. or tossing a coin). together with the preferences among options within it. surgery. where: (i) E is an algebra of relevatit events. It is to be understood that the initial specification of any such particular frame of discourse. of course.~e which the comparisons of scenarios are to be carried out.. Throughout. Definition 2.) with the phrase "I prefer.scoi~r. (Decision problem).2. the idea of indifference between two courses of action also has a clear operational meaning. is largely based on Bernardo. We begin by within describing this widerfiume o~~di. which we hope will aid us in ordering our thoughts by providing suggestive points of reference or comparison. concrete decision problem. These four basic elements have been introduced in a rather informal manner.2 Formal Representation When considering a particular. are dependent on the decisionmaker's overall state of information at that time. a textbook of statistical methods. We shall therefore need to consider a fourth basic element of a decision problem: (iv) the relation 5 . s). we do not usually contine our thoughts to on/? those outcomes and options explicitly required for the specification of that problem. In everyday terms.).\I(. E. A. Typically.2 means that if these were the only two options available ( 1 1 would be chosen (conditional. we shall denote this initial state of mind by . here and in Section 3. . we shall need to reformulate these concepts in a more formal framework. etc. an investment portfolio. It signifies a willingness to accept an externally determined choice (for example. In order to study decision problems in a precise way. on the information available at the time). Ferrandiz and Smith (1985). In addition to represetiring [he srructure of a decision problem using the three elements discussed above.. We now give a formal definition of a decision problem. (caviar. A decision problerii is dejned by the elements ( E . The collection of uncertain scenarios defined by the original concrete problem is therefore implicitly embedded in a somewhat wider framework of actual and hypothetical scenarios. This will be presented in a rather compact form.C.3. equities. " 2. letting adisinterested third party choose. which expresses the individual decisionmaker's preferences between pairs of available actions. etc.
So far as the definition of an option as a func!ion is concerned.) As we mentioned when introducing the idea of a wider frame of discourse. or potential acts. leads to consequence c. In particular.. . . the certuin event in E. (iii) A i a set of options. these requirements ensure that the certain event $2 and the impossible event 8. (However. since the general. Within this wider frame of discourse. actual decision problem that we may wish to consider). . together with any other hypothetical events. ordered sets of elements of C . consisting offunctionswhich map s finite partitions of 0.2. which it may be convenient to bring to mind as an aid to thought. an individual decisionmaker will wish to consider the uncertain events judged to be relevant in the light of the initial state of information A J . (iv) 5 is a preference order. The class € will simply be referred to as the ulgebra of(re1evant)events. it is natural to assume that if El E € and E2 E € are judged to be relevant events then it may also be of interest to know about their joint occurrence.. the algebra E will consist of what we might call the realworld events (that is. In our introductory discussion we used the term action to refer to each potential act available as a choice at a decision node. To represent such a mapping we shall adopt the notation {c.4. preferences among such consequences will later be assumed to be independent of the state of information concerning relevant events. I E. to compatiblydimensioned.2 Decision Problems 19 (ii) C is a set of possible consequences. { E. Repetition of this argument suggests that E should be closed under the operations of arbitrary finite intersections and unions. j E J}. j E J } . {c. Within the wider frame of discourse. with the interpretation that event E.8..j E J . both belong to €. it can certainly be argued that this is too rigid an assumption. We now discuss each of these elements in detail. those occurring in the structure of any concrete. with a corresponding set of consequences. This means that El n E2 and El U E2 should also be assumed to belong to &. we prefer the term option. . However. taking the form of a binary relation between some of the elements of A. The class C will simply be referred to as the set of (possible)consequences. formal framework may include hypothetical scenarios (possibly rather far removed from potential concrete actions). j E J}. c. We shall provide further discussion of this and related issues in Section 2. We denote by C the set of all consequences that the decisionmaker wishes to take into account. it is natural to require & to be closed under complementation. Technically. Similarly. or whether at least one of them occurs. so that E' E E . we note that this is a rather natural way to view options from a mathematical point of view: an option consists precisely of the linking of a partition of R. we are assuming that the class of relevant events has the structure of an algebra.
Definition 2. the options { C:I 1 E .2 5 a l .I) are completely equivalent. we say that ( I . will simply be referred to as the uction spucr. or (12 is not preferred to ( I I (or both). The class A of options. In introducing the preference binary relation 2.2 ol 5 andnp 5 u1. ~is not preferred to u p . In defining options. ~2 I I?'}. Together with the interpretation of 2.2. ~e 0. if we shall use the composirefuncriori iiorution ( I = { ( J .20 2 Foundutions It follows immediately from the definition of an option that the ordering of the and labels within J is irrelevant. nl 5 (12 and it i nor triw thut as 5 ( 1 1 . (12 < 01. Without introducing further notation. r i p in A. I fl}. In all cases.. or potential actions. we can derive a number of other useful binary relations. in Chapter 3. To simplify the presentation we shall omit such universal quantifiers when there is no danger of confusion.with (11 = { ( * I I E . in the sense that either (iI 5 (J? or a2 5 ( i I (or both). c 1 E. so that. For example.. c2 I E" n G. Sometimes. If the rclation can be applied. for example. I I. we shall simply regard c as denoting either an element of C. the ordering of the labels is irrelevant.. an extension to admit the possibility of injinite partitions has certain mathematical advantages and will be fully discussed. for any r E C. We can identify individual consequences as special cases of options by writing c = {c.. However. these suffice to describe all cases where pairs of options can be compared. (Induced binary relations) ( i ) a I 0. the interpretation of an option with a rather cumbersome description is clarified by an appropriate reformulation. Definition 2. The induced binary relations are to be interpreted to mean that (II is equivalent to u2 if and only if ( 1 1 (12. I E. cp I F'}. Thus. we shall  . E . s (ii) 01 < a1 (iii) a I 3 n . the assumption of ajnite partition into events of E seems to us to correspond most closely to the structure of practical problems.1. together with other mathematical extensions.I}. a = {q 1 6 n G'.. j E J } and { c I El UE2$ c. j E . or the element { c I 12) of A. (iv) a l > n?  . Which form is used in any particular context is purely a matter of convenience. There will be no danger of any confusion arising from this identification. are not assuming that all we pairs of options (a. c1 I E } are identical. and forms such as {c I E l . { ca I E'. Thus. ( 1 2 ) E A x A can necessarily be related by 5. and o1 is strictly preferred to nz if and only i f a l > a2.2 is to be understood as referring to any options n l . c. From 1. . c:j I C }may be inore compactly written as (1 = ( ( 1 1 I G. I E. . I G'"}.
in the sense that.2 to describe uncertainty relations between events. Since. we then suy that E is not more likely than F . given the stute of information described by Moo. can also be used to define a binary relation on E x &. 10) 5 {Q I S2) and say that consequence c1 is not preferred to consequence c2. Moreover.  (01 it is always true.3. once again. Definition 2. It is worth stressing once again at this point that ull the order relations over A x A. Continuing the (convenient and harmless) abuse of notation. E F if and only if E and Fare equally likely. involving the same pair of consequences and differing only in terms of their uncertain events. Thus. that 0 < $2. we should introduce a new symbol to replace 5 when referring to a preference relation over C x C. This binary relation will capture the intuitive notion of one event being "more likely" than another. we shall also use the derived binary relations given in Definition 2. and hence over C x C and & x &. To avoid triviality. for a given individual. Since. If we compare two dichotomised options. since 5 is defined over A x A. In fact. the force of this argument applies independently of the choice of the particular consequences el and c2. as one would expect. we shall further economise on notation and also use the symbol 5 to denote this new uncertainty binary relation between events. for all < c?. The intuitive content of the definition is clear. the collection of all pairs of relewnt events. . are to be understood as personal. each individual i s free to express his or her own personal preferences.2. Strictly speaking. Thus. (Uncertainty relation). there is no danger of confusion. conditional on the initial state of information Mo. in the light of his or her initial state of information Mo. 5 .2 Decision Problems 21 write c1 5 02 if and only if { c .2. a statement such as E > F is to be interpreted as "this individuul. this parsimonious abuse of notation creates no danger of confusion and we shall routinely adopt such usage in order to avoid a proliferation of symbols. We shall proceed similarly with the binary relations N and < introduced in Definition 2. we shall later formally assume that there exist at least two consequences CI and cz such that c1 < ep.considers event E to be more likely than event F ". Clearly. and E > F if and only if E is strictly more likely than F. The basic preference relation between options. provided that our preferences between the latter are assumed independent of any considerations regarding the events E and F. we will prefer the option under which we feel it is "more likely" that the preferred consequence will obtain. given an agreed structure for a decision problem.
induced between events is of fundamental importance.22 cI < ~ 2 Foi4ndations Definition 2. occurs. to ( 1 1 or (12 if C: occurs. obtained by considering the occurrence of realworld events. we have stressed that preferences. with its derived forms (. Throughout this section.\A. is a simple translation The definition ofthe conditional uncertainty relation of Definition 2. it is only when k! < C. and <(. as one would expect. Subsequently. taking as an arbitrary "origin" the first occasion on which an individual thinks systematically about the problem.3. has been denoted by Jf. Obviously. and defined identically if G" occurs. when we come. The conditional uncertainty relation <(. in Section 2. The obvious relation between 5 and is given by the following: s(. s(. initially defined among options but inducing binary relations among consequences and events. preference to option (cz I F.. The intuitive content of the definition is clear. we need to consider one further important topic. The induced binary relation between consequences is obviously defined by However. we shall need to take into accountfurther irlforniufiori. Naturally. all conditional relations reduce to their unconditional counterparts. This relation. taking into account both the initial information . and <(. To complete our discussion of basic ideas and definitions.3 provides such a statement with an operational nieaning since for all 2 E > F is equivalent to an agreement to choose option { ~2 I E. are conditional on the current state of information. The initial state of information. respectively. to discuss the desirable properties of 5 and we shall make formal assumptions which imply that. c1 I E " } in . Conversely. the induced binary relations set out in Definition 2.. Thus. provides the key to investigating the way in which uncertainties about events should be modified in the light of new information. then this preference obviously carries over to any pair of options leading. given G. and the additional information provided by G. . not affected by additional information regarding the uncertain events in f. e: I I that conditioning on G may yield new preference patterns.. denoted by (. ('1 I F"}. however.3 to a conditional preference setting. preferences between options will be described by a new binary relation <(..2 have their obvious counterparts.(''2 if and only if ('I <_ ~ 2 so that prcfcrences between pure consequences are . Given the assumed Occurrence of a possible event CJ. if G = I I . If we do not prefer u ] to 02. comparison of options which are identical if G" Occurs depends entirely on consideration of what happens if C. c1 ir..
EJ E €. our assumptions.2.I&. (Comparability of consequences and dichoromised options). F ..If. ~ ~ I F " } 5 o r { c 2 l E . representing the preference relation on A x A conditional on . (ii) For all consequences c1. . j E J . all preferences are assumed conditional on an initial state of information.j E J } is a finite partition of the certain event Q. can be viewed as responses to the questions: "what rules should preference relations obey?' and "what events should be included in E?" Each formal axiom will be accompanied by a detailed discussion of the intuitive motivation underlying it. Axiom 1. presented in the form of a series of uxioms.not descriptive. j E J } .. a set of consequences C.2 Coherent Preferences We shall begin by assuming that problems represented within the formal framework are nontrivial and that we are able to compare any pair of simple dichoromised options. Options and Preferences The formal representation of the decisionmaker's "wider frame of discourse" includes an algebra of events €. c ~ l F ' } . Thus. either{c:!IE. where { E j . It is important to recognise that the axioms we shall present areprescriptive.> 0. given the assumed Occurrenceof a possible event G. whose generic element has the form {c.. c2. The axioms simply prescribe constraints which it seems to us imperative to acknowledge in those situations where an individual aspires to choose among alternatives in such a way as to avoid certain forms of behavioural inconsistency. with the binary relation 5 (i. Bearing in mind the overall objective of developing a rational approach to choosing among options. We now wish to make precise our assumptions about these elements of the formal representation of a decision problem. c2 such thut c1 < c2. . the ways in which individuals should behave.3. on some presumed "ethical" basis.3 Coherence and Quantification 23 2.clIE"} { c ~ I F . und events E.alone. s$~) 2. they do not purport to describe the ways in which individuals actually do behave in formulating problems or making choices.e. The set A x A is equipped with a collection of binary relations 5 ~G .3 2.representing the notion that one option is not preferred to another.. I E.1 COHERENCE AND QUANTIFICATION Events.3. c 1 I E c }2 { c ~ l F . In addition. neither do they assert. and a set of options A. (i) There exist consequences cl. cj E C.
We have already noted that. Condition (i)s very natural.rhrrr r w rrrrrst 111 leusi he willing to esprixs pti:ferunc*esbetu~urii siiiiplv cliclrororrrisutl opriorrs. in any given decision problem. although i t i s easy to think o f examples where this form i s complex (eg. But there could be no possibility o f an orderly and systematic approach if we were unwilling to express preferences among simple dichotomised options and hence (with E = F = $ 1 ) among the consequence3 i themselves.” There are certainly many situations where we find the task of comparing simple options.s. Suppose. obviate the r r c w l for such comparisons if we are to aspire lo responsible decision making. can be compared. then ( I if ( 1 . 5 a:! ando? 5 thm (11 5 (I. We are trying to capture the essence o f what i s required for an orderly and systematic approach to cornparing alternatives of genuine interest. (i)n 5 (ii) f u l / (I. Discussiori ofAriom 2. combinations of monetary. The assertion o f : strict preference rules out equivalence between any pair o f the options. the difticulty of comparing options in such cases does not. of course. however bizarre or fantastic.2 (i). :: (12 < (Iri and < ( i I among three options n l . Condition (i) obvious intuitive support. very difficult.(. so that our  . however complex. making the direct assumption that rill optiom. The intuitive basis for such a requirement i s perhaps best illustrated by considering the consequences of iiirrwisitirv preferences. I t would has make little sense to assert that an option was strictly preferred to itself. (Transitivity of preferences). However. that we found ourselves expressing the preferences ( 1 1 < ( 1 2 . We shall now state our assumptions about the ways in which prcferences I should fit together or cohere in terms o f the order relation over . there would not be a decision problem i n any real sense. herwurrr rrlrrrrrcrti\~c optiori. u 5 (1. Condition (ii) does riot therefore assert that we should be able to compare any pair o f conceivable options.1. there w i l l typically be a high degree of similarity i n the form of the consequences (e. Resource allocation among competing health care programmes involving different target populations and morbidity and mortality rates is one obvious such example. since all choices would certainly lead to precisely equivalent outcomes. In most pructicul problems. Condition (ii) requires preferences to be trunsitirv.g all monetary). Condition (ii)s therefore to be interpreted in the following sense: **//’ wu tispire to iiitrkr ( I rutiorrul clioirx. Axiom 2. If all consequences were i equivalent. from Definition 2. It would also seem strangely perverse to claim to be unable to compare an option with itself! We note that. We are not. health and industrial relations elements). x A. and even consequences. (12 and o. therefore. C can be defined as simply the set of consequences required for that problem.24 Discussion ofAriom 1. at this stage.
since u3 < a1 we are willing to pay z in order to avoid a:% have. By virtue of the expressed preference nl < a2 and the above discussion. We would thus have paid x + y + i in order lo find ourselves in precisely the same position as we starred from! What is more. Repeating the argument once again. or whatever). however small. to accept option u2. 2 ProoJ This is immediate from Definition 2. we are willing to pay y in order to exchange a2 for u3. Proposition 2. Let y and 1 i denote the corresponding “prices” for switching from u2 to u3 and from a3 to u r l . having a “value” less than the perceived difference in “value” between the two options.that we would be willing to pay in order to move from a position of having to accept option a to one where we have. If we consider. Willingness to act on the basis of intransitive preferences is thus seen to be equivalent to a willingness to suffer unnecessarily the certain loss of something to which one attaches positive value. by virtue of the preference a2 < u3. or beads. The following consequences of Axiom 2 are easily established and will prove useful in our subsequent development. say 6. the prospect of option and u. Let us now examine the behavioural implications of these expressed preferences. But now.2.  a . Thus. We should therefore be willing to pay this amount to switch from the less preferred to the more preferred option.3 Coherence and Quantification 25 expressed preferences reveal that we perceive some actual difference in value (no matter how small) between the two options in each case. instead.. informal and appeals to directly intuitive considerations. of course. we could find ourselves arguing through this cycle over and over again. we are willing to pay I in order to exchange option a1 for option u2.1. Suppose now that we are confronted with the prospect of having to accept option a ] . (i) E E. We regard this as inherently inconsistent behaviour and recall that the purpose of the axioms is to impose rules of coherence on preference orderings that will exclude the possibility of such inconsistencies. it would therefore be inappropriate to become involved in a formal discussion of terms such as “value” and “price”. (Transitivity of uncertainties). the preference ul < (12. At this stage. (ii) El 5 E2 and E 5 E:i imply El 5 En.” Our discussion of this axiom is. then we must ensure that our preferencesjt together in a transitive manner. we are implicitly stating that there exists a“price”.respectively.3 and Axiom 2. Axiom 2(ii) is to be understood in the following sense: “lfwe aspire to avoid expressing preferences whose behavioural implications are such as to lead us to the certain loss of something we value.. It is intuitively clear that if we assert strict preference there must be some amount of money (or grains of wheat. for example. instead.
Proof. 2 Foundations    Proof. i G' } i { (12 I C. . (Invarianceof preferences between consequences1. Conversely.A similar argument applies to events using Proposition 2. so that. c. (I I C' }. by Axiom 2(ii).?1 Q } . I .. (.3.4 have operational content. cI <(.2. 1 G"} F { (12 I G. u I G } 5 { c2 I G. ( 1 1 C ' } . I ' I G' }. ('1 5 ( ' 2 .1 < (12 und a? (I:{ {hen a1 < o+ If El < E2 ond E? El then El < E:s. {Q I E. Discirssion of Ariom 3.. theri E 5 F . by Definition 2.}. (iii) G f o r some c and G > @.('2. (ii) asserts that if we have {cz I E . c1 I E " } 5 { r2 I F. 1 G' }.. Hence. 4 a 2 5   .3 and 2. c I G"}for some c then. If c1 5 (:2 then. Again.nl 5 ~ 2 a1 and a? 5 a:$.2. thus contradicting C > k!. Condition (i) formalises the idea that preferences between pure consequences should not be affected by the acquisition of further information regarding the uncertain events in 1. I F"} for sot~ic' cI < ( ' 2 then we should have this preference for m y c1 < t.26 Proposition 2. This forrnalises the intuitive idea that the stated preference should only depend on the "relative likelihood" of F' and E' and should not depend on the particular consequences used in constructing the options. "1 5 Q ifund onl! iftlierr crist C > sud1 that CI 5(. : . then the second situation is preferable overall. that . and thus n l 0%. (Derived transitive properties). { (1 I 1 G. An important implication of Axiom 3 is that preferences between consequences are invariant under changes in the information "origin" regarding events in E . by Axiom 3( i).This latter argument is a version of what might be called the surething principle: if two situations are such that whatever the outcome of the first there is a preferable corresponding outcome of the second. by Axiom 3(ii). I G'. Then. G 5 (3. for any G > Y). CI 5(. Indeed. (iii) asserts that if we have the preference { ( I . Taking n = { (j 1 G. Toprove(i).2 I G"} 5 { c1 10. Conditions (ii) and (iii) ensure that Detinitions 2. (I I C" } 5 { I G. given C. Axiom 3. (j <(. (ii) lf(1.ail <_ (12. (i) Ifcl 5 (f2 then. then a1 $: (12. for any u. Similarly. Q I 5 u:t and o:t 5 0 1 . lf El E2 und E? E:$then FJI 155.4(i). (Consistencyof preferences). jhr sume (:I < "2. this implies that {cl 1 C c. c? for any event C. (i) l f a l u2 and u2 Q:{ then a1 o:$. (a1 Proposition 2. { a1 I G.2.2.1. ( 1 1 should not be preferred to (12. one has { c1 I G.fur crll C > 8. c2 implies that for cin? option ( 1 . CI I E"'} 5 { r? I b'. part (ii) follows rather similarly. by Axiom I(ii).letnl a2and(12 qsothat.(1 I P. If ci > c? this implies. byDefinition2. : (ii) I f .
n E". I f E 5 F then E 5 F.5. for which this ordering is strict.cl I E'} = {CI 1 F . one would strictly prefer c2 for sure to the stated option.q I E'} < G c2. CI I G'} . CI 1 ( E n G)'} < {CZ I G . however.E ) " } . Using Definitions 2. for all c1 ICc2 and for any option a. 2 By Axiom 3(i) with G = F . one would strictly prefer the option {cz 1 E . nijicant given G > 8. for any event E. (Characterisationo sign@ant events). as one would expect.4 that. An event E is signifcant given G > 0 ifcl <G c2 implies that c1 <G {c2 I E.CIE"} I ( F .5.5.e.E = F from Definition 2. with "significant" events. we shall simply suy that E is signifcant.E . {CZ I E. c1 1 E c } to c1 for sure.Ey { ~ IE:cII E"} I ( F . {cI IG. G ~ G " } . al 5 a2. since it provides an additional perceived possibility of obtaining the more desirable consequence c2. 8 5 E 5 SZ. Proposition 2. C . c ~ IF"} = {CZ IF .2. define a1 = { Q I E. For any c1 < CZ.~U G . Similarly.E ) " } . Proof.3 Coherence and Quantification 27 Another important consequence of Axiom 3 is that uncertainty orderings of events respect logical implications. 1 = {Q I F . E is Proof. Thus.4. we have CI = { C Z 10. given G > 8 and assuming c1 <G CZ. We shall mostly work. In particular. I f G = 0. Intuitively. i.. ~ E ' < { C.alG"} < { C ~ ~ E ~ G . if and only if8 signijicant ifand only if0 < E < 12. if E is judged to be significant given G . ~ G I ~ U } Taking a = c l . then F cannot be considered less likely than E .2 that E 5 F. if E is significant given G then. An event E is sigf < E n G < G . in the sense that if E logically implies F .4 and 2. Definition 2. (Significantevents). If follows from Proposition 2. It now follows immediately This last result is an example of how coherent qualitative comparisons of uncertain events in terms of the "not more likely" relation conform to intuitive requirements. (Monotonicity). Proposition 2. ~1 IS1} < ( ~ 2 1 E n C . if E E F . significant events given G are those operationally perceived by the decisionmaker as "practically possible but not certain" given the information provided by G.
233 . systematic comparisons of options. I E . too many complex ways in which such changes in assessments can take place for us to be able to capture the idea in a simple form. ('1. ('1 1 l }. in particular. if 8 < E n ( . Since by Proposition 2. ~ ~ O } (81 where is any one of the relations < . denoted by E I F'. Quantification The notion of preference between options. (Painvise independence of events). the condition stated captures. ? ( i i ) c o { ~ ~ ~ F . and hence. The coherence mioms (Axioms I to 3) then provide a minimal set of rules to ensure that qualitative comparisons based on 5 cannot have intuitively undesirable implications. There are.13. formalised by the binary relation 5. 1 E' } + c { ('2 I E . provides a qualitative basis for comparing options and. the very special case in which preferences do not change is easy to describe in terms of the concepts thus far available to us. c+ ~ . r? ( i ) r { r2 I E. by extension. We shall now argue that this purely qualitative framework is inadequate for serious. and e on& if. Definition 2. An alternative characterisation will be given in Proposition 2. On the other hand. in an operational form.3. a The operational essence of "learning from experience" is that a decisionmaker's preferences may change in passing from one state of information to a new state brought about by the acquisition of further information regarding the Occurrence of events in E . {(*. by Axiom 3(iii). c?. (? < E 17 G < C. cI I I?} <(... If.20 2 Foundations and hence. An illuminating analogy can be drawn between 5 and a number of qualitative relations in common use both in an cveryday setting and in the physical sciences. ~ ~ I P ) . by Definition 2.3 preferences regarding pure consequences are unaffected by additional information. however. Conversely. < G. ('1 <<.6. which leads to changes in assessments of uncertainty. We interpret E I F as "E is independent of F". G = Sl then E is significant if and only if 8 < E < $1. say. the notion that uncertainty judgements about E. if. The definition is given for the simple situation of preferences between pure consequences and dichotomised options.  or >. are unaffected by the additional information F . ~ I F . for comparing consequences and events. for a l l c.b "E { ( . W suy that E and F are (pairwise) independent. .
It is therefore necessary that we have available some form of standard options. E. implicitly or explicitly.) ~ ( 5 ’ 2 ) . If we needed to. representing weight.r? and events SI. we could continue to make qualitative comparisons of this kind with finer subdivisions of the scale. a metre. a centigrade interval. length. iJ and only$ ~(5’1) 1 4 5 ’ 2 ) : I 5 (ii) S n S.4 metres. (Existence o s t a n h d events).5 metres. to any given physical or chemical entity. as an adequate basis for the physical sciences.S. 1 u S. We shall regard it as essential to be able to aspire to some kind of quantitative precision in the context of comparing options. but the general point is extremely important.) (iii) for any number c in (0. F . n S. = Q) implies rhat ~ ( S I = p(S. in quantifying the length of a stick. corresponding to the ball landing within specified connected arcs. is I achieved by introducing some form o numerical standard into ( conte. we need to introduce in each case some form of quantification by setting up a stundard unit of measurement. or the centigrude interval. It is abundantly clear that these cannot suffice.45 metres long”. ‘ (iv) SI IS.3 Coherence and Quantification Consider. F IS a n d E I F . A family of events satisfying conditions (i) and (ii) is easily identified by imagining an idealised roulette wheel of unit circumference.rt already f equipped with a coherent qualirative ordering reittion. If the stick is “not longer than“ the scale mark of 2. 1) such that: (i) S 5 S. based on “not longer than” (and derived relations. or temperature. but is “strictly longer than” the scale mark of 2. or finite unions . whose definitions have close links with an easily understood numerical scale. We suppose that no point on the circumference is “favoured” as a resting place for the ball (considered as a point) in the sense that given any CI. and events E. through quantification.1]. a trivial one. we make the following assumption about the algebra of events. nor longer than.2. not hotter than. there is a standard event S r such that p(S) = 0. of course. The example is. such as the kilogram.) (v) i f E I S. E IS and k IS . There exists a subalgebra S of & f + [O. then E S * E F S. and which will play a role analogous to the standard metre or standard kilogram. a series of qualitative pairwise comparisonsof the feature of interest with appropriatelychosen points on the standard scale. This can be achieved by carrying out. the relations not heavier than. together with an (implicitly) continuous scale such as arbitrary decimal fractions of a kilogram. we might lazily report that the stick is “2. and afunction 11 : S Axiom 4. Instead. we place one end against the origin of a metre scale and then use a series of qualitative comparisons. as they stand. Precision. implies that ~ ( S I = p(S1)p(S2). For example. for example. the merre. such as “strictly longer than”). +  Discussion of Ariom 4. As a first step towards this. This enables us to assign a numerical vche. thus extending the number of decimal places in our answer.
We will refer to S as a . { c1 I SI. The remainder ) of (iii) is intuitively obvious. in (iii). corresponding to binary expansions consisting of zeros from some specified 1 0 cation onwards. ( v ) encapsulates an obviously desirable consequence of independence. alternatively. condition (iv) is clearly satisfied.srundctrd/umil~of events in E and will think of C as : the algebra generated by the relevant events in the decision problem together with the elements of S. that for any a E [O. Other forms of standard family satisfying ( i ) to ( v ) are easily imagined. some subset of the rationals. and F is independent of S.6. while many irrelevant technical difticultirs are avoided: in particular. 1 we can construct an S with p ( S ) = o. with S denoting a region of area p . in the sense of Definition 2. where p is the function mapping the "arcevent" to its total length. 11. Finally. as is the fact. Note that 1 S is required to be an algebra and thus both VJ and II are standard events. The underlying image would then be that of a point landing in the unit interval and an event S such that p ( S ) = I.cj I S. for any events E. Namely. however. mapping events to the areas they define.a judgement of equivalence between E and 5 should ' not be affected by the Occurrence of F. namely. that no serious conccptual distortion is introduced. We only require that we can invoke such a set up as a iiiental image. For example. it is obvious that a roulette wheel of unit circumference could be imagined cut at some point and "unravelled" to form a unit interval. with 11. that if E is independent of F and S. The same is truc. would denote a subinterval of length 1). of course. I]. .4 and Axiom 4(i) that I & ( @= 0 and p(I2) = 1. Thus. we could imagine a point landing in the unit square. if we think of the circumferences for two independent plays as unravelled to form the sides of a unit square. an element of mathematical idealisation involved in thinking about ull p e [O. In this extended setting. for example.2 Fuundations and intersections of such arcs. physically exist.} are considered c! : equivalent if and only if p(Sl) = p ( & ) . reflecting the inherent limits of accuracy in any actual prtxedurc for determining arc lengths or areas. the discussion for the unit interval being virtually identical to that given for the roulette wheel. The obvious intuitive content of conditions ( i ) to ( v ) can clearly be similarly motivated in these cases. to be real numbers. F in E. Our argument for accepting this degree of mathematical idealisation in setting up our formal system is the same as \vould apply in the physical sciences. rather than. of cill scientitic discourse in which measurements are taken. including previous"p1ays"on the same wheel. rathcr than a subset of the rationals chosen to rellect the limits of accuracy in the physical measurement procedure being employed. in principle. or could be prwise/y constructed in accordance with conditions ( i ) to (v). we note first that the basic idea of an idealised roulette wheel does assume that each "play" on such a wheel is "independent". There is. of any other events. I Si'} and { c1 I SZ. It follows from Proposition 2. It is important to cmphasise that we do m r require the assumption that standard families of events actually. we can always think of an "independent" play which generates independent events S in S with p(S)= o for any specified (1 in (0. Conditions (i) and (ii) are then intuitively obvious.
. a.. }. with 0 < s < y < 1.S. we arrive at comparisons suchas(c1 1 S r . {c? I S. however.} becomes less “attractive”. For suppose inductively that SI. whereas the step from comparing a finite set of possibilities to comparing an infinite set has more substantive implications. For anyfinire collection { 0 1 . . . 1 < j 5 I?. /J(T. + Axiom 5 . any e 5 c 5 CZ. i Proposition 2. Proof. = 8. denote a standard event c1 such that p( S. . n B J )and hence. e I .. and define l ijl = crl . and the result follows. a . as we increase s. .. (i) Ifcl 5 c 5 q. BJ = S U.. (1. in S such that p(B..The essence of Axiom 5(i) is that we ...) p ( s J ) p ( ~n B.. such that c1 5 c 5 c : ~ . n BJ’.. for l standard options based on S.+) = 0. j . . . there exists T. .By Axiom 4 (iii. and then begin to explore comparisons with ’.6. Let S.. . related discussion of the issue is provided in Section 2. there exists a standard event S such that E  S. 5 1 there exists a corresponding collection { S1. Intuitively.rbecoming increasingly small. from the perspective of the foundations of decisionmaking. Discussion of Axiom 5. .} 5 c 5 { c 2 1 51 c1 I Sf}. (Precise measurement of preferences and uncertainties).withthedifference!/. P. .. .} 5 c 5 {c2 I S. 11. becomes more l S} and more “attractive” as an option.c. In the introduction to this section. so i = 1. { c 2 1 S.can therefore be “sandwiched” arbitrarily tightly and. = ..)= a J / ( l.).u. > 0 and of a1 ./(l~ 3 ~ )Define S. = T. that the process is based on successive comparisons with a standard. n u s . . we discussed the idea of precision through quantification and pointed out. T. In this way. { c2 1 SO..there exists a standard event S such that c  {c2 1 s. the step from the finite to the infinite implicit in making use of real numbers is simply a pragmatic convenience. This argument is not universally accepted.b‘.cl I S. Any given consequence c.ij.8. + + + + .3 Coherence and Quantification 31 those concerning the nonclosure of a set of numbers with respect to operations of interest. Our view is that. We start with the obvious preferences. be judged equivalent to one of the standard options defined in terms of c1. iv).} real numbers such that a. I S./(l. in the limit. length and temperature. ) = a. Sn} of disjoint standard events such that p ( S .I: away from 0 and decreasing y away from 1..l are disjoint. by gradually increasing .1.) . .) = p(BJ){a.). n T. We have emphasised this latter point by postponing infinite extensions of the decision framework until Chapter 3. Then. that SJ n S.. using analogies with other measurement systems such as weight. . IF}.I = p( B. /J(s.}. (ii) For each event E . * . . qIS‘. By Axiom 4(iii) there exists SI such that p(S1) = 0 1 ..USJ. = 1. U (T. using Axiom 4(ii). . . S.) = 4.a.. . and as we decrease y.2. = S. (Collections of disjoint standard events). and further. c:l 1 S.
?I. ri I S..in the sense that we judge {c? I S. } for any event E. In this case. Indeed..: < I 1. The standard family of options is thus assumed to provide a continuous scale against which any consequence can be precisely compared.1 1 S .13 and proceed CIS iJ'this were a precise measurement.. . 2 E 5 S<. The underlying idea is similar ('1 to that motivating condition (i).I' increasing gradually from 0. several authors have suggested that this into imprecision be forinn/!\' inc~orpnrccfccl /he ct. r?. this would seem to be an unneccssary confusion of the prescrIp/iiv~ and the dtwripriw. for any event E and for all consequences r I ..3. Condition (ii) extends the idea of precise comparison to include the assumption that. ('1 I . defined by c1 and c2.riom +s/u~J. We formulate the theory on the prescriptive assumption that we aspire to exact measurement (exact comparisons in our case). given the intuitive content of the relation "not more likely than". For many applications.. This is analogous to the situation where a physical measuring instrument has inherent limits.. in terms of the ordering of the events. but not permitting a more precise statement. The preceding argument certainly again involves an element of mathematical idealisation.2 i c'. there might. we have to make do with the best level of precision currently available (or devote some resources to improving our measuring instruments!1.} some (possibly '. tempcred with pra~matically ncknowledged approximation.}. enabling one to conclude that a reading is in the range 3..1.r E [O. whilst acknowledging that. <. ('1 1 S. I I?'} can be compared precisely with the family of standard options { q I S'. vI I k" } 5 { ("2 I S. . (P. ('1 I E"} 5 { (*. say. (j I $ } 5 { 1. approached from below by the successive r values of .r's and a decreasing sequence of !/*stending to a common limit. and !/ . it may \wll he that therc are situations where imprecision in the context of comparing consequences is too basic and problematic a fciiture to be xlequately dealt with hy an approach based on theoretical precision.1.5.. say. 1 S. so that. [ . ( 8 . In the context of measuring beliefs. the option { c? I E .S[..} .. We shall return to this issue in Section 2.. in fact.can proceed to a common limit. I26 to 3.c2 such that c. 1 S. I 5:r}. be some itirerwl oj'ittdifferencv... ('I with of this of the form { (2 I S. In practice. we can begin with the obvious ordering {(.. for rational) .I' and g but feel unable to express a more precise form of preference.8. S . and then consider refinements . Every physicist or chemist knows that there are inherent limits of accuracy in any given laboratory context but. we would typically report the measurement to be 3.r becoming increasingly small. Again. so far i15 we know. the essence of the axiom is that this "sandwiching" can be refined arbitrarily closcly by an increasing sequence of . !/ decreasing gradually from I.} 5 ( ' 5 { I S .. However. n o one hits suggested developing the structures of theoretical physic5 or cherni5try on the assumption that quantities appearing in fundainental equations should be constrained to take values in some subset of the rationals. and above by the successive values of g.11.. in practice. .
8). a  . The principles of coherence and quantificationby comparison with a standard. Indeed. or E2 > E l .2. coherent preferences.1 BELIEFS AND PROBABILITIES Representationof Beliefs It is clear that an individual’s preferences among options in any decision problem should depend.  Proof. will enable us to give a formal definition of degree of belief. By Axiom 5(ii). or El E2.4 Beliefs and Probabilities 33 The particular standard option to which c is judged equivalent will.4 2. where the existence of symmetries and the possibility of indefinite replication. Proposition 2. respectively. This is in vivid contrast to what are sometimescalled the rfassicalandfreyuencyapproachesto defining numerical measures of uncertainty (see Section 2. play fundamental roles in defining the concepts for restricted classes of events.3 implies that our “attitudes” or “values” regarding consequences are Jixedthroughout the analysis of any particular decision problem.coherent preferences. We cannot emphasise strongly enough the important distinction between defining a general concept and evaluating a particular case. Our definition will depend only on the logical notions of quantitative. there would be little hope for rational analysis of any kind. 2. The conceptual basis for this numerical measure will be seen to derive from the formal rules governing quantitative. but we have implicitly assumed that it does not depend on any information we might have concerning the Occurrence of realworld events. of course. the complete ordering now follows from Axiom 4(i) and Proposition 2. We begin by establishing some basic results concerning the uncertainty relation between events.1. thus providing a numerical measure of the uncertainty attached to each event. It is intuitively obvious that. (Complete comparability of events). if the timescale on which values change were not rather long compared with the timescale within which individual problems are analysed. Proposition 2.7. our practical evaluations will often make use of perceived symmetries and observed frequencies. at least in part. depend on c. Either El > E2.4. on the “degrees of belief’ which that individual attaches to the uncertain events forming part of the definitions of the options. there exist SIand S2 such that El SIand E2 N S2. expressed in axiomatic form in the previous section. irrespective of the nature of the uncertain events under consideration.
or a computer generated 'random' integer being an odd number.34 2 Foundations We see from Proposition 2.lu(CB)u(CnB) 5 DU(BC)u(CnB)=BuD. Thus. 4 (13 5 u4. that the uncertainty relation induced between events is complete. maybe a conceptual perfect coin s falling heads. the statement P ( E ) = 0. C 5 L) A n C = B n D = VJ. .3. the nieaning of a probability statement i s clear. Moreover. a The final statement follows from essentially the same argument. .then A u C < H u L). This definition provides a natural. although the order relation 5 between options was not assumed to be complete (i. i f . With our operational definition. Given an uncertainty relutioti 5.5 precisely means that E is judged to be equally likely a a standard event of 'measure' 0. Definition 2.5.3. and using again Definition 2. for any G. We first show that. if A n G = BnC = 0 then A 5 U e A u G < B U C .. not all pairs of options were assumed to be comparable). A 5 B I a 5 a?.3. If '4 5 R . (Measureof degree of beliej). as a consequence of Axiom 5 (the axiom of precise measurement). by Definition 2.7 that. by linking the equivalence of any E E E to some S E S and exploiting the fact that the nature of the construction of S provides a direct obvious quantification of the uncertainty regarding S. l n G = B n n = = . a1 5 a2 5 n4 . then A U C 5 U u D. it turns out. Proof. .7. For instance. d e f i n e : 1 Then. (Additivity of uncertainty relations). F o r a n y q > r] .8.4 < 1 or 3 C' < D. the probability P ( E ) of an event E is the real number p( S ) associated with any stancklrd event S such that E  S.5. by Axiom 3.. u G 5 R u C.   AuC=.e. A similar result concerning the comparability of all options will be established in Section 2. operational extension of the qualitative uncertainty relation encapsulated in Definition 2. and Proposition 2. We now make the key definition which enables us to move to a yuantirari\*e notion of degree of belief.
The probabilityfunction P( . “correct” or “unconditional”.then P ( E U F ) = P ( E ) P ( F ) . by Proposition 2. it establishes that significant events. a more precise and revealing notation in Definition 2. Proposition 2. (iii) E is signijEcant if. 2 S2. E 5 F iff S1 5 S and hence. Existence follows from Axiom 5(ii).7. according to Definition 2. should always be borne in mind...9. i. should be assigned probability values in the open interval (0. Proposition 2.10.7 would have been P(E I A&)).7. a   The following proposition is of fundamental importance.l).. probabilitiesare always conditionalon the information currently available.8. It establishes that coherent. (Compatibility). Poaiiy f f (i) P(0) = 0 and P(Q) = 1. S. Proof. and only if.11. (Existenceand uniqueness). we shall stick to the shorter version. Proposition 2. Moreover. 0 < P( E ) < 1. Since probabilities are obviously conditional on the initial state of information M o . events which are “practically possible but not certain”.2. ( r b b l t structure o degrees o beliej). within the framework we are discussing. there exisfsa unique probability P( E ) associated with each event E .) is compatible with the uncertainty relation 5. a E    Definition 2. It makes no sense. (ii) I f E n F = 0. The result follows from Definition 2. Proof.4 Beliefs and Probabilities 35 It should be emphasised that. iff p(S1) 5 p(S2). (Compatibility of probabilify and degrees of belief ).2(ii) . + . if E S1 and S then by Proposition 2.2(ii). The result now follows from Axiom 4(i). in that they are a numerical representation of the decisionmaker’s personal uncertainty relation 5 between events. For uniqueness. Then. Moreover.e. by 2 Axiom 4(i). By Axiom 5(ii) there exist standard events S and S2 such that E S 1 1 and F S. probabilities are f always personal degrees o belief. In order to avoid cumbersome notation. Given an uncertainty relation 5. Afunction f : E 3 8 is said to be compaton ible with an order relation I E x & 8for all events. quantitativedegrees of belief have the structure of a finitely additive probability measure over E. to qualify the word probability with adjectives such as “objective”. but the implicit conditioning on M.
Moreover. by Proposition 2. by Proposition 2. by Proposition 2. B 5 S. The first part follows by induction from Proposition 2.10. The latter fact should.F > & o r F SzorF < S?. = Q).J}fornirifinite pcrrtition of $2. In short. . and. .or both.as stated.j. a Proposition 2. by Proposition 2. Hence. The result then follows immediately from Proposition 2. coherent degrees of belief are probahiliries.7. by Proposition 2. then f (ii) For any event E. with P( E. S 5 R implies ' that P ( 0 ) = 1. ByProposition2. If E > and I' > ld.8.P(E. (ii) If E = 0 or F = 8.S2. and S* such that p ( S . = $2 then. again.P ( E ) . be constantly borne in mind. Proof. ) = 0 and p ( S * )= 1.E. if F < S?then E U F < 5') US?and P(EU1.) = 1. S 2 such that S1n. if o = P(E) and J = P ( E U F).6.6 for this measure of degree of belief. therefore justifying the nomenclature adopted in Definition 2. E U F > E: thus. quantitative measures of uncertainty about events must take the form of probabilities. by Axiom 4(iii) there exist S. then. then { p . j E J ) is cijnite collection o disjoint events. Definition 2. . then. similarly. I I is crucial. without explicit reference to the fact that the mathematical structure is merely serving as a representation of (personal) degrees of belief. there exist events S1.( I . It establishes formally that coherent. similarly. is impossible. P(0) = 0.36 2 Foundations Proof. (i) By Definition 2. (i) If( E. the result is trivially true.5. C.v distribution over the pcirtition.) = pJ. E U F > SI u S and hence P ( E U F) > .0 . however. P(E u F') = P ( E ) P ( F ) .10 P(0) 5 0.8. we have (1 < $and. = If F > .') < .9. 2 which is impossible. hence.j which. (Probability distribution). P ( E " ) = 1 .0 5 P ( E ) 5 1. (iii) By Proposition 2.j E . By Proposition 2. (Finitely additive structure of degrees of belief). I 1 (i). F S and therefore P( F\ = 3 .5. a  +  Corollary. P(SI) ctandP(S2) = j j . If{ E J . E is significant iff 8 < E < S2.J E J . so that . It will often be convenient for us to use probability terminology.7.j E J } is said to be N probahilit.4. I I(iii): the second part is a special case of (i) since if U..
there exists a monotonic function f of [O. so that. CI I (SIn F ) ' } . CI I ( E n F)'} so that E n F  { c 2 I SI n F.p. such that P ( S I ) = (r and P(S2) = 3.E I Sl and F I S1.a. ad hoe. The idea is that total belief (in 0. (Uniqueness o the probability measure). 1978. ..63). and any option a. and noting from Definition 2. ' 1 there exists s such z that P ( S z )= P ( F ) . then by Proposition 2. {cp I E n F. over E and shown that. by . k = 1.4 Beliefs and Probabilities 37 This terminology will prove useful in later discussions. Proof.cl I Sc n F.9. by an identical argument to the above. But. Proof.12. I F } ' ' Taking a = cl. there exist disjoint standard events S1 and S. (Characterisation o independence). Hence.2. having measure 1) is distributed among the events of the partition. P(i2)= 1 and hence.' S1. given F. Proposition 2. By 1 Proposition 2. { E. We now establish that this is the only probability measure over & compatible with 5. by Proposition 2. by Axiom 4(v).6 the symmetry of I. If P' were another compatible measure. Starting from the qualitative ordering among events.P(E. Hence. it has the form of ajinitely udditive Probability measure. ( f f($) Axiom 4(ii). 1 F'}. j E J}. f ELF e P(EnF ) = P(E)P(F).8 we would always have P ' ( E ) 5 P ' ( F ) e P ( E ) 5 P ( F ) .6.cl I E n F. P is the only f probability measure cornputihlv with the uncertainty relation 5. Proposition 2.with C.1 ill"). 51n F.we have   ( ~ 11 S I n F.according to the relative degrees of belief { p . n s. 1).13. a + + + We shall now establish that our operational definition of (pairwise) independence of events is compatible with its more standard. 1 into itself such that P ( E ) = f { P ( E ) } . so that we have P'(. By Axiom 4(iii). Theorem 2. hence. product definition. we have derived a quantitative measure.E) = P ( E ) for all E. = C. Again by Axiom 4(iii). expressed in conventional mathematical terminology. f ( 0 + 19) = P'(S1 U S2) = P'(S1) P'(sp) = ) and so (Eichhorn. s1n F  S .. Hence.. f ( o ) = ko for all ct in [O. for any consequences el < c.a. Suppose E I F. E I.F I Sp and S1 I Sz. S . there exists S1 such that P(S1) = P ( E ) . . compatible with the qualitative ordering 5. for all nonnegative (1.) = 1. P( ..) P(. .J E J}. we have {Q I E n F.? such that B 3 5 1.
c .~1 I Sr} { c 1 I.sr q < c. hence establishing that E: I k'.10. (ii) lfrlirre p... .( C C F ) ' } . CI I I. I and 2.'!) (. > v). rI I P ' } . a 2. [ ( E n s ) ' } . The starting point for analysing order relations between events.4. we use the assumptions of Section 2. (Propertiesof conditional beliefs).  {r. to the events defining the options." 'V } and { c 2 I E n S c1 i ( E n S)"} . Axiom 4(iii). E' u En G 5 I. there exists S such that By P ( S ) = P ( F )and F IS. and Axiom 4(iv). { tlleii E <<. E.3 are trivially established and we recall (Proposition 2.2 5 ( ' 1 iff cz Lc. ~1 I E'. for any C.vi. Is'} 5 { C 2 ( t i ' n S .'.A similar argument can obviously be given reversing the roles of E and F.30 By Propositions 2.1. by an individual. Given the assumed Occurrence of C > !A the ordering 5 between . Now suppose. I E n k: c1 I ( E n F ) " } : ((*I F. .cI IF"} 5 { C ~ I E ~ Fl . given the asdefined sumed Occurrence of a possible event G.3) that. But {C I S.3 in order to identify the precise way in which coherent modification of initial beliefs should proceed. ('1 I E" } . is the uncertainty relation s(. <(. i. ( i ) I. without loss of generality. In this section. : Proposition 2. cI 1 E r } . between events.2.E IS. Then.2 siich tlicrt { (. c.2 Revision of Beliefs and Bayes' Theorem The assumed occurrence of a realworld event will typically modify preferences between options by modifying the degrees of belief attached.? ! I. ('1 1 il" } s(. Analogues o f Propositions 3. so that c < f { c2 1 E..6. P ( E n S ) = P( E ) P ( S )= P ( E ) P ( F ) P(h' n k'). Hence. 2 fimndutioiis a n d h e n c e P ( E n F ) = P(E)f(F').  { c ~ s . by the first part of the proof.14. acts is replaced by 5 ~ ..2. by Definition 2. Suppose Y ( E f7 F) = P ( E ) P ( F ) . hence by Proposition 2. = so that E n F E n S. that c 5 {c2 I E.F .
Proposition 2. { C ~ ~ E ~ C ~ C . for all c2 2 CI. (Conditionalprobability). It relates the conditional measure of degree of belief P(. of course. the initial state of information. c ~ I G ' } .13. We have.. a l G ' } 5 ( ~ 1 F n G . Given a conditional uncertainty relation S C !G > 0. P( E I C ) provides a quantitativeoperational measure of the uncertainty attached to E given the assumed Occurrence of the event G.((FnG)c and this is true iff E n G 5 F r l C. and only if. .the conditional probability P(E I G) of an event E given the assumed occurrence of G is the real number p ( S ) such rhat E y: S. S n G E n G and. by Proposition 2. {c21EnG.If(. a E I ~ F Definition 2.10. P ( En G ) P ( E 1 G)= HG) Proof. (Conditional measure of degree of bebf).c. Thus.c~~(E~C)'}I{c2(FnC. Moreover.4 Beliefs and Probabilities Proof. a I G r } .e. The intention of the terminology used above is to emphasise the additional conditioning resulting from the Occurrence of G . by Proposition 2.. By Definition 2.13. E 39 IGF iff.10. P( E 1 C)= p ( S ) = P( F: n G ) / P ( G ) .4 and Proposition 2.15.2. {c~(E~G. with a = c1. c . for all a. S c: E.! I F ~ G . For ans G > 0. = Thus. c {cz I FCl I F}? {cz I E.4. although omitted throughout for notational convenience.14.10. The following fundamental result provides the key to the process of revising beliefs in a coherent manner in the light of new information. CI I F'} then. there exists S I C such that p ( S ) = P ( E n C ) / P ( G ) By Proposition 2. by Definition 2.c. . c1 I E"} Ic { c2 I F. ~ c C n~ c ' } so that {c2IEnG.3. Gl ( F ' n G . if there exist c2 > CI such that {c2 I E .).is always present as a conditioning factor. if. ~ E " 5 {G :.7. already stressed that all degrees o f belief are conditional. where S is an standard event independent of G. by Definition 2.I G) to the initial measure of degree of belief P(. Generalising the idea encapsulated in Definition 2.clI E n C .Cl I E'} I i. 0  . P ( S n G ) = P ( S ) P ( G )= p ( s ) P ( c ) P ( E n G).clI(FnG)'} and the result follows from Axiom 3(ii) and part (i) of this proposition. 1 Taking a = q.I(EnG)'} 5 {cqIFnC. I F C n G . By Axiom 4(iii) and Proposition 2.
Finally. holds if and only if P(E (1G ) 5 P(F (1 C. (Probability structure o conditional degrees o belief f f For any event C: > 0. 1 G') = 1 . Finally.19. P(E I C) = P ( E fi C : ) / P ( C )is a logical derivation from the axioms.1'( E 1 (.5. IS. E i 6" c: C.15..) K I'( (.).10.10 implies that P(I. l Thus. In fact. E is significant given G iff (b c. since E fi G 5 CT'. n (.: n C) <_ WG). < Prooj By Proposition 1. Proposition 2.15. By Proposition 2. so that. I I t o degrees of belief conditional on the occurrence of significant events.3 We now extend Proposition 2. F e+ P ( E / G ) P ( F 1 G ) .) = 0 5 C'( t. Corollary. " Proof.( F I (. by . fJ( E I G j 5 1. ( i ) P(@I C. IS.17. . ( i ) if { E:. E <(. Proposition 2. thiw ). by Proposition 2. ? 1 Proposition 2. (Finitely additive structure o conditional degrees o belief f f For nll C > 0. .a This parallels the proof' of' the Corollary to Proposition 2. nor an ad hoc definition.): the result now follows from Proposition 2.). I? n G = C. Proposition 2.16. F P 0 < P ( E I G ) < 1.J 1 C. An extended form is given later in Proposition 2. so that.which. I } is ajnirc cnllectiott ofdisjoint events.. . IS. P( k . in our formulation. this is the simplest version of Bciyrs' theorfnt. I:' <(.40 2 Foimdutions Note that. P(E 1 G ) 2 0 and I > ( @ I G ) = 0: moreover. By Proposition 2. (ii) ji)r uny event E.14(i). using again Proposition 2. iff 0 < P(I: n c. I' iff E I G 5 F 1 G.). IS. by Proposition 2.. The result follows from Proposition 2. (iii) i is sigrzificunt g i r m G 5 ). I I . . (Compatibility of conditional probability and conditional f degrees o belief). E is significant given C.I E . P ( I 1 ( C )= 1. by Proposition 2. .) 5 P ( Q I G') = 1: (ii) i f E 17 (1G = 13. tlieti P( E: u F G ) = P( E 1 G) i. Proof.10.
in an obvious notation. I C ) is the only probability measure compatible with the conditional una Proof. Proposition 2. I tells us that overall it is beneficial! How are we to come to a coherent view in the light of this apparently conflicting evidence? . P ( R I T")= 0. Tables 2. but Table 2. Suppose now.2 and 2. it would seem reasonable to specify P(RI T ) = 0. however. (Uniqueness of the conditionalprobabilily measure). where recovery and the receipt of treatment by individualsare now represented.2 and 2. as events.5. The following example provides an instructive illustrationof the way in which the formalismofconditionalprobabilitiesprovides acoherent resolution of an otherwise seemingly paradoxical situation. it seems clear that the treatment is beneficial. I . respectively. Table 2. R denote. Suppose that the results of a clinical trial involving 800 sick patients are a shown in s Table 2.18.2 Trial resultsfor male patients R T T 180 70 Rr 120 30 Total 300 100 Recoveryrate 6Q% 70% The results surely seem paradoxical. P ( . respectively.3 tell us that the treatment is neither beneficial for males nor for females. (Simpson's paradox).4. Example 2.1. where T. that the patients did or did not recover. Table 2. and R.1 Trial resultsfor all patients R T T ' 200 160 R ' 200 240 Total 400 400 Recoveryrate 506 40% Intuitively.2. and that these have the summary forms described in Tables 2.4 Beliefs and Probabilities 41 certainty relution &. This parallels the proof of Proposition 2.3. that one became aware of the trial outcomes for male and female patients separately.12. and were one to base probability judgements on these reported figures. TCdenote. that patients did or did not receive a certain treatment.
To see that these judgements do indeed cohere with those based on Table 2.j E J } o 9 and G > 0. we note. P ( AI 'r = 0.2 and 2.3.3 Trial results for femule purienrs 2 Foicndations R T T 20 90 R 80 210 Torul Recoveryrare 100 300 20% 30% The seeming paradox is easily resolved by an appeal to the logic of probability which. that P ( R ( T )= P ( R l . (Bayed theorem). Z f I T ~ + P ( R i A I ' n 7 ) P ( A f ' l T ) P ( R I T )= P ( R ( . z ~ ~ T ' ) P ( A ~ ~ R ~ ) ~ T ) P ( M T ) .19.42 Table 2.(C n E.L ! P ( RI :\I n T ) = 0. For m y finite purtition { E l . the events that a patient is either male or female. U n T ) P ( . from the Corollary to Proposition 2. we have just demonstrated to be the prerequisite for the coherent treatment of uncertainty. I . With Af.2. IS and the Corollary to Proposition 2.. a .3. were one to base probability judgements on the figures reported in Tables 2. Proposition 2.15. it would seem reasonable to specify P ( R I XI r T ) = 0. ~ The probability formalism reveals that the seeming paradox has arisen from the confounding of sex with treatment as a consequence of the unbalanced trial design.25.17. Proposition 2. P ( RI w n T ) ' = 0.1 I when applied to C = u.7 ' P ( R lw n r )= 0. +P( T' M' I where P ( M I T ) = 0. after all. f Proof: By Proposition 2. 1973) and Lindley and Novick (1981) for further discussion. respectively.). See Simpson ( I 95 I ). I 1. The result now follows from the Corollary to Proposition 2.75. Blyth (1972. Al' denoting.
). are calledtheposCeriorprobabisof the H.17 IeadstoBayes’theoremintheformP(H. Thus.!f”). the outcome of a clinical test). given D. in various guises. ID)= P(DI H . P ( H . in a medical context. ) P ( H . P( HJ). J E J. . Bayes’ theorem may be written in the form c. I } are exclusive and exhaustive events (hypotheses).17. it may also be established (Zellner. which reflect how beliefs about obtaining the given data. (Prior.into a j revised set of beliefs.ure culled thepriorprobabilities of the H J . D.. E J. ) P ( H J ) .characterizing the way in which initial beliefs about the hypotheses. EJ = H. E . . j E J.1 1.). occur. Proposition 2. j f j (iii) P( HJ I D ) . The four elements. E J . It is important to realise that the terms “prior” and “posterior” only have significancegiven an initial state of information and relative to an additional piece of information. 1988b)that.I. Definition 2. E J .J E J . the set of possible diseases from which a patient may be suffering) and the event G corresponds to a relevant piece of evidence. This form of the theorem is often very useful in applications. ) . (i) P(H. Bayes’ theorem acquires a particular significance in the case where the uncertain events { EJ. as usual. explicit notational reference to the initial state of information Arc. are modified by the data. by the Corollary to Proposition 2. I E . which could be more properly be written as P( HJ I . From another point of view. and. and predictive probabilities). j E . D .I. P( HJ I D)and P( D). ) / P ( D ) .j E J } form a partition and hence. G = D.). since the missing proportionality constant is [P(G)]’ [CJP(G E J ) P ( E J ) ] . under some reasonable desiderata. (iv) P( D )is called the predictive probability o D implied by the likelihoods f and the prior probabilities. we omit .’ .. j where P ( D ) = C.2.vary over the different underlying hypotheses. thus defining the “relative likelihoods” of the latter. = I and thus it is always possible to normalise the products by dividing by their sum.thenfor any event (data)D. Since the { EJ. P ( H . P(EJI G) = 1. This process is seen to depend crucially on the specification of the quantities P( D 1 HJ). throughout Bayesian statistics and it is convenient to have a standard terminology available.j E J}comespond to an exclusive and exhaustive set of hypotheses about some aspect of the world (for example. Bayes’ theorem is an optimal information processing system. P(D I H.or data (for example.I. (ii) P( D I HJ).P(D 1 H . E *J. j E . I D).j E J.4 Beliefs and Probubilities 43 Bayes’ theorem is a simple mathematical consequence of the fact that quantitative coherence implies that degrees of belief should obey the rules of probability. If { HJ. If we adopt the more suggestive notation. are called the likelihoods o the H.posterior. P(H.
Bayes' theorem often provides a particularly illuminating form of analysis of the various uncertainties involved. whose outcome is either positive (suggesting the presence of the disease and denoted by D = 7'). I D). (Medical diagnosis). of a specified disease. or negative (suggesting the absence of the disease and denoted by D = T ).0 0. more properly. or. provides a basis for assessing the compatibility of the data D with our beliefs (see Box. we can investigate the range of underlying prevalence rates lor which the test has worthwhile diagnostic value. The predictive probability P ( D ) .. but prior to conditioning on any further data which l o may be obtained subsequent to D.logically implied by the likelihoods and the prior probabilities. respectively) and the systematic usc of Bayes' theorem then enables us to understand the manner i n which these characteristics of the test combine with the prevalence rate to produce varying degrees of diagnostic discriminatory power. respectively. For simplicity.2 The quantities P(T H I ) and f'(T I H 2 ) represent the !rue positive and true negnriw rates of the clinical test (often referred to as the test sensitivity and tehi spec$ficity.) represents rhe prrvtrlrwe rote of the disease in the population to which the patient is assumed to belong. Similarly. and that further information is available in the form of the result o f a single clinical test. Let us further suppose that I'(ff. I AfO n D ) . but posterior to conditioning on whatever history led to the state of information described by ill^. let us consider the situation where a patient may be characteribed as belonging either to state H I . for a given clinical test of known sensitivity and specificity. represents beliefs posterior to conditioning on X and D.6 0. We shall consider this in more detail in Chapter 6. I . In particular. 1980). Example 2. representing the presence or absence. P ( H . P(H.or to state H.44 2 Foundations represents beliefs prior to conditioning on data D. In simple problems of medical diagnosih. .2.
. one may consider the difference P ( H l I T ) .as shown in Figure 2.15 and 2.20. the test may be expected to provide valuable diagnostic information.so that beliefs about E are unchanged by the assumed Occurrence of G. Theyare not necessarily to be interpretedin a chronological sense.4. Not surprisingly.15 one might first specify P ( G ) and P ( E 1 G) and then use the relationship stated in the theorem to arrive at coherent specification of P(E n G). but it is important to realise that this is a pragmatic issue and not a requirement of the theory.2. Thus. Propositions 2. P( HI I T’ ) against P( H I). the particular order in which we specify degrees of belief and check their coherence is a pragmatic one. In cases where P ( H l )has very low or very high values (e. 2. while we are less sure about other assessments and need to approach them indirectly via the relationships implied by coherence. It is true that the natural order of assessment does coincide with the “chronological” order in a number of practical applications. there is limited diagnostic value in the test.) 5 0.4 Beliefs and Probabilities 45 As an illustration of this process. in clinical situations where there is considerable uncertainty about the presence of coronary heart disease. for example.2. specifications of degrees of belief must satisfy the given relationships. respectively). some assessments seem straightforward and we feel comfortable in making them directly. overall measure of the discriminatory power of the test. E L F P ( E I F )= P ( E ) . P ( T 1 H2)= 0. As a single. They merely indicate that. for coherence.900. where for D = Tor D = T .P ( H I I T’). One further point about the terms prior and posterior is worth emphasising. (198 I ) concluded that P ( T I HI) = 0. Proposition 2.g. thus. On the basis of a controlled experimental study. However. a technique involving analysis of Gamma camera image data as an indicator of coronary heart disease.75. this is directly related to our earlier operational definition of (pairwise) independence.875 were reasonable orders of magnitude for the sensitivity and specificity of the test. let us consider the assessment of the diagnostic value of stress thallium201 scintigraphy. Murray er ul.25 <_ P ( H . with the assump tion that “prior” beliefs are specified first and then later modified into “posterior” beliefs.15 arises when E and G are such that P(E 1 G) = P ( E ) .17 do not involve any such chronological notions. in Proposition 2. Insight into the diagnostic value of the test can he obtained by plotting values of P( H I1 T). 0. for large population screening or following individual patient referral on the basis of suspected coronary disease. In any given situation.3 Conditional Independence An important special case of Proposition 2. For ull F > 8. for example.
a In the case of three events. using the probability rules.and therefore reflects an inherently personal judgement (although coherence may rule out some events from being judged independent: for : example. f" nG'}. Instead of being forced into a direct specification. Feller. by Proposition 2. we can attempt to represent the complex event in terms of simpler events. any E. There i s a sense. Definition 2. however. j E . 1950/1968. These considerations motivate the following formal definition.J. E. J I l f j ) . Events { E . in the general case. of F and G (and. for which we feel more comfortable in specifying degrees of belief. { F n G . Note that Proposition 2.20 derives from the uncertainty relation 5 1. F'nC:. in that scope for learning from experience is very much reduced. we would regard our degree of belief for E as being "independent" of knowledge of F and G if and only if P( E I H )= P( E ) . f and C. the intuitive conditions discussed above. E' such that c? c I' C F c I?). pp.15. . 125128) to be necessary and sufficient for encapsulating. which generalises Definition 2. we haveP(EnF) = P(EIF)P(b'). of course. describing the combined Occurrences.12. An important consequence of the fact that coherent degrees of belief combine i n conformity with the rules of (finitely additive) mathematical probability theory is that the task of specifying degrees of belief for complex combinations of events is often greatly simplified. in which the judgement of independence (given for large classes of events of interest reflects a rather extreme form of belief. l'nG'. Definition 2. (Mutuul independence).g..I} ure suid to be t n u t i d l ~ independent $ f o r any I C . from an intuitive point of view. or otherwise.46 2 Foundations froofi E L F P ( E n F) = P ( L ) P ( F )and.6 and can be shown (see e. This motivates consideration of the following weaker form of independence judgement. I2 makes clear that the judgement of independence for a collection of events leads to considerable additional simplification when complex intersections ofevents are to be considered. for any of the four possible forms of I!. the situation is somewhat more ' complicated in that. similar conditions must hold for the "independence" of F from E and C.. and of G from E and F ) . The latter are then recombined. to obtain the desired specification for the complex event.
1979b. of course. the events { EJ.In practice.for any I C_ J . have been stated in primitive terms of choices among options. (Conditional independence).j E J } are said to be conditionally independent given G > 0 if. given exuct knowledge o the proportional split in the dichotumised populution. .4. marketable quality. as in Definition 2. the genotypes of successive offspring are typically judged to be independent events.13 is one which is utilised in some way or another in a wide variety of practical contexts and statements of scientific theories. of course. in simple Mendelian theory. successive outcomes (voting intention.6. manufactured items. so that the process of revising beliefs is sequential. in turn. since earlier outcomes provide information about the unknown population or mating composition and this. Indeed. f Similarly.j E . see Dawid ( I979a. However. or whatever).2. I2 and 2. Definitions 2.4 Sequential Revision of Beliefs Bayes’ theorem characterises the way in which current beliefs about a set of mutually exclusive and exhaustive hypotheses. As a simple illustration of this process. D. are revised in the light of new data.. influences judgements about subsequent outcomes. . for example.4 Beliefs and Probabilities 47 Definition 2. a detailed discussion of the kinds of circumstances in which it may be reasonable to structure beliefs on the basis of such judgements will be a main topic of Chapter 4. H.Omitting. in the practical context of sampling.13. with or without replacement. The form of degree of belief judgement encapsulated in Definition 2. we typically receive data in successive stages.I}are said to be conditionally f independent given F i f and only i f they are conditionally independent given any G > (b in 3. .j E J .) may very often be judged independent. For a detailed analysis of the concept of conditional independence. 2. 1980b). however. The events { EJ. .13 could. having seen in detail the way in which the latter leads to the standard “product definition”. from large dichotomised populations (of voters. which can be described by realworld events D1 and D2. For any subalgebra 3o &. but that the algebraic manipulations involved are somewhat more tedious. in neither case would the judgement of independence for successive outcomes be intuitively plausible. let us suppose that data are obtained in two stages. In the absence of such knowledge. it will be clear that a similar equivalence holds in these more general cases. Thus. given the knouledge ofrhe two genorypes forming the muting.
D I 9 D?. corresponding to data. One possible form of simplifying assumption is the judgement of conditional independence for D I .D. explicit conditioning on Mu. j E . j E J . whereupon we obtain the latter being the direct expression for P( if..I..13. .48 2 fiiuridtitions for convenience. since there is an implicit need to specify the successively cvriditiotred lik~li/i(iod.s. I DI). .. holds for all k . . a task which. . I I l l f l L)?).given any H./. proceeds straightforwardly. respectively. . we would obviously anticipate that coherent revision of initial belief in the light of the combined data. We thus have. J . since all judgements are now conditional on LII. I I H. . should not depend on whether DI. might be to assume a rather weak form of dependence by making the judgement that a (Markov) property such as f’(fIfiTI 1 H .. however. 1 4 . by Definition 2. 111n U . As we shall see later. . first piece of data DI is described by P( HJ 101) P ( U I I ff.. since. I I>I) into the cxpression for P(H. n D l ) and P( H. This is easily verified by substituting the expression for P ( H.). where P(U? 1 f l l ) = C j l’(f12 I H... n D’”) = P ( & . There is. D..j E .I. for all j E . ) from Bayes‘ theorem 1 when DI n D2 is treated as a single piece of data. . revision of beliefs on the basis of the = H. a potential practical difficulty in implementing this process. subsequent revision of beliefs in the light of Dz.D?. then. 1~H.j E . for all .. 7 j E . D./.. . P( /I. . From an intuitive standpoint. When it comes to the further. .. D? were analysed successively or in combination. Another possibility n the evaluations f’( D L . .. j E J.. in the absence of simplifying assumptions. D1. . these kinds of siniplifying structural assumptions play a fundamental role in statistical modelling and analysis. which provides a recursive algorithm for the revision of beliefs. may appear to be impossibly complex if I. . . the likelihoods and prior probabilities to be used in Bayes’ theorem are now P(D2 I H./.)P( )/f’(Dl). n D’”). we then only need I ) = I’( I H.. If we fl write DIk)= D 1 U2n . 1 .j E . The generalisation of this sequential revision process to any number of stages.+ I H. n L ) k to denote all the data received up to and including stage k. . ‘I D ~ ) f ‘ ( f If D I ) . is at all large.
D. The measurement framework of Axiom Xi) provides us with a direct. . we can thus summarise the learning process (in “favour” of H I )as follows: posterior odds = prior odds x likelihood ratio. This eliminates pathological. enables us to provide an alternative . Indeed. It is equally clear that choices among options should depend on the relative values that an individual attaches to the consequencesflowing from the events. which could be constructed in such a way as to rule out the existence of extreme consequences. . The pair of consequences cI and c* are called. intuitive way of introducing a numerical measure of valuefor consequences.4. operational basis. .1 ACTIONS AND UTILITIES Bounded Sets of Consequences At the beginning of Section 2. in a noloss situation with current assets k. C is often taken to be the real line 3 or.L. . sequential decision problems. H I . in part.c. Such C’s would not contain both a best and a worst consequence but. the judgement of conditional independence for D1.H2. For example. respectively. we need to consider a little more closely the nature of the set of consequences C. Definition 2. D2. In Section 2. 2. on the CG). 5 c 5 c*. Before we do this. we recall that all consequences are to be thought of as relevant consequences in the context of the decision problem. . With due regard to the relative nature of the terms prior and posterior. mathematically motivated choices of C. . .14. description of the process of revising beliefs by noting that.to be the interval [k. The following special case provides a useful starting point for our development of a measure of value for consequences.5.6. (Extreme consequences). . we argued that choices among options are governed.2. in this case.given H i or H2. It could be argued that all real decision problems actually have extreme consequences.the worst and the best consequences in a decision problem 3for any other consequence c E C. we shall examine in more detail the key role played by the sequential revision of beliefs in the context of complex. by the relative degrees of belief that an individual attaches to the uncertain events involved in the options.5 2. in mathematical modelling of decision problems involving monetary consequences.5 Actions and Utilities 49 In the case of two hypotheses. in such a way that the latter has a coherent.
we shall consider the solution to decision problems for which extreme consequences are assumed to exist.. For cir1. the utility i i ( r ) = i i ( c 1 (**.2 Foundutions other hand. C. ) = 1. In the next section. 2.2 c. The inupping ( ii : C 3 i s culled the utilityfunction. c . c* ) exists und ib nnique.v f bounded decision problem (€.  + It is important to note that the definition of utility only involves comparison among consequences and options constructed with standard events. years as a realistic possible survival time.15. 5 ) for which extreme consequences C. even though we believe there to be one. Consider. A. but regarding c* + 1 years as impossible! In such cases. it is attractive to have aviiilable the possibility. . a medical decision problem for which the consequences take the form of different numbers of years ofremaining life for a patient. < c' are assumed to exist. Ciwn ( I preference relution 5 . we would expect the utility ofa consequence to be uniquely defined and to remain unchanged as new information is obtained. / c .) o ( I consequenc~rc. 11 I c. 5 ) with extrenie 1onsequences c. (ii) the value o i i ( c I c. practical problems. Definition 2. we shall also deal (in Section 2. (Existenceand uniqueness o bounded utilities). We shall refer to such decision problems as bonnded. it would appear rather difficult to justify any purticitlur choice of realistic upper bound. .21. ~ ( ( 8 . To choose a particular r* would be tantamount to putting forward (. Assuming that more value is attached to longer survival. This is indeed the case. it must be admitted that insisting upon problem formulations which satisfy the assumption of the existence of extreme consequences can sometimes lead to rather tedious complications of a conceptual or mathematical nature. they clearly do not correspond to concrete. Proposition 2. ciaied with uny stundurd event S such that (. is the real nimiber 11 ( S ) NSSOc.* * 1 s' } . { (** I S. .. (Canonical utility function for consequences). For this reason. Since the preference patterns among consequences is unaffected by additional information. Nevertheless.5.<_c u ()r ' l c . f )i s im$fected by the assumed occurrence of an f event G > 0: (iii) O = i i ( c . despite the force of the pragmatic argument that extreme consequences always exist.3) with the situation in which extreme consequences are not assumed to exist. for conceptual and mathematical convenience. for example. Bounded Decision Problems Let us consider a decision problem (8. f relative to the extreme c~onseqiiences < c.:c c.5.. c . A. ( i ) jiw d l c. of dealing with sets of consequences 1101 possessing extreme elements (and the same is true of many problems involving monetary consequences). c ' )5 r ~ ( c ~ c .
.). which we shall often simply denote by u ( c ) . as well as from statisticians). whose form depends both on the events of a finite partition of the certain event R and on the particular consequences to which these events lead. then the utility of c can be thought of as defining a threshold value for the degree of belief in E. c.3. I Si} {c' I S2. (i) Existence follows immediately from Axiom Xi). I  . ... Then.G > 0. {c* I S1. c* I 0}. in the sense that values greater than u would lead an individual to prefer the uncertain option. and both 0 and R . c ' . G ) = u C ~ (IC~.. Using the coherence and quantification principles set out in Section 2. for any G > 0choose S2such that G I S. e * ) = ~ ( ( 3 ) = 0 and U ( C * I c. < c'. c * ) = p(S1). Indeed. I Ss} and so the utility of c given G is just the original value p(S2). c . 0 It is interesting to note that u(c I c.} and Sl S2. It remains now to investigate how an overall numerical measure of value can be attached to an option. Foranyc. c*.j E J } .15 and Axiom 4(i). let c using AxiomI(iii). given G.  . ( a I c . It then follows. c. and Axiom 3(ii). JFJ is the expected utility of the option u. c. since c* = {c' I 0. Definition 2. we shall simply write ii( a I c*. we have u(c*1 c. from Definition 2. belong to the algebra of standard events.. by Definition 2. consequence.c*) = p(f2) = 1. c. . The value u itself corresponds to indifference between the two options and is the degree of belief in the Occurrence of the best. by transitivity c.} then. {c' I S.c. (iii) Finally. . (Conditional expected ufility). ) C. c c. C = {c' I S2. the result now follows from Axiom 4(i). I S.C * ) P (IEG. so that u(c I c . . We shall illustrate the ideas in Example 2. {c* 1 Sl. a n d a = { c j I E j ..16. can be given an operational interpretation in terms of degrees of belief. rather than worst. c*. c. we have seen how numerical measures can be assigned to two of the elements of a decision problem in the form of degrees of belief j o r events and utilities for consequences.3. I E " } . For uniqueness.for some event E. c* I Si} and c {c' 1 S2. with respect to the extreme consequences c. S. c * ) 5 1. a f subject which has generateda large literature (with contributionsfrom economists and psychologists.c*) in place of F(a I c'*.c' I Si}.. I 12). note that if c {c* I S.). if we consider a choice between the fixed consequence I: and the option { c' 1 E . that 0 5 ~ ( I c. This suggests one possible technique for the experimentalelicitariotz o uriliries. (ii) To establish this. andp(S2) = p(S. 0).6.2. If G = $2.5 Actions and Utilities 51 Proof. whereas values less than 11 would lead the individual to prefer c for certain.
. = Hence. 4(iii). c ' ) P ( E . Let a. of course. I S. practical problems the set of options considered will be finite and so a hesf option (not necessarily unique) will exist.)= P ( S . for each <(. the resulting prescription for quantitative.. J and Proposition 2. 3 f'(A1 G ) 5 /'(....vimising expected ittilitj..C . <(.. but rather an implication of our assumptions and definitions.  I s.) = u(c.#I? 1 C). In more abstract mathematical formulations. c*)..l = 1. c . .G).P(E.  { [(c. P ( A .6. By Axioms 5(ii). For any bounded decision with extreme consequences c.) I E. Ti S. r. then ii is simply the expected value of that utility when the probabilities of the events are considered conditional on G.R..... r i s t c m ~ of an optimal option for which the expected utility is a maximum. u.(1 1 G' } .5. . I B. I u1 < G U L Proof.6'). < c* niid G' > In..WfYG').) n c:]. c. 1 G)= and SO (11 <c (11 c )).C*..}with . .(E.22 merely establishes. . I c. The result just established is sometimes referred to as the principle of tna. by Proposition 2..  .c*.. il complete ordering of the options considercd and does not guarantee the c ~ .J. .But. In our development. and S: such that '. I ('.14(ii) and 2.). 2 and any option ( I ... . I E. By Detinition 2..I s. (11 5 ~ u2 =+ =I.15.. a C) /=1 w ii(al I c*... However. q <_ 7i(O~/C. u ( c .2. } . for i = 1 . i = 1.) 4 and B. . n C. which may be written as {c* I A. by Definition 2. = {c.. and using .G ) 5 E(a2 I C. Proposition 2.. in most (if not all) concrete. 1 c. j r . this is clearly not an independent "principle". Proposition 2. By Propositions 2.. n G n S. the existence of a maximum will depend on analytic features of the set of options and on the utility function I I : C ..( E.52 2 Foundations In the language of mathematical probability theory (see Chapter 3).j )there exist S.I.16. Hence. = U..)P(E.fi~nSf.l(E. ..= IJ.. = 1. Proposition2. Technically. ( I I cz" }. where ' .. (Decision criterionfor a bounded decision problem).. c * .22. ) P f E .. I G) = q n . ifthe utility value of n is considered as a "random quantity".S. .10. c#] {c" I S. for all ( r .  ii(CC1 / c * . coherent decisionmaking is: choose the option with [lie greutest expected utility.13. In summary form. contingent on the occurrence of a particular event E..nC)Y(S. 17G ) and P(S.
150) = p . . One of the earliest reported systematic . Repeating this exercise for a range of values of p. * * b * * *.) = p U(825) + (I . ~ ( 8 2 5 ) 1 . from .. 0 . of course. For the purposes of illustration. f attempts at the quantification of utilities in a practical decisionmaking context was that of Grayson ( 1960). and Grayson's work focuses on the assessment of utility functions for this latter quantity. For example.emerges from such interrogation as an approximate "indifference" value.3 Willium Beard's 14tilityfitncrion for chunges in moinetup assets . the above development = suggests ways in which we might try to elicit an individual wildcatter's values of u(c) for various r in the range 150 < c < 825.I50 (the worst consequence)to +825 (the best consequence).5 Actions and Utilities 53 Example 2 3 (Uriliries o oil wildcoaers). which option he or she would prefer out of the following: (i) c for sure. An alternativeprocedure. I&( c. using a series of values of c. one could ask the wildcatter.. 11. If c].p . for a coherent individual. provides a series of ( c . would be to fix c..2. the theory developed above suggests that. pr is found. and then repeat this procedure for a range o values of c to obtain a series f of (c. The consequences o drilling decisions and their outcomes are ultimately f changes in the wildcatters' monetary assets.p ) ?I( .supposethat we restrict attention tochanges in monetary assets ranging. ) pairs. I I Utility . for some specified p . Assuming u(150) = 0. Thousunds of dollurs Figure 2. whose decisionmakerswereoil wildcattersengaged inexploratory searches for oil and gas. in units of one thousand dollars. from over which a "picture"of ~ ( r ) the range of interest can be obtained. perform an interrogation for various p until an "indifference" value.150 with probability 1 . (ii) entry into a venture having outcome 825 with probability p and an outcome . . p ) pairs. .
. derived by rcferring to the best and worst consequences..r.c ofcr ~~onseq~ience .54 2 Founddons Figure 2. Beard. provided we take . c * . c2 to play the role of c.rhen ii = . . H so that the orientation of "best" and "worst" is not changed. Definition 2. . and we can simply refer to the expected utility of an option without needing to specify the (positive linear) transformation of the utility function which has been used. A "picture" of Beard's utility function clearly emerges from the empirical data.) i . we shall require negritiw assignments for c and assignments gremer t h m one for (.1.e. ) c. 1 Our restricted definition of utility (Definition 2. ..3 shows the results obtained by Grayson using procedures of this kind to interrogate oil company executive. Since the expected utility Ti is a linear combination of values of the utility function. i f c :> C? and r? { r I S6. 1. 1957. such an origin and scale can be chosen for convenience in any given problem. we cannot then assume that rl 5 I' 5 c2 for all c. where . In particular.sequct~~~es.. c' I .4 0 ..17.2 1 S. However. r / ( 1 .1 then I I = 1/.. there may be bounded decision problems where the probabilistic interpretation discussed above makes it desirable to work in terms of ~ ~ ~ i n o t ~ i c ~ d utilities. S'} .. the utility function reflects considerable risk aversion. we have to select some reference ~~on. 5 t' 5 vw for all c E C. such that r . 2.stand(ird ebwt S . ('2 are to define a utility scale by being assigned values 0. In the next section. r?. .then 11 = . c.. In the absence of this assumption.22 guarantees that preferences among options are invariant under changes in the origin and scale of the utility measure used. IS) relied on the existence of extreme consequences v. over the range concerned.5. Given (I preference relation 5.!Y'~}. we shall provide an extension of these ideas to more general decision problems where extreme consequences are not assumed to exist. therefore. The definition is motivated by a desire to maintain the linear features of the utility function obtained in the case where c: +(( the utility 11  . relutiw to the conseqitences < ~2~ is defined to he the r e d nuinher u such tlicrr (f c < ('1 and 1'1 { f. respectively. 1 1. in the sense that even quite small asset losses lead to large (negative)changes in utility compared with the (positive) changes associated with asset gains. In general. on October 23. However. i.I' = / i ( S.3 General Decision Problems We begin with a more general definition of the utility of a consequence which preserves the linear combination structure and the invariance discussed above. Proposition 2. I PI. is the tnensirre crssociated with the . invariant with respect to transformations of the form f l u ( . .(j I . cI . (General utility function). and this means that if q .r ): ifcl 5 c 5 c ? and I' { r 2 1 S . W.
c q ) exists and is unique.17. Proposition 2.)I c*. (Existence and uniqueness of utilities). This is virtually identical to the proof of Proposition 2. ) ICl.y ) 4 c 1 CI C 2 L where y = P(S.2 I to the general utility function defined above. such that c4 {c I S . c2) = yu(c4 I c1. c. cl.c4) B.CZ) + BI and u ( q . C. c:j < (I <_ c4 implies that there exists a standard event S. by Definition 2. such that c3 {c4 I S.22.22. Hence. c3 I S }and 9 I q . . Then.y).17. It can be checked straightforwardly that if ql). u ( c I c1. c 2 ) = . subject to c2 > c1. Hence. u(c(. Proof. (12..CZ) = y u ( c I CI. qZ)G {q3) ! c(]) 1 S:} implies IS . Suppose first that c 2 el. cp). Hence. Proposition 2. c ~ ) . B1. (Linearity).1. we have g u(c 1 c l ? = AU(C ca.u(c I CI. c2. (i) for all c. u(c 1 c1.414~ c.*) B. ) ' ) ) hence. by the above. a The following results guarantee that the utilities of consequences are linearly transformed if the pair of consequences chosen as a reference is changed. ZL(C 1 cg. Hence. c* be the minimum and maximum. By Axg iom 5(ii). c = A ~ u ( c ( 1. By Axiom 5(ii).17. that < U ( C ( 2 )( C l . c I S. 132 such that.y).g/(l .  u(cIcI:c2)= r u ( c ~ I c l . c a ) = .ca) = l/y.) and. c3. c. C * ) = Z74C(3)ICI. c ~ = ) Av(cIcg.)1 q . ca 5 q . there exist A . . c2) + (1 . for all c.1 > cg.A2. 1 where z = P(& ) and.5 Actions and Utilities 55 extreme consequences exist.21.c2)= 1.ca) (132 .) + + .. by Definition 2.1) = x. c') = A ~ u ( c (I.23.u(c3I c1.}. Hence. c}. c ~ = ( A ~ / A I ) u ( cI(::t. if Q > c there exists S.c2). with 4 and 13 as above. f f (iii) u ( q I c 1 .CZ) + (1 . For all c 1 < c2 and c:$ < ca there exist A > 0 and B such that.u(c:t I C I . C 2 ) + ( 2)u(c3IcI.c?) + (1 ... such that c {q I S. c2. c 2 )=Oandu(c2Icl. with A and B as above. if c > c . c . by Proposition 2.u(c I q . c:i. qj I S. where 9 = P(S.the definition given ensures that for any G > 0. 14.where A = u(c4 \ ~ I > Q ) .and c1 < c I c2.1) + B.u(q5 1 ~ 1 . by Definition 2. The following result extends Proposition 2. ~ ) . q3)denote any permutation of c. by Proposition 2. u ( c I cl.c2) is unaffected by the occurrence o an event G > (b.Z ) U ( C ( .) and. u(c(.CI.L) = A. ~ there exists S. Let c. u(c I c . of (c1. c. u ( c I c : t . c2) 1 u(c4  + Now suppose that the C'S have arbitrary order. a ) (.2. c } . for q. . respectively.}. For any decision problem and for any puir of consequences c1 < c2. Proof.)E { c1. I c. +  . c.B l ) / A l . where c1 < cZ and ql) cf2)5 ct3). Similarly. c d ) Bz.CZ). c4) + B. cq. ~ 2a)n d B = ~ u ( c : ~ ~ c l . (ii) the value o u( c I c l . Q.24..
r ' . see. L").3. are the major options considered. 5 c. in the light of information obtained (or on the basis of initial information if the study is not undertaken).1 SEQUENTIAL DECISION PROBLEMS Complex Decision Problems Many real decision problems would appear to have a more complex structure than that encapsulated in Definition 2.. and so the result follows. For instance. in the fields of market research and production engineering investigators often consider first whether or not to run a pilot study and only then. I E . trnd esetil C: > g.56 2 Fouticlcrt ions Finally....i = 1..6 2. p. by Proposition 2. for example..25.24..22 to unbounded decision problems. a problem which does not arise in the case of finite partitions) and the choice is a personal one. . I . u .. coherent comparisons of options must proceed cis if a utility function has been assigned to consequences. N? 5(. (General decision criterion ). there are further formal considerations which may delimit the form of function chosen. and B such that ii(r I c.we have shown that quantitative.1 cw.J = 1. r 2 )+ H . there exists . and let r. . c. (11 iff .G ) 5 % ( a . 5 . Fine 1973. Any function can serve as a utility function (subject only to the existence of the expected utility for each option. } . probabilities to events and the choice of an option made on the basis of maximising expected utility.2. Proposition 2. 221 ). we generalise Proposition 2. In some contexts.I 0 u(a2 But. . r' be such that for all c.25 is that all options can be compared among themselves. Pro<$ Suppose ( I . r ' ) = Au(c 1 cl. For any decision problem.22. Instead..  2. An immediate implication of Proposition 2. pciir ofcomeyuences ('1 < ('2. however. . If we begin by defining a utility function over I I : C .7. An important special case is discussed in detail in Section 2. = { c. This completes our elaboration of the axiom system set out in Section 2... we merely assumed that all consequences could be compared among themselves and with the (very simply structured) standard dichotomised options.6. Then.this indirccv in turn a preference ordering which is necessarily coherent. Such a twostage process provides a simple example of il src~~rrnri~rl .H. Starting from the primitive notion of preference.c * . and that the latter could be compared among themselves. I c. 5 r * . We recall that we did nol directly assume that comparisons could be made between all pair of options (an ussuniprioti which is often criticised as unjustified. by Proposition 2.
Before explicitly considering sequential problems.2. Let A = {a. given hf. one may first . having been taken. with this notation. If Afo is our initial state of information and G > 0 is additional information obtained subsequently.. there is a class { f $ ] .. involving successive. I C) used previously. By using the extended notation P(E. interdependent decisions. i f and only if u ( c tJ) the value attached to the consequenceforeseen ifaction at is taken and the is occurs.. rather than the more economical P ( E. I a. } of exhaustive and mutually exclusive j events. a1 is to be preferred to action a2. in a sequential form. and P( E. we are emphasising that (i) theactual eventsconsidered may dependon theparticularactionenvisaged.and (iii) degrees o belief in the Occurrence of events such as E.. j E J . the main result of the previous section (Proposition 2. M0).G. it is sometimes convenient to describe the relevant events Ell.and For behuviour consistent with the principles of quantitutive coherence.25) may be restated as follows. in considering the relevant events which label the consequences of a surgical intervention for cancer. For any action ai. we shall review. we shall demonstrate that complex problems of this kind can be solved with the tools already at our disposal.. we are merely emphasising the obvious dependence of both the consequences and the events on the action from which they result. and the state of information being ( hlo. We recall that the probability measure used tocompute theexpected utility is taken to be a representation of the decisionmaker's degree of belief conditional on the total information available. using a more detailed notation.. (ii) the information available certainly includes the initial information together with C > 8. thus substantiating our claim that the principles of quantitative coherence suffice to provide a prescriptive solution to any decision problem..J. so that the possible influence of the decisionmaker on the real world is taken into account.. are f understood to be conditional on action (I. action G. i E I} be the set of alternativeactions we are willing to consider. E J . G) is the degree of belief in the occurrence of event EZ3 event E.having been assumed to be taken.. j E J .conditional on action a. For example. G). For each a. Note that. } which may result from action a. In this section. Ale. which label the possible consequences { cLJ. lu. some of our earlier developments.6 Sequential Decision Problems 57 decision problem.
These situations are most easily described diagrammatically using decision trees. Consider. after an action has been taken and its consequences observed. 1 the related betting bchaviour of other people ( E. but. Obviously. It may appear at first sight that this is a decision problem where the utilities involved in an action (the possible prizes to be obtained from a bet) depend on the probabilities of the corresponding uncertain events (the possible winning horses). G. .A. It is also usually the case. a possibility t i o r contemplated in our structure.. with as many successive random nodes as necessary. a . such as that shown in Figure 2. It is only natural to assume that our degree o f belief in the possible outcomes of the race may be influenced by the betting hehaviourofother people. in practice. We now turn to considering sc~yitencusofdecision problems. This conditional analysis straightfonvardly resolves the initiul. that the structure of the problem is similar to that of Figure 2. so that. We shall consider situations where. . and .. apparent complication.think of whether the patient will survive the operation and then..in the example shown. this does not represent any formal departure from our previous structure.) P(Et. for example. Conditional analysis of this kind is usually necessary in order to understand the structure of complicated situations. J and the outcome of the race ( F . The prize received depends on the het you place (a. k ) . whether or not the tumour will eventually reappear were this particular form of surgery to be performed.. C.1 u . conditional on survival. . n F'/. Mo) would often be best assessed by combining the separately assessed terms P( k..). for instance.. since the problem can be restated with a single random node where relevant events arc defined in ternis of appropriate intersections.. / k I (I. . fl k . G. P( E... such as E. A closer analysis reveals. if the favourite wins.If. however. if we bet on the favourite we have a higher probability of winning. . that it is easier to elicit the relevant degrees of belief conditionally. the problem of placing a bet on the result of a race after which the total amount bet is to be divided up among those correctly guessing the winner./k I E u..4.4. Clearly. . M. many people will have guessed correctly and the prize will be small.
will typically depend on the possible consequences of today's actions.2. Thus. when the consequences of a given medical treatment have been observed. a physician has to decide whether to continue the same treatment. In the next section. which makes it possible to solve these problems within the framework we have already established. or to declare the patient cured.6. For example. we typically cannot decide what to do today without thinking first of what we might do tomorrow. and that. the situation may be diagrammaticallydescribed by means of a decision tree like that of Figure 2. buchurd induction. Using the notation for composite options . or to change to an alternative treatment. of course. we consider a technique. it is intuitively obvious that the optimal choice at the first decision node depends on the optimal choices at the subsequent decision nodes. the number of decision nodes to be considered in any given sequential problem will be assumed to be finite. after which no further decisions are envisaged in the particular problem formulation. and bearing in mind that the analysis is only strictly valid under certain fixed general assumptionsand we cannot seriously expect these to remain valid for an indefinitely long period.6 Sequential Decision Problems 59 new decision problem arises.2 Backward Induction In any actual decision problem. In colloquial terms. Figure 2. at each node. If a decision problem involves a succession of decision nodes.5 Decision tree with several decision nodes Let 'n be the number of decision stages considered and let a('")denote an action being considered at the rnth stage. the possibilities are finite in number. conditional on the new circumstances. If. the number of scenarios which may be contemplated at any given time is necessarily finite. Consequently.5. 2. we should be able to define ajinite horizon.
Formally.is the relevant information available. and u( . we may write where This means that one has to first solve the final (nth) stage. and the process continued until we reach the rith stage..:’) and It is now apparent that sequential decision problems are a special case of the general framework which we have developed. if GI. to occur. consisting of “ordinary” options defined in terms of the events and consequences to which they may lead. secondstage options may be written in terms of thirdstage options.) is the (generalised) utility function.k E K(.2.1)th stage by maximizing . given the Occurrence of EIJ. by maximising the appropriate expected utility. Similarly.Indeed.60 2 Fiwdurions introduced in Section 2. jE J(”)} is the partition of relevant events which corresponds to the notation “max a.25 that. at each stage r r i . we have a. the “consequence” of choosing u:li and having E. occur is that we are confronted with a set of options { ul!’.} which we would be confronted with were the event E..::?’’‘refers to the musf preferred of the set of options {(if’. k E Iil. The “maximisation” is naturally to be understood in the sense of our conditional preference ordering among the available secondstage options. all firststage actions may be compactly described in the form where { E I J . It follows from Proposition 2. then one has to solve the ( 1 ) .} from which UY (*(it?choose that option which is preferred on the basis of our pattern of preferences at that stage..
At each stage. the mth. by Ti.. a2 = continue (where. is the rank of the next object relative to the I’ + 1 then inspected. with c.r. r . Suppose further that. . ) = u ( i ) .).. . all values .2. The information available at the r t h stage is G. ofcourse. the object currently under inspection. This process of hcrchwdinduction satisfies the requirement that. denote the possible consequencesof the inspection process. such as property purchase in a limited seller’s market when a bid is required immediately after inspection.. 1. reThis quirement is usually known as Bellmun ‘s oprimulity principle (Bellman. . = i if the eventual object chosen has rank i out of all YI objects. receiving. When should the inspection process be terminated‘? Intuitively.. N o backtracking is permitted and if the inspection process has not terminated before the nth stage the outcome is that the trth object is received. if the inspection process goes on too long there is a good chance that the overall preferred object will already have been encountered and passed over. we have dropped the superscript. and so on. but is simply a consequence of the principles of quantitative coherence. which is usually referred to in the literature as the “marriage problem” or the “secretary problem”. However. equally likely since the I I objects are inspected in a random order. This kind of dilemma is inherent in a variety of practical problems. i = 1. again with no backtracking possibilitiesthis stopping problem has been suggested as a model for choosing a mate. Potential partners are encountered sequentially: the proverb “marry in haste. working backwards progressively. I . Suppose that a specitied number of objects I I 1 2 are to be inspected sequentially.I/. or staff appointment in a skill shortage area when a job offer is required immediately after interview. if the inspector stops too soon there is a good chance that objects more preferred to those seen so far will remain uninspected. of .4. = (. given G. We shall denote by u ( c . . I’ + 1). Less romantically. the continuation of the procedure must be identical to the optimal procedure starting at the mth stage with information G. until the optimal first stage option has been obtained. r ) and the expected utility . More exoticallyand assuming a rather egocentric inspection process. let r‘.(. r . the information available at the ( r + 1)th stage would be G . As with the “principle” of maximising expected utility. where I/. There are two actions available at the rth stage: n l = stop.. . or of continuing the inspection process with the next object.6 Sequential Decision Problems 61 the expected utility conditional on making the optimal choice at the nth stage. . i = 1. repent at leisure” warns against settling down too soon: but such hesitations have to be balanced against painful future realisations of missed golden opportunities.r. n. where 1 5 . 1 5 I’ 5 I ) . to simplify notation. the inspector’s utility for these consequences. Now suppose that t’ < n objects have been inspected and that the relative rank among these of the object under current inspection is . a procedure often referred to as gynamic programming. in order to select one of them. at any stage of the procedure. 1957). 1 5 I/ I r + I . r ) . we see that this is not required as a further assumed “principle” in our formulation..y being. 5 7’. n . (An optimal sfopping problem). If we denote the expected utility of stopping. and the knowledge that the 78 objects are being presented in a completely random order. = ( . the only information available to the inspector is the relative rank ( I =best. as a result.. say. We now consider a famous problem. r=worst) of the current object among those inspected so far. one at a time. the inspector has the option of either stopping the inspection process. Example 2. at any stage r .
r. stop.r ) . r ) = 0. see DeGroot (1970.z./' = 2. For reviews 0 1 I'iirthcr.1 I / . " ' 1 5 1'. I. the general development given above establishes .r. the optimal procedure is detined as follows: ( i ) continue until at least 1. . It is then easy to show that I' //. .0. 1 I. If I.r. suppose that the inspector's preference ordering corresponds to a"nothing but the best" utility function. defined by I J ( I ) = I.62 2 Foundations of acting optimally. (iii) otherwise. I' = 1. then htop (\topping in any case if the u t h stage is rcachcd). ) can be found from the final condition and the technique of backwards induction.(I. 1 . &(. . if . I I : thus.should never be terminated ifthe cwrent ubje1.r ) > E. whose account is based closely on Lindley ( 19613).r.(. . . The decision as to whether to stop if .seen to be: (i) continue if li0(a. ) = ii. .r = 1 is determined from the equation which is easily verified by induction. t t . related work on this fascinating problem. i r ( . .(.). r For illustration.r ) . 1. yields the approximation I" z I / / (For further details. I.' objects have been inspected: (ii) if the r'th object is the best so far. .r = 2 . ) . that where r Values of QI(.) = .by lil. see Freeman ( 1083) itnd Ferpuson (1989). / I ..* It ii. given C. .1 This implies that inspection .(r. 1 .(. It' I ) is large.SCPII sofur.l is not thc best .(. .r. . The optimal procedure is then .. ) > E. (ii) stop if ull(. continue until the object under inspection is the best s o far. .' is the smallest positive integer for which I T Y 1 i / I . .r > 1. .r. Chapter 13). approximation of the sum in the ithove inequality by an integral readily ..).
2.6 Decision treefor e.6. Readers who are suspicious of putting this into practice have the option. We want to choose the “best” experiment. may be diagrammatically described by means of a sequential decision tree such as that shown in Figure 2. D. Usually. above analysis suggests delaying a choice until one is at least 32 years old. very important example of a sequential problem is provided by the situation where we have available a class of experiments.. Sequential decision problems are now further illustrated by considering the important special case of situations involving an initial choice of experimental design. which. we also have available .3 Design of Experiments A simple. t‘. which embraces the topic usually referred to as the problem of experimental design.vperimenrcri design We must first choose an experiment e and. Figure 2. of course.2. of staying at home and continuing their study of this volume. were event E to occur. in light of the data obtained.6 Sequential Decision Problems Applied to the problem of “choosing a mate”. take an action o. E). would produce a consequence having utility which. The structure of this problem.6. thereafter ending the search as soon as one encounters someone better than anyone encountered thus far. we denote by u( u. and assuming that potential partners are the encountered uniformly over time between the ages of I6 and 60. modifying earlier notation in order to be explicit about the elements involved. one of which is to be performed in order to provide information for use in a subsequent decision problem.
) are different functions defined on and different spaces. the set of available actions may depend on the results of the experiment performed./ I . ) = / l ( ( i . the expected utility of performing no experiment and choosing that action (I.f’.vperinrent ( ’ ( f i i ( c .I For each pair (c. we note that the possible sets of data obtainable may depend on the particular experiment performed. We have seen. if r‘ were the experiment chosen. in the class of available experiments. is i i ( U . i i ( c . (Optimal experimental design). that to solve a sequential decision problem we start at the last stage and work backwards. 11.) T i ( ( .. Tlic optinid w t i o t i is to prrfortn the e. . f . and the sets of consequences and labelling events may depend on the particular combination of experiment and action chosen. fl. Proposition 2.) = TT(fi.. I l . D. = I We are now in a position to determine the best possible experiment. However. Thus. Within the general structure for sequential decision problems developed in the previous section. ’ ) > t r ~ T i ( ( 1 = Illils. 11. ) .6. D. c. given the information available at the stage when the action is to be taken.. However. D. On the other hand. ) P ( l I(. Ti(n.i.1 is given by  U(C. lil where I’( I). f. aspects of the problem. This is that c which maximises..64 2 Foundations the possibility. in Section 2. c /‘ . which maximises the (prior) expected utility is so that an experiment P is worth performing if and only if T i ( ( ) > T i ( ( Naturally. that action a: which maximises the expression given above.)P(fi.). . the expected utility of option i i .) IllasTT(fi. denoted by ell and referred to as the rrrr// e.2.</). the unconditional expected utility i i ( r . b. I P ) denotes the degree of belief attached to the Occurrence of data D. otlwrd wise. Proof: This is immediate trom Proposition 2. In this case. ) = ~ q o . D. in our subsequent development we will use a simplified notation which suppresses these possible dependencies in order to centre attention on othcr.25. T i ( ( ).vperinienf. tlie optiniul cic’tiori is to iwfvrtti t i o cvpcrittiwt.) we can therefore choose the best possible continuation: namely.. I. d l . U. the expected utility of the pair (c. to simplify the notation and without danger of confusion we shall always use ii to denote an expected utility. of directly choosing an action without performing any experiment..). more important.26.’.. f J .
e. given EJ.19.. for all E J . (ii) the expected value of an experiment e is given by v(e) = C v(e. E .. .6 Sequential Decision Problems 65 It is often interesting to determine the value which additional information might have in the context of a given decision problem.) = m a x u ( a .eo. The expected value of the information provided by new data may be computed as the (posterior) expected difference between the utilities which correspond to optimal actions after and before the data have been obtained. E j ) .E. lGr It is sometimesconvenient to have an upper bound for the expected value u ( e ) of an experiment e.. For a = a. occurred is l(a. under appropriate conditions. the optimal action under prior information.) = riiaxu(a. I e. )u(a..EJ)p(EJ I .e. eo. i.2. the value of perfect information and..)are. ) ~ ( IDe )~....E. E. its expected value will provide an upper bound for the increase in utility which additional data about the El's could be expected to provide. the optimal actions given D. Let us therefore consider the optimal actions which would be available with perfect information.the loss suffered by choosing any other action a will be "(a&). (Expected value of perfect informatton).. Definition 2. The opportuniry f loss which would be suffered i action a were taken and event E.E.)} P ( E . u(ai.E. were we to know the particular event EJ which will eventually occur. such that.eo. a. be the optimal action given E. respectively. eo. conditional on E.~ ( a . provided by an experiment e is t v(e.18. and let a t . e o .). E ~) u(aG. D.eo. D. J ~ J e. D ~ = ) C {.(a:. a Then. EJ).. and with no data. a:): where a:. this difference will measure. (i) The expected value of the & u D.. 01 the expected value of perfect information is then given by u*(eo) = JEJ I(aT.eo. i. D .). Definition 2. (The value of additional infomaation).
E .27.t. that. I8 and 2.. E .7(f ).. ) .1 v ) is the expected cost o c . Often. D .10).19.) . One is the (experimental) cost of performing c and obtaining D.for any available experiment P .).) = //(U. with c(e. for notational convenience.c . I * ( c) 5 1" (PI. 2 0.r(c. f ProoJ: Using Definitions 2. D. E. and the probability distributions are such that then. ) where c ( f .). we can establish a useful upper bound for the expected value of an experiment in terms of the difference bctween the expected value of complete data and the expected cost of the experiment itself. E.66 f(tI. c .). d}although. we may write / ) ( a . If the ittilit! function hus the forrn u(a.E. Moreover. ) .) = u ( u ..) n. D. where T(c) = c ICl C(P.* ( t) and the number i all crucially depend on the (prior) probability distributions { ( P ( E . we have not made this dependency explicit. the utility function u ( u . the probability distributions over the events arc often independent of the action taken. ) . assuming additivity of the two components. the latter component does not actually depend on the so preceding e and D... R E In many situations.U .('(f . When these conditions apply. occurs.. Proposition 2. 2 Foundations It is important to bear in mind that the functions t i ( 0. D..) may be thought of as made up of two separate components. (Additive decomposition). U .D. f . ~ ( cmay he written as ) . the other is the (terminal) utilip of directly choosing ( I and then finding that I?.V. ) 2 0. ) P ( D .
The probability measure represents an individual’sbeliefs conditional on his or her current state of information. a In Section 2.7.2. Given the initial state of information described by A10 and further information in the form of the assumed occurrence of a significant event G. The event G will then be defined directly in terms of the counts or measurements obtained. Options are then to be compared on the basis of expected utility. and the tools developed in Sections 2.5) establish that if we aspire in accordance with the axioms of toanalyse a given decision problem. 67 as stated. We now wish to I specialise our discussion somewhat to the case where G can be thought of as a description of the outcome of an investigation (typically a survey. rather than as an input into a directly practical decision problem. we shall study in more detail the special case of experimental design in situationswhere data are being collected for the purpose of pure inference. sequential problems which. A.2.7 2. .5.3 to 2. appear to go beyond that simple structure. at first sight.7. We shall now use this framework to analyse the very special form of decision problem posed by sratistical inference. or involving a description of intervals within which s}. hence.1 INFERENCE AND INFORMATION Reporting Beliefs as a Decision Problem The results on quantitativecoherence (Sections 2. C.G). we have seen that the important problem of experimental design can be analysed within the sequential decision problem framework. or an experiment) involving the deliberate collection of data (usually. 2.7 Inference and Information and. either as a precise statement. quantitative coherence. in numerical form). we must represent degrees of belief about uncertain events in the form of a finite probability measure over E and values for consequences in the form of a utility function over C. { E . We have shown that the simple decision problem structure introduced in Section 2. thus establishing the fundamental relevance of these foundational arguments for statistical theory and practice. suffice for the analysis of complex. In particular.2 to 2. we previously denoted such a measure by P(.
An individual’s degree of belief measure over E will then be denoted P ( . quoting Lehmann (1959/1986). even if an immediate decision problem does not appear to exist. Moreover. the study of models and techniques for nnalysing the w u y in which beliefs ure modified by data. On the other hand. possibly distinct. we noted that the inference.D) constitutes I a complete encapsulation of the information provided by D.2 Foundutions readings lie. namely. of course. we know that our statements of uncertainty may be used by others in contexts representable within the decision framework. So far as uncertainty about the events of E is concerned. By way of preliminary clarification. we have suppressed. they would. in conjunction with the specification of a utility function. reasons for trying to think rdtionally about uncertainty. t o be judged independently of any “practical” decision problem. for notational convenience. a specification of the set of possible populations sampled and a question concerning the true population.5. or inference statement. . . P ( . we thus have a formal justification for the main topic of this book. Our main purpose in this section is therefore to demonstrate that rheproblenr of reporting inferences is essentially u special case of a decision problem. 1956/1973). Starting from the decision problem framework. indeed.I D ) representing the individual’scurrent beliefs in the light of the data obtained (where. In such situations. On the one hand. establishing that. (Fisher. we noted that. It is this case that we wish to consider in more detail in this section. for the solution of uny decision problem defined in terms of the frame of reference adopted. . . I D) provides all that is necessary for the calculation of the expected utility of any option and. However. given the initial state of information :If(. . Decisions are based on not only the considerations listed for inferences.1 that we distinguished two. many eminent writers have argued that basic problems of reporting scientific inferences do not fall within the framework of decision problems as defined in earlier sections: Statistical inferences involve the data. our conclusion holds. If views such as these were accepted. (Cox. To emphasise the fact that G characterises the actual dutu collected.. . quoting Ramsey (1926). undermine our conclusion that problems concerning itncertuinfy are to be solved by revising degrees of belief in the light of new dutu in accordance with Buyes’ theorem. it can be regarded as falling within the general framework of Sections 2. again.7. namely as means to making decisions in an acceptance procedure. hence. may sometimes be regarded as un end in itseff. 19%): ..). . f(.. to 2. we shall denote the event which describes the new information obtained by D. a considerable body of doctrine has attempted 10 explain. but also on an assessment of the losses resulting from wrong decisions. let us recall from Section 2.. . the explicit dependence on :If. or rather to reinterpret these (significance) tests on the basis of quite a different model. The differences between these two situations seem to the author many and wide.
To complete the basic decision problem framework. We shall consider this specification and its implications in the following sections. implicitly. From a strictly realistic viewpoint.2. given above. and the action space A relates to the class Q of conditional probability distributions over {E. a pure inference problem may be described as one in which we seek to learn which of a set of mutually exclusive “hypotheses” (“theories”.. But there is nothing (sofar) in thisformulation which leads to the conclusion that the besr action is to state one’sactual beliefs. Mathematical extensions will be discussed in Chapter 3. a class of exclusive and exhaustive “hypotheses” { EJ. j E . We shall regard this set of hypotheses as equivalent to a finite partition of the certain event into events { E J j E J}. there . say { H.7. to regard the set of possible inference statements as the class of probability distributions over { E l ..j E . the record of what the individual put forward as an appropriate inference statement. 9 having the interpretation EJ = “the hypothesis HJ is true”. A particular form of utility function for inference statements will be introduced and it will then be seen that the idea of inference as decision leads to rather natural interpretationsof commonly used information measures in terms of expected utility. there is always. therefore. j E J}. j E J } . j E J}. where the set of “hypotheses” { E. conditional on some relevant data D. will be a consequence. Indeed. I D)denotes our current degree of belief measure. we know from our earlier development that options cannot be ordered without an (implicit or explicit) specification of utilities for the consequences. we shall only consider the case of finite partitions { E . or “model parameters”) is true. 2. the latter constituting the uncertain events corresponding to each action. The inference reporting problem can thus be viewed as one of choosing a probability distribution to serve as an inference statement.. . If we aspire to quantitative coherence in such a framework. It is natural.2 The Utility of a Probability Distribution We have argued above that the provision of a statistical inference statement about J}.7 Inference and Information 69 Formalising the first sentence of the remark of Cox. j E J } is a partition consisting of elements of E . we know that I our uncertainty about the {E. where P ( .I} compatible with the information D.may be precisely stated as a decision problem. corresponding to each inference statement and each € . In the discussion which follows. “states of nature”. j E J}.thus. The actions available to an individual are the various inference statements that might be made about j the events { E J .namely. together with what actually turned out to be the case. j E J} should be represented by { P ( E J D ) . we need to acknowledge that. a finite set of such hypotheses. given data D in addition to the initial information ilfo..although it may be mathematically convenient to work as if this were not the case. E I } .
Wc complete the specification of this decision problem by inducing the preference ordering through direct specification of a utility function i t ( .20. since one would wish small changes in the reported distribution to prtduce only small changes in the obtained score. representing the conjunctions of reported beliefs and true hypotheses. A sc'orr&funcriori I I jhr probal>ility distrihitriotis q = { q. which describes the "value" u. in preference to any other probability distribution q in Q. j E J } . without loss of generality..otitinuoiisly differenriable us n. Definition 2. E J ) .E. were EJ to turn out to be the true "state of nature".fittii.rioii oj'rwli (1. .5) Cn < EJ n D < D for all j E J. Our next task is to investigate the properties which such a function should possess in order to describe a preference pattern which accords with what a scientific cornmunity ought to demand of an inference statement. L<J } to each pair ( q .. is assumed to be the probability which. To avoid triviality. j E .wid to be smooth if it is c. This special class of utility functions is often referred to as the class of score functions (see also Section 2.I } d e j t i d over (I purtition { E. .70 2 hiindatiotis where q. that all the EJ'sare significant given D. Throughout this section. conditional on the available data D. W emphasise again that.j E .(q. in the structure described so far. Thisjim~rioiis .))EJ. is. conditional again on the available data D. It seems natural t o assume that score functions should he smooth (in the intuitiw sense). we shall denote by the probability distribution which describes. there is no logical e requirement which forces an individual to report the probability distribution p which describes his or her personal beliefs. being true. If this were not so.). consists of all pairs ( q . The action corresponding to the choice of q isdefinedas{(q. we assume that none of the hypotheses is certain and that. (Score function).I } is ( I nrirpying i which assigns a real tiimil>er i t { q .8) since the functions describe the possible "scores" to be awarded to the individual as a "prize" for his or her "prediction". E J )of reporting the probability distribution q as the final inferential summary ofthe investigation. the individual's ucruul beliefs about the alternative "hypotheses". I7(iii) that each of the personal degrees of belief attached by the individual to the conflicting hypotheses given the data must be strictly positive. all are compatible with the available data.. The mathematical condition impoed is ii simple and convenient representation of such sniwthnesb. . so that (Proposition 2. The set of consequences C.). E. It then follows from Proposition 2. an individual reports as the probability of EJ f H. we could simply discard any incompatible hypotheses.
a Smooth. I98 1 b. one should require score function to be. de Finetti. (iii) to devise general procedures to elicit personal probabilities and expectations (Savage. = P(E.21. in accordance with our development based on quantitative coherence. 1962) defined as follows. A scorefunction u is proper i j for each strictly positive probability distribution p = { p . for each j . j E J}. is to report that distribution q which maximises the expected utility c (E J ) "q t JEJ 1 D). In order to ensure that a coherent individual is also honest. whose solution. . j E J }. j E J } dejned over a partition { E. in a scientific inference context. and only if. i attained $ and only i j q = p .6). proper. Bernardo. taken over the class Q of all probability distributions over { E. This is a well specified problem. This motivates the following definition: Definition 2. Whether a scientific report presents the inference of a single scientist or a range of inferences.with preferences described by a score function. 1967). The simplest proper score function is the quadratic function (Brier.we should wish to be reassured that any reported inference could be justified as a genuine current belief. otherwise. over the possible answers. probability distributions which truly describe their beliefs (de Finetti. 1950. where the supremum. the individual's best policy could be to report something other than his or her true beliefs.. j E J } . .2. I D ) . = p. (iv) to select best subsets of variables for prediction purposes in political or medical contexts (Bernardo and Bermlidez.) which guarantees that the expected utility is maximised if. purporting to represent those that might be made by some community of scientists. Section 3. 1965. we need a form of u( . proper score functions have been successfully used in practice in the following contexts: (i) to determine an appropriate fee to be paid to meteorologists in order to encourage them to report reliable predictions (Murphy and Epstein.7 Inference and Information 71 We have characterised the problem faced by an individual reporting his or her beliefs about conflicting "hypotheses" as a problem of choice among probability distributions over { E l . q. 1971). 1985). (Proper scorefunction). (ii) to score multiple choice examinations so that students are encouraged to assign. . s It would seem reasonable that.
turns out to be true. tlwre crist jiiircrions { I I I ( . E . } = I(. and since C .j E . . .i E .23.. A score . A yuudratic score jtnctiori j b r pr~)huhilit~distrihutions = { qJ. Proposition 2. in a "pure" inference problem we are solely concerned with "the truth".11. where the value of a distribution. which we shall refer to as "pure inference problems". j E . we have the } = 0.28 we did iiot need to use the condition 1 qI = 1: this is a rather special feature of the quadratic score function. A qundrcttic score function is proper. A further condition is required for score functions in contexts. over q. by definition. j E .v probtrbility clisrrihiitiono I. j E . q.say. dejned over ci partition Definition 2. It is easily checked that this gives a maximum./ (q.22.. It is therefore natural that if E l .I } S11~~11 Iilflf I1 { q . etrch eleriient q = {y.. the expected score Taking derivatives with respect to the q1's and equating them to zero.. 1.. \ ~ {r El . (Local score funcfion). Using the indicator function for F. J } of the chrss Q of probnl>ility distrihiitioris { I. (Quadratic score function)..e.J } .2qJ { q.I }.. j h r .:. = I . The reason for this is that. It is intuitively clear that the preferences of an individual scientist faced with a pure inference problem should correspond to the ordering induced by a local score function.72 2 Foundutions Definition 2.ftheform w l w e q = { y.28.rioii I I is locd if.. the individual scientist should be assessed (i.I } is tii1. Proof.I } q is (my function (.j E J . .. we have system of equations 2pJ .j E . a xk Note that in the proof of Proposition 2. I t quadratic score function is given by an alternatiw expression for the which makes explicit the role of a 'penalty' equal to the squared euclidean distance from q to a perfect prediction. = /jJ for all j. scored) only on the basis of his or her reportedjudgement about the plausibility of E l . is only to be assessed in terms of the probability it assigned to the actual outcome. We have to maximise. I } definedover apartition { E. .firnc.i F . ).
} to make F stationary it is necessary (see e.) . 73 Note that. . In addition. . By permitting different uJ(. probability judgements about demand would usually be assessed in the light of the relative seriousness of under. j E J } which contains more than two elements. in Definition 2.29.s are arbitrary constants. E. . the functional form u. ( q .7 Inference and Information This can be contrasted with the forms of “score” function that would typically be appropriate in more directly practical contexts. giving an extremal of For {qz. ) P . we must have y u P c u ( q 7 m 3 = s u P c u . at least approximately. = 1 and the supremum is taken over the class of probability distributionsq = ( q l r j E J ) .} = A logq. appropriate in reporting pure scientific research.(pJ)of the dependence of the score on the probability attached to the true El is allowed to vary with the particular El considered.23. an idealised. B.} and q = {q. . limit situation. q. but one which seems.. then it must be of the form u { q . p. C.. Writing p = { p l . Q. where A > 0 and the B.j E J } defined over a partition { E. + Proofi Since u( . we allow for the possibility that “bad predictions” regarding some “truths” may be judged more harshly than others.). local score function.ql = 1.(. proper. 315) that . of course. 1946. . The situation described by a local score function is. In stock control. later in this section we shall see that certain wellknown criteria for choosing among experimental designs are optimal if.g. . Jeffreys and Jeffreys. 4 3 . > 0. we seek {u. p. .. then for some { u. = &(PI)P. I u f is a smooth. for example. j E J}. .}.2. Proposition 2. preferences are described by a smooth. p z . proper. .. local score function for probability distributions q = { q. 2 0 .or overstocking..I IEJ where p. with c. .) is local and proper. and only if. (. j E J } . JEJ q ]€. (Characterisation of proper local score functions).)’sfor each E. rather than by just concentrating on the belief previously attached to what turned out to be the actual level of demand.
j E .rJ'(.r. j E . .a partition { E J . of course. namely p .o) so that. . if the score function is smooth. A > 0.2. and the score function only depends on the probability ql attached to H . E J } J l o g qJ = + B.f(ql. (Logarithmicscorefunction). . 1 ) + ( 1 .A.Y1) f(P1. . 3 . then f must satisfy the functional equation . ( p ) = A logy BJ. f o r a l l j = 1. Calculating this derivative. where u' stands for the derivative of u.O) =o. . a + 2 0 suffices to guarantee that the Definition 2. since IL is proper.. the condition is seen to reduce to for all el's sufficiently small. ( p J )= p1 u'.I } dejned o w .p:j. . for all (p2. . .2. O < p < I.I} is unyfittiction of rhe form u { q . .(pl). vacuous.PI) f((Il.1 . . 1 .} must be an extrernal of F and thus we have the system of equations so that all the functions u J .The condition A extremal found is indeed a maximum. ~ 3 .24. hence.} such that all the E . so that u . If the partition { € J . where 111 is the indicator .. Moreover.74 2 Foirnddons for any E = { E ' L . + ( 1 o)} 1) . . .I} only contains two elements.(11). pui(p)=. . W}.q1)? 1~ } to be proper we must have SUP Y1 t 10. . j = 2 . are sufficiently small. u { q . 1fi } = f ( q 1 . In this case. so that the partition is simply {H. . satisfy the same functional equation.l)+(l S)f'(S. E J } = u(((11.} and. = 1'1 h .j = 1. u [ . For u { ( q l . . ( ~ 2 . A logarithmic scorefunction for strictly positive probabiliry distributions q = { qJ.j E . the locality condition is.whether or not W occurs.. 5 3 . function for H .1 ~ )say.1 I {PI .
on the grounds that q is “close” to p. the expected utility of reporting q is maximised if. + BJ}p.e. The justification of such a procedure requires a study of the notion of “closeness” between two distributions. say. (Expected loss in probability reporting). If preferences are described by a logarithmic scorefunction.29 since. j E J } representing actual beliefs. because the logarithmic score function is proper. given any particular q E &. Proposition 2. We have assumed that the probability distributions to be considered as options assign strictly positive qJ to each EJ. f(x. j E J } . that since we place no (strictly positive) lower bound on the possible q. q = p. rather than the distributiori p = { p . we have an example of an unbounded decision problem. the expected loss of utility in reporting a probubilit?.and thus The final statement in the theorem is a consequence of Proposition 2. say. we have no problem in calculating the expected utility arising from the logarithmic score function. . Using Definition 2. so that V ( p ) 2 a(q). or for each of several individuals)is to state the appropriate actual beliefs. From a technical point of view.7 tnference and tnformation 75 The logarithmic function f(x. . and only . is given by . but much easier to calculate. however. however. the precise computation of p may be difficult and we may choose instead to report an approximation to our beliefs. Moreover.distribution q = { qJ j E J } defined over a partition { E . particularly within the mathematical extensions to be considered in Chapter 3.3 Approximation and Dlscrepancy We have argued that the optimal solution to an inference reporting problem (either for an individual.2.30. p . 6{ q I p } 2 0 with equality if and oniy i f q = p .x) B is 2 then just one of the many possible solutions (see Good. It is worth noting.. Proof.7.24. and only if.. the expected utility of reporting q when p is the { actual distribution of beliefs is E ( q ) = CJA log q.. q. a decision problem without extreme consequences.0) = A log(1.with equality if.This means that.1952). + + 2.1) = A logx B1. i.
I' I . log .. . 1. we then have . Definition 2.I / ..r > 0. . = ( g ) @ ' ( l . Combining Propositions 2. hetween a strictly positive probability distribution p = ( p .j E .. r i with equality if. I } is dejitied by Example 2. elementary example. (Discrepancy of an approximation).5. and only if. Indeed. p = q. extremely important in pure inference problems.j = 0. j E J } and un upproxiniarion p = {fi. when preferences are described by a logarithmic score function. Consider the binomial distribution I ) . '. and let . for all . j E . This is in contrast to many practical decision problems where the form of the utility function often makes the solution robust with respect to changes in the "tails" of the distribution assumed. (Poisson approximation to a binomial distribution). .1 . . generally speaking.. othenvise '. This reflects the fact that the "tails" of the distribution are. Proposition 2.30. The behaviour o f h { p I p } is well illustrated by a familiar. s = 1.76 2 Foundations if. The discrepcinc:\. and only if.1 with equality if. between a distribution and an approximation. An immediate direct proof is obtained using the fact that for all .).J} oier (I partition { E. = p . general measure of "lack of fit". or discrepancy. it is clear that an individual with preferences approximately described by a proper local score function should beware of approximating by zero. .30 suggests a natural.29 and 2.25.0)'' = 0 . J = 0. was introduced by Kullback and Leibler ( I95 1 ) as an d h o c measure of (directed) divergence between two probability distributions. utilities. a The quantity & { q l p } which arises here as a difference between two expected .
logarithms to base 2 are used. When.7. Definition 2.7 that h { p I p } decreases as either I I increases or 19decreases. In Section 2. Clearly. or Renyi.7. we showed that. 1970/1974. p. it follows from our previous discussion that it would not be a good idea to reverse the roles and try to approximate a Poisson distribution by a binomial distribution. However.I D). utility is defined in terms of the logarithmic score function. as in Figure 2. or both.00 0. 103.7 Inference and Information 77 I1 =1 n=2 tl = 10 0.4. so that the initial representation of beliefs P(. for quantitative coherence. within the context of the pure inference reporting problem.25 provides a systematic approach to approximation in pure inference contexts. 2. Information in Section 2. 564). de Finetti.2. we showed that.2. The best approximation within a given family will be that which minimises the discrepancy.7 Discrepncy between a binomial distribution und its Poisson upproximation (logurithms10 btise 2). the utility and discrepancy are measured on the wellknown scale of bits of information (or entropy).p.2. which can be interpreted in terms of the expected number of yesno questions required to identify the true event in the partition (see. for example. It is apparent from Figure 2.7.25 Figure 2.any new information D should be incorporated into the analysis by updating beliefs via Bayes’ theorem.4 . and that the second factor is far more important than the first. be its Poisson approximation.) is updated to the conditional probability measure P(. 1962/1970.
.30. )or P ( . utility assumes a special form and establishes a link between utility theory and classical information theory.. This motivates Definitions 2. .24. strictlv positive.. Thus. given D. i s giwn by u logarithmic score function for rhe class of prohahilip distrihutions dejned where 3 > 0 is urhirrury.)jiJrull . (Informationfrom data). the utilities of reporting P ( . were E. (Expected utility o &a). . a In the context of pure inference problems. when the initial probubilip distribution { P( I?. E .I D1.70 2 Foimdutions Proposition 2. respectively. we shall tind it convenient to underline the fact that.j E J } is the conditionul prohcibilip distribution.I } is the c. I ) .I E J } is p. then the e. when the initial distrihiition over { E .\ dejned ro be Ivhere { P( EJI D). would be A log [’(&IJ) l3. J E J }.26. for all j .Ypected increuse in utility provided by data D. it pctrtition { E l . .. M0reove. known to be true.).J.31. The umoimt of informution uhout E J } provided by the dutcr D. = { I’( f C J ) . ) . and { f ( EJ I D ) ..r. by Proposition 2.26 and 2. . and onls P(E . ~ Proo& By Definition 2. the expected increase in utility provided by D is given by which. and 4 P(EJI D ) i + log U.. j E J } i. conditional on D . I 11) = 1’( E.1 Definition 2. is nonnegative and is zero if and only if. j E . P(E. i. Ifpreferences are described by f over u partition { EJ. this expected increase in irrilip is nonnegative und i s zero if. because of the use of the logarithmic score function.onditionaI prohuhiliry distrihittion giren J the (IUIU n.77.I D ) = P(f?.
.apartfroman unimportant additive constant. As we shall see in detail later. Attempts to define absolute measures of information have systematically failed to produce concepts of lasting value. In the finite case. j E J } . For detailed discussion of H { p } and other proposed entropy measures. intermsoftheabovediscussion.. . . the entropy of the distribution p = { p l . in the finite case. .7 Inference and Information 79 It follows from Definition 2. The fact that. Conditional on E.H { p } maybeinterpreted. so that log P(E.were E. . I D ) . I D ) measure. see Renyi (1961). known to be true.) is a measure of the value of D.logp. how good the initial and the and conditional distributions are in “predicting” the “true hypothesis” EJ = H..2. I ) .. however. have raised doubts about the universality of this concept. } from an inirinl discrete uniform disrriburion (see Section 3. = {P(E. the expectation being calculated before the results of the experiment are actually available. .2).2.H { p } as a measure of “absolute” infomation) seems to work correctly is explained (from our perspective) by the fact that sothat. H { p } as a measure of uncertainty (and . respectively.. It should be clear from the preceding discussion that I ( D I p o ) measures indirectly the information provided by the data in terms of the changes produced in the probability distribution of interest. as the amount of information which is necessary to obtain p = {pi.).log P(E. Another interesting interpretation of I ( D I p o ) arises from the following analysis. I D). The recognised fact that its apparently natural extension to the continuous case does not make sense (if only because it is heavily dependent on the particular parametrisation used) should. . .. .26 that the amount of information provided by data D is equal to 6(poI p u ) .j E J } is consideredas an approximation to pu = { P(E. logP(EJ) log P(E. The amount of information is thus seen to be a relative measure. p .. a role unambiguously played by the uniform distribution in the finite case. which obviously depends on the initial distribution. . I(D I p o ) is simply the expected value of that difference calculated with respect to pD.I t has been proposed and widely accepted as an absolute measure of uncertainty. ) . which acts as an “origin” or “reference” measure of uncertainty. We shall on occasion wish to consider the idea of the amount of information which may be expected from an experiment c. defined by Hip} =  CP.the discrepancy measure if p . the problem of extending the entropy concept to continuous distributions is closely related to that of defining an “origin” or “reference” measure of uncertainty in the continuous case.
rpected iirli. proper.fr... occur with probabilities { P( D.xpression jhr the e.31.r (ill E.er. ) and p / .27. ..) P(E.) = P ( E .) P (11. t t n d { P(E. { D.32 is Shannon's ( 1948) measure of expected information.). Moreor. smooth.. y.) given by Proposition 2. ) . n 1). we have suggested that the problem of reporting inferences can be viewed as a particular decision problem and thus should be analysed within . is where the possible resirits ofthe e.27.). Proposition 232. ) . )j. In conclusion. und D.. . conditionid distribution. An alternative e. i E I } . = I)( ' .rperinient t'.I 6 . when un indiridiiul 's preferertc*esare clesc~rihed a b!. ) k Proof.rt. = P ( E / I D.I} is p. . I D..) 2 0. We have thus found.j E . I ( D .) I'( I D .). ) = P( I). n 12. i E I } .. Definition 2. ) . Ip. ) 2 0 with equality if. decision theoretical 1 framework. P ( € .i E .) 2 0 withequality iff. given the occurrence of U . Then.J E . and only if. a natural interpretation ofthis famous measure ofexpected information: Shannon's expected information is the expected urilip provided by uit e. by Proposition 2. = { P ( E .80 2 Foimdutioiis Definition 2. P(E . in .\ the EJ.rpcrirneiit in (I pure itlference c'onte. when the initial distribirtion over { E. I } . I ( ( Ip. The eqiected inforntation to he provided by un experiment P ubotrt N purtition { E . it follows from Definition 2.).. E J ) . k i d t equality ifand only if. Let (1.27 that I ( P 1 p . The expression for I ( c Ip.rnrution is where P( EJ D . and D. I } i. I } .. P(E. 1 D. . = P ( E . . (Expected informafion from an experiment). for all E . = P ( D . = Since. locd score junction. cwrespondirtg to the initiul distribution p. ). = { P( E l ) . ) P ( D . by and the result now follows from the fact that. . by Bayes' theorem.
8 Discussion and Further References the framework of decision theory. this point of view was summarised as follows: Consider what effects. We shall see in Section 3. More rigorous habits of thought are required. tolerant. preferences should be described by a logarithmic score function. (For a detailed account of the ambiguities which plague qualitative probability expressions in English. occasionally seeking an ad hoc clarification of a particular statement or idea if the context seems to justify the effort required in trying to be a little more precise.2. respectively. In Peirce (1878). .1 DISCUSSION AND FURTHER REFERENCES Operational Definitions In everyday conversation. as formulated in the second half of the nineteenth century by Peirce.8 2. A prerequisite for “making sense” is that the fundamental concepts which provide the substantive content of our statements should themselves be defined in an essentially unambiguous manner. The everyday. discrepancy and amount of information are naturally defined in terms of expected loss of utility and expected increase in utility.4 how these results. extend straightforwardly to the continuous case. with a natural characterisation of an individual’s utility function when faced with a pure inference problem. and we need to be selfconsciously aware of the precautions and procedures to be adopted if we are to arrive at statements which make sense. that might conceivably have practical bearings. Then. and that maximising expected Shannon information is a particular instance of maximising expected utility. who insisted that clarity in thinking about concepts could only be achieved by concentrating attention on the conceivable practical effects associated with a concept.8. our conception of these effects is the whole of our conception of the object. rather than remaining at the level of mere words or phrases.) In the context of scientific and philosophical discourse. 2. or the practical consequence of adopting one form of definition rather than another. established here for finite partitions. We have also seen that. within this framework. see Mosteller and Youtz. We have established that. however. there is a paramount need for statements which are meaningful and unambiguous. We are thus driven to seek for definitions of fundamental notions which can be reduced ultimately to the touchstone of actual or potential personal experience. This kind of approach to definitions is closely related to the philosophy of pragmatism. and we tolerate each other’s ambiguities and vacuities for the most part. the way in which we use language is typically rather informal and unselfconscious. 1990. we conceive the object of our conception to have. ad hoc response will therefore no longer suffice.
Since many critics of the personalistic Bayesian viewpoint claim to find great difficulty with this feature of the approach. we have stressed this aspect of our thinking in Sections 2. or approximation. . we shall seek to adhere to the operational approach to defining concepts in order to arrive at meaningful and unambiguous statemcnts in the context of representing beliefs and taking actions in situations of uncertainty. This is well illustrated by Bridgman‘s account ofthe operational concept of length and its attendant idealisations and approximations: . This crucial elaboration was provided by Bridgman (1927) in a book entitled The Logic. . often suggesting that it undermines the entire theory. Throughout this work. the operations are mental operations.I large treatise. . and nothing more. . . of Modern Pkjsics. or if the concept is mental. and doubtless a full description o f all the precautions that must be taken would till . Indeed. and call the length the total number of times the rod was applied. than the set of operations by which length is determined. and we remarked on this at several points in Section 2. . we mean by any concept nothing more than a set of operations: the concept is synonymous with the corresponding set of operations. . apparently so simple. what do we mean by the length of an object’? We evidently know what we mean by length if we can tell what the length of any and every object is and for the physicist nothing more is required. where the key idea of an operatinnul definition is introduced and illustrated by considering the concept of “length”: . The concept of length is therefore fixed when the operations by which length is measured are tixed: that is.. the operatiom are actual physical measurements .3. We also noted the inevitable element of idealisation. it is worth noting Bridgman’s very explicit recognition that uN experience is subject to error and that all we can do is to take sufficient precautions when specifying sets of operations to ensure that remaining unspecified variations in procedure have negligible effects on the results of interest. we have to perform certain physical operations. We must. . the concept of length involves as much as. .we take a measuring rod. however. then move the rod along in a straight line extension of its previous position until the first end coincides with the previous position ofthe second end. In general. mark on the object the position of the rod. for example. where we made the practical. is in practice exceedingly complicated. I to 2. . be sure that the temperature of the rod is the standard temperature at which its length is detined. repcat this process as often as we can.a2 2 Foundations In some respects. operational idea of preference between options the fundamental starting point and touchstone for all other definitions. . This procedure. implicit in the operational approach to our concepts. this position is not entirely satisfactory in that it fails to go far enough in elaborating what is to bc understood by the term “practical”. or else we must niiike i i . If the concept is physical. . lay it on the object so that one of its ends coincides with one end of the object. . . To find the length of an object.7.
.2. but to do this it seems essential to invoke strong structural conditions. we should like the definitions of the basic concepts of probability and utility to have strong and direct links with practical assessment procedures. On the other hand. The key postulate in Ramsey’s theory is the existence of a socalled ethically neutral . This pragmatic recognition that there are inevitable limitations in any concrete application of a set of operational procedures is precisely the spirit of our discussion of Axioms 4 and 5 in Section 2.3.8. interpretable. In practical terms. in conformity with the operational philosophy outlined above. in principle. we have to stop somewhere. we would like our theory to adhere to the loose structures that often arise in realistic decision situations. but if this is done then we will be faced with fairly complicated axioms that accommodate these loose structures. who presented the outline of a formal system. but the justification is in our experience that variations of procedure of this kind are without effect on the final result. This will serve in part to acknowledge our general intellectual indebtedness and orientation. we could indefinitely refine our measurement operations. Fishburn sums up the dilemma as follows: On the one hand.8 Discussion and Further References correction for it. we would like our axioms to be simple. or we must correct for the gravitational distortion of the rod if we measure a vertical length. Practically. . The earliest axiomatic approach to the problem of decision making under uncertainty is that of Ramsey ( I926). Fishburn (1981) lists over thirty different axiomatic formulations of the principles of coherence. and in part to explain and further motivate our own particular choice of axiom system. or we must be sure that the rod is not a magnet or is not subject to electrical forces . . 2.2 Quantitative Coherence Theories In a comprehensive review of normative decision theories leading to the expected utility criterion. precautions such as these are not taken. of course. we must go further and specify all the details by which the rod is moved from one position to the next on the object. even though. intuitively clear. and capable of convincing others that they are appealing criteria of coherency and consistency in decision making under uncertainty. . reflecting a variety of responses to the underlying conflict between axiomatic simplicity and structural flexibility in the representation of decision problems. In addition. With these considerations in mind. our purpose here is to provide a brief historical review of the foundational writings which seem to us the most significant. its precise path through space and its velocity and acceleration in getting from one position to another. What matters is to be able to achieve sufficient accuracy to avoid unacceptable distortion in any analysis of interest.
which. The first of these difficulties stems from the "continuously divisible" assumption about events. Savage's major innovation in structuring decision problems is to define what he calls acts (options. Suppes ( 1956) presented a system which combined elements of Savage's and Ramsey's approaches. is used to extend the definition of degree of belief to general events by means of an expected utility model. which Savage uses as the basis for proceeding from qualitative to quantitative concepts. also. I E'. One way of avoiding this embarrassing structural limitation is to introduce a quantitative element into the system by a device like that of Ramsey's ethically neutral event.2 Foundations event E . for a modern appraisal see Shafer (1986) and lively ensuing discussion. It is then rather natural to define the degree of belief in such an event to be 1/2 and. many variations on an axiomatic theme are possible and other Savagetype axiom systems have been developed since by Stigum (1972). From a mathematical point of view. Of course. Roberts (1974). it even prevents the theory from being directly applicable to situations involving il finite or countably infinite set of possible outcomes. in turn. This. from this quantitative basis. See. Utilities are subsequently introduced using ideas similar to those of voii Neumann and Morgenstern (1944/1953). say. Hens ( 1992). it is straightforward to construct an operational measure of utility for consequences. two major difficulties with Savage's approach. These are extended into quantitative prubabilities by means of a "continuously divisible" assumption about events. c.2. a revolutionary landmark in the history of ideas. assuming the prior existence of probabilities. the treatment is rather incomplete and it was not until 1954. but a closely related development can be found in Pfanzagl ( 1967. There are. 1974) and Savage ( 1970). From a conceptual point of view. presented an axiom system for utility alone. however. with the publication of Savage's (1954) book The fi~imdurioiis ofStutistics that the first complete formal theory appeared. c2 1 EC} { c . Fishburn (1975) and Narens ( 1976). His key coherence assumption is then that of a complete. expressed in terms of our notation for options. This is directly defined to have probability 1/2 and thus enables Ramsey to get the quantitative ball rolling without imposing undue constraints on  . in our terminology) as functions from the set of uncertain possible outcomes into the set of consequences. 1968). c2 1 E}. ten years earlier. which impose severe limitations on the range of applicability of the theory. Ramsey's theory seems to us. Suppes ( 1960. No mathematical completion of Ramsey's theory seems to have been published. The Savage axiom system is a great historical achievement and provides the first formal justification of the personalistic approach to probability and decision making. however. has the property that { c . Such an assumption imposes severe constraints on the allowable forms of structure for the set of uncertain outcomes: in fact. as indeed it has to many other writers. transitive order relation among acts and this is used to define qualitative probabilities. 1 E. for any consequences cl. who had. See. also.
relax the constraint imposed on the form of the action space.2. Related Theories Our previous discussion centred on complete axiomatic approaches to decision problems. and the standard events and options that we introduced in Section 2. In fact. indeed. 283 . involving a unified development of both probability and utility concepts. for example. a generalisation of Ramsey’s ’ idea reemerges in the form of canonical lotteries.. is that the Savage axioms imply the boundedness of utility functions (an implication of which Savage was f apparently unaware when he wrote The Foundations o Statistics.3 play this fundamental operational role in our own system. In our view. a unified treatment of the two concepts is inescapable if operational . A theory which. uncomplicated by structural complexities. we hope that the essence of the quantitative coherence theory has already been clearly communicated. And yet. The basic idea is essentially that of a standard measuring device. Other systems using standard measuring devices (sometimes referred to as external scaling devices) are those of Fishburn (1967b. 1970). 1969) and Balch and Fishbum (1974). in presenting the basic quantitative coherence axioms it is important not to confuse the primary definitions and coherence principles with the secondary issues of the precise forms of the various sets involved. that it is often conceptually and mathematically convenient to be able to use structural representationsgoing beyond what we perceive to be the essentially finitistic and bounded characteristics of realworld problems. For this reason. in Chapter 3. those implicit in forms such as “quadratic loss” and “logarithmic score”. All he requires is that (at least) one such event be included in the representation of the uncertain outcomes. within this simple structure. It seems to us that this idea ties in perfectly with the kind of operational considerations described above. We take the view. we have so far always taken options to be defined by finite partitions. like ours. combines a standard measuring device with a fundamental notion of conditional preference is that of Luce and Krantz ( I97 I ). introduced by Anscombe and Aumann (1963) for defining degrees of belief. We shall then arrive at a sufficientlygeneral setting for all our subsequent developments and applications. we shall. however. Motivated by considerations of mathematical convenience.5. already hinted at in our brief discussion of medical and monetary consequences in Section 2. 1981). The theory does not therefore justify the use of many mathematically convenient and widely used utility functions.8 Discussion and Further References the structure. and by Pratt. in some sense external to the realworld events and options of interest. Raiffa and Schlaifer (1964. The second major difficulty with Savage’s theory. 1965) as a basis for simultaneously quantifying personal degrees of belief and utilities in a direct and intuitive manner. and one that also exists in many other theories (see Table I in Fishburn. but which was subsequently proved by Fishburn.
for any E and r t ) . there is a considerable literature on informationtheoretic ideas closely related to those of Section 2. we shall provide a summary overview of a number of these related theories. In the latter case. be treated at greater length. probability can be considered to be a marginal rate of substitution or. 0 I ? 1. Using the notation for options introduced in Section 2.s and Degrees of Belief. (iii) Ariomatic Approaches to Degrees c$ Beliej: (iv) Avioniutic Approaches to Utilities and (v) hlforniution Theories. Monetary Bets and Degrees of Belief An elegant demonstration that coherent degrees of belief satisfy the rules of(finitely additive) probability was given by de Finetti (1937/1964). there have been a number of attempted developments of probability ideas separate from utility considerations. partly because of their close relation with the main concerns of this book. In addition. however. given an arbitrary monetary sum 771 and uncertain event E. then the individual's degree of belief in E is defined to be p. It is now straightforward to verify that coherent degrees of belief have the properties of finitely additive probabilities. If consequences are assumed to be monetary. together with some brief comments. more simply. In modern economic terminology. as well as separate developments of utility ideas presupposing the existence of prob' abilities. In this section. An individual who assigns p > 1 is implicitly agreeing to pay a stake larger than I ) ) to enter a gamble in which the maximum prize he or she can win is t i t : an individual who assigns 1) < 0 is implicitly agreeing to offer a gamble in which he or she will pay out either t n or nothing in return for a negative stake. an individual's preferences among options are I such that { p r I~Q } { 711 I E. a socalled "Dutch book". and if. and partly because of their connections with the important practical topic of the assessment of beliefs. a bet can be arranged  . we shall simply give what seem to us the most important historical references. The first two topics will.3. the individual is said to have specified a coherent set of degrees of belief.7. or such an arrangement is impossible. either it is possible to arrange a form of monetary bet in terms of these events which is such that the individual will certainly lose. In either case.considerations are to be taken seriously. we can argue as follows. Given that an individual has specified his or her degrees of belief for some collection of events by repeated use of the above detinition. This definition is virtually identical to Bayes' own definition of probability (see our later discussion under the heading of Ariomuric Approoches to Degrees o f Belie). However. a kind of "price'. which is equivalent to paying an opponent to enter such a gamble. grouped under the following subheadings: (i) Moneran Bets and Dexrees ofBelieJ (ii) Scoring Rit1c. without explicit use of the utility concept. de Finetti's approach can be summarised as follows.. For the most part. To demonstrate that 0 5 p 5 1.
A more formal argument based on the avoidance of certain losses in betting formulations has been given by Freedman and Purves (1969). it is necessary that the determinant of the matrix relating the m. E l . The interpretation of this definition is straightforward: having paid a stake of qrn. but has its earlier origins in the celebrated St. order to enter a gamble resulting in a prize of m. Lavalle (1968). Related + + + + + + +  .(1973). perhaps.. despite its rather informal nature. p.21112 . has considerablepedagogical and.’s be zero so that the linear system cannot be solved. . and hence the impossibility of all the returns being negative. occurs and thus a “gain”. pn = 1. . one may explicitly recognise that some people have a positive utility for gambling (see. which could. for instance.*7nf. for any choice of the 7 r t J ’ s . restricting attention to a range of outcomes over which the “utility” can be taken as approximately linear) and the argument.’s in this system of linear equations. p. Moreover. . The extension of these ideas to cover the revision of degrees of belief conditional on new information proceeds in a similar manner. Chapter 5) and Hull er a f .. p.C. (q7n 1 0) { m I E n F. .En.if E. 1730/1954). for example. .= nit . this turns out to require that p1 p 2 . . . and nothing otherwise.. 0 1 E n F. see. For further discussion of possible forms of “utility for money”. practical use. Pratt (1964). it has two major shortcomings from an operational viewpoint. Conlisk. In order to avoid the possibility of the m. . . it is clear that the definitions cannot be taken seriously in terms of arbitrary monetary sums: the “perceived value” of a stake or a return is not equivalent to its monetary value and the missing “utility” concept is required in order to overcome the difficulty. be negative. If an individual specifies pl . of course. of 9. qrn 1 F‘}. . . Additionally. in effect. E2. This point was later recognised by de Finetti (see Kyburg and Smokler. However.we proceed as follows. if F occurs we are confronted with a gamble with prizes 7 n if E occurs.‘s being chosen in such a way as to guarantee the negativity of the 9.’s for fixed p. to be his or her degrees of belief in those events. In the first place.p2. despite the intuitive appeal of this simple and neat approach. 1964/1980. given any monetary sum ~ nwe have the equivalence . Petersburg paradox (first discussed in terms of utility by Daniel Bernoulli. p. except that an individual’s degree of belief in an event E conditional on an event F is defined to be the number q such that. An ud hoc modification of de Finetti’s approach would be to confine attention to “small” stakes (thus.nz. it is easy to check that this is also a sufficient condition for coherence: it implies C. footnote (a)). To demonstrate the additive property of degrees of belief for exclusive and exhaustiveevents.’s to the 9. if F does not occur the bet is “called o f f and the stake returned. thus modified.p. this is an implicit in agreement to pay a total stake of plml p. 62.gJ = 0. or “net return”.2. according to the individual’s pref‘ erence ordering among options. 1993). Lindley (197 1/1985.8 Discussion and Further References 87 which will result in a certain loss to the individual and avoidance of this possibility requires that 0 5 p 5 1.
the coherence condition can be reinterpreted as requiring that this latter point cannot be moved in such a way as to reduce the distance from ..7.p ) ? .. . with important subsequent generalisations by Savage (1971) and Lindley ( 1982a). .p. €. . l . = ( 1 . 1). and p1 + p 2 + . as delining a further point in 'R". .. Scoring Rules and Degrees of Bclief The scoring rule approach to the definition of degrees of belief and the derivation of their properties when constrained to be coherent is due to de Finetti ( 1963. To see this.ill the other 11 points. The underlying idea i n this development is clearly very similar to that of de Finetti's ( I93711964) approach where the avoidance of a "Dutch book" is the basic criterion of coherence.. .p . p?. Thinking of p 1 .+ ( l / . such that for any assignment of the value 1 to one of the E. .X Given a specification. . (I(. respectively. In addition to the problem of "nonlinearity in the face of risk".. p.p L ) L ! + . y. say. for i = 1. . . whereas if E does not occur he or she is to suffer a penalty of L = 1.. de Finetti himself later preferred to use an approach based on scoring rules.'. . .I are an exclusive and exhaustive collection of uncertain events for which the individual. . with the understanding that if Ii occurs he or she is to suffer a penalty (or loss) of I.2. . The number. Heath and Sudderth (1972) and Buehler ( 1976) to expand on de Finetti's concept of coherent systems of bets. p1. note that the n logically compatible assignments of values 1 and 0 to the E. For this reason.. This .. = I . 11. .. . . q?. q. the individual is said to have specified a coherent set of degrees of belief.00 2 Foundutions arguments have also been used by Cornfield (1969). In terms of the quadratic scoring rule. p ~ ). there is also the difficulty that unwanted gametheoretic elements may enter the picture if we base a theory on ideas such as "opponents" choosing the levels of prizes in gambles.1))'. + p. In the latter case. subject now to the penalty L = ( l E .. pI)L!+(lc. . p. ( I . A simple geometric argument now establishes that. . p. has to specify degrees of belief 1".1. Given an uncertain event b.. . . . .. . . 5 1. using the quadratic scoring rule scheme.'sdefine )i points in !R". the development proceeds as follows. . &.'s and 0 to the others. . . . or it is not possible to find such yl. 1964). .' Suppose now that El. an individual is asked to select a number.. for coherence we must have 0 5 p.'. . p. Using the indicator function for E. alluded to above. L = ( I[: . either it is possible to find an alternative specification. which the individual chooses is defined to be his or her degree of belief in t. . . . a concept we have already introduced in Section 2. . the penalty can be written in the general form.
The first recognisably “axiomatic” approach to a theory of degrees of belief was that of Bayes (1763) and the magnitude of his achievement has been clearly . which is. De Morgan ( 1847)and Bore1 ( 1924/1964). if F does not occur. The extension of this approach to cover the revision of degrees of belief conditional on new information proceeds as follows. the numbers p and T are the individual’s degrees of belief. Indeed. respectively. Bayes’ theorem. Relevant additional references are Myerson ( 1 979). A simple calculation shows that this reduces to the condition q = p / r . . If p. Gatsonis (1984). The interpretation of this penalty is straightforward. q.q)’. respectively. it would be possible to move from that point in a direction which simultaneously reduced the values u. there is no penalty. which demands that no other choices will lead to a strictly smaller L. Regazzini (19831.. thus establishing the required result. 2. the idea of probability as “degree of belief” has received a great deal of distinguished support. we argue as follows. if F occurs. w. then p . Piccinato (1986).T ) 2 . again. De Finetti’s ‘penalty criterion’ and related ideas have been critically reexamined by a number of authors. t i .8 Discussion and Further References 09 means that p1. a formulation which is clearly related to the idea of “calledoff bets used in de Finetti’s 1937 approach. so far as we know.r)’ p’ + r2. also. .2. q. p2.(1)’ + ( 1 E 1F . Suppose now that. are the values which L takes in the cases where E n F. must define a point in the convex hull of the other R points. q and T imposed by coherence. Coherence therefore requires that the Jacobian be zero. However.p ) 2 ( IF . which he or she chooses when confronted with a penalty defined by L = 1~( 1~ . T satisfy the equations 11 tl + = (1 . for the events E n F and F.p ) 2 + (1. T defined a point in 91“ where the Jacobian of the transformationdefined by the above equations did not vanish. If u. To derive the constraints on p.q)2. and w. whatever the logically compatible outcomes of the events are. Axiomatic Approaches to Degrees of Belief Historically. specified subject to the penalty L = 1F ( 1E . I8 1411952). Laplace ( 177411986. including contributions from James Bernoulli ( 1713/1899). ECn F and F“ occur. An individual’s degree of belief in an event E conditional on the occurrence of an event F is defined to be the number q. Eaton (1992) and Gilio (1992a). none of these writers attempted an axiomatic development of the idea. p. the specification of q proceeds according to the penalty ( 1~ . See. .r)2 q2 + p’ U! + (1 . in addition to the conditional degree of belief q.q y = = + ( 1 .
extremely informal. share a basic commitment to the Bayesian approach to statistical problems. by Good ( 1980a) and by Lindley ( 1980a). Domotor and Stelzer ( 197 1 ). modem approach only began to emerge a century and a half later. See Good ( 1965. however. Fishburn ( 1970. also. ( 1959). I961 ) and Jaynes ( 1958). see for example. Similar criticisms seem to us to apply to the original and elegant formal development given by Cox ( 1946. which is followed by a long. in the volume edited by Zellner ( 1980). of course. See. 1982). that our emphasis on operational considerations and the subjective character of degrees of belief would. Axiomatic Approuches ro Utilities Assuming the prior existence of probabilities. Heath and Sudderth ( 1978) and Luce and Narens ( 1978). the flavour of Jeffreys' approach seems to us to place insufficient emphasis on the inescapably personal nature of degrees of belief. by his essay.8. this includes work by Kraft et a / . stimulating discussion. French (1982) and Chuaqui and Malitz (1983).90 2 Foundations recognised in the two centuries following his death by the adoption of the adjective Bayesian as a description of the philosophical and methodological developments which have been inspired. are discussed in detail by Jeffreys (193111973. of course. the evaluations of his work by Geisser (1980a).4. I92 1 ). Krantz Y I 01. Chapter 2) for a discussion ofthe variety of attitudes to probability compatible with a systematic use ofthe Bayesian paradigm. in other respects. whose profound philosophical and methodological contrihutions to Bayesian statistics are now widely recognised. In the finite case. and a more formal. directly or indirectly. We should point out. Bayes' formulation is. Chapter 4). Formal axiom systems which wholeheartedly embrace the principle of revising beliefs through systematic use of Bayes' theorem. ( I97 I ). von Neumann and Morgenstern (1944/1953) presented axioms for coherent preferences which led to a justification of utilities as numerical measures of value for consequences and to the optimality criterion of maximising expected utility. The work of Keynes ( 192 1/ 1929) and Carnap (1950/1962) deserves particular mention and will be further discussed later in Section 2. From afixmfarional perspective. Scott ( 1964). be criticised by many colleagues who. There are. Much of Savage's (195411972) system . in a series of papers by Wrinch and Jeffreys ( 19 19. who showed that the probability axioms constitute the only consistent extension of ordinary (Aristotelian) logic in which degrees of belief are represented by real numbers. 1939/1961). many other examples of axiomatic approaches to quantifying uncertainty in some form or another. By present day standards. resulting in an overconcentrationon "conventional" representations of degrees of belief derived from "logical" rather than operational considerations (despite the fact that Jeffreys was highly motivated by real world applications!). Suppes and Zanotti ( 1976. Fishburn (1986) provided an authoritative review of the axiomatic foundations of subjective probability. however. in turn.
Becker and McClintock (1967). The logarithmic information measure was proposed independentlyby Shannon (1948) and Wiener (1948) in the context of communication engineering. Chapter 7) presents a general axiom system for utilities which imposes rather few mathematical constraints on the underlying decision problem structure. Luce and Raiffa (1957). Chernoff and Moses (1959) and Fishburn ( 1970). General accounts of utility are given in the books by Blackwell and Girshick (1954). Savage (1971) and Hull er al.7) a natural. Arrow (I95 la). Lindley (1956) later suggested its use as a statistical criterion in the design of experiments. The logarithmic divergency measure was first proposed by Kullback and Leibler ( I95 I ) and was subsequently used as the basis for an informationtheoreticapproach to statistics by Kullback (195911968). DeGroot (1963). Herstein and Milnor (1 953). Edwards ( 1954) and Debreu (1960). ( I 973). 1981).8 Discussion and Further References 91 was directly inspired by this seminal work of von Neumann and Morgenstern and the influence of their ideas extends into a great many of the systems we have mentioned. The mathematical results which lead to the characterisation of the logarithmic scoring rule for reporting probability distributions have been available for some considerable time. Schervish era/. DeGroot (1970. See. (1963). 1966. unifying account of . Suppes and Walsh ( I 959). ( 1 990). A formal axiomatic approach to measures of information in the context of uncertainty was provided by Good (1966). Seminal references are reprinted in Page (1968). 1987). 1967) and S h d a l (1970). but he only dealt with dichotomies. for which the uniqueness result is not applicable. Other relevant references on information concepts are Renyi ( 1964. Information Theories Measures of information are closely related to ideas of uncertainty and probability and there is a considerable literature exploring the connections between these topics. The first characterisation of the logarithmic score for a finite distribution was attributed to Gleason by McCarthy ( 1956). 1988b) and Machina ( 1982. Multiattribute utility theory is discussed. Extensive bibliographies are given in Savage (1954/1972) and Fishburn (1968. Discussions of the experimental measurement of utility are provided by Edwards ( 1954).2. Arimoto (1970) and Savage (1971) have also given derivations of this form of scoring rule under various regularity conditions. Beckerer ul. by Fishburn (1964) and Keeney and Raiffa (1976). Other discussions of utility theory include Fishburn (1967a. 1952). ( 1957). Logarithmic scores seem to have been first suggested by Good (1952). we have provided (in Section 2. among others.Aczel and Pfanzagl(1966). By considering the inference reporting problem as a particular case of a decision problem. also.Davison etul. Marschak (1950). Other early developments which concentrate on the utility aspects of the decision problem include those of Friedman and Savage (1948. who has made numerous contributions to the literature of the foundations of decision making and the evaluation of evidence.
. and thus not directly applicable to every phase of the scientific process. in Chapter 3. (iv) Structure of the Set of Relevant Events. the creative generation of the set of possibilities to be considered. 2. 1980).. currently assumed necessary and sufficient to reflect key features of interest in the problem under study. Complete. this activity constitutes only one static phase of the wider. (iii) Relevance oj’un Axiomatic Approach. seems to us to be one which currently lies outside the purview of any “statistical” formalism. These will be dealt with under the following subheadings: ( i ) Dynamic Frame rtfDiscourse. In the language of Section 2. context. particularly in the context of the possibilities opened up by modern computer graphics. Quantitati\*e Preference. this activity has to be viewed. that alternativefornral statistical theories have a convincing. defined in the light of our current knowledge and assumptions. inventing new models or theories. as sandwiched between two other vital processes. complementary role to play. Box. In the more general. evolving. The problem of generating the frame of discourse.. either potentially or actually. DS Box (1980) appears to.4 Critical Issues We shall conclude this chapter by providing a summary overview of our position in relation to some of the objections commonly raised against the foundations of Bayesian statistics. (vii) Subjectivity of Probabilit!.8. :\I. this analysis will be extended. our concern in this volume is with coherent beliefs and actions in relation to a limited set of specified possibilities. (ii) Updating Subjective Probability. offers considerable intellectual excitement and satisfaction in its own right. dynamic.2. (vi) Precise. for example. the critical questioning of the adequacy of the currently entertained set of possibilities (see. ( v ) Prescriptive Nature of the Axioms. although informal. on the other hand. i. exploratory data analysis is no doubt a necessary adjunct and. D y a m i c Frame of Discourse As we indicated in Chapter I . we are operating in terms of a fixed frame of discourse. But we do not accept. although some limited formal clarification is actually possible within the Bayesian framework. On the one hand. scientific learning and decision process. We accept that the mode of reasoning encapsulated within the quantitative coherence theory as presented here is ultimately conditional. as many critics have pointed out. However. to cover continuous distributions. Based on work of Bemardo (1979a). Substantive subjectmatter inputs would seem to be of primary importance. as we shall see in Chapter 4. .2 Foundations the fundamental and close relationship between informationtheoretic ideas and the Bayesian treatment of “pure inference” problems. (viii) Statistical Inference us (I Decision Probktn and (ix) Communication and Group Derision Making.e.
in the absence of such “externally” directed revision or extension of the current frame of discourse. it is not clear what questions one should pose in order to arrive at an “internal” assessment of adequacy in the light of the information thus far available. for example. and we simply begin again on the basis of the frame of discourse implicit in the new paradigm. the issue of assessing adequacy in relation to a total absence of any speciJic suggested elaborations seems to us to remain an open problem. (ii) informal. critics of the Bayesian approach should recognise that: (i) an enormous amount of current theoretical and applied statistical activity is concerned . We shall return to these issues in Chapter 6.8 Discussion and Further References 93 The problem of criticising the frame of discourse also seems to us to remain essentially unsolved by any “statistical” theory. However. there can be no purely ”statistical” theory of model formulation. 1962). On the other hand. (iii) we ull lack a decent theoretical formulation of and solution to the problem of global model criticism in the absence of concrete suggested alternatives. empirical performance of an internally coherent individual. it is not clear that the “problem” as usually posed is wellformulated. the issue is resolved for us as statisticians by the consensus of the subjectmatter experts. is that such specific diagnostic probing can only stem from the prior realisation that the corresponding specific elaborations might be required. The logical catch here. although the process of passing from such ideas to their mathematical representation can often be subjected to formal analysis. and thus not directly applicable to every phase of the scientific process. In the case of a “revolution”. and the ensuing discussion. So far as the scope and limits of Bayesian theory are concerned: (i) we acknowledge that the mode of reasoning encapsulated within the quantitative coherence theory is ultimately conditional. Indeed. a range of reactions. is the key issue that of “surprise”. our responses to critics who question the relevance of the coherent approach based on a fixed frame of reference can be summarised as follows. For example. or even “rebellion”. Related issues arise in discussions of the general problem of assessing. the external. this aspect of the scientific process is not part of the foundational debate. or is some kind of extension of the notion of a decision problem required in order to give an operational meaning to the concept of “adequacy”? Readers interested in this topic will find in Box (1980). esplorurury techniques are an essential part of the process of generating ideas. in scientific paradigm (Kuhn. The issue here is one of pragmatic convenience. However. see. The latter could therefore be incorporated ub initio into the frame of discourse and a fully coherent analysis carried out. or “calibrating”. exploratory diagnostic probing would seem to have a role to play in confirming that specific forms of local elaboration of the frame of discourse should be made. Dawid (1982a). rather than of circumscribing the scope of the coherent theory.2. Overall. however. On the one hand.
as working frames of discourse. I965/1983). The quantitative coherence approach is based on the aswnption .the notion of the conditional probability. Detailed discussion and relevant references can be found in Diaconis and Zabell ( 1982). This does not seem to us a reasonable assumption at all. . we derived Bayes' theorem. Our objection to both these attitudes is that they each implicitly assume. for the purposes of the analysis.2. a number of authors have questioned whether it is justified to identify assessments made conditional on the assumed occurrence of C. direct. in terms of a conditional uncertainty relation. However. Although this is literally true. contained in the premises. The prior probability P ( E ) .j)rcertain that G has occurred. of an event E given the <(. although we acknowledge its interest and potential importance. who examines the role of tempord colwrence. In Section 2. intuitive appeal and do not need axiomatic reinforcement. Another form of this argument asserts that developments from axiom systems are "pointless" because the conclusions are.. Upduling Subjecrive Probubiiity An issue related to the topic just discussed is that of the mechanism for updating subjective probabilities. subject only to local probing of specific potential elaborations. who discuss. P ( E I G). and to avoid potential confusion. I>( E I G) becomes our actual degree of belief in E.4. and those to follow.94 2 Foundations with the analysis of uncenainty in the context of models which are accepted. See. are an attempt to convince the reader that within this Iutrer context there are compelling reasons for adopting the Bayesian approach to statistical theory and practice. with actual beliefs once G is knowvr. the existence of a commonly agreed notion of what constitutes "desirable statisticid practice". an operational definition o f the notion is required. in particular. which establishes that p ( E I G ) = P(G I E ) P (E ) / P ( G ) . Good ( 1977). also. JrJ?eX'. we simply do not accept that the methodological imperatives which How from the assumptions of quantitative coherence are in any way "obvious" to someone contemplating the axioms. and Goldstein ( 1985). we defined. Relevance o the Axiomatic Approach f Arguments against overconcern with foundational issues come in many forms.If we actually knocr.has been updated to the poJIerior probability P(E 1 G). tautologically. we have heard proponents of supposedly "modelfree" exploratory methodology proclaim that wc can evolve towards "gocxl practice" by simply giving full encournpement to the creative imagination and then "seeing w h a t works". and (ii) our arguments thus far. we have heard Bayesian colleagues argue that the mechanics and flavour of the Bayesian inference process have their own sufficient. At the other extreme. albeit from different perspectives. From this. assumed Occurrence of an event G. At one extreme.s rule (Jeffrey. We shall not pursue this issue further.
Randall and Foulis. of course. 1980a).an assumption which leads to the Bayesian paradigm for the revision of belief. are. 1975). We shall return to this topic later in this section. desirable practice requires. to avoid Dutchbook inconsistencies. We shall return to this topic in Chapters 4 and 6. Another area where the applicability of the standard paradigm has been questioned is that of socalled “knowledgebased expert systems”.2. Prescriptive Nature of the Axionis When introducing our formal development. the logical description of the possibilities should be forced into the structure of an algebra (or cralgebra). and ensuing discussion. Finally. Structure of the Set of Relevant Events But is the structure assumed for the set of relevant events too rigid? In particular. Barnard. However. in which each event has the same logical status? It seems to us that this may not always be reasonable and that there is a potential need for further research into the implications of applying appropriate concepts of quantitativecoherence to event structuresother than simple algebras. for example. in each and every context involving uncertainty. This draws attention to the interpretational asymmetry between a statement like ”the underlying distribution is normal” and its negation. for the most part. we emphasised that the Bayesian foundational approach is prescriptive and not descriptive. “belief functions” and “confirmation theory”. at least. Choices of the elements to be included in & are. which include “fuzzy logic”. including hierarchies and networks. alternative proposals.2. In that context. for example. 1984. ad hoc and the challenge to the probabilistic paradigm seems to us to be elegantly answered by Lauritzen and Spiegelhalter (1988). another form of query relating to the logical status of events is sometimes raised (see. for references to these ideas). where the notion of “sample space” has been generalised to allow for the simultaneous representation of the outcomes of a set of “related” experiments (see. bound up with general questions of “modelling” and the issue here seems to us to be one concerning sensible modelling strategies.8 Discussion and Further References 95 that. For example. Proponents of such systems have argued that (Bayesian) probabilistic reasoning is incapable of analysing these structures and that novel forms of quantitative representations of uncertainty are required (see Spiegelhalterand KnillJones. is it reasonable to assume that.2. This raises questions about their implicitly symmetric treatment within the framework given in Section 2. this problem has already been considered in relation to the foundations of quantum mechanics. it has been established that there exists a natural extension of the Bayesian paradigm to the more general setting. We are concerned with un . which often operate on knowledge representationswhich involve complex and loosely structured spaces of possibilities. within the structured framework set out in Section 2.
1979) that there are a great many individuals who prefer option I to option 2 in the first situation. for examplc. We are not concerned with sociological or psychological description of actual behaviour. many critics of the Bayesian approach have somehow taken comfort from the fact that there is empirical evidence. for example. For the latter. To examine the coherence of these two revealed preferences. like those described in Figure 2.0I 0.10 0. there must cxist a utility .1 I 0. in each of which a choice has to be made bctween the two options where C' stands for current assets and the numbcrs describe thousands of units of a familiar currency. Savage (1980). Bordley (1992). I . ( 1982). Despite this. see. Kahneman and Tversky ( 1979). See. if we wish to avoid a specified form of behavioural inconsistency. Kahneman PI (11. Allais and Hagen.derstanding how we ought to proceed.89 0. if they are to correspond to a consistent utility ordering. for example. from experiments involving hypothetical gambles. Luce (1992) and Yilmaz (1992).8. Allais' criticism is based on a study of the actual prcferences of individuals in contexts where they are faced with pairs of hypothetical situations. we note that. Wallsten ( 1974).oo 0. and a t the same time prefer option 4 to option 3 in the second situation.10 0. also. Allais (1953) and Ellsberg ( I961 ). which suggests that people often do not act in conformity with the cohercnce axioms: see.X9 0. Machina ( 1987).90 It has been found (see.
and also as a suggestive source of improved strategies for thinking about and structuring problems. This seems to us a very peculiar argument.01u(C) O.4.89u(C). on the grounds that individuals can often be shown to perform badly at deduction or long division. the stated preferences are incoherent. and the realisation that they are not in accord with the prescriptive requirements of the formal theory? Allais and his followers would argue that the force of examples of this kind is so powerful that it undermines the whole basis of the axiomatic approach set out in Section 2. Indeed. it is clear that if any of the tickets numbered from 12 to 100 is chosen it will not matter. both as a reference point. .89u(500 + + + + G) + 0.11~(5OO+C) +0. Chapter 5 ) pointed out that a concrete realisation of the options described in the two situations could be achieved by viewing the outcomes as prizes from a lottery involving one hundred numbered tickets. the more need there is to have the formal prescription available.4 Savuge 's reformulution o Allais ' example f ficketnumber I 500tC 211 12100 situation I option I option 2 option3 option 4 C 500+ C 500+ C 5OO+C 2500+C 500+C situation2 c 5OO+C 2500tC G c In the case of Allais' example. to enable us to discover the kinds of mistakes and distortions to which we are prone in ad hoc reasoning.3.l0~(2. as shown in Table 2. How should one react to this conflict between the compelling intuitive attraction (for many individuals) of the originally stated preferences.2. so that preferring option I to option 2 and at the same time preferring option 4 to option 3 is now seen to be indefensible. total assets in thousands of monetary units). when the problem is set out in this form.90u(C) > 0.500 C) 0. satisfying the inequalities ~ ( 5 0 0 C) > 0. The conclusion to be drawn is surely the opposite: namely.). It is as if one were to argue for the abandonment of ordinary logical or arithmetic rules. which of the options is selected. situations 1 and 2 are identical in structure. Table 2.50O+C) +0. Savage ( 1954/1972.10~(2. Preferences in both situations should therefore only depend on considerationsrelating to tickets in the range from I to I 1.).defined over consequences (in this case. But simple rearrangement reveals that these inequalities are logically incompatible for any function u(.and. for this range of tickets. therefore. the more liable people are to make mistakes. But. in either situation.8 Discussion and Further References 97 function u(.
when confronted with complex or tricky problems. where confusion is engendered by the probabilities rather than the utilities.98 2 Foudutions Viewed in this way. and arguing solely in terms of the gambles themselves. However. Chapter 2). however. and Koopman. scientifically useful theory. and we shall not repeat the details here. but the ”distorting” elements which are present in his hypothetical gambles stem from the rather vague nature of the uncertainty mechanisms involved.g. we may need to consider the attribute “avoidance of looking foolish’. beginning with Jeffreys’ review of Keynes’ Treutise (see also Jeffreys. For a recent discussion of both the Allais and Ellsberg phenomena. Raiffa ( 1961) and Roberts (1963) have provided clear and convincing rejoinders to [he Ellsberg criticism. in fact. Precise. 1921/1929. in addition to the monetary consequences specified in the hypothetical gambles. and hence for beliefs and values. that all consequences and certain general forms of dichotomised options can be compared with dichotomised options involving standard events. Allais’ problem takes on the appearance of a decisiontheoretic version of an “optical illusion” achieved through the distorting effects of “extreme” consequences. there has been a widespread feeling that the demand for precise quantification. The form of argument used is similar to that in Savage’s rejoinder to Allais. is rather severe and certainly . Even without such refinements. In particular. Complete. rather than from the extreme nature of the consequences. in Axiom 5 . This latter assumption then turns out to imply a quantitative basis for all preferences. see Kadane ( 1992). Other references. Qiruntitutive Preferences In our axiomatic development we have not made the a priori assumption that all options can be compared directly using the preference relation. disappear if one takes into account the possibility that the experimental subjects’ utility may be a function of more than one attribute. assumed. together with a thorough review of the mathematical consequences of these kind of assumptions. implicit in “standard” axiom systems. The lesson of Savage’s analysis is that. The view has been put forward by some writers (e. the perceived incoherence may. Indeed. We have. also making use of the analogy with “optical” and “magical“ illusions. Roberts presents a particularly lucid and powerful defence of the axioms. I93 1/1973)the general response to this view has been that some form ofquantification is essential if we are to have an operational. however. In such cases. Neverthcless. Ellsberg’s (1961) criticism is of a similar kind to Allais’. 1940)that not all degrees of belief are quantifiable. we must be prepared to shift our angle of vision in order to view the structure in terms of more concrete and familiar images with which we feel more comfortable. are given by Fine (1973. often as a result of thinking that there is a “right answer” if the problem seems predominantly to do with sorting out “experimentally assigned’ probabilities. or even comparable. which go far beyond the ranges of our normal experience. Keynes.
to be dogmatic about this.2.3. 1990b). The question of whether this should be precise. Subjectivity of Probubility As we stressed in Section 2.g. is clearly absurd if taken literally and interpreted in a descriptive sense. debatable one. Particular ideas. Among the attempts to present formal alternatives to the assumption of precise quantification are those of Good (1950. have been shown to be suspect (see. Walley and Fine (1979). to the effect that these kinds of proposed extension of the axioms seem to us to be based on a confusion of the descriptive and the prescriptive and to be largely unnecessary.e. such theories lead to the identification of a class of “wouldbe” actions.but have led on themselves to further generalizations. we shall proceed on the basis of a prescriptive theory which assumes precise quantification. We therefore echo our earlier detailed commentary on Axiom 5 in Section 2. is certainly an open.6. the suggestion in relation to probabilities is to replace the usual representation of a degree of belief in terms of a single number. some of the kinds of suggestions that have been put forward from this latter perspective. forexample. but then pragmatically acknowledges that. 1962). all this should be taken with a large pinch of salt and a great deal of systematic sensitivity analysis. also. It is rather as though physicists and surveyors were to feel the need to rethink their practices on the basis of a physical theory incorporating explicit concepts of upper and lower lengths. such as Shafer’s (1976. Walley (1987. In this work. Kyburg (1961). in the sense that it derives from the response of a particular individual to a decision making situation under uncertainty. is to consider simultaneously all probabilities which are compatible with elicited comparisons. 1968). and it might well be argued that “measurement” of beliefs and values is not totally analogous to that of physical “length”.. Our basic commitment is to quantitative coherence. but provide no operational guidance as to how to choose from among these.8 Discussion and Further References 99 ought to be questioned. such as Dempster’s (1968) generalization of the Bayesian inference mechanism. 1982a) theory of “belief functions”. For a related practical discussion. the primitive operationalconcept which underlies all our otherdefinitions. We would not wish. Aitchison. We should consider. i. Gir6n and Rios (1980). we accept that the assumption of precise quantification. the notion of preference between options. Smith (1961). by an interval defined by two numbers. See. therefore.. Wasserman (1990a. Dempster ( 1967.2.1985). This and other forms of “robust Bayesian” approaches will be reviewed in Section 5. A particular consequence of this .3. In general. to be interpreted as “upper” and ‘‘lower’’ probabilities. see Hacking (1965). in practice. DeRobertis and Hartigan (1981). So far as decisions are concerned. Chateaneuf and Jaffray (1984). that comparisons with standard options can be successively refined without limit.but its operational content has thus far eluded us. 1991) and Nakamura (1993). or allowed to be imprecise. however. An obvious. In essence. if often technically involved solution. is to be understood as personal. This has attracted some interest (see e.
of course. The logical view was given its first explicit formulation by Keynes ( 192 I I1 929) and was later championed by Carnap ( I950/ 1962)and others. the alternatives are either inert. Brown ( 1993) proposes the related concept of “impersonal” probability. that Keynes seems subsequently to have changed his view and acknowledged the primacy of the subjectivist interpretation (see Good. are found to contain subjectivist ingredients. or have unpleasant and unexpected sideeffects or. it has met with a more favourable reception. 1954). The second approach. bitter though the subjectivist pill may be. see Clarke. have an “objective” character. to the extent that we ignore the processes by which the group arrives at preferences. in which case. At the very least. provided the latter had agreed to “speak with a single voice”. From the objectivistic standpoint.100 2 Foundations is that the concept which emerges is personal degree of belief. The first of these retains the idea of probability as measurement of partial belief. The “individual” referred to above could. by far the most widely accepted in some form or another. The logical view is entirely lacking in operational content. But any uncertain . in some undefined way.4 and subsequently shown to combine for compound events in conformity with the properties of a finitely additive probability measure. above all else. such as actuarial science.sens that the notion of probability should be related in a fundamental way to certain ”objective” aspects of physical reality. Further comments on the problem of individuals versus groups will be given later under the heading Communicution and Group Decision Making. and admittedly difficult to swallow. it appears to offend directly against the general notion that the methods of science should. from the formal structure of the language in which these statements are presented. Chapter 2). The symmetry (or classical) view asserts that physical considerations of symmetry lead directly toa primitive notion of”equa1ly likely cases“. 1965. to the extent that they appear successful. This idea that personal (or subjective) probability should be the key to the “scientific” or “rational” treatment of uncertainty has proved decidedly unpalatable to many statisticians and philosophers (although in some application areas. the first systematic foundation of the frequentist approach is usually attributed to Venn ( 1886). it can conveniently be regarded as a “person”. such as a committee. Nevertheless. as a unique degree of partial logicul implication between one statement and another. Unique probability values are simply assumed tocxist as a measure of the degree of implication between one statement and another. it is interesting to note. but rejects the subjectivist interpretation ofthe latter. however. From an historical point of view. as. defined in Section 2. there have emerged two alternative kinds of approach to the definition of “probability” both seeking to avoid the subjective degree of belief interpretation. with later influential contributions from von Mises (1928) and Reichenbach ( 1935). instead. to be intuited. be some kind of group. such as symmetries or frpquencies. The case for the subjectivist approach and against the objectivist alternatives can be summarised as follows. regarding it.
in addition. and then identifying “probability” with some notion of limiting relative frequency. The subjectivist view explicitly recognises that regarding a specific symmetry as probabilistically significant is itself. The idea of probability as individual “degree of confidence” in an event whose outcome is uncertain seems to have been first put forward by James Bernoulli (17 13/1899). The identification of probability with frequency or symmetry seems to us to be profoundly misguided. of the fundamental role of symmetriesand frequenciesin the structuringand evaluation of personal beliefs. an act of personal judgement. This concept. a “collective” in von Mises’ ( 1928)terminology. In fact. from an entirely subjectivisticperspective. The subjectivist point of view outlined above is. provides an elegant and illuminating explanation. and it is only in this evaluatory process that the latter have a role to play. It also provides a meaningful operational interpretation of the word “objective” in terms of “intersubjective consensus”. .2. the definition derives from logical notions of quantitativecoherent preferences: practical evaluationsin particular instancesoften derive from perceived symmetries and observed frequencies. there are obvious dificulties in defining the underlying notions of “similar” and “randomness” without lapsing into some kind of circularity. Moreover. inescapably. But an individual event can be embedded in many different “collectives” with no guarantee of the same resulting limiting relative frequencies: a truly “objective” theory would therefore require a procedure for justifying the choice of a particular embedding sequence. individual events is itself. it was not until Thomas Bayes’ ( I 763) famous essay that it was explicitly used as a definition: The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed. an act of personal judgement. an operational definition of which is meant by “similar”. requiring. which we shall discuss at length in Chapter 4. inescapably.8 Discussion and Further References 101 situation typically possesses many plausible “symmetries”: a truly “objective” theory would therefore require a procedure for choosing a particular symmetry and for justifying that choice. not new and has been expounded at considerable length and over many years by a number of authors. course. The frequency view can only attempt to assign a measure of uncertainty to an individual event by embedding it in an infinite class of “similar” events having certain “randomness” properties. this latter requirement finds natural expression in the concept of an exchangeable sequence of events. via the celebrated de Finetti representation theorem. and the value of the thing expected upon its happening. It is of paramount importance to maintain the distinction between the definition of a general concept and the evaluation of a particular case. However. The subjectivist view explicitly recognises that any assertion of “similarity” among different. In the subjectivist approach.
Barnett ( 1973/1982). In Section 2. and a number of inforrnationtheoretic ideas and their applications were given a unified interpretation within a purely subjectivist Bayesian framework. Hacking ( 1975). Walley and Fine ( 1979) and Shafer ( 1990). An exhaustive and profound discussion of all aspects of subjective probability is given in de Finetti's magisterial Tlieory qf Prohubility (1970/1974. the supposed dichotomy between inference and decision is illusory. see. pose rather different problems. DeGroot (1970).102 2 Foundutions Not only is this directly expressed in terms of operational comparisons of certain kinds of simple options on the basis of expected values. Coniniutiicution trnd Group Decision M d i i i g The Bayesian approach which has been presented in this chapter is predicated on the primitive notion of individirul preference. ofcourse. the books by Ferguson (1967). since any report or communication of beliefs following the receipt of information inevitably itself constitutes a form of action. A number of later contributions to the field of subjective probability are collected together and discussed in the volume edited by Kyburg and Smokier (i964/1980).7. We shall deal with these topics in more detail in Chapters 5 and 6. Kyburg (1961. de Finetti ( 1978). The expected utility of an "experiment" in this context was then seen to be identified with expected information (in the Shannon sense). However. "public" and "cohesivesmall:Iroup".Fine (1973). Section 3). in our view. Other interpretations of probability are discussed in Renyi (1955). We believe that the two contexts. and concentrate instead on stylised estimation and hypothesis testing formulations of the problem (see Appendix B. Many approaches tostatistical inference do not. we formalised this argument and characterised the utility structure that is typically appropriate for consequences in the special case of a "pure inference" problem. 1974). 1970/1975). requiring separate discussion. we have already made clear that. Stutisricul Inference us u Decision Prohlm Sty lised statistical problems have often been approached from adecisiontheoretical viewpoint. and also "cohesivesmalleroup" decision making processcs. assign a primary role to reporting probability distributions. Fishburn (1964). Berger ( I985a) and references therein. A seemingly powerful argument against the use of the Bayesian paradigm is therefore that it provides an inappropriate basis for the kinds of interpersonal communication and reporting processes which characterise both public debate about beliefs regarding scientific and social issues. for instance. which includes important seminal papers by Ramsey (1926) and de Finetti (193711964). Good (1959). . but the style of Bayes' presentation strongly suggests that these expectations were to be interpreted as personal evaluations.
the prescriptive processes by which we ought individually to revise our beliefs in the light of new information if we aspire to coherence. and the uses to which it can be put. Dickey and Freeman ( I 975) together with a and Smith (1978).Luce and Raiffa (1957). Lindley (1987). Useful introductions to the extensive literature on amalgamation of beliefs or utilities. and on how the group is structured in relation to such issues as “democracy”. Blackwell Edwards (1954). no Bayesian statistician has ever argued that the latter would be appropriate.On the broader issue. and we conjecture that modem interactive computing and graphics will have a major role to play. A variety of problems can be isolated within this framework. however. The first of these processes leads us inescapably to the conclusion that beliefs should be handled using the Bayesian paradigm. Some discussion of the Bayesian reporting process may be found in Dickey ( 1973). the range being chosen both to reflect (and even to challenge) the initial beliefs of a range of interested parties. so far as we are aware. or utilities. the pragmatic processes by which we seek to report to and share perceptions with others. the whole basis of the subjectivist philosophy predisposes Bayesians to seek to report a rich range of the possible belief mappings induced by a data set. on the other hand. It is not yet clear to us whether the analyses of these issues will impinge directly on the broader controversies regarding scientific inference methodology. On the other hand.2. “negotiation” or “competition”. Any approach to scientific inference which seeks to legitimise an answer in response to complex uncertainty seems to us a totalitarian parody of a wouldbe rational human learning process. the second reminds us that a “oneoff’ application of the paradigm to summarise a single individual’s revision of beliefs is inappropriate in this context. together with most of the key references.8 Discussion and Further References 103 In the case of the revision and communication of beliefs in the context of general scientific and social debate. are provided by Arrow ( I 51 b). one of the most attractive features of the Bayesian approach is its recognition of the legitimacy of the plurality of (coherently constrained) responses to data. . This latter topic will be further considered in Chapter 4. “informationsharing”. for instance. we feel that criticism of the Bayesian paradigm is largely based on a misunderstanding of the issues involved. in the “cohesivesmallgroup”context there may be an imposed need for group belief and decision.Stone (1961). Some of the literature on expert systems is relevant here. review of the connections between this issue and the role of models in facilitating communication and consensus. see. Spiegelhalter (1987) and Gaul and Schader (1988). So far as the issues are concerned. we need to distinguish two rather different activities: on the one hand. Indeed. 9 Luce (1959). and on an oversimplified view of the paradigm itself. But. Further discussion is given in Smith (1984). and so we shall not attempt a detailed review of the considerable literature that is emerging. We concede that much remains to be done in developing Bayesian reporting technology. or both. depending on whether the emphasis is on combining probabilities.
Aumann ( 1987). DeGroot and Kadane (1980). Seidenfeld rt ul. ( 1989). ( 1986). 1968.i. White ( 1976a. DeGroot ( 1974. Lindley (1983. 1987). DeGroot and Feinberg ( 1982. Important. (1992). 1990). Huseby ( 1988). 1985. Kogan and Wallace ( 1964). 1989. Kadane and Seidenfeld ( 1992) and Keeney ( 1992). 1989). . Smith ( 1988b). Raiffa ( 1982). I99 I ). 1986). ( 1984). I992a). (1983).West ( 1988. Werrahandi and Zidek ( 198 I . Clemen ( 1989.Berger ( 1981 ). Saaty ( 1980). seminal papers are rcprtxtuced in GHrdenfors and Sahlin ( 1968).Nau and McCardle ( 1990). Lindley et ul. French et ul. Lindley and Singpurwalla (1991. Arrow and Raynaud ( I987). Bayarri and DeGroot ( 1988. 1983). Rios rt ul. White and Bowen ( I975). Goel et ul. Bunn ( I984). Kelly (1991). Morris ( 1974). French ( 198 I . References relating to the Bayesian approach to game theory include Harsany (1967). see Hodgrs ( 1987). A recent review of relatcd topics. 1986. Wilson (1968). 1985. Wilson ( 1986). Genest and Zidek ( I 986). Fishburn (1964. Goicocchca et t i / . For related discussion in the context of policy itnalysis. Ctxhrane and Zeleny ( 1973). Press ( 1978. Sen (1970). Raiffa (1982). 1981). Eliashberg and Winkler ( 198 I ). Normand and Tritchler ( 1992) and Gilardoni and Clayton (1993). 1986). DeGroot and Mortcra ( I9Y I ). Brown and Lindley ( I 982. 1986).104 2 I. 1970.De Waal erul. I980b. 19931. (1992). followed by an informative discussion. 1993). 1983. Edwards and Newman ( 1982). Hylland and Zeckhauser ( 198 I ). Rarlow et ti/. I985b). Hogarth ( 1980).Yu ( 1985). Caro etul. Winkler (1968. ( 1979). Kim and Roush ( 1987). Rios ( 1990).ruidution~ and Dubins (1962).Young and Smith ( I99 I ). Kranz el d . 1980). Kadane and Larkcy ( 1982. 1983). Chankong and Haimes ( 1982). Genest ( 1984%1984b). ( 1989). Roberts ( I979). Clemen and Winkler ( 1987. is provided by Kadane ( 1993). ( 1988). 1976b). (1971). Marschak and Radner ( 1972).
Smith Copyright 02000 by John Wiley & Sons. M.BAYESIAN THEORY Edited by Josd M.1. 3. generalised definitions of score functions and of measures of information and discrepancy are given. and is shown to justify the criterion of maximising expected utility within this more general framework. The elements of mathematical probability theory are reviewed. A further additional postulate regarding preferences is introduced. based on Axioms I to 5. The notions of options and utilities are extended to provide a very general mathematical framework for decision theory. led to the fundamental result that quantitatively coherent degrees of belief for events belonging to the algebra & should fit together in conformity with the mathematical rules offinire/y . Bemardo and Adrian F.1 GENERALISED REPRESENTATION OF BELIEFS Motlvation The developments of Chapter 2. An additional postulate concerning the comparison of a countable collection of events is introduced and is shown to provide a justification for restricting attention to countably additive probability as the basis for representing beliefs. Ltd Chapter 3 Generalisations Summary The ideas and results of Chapter 2 are extended to a much more general mathematical setting. In the context of inference problems.1 3.
it is certainly attractive.7. Within this extended structure. Our fundamental conclusion about beliefs in this setting will be that quantitatively coherent degrees of belief for events belonging to a aalgebra & should fit together in conformity with the mathematical rules of coiintablv additive probability. For these reasons. Countable Additivity In Definition 2. so that the latter is taken to be a 0ulgebru. and provided we do not feel that in so doing we are distorting essential features of our beliefs. and in great detail.5. As the first step in providing a mathematical extension to the infinite domain. a view argued forcefully. thus establishing an extremely general mathematical setting for the representation and analysis of decision problems. a discussion of some particular issues is given in Section 3. A selection of these tools and results will be reviewed in Section 3. we assumed that the collection & of events included in the underlying frame of discourse should be clobed under the operations of arbitraryjnire intersections and unions. Most people would accept that there is an implicit upper bound. 312 . given in Section 2. from the point of view of descriptive and mathematical convenience.4. 1970/1975). or whatever) and further difficulties occur in representing the possible outcomes of many other measurement processes. and then used in Section 3. convenient references are. there seems no compelling reason to require anything beyond this finitistic framework. for example. we shall now assume that we allow arbitrary countable intersections and unions in E. by de Finetti (1970/1974.2. which extends to the general mathematical framework the discussion of the finite. The important special case of inference as a decision problem will be considered in Section 3.5. Axioms 1 to 5 will continue to . In Section 3. transplanted organs. I when discussing bounded sets of consequences. a we remarked in Section 2. to consider the possibility of extending our ideas beyond the finite. discrete case. From a directly practical and intuitive point of view. there are many situations where the implied necessity of choosing a particular finitistic representation of a problem can lead to annoying conceptual s and mathematical complications. The major mathematical advantage of this generalised framework is that all the standard manipulative tools and results of mathematical probability theory then become available to us. since these are generally regarded as being on a continuous scale. discrete framework. However. Similar problems obviously arise in representing other forms of survival time (of equipment. Finally. but find it difficult to justify any particular choice of its value. Kingman and Taylor (1966) or Ash (1972).. we shall provide a formal extension ofthe quantitative coherence theory to the intinitedomain.106 3 Generalisations addirive probability. I and the subsequent discussion.1.3 to develop natural extensions of our finitistic definitions of actions and utilities.2. The example given in that context involved the problem of representing the length of remaining life of a medical patient.
. Hence. and hence for degrees of belief. The operational justification for considering such a countable sequence of comparisons is certainly open to doubt. in this extended setting. n G 2 E. . . (n. P(EJI G) 2 P(E. for descriptive and mathematical convenience. n. EJ _> EJ+I and EJ 2 F.16.and thus. J*X P(E. By Axiom 4(iii) and Proposition 2. then / x \ x . by the above. if.+. 2~F then we have E.2.I G) i s a countably additive probability measure.P(EJI G) = p . then 0Ej 2 F .2. However. then this form of continuity would seem to be a minimal requirement for coherence. J=l x Discussion of Postulate 1. By Proposition 2. and G > 8. and so there exists a number p 2 0 such that lini. I j for a l l j . G I S and. I G) 1 p. However. P( E. provided that only finite combinations of events of & are involved.+l I C) 2 0.14(i). then. a (n. we . if E.I Generalised Representation of Beliefs 107 encapsulate the requirements of quantitative coherence for preferences. for any G > 0. I G) = 0. for all j = 1. n. . proposition 3. by Propositions 2. n G.} are disjoint events in E. we shall need an extension of the existing requirements for quantitative coherence. Proposition 3. the above result. 2 F holds for every member of a decreasing sequence of events El 2 E2 .thatp=O. I G) is a finitely additive probability measure. there exists a standard event S such that p(S)= p .2. for all j. E.) n G 2 F n G.16. It now follows from Proposition 2.3. I j for all j . which implies that S .. we admit the possibility. We note first that the condition encapsulated in Postulate 1 carries over to conditional preferences. Postulate 1. If the relation E.+cc. then it would seem very natural in terms of“continuity” that the relation should “carry over”. lim Proof.14(i) by Postulate 1.1. f f If { EJ. enables us to establish immediately that. if we wish to deal with countable combinations of events. (Continuitya~8). ..1l. By Proposition 2. and. . and 0 have ( E. .. based on the postulate of monotone conrinuiry. . for all j. (Countably addin’ve structure o degrees o be&. moreover.j = 1. Since we have already established in Proposition 2. (Monolone continuiry).10 and2. n G 2 F n G.) = 8 >G S. . Ej 2 Ej+1 X und J=1 Ej = 0. E. that E l ) 2~F. and if we accept the limit event Ej into our frame of discourse. S.17 that P ( . One possible such extension is encapsulated in the following postulate.EJ >c. P(. thus.
P ) where 3 is u aalgebra of $1 arid I' is N cwnpletc). However. A probubilit? space is dejned by tlw elenients ((1.. P(U. .F. for any I ) >_ 1. It is called ii complete probability measure. which contains & and all the subsets of the null events of t (the socalled ' cornpletion of E).10) that if E and F are events in & with E C F. The debate centres on whether this particular restriction to ii subclass of the linitely additive measures should be considered as a twc'esscrr:\' feuture of quantitative coherence.3.2. I G). I . we simply note that philosophical allegiance to the finitistic framework is in no way incompatible with the systematic adoption and use of countably additive probability measures for the overwhelming majority of applications. so that if P(F)= 0 then P ( E ) = 0.1. or whether it is a pragntaric option. From a philosophical point of view. Definition 3.X I C ) = 0. The induced probability measure over F is unique and has the property that all subsets o f null events are themselves null events.. where % . 0additiw prohibilit?.+ with n. nieusure on 3.1 G) is finitely additive we have. outside the quantitative coherence framework encapsulated in Axioms I to 5 . For the present. In somc circumstances it may be desirable.5. hence.4 and 2. Since P(. then P ( E ) 5 P ( F ) . this can be done "automatically" by simply agreeing that E be replaced by the smallest nulgebrti.108 3 Generalisations Proof. in general. lim P( l. but in almost all the developments which follow we shall rarely feel discomfited by implicitly working within a countably additive framework. as well as mathematically convenient.= L4. by Proposition 3. not all subsets of an event of probability zero (a socalled null event) will belong to & and so we cannot logically even refer to their probabilities. 3. let alone infer that they are zero. E. 2 F. If so.. We have established (in Propositions 2. It follows that F. (Probability space). to be able to do this. The result follows by taking limits in the last expression for We shall consider the finite versus countable additivity debate in a little more detail in Section 3.. we identify strongly with this latter viewpoint..
we shall concentrate instead on reviewing. in the forms of counts or measurements. or the state of international commodity markets at a particular time point.1 REVIEW OF PROBABILITY THEORY Random Quantitiesand Distributions In the framework we have been discussing. in a certain abstract sense. However.3) and of the link between beliefs about observables and the structure of specijc models for representing such beliefs (Chapter 4).3. operationalist philosophy as a basic motivation and guiding principle. 3. for example.8. We do not anticipate encountering situations where these mathematical assumptions lead to conceptual distortions beyond the usual. The material which follows (in Section 3. vector extension will be the .2 Review of Probability Theory 109 From now on. one must always be on guard and aware that distortions rnighr occur. our mathematical development will take place within the assumed structure of a probability space. However. These developments systematically invoke the subjectivist. In the next section. Now.1) and our subsequent discussions of generalised decision problems (Section 3. it might be argued that “measurements” are always.F. However.F. P } .2 3.. that surrounding the birth of an infant. as de Finetti ( 1970/1974.2. from { 0. “measurements” will typically mean data which we prerend are realvalued.2) will differ in flavour somewhat from our preceding discussion of the general foundutions of coherent beliefs and actions (Chapter 2 and Section 3. P } to a more explicitly numerical setting by invoking a mapping. 3?:R x R. We move. Recalling the discussion of Section 2. we might think of R as the “primitive” collection of all possible outcomes in a situation of interest. the concepts and results from mathematical probability theory which will provide the technical underpinning of our theory. “counts”. but rather in some numerical summary of the outcomes. therefore. even if such were possible.. in fact. the constituent possibilities and probabilities of any decision problem are encapsulated in the structure of the probability space {R. we are not really interested in a “complete description”. we shall distinguish the two in the usual pragmatic (fuzzy)way: “counts” will typically mean integervalued data.  c which associates a real number Z ( W ) with each elementary outcome w of 52 (our initial exposition will be in terms of a singlevalued z. from a purely mathematical standpoint. inevitable element of mathematical idealisation which enters any formal analysis. 197011975) has so eloquently warned. when convenient.
intervals (x.(. Definition 3. intervals such as (x. over appropriate subsets of !R.. 11 such that  p.. (Random quantity). The interpretation will always be clear from the context. ( I f 'R.. n E 0.tor F spuce { f 1 .. P C . fie to The function 1. since the 0 1 latter can be generated by appropriate countable unions and intersections of the a]. definable on the aalgebra of Burel sers. is easily seen to be aprububility measure. The distribution function of CI rnn' [O. B. example.r which we would wish and to use to define the numerical mapping.P } is thefuncrion F. : 'R dejned by  If the probability distribution concentrates on a countable set of values.c is called a discrete random quantity and the function pr :R [O.2. on CI prohuhi1ir. r @))..4). A rnndont quunrin. we use the term rundoni yiiurisignify a numerical entity whose value is uncertain. term rundonr wriable.s R such that . is defined for on certain special subsets of 8. in 5 which case we shall want sets of the form { w : .2.. or a particular value of the function . (I E %. but potentially confusing..Y. . to belong to 3. of %. say. F. .(1 E 'R. the induced measure P.n E 'R.x < ~ ( u )u } . rather than use the traditional. .(H) = f ( . 11 dom quunti? x : R * X C_ R on { (1.r. Thus. This information can also be encapsulated in a single realvalued function. 1970/1975). n ] ..I + .r)= P { w : x(u)= . For a random quantity z'. the smallest oalgebra containing intervals of the form (XI.. we shall wish to ensure that f . 3 . Notationally. However.}.Y Clll B E B.say.r may denote a function.r(w) = . c Following de Finetti (1970/1974. . this will constrain the class of functions . for example. and hence all forms of interval.I (131 E . we shall use the same symbol for both a random quantity and its value. . Definition 3.} . is defined in the natural way by P.I. (Distribution function). and describes the way in which probability is"distributed"over the possible values J' E .3. so that X = {I.110 3 Generulisutiotis made in Section 3. The standard requirement is that Z be ' . which might suggest a restriction to contexts involving repeated ''trials'' oter which the quantity may vary. Subsets of 51 are thus mapped into subsets of 'R and the probability measure P defined on F will induce a probability measure.. sp. P } is a function s : (.
for example. . special cases of the RadonNikodym derivative. In particular. The distribution function is then a step function with jumps pr(xi) at each 5 . 1972. . r ) .(. and if g : R t Y C 8 is a function such that (y o s ) . B E f?. of course. the random quantity 9 induces a probability space {%. y. = Functions such as F. the density p .2 Review of Probubility Theory 111 is called its probability (mass)funcrion. we might have a mixture of both discrete and continuous elements.I and pu are defined in the obvious way.(s) a particular z E X. P} such that s : 52 4 X C 8. Writing y = g(z). Also.2). we shall. simply state that densities are equal.) and its value p.’ ( B ) } ) . we shall use the notation and results of Lebesgue and LebesgueStieltjesintegration theory as and when it suits us. No use of singular distributions will be made in this volume (for discussion of such distributions. if g’ exists and is strictly monotonic increasing we have FJY) = P*(g(4 9) = P*(. is called an (absolurely)conrinuous random quantity and p .’ ( B ) } ) P ( { ( g O . nonnegative (measurable) function pr such that r then . using p(. ( { g . If the probability distribution is such that there exists a real. In measuretheoreticterms. These forms are easily related to those of F. and p I . this means “equal. L g’(y)) = F*(g’(d) 5 and. for both the mass function of a discrete random quantity and the density function of a continuous random quantity.. when there is no danger of confusion. ( x ) . see.2. is called its ciensivhncrion. we shall often omit the suffix . Readers unfamiliar with these concepts need not womy: virtually none pf the machinery will be visible.’ ( B )E F for all B E D. to avoid tedious repetition of phrases like “almost everywhere”. In addition.2. and the meanings of integrals will rarely depend on the niceties of the interpretation adopted.(B) = P . of course.r) both to represent the density at or mass function p. then g o s is also a random quantity. whenever we refer to such functions of a random quantity 2. B. We shall typically denote y o t by g(x). Section 2. in the continuous case.3.Py}.and.it is to be understood that the composite function is indeed a random quantity. leaving it to be understood that. 3 . If x is a random quantity defined on { Q .where P. both are. Moreover. if g is monotonic and differentiable. except possibly on a set of measure zero”. Ash. when appropriate. with respect to the relevant measure. In general.c in p . We shall use the same notation. is given by Some examples of this relationship are given at the end of Section 3.
. . .r.4. respectively.!+ (. Assuming. E[y] = E [ g ( .. ~ .s (vi) Q. If. (Expectation). an aquantile of thc distribution of .E[.r]= \. such summary quantities include: (i) E[.(Q.[.I:]. . E.Q(. often in terms of special cases of Definition 3. !/C 1 ..r.5[x]. c ]pinreryuunrik runge. we shall also typically use the integral form to represent both the continuous and the discrete cases.I  . such that p. y ure rundoni quunrities with . such that [. c2 are finite real numbers then E[(j..\ for the discrete and continuous cuses. so that if . most sums or integrals over possible values of random quantities will involve the complete set of possible values (for example.r.r1: rc .r])'] = E[. It is useful to be able to summarise the main features of a probability distribution by quantities defined to encapsulate its location. (._1].the mean of the distribution of the random quantity ....\I[. ~ . To avoid tiresome duplication.r] = E[(. vuriunce: the (iv) u[. ).r]. the expectution of y.Ij + c 2 4 = ('1 E[. r ) ]is clejned by either .r]) = ".ti = {I( . ~ [ .r2 are two random quantities and C'I.. r <_ q. where tlir eyuulity is to he interpreted in tlie sense h i t if'either side e..rists so cloes the other tirrd they (ire equul. in each case. spread. To simplify notation we shall usually omit the range of summation or integration. the kth (ohsolute)niotnent: (iii) \.r]. assuming it to be understood from the context. ( .4.'E[.r]' the sfandurd deviution. Thc expectation operator of Definition 3.r. I a ). a mode of the distribution 0f.I) [.r I .r).'[.r2] E"[. As in Definition 3.4..( d I [ X I ) = snp p..Y and 1. ".112 3 Generalisations Definition 3.r']. or shape. the righthand side to exist. ~[ p. (v) .r: (ii) E[.r]. a median.'*[. . (vii) ( Q .4 is linear.. = (1: (vi) M e [ x ] = Q0.+T ] .I.. I .
Proposition 3.x.12) and conditional independence (Definition 2.6. we obtain d z ) = ! d P ) + (. squaring. } are condirionally independent given the event { w .?.(x . .. which we shall illustrate at the end of Section 3. are conditionally independent given y $ for any t . Definition 3.(w) 5 t . . For general transformations g ( x ) .3.z n . (Conditionalindependence). These notions can be extended to random quantities in the following way.. . ] and V cxl [ . a In Definition 2. the random quantities 1 1 . subject to conditions on r ' the distribution of x und the smoothness o g. = Definition 3.P)29"(/. we introduced the notion of the independence of two events with subsequent generalisationsto mutual independence (Definition 2. subtracting the latter approximation from both sides.n. for suitably wellbehaved g ( s ) the following result. (Mutual independence). . f 1 = S ( P ) + p 29 N (PI. However.1 L:. for all y. i = 1. Taking expectations immediately yields the approximate form for E[y]. we clearly have E[g(x)l = cE[zI = g(E[sI)* 1/2 D [ g ( z ) ] (V[g(z)])1'2 = (b2V[x]) = = g(D[x]). The random quantities 21. .x. for some real constant c. .5. E n x . .6. E(g(s)] g ( E [ z ]and the moments of a trans# ) formed random quantity g(x)do not exactly relate in any straightforward manner to those of z.often provides useful approximations. E 92. y ( w ) 5 y}. For any random quantify y. . . Clearly.. .2 Review of Probability Theory 113 In the special case of the transformation g ( s ) = cx. I f x is a random quantity with E[s] = p. We note that for independent random quantities 51 . Outline proof Expanding g (x ) in a Taylor series about p .1).3.}. x. n E [ T . . V[UI ZZ [g'(4l2. are mutually independent $ f o r any t. for I the . . . more refined approximationsare easily obtained by including higher order terms. (Approximate mean and variance). are mutuully independent.13). where we are assuming regularity conditions sufficient to ensure the adequacy of this approximation in what follows. . i = 1.. .2.. E ! ? events {w. yields the result for V[y]. V[r] = c and y = g(z) then. taking expectations and ignoring higher order terms. .2. .R. . .P ) g W + . ~ 1 I =~ V [ G ] . I*:. rhe events { w .xt( w ) 5 t.
7.(O) I If . 1 . and detailed discussion and derivations are therefore not given. . ..is a uniformly continuous function o ft.rl. We shall assume that the reader is familiar with most of this material.I . these "labelling parameters" often relate closely to onc or other of the characteristics of the distribution. Definition 3. 7 4 ]+ o ( t ' ) . o. which will be discussed at length in Chapter4. f E 3. given by o. . n:.rf. ( t )= 1 q E [ . = Among the most important properties of the characteristic function. s. The characteristic function of u rundom quantity s is the function d.are independent random quantities.(U.114 3 Generalisatians Conditional independence will play a major role in our later discussion of modelling in Chapter 4."c"4]. and s = I:=. we shall review a number of particular univariate distributions which are frequently used in applications. '. Two random quantities have the same distribution if and only if they have the same characteristic function. operationalist framework are extremely important issues. 3.2 Some Particular Univariate Distributions In this section.. then If E[. we note the following. 1 Q r ( t ) <_ 1 and @. then o .(t) = 1 Of. Many forms of technical manipulation of probability distributions are greatly facilitated by working with some suitable transformation of the original density or distribution function. One important initial warning is required! These distributions provide the building blocks for srutistic*almodels and are typically delined in terms of "parameters". the probahili? generating~4ficncrior1. .. and E[ t'1. /=I .r plune.2. the moment generatingfunction.parameters" should simply be regarded as "labels" of the various mathematical functions we shall be considering. although. 1970) provide a mass of detail on these and other distributions. = 1. One of the most useful such transforms is the following. mapping 92 to the conip1e. .(t) t. The role and interpretation of "models" and *paratneters" within the general subjectivist. Many similar properties hold for the closely related alternative transforms S[(' r ] . The books by Johnson and Kotz (1969. . For the present. and list some of their properties and characteristics.r'] < x. o. (Characteristicfunction). as we shall see.
l .8). 8) is Bi(r(O.71) ( N + n q z ( N + nr . 11.3. . then both X.. is and  . + + + The Hypergeometric Disrriburion A discrete random quantity x has an hypergeometric distribution with integer parameters N.. is a binomial random quantity with parameters 8 and I I ~ . with probability function denoted by Br(z 18).rm 1 ) ( N + 1). . k. . 1 are modes. The sum of k independent binomial rmdom quantities with parameters (8.]. n) ' is + where The mean and variance are given by 7N 1 E[4 = .2 Review of Probubiliry Theory The Binomial Distribution 115 A discrete random quantity x has a binomial distribution with parameters 8 and n (0 < 8 < 1. . 2 . I I A . The mean and variance are E[I] = n8. r = O . and V[r] 7 4 1 . 1 are modes. n = 1 .. A1 and n (n 5 N hf) if its probability function Hy(x I A.. if I. r If n = 1. .and V[r\ = N + hi 71. .. :11+N+2 + if x. . is said to have a Bernoulli distribution.. .). A mode is attained = at the greatest integer Ai[z] which does not exceed I....1) A mode is attained at the greatest integer Al[x] which does not exceed = (n .\1& ( N + hi . x. .] = ( n 1)8. ...hi.. . n.i = 1. is an and integer.~ x.n)= (:) 8"(10)"'.) if its probability function Bi(x I n . then both x.. an integer.
8) > 1 the mode Af is the least integer not less than [r(1 . . 3 > 0) if its density function Be(. df [.i = 1 .k .0) < 1.+rh. 2 . r + 1) = . 1 1 ) ( 1 . If r ( 1 . where c = c The mean and variance are given by E[.I* = 0. [XI The Poisson Distribution A discrete random quantity . and the values I'( I ) = 1 and I( 1/2) = v6 1.2..0) = 1. k . 0) is Nb(r 119.] = r ( 1 .2. '.1 are modes.r has a Poissoti distribution with parameter A ( A > 0 ) if its probability function Pn(.) if its probability function Nb(. ) . It' A is an integer. .[. . . < 1.8)'. The mean and variance are E[. where c = 8'. t X I .r) = J. if I (1 .772.r [ A) = c' .r I r.)*0 .I' = 0..r ) = c (I' .. . integer and halfinteger values of the gamma function are easily found from the recursive relation T ( .116 The NegativeBinomial Distribution 3 Generctlisations A discrete random quantity 3: has a negativebinomial distribution with parameters 8 and r (0 < 6. r is said to have a geometric or Pascal distribution.I] A. + The Beta Distribution A continuous random quantity . . the sum of k independent negative binomial random quantities with parameters (8.4 = r8 and \. is a Poisson random quantity with parameter XI .. is a negative binomial random quantity with parameters 0 and rl+.Ir) I. .Ir) is ( 1 j where (a = l'(rr + . .r [ A ) is x. 1.j) and r(. z ." Pn(. If I' = 1. .H ) / H ' .r! . . r . I = 1. 1. A mode = :\I [ r ] attained at the greatest integer which does not exceed A. f = 1 . .5.I.r).rl(. .8 ) ] / 8 :if r( 1 . .((I ) r ( .r]= \'[. Moreover. there are two modes at 0 and I .r has a betti distribution with parameters o and (1 > 0.r]= 0." t ' ' d t . is both X and A .. The sum of k independent Poisson random quantities with parameters A.. .
1) on (0. gives 0 czd E [ s ]= . where .x).l V[L] = (0 na. n<y<b. the beta distribution can be generahed to any finite interval ( a . . L.d (a . n) j . 1 are modes.and V[s1= a +d (0 + 3 ) ? ( . i r ) density.r has a Be(r 111. assigning mass ( T I 1 ) .. 8 > 0 . . .3 1) + + + + A mode is attained at the greatest integer Af [x] which does not exceed = ( .2 Review o Probability Theor?. there is a unique mode at (a .s has a Be(y I 3.* to each possible s.) if its probability function Bb(x 1 a.6) on ( a6 ) . + 1)(.2. .n ) = c where (:) r(cz+ x ) r ( ! j+ 71 . Un(yIa.l)/(n + . + b ) / 2 and variance V[y] = (6 . rz: is said to have a uniform distribution Un(r 10. If u = d = 1 we obtain the is and discrete uniform distribution. .. .. By considering the transformed random quantity y = u + ~ ( ba). s = 0.. d) density. : 3 . a) density.3)2( a + . 7 t ) is Bb(o I a. f 117 Systematic application of the beta integral. If o = A = 1. an integer. If s has a Be(x I a. n = 1.1) ' ~r+ij2 if .2). 9 and 11 ( u > 0. the uniform distribution Un(y 10. .3 . both zII.n .1).+ $ + 1) If a > 1 and 3 > 1.I. + .3.b)=(bu) has mean E [ y] = (u I.a)2/12 The BinomialBeta Distribirrion A discrete random quantity s has a binomialbetu distribution with parameters 0 ./j. The distribution is generated by the mixture r1 The mean and variance are given by E [ x ]= n a a + $9 and XI. then ?J = 1 . In particular.. 6 ) .
. d and r (CI > 0.r I v ) or 4 (.j)tf0. . I f o > 1. r " . + and . ' . d. .r is said to have an r.r is said to have an Erlurig distribution with parameter 0 . ) is The distribution is generated by the mixture Nbb(. 2.3) distribution with parameter :3 and density Ex(..I' > 0. J) is . is a gamma random quantity with parameters ( t l + .r I o. 1 . the gamma distribution can be generalised to the ranges ( ( 1 . r ) = The mean is t. .jf. If .r I ( t . . .I' is said to have a (centrul) chisyuured ( k 2 ) distribution with parameter 11 (often referred to as degrees qffreedonr) and density denoted by \ (. where c = . r) or (x. ". .l ) / . .j. I. Moreover. sj= 1/ 2 .j > 0) if its density function Ga(.P/T(tr ).thereisauniquemodeat ( o . .. :3) density..r I (k. 2 0.j Ga(. .r I (k. o > I . j : i f r r < 1 there are no modes (the density is unbounded).r 1 J) = . . k. and the variance is given by The Guniniu Distribufion A continuous randomquantity s has agutnnzu distribution with parameters (1 and ( 0 > 0. .r 1 .1 ) I' Nb(s 10. If ( I = v/2.. . ."r]= r. <j)= r . Systematic application of the gamma integral givesE[.I' o r := !J .118 3 Genemlisutions The NeXcitiveBinomiulBrrcrDistributioti A discrete random quantity s has a negutii. By considering the transformed random quantities !/ = (I + .' t ".r has a Ga(.j.r. .r In.r] = n / J a n d V [ r ] = o/:j'. r = l .ehinomicilbeta distribution with parameters ct. If = 1. the sum of k independent gamma random h) quantities with parameters ((k.j > 0. f The mode o an exponential distribution is located at zero.rpotientiul Ex(. .r ).j = 1. F ) Be(0 I (I.) if its probability function Nbb(.j(o . 3 ) . where . I = 1. .I.
1. distribution is generated by the mixture The This compound Poisson distribution is. = + + + + . 0 = 1/2. if QV < 1) + v. the variance is V[z] = ~ ~ ( i i v)/.2 Review of Probability Theory The Inverted. 8. gz If x has an invertedgamma distribution with CL = v/2. B/(L? v)). then 2 is said to have an inverted%: distribution. The term invertedgamma derives from the easily established fact that if y has a Ga(y I a.Y ) = c r(n+x) v r x! ( p + I/)"+" 9 x = 0 . Ga(':')(y 1 a. Moreover.l)//3) . 2 . 0)density. A continuous random quantity y has a squureroot invertedgamma density. B > 0. There is a unique mode at B/(a + 1). where c = P / r ( a ) .O) density then x = y' has an I ( I a. and j if CLV > D v.Gamma Distribution 119 A continuous random quantity x has an invertedgamma distribution with parameters a and /!! ( a > 0.. A A discrete random quantity x has a Poissongummu distribution with parameters and v (a > 0. 1 .if x = y' has a Ga(x I a. The PoissonGamma Distribution (1.. Af [z] 0.!?*. Systematic application of the gamma integral gives \X "I = (0 1' 3 . previously defined only for integer a. 0 > 0) if its density function Ig(z I CL: 8) is where c = jY'/r((i).3. .1 ) 2 ( a . there is a mode at the least integer not less than ( v ( a. there are two modes at 0 and 1. in fact..2) 9 a>2. 9) density. v > 0) if its probability function Pg(x I n . if av = fl v. 3). The mean is E [ x ] = va/p. a generalisation of the negative binomial distribution Nb(x I a. v ) is Pg(2 10: 13.
y has an invertedPwetc~density Ip( g has a l’fi(. .1 .) if its probability function Gg(.’ has a gunznrugumnzu distribution with parameters I j and I I (o > 0.z IS 3 Generulisatioru A continuous random quantity . ( 1 . .j .j10) Ga(H n. .j)= 1‘ 1 Ex(. d > 0) if its density function Pa(.2. j I I ) 1 .y’ A continuous random quantity .mean and variance are given by The E [ J ] = 0 fk.I’ = . 3 ) density. I ) = 1. I 11.j)is (I and where = ~ 3 ’ .r In. The distribution is generated by the mixture Pa(.j ( a > 0. . if ( 1 > 1 The mcxle i s .r  . The distribution is generated by the mixture The mean and variance arc given by The Pureto Distribirtion A continuous random quantity* I .120 The GunimuGunmu Distribution 0.r 10. j) if .II[. j) d H . has a Pureto distribution with parameters . .r 10.r] = J.r. *j> 0.
. X > 0) if its density function y2(s I v. X > 0) if its density function N(x I p. . . cf'=. N ( y In b. o2 = V[z] is the vuriance. . .. The mean and mode are E [ z ]= M [ x ] = IL and the variance is V [ x ]= A . . aremutually independent normal randomquantities N ( z . If X I . LC? + . . are independent with N(s. A). densities. I p. then .~ )where . A.' .p ) / o . : then z = + + cf'=. r l . a mixture of central y2 distributions with Poisson weights.t. In general.x ~are mutually independent standard normal random quantities.2 Review of Probubiliry Theory 121 The Norm1 Distribution A continuous random quantity x has a normal distribution with parameters p and X ( p E R. X = 1.' . + 2(tI k g = .. The Noncentral y2 Distribution A continuous random quantity LC has a noncentrul x2 distribution with parameters v (degrees of freedom) and X (noncentrality) (v > 0 . A) = x 2 ( A l [ z ]I v . . . A) is N(xIp.. so that X here represents the precision of the distribution. then y has a normal density. where b. LCA. A) is x2(x I v. Alternatively. . Z ) ..2.e. a weighted harmonic mean of the individual precisions. . The distribution is unimodal. It reduces to a central x2(v) when X = 0..p ) = ( x .3.pl. .) the .X)=cexp where The distribution is symmetrical about s = p. + vk and XI + . If p = 0. A. c then z = El=.) is a noncentral .=. I p. A) where X = b : / X j ) . (c:=. The mean and variance are E(x] = v X and V [ z ] = 2(v 2X). y has a N(y 10: 1) (standard) density. N ( z 111. If y = X'12(x . ] I f z l . has a noncentral x' distribution..: 1). z?has a (central) k distribution. with distribution function Q given by @(s) =  1 : exp{it2} d t . where I has a normal density N(x I p . The sum of k independent noncentral x 2 distributions with parameters (v. the mode occurs at the value M(x]such that x 2 ( A l [ ~ I: v. + Xk.r. if y = a Cf.A) = C Pn (II X / ~ It 2 ( x 1 v + 2i).I T . c is said to have a standurd normul distribution. A) is denoted by N(x I p. A). 1=0 x i. .x2 with parameters vl + .
 (y)}] 1 The distribution is symmetrical about . and has a unique mode A/[. and the variance is \'[s] j2ii2.r = i t . of .122 The Logistic Distribittiori 3 Generuliscrtions A continuous random quantity x has a logistic distribution with parameters o and . = The Student ( t ) Distrihitriotr A continuous random quantity ./3. > 0. X where The distribution is symmetrical about .1 = :\f [. o > 0) if its density St(. The mean and variance arc The parameter rr is usually referred to as the clegrec~s c?f. F. The logistic distribution is most simply expressed in terms of its distribution function. . A.r 10. The mean and mode are given by E(1.(s) [I + ( = a . j ( o E R. J) is j where c = An alternative expression for the density function is so that the logistic is sometimes called the sechsyitared distribution.freedom the distribution.I' has a Studew distribution with parameten and o ( j i E '8.I' = o.r I i t . (k) is X 11.r]= o.r]= i i . > 0) if its density function Lo(.
a ) . with density Ca(x I p. with. l . If a = 1. a).Ldegrees of freedom. 1968) by the mixture 123 and includes the normal distribution as a limiting case.thenyhasa(srandurd) student density St(y 10. see Bailey ( 1992). For a geometrical interpretation of some of these relations. if D > 4. If x has a standard normal distribution.A.D/( . then has a standard Student density St(z I O .2 Review of Probability Theory The distribution is generated (Dickey.2)’  o2 If x and y are independent random quantities with central .l. since N(x J 11. The Snedecor ( F ) Distribution A continuous random quantity x has a Snedecor.A).y2 distributions.3.~ Ip.o)density. u). A. 0 > 0) if its density Fs (xI a. moreover. respectively. V[X] = 2 ((Y+l32) a(@ 4) (0 .wherexhasaSt(. distribution with parameters cr and p (degreesuffieedutn) ( a > 0. A) = lirn St(x J p.2) and there is a unique mode at [ / 3 / ( p 2 ) ] [ ( a 2 ) / ( r ] . 4) is where 8 + If 0 > 2. or Fisher. u1 and U. and x and y are mutually independent. y has a k: distribution.2. z is said to have a Cuuchy distribution. then 1 has a Snedecor distribution with u and u2 degrees of freedom. . E [ x ] = . Relationships between some of the distributions described above can be established using the techniques described in Section 3. I. (tw Ify = A‘/’(xp).
j) density.I' has a density Be(. Example 3.r. we have so that y has a t2(g12o) density. Then.ir!o(1 . Binomial probabilities may also be obtained from the F distribution using the exact relation between their distribution functions given by Peizer and Pratt (1968) Since 5' distributions are extensively tabulated.r)j I . .j r + o.124 3 Generulisutions Example 3. for ! > 0.r 1 f i .r has a Be(r 10.I' has a Ga(.r < 1.. (Gumma and 4 distributions). and noting that .r 1 ct. = ntj[. .I) and let = . Suppose that .2. Suppose that . 0 < .j. Binomiul and Snedecor (F) distributions). but that we are interested in the means and variances of the transformed random quantities Recall ing that and noting that . for r/ > 0. Example 33.1) den/ sity and let y = 2. (Beta.y] '. (Approximutemomentsfor transformed random quuntih'es). we have so that y has the stated Snedecor ( F ) density..1. Suppose that . these relationships provide a useful basis for numerical work involving any beta or binomial distribution. this relationship provides a useful basis for numerical work involving any gamma distribution. Then. Since \' distributions are extensively tabulated.
I.e. in particular.8. we denote this by FI  F. We shall summarise a few of the main ideas and results. lim P ( { w .c ifand only i f Ix lini ~ ( ( .. ( w ) tends to ~ ( wfor all w except those lying in a ) set o Pmeasure zero. Within the countably additive framework for probability which we are currently reviewing. much of the powerful resulting mathematical machinery rests on various notions of limit process. does not depend on [ I ) under the second transformation.3 : ) 7 = 0. r. . Definition 3. the variance is “stabilised” (i..3 immediately yields the following approximations: We note. f converges in mean square to a random quuntity . f converges in probability to a random quantity s i and only i f f for all E > 0 . .2 Review of Probubiliry Theory 125 application of Proposition 3. that if p z the second (correction) terms in the mean approximations will be small. 3 2 3 Convergence and Limit Theorems .. i f s .. . I* converges in distribution to a rundonr quantity J’ ifand only if the corresponding distribution functions are such that !a liin F.o random quantities: x2. A sequence X I .r. for all 11. converges almost surely to a random quantify 3: ifand only if in other words. beginning with the four most widely used notions of convergence for random quantities.3. (Convergence). and that.(w) .(t) = F ( f ) at all continuity points t of F in ?R.r(w)l > E}) = 0.
discrete random quantity which assigns probability one to I t .)] value E [ g ( s )with respect to F‘. the converse is false. converges in distribution to thc standard normal distribution. I ) = 1 .. ] (ii) if F.rf] < x. .} of distributions functions such that for each E > 0 there exists an such that for all i sufficiently large F. ( tand q ( t )are the corresponding characteristic functions. are independent identically distributed random quantities with El. I . ’ [ .. . . for finite random quantities. . .5. .. .] = < x. and E [ s . converges in mean square (and hence in probability) to 11. .(n) > 1 . I I = 1. .. . 2  An important class of limit results.r.s uj’lmgvtti4mbers. Some of the most basic of these are the following: (i) If.. In addition.r?. are independent. . .. F and ~ . such that I. .’. the sequence with respect to F.. identically distributed random quantities with E!..r. provided that o(t ) is the continuous at f = 0. Moreover. to a degenerate. almost sure convergencealso implies convergence in probability.126 3 Generalisations Convergence in mean square implies convergence in probability.2.rl. .Under the same conditions as in (ii). (ii) The w c i k k u ~of lurge numbers. for every bounded continuous function g. tI I= 1. If . . .). r ) ..I.. . . converse also holds. converges in probability to I / . . that is to say. . ) then o. Convergence in probability implies convergence in distribution. (iii) The strong law oj’/argen i r ~ d w s .. there exists a distribution function F and a subsequence FfI... .(ci). x:’. (i) F. then the < sequence of standardised random quantities 1. (iii) (Hel/y’s rheorem) Given a sequence {Pi. are independent. . I . = I)’ . . converges almost surely to 1’. ]= p .(t)+ O ( t ) for all f E 3.. converges to the expected of the expected values E[g(. . then the sequence of random quantities T. .2..F. identically ’ distributed random quantities with E[. . Convergence in distribution is completely determined by the distribution functions: the corresponding random quantities need not be defined on the same probability space. s2.2. Two important examples are the following: (i) 7he centrcrl littiit theurmi.. r ~ . for all i . link the limiting behaviour of averages of (independent) random quantities with their expectations. . ] = o’) x.I / f 7 i  J. .. F. F.rl. then the sequence of random quantities T.?. . * I/ = 1. 2 . = 11 and \ . .r.. the socalled luw. + F‘ if and only if. If .. T. . there is a further class of limit results which characterises in more detail the properties ofthe distance between the sequence and the limit values.
(Random vector). we shall take this class of subsets to be the smallest aalgebra. B.P } is afuncrion x : 52 3 X C 'Jzk such that x .Bayes' Theorem A random quantity represents a numerical summary of the potential outcomes in an uncertain situation.2. we shall again wish to ensure that P. H E B.l ( f 3 )E F. as well as anencoding (for example. Formally. A random vector x on N probability space {R.2 Review of Probubiiiry Theory 127 (ii) The law of the irerared logarithm. we move the focus of attention from the underlying probability space { 52.3. However. For a random vector x. I(% . F. is welldefined for particular subsets of Rk and this puts mathematical constraints on the form of the function x.. However. Under the conditions assumed for the central limit theorem. Xli . using aO1 convention)of its sex. e). we wish to define a mapping which associates a vector z ( w ) of k real numbers with each elementary outcome w of R. but we shall rarely need to go beyond the above in our subsequentdiscussion. in that they . for all BE B. I .containing This then all forms of kdimensional interval (the socalled Bore1 sets of prompts the following definition. are potentially much more complicated for a random vector than in the case of a single random quantity. For example. As in the case of (univariate) random quantities.9. in general each outcome has many different numerical summaries which may be associated with it. 3. There are enormously wideranging variations and generalisations of these results.P 1/2 limsup((2loglogn) = 1. the induced probability measure P. Generalising our earlier discussion given in Section 3. } to the context of Skand P an induced probability measure P.P.. is defined in the natural way by P.' ( R ) ) . F. The possible forms of disrribution for x. Definition 3. It is necessary therefore to have available the mathematical apparatus for handling a vector of numerical information./Jr.(R) = P ( x . a description of the state of the international commodity market would typically involve a whole complex of price information. the birth of an infant might be recorded in terms of weight and heartrate measurements.4 Random Vectors.2.
I' h'k . [O. .. The density p r ( z ) = p x ( . but also the dependencies among them. . might themselves be . . .. . It will always be clear from the context how to reinterpret things in the discrete (or mixed) cases. the niargittul density for the vector y is given by or alternatively. where x takes only a countable number of possible values and the distribution can be described by the probability (~. n (n. we could have cases where some of the components are discrete and others are continuous. ) ..I . . 2 ) . . z = (.'. . emphasising the operation itself.. .. rather than the technical integration required. r f ) = ./'A) of the random vector x is often referred to as the joint density of the random quantities . say.rciss)fiitictioti px(cc) = P({w: ( w )= x}). : I. . . . . . we shall usually present our discussion using the notation for the continuous case..'.17. . . . 2 ) +I+(Y). . . r . . where the distribution may be described p x such that by a d~~nsityfiiric~tioti( x ) The distribution firnction of a random vector x is the realvalued function . r .x . of course.I p ( . x and (absolutely)continirous distributions. .I28 3 Ceiiercilisuriotis not only describe the uncertainty about each of the individual component random quantities in the vector.I mixture of the two types.r r ) = & { ( . Some components. r . . . . in particular) that it is useful to have available a simple alternative notation. . I ] defined by E. given that y ( w ) defined by : y. we can distinguish discwte distrihtioiis. . If the random vector x is partitioned into x = (y.r. .. ] In addition. In what follows. r . . . .]}.rk ' This operation of passing from a joint to a marginal density occurs so commonly in Bayesian inference contexts (see Chapter 5.(x) = k i ( . p(r. . . dropping the subscripts without danger of confusion. . I/ The iwditiotial density for the random vcctor z.rI. . r l . d. . To denote the nrorgintiliscitioti opcvtitioii we shall thereforc write I d Y . r k . As in the onedimensional case. . is . .rA ) d r . . . where y = ( .rI.. r * k ! . I .
should the latter be needed explicitly.4. up to the final stage of calculating the normalisingconstant. we can always just work with kernels of densities.then since Any such q ( x ) will be referred to as a kernel of the density p ( z ) .y In fact. The proportionality form of Bayes’ theorem then makes it clear that. which immediately yields the result.2 Review of Probability Theory 129 or. (Generaked Bayes’ theorem). however. we note that if a density function p ( z ) can be expressed in the form c q ( z ) . In many cases. . Proposition 3. y. We shall almost always use the generic subscriptfree notation for densities. it is not explicitly required since the “shape” of p l ( I z ) is all that one needs to know. More generally. again dropping the subscripts for convenience. It is often convenient to reexpress Bayes‘ theorem in the simple proportionality form PYIAY I z> i P x& I Y)PY(Y). since the righthand side contains all the information required to reconstruct the normalising constant. Exchanging the roles of y and z in the above. this latter observation is often extremely useful for avoiding unnecessary detail when canying out manipulations involving Bayes’ theorem.3. not depending on x. where q is a function and c is a constant. alternatively. it is obvious that P&) = P&(Z I Y)PY(Y) = Pylzb I Z)PZ)z(Z). Proof. It is therefore important to remember that the functional forms of the various marginal and conditional densities will typically differ.
dr. . 13 t B.prior to) observing the random quantities .and.). .r. . .r. .r.1 p ( z . .e.TI. we would have p(.. .. .r.posrerior to) observing . .andp(.I ..rl j . the Buyes’ theorem operation of passing from a conditional and a marginal density to the “other” conditional density is also fundamental to Bayesian inference and. . For example. . again. .130 3 Generdisutions For further technical discussion of the generalised Bayes’ theorem.) : 11 if . r l ) as describing beliefs for the random quantities . ) = JJp(. and dropping the subscripts on densities.the random vector y induces a probability space (Y?‘‘. we shall typically interpret densities such as p ( . manipulation is si m pl i lied if independence or conditional independence assumptions can be made.r1) p(.. .)d r l .l ( B ) )= P((g0z) l ( z 3 ) ) . .y(r I Y) . . Chapter 3) and Wasserman and Kadane (19%)). . ..:I d Y ) = PJlZ(Vl4 In more explicit terms. r . If x is a random vector defined on ($2. it is useful to have available an alternative notation.r.. .xk I X I . . . . B. . . I .I. ... Bayes’ theorem can be written in the form p ( 1’.r. . . .xk. . As with marginalising.. r l . . . .. . . r . we shall therefore write P. ..rL) as describing beliefs after(i.then g o x is also a random vector. . We shall typically denote g o x by g(x). 1 . I . if s l .r... 1 .r. . To denote the Bayes’ theorem operation.. . .rJ. . . .+1. . . .t T I p(. . . .1. I n 1 I . . . . . .. were independent we would have t p ( J . . . ) 1.I’~ ..cttI.) p ( . that z : I I P } such X C >RL 1’ and if g : R” ( h 5 k ) is a function such that (g o z) ’ ( B )E 3 for all B E B. . .r. .) = .. extending the use o f the terms given in Chapter 2. 1 s t JHt .. . Manipulations based on this form will underlie the greater part of the ideas and results to be developed in subsequent chapters. . I . . . it is to be understood that the composite function is indeed a random vector. r l . xk were conditionally independent. whenever we refer to such vector functions of a random vector x. . . . .. . before (i. . .r. Hartigan (1983. . .s/. .. I. ) p(J. . . . . . .. Often.rl. .rl.rl. In particular. . . .rk .Ifv}where f  P y ( B )= P x ( g . . .e. .rf+ I . . . . . see Mouchart (I976).. I ..r.. Writing y=g(x). . r ~ .. . given ..3. .
In particular. h i i dejined by either s . where all the equalities are to be interpreted in the sense that ifeither side exists.=E[g. .' ( d ]. if g is a onetoone differentiable function with inverse g*. for the discrete and absolutely continuous cases.px. (Expectation o a random vector).2 Review of Probability Theov 131 and distribution and density functions Fy. so does the other and they are equal. expectation ofy. 4 = Px(f'(w))lJfl(w)l. Rk .pv are defined in the obvious way.y are random vecf forssuchthary = g(x).= l . we where is the Jacobian of the transformation g'.X C R k .x: R . g : . . the .z) = (g(x). ) is a onetoone function with inverse = z f'. is EIYJ=E[g(x)l. These forms are easily related to Fx. . such that w = f(s) ( 9 . with dimension k .h. a vector whose ith component.Y C R h ( h 5 k). . E[y] = E(g(s)]. Z ( ? / . have. we are usually able to define an appropriate z . ' If h < k.10.defined by where h ( Y ) = [ g .and then proceed in two steps to obtain py(y)by first obtaining P d W ) = P Y . Definition 3. respectively.3.(x)]. and then marginalising to The expectation concept generalises to the case of random vectors in an obvious way. ifx. for each y E Y . .
. but the following will occasionally prove useful.x. .] pectation vector with components E[x1]. I I [ . the is of mutrk.]. The cuvuriunce between r l and sJ is defined by C. + i l k . then.. This follows straightforwardly from a multivariate Taylor expansion: the details are tedious and we will omit them here. E[. . ]= ] [ ..~ ( r . r .r. . .... 2. If x is ( I rundom .]) . k. i = I .c:'~]are called the moments of x of order n = i i l + ..r. .. E [ z ]of z.. 21 = g(x). E [ J * 1.] = ~ [ ( . ]i. . ' ] .. As in the case of a single random quantity. .and the secondorder moments.wirh E ( z ] = p. where. Important special cases include the firstorder moments.I.. . . .rf]. r . ] (. E[. .rJ )~ ] . are not available. k x k matrix with ( i . . V[x]. . r . Proof. .r. j)thelement C[. subject to conditions on the distribution of x and on the smoothness of g. V[z] = C und y = g(x) is Q oii~toone trutisforwiutiott of x such thut g exists. r . wcror in Xk.. and the correlation by The CauchySchwarz inequality establishes that 1 R[. = 1. .]denotes the truce ofu mutrix urgurnent. V[x] reduces to a diagonal matrix with (i. .I.rk]is also called the mean vector.[. r l . . . ' E [ ( g ( z ) ) . ..lJ (1 <_ i # j 5 k ) .] called the covariance . The ex.~ [ . If . If the components of 2 are independent random quantities. . We shall not need very general results in this area. E [ s . where t\r[..k..for i = 1 .132 3 Generulisutions In particular.k ) . i)th entry given by \'[r. exact forms for moments of an arbitrary transformation.)'A are independent. r . the forms defined by E [I'IfY. . I 5 1. E[. ( s . = E [ y W ]= S f b ) + f t r [ C V " Y f ( P ) ] ] V[9(z)] J g h ) C J $ ( c L ) . then E [II.L.r:'] = 11. . Proposition 35 (Approximate mean and covariance). r ~][ . .
. for example Wilks (1962).. As in Section 3. with the condition relatively few points. .. Readers unfamiliar with measure theory will already have assumedcorrectly! that we shall almost always be dealing with the “standard” versions of probability mass functions and densities (corresponding to counting and Lebesgue dominating measures and “smoothly” defined).2. . . mension k.2. Only occasionally. 3.. i. distributions for random vectors. . Johnson and Kotz ( 1969.2. 1972) and DeGroot ( 1970) for further information. nondensity. = 0 .3. . O r ) and n (0 < 6. In particular.3. The Multinoniial Distribution A discrete random vector z = (sl ~. Readers familiar with measure theory will be aware 133 that there are many subtle steps in passing to a density representation of the probability measure PI. no very detailed discussion will be given: see. 1 .=. 5 72. restrict the possible modes to a . .6. < 1.2 Review of Probability Theory A nore on rneusure theory.. a detailed rigorous treatment of densities (RadonNikodym derivatives) requires statements about dominating measures and comments on the versions assumed for such densities. is J. with I E. do we refer to general. C.5 xf 7t. . sk) has a mulrinomial distribution of di. 2 . . for I. The mean vector and covariance matrix are given by The mode@)of the distribution is (are) located near E [ z ] . forms.satisfying these inequalities.e. that is to say. .) if its probability function M u k ( z 18. in Chapter 4. x:=. < 1. with parameters 8 = (61. n ) .5 Some Particular Multivariate Distributions We conclude our review of probability theory with a selection of the more frequently used multivariate probability distributions. r7 = 1.
. . ) . = I I is multinomial Mu(. where If z is the sum of 111 independent random vector> having multinomial densities with parameters (8. .rI + . I . The conditional distribution of x"") given the remain . with 8. )+ ~( ( I . J . . ih where Ifk = I . . . MU^. ) I ) then y = (yI. ( 1 2 )In thegeneralcase. If k = I. r I . . + E:J The Dirichler Di. then the joint distribution of 2 = ( . . .). . 0 < . r t ( 1 1 . .. . is the multinomial . ( ~ k . I A. .i u < k. . n). +.vk < I. .specifically.. sr. . .n .. I ! ) reduces to the binomial (z density Bi(x 18... . y f ) where . . Ifz = (xl. the mean vector and covariance matrix arc given by .134 3 Generalisiitions The marginal distribution ofd"" = (XI.. . J L )given .r. .. . D i k ( z ( a )reducestothc betadensity B c ( . B. . = A.r 1 8. MUI. t i t .../ I ) ... r k ) has a Dirichler distribution of dimension k .striburion A continuous random vector x = ( X I . .). k + 1 ) if its probability density Dik(z 1 a).r. . If . . < 1 and . ) . then z also has a multinomial density with parameters 0 and (711 . i = 1. .) has density M u k ( x 10. . . . ' ~ through their only sum s = E:=~. with parameters a = (oI . + . . . and it depends on the remaining s . i = 1. . . has density Mut(y j 4..(^('^') 181.'~ is also multinomial. n ) .r1. .' CtZI A. . n ) . ing s. . > 0...rl are k independent Poisson random quantities with densities Pn(x... 10. .I*. .
=. . is the Dirichlet = mi . with + k x ~ + I = )L .z .. In particular.2. .. if z = ( I ~ . . Moreover.+~.. ..k . and 71 = 1. ]= 1 C::: aJ . . .=.z~. . .(Z I a).3. given z.(x'. .. .1) defines the ascending facrorial function. . . . . then ..rJ and I. . ) .. has density Dit(y 10). xk) has density Dir.. of dimension k. I.(a J .~).rk) has amultinomialDirichletdistribution .. . i = 1.. with parameters a = (oL.. . there is a mode given by M [ x ... . < k.xi. Di.1 The marginal distribution of d"') (xl. l .. yt) where ..0 1 . if its probability function Mdk(z I a. . = 0 . y = (y1. k.(I. is where cr['I = fl. with n a r 1 5 ri.! C= CJ=l kt i pi] "J .. . . . t k + l ) ..2 Review of Probability Theory If czi 135 > 1. and ... . 2 . oktl) 71 where at > 0. .. . k The conditional distribution. where The MulrinornialDirichler Distribution A discrete random vector z = ( X I .. for s..of I ( is also Dirichlet. .
.=.2.. A.j). . has a r?titltivariutenormul distribution ofdimension k.C.1 n.. if its probability density N k ( z 1 p . .r. where p E '5' and A is a k x k symmetric positivedefinite matrix..A ( x.. .I. . . . where the probability density of Ng(. . ()! . 1. I * . y 10. . In particular.} and 12 Y. .X..) . } is a multinomialDirichlet : with parameters { G I . I . > 0.s The marginal distribution ofthe subset { . . . .( I . . *) x . . ) . . ./ . I n I } and I I . with parameters { o . 9) is where the normal and gamma densities are defined in Section 3.5. The Miiltivuriute Norniul Distribution A continuous random vector x = ( . . Xy) and that the marginal density of y is Ga(y 1 n.~um~m distribution. :I c'. I o. the marginal density of . J . x E IP.I . .r. . with parameters / I . see Basu and of Pereird ( 1983). h). I .l.z=l The NorinulGurnmu Distrihutiort A continuous bivariate random vector ( J . . ) and A. y) has a rtor~nal.. is St(.r I p .ylp. with parameters p = ( p1 .j. For an interesting characteri7~tion this distribution. Ct o. the conditional distribution of { . . It is clear from the definition.136 The mean vector and covariance matrix are given by 3 Generu1isution.c.n. o.. .r. .5) is displayed both as a surface and in tcrms of equal density contours. (11 E 'N. Xn/d. multinomialDirichlet. that the conditional density of s given p is N(. .r I p. . } is also I . Moreover. l(\.I'L } given' { . the marginal distribution of . . . . . . Moreover.rl. . I' c::.~ ) ' A ( .p ) } . The shape of a normalgamma distribution is illustrated in Figure 3.1 .X > 0. o k . n and .. . is the binomialbeta Bb(. r l . . . A ) is N ~ ( Z ~ ~= (A r x p .2. o. .181.. 8 j > 0) if its density Ng(s. .
2 Review of Probability Theory 137 Y I' Y 2 0 2 F i r e 3. V [ x l ] u .V[z]= X'. y Io. The parameter p therefore = ~ ~ ~ labels the mean vector and the parameter X the precision matrix (the inverse of the .5) where c = ( 2 ~ ) .3. so that X i s a scalar.]= ~ 7 so that . A). E[x. A... A) reduces to the univariate normal density N(x 1 p . In the general case.~ / ~ 1 X 1 ' / * .1 The Normalgamma densiry Ng(z. with C = A' of general element u t j . and.c.5..]= p.1. Nk(z I p. and C[x. If k = 1.
I'1 is the generulised gummufunction and tr( . j 5= 1.p) of the k(k + 1)/2 dimensional random vector of the distinct entries of x is Wik(a: where c = l / 3 1 n / I ' k ( ~ t ) . of appropriate dimension. we can deduce the integral formula The Wishart Distribution A symmetric. We also note that. if the density Wik(a: 1 n. . If k = 1. if x = (21. denotes the truce o f a matrix argument. respectively. then L\*k( 2 I 0.r. by ' ( The random quantity y = (a:. . .p ) has a kg I k ) density.22) a partition of x.p)'X(a:. with mean vector and covariance matrix given by the corresponding subvector of p and submatrix of X I . k. reduces to the gamma /3) density Ga(cr I ck. nonsingular matrix). has a Wishart distribution of dimension k. If y = Ax... as before. In particular.with 2. so that 0 is a scalar 3.having dimension k. . X).( ACA' 1 ). Moreover. with parameters n and 0 (with 2ct > k .130 3 Generdisutions coturiance mutri. . Jj).where A is an rri x k matrix of real numbers such that ACA' is nonsingular. from the form of the multivariate normal density. is and kl k? = k. . for 1 = I. of dimension kl with mean vector and precision matrix given. . . ( y I Ap. and if the corresponding partitions of p and X are ' + then the conditional density of xI given x2 is also (multivariate) normal.k.= x. ..). positivedefinite matrix x of random quantities s. then y has density N...1 and a k x A: symmetric.. the murginal density for any subvector of x is (multibariate) normal.
I p. . x 8 are independent k x k random matrices. a ) reduces to the univariate Student density St(z I p: A! Q). . Although not exactly equal to the inverse of the covariance matrix. and The following properties of the Wishart distribution are easily established: E [ s ]= ap' and E [ x .} is a random sample of size n > 1 from a multivariate normal ? x. . with parameters p = (111. so that A is a scalar. if x and p' conformably partition into + x= (:. .' ] = (a . . ): * where 2 1 1 . . A.( P+ 1)/2 exp{tr(ps)} ctx = cl . then 21+ . . if g~ = AxA' where A is an rn x k matrix (m 5 k) of real numbers. p' = (:. Moreover. has a Wishart distribution Win.: ): . .(k 1)/2)'D. we can deduce the integral formula 1 1". A. A. E ( s ] = p and V [ z ]= A . then x11 has a Wishart distributionofdimensionh withparametersaand(aIl)'.ifzl. . . a ) is &(a: I p. . in particular. . A.xk) has a multivariate Student distri. .p) 1 (n+k)/2 . A). .' ( a / ( a.also has a Wishan distribution. from the form of the Wishart density. nA). . pk). and Z = 7tI s = C(Z?Z)(q. A a symmetric. each with a Wishart distribution. bution of dimension k. 6 1 1 are square h x h matrices (1 5 h < k).2 Review of Probability Theory 139 If { x1 .p)9(x. . A and (Y ( p E Xk. with parameters a1 . . We note that.3.x.. if the iatterexists. The Multivariate Student Distribution + 1)/2 distinct A continuous random vector z = ( X I . i = 1.3)' c=l 11 is independent of Z. positivedefinite k x k matrix. xEqk. + + J the integration being understood to be with respect to the k(k elements of the matrix x. Y. Z is N k ( f 1 p . . then Stk(x 1 p. and 0.2)). a)= c where . the parameter A is often referred to as the precision matrix of the distribution. In the general case. $A). a > 0) if its probability density &(a: I p ... + x. then y has a Wishart distribution of dimension m with parameters o and (A0I A')'. with parameters a t .(SI i ( n . and then Nk(x. a. If k = 1.l ) .p. .
Moreover.&%2(a:? (k . The Mirltivarinte Normal.0). A. with mean vector and inverse of the precision matrix given by the corresponding subvector of p and submatrix of A . the marginal density of a: is Stk(x I p. where A is an 111 x k matrix ( i n 5 k) of real numbers such that AAIA' is nonsingular.X?IA. The Multivariate NornialGarntnn Distribrrtiotr . with ct kz degrees of freedom.p2)'(A22 .0 ) = Nk(a: I p.'A. In particular.. given by + 111 .j). ( t ) density.?)(x2 0 1 The random quantity y = (a: . the marginal density ofa: is Stk(x 1 p .3). From the definition. 0 ) . of appropriate dimension. p ) is Nivk(2.X a k x k symmetric. . X 3 0.1. of dimension k l . A. J ( p E g k . if the probability density of x and the k(k + 1 ) / 2 distinct elements of g. h). Ngk(a:. (t. I p.Moreover. y where the multivariate normal and gamma densities have already been defined. (AA'A')'. and 0 a k x k symmetric. 20 ). . 0 ( p E 92'. Xy) and the marginal density of y is Wik. A!]) and the marginal density of y is Ga(g I(\. where the multivariate normal and Wishart densities are as defined above.(y ] ( I . .Nwk(z.p)'A(a:. From the definition.j)= NA(a:I p. A.1 and A are given by then the conditional density of 2 1 . t * ~ . A..p2). respectively.' .j) is j A continuous random vector x = Ngk(x.o).140 3 Generalisations If y = Ax.(5I p. Ap)Ga(!/ 10. Moreover.1. . ( I : 0 and > : > 0) if the joint probability density of a: and y. 0 . with parameters p. positivedelinite matrix of random quantities y have a joint Nornwl. (t':jA. y I p. then y has density St.Wishart Disrribi4tioti A continuous random vector x and a symmetric. the marginal density for any subvector of a: is (multivariate) Student. Anp I . and mean vector and precision matrix.x2) is a partition of x and the corresponding partitions of 1.. u .y 1 p.Wislzart distribution of dimension k. Act. nonsingular matrix). the conditional density of x given is NI. X y ) Wik(y l(1. 0 . A.p ) has an Fs(y 1 k . with parameters p. integer 20 > k . if x = (xl. the conditional density of a: given y is N k ( z I p. )and a random quantity y have (2. given a:? is also (multivariate) Student. positivedetinite matrix. a joint tnultivariate norntalganvna distribution of dimension k .. y I p . .(y I Ap.111') A11 + A*:! [ + (a:? . O . .
going beyond finite.~=yi3. Ferrhndiz and Smith. The extension of this function definition to infinite settings clearly requires some form of constructive limit process. the "natural" definition of limit that suggests itself is one based fundamentally on the expected utility idea (Bemardo.B.3 Generalised Options and Utilities 141 The Bilateral Pureto Distribution A continuous bivariate random vector (I. I}.1 GENERALISED OPTIONS AND UTILITIES Motivation and Preliminaries For reasons of mathematical or descriptive convenience.cr). described which is F by a probability space { $2: .2) ifa > 2. J 5 85. The marginal distributions of tl = / ) I . analogous to that used in Lebesgue measure and integration theory in passing from simple (i. cJ is the consequence of the occurrence of the event E. In the finite case. y 1 a.1 ) * ( 0 .~o.. It is therefore desirable to extend the concepts and results of Chapter 2 to a much more general mathematical setting. j E J } . C . Let us therefore consider a decision problem { A . where c = ck(a + l)(/3.. P } and utility function u : C + %. and let .)arebothPa(f1.  The mean and variance are given by V [ s ]= V[y] = Q($l ..3 3.3. or even countable. "step") functions to more general functions./%)2 v ( n . Paa(z. with the straightforward interpretation that. has a bilaterul Pureto distribution with y) parameters &. frameworks. and the correlation between I and y is . y>i31.31 ik. first by taking & to be a oalgebra and then suitably extending the fundamental notion of an option. 3. A ) is Paz(I. dl. it is common in statistical decision problems to consider sets of options which consist of part or all of the real line (as in problems of point estimation) or are part of some more general space. an option was denoted by a = {cJ I EJ.e. and CI ({/?".3.' . 1985). < O1. .n .yItr.~andt. a > 0 ) if its density function E R2. Since the development given in Chapter 2 led to the assessment of options in terms of their expected utilities.)=('(yx)("+'). if option (1 is chosen.E .
there exist sequences ( I I . Proposition 3. 0 2 . . ) ' f z(d).t . Discussion of DeJnifions 3. . A and { R. E ( a ) = i i ( r r I S 2 ) is precisely the expected utility of the simple option n (see Definition 2.)r t i l l i. the r*ulrte of ). In order for this to carry over smoothly to decisions (generalised options). (Convergence in expected utifity). for tl siich that 11 o d is essentiully bounded. this constructive definition does not provide a straightforward means of checking whether or not. . given a specified utility function. For u giivn iitilityjm. In the case of the particular subset A of 'D.25 was that simple options should be compared in terms of their expected utilities.)) is a random quantity whose expectation exists.16 with G = IT.. 11 o rl is bounded except on a subset of $2 of Pmeasure zero) is a decision. tl und.P } are to be understood as fixed background specifications. we shall return later to the case of a general conditioning event G and corresponding probability measure P ( . it is natural to require a constructive definition of the latter in terms of a limit concept directly expressed in terms of expected utilities. Definition 3. Given u urilit?. we can prove the following. f (I. t. converges to (ii) ri(rl.. (1. ) n . (1: . the extension from simple functions. ti.1 1. o simple options siich thut ( I . 'Dconsists of those functions (soon to be called decisions) rl : (1 C for which u o d = u(d(. . As it stands. however. u.02. we can prove that any (I E D such that o t i is essentially bounded (i. In abstract mathematical terms. t i .) 5 i i ( d ) 5 q c c : ) . Definition 3. [ijunction (I E 'D is (I decision (getirralised option) if' und only ij' there cvisrs u sequence 01. .. . ~  . in 'D i5 said to ticwrivergi~ u to ( I firnction tl in 'D.6. a function d E D is or is not a generalised option. if turd only if   (i) (L o d. +I.12.12. . However. : C zi 92.)  11 o (1 almost sure!\. sequence ofjuncrions (11.tion.  . F. (11. I G)). the fundamental coherence result of Proposition 2.function 11 : C 9. Fnra given utili~fincrion. More specifically. G((1. written d.e.( d ) = liiii i i ( t 1 . However. mapping C to 'J?. (with respect to . is then called the expected utilic of the derision d. . In all the definitions and propositions in this section. . to more general functions requires some form of limit process. o j simple options such that ( I .142 3 Generalisations In other words. t i : C 3.I I trnd 3. (Decisions). . un~JNnction E 2. tind u'.
of simple options a. for this i .) and > To see that the c. a.. ( d ) . ( d ) < cc. then we would have thus contradicting the definition of a.iz(a. note that. = { clJI E.) = 0 then czJis an arbitrary element of C. (An exactly parallel proof exists if “above” is replaced by “below” and 2 by I.. .. j E J ..) 1 E ( d ) .2. whereas the other events contain outcomes whose values of u o d(w) do not > differ by more than 2’.. } such that: + (i) if P ( E. d almost surely.i = 1. In addition.. j E J i } .3. for all i. for all E > 0.there exists z such that u o d ( w ) E [i.}. . .defined by but.. Hence. . . la..d(w)l < E.. (ii) if P ( E t J ) 0 then cy E d(E.) begin by defining the partitions {&. since a(d) < m.  . where We For each i.(w) .u(d(w)) < T&j(d)for all w E E. such that a. in such a way that two extreme events contain outcomes with values of u o d ( o ) < i or . for all i. there exists Ti..i. with 2’ < E and.i). exist and are well defined.j. if . 1” d and .3 Generalised Options and Utilities 143 Proofi We prove first that if u o d is essentially bounded above then there exists a sequence of simple acts a l .. By construction. an. We now define a sequence {a. this establishes a partition of R into 2(22’ 1) events.
moreover. define . . tliere exist sequences ( I 1 .)n : P(ilu)c~.. x.6 there exists a sequence of standard events S...lll and choose a consequence c E C for (7 . . Given Q titilit~. +..4'. define f< = x and All r fl. j Y and ( I ( ( * ) < .7.h h i ( I .x. whether or not o tl is essentially bounded. choose a consequence ( ' E C and an increasing sequence of real numbers 3. ( 1 2 .x. . c.I decreasing sequence of real numbers ( I . ) = o . In fact. since A.. 11 such that (1. . let Ir' denote the essential supremum and define :lo = { w E ! I : I I o d(w) = f<}: in the lattcr case. such that . ( I : . since ii(o.j.(w) = { rf(w). .. all j.' ) ). i i (( 1 .rr(rr.A.S. such that 0. .i = . which converges to zero as E(d) < %. .(w))( ( ~ ( w ) 0. Proposition 3. + fl as i . . E(cr. tine1 (I. . of siniple optiori. ) < ~ ( t l < ii(cr. then.. Then by Axiom 4 and Proposition 2. . . S.l. a?.) 2 T i ( ( / ) .) < i i ( d ) . ifw E . and and.j.r ciriy clecisioii tl E D. such that n ( c ) < I<.&) = 0. S. rl cindfor till i. = { w E I! 1 ic(d(w)) .: 4. / / ( .I 2 S'. IfP(&) > 0 . define   rl. ti. We first note that either o (1 is essentially bounded above or it is not. / . we can show that any decision. can be obtained as the limit of "bounding sequences of simple options" in the sense made precise in the following...}..t } . for all i .. If P(.:.144 To show that u.firiic~tion: C ti 'T?. :w E I!: I / o r l i w ) :> . . d and.  Prouf. 0 ..this also converges to zero. . . S .. An obviously parallel proof exists for the other inequality. (1. uch that S .f.s sicc. and Pr( . we note that for < /I sufficiently large i (larger than the essential supremum of o (0. ( I : ... it remains to prove that 7 Genercllisotions 1: Writing Ji! = I u ( d ( w ) ).. We shall show that there exists a sequence of simple options ( I I .  as i . choose . where 4. ifw E . . for all r . let { In either case. + J:. In the former case. E [O.I. .
).. Hence. since for all k. if w E A.E(d. then d.. . The first part of the postulate simply captures the notion of the carryover of preferences in the limit. u. if w E A. (Extension of the preference relation).6.ii(d. of simple options such that a. of the extended framework developed in the previous section. under the limit process we have introduced. for mathematical or descriptive convenience.&. (1) . 2 a:. . and { a . it is natural to require that preferences among simple options should “carry over”.1 d. . for all i.’). (1) J . (ii) if:for all 2 > 10. the required result follows. we have: (i) if: for all i > io. a. so..) < E ( d ) and u o d. .) + d. (w)= { . by Proposition 3. . a 332 . that. ii(d.a. } d2. d. Generalised Preferences Given the adoption. by )11 . this postulate enables us to establish a very general statement o the identification of quantitative coherence with the principle f of max imising expected uti I i ty.then dl > d. > ui. If.3 Generalised Options and Utilities 145 Since d is a decision. . u(7) 2 . .:(w).. Given a utilityfunction u :C R. and E ( u y ) ) _> E ( 4 ) . for all i. Moreover. to the corresponding decisions. such that TI (uia) . there exist sequences of simple options u(. If we now choose a subsequence uj. +. d.) < E(d) .  .for any decisions dl . we now define a new sequence of simple options a:j. we clearly have a:. such that u:’) I.. This is made precise in the following. is bounded above. and d. for each a.and sequences of simple options { a . } )“ d. for some io. 2 dp.). is a decision. there exists a sequence al a2.. . Together with our previous axioms. for some io. Postulate 2. . d. the second part of the postulate is an obvious necessary condition for strict preference. by construction d.} and {a:} such that { a .. d. d implies that uji..3..
.i i ( t l .where I(.) then it is integrable with respect to P ( . d2 and. ) .Z(t() i :. . .7. ) < 5. By Proposition 3.} . . so that d ] (12.  Z(nj:”) 5 TT(rl2) 5 IIi Z((1.(&.(11 < . fiw uriy G > Cn. o i . . E ( ( i . (I1 fork = 1. d l . ) . . ( d )= 1 (I 0 d ( W ) $ P ( W 1 G). we must show that G(d1) = i i ( d 2 )implies that t f l w (12.E(dl)]/:3. by Proposition 2. ) > ‘iT(tll) and Z(o:) < Z(d2).. and so. d. ( n . .. (Maximisationof expected uliriryfor decisions).. ). dl 2 (f. .8. .. &}. i 2 such that. . ..7. (1: +. . Given ans two decisions dl . Since we have U ( d 1 ) = E ( d 2 ) .j > i i i i \ s { i ~i t. . Throughout the above. Il(d1) 2 T7(tL) Proof.I... I G) without any basic modifications to the proofs and results. tf2 2 dl and t l l 2 r f j . I 2 ( I . this implies.(d) ii( 1 ~od). ) ~ ’2 a:”’.25 that.2) ii(r() 5 ii(tl2) .25.) 2 i i ( d 2 I G ) .(j u(u. ) . +.  i q n .. 2 . Ti(tl2) .u( I ) 5 ii(((?)). >[. By Proposition 3. > It follows from Proposition 2. by Postulate 2. for all . is the indicator function = . We first establish that i7(d?) > i 7 ( d l ) implies that d? > (11. With .k = 1 . the probability measure P( . 3 . u( n. . of G. and.. for all i . 1. and u’. by Postulate 2. we can choose i l .146 3 Genemlisurions Proposition 3.1) such that 0.)“’ (12 for k = 3 . (J. 4  Proposition 3.” 2 u : ” . . there exist sequences of simple options u I . Writing f / c .Z(4J > max{ i’. ih such that.) sequences of simple options ( ( 1 . d2 > d l .. ) ii(d1) U((f. for all j > i’. and hence.l 2 (1. such that a. . and we can choose i’. there exist iC! I I. 0.9.1 C). for all i.2. o d is integrable with respect to P ( .G ( r l l ) 5 u and.c = ( i i ( i f p ) .) can be replaced by P ( . Proof. that o:. fl?   E(dl 1 c.a?. for all i . +. It follows that if 11. for all j > i i . and ( I . it iseasily verified that P(G)z(. for all .j C/. To complete thc proof(tfl rf2 =$ Ti(dl) = T i ( r f 2 ) being obvious).
.3. If d$ is the optimal decision corresponding to ell. Whether or not this is sensible obviously depends on the relative costs and benefits of such additional information.. u ( d ( w ) )describes the current preferences among consequences and p ( w ) .3. as in Section 2.2. z) sup = d I u ( d . . the utility notation is extended to make explicit the possible dependence on the experiment performed e (or eo if no data are collected) and the data obtained. The Value of Information For the general decision problem. where w E R labels the uncertain outcomes associated with the problem. we have shown that. w d and. given a general decision problem. e. where. and we shall now extend the notions related to the value of information. which maximises the expected utility..6. so that 7i(d. the expected utility from an optimal decision with no additional information is defined by Let d: be the optimal decision after experiment e has been performed and data x have been obtained.6. now takes the form given in Figure 3. x)p ( x I e ) dx. expected utility from the optimal the decision given e and 2. z. ) d u . e .3 Generalised Options and Utilities 147 This establishes in full generality that the decision criterion of maximising the expected utility is the only criterion which is compatible with an intuitive set of quantitative coherence axioms and the natural mathematical extensions encapsulated in Postulates 1 and 2. Specifically. As we saw in Section 2.6. into the more general mathematical framework established in this chapter. z). the expected utility from the optimal decision following e is  u(e) = L 7i(di. the decision tree for experimental design. hence. given originally in Figure 2. it is natural before making a decision to consider trying to reduce the current uncertainty by obtaining further information by experimentation. describes current beliefs about w . ) p ( wI P . the probability density of P with respect to the appropriate dominating measure. introduced in Section 2.6.3.e. the optimal action is that Ct. e.is q d .3. 333 . x.. 5.
for all t i . . ( I ) . The optitnal decision is to perjorm experitnent r" i f i i ( r ' ) = iiiax.is (ii) the e. . w ) 2 t r ( r l . (The value of adiitional information).10. (Optimal experimental design). such that.148 3 Genernlisiitions Fgr 32 Genemiised decision rreejw esprririienrtil tlrsign iue .) ctnd i i ( r ' ) > i i ( ( . ) describes beliefs about the occurrence of z were c to be performed. Thus.e.E $1.rperintent f i s Let us now consider the optimal decisions which would be available to us if we knew the value of w .. perform no experiment otherwise. . Definition 3. Pmoj: This follows immediately. Thus. '. Ti((.. 4 The expected value of the information provided by additional data x may be computed as the (posterior) expected difference between the utilities which correspond to optimal decisions after and before the data.w). Proposition 3. where p(x I f . )mid to .. be the optimal decision given w: i. w u(d:.13..rpected value ofthe e. (i) The expected viilire cgthe ittformutioti proiiiled hy x. let (1:..
under certain conditions. then. Definition 3.7. x). = . form If the urilin. eo. Given such an (additive) decomposition.u(d. 4 = P(W I PO). As we remarked in Section 2.and the probability distributions are such that y(w I e . for the finite case. 2. Its expected value with respect to p ( w ) will define.6. the loss suffered by choosing another decision d u(d:. 5 )p ( z I c) dx is the expected cost of e.and the utility of directly taking decision d and finding w to be the state of the world..3.w) u ( d . z ) . (Expected value of perfect information). und the expected vulue of perfvct informution ubout w is defined by where d. In the next section. ell. . The opportunity loss o choosing d is defined to be f I ( d . w ) . given w . this utility difference measures (conditional on w ) the value of perfect information.r ( e .1 1. w ) . we can establish a useful upper bound for the expected value of an experiment. P(W I eo. Proposition 3. This closely parallels the proof. f’. v(v) <_ P*(eo). C(). w ) = u(d1.W) . in many situations the utility function may often be thought of as made up of two separate components: the experimental cost of performing e and obtaining 2. an upper bound on the increase in utility which additional information about w could be expected to provide. given in Proposition 2. Proof.14.xperiment e. 149 # dw would be For d = 4.J.function has the u(d. x) L: 0. q J w ) .27.3.v(d. with c(e.3 Generalised Options and Utilities Then. where ?(e) = c(c .w ) . for any available e. we reconsider the important special problem of statistical inference.e.2. q This concludes the mathematical extension of the basic framework and associated axioms. previously discussed in detail in its finitistic setting in Section 2.F(e). the optimal decision with no additional data. is the optimd decision with no udditionul information.d) = y ( w 1 e. (Additive decomposition).
I D ) . so that the relevant uncertain events are E. .. /I. = 1.j tion for a (finite) class of exclusive and exhaustive "hypotheses" {tf. = "the hypothesis w is true". A relates to where q) is the probability which.). labelling the "unknown states of the world. I } . > 0.. Quantitative coherence requires that any particular individual's uncertainty about w.150 3. with the interpretation E. In the previous finitistic setting.I} is a partition of 9. conditional on D. compatible with I). The first generalisation consists in noting that the set of alternative "hypotheses" now corresponds to the set of possible values of a (possibly continuous) random vector. We then proceeded to consider a special class of utility functions (score functions) appropriate to this reporting problem and to examine the resulting forms of implied decisions and the links with information theory.] E * I } . Here. given data 1) and initial state of information \f. = "hypothesis H. .). we shall generalise these concepts and results to the extended framework developed in the previous sections. one for each pu(. is true"..which map w to the pair ( p w ( .1 3 Gmeralisarions GENERALISED INFORMATION MEASURES The General Problem of Reporting Beliefs In Section 2. we denoted by p = {y... which we shall assume can be described by a density (to be understood as a mass function in the discrete case) We shall take the set of possible inference statements to be the set of probability distributions for w . 5 ) . = W E . w ) . E . the probability measure describing an individual's actrrctl beliefs. being true. consisting of elements of I .4 3. conditional on some relevant data D and initial state of information :2/0. 1 D ) . could be formulated as a decision problem { E . E. an individual reports as the probability of E. = ( w ) . { F. We denote by 2) the set of functions d. say. w E (2. we argued that the problem of reporting a degree of belief distribu. and the set of consequences C consists of all pairs (9. representing the possible conjunction of reported beliefs and true hypotheses. w . In this section. conditional on 11.7. should be represented by a probability distribution P over a cralgebra of subsets of I!. Ill). C. j E .p.4. A. c....with the interpretation E .
21 is the following. E 2 . w ) . For this purpose and with the same motivation. conditional on data D. w E R}. I D)’sand defined by d. u. Definition 3. where R is the set of possible values of the random quantity w. in the sense that his or her expected utility is maximised if and only if d. I D)E Q are strictly positive probability densities.p(w 1 D) > 0 and q(w I D) > 0 for all d. we shall wish to restrict utility functions for the reporting problem in such a way as to encourage a coherent individual to be honest. the problem of providing an inferencestatement about a class of exclusive and exhaustive “hypotheses” { w . D relates to the class of probability densities for w E Q compatible with D.representing the conjunction of reportedbeliefs and true “states of nature”. q(w I D ) = p(w 1 D). so that d. Throughout this section. I D) is the density which an individual reports as the basis for describing beliefs about w conditional on D. decision space D consists of d. which we can conveniently denote by {D. where qu(. The solution to the decision problem is then to report the density q W ( .2 151 The Utility of a General Probability Distribution In this general setting.4 Generalised Information Measures 3. which describes the “value” u(qu(.15.20. so that. P}..I D) were w to turn out to be the true “state of nature”. scorefunction is said 2 A lo be smooth i f it is continuously diflerentiable us afunction of q ( o 1 D ) for each w E 0. The set of consequences C consists of all pairs (d. by inducing the preference ordering through direct specification of a utility function u. 1 D) and the qw(. D for all w E 0. I D) which maximises the expected utility As in our earlier development in Chapter 2.’s correspondingto choosingto report The qw(. We shall assume the individual to be coherent. given data D.4. we shall denote an individual’s actual belief density by pu (. w ) . . is a decision problem. I D ) . We complete the specification of this decision problem. Without loss of generality. is chosen such that. for each o E R. I D).3. we generalise the notion of score function introduced in Definition 2.. (Scorefunction). R. is a mapping u : Q x 1 + 8. A score functionfor probability densities qu (. E ’ . I D).(w) = (qu(. The appropriate generalisation of Definition 2. w ) of reporting the probability density qu(. we shall assume that ) pu(. I D ) defined on R.
. ). ...17. A qucrdmtic score firtiction i s proper. Given data I).j j a i w I U ) . 1 D ) E Q defined on $1 is a timapping 11 : Q x 5 1 .. As in the finite case (see Definition 2. ensures the existence o f t i (d. I D 1. Definition 3.fi)r euch strictly positive probability densiry pw (. the simplest proper score function in the general case is the quadratic.28) that the constraint J' q(w I D)dw = 1 has not been needed in establishing this result for the quadratic scoring rule. E D. from which it follows that we require q ( o 1 D ) = p ( w 1 D ) for almost all w E 51.12.152 3 Generulisutions Definition 3. (Quadraticscorefunction).riorrfoi. Proposition 3.Proposition 2.) for all d. probability densities yw (.tioti. we must choose yw(. We note again (cf.22). I D ) E Q to maximise subject to j ' q ( w I D)dw = 1.12 ofthe form sirch that the otherwise arbitrclry$int. it is easily seen that this is equivalent to maximising . (Proper scorefunction).Il(W I D))' dw. B (. Rearranging. A yuadrcrti~ scorefirnc. Proof. A scorefrtnction ti is proper if.16.
(qw(.1 D)) is necessary it that i3 = 0. locul scorefunction. This condition reduces to the differential equation + . for some “true states of nature”. as we argued in Section 2. The following generalises Definition 2. Proposition 3.23. the functional form. (Characterisationof proper local score functions).23. w . i..e. (Localscorefunction). as in Definition 2.. Chapter 10).3. may be judged more harshly than others. 1 D). uw(). F(q(w I U ) n i ( w ) ) l da n=O for any function T : R t R of sufficiently small norm (see. 1946. w E Q.proper. E D. Jeffreys and Jeffreys. proper.) is an arbitruryfunction of w. values of q ( u I D). 1 0 )to give a stationary value of F ( q w ( . we need to maximise.4 Generulised Information Measures 153 In fact. Since IL is local. for Proof. local score function. for qu(. The next result generalises Proposition 2. I t i : Q x R + SR is a smooth. I D ) f Q there existfunctions uw. for the problem of reportingpure inference statements it is natural to restrict further the class of appropriate utility functions. A scorefunction is locul if for each yw(. the expected utility  4dq) = subject to of q(w I J T&d. Given data D. then it must be of f the form .sq(w I D)f B ( w ) where A > 0 is un arbitrury constant and B(. subject to the existence of 21(dfl) all d. defined on ‘8’ such that Note that.29 and characterises the form of a smooth. for example.7. with respect to q(.18. this reduces to finding an extremal However. Definition 3. I m w ) = Alo. this enables us to incorporate the possibility that “bad predictions”. I m w ) P(W I D ) Q!u D)dw = 1. the dependence of of the score function on the density value q(w 1 D ) which dflassigns tow is allowed to vary with the particular w in question.13. Intuitively.
154
3 Generulisuticm
where D1 u . denotes the first derivative of uw. But, since uw is proper, the maxi~ must be attained at qw (. I D ) = pw (. / D ) , so that a smooth. mum of F ( qw (. I D)) proper, local utility function must satisfy the differential equation
D , u w ( p ( o I D)) I D ) .<I= 0. p(w
whose solution is given by
llw
( p ( wI U ) ) = .4 logpfw 1 11) + 13(w).
as stated. This result prompts us to make the following formal definition.
Definition 3.19. (Logarithmic score funcfion). A Iogurithmic score @tun(.tion for probubility densities qw(. 1 D ) E Q defined on I1 i5 N iircippitig II : Q x i2 + 9 ($theform ?
u(YW(.I
D).o) = :Ilogq(wI D ) + H(w).
A > 0, B ( . )urbitrury. subject to the cJAi.5tenc.e ofii( (I,,) /i)r d I ti,, E V.
For additional discussion of generalised score functions see Good( 1069) and Buehler ( 197 1).
3.4.3
Generalised Approximation and Discrepancy
As we remarked in Section 2.7.3. although the optimal solution to an inference
problem under the above conditions is to state one's actual beliefs. there may be technical reasons why the computation of this "optimal" density pw(. I D) is difficult. In such cases, we may need to seek a tractable approximation, q d ( . I D). say, which is in some sense "close" to pw(. I D). but much easier to specify. As in the previous discussion of this idea. we shall need to examine carefully this notion of' "closeness". The next result generalises Proposition 2.30.
Proposition 3.14. (Expected loss in probability reporting). preferences are described I,v u logarithnric score fi4nctioti. the expected loss c irtilin q in reporting (I probubility (1eiisit.v y( ~3 1 D), rcrrher than the densir?.p( w I 11) representing u c t i d heliey5. is g i w r t>J
3.4 Generalised Information Measures
155
Proof. From Definition 3.19 the expected utility of reporting qw(. I D) when p w ( . 1 D )is the actual belief distribution is given by
so that
The final condition follows either from the fact that u is proper, so that E(dp) 2 ti(d,) with equality if and only if qw(. I D )= p w ( . I D); directly from the fact or that for all x > 0, log 2 5 I  1 with equality if and only if 5 = 1 (cf. Proposition 2.30). a
As in the finitistic discussion of Section 2.7, the above result suggestsa natural, general measure of “lack of fit”, or discrepancy, between a distribution and an approximation, when preferences are described by a logarithmic score function.
Definition 3.20. (Discrepancy of an approximation). The discrepancy between a strictly positive probability density pw (.) and un approximation & (.), w E $2, is defined by
Example 3.4. (General normal upproximdons). Suppose that p(w) an arbitrary density on the real line, with finite first two moments given by
> 0. i~ E 9, is
and that we wish to approximatep(.) by a j(.) correspondingto a normal density, N ( u , I p , A), with labelling parameters p, X chosen to minimise the discrepancy measure given in Definition 3.20. It is easy to see that. subject to the given constraints, minimising 6 ( p I y ) with respect to p and X is equivalent to minimising

Lx
p ( w )log N(w I p . X) &.
and hence to minimising
 $ log X
+
:J:
p ( w ) ( w  p)2du.
3 Generalisarions
Invoking the two moment constraints. and writing (d  \I)' = ( J  111 + i t /  //)'. [his reduces to minimising, with respect to and A, the expression A  lo&A +  + ,\(HI  I / ) ? . t It follows that the optimal choice is / I = w . X = I . In other words.jiir the reporrinR problem with a logctrithntic score funcrion. the hest norntal crppro.\i,,rcirion to a distribution on the real line (whose mean and variance exist) is the normal distribution having the .wne meatt and variance.
Example 3.5. (Normal approximations to Student distributions). Suppose that we wish to approximate the density St(.r 11'. A. (1). o =. 2, by a normal density. We have just shown that the best normal approximation to ony distribution is that with the same first ~ W O moments (assuming the latter to exist. corresponding here to the restriction ( t > 2). Thus. recalling from Section 3.2.2 that the mean and precision of St(.r I 1 1 . X. i t ) are given by and X(tr  2 ) / i t , respectively, it follows that the best normal approximation to St(./.1 11. ,\. ( 1 ) is provided by N(.r I 11. A(ct  2)/tt). From Definition 3.20. the corresponding discrepancy will be st(.r~//.<\.(l) n'( N I St) = St( .I' I 11. A. 0 ) log N ( . r j / I . ,\(I\  2 ) / 1 i
0
1 0
20
30
Figure 3.3 Uiscrepnnc~ brmecn Student cind nornial clrtisirirs
This is easily evaluated (see. for example, Bemardo, 1978a) using the fact that the entropy of a Student distribution is given by
3.4 Generulised Information Measures
157
where $ ( z ) = r’(:)/r‘(z) denotes the diRammufunction (see. for example. Abramowitz and Stegun, 1964). from which it follows that 6(N 1 St) may be written as
which only depends on the degrees of freedom, shows a plot of 6(N 1 St) against 0 . Using Stirling’s approximation,
logI’(2)
ft,
of the Student distribution. Figure 3.3

( i

;)
log:  2
1 + 7 log(2n). 2
we obtain, for moderate to large values of ct,
6(N 1 St)
[CV(O
 2)] ’ = O(l/(k2),
so that [a( n  2)] provides a simple, approximate measure of the departure from normality of a Student distribution.
’
3.4.4
Generalised Information
In Section 2.7.4, we examined, in the finitistic context, the increase in expected utility provided by given data D. We now extend this analysis to the general setting, writing x to denote observed data D.
Proposition 3.15. (Expected urility ofdclra). Ifpreferences are described by a logarithmic score function for the class of probability densiries p(w I x) defined on 52, then the expected increase in utility provided by data x, when the prior probability density is p(w).is given by
where p(w 1 x ) is the density of the posterior distribution for w ,given x. This expected increase in utility is nonnegative. and zero if, and only $ p(w I x ) is identical to p(w). Proof. Using Definition 3.19, the expected increase in utility provided by x
is given by
which, by Proposition 3.14, is nonnegative with equality if and only if p(w I x ) and y(w)are identical. a
158
3 Generalisations
The following natural definition of the amount of information provided by the data extends that given in Definition 2.26.
Definition 3 2 . (Information .1 from data). The aiiiouirt ($infortrratioti uhoirt w E 11provided by data x when the prior density is p(w)is given hy
where p(w I x)is tlw corrrsyoirding posterior deiisity.
As in the finite case. it is interesting to note that the amount of inforination provided by z is equivalent to the discrepancy measure if the prior is considered as an approximation to the posterior. Alternatively, we see that logp(w). logp(w 1 x). respectively, measure how "gtmd", on a logarithmic scale. the prior and posterior are at "predicting" the "true state of nature" w. so that log p w I z) log p( w j ( is a measure of the usefulness of x were w known to be the true value. Thus I { x I pu(.)} is simply the expected value of that utility difference with respect to the posterior density, b' e n 2. w
The functional , \ ' / ) ( & I ) logp(d)1 L has been used (see e.g.. Lindley. 1956. and references therein) as a measure of the 'absolute' information about contained in the probability density p ( r i ) . The increase in utility from observing .I' is then
instead of our Definition 3.21. However. this expression is m t invariant under ' onetoone transformations of A . a property which seems to us to be essential. Note. however. that both expressions have the same expectation with respect to the distribution of 2. Draper and Guttman ( 1969) put fi)r\vard yet another noninvariant detinition o f inlormation.
Additional references on statistical information concepts are Renyi ( 1963. 1966, 1967). Goel and DeGroot ( 1979) and De Waal and Groenewald ( 1989). More generally, we may wish to step back to the situation before data become available. and consider the idea of the amount of information to be expected from an experiment 6'. We therefore generalise Definition 2.27.
Definition 3 2 . (Expected inforination from an experiment). Tlic c>.vpectetl .2 iilfornratioii to Iw provided I?\. (rii erperiirwiit t trhorrt LJ E 12. n.lieii rlre prior dens it^ is p(w)is gireri by
3.4 Generalised Information Measures
159
The following result, which is a generalisation of Proposition 2.32. provides an alternative expression for I { e pw (.) } .
1
Proposition 3.16. An alternative expressionfor the expected information is
wherep(w,x 1 e ) = p(w 1 x,e)y(x I e ) andp(w I x. e ) is theposterior density for o given data x and prior densiry p ( w ) . Moreover, I { e p , ( .) } 2 0, with equality i f and only i f x and w are independent random quantities, so rhar p(w,z 1 6 ) = p(w)p(xI .)for all w and x.
I
Proof,
and the result now follows from the fact that p(w 1 x.e ) = p(w 1 x, ) p ( x I e). e Moreover, since, by Proposition 3.14, I { e p w (  ) } 2 0 with equality if and only ify(w I x,e)= y(w). it follows from Definition 3.19 that I { e p w ( . ) } >_ 0 with equality if and only if, for all w and x,p(w.z I e ) = p ( w ) p ( xI e ) . a
I
I
Maximisation of the expected Shannon information was proposed by Lindley (1956) as a "reasonable" ad hoc criterion for choosing among alternative experiments. Fedorov (1972) proved later that certain classical design criteria (in particular, Doptimality) are special cases of this when normal distributions are assumed. We have shown that maximising expected information is just a particular (albeit important) case of the general criterion, implied by quantitative coherence. of maximising the expected utility in the c a e of pure inference problems. See Polson (1992) for a closely related argument.
It follows from Proposition 2.3 I and the last remark, that someone who adopts the classical Doptimality criterion of optimal design under standard normality assumptions should. for consistency, have preferences which are described by a logarithmic scoring rule; otherwise, such designs are not optimal with respect to his or her underlying preferences.
There is a considerable literatureon the Bayesian design of experiments, which we will not attempt to review here. A detaileddiscussion will be given in the volume Bayesian Methods. We note that important references include Blackwell ( 1951,
1953). Lindley (1956). Chernoff (1959). Stone ( 1959). DeGroot (1962. 1970). Duncan and DeGroot (1976), Bandemer ( 1977). Smith and Verdinelli ( 1980). Pilz (1983/1991), Chaloner (I984). Sugden (1985). Mazloum and Meeden ( 1987). Felsenstein (1988. 1992). DasGupta and Studden (1991). ElKrunz and Studden ( I99 I). Pardo @ICJI. I99 I), Mitchell and Moms ( 1992). PhamCia and Turkkan ( ( 1992). Verdinelli ( 1992). Verdinelli and Kadane ( 1992).Lindley and Deely ( 1993). Lad and Deely ( 1994) and Parmigiani and Berry (1994).
3.5
3.5.1
DISCUSSION AND FURTHER REFERENCES
The Role of Mathematics
The translation of any substantive theory into a precise mathematical formalism necessarily involves an element of idealisation. We have already had occasion to remark on aspects of this problem in Chapter 2. in the context of using real numbers rather than subsets of the rationals to represent actual measurements (necessarily "finitised" by inherent accuracy limits of the unceflainty apparatus). Similar remarks are obviously called for in the context of using, for example. probability densities to represent belief distributions for realvalued observables. In some situations, as we shall see in Chapter 4. the adoption of specific forms of density may follow from simple. structural assumptions about the form of the belief distribution. in other situations. however. if we really try to think of such a density as being practically identified by expressions of preference among, say. standard options. we would encounter the obvious operational problem that, implicitly, an infinite number of revealed preferences would be required. Clearly, in such situations the precise mathematical form ofa density is likely to have arisen as an approximation to a "rough shape" obtained from some finite elicitation or observation process, and has been chosen. arbitrarily, tbr reasons of mathematical convenience. from an available mathematical toolkit. Similar remarks apply to the choice, for descriptive or mathematical convenience. of infinite sets to represent consequences or decisions. with the attendant problems of defining appropriate concepts of expected utility. There are obvious dangers, therefore. in accepting too uncritically any orientation, or wouldbe insightful mathematical analysis, that flows from arbitrary. idealized mathematical inputs into the general quantitative coherence theory. However, given an awareness of the dangers involved. we can still systematically make use of the power and elegance o f the (idealised) mathematics by simultaneously asserting. as a central tenet of our approach. a concern with the i ~ o t ~ i ~ . v r ia i~d~ v i. w ~ s v sirirityofthe output of an analysis to the form of input assumed (see Section 5.0.3). Of course, we shall later have to make precise the sense in which thew terms arc to be interpreted and the actual forins of proccdures t o htr adopted. That being
3.5 Discussion and Further References
161
understood, our approach, as with the earlier formalism of Chapter 2, will be to work with the mathematical idealization, in order to exploit its potential power and insight, while constantly bearing in mind the need for a large pinch of salt and a repertoire of sensitivity diagnostics. 3.5.2
Critical Issues
We shall comment further on three aspects of the general mathematical structure we have developed and will be using throughout the remainder of this volume. These will be dealt with under the following subheadings: (i) Finite versus Countable Additivity; (ii) Measure versus Linear Theory (iii) Proper versus Improper Probabilities; (iv) Abstract versus Concrete Mathematics.
Finite versus Countable Additivity In Chapter 2, we developed, from a directly intuitive and operational perspective, a minimal mathematical framework for a theory of quantitative coherence. The role of the mathematics employed in this development was simply that of a tool to capture the essentials of the substantive concepts and theory; within the resulting finitistic framework we then established that uncertainties should be represented in terms offinitely additive probabilities. The generalisations and extensions of the theory given in the present chapter lead, instead, to the mathematical framework of countable additivity, within which we have available the full panoply of analytic tools from mathematical probability theory. The latter is clearly highly desirable from the point of view of mathematical convenience, but it is important to pause and consider whether the development of a more convenient mathematical framework has been achieved at the expense of a distortion of the basic concepts and ideas. First, let us emphasise that, from a philosophical perspective, the monotone continuity postulate introduced in Section 3.1.2 does not have the fundamental status of the axioms presented in Chapter 2. We regard the latter as encapsulating the essence of what is required for a theory of quantitative coherence. The former is an “optional extra” assumption that one might be comfortable with in specific contexts, but should in no way be obliged to accept as a prerequisite for quantitative coherence. Secondly, we note that the effect of accepting that preferences should conform to the monotone continuity postulate is to restrict one’s available (in the sense of coherent) belief specifications to a subset of the finitely additive uncertainty measures; namely, those that are also countably additive. This is, of course, potentially disturbing from a subjectivist perspective, since a key feature of the theory is that the only constraints on belief specifications should be that they are coherent. For some such representations to be ruled out a priori, as a consequence of a postulate adopted purely for mathematical convenience, would indeed be a distortion of the
162
3 Generulisutions
theory. This is why we regard such a postulate as different from the basic axioms. However, provided one is aware of. and not concerned about, the implicit restriction of the available belief representations, its adoption may be very natural in contexts where one is, in any case, prepared to work in an extended mathematical setting. Throughout this work, we shall. in fact, make systematic use of concepts and tools from mathematical probability theory. without further concern or debate about this issue. However. to underline what we already said in Section 3.1.2. it is important to be on guard and to be aware that distortions might occur. To this end. we draw attention to some key references to which the reader may wish to refer in order to heighten such awareness and to study in detail the issues involved. De Finetti (1970/1974. pp. 116133. 173177 and 228241: 1970/1975. pp. 267276 and 340361 ). provides a wealth of detailed analysis, illustration and comment on the issues surrounding finite versus countable (and other) additivity assumptions, his own analysis being motivated throughout by the guiding principle that
. . . mathematics is an instrument which should conform itself strictly to the exigencies of the field in which it is to be applied. (1970/1974. p. 3)
Further technical and philosophical discussion is given in de Finetti ( 1972, Chapters 5 and 6); see, also, Stone (1986). Systematic use of finite additivity in decisionrelated contexts is exemplified in Dubins and Savage (1965/1976), Heath and Sudderth (1978, 1989). Stone (1979b), Hill (1980). Suddenh (1980). Seidenfeld and Schervish ( I 983). Hill and Lane ( 1984). Regazzini ( 1987) and Regazzini and Petris ( 1 9 3 ) . A discussion of the statistical implications of finitely additive probability is given by Kadane et u1. ( 1986). In Section 2.8.3 we discussed , within a tinitistic framework, several "betting" approaches to establishing probability as the only coherent measure of degree of belief. These ideas may be extended to the general case. Dawid and Stone (1972. 1973) introduce the concept of "e.rprc.totion cwisi.stenc:v". and show the necessity of using Bayes' theorem to construct probability distributions corresponding to fair bets made with additional information. Other generdlised discusions on coherence of inference in terms o f gambling systems include Lane and Sudderth ( 1983) and Brunk (1991 1.
Mmsurc. versus Lineur Theory
Mathematical probability theory can be developed. equivalently. starting either from the usual Kolmogorov axioms for a set function defined over a afield of events (scc,for example, Ash, 1972). or from axioms for a linear operator defined over a linear space of random quantities (see. for example. Whittle, 1976). The former deals directly with probability tnecrsurr; the latter with an cxpectcition operutor (or a prewXori. in de Finetti's terminology).
3.5 Discussion and Further References
163
In our development of a quantitativecoherence theory, the axiomatic approach to preferences among options has led us more naturally towards probability measures as the primary probabilisticelement, with expectation (prevision)defined subsequently. In the approach to coherence put forward in de Finetti (1972,197011974, 1970/1975),prevision is the primary element, with probability subsequentlyemerging as a special case for 01 random quantities. The case for adopting the linear rather than the measure theory approach is argued at length by de Finetti, there being many points of contact with the argument regarding finite versus countable additivity, particularly the need to avoid, in the mathematical formulation, going beyond those aspects required for the problem in hand. In the specific context of statistical modelling and inference, Goldstein ( I 98 I , 1986a, 1986b, 1987a. 1987b, 1988, I99 I , 1994) has systematically developed the linear approach advocated by de Finetti, showing that a version of a subjectivist programme for revising beliefs in the light of data can be implemented without recourse to the full probabilistic machinery developed in this chapter. Lad et a f .(1990) provide further discussion on the concept of prevision. We view these and related developments with great interest and with no dogmatic opinion concerning the ultimate relative usefulness and acceptance of “linear” versus “probabilistic” Bayesian statistical concepts and methods. That said, the present volume is motivated by our conviction that, currently, there remains a need for a detailed exposition of the Bayesian approach within the, more or less, conventional framework of full probabilistic descriptions.
Proper versus Improper Probabilities Whether viewed in terms of finite or countable additivity, we have taken probability to be a measure with values in the interval [O,1). However, it is possible to adopt axiomatic approaches which allow for infinite (or improper) probabilities: see, for example, Renyi (1955, 196211970, Chapter 2, and references therein), who uses conditional arguments to derive proper probabilities form improper distributions, and Hartigan (1983, Chapter 3), who directly provides an axiomatic foundation for improper or, as he terms them. nonunitary, probabilities. We shall not review such axiomatic theories in detail, but note that we shall encounter improper distributions systematically in Section 5.4. Abstract versus Concrete Mathemarics When probabilistic mathematics is being used as a tool for the representation and analysis of substantive nonmathematical problems, rather than as a direct mathematical concern in its own right, there is always adilemma regarding the appropriate level of mathematics to be used. Specifically, there are basic decisions to be made about how much measuretheoretical machinery should be invoked. The introduction of too much abstract mathematics can easily make the substantive content seem totally opaque to the very reader at whom it is most aimed. On the other hand, too
164
3 Generu1isation.s
little machinery may prove inadequate to provide a complete mathematical treatment, requiring the omission of certain topics, or the provision of just a partial, nonrigorous treatment, with insight and illustration attempted only by concrete examples. Thus far, we have tried to provide a complete, rigorous treatment of the Foundations and Generalisations of the theory of quantitative coherence, within the mathematical framework of Chapters 2 and 3. This chapter essentially defines the upper limit of mathematical machinery we shall be using and. in fact, most of our subsequent development will be much more straightforward. However, it will be the case. for example in Chapter 4. that some results of interest to us require rather more sophisticated mathematical tools than we have made available. Our response to this problem will be to try to make it clear to the reader when this is the case. and to provide references to a complete treatment of such results. together with (hopefully) sufficient concrete discussion and illustration to illuminate the topic. For more sophisticated mathematical treatments of Bayesian theory. the reader is referred to Hartigan ( 1983) and Florens et (11. ( 1990).
BAYESIAN THEORY
Edited by Josd M. Bemardo and Adrian F. M. Smith Copyright 02000 by John Wiley & Sons, Ltd
Chapter 4
Modelling
Summary The relationship between beliefs about observable random quantities and their representation using conventional forms of statistical models is investigated. It is shown that judgements of exchangeability lead to representations that justify and clarify the use and interpretation of such familiar concepts as parameters, random samples, likelihoods and prior distributions. Beliefs which have certain additional invariance properties are shown to lead to representations involving familiar specific forms of parametric distributions. such as normals and exponentials. The concept of a sufficient statistic is introduced and related to representations involving the exponential family of distributions. Various forms of partial exchangeabilityjudgements about data structures involving several samples, structured layouts. covariates and designed experiments are investigated, and links established with a number of other commonly used statistical models.
4.1
4.1.1
STATISTICAL MODELS
Beliefs and Models
The subjectivist. operationalist viewpoint ha. led us to the conclusion that, if we aspire to quantitative coherence, individual degrees of belief, expressed as probabilities. are inescapably the starting point for descriptions of uncertainty. There can
. I . those where thc events of interest are defined explicitly in terms of rundotn yuunrities..r. . .. both herc. t ’ I .I. .. .. . .. we shall use notation such as I’andp in a. . . . . .. are necessarily reasoning processes which take place in the minds of individuals..I.. I .. In particular. = . which we shall typically assume.]... any such specification implicitly defines a number of other degrees of belief specifications of possible interest: for example. . it is important to emphasise.rl . .I. .. . (discrete or continuous. . P may sometimes refer to an underlyinp probability measure.r. . We now turn to the detailed development of these ideas for the broad class of problems of primary interest to statisticians.. we may write p(. namely. It follows that learning processes..... . .sense. .rl. and sometimes refer to implied distribution functions. w that. I . . . /J(I ..x.8. . i . such as the conditions under which an approximate consensus of beliefs might occur in a population of individuals. . . J  /)I. we established a very general foundational framework for the study ofdegreesofbeliefand their evolution in the light of new information... . However. . .. I . . whatever their particular concerns and fashions at any given point in time. subjective beliefs about that reality. . . rather than as specifying particular functions. as in our earlier discussion in Section 2. P[. . Werecall that.I..I . 1 or P!. .. without systematic reference to measuretheoretic niceties.).. . . In what follows. . .I. for 1 (1 I ) ) < 1). .166 4 Modellitig be no theories without theoreticians.. . . /J(... ... I * ( = . To be sure.r.. ) (to be understood as a mass function in the discrete case). In Chapters 3.rl). . In such cases.I.I’ . .. Of course.. we shall assume that an individual‘s degrees of belief for events of interest are derived from the specification of a joint distribution function P(. l .. . . . objective reality: but the actuality of the learning process consists in the evolution of individual. to be representable in terms ofa joint density function p(. provides the marginal joint density for . I . = ) ) ( . I . ~ ) .and . . .. no learning without learners: in general.. I ... .rI. that the primitive and fundamental notions of irrrhsidu d preference and belief will typically provide the starting point for irirrrpersotrtil communication and reporting processes. I.. we shall therefore often be concerned to identify and examine features of the individual learning process which relate to interpersonal issues.... . . ) yet unobserved ... ) / p ( .. ). . conditional on gives the joint density having observed . . .J ’ . the object of attention and interest may well be an assumed external. and possibly vectorvalued) representing obscrved or experimental data. Within the Bayesian framework.s. ! ... and 3.~. r l . .. etc. throughout. .. . no science without scientists. . ) : / ) ( . this latter conditional form is the key to “learning from experience“. . and more particularly in Chapter 5 . X I . . . .. . for example.rI. Siinilarly...I. . I . . . . . .. I .I* .. such as P ( . i . r . .qrrreric.
. the starting point for formal analysis is the assumption of a mathematical model form. In other cases.x . . operationalist approach will provide considerable insight into the nature and status of these conventional assumptions.2. for all 11.2 4. . From our perspective.T. muthemutically specijies theform of the joint belief distributionfor any subset ofx.r2. . and suppose that a predictive model is assumed which specifies that. . . . is a probability measure P . the subjectivist. . At this stage. .4.1 EXCHANGEABILITY AND RELATED CONCEPTS Dependence and Independence Consider a sequence of random quantities x l . which x?.1. . this “formal” approach will not take us very far towards solving the representation problem and we shall have to fall back on rather more pragmatic modelling strategies.. as we shall soon see. and thus far we have no way of attachingany operational meaning to the “parameters”which appear in conventional models.. the main object of the study being to infer something about the values of these parameters. the joint density can be written in . a word of warning is required. However.r:. . in some sense... A predictive model for a sequence of rundom quanrities x!. however.. since direct contemplation and synthesis of the many complex marginal and conditionaljudgements implicit in such a specification are almost certainly beyond our capacity in all but very simple situations. . In much statistical writing. concrete forms for joint distributions. 4. This is clearly a somewhat daunting task..x. In actual applications we shall need to choose specific.2 Exchangeability and Related Concepts 167 simply indicates that the conditional density for . our discussion is rather “abstract”.L:. typically involving “unknown parameters”.+l. we shall find that we are able to identify general types of belief structure which “pin down”. this is all somewhat premature and mysterious! We are seeking to represent degrees of belief about observables: nothing in our previous development justifies or gives any insight into the choice of particular “models”. ~ given X I . . Definition 4. . Thus far. . (Predictive probability model). In some cases. Such usage avoids notational proliferation. We shall therefore need to examine rather closely this process of choosing a specific form of probability measure to represent degrees of belief.. the mathematical representation strategy to be adopted. is given by the ratio of the specified joint densities. . . and the context will always ensure that there is no confusion of meaning.
or oirght to be adopted in most cases. are ”uninformative”. . .I so that no leurning.. r . r i . there are a vast number of possible subjective assumptions about the form such dependencies might takc and there can be no allembracing theoretical discussion. . They simply represent forms of judgement which may often be felt to be appropriate and whose detailed analysis provides illuminating insight into the specification and interpretation of certain classes of predictive models. .. A predictive model specifying such an independence structure is clearly inappropriate in contexts where we believe that the successive accumulation of data will provide increasing information about future events.2 Exchangeabilityand Partial Exchangeability Suppose that. . . . There is no suggestion that the structures we are going to discuss in subsequent subsections have any special status. etc. I t is easy to see that this implies that the form of the joint distribution must be such that for any possible permutation ~i of’the subscripts { 1...r.(. in the sense that he or she would specify all the marginal distributions for the individual random quantities identically..r 1 . in thinking about P ( . .r..an individual makes the judgement that the subscripts. .) = n il p(. the “labels” identifying the individual random quantities. I 1 so that the .. . for any 1 5 t i / < 1 1 . In other works.I. what we can do is to concentrate on some particular simple forms of judgement about dependence structures which might correspond to actual judgements of individuals in certain situations. In such cases. 1 . .1. ..r../‘. are independent random quantities.. . . Instead. / ’ I . . In general. .. r l . his or her joint degree of bclief distribution for a sequence of random quantities ... I ) } . triples. .r. .vperience can take place within this sequence of observations. . p(.. . however. or whatever.2.. . ) . . and similarly for all the marginal joint distributions for all possible pairs. ./. I .(.+1. ..4 Modelling the form p(. r l . 4. the structure of the joint density p ( . past data provide us with no additional information about the possible outcomes of future observations in the sequence. . . ... It then follows strdightfonvardiy that.. . .)must encapsulate some form ofdepcnrlunce among the individual random quantities. .r. .) = p(.r. . of the random quantities./. . .. We formalise thi\ notion of “symmetry” of beliefs for the individual random quantities as follo~v\.frci?re.
most observers would judge the outcomes of the sequence of tosses 11. but that the odd tosses are made by a different person. are said to be judged yiniteiy) exchangeable under a probability measure P if the implied joint degree of belief distribution satisjes for ull permututions 7r defined on the set { 1... using a completely different tossing mechanism from that used for the even tosses. .).R. i = 1. The notion of exchangeability involves a judgement of complete symmetry among all the observables x l .4. many individuals would retain an exchangeable form of belief distribution within the sequences of odd and even tosses separately. x. the essence of the idea of a socalled "random sample". . in many situations this might be too restrictive an assumption. . . a meaningless phrase thus far within our framework. (cont. Suppose that the sequence of tosses of a drawing pin are not all made with the same pin.given the value of the underlying parameter".n}. . . but might be reluctant to make a judgement of symmetry for the combined sequence of tosses. whatever precise quunrirutive form their beliefs take..I. of course. the condition reduces to Example 4.1. The random quantities 11. In such cases. = 0 otherwise. to be exchangeable in the above sense. but that the even and odd numbered tosses are made with different pins: an all metal one for the odd tosses: a plasticcoated one for the even tosses. .x. . If the tosses are performed in such a way that time order appears to be irrelevant and the conditions of the toss appear to be essentially held constant throughout. (Finite exchangeability). Consider a sequence of tosses of a standard metal drawing pin (or thumb tack). . since it (implicitly) involves the idea of "conditional independence. under consideration. Clearly. In general. . even though a partial judgement of symmetry is present. the exchangeability assumption captures. . Example 4. suppose that the same pin were used throughout. (Tossing a thumb tuck). for a subjectivist interested in belief distributions for observables. ... and let xi = 1 if the pin lands point uppermost on the ith toss. Alternatively. r2.2 Exchangeability and Related Concepts Definition 4. . This latter notion is.. it would seem to be the case that. . of no direct use to us at this stage.2. In terms of the corresponding density or mass function. . .1.
r]. sex and appropriately defined agegroup. As a preliminary. I .170 4 Modelling Example 4. Suppose that . . the substance being identical but the measurement procedures varying from laboratory to laboratory. Under such conditions. at least in principle. .. are realvalued measurements ofa physical or chemical property of a given substance. These and related issues of finite versus infinite exchangeability will be considered I in more detail in Section 4. Clearly. x 2 .3. . A detailed discussion of various possible forms of partial exchangeability will be given in Section 4. for mathematical and descriptive purposes. We shall now return to the simple case ofexchangeability and examine in detail the form of representation of p(. we shall concentrate on the ”potent idly infinite” case. . specifying an actual upper bound may be somewhat difficult or arbitrary and so. ) which emerges in various special cases. Of course. In practice. judgements of exchangeability for each laboratory sequence separdtely might be appropriate. it simply signifiesthat there may be additional “1abels”on the random quantities (forexample. } are realvalued measurements of a specific physiological response in human subjects when a particular drug is administered. spanning a wide age range.6. (Physiologicalresponses). .I. we shall generalise our previous definition of exchangeability to allow for ”potentially infinite” sequences of random quantities. that sequencesof such measurements are combined from 1 different . However.2. there are many possible forms of departure from overall judgements of exchangeability to those of partial exchangeability and so a formal definition of the term does not seem appropriate. However. If the drug is administered at more than one dose level and it’ there are both male and female subjects. within each combination of doselevel. laboratories. all made on the same sample with the same measurement procedure. always be possible to give an upper bound t o the number of observables to be considered. . odd and even. . it is convenient to be able to proceed as if we were contemplating an infinite sequence of potential observables. r l . . For the time being. In this case. In general. most individuals would be very reluctant to make a judgement of exchangeability for the entire sequence of results.. a judgement of exchangeability might be regarded as reasonable. whereas such a judgement for the combined sequence might not be. . . (Luborato~ measurements). however. . many individuals might judge the complete sequence of meawrements to be exchangeable. it should.:. . . it will be important to establish that working within the infinite framework does not cause any fundamental conceptual distortion. or the identification of the tossing mechanism in Example 4. Suppose that { . . Suppose. . Judgements of the kind suggested in the above examples correspond t o forms of purriuf exchongeubility.7. Example4.1 ) with exchangeable judgements made separately for each group of random quantities having the same additional labels. T i .
However.3. Si = 1) 5 p ( J l = 0. exchangeable.rl = 0. p(J.4. r y = l . we also have p ( r .x. x.S:t = l .KI = 1. S ) = 1) = 0 and so p(r1 = O. . One might be tempted to wonder whether every finite sequence of exchangeable random quantities could be embedded in or extended to an infinitely exchangeable sequence of similarly defined random quantities. Suppose that we define the three random quantities C ~ . X:) = 1.x2: r:{ having probability zero.r4= I ) . 3 . S ) = 1.rr = 0) = 1/3.1.. = where p ( q = l. such that rI. is said to be judged (injnitely) exchangeable if everyfinite subsequence is judged exchangeable in the sense of Definition 4.r. q = 1. x2!. However. = 0. i = 1. this is certainly not the case as the following example shows.. & $= I .r2= O.1 = 1. x r = 1. XA = 0) = p(s1 = 0 .1 = 0. we require.with joint such probability function given by p ( q = O.~ 2.2?3.r3are clearly .x2 = l.C2 = 1. .S. Example 4.S( = O ) .t = l.. (Infinite exchangeability). .r:i = I ) = p ( q = 1. The inJnite sequence of random quantities $1.rI = O . .. 2 = 1. = 1 or .r.3.$ 1. for example. .2 Exchangeability and Related Concepts 171 Definition 4.s) 1 .C:3 = 1) = p(3. = 1. r L = l ) = l / : I . . x:! = 1. with all other combinationsof x i .p ( q = 0. S l = 1).2.t = 0) = 1/3.. let alone an infinitely exchangeable sequence.Zz = 1 . = It follows that a finitely exchangeable sequence cannot even necessarily be embedded in a larger finitely exchangeable sequence. p ( s .4 = 0) = p ( q = O.) = 1 ) = 0. .p ( q = 1.r:i= 1) . We shall now try to identify an sJ..2 = l. C : ~ that either I.1 = 0 ) 5 p(xl = so that 1. *'2 = O. For this to be possible. X J = 1) 1/3p(. (Nonextendibleexchangeability).C? = 0. ~ : ~ = l .1:z = l .22 = 1.S.x j are exchangeable. ~ 4= 0) # p(r1 = 0.so that x i . . = 0. I1 = 0. taking only values 0 and 1.. . . But p ( q = 0. = 0 .22 = 1.ri = 1.S. : C ~ .4. .
corresponding to the hypergeometric distribution of Section 3. .1 + + rl. .1 4 Modelling MODELS VIA EXCHANGEABILITY The Bernoulli and Binomial Models We consider first the case of an infinitely exchangeable sequence o f 0 . ’ .. 199 I ) . . 2 ri 2 y. . . + zI.1) .) jhr . with s. .3 4. r l .. . .vI.I’I + .1)). etc. . p ( . (De Finetti. t . = . Proof. and with the summations below taken over the range y y = y. . I .) = (. ...1.I random quantities. ? . [!I. J’T. r . . 0’s.)  .) d . = 1.). . .r..(?y. . .. . ]1(. = 0 or .( 1 1 .2. r l . . (Intuitively. .= y.3.I‘ I . .1.r. .x i .. / t t . . 1976.. .r. . . for arbitrary R. for all i = I . .. . .rII). . with y. ... Q ( 0 ) = lim 0x P[gll/tt 5 HI. 1930.I/. ..y~. . .r: t ~ ... . .rTl = y.7.y. ~ . the first I I random quantities XI. I * “ I. is un injnitely e.172 4. I937/ 1964.r7(I .v(y.g. by exchangeability. then. t. of Proposition 4. . .. . . . .\ = 2%. .. for any 0 5 ?I.\ 1’s and :Ir y.rll.. hcrs the forin r rcjhere.. we can imagine sampling tt items without replacement from an urn of X items containing . If X I . = y. here we follow closely the proof given by Heath and Sudderth. . 1 for any permutation x of { 1... . {rnd 0 = l i ~ i i .. Suppose . we see that where (. . + . we shall derive a representation result for the joint mass function. measure P . (Representation theorem for 0 I random quantities)..rchungeable sequence of01 random yuuntitics with prohahilin.. to g. see also Barlow. 2 0..\ .. there txists u distribution fiuiction Q such t k r r the joint inuss firnction p( .r. 7 . . . ..)U. rt } such that . I . Without loss of generality.ri t . . = . Moreover.
places a strict limitation on the family of probability measures P which can serve as predictive probability models for the sequence.  x where Q is a distribution function. for example.y.. Section 3. ..1. (1 . are judged to be independent.. . = Q..2.. .2. I . . . Moreover. l K1 . by Helly’s theorem (see.. The result follows. As we range over all possible choices of this latter distribution. (Nht uniformly in 8. . ( 8 W U .I . In more conventional notation and language.l / l l ~ fly. .v/N. we see that at As N  x. ~ 2 . Q.V = y . such that :YJ lirn Q“.. t S.r. we generate all possible predictive .. conditional on 8. we shall refer to the joint sampling distribution as the likelihood function..3 Models via Exchangeability 173 If we now define Q. I. N.4.. Thought of as a function of 8.=I r=l where the parameter is assigned a prior distribution Q(8). by the strong law of large numbers.v(8) on R to be the step function which is 0 for 8 < 0 and has jumps of p(zl t .xll are a random sample from a Bernoulli distribution with parameter 8. Bernoulli random quantities (see Section 3. 1972. a The interpretation of this representation theorem is of profound significance from the point of view of subjectivist modelling philosophy. 8 is itself assigned a probability distribution Q..2). so that Q may be interpreted as “beliefs about the limiting relative frequency of 1 ’s”.). gAv= 0.v.The operational content of this prior distribution derives from the fact that it is as ifwe are assessing beliefs about what we would anticipate observing as the limiting relativefrequency from a “very large number” of observations. . it is as if. the assumption of exchangeability for the infinite sequence of 01 random quantities ~ 1 . . . 8 = liInllx(y. generating a parametrised joint sampling distribution I. Section 8. . . for some choice of prior distribution Q ( 8 ) . ~ ) 8 = y. p(x1.v2. Any such P must correspond to the mixture form given in Proposition 4. It is as if: the . there exists a subsequence Q./n).~ ) A v r r .2) conditional on a random quantity 8. In terms of Definition 4.3 and Ash. .
. The likelihcd is defined in terms of an assumption of conditional independence of the observations given a parameter.rll = .yll10. "at a stroke". . we may be more interested in a summary random quantity. . . f PIUOJThis follows immediately from Proposition 4.. is (it1 injnitely crchangeahle seqrretice ~ $ 0 . . ] . .1 and the fact that for all s1. .rIIgiren .174 4 Mudelling probability models compatible with the assumption of infinite exchangeability for the 01 random quantities. . in a sense.rl I + .I' . ) I ) .. # f i r . Corollary 2..l . when expressing beliefs about g. . . I .rll. . such as yl. . l . f .r2. hus + 1 thejhri . . . . Thus.for . individual sequences of . .l such that . . = !I.1 rundun1 quantities with prohuhitity mrcrsitrr P. r t * s . the key step is a straightforward consequence of the representation theorem. . this simple example provides considerable insight into the learning process. . r ~ . t . The formal learning process for models such as this will be developed systematically and generally in Chapter 5 . .r. .. In many applications involving 01 random quantities.. . the conditional probability . . + J . . and its associated prior distribution. + + Corollary 1.for acting us ifwe have a binomial likelihood. we establish a justification for the conventional model building procedure of combining a likelihood and a prior. Given the cvnditions o Proposition 4..rI. . the latter.y.slI X I . ~ than in the .). .. . + . The representation of p ( . with a prior distribution Q(0) for the binomial parameter 0. i . defined by Bi(.r1. = . . However. showing how.) is straightforwardly obtained from Proposition 4. .firnction p(rl. . .. . .1. acquire an operational interpretation in terms of a limiting average of observables tin this case a limiting frequency).rI .. . a T h i s provides a justification. .
c. .~) Y.~ s I+zl. . Forexample.... via &yes' theorem. of course.1 to both p ( q .I l l .. . n . . also provides the basis for deriving the con.) and p(z1. .I . .3 Models via Exchangeubiiity where 175 Proof. .~. a posterior distributionfor a parameter i seen to be a limiting case of s u posterior (conditionat)predictive distributionfor an observable. . for . . p(z1. . .) . . . .I. .. 2 ~ I the (conditionul. given 11..) is called .. . by Proposition 4.e. I.. . . p(z.4.~ . . ~and rearranging the resulting expression. expressed in conventional terminology. ditional predictive distribution of any other random quantity defined in terms of the futureobservations. ...# 1. .the predictive probability func.~ I . .. Clearly. . tion p ( ~ .+ .i. The conditional probability function ~ ) ( t . ~ + . . ..z. z.x. given rl.. . .. G I ) f and the result follows by applying Proposition 4. .. the total number of 1's in I . Thus. 5 1 1 12. ~and this.+1.. a ) We thus see that the basic form of representation of beliefs does not change. x . . .. . .. into the posrerior distribution Q(O 1 zl.. .or posterior) predictive probability function for zn. ~ XI. . . .~ has the form A particularly important random quantity defined in terms of future observations is the frequency of 1's in a large sample.1 and its Corollary 2.xn. ..... But.rnl). . )= p ( ... is that the prior distribution Q ( 0 ) for 8 has been revised. . .... .... All that has happened.
. correspondingto the total number ofoccurrences ofcategory j in the I I observations. ..r. 1 = y(y. We can extend this idea in an obvious way by considering Idimensional random vectors x.)i s the random quantity vector y. We shall give the representation of p z generalising Corollary I to Proposition4..176 4. . whose j t h component g. If xI. . ( l . In what follows. takes the value I to indicate membershipofthejth o f k 1 categories.~. we are often most interested in the summary random x. . . y . .. 4 As in the previous case. ..whose j t h component. . (Represenfation theorem for 0 I random vecfors).1.for htrs where Proof. If 2 1 .s P. . x. rcitli probability tt~~(rsiire there e. + + + + + . At most one of the k components can take the value I . in the sense that . we shall refer t o such 5. x. ~ 1. if they all take the value 0 this signifies membership of the ( X 1)th category.= y. I . This i s a straightforward. .2 . . .rchtirigeableseyitetice ($0l nrntiotii \~~tcir. 5 2 .fiinction p(x I . = 0 signities membership of category 2. generalisation of the proof of Proposition4.1 i s as defining category membership (given two exclusive and exhaustive categories)..T.2 4 Modelling The Multinomial Model An alternative way of viewing the 0 1 random quantities discussed in Section 3. = x l .2.. we can extend Proposition4 I in an obvious Wily.. + + Proposition 4. .fortr~ ).3. i s an infinitely exchangeable sequence of 5 01 random vectors. and then comment on the interpretation of these rcsults. . . . J . . is (it1 itrjnitely c~. = 1 signifies that the ith observation belongs to category I and . . . . . . . x .. tlie. as "01 random vectors". albeit algebraically cumbersome.rists u distrihirtiorifiinctiotr Q such thrit the joint mtus... x.
. Mu&. ~1~ x2. .. 4 Thus. TI). we see in Proposition 4. where the components 8. (General representation theorem). As one might expect. 18. . there exists a probability memure Q over 3. . . of the latter can be thought of as the limiting relative frequency of membership of the jth category. . where sponding to the joint sampling distribution of a random sample z .T.2 that it is as if we have a likelihood corre1. providing some intuitive insight into the result.. and a rigorous treatment involves the use of measuretheoretic tools beyond the general mathematical level at which this volume is aimed. hopefully..2.z. . . . f such that the joint distributionfunction of T I .3. the space o all distributionfunctio.Ynk!(n c?/. with a prior Q(8)for 8. In the corollary. . is an infinitely exchangeable sequence of realvalued random quantities with probability measure P . it is as ifwe assume a multinomial likelihood. together with a prior distribution Q over the multinomial parameter 8. . . .l)! ’ Proof.. .. . as well as the key ideas underlying a form of proper proof... we shall content ourselves with providing an outlineproof of a form of the representation theorem. 18. . . the mathematical technicalities x2.. Proposition 4. ...! . .x. For this reason..) 112 and F.4. has the form I f where Q ( F ) = lirn P(Fl.ynk) may be represented as 177 Corollary... having no pretence at mathematical rigour but. each z.. is the empirical distributionfunction defined by X I .has a multinomial distribution with probability function Muk(z. l ) . .u on 8.3 Models via Exchangeability p(ynl.. 4 3 3 The General Model . the joint mass function where LlIl hJ 72 T?. This follows immediately from the generalisation of the argument used in proving Corollary 1 to Proposition 4. . Given the conditions of Proposition 4. I. . of establishing a representation theorem in the realvalued case are somewhat more complicated than in the 01 cases. . We now consider the case of an infinitely exchangeable sequence of realvalued random quantities xl. = ?Jnl!Pn2!.
r) ( . for fixed 11.IT. .. 4 r For il: > t t ... A straightforward count of the numbers of terms involved in the summations then gives the required result.r.V x . tt x.I*)].(. ) ] .) = P ( J . we have ' Note also that E(I.r) in probability to some random quantity. say. . r ~ . in place of and noting that I = I. F(. (See Chow and Teicher.. F. r . o.178 Ourline proof.rl < . < for all i . writing I.=I I' and ' = I ( n ) = I[(. < . Suppose we now let (1 I . < ..r). denote positive integers and set  . . I J ) = I'[(.. which implies that tends  in probability as . . To see this. j . Since 4 Modelling we have.l 5 . .. n ( .r) and E ( Z . 1978/ 19138). by exchangeability.. by exchangeability.and hence the random quantity The righthand side tends to zero as . r l ) n . . it then follows that . .
by exchangeability. random) distribution function (which plays the role of an infinitedimensional "parameter" in this case). where Q ( F ) = lim. and so Recalling (*). cannot easily be described explicitly.3.x. . For ease of reference. In what follows. .x P ( F . /g F ( z j )dQ(F) P ( x ~ . . we shall therefore find it convenient to restrict attention to those cases where a corresponding representation holds in terms of density functions. . having the operational interpretation of "what we believe the empirical distribution function would look like for a large sample". . a The general form of representation for realvalued exchangeablerandom quantities is therefore as ifwe have independent observations zl.) . as N + 179 co.4. F.I But.zn conditional on F . we present this representation as a corollary to Proposition 4. as N  m. labelled by a finitedimensional parameter. we see that.3. with the distribution function representation given in Proposition 4. . 8. The structure of the learning process for a general exchangeable sequence of realvalued random quantities. so that. . an unknown (i.v. say.3 Models via Exchangeability However.. rather than the infinitedimensional label.e. v ) ... with a belief distribution Q for F.
However. and and rearranging the resulting expression. with x.> 1 matters. .rl. . under the conditions p(x. . from the subjectivist perspective. In cases where the distinction between k = 1 and A. x. a The technical discussion in this section has centred on exchangeable sequences. 1 X I . . The role of Bayes' theorem in the learning process is now easily identified. . parumeters.3 and its corollaries become the To joint distribution functions and densities for the k components of the z.180 of Proposition 4.. in subsequent developments we shall often just write s E X.I and realvaluedcases. wen the simple cases we have presented already provide. In Section 4.). applying the density representation form to both p ( . i r s .) p ( .rll has the form 4 Modelling Corollary 1. . x2. . with p ( . . including farreaching generalisations of the 0 .rt.8. This follows immediately on writing P(.)= . . . . . . ./+I.x 2 . .. x. .r E 'R and x E !Rk. I . . zl. is that the distribution func2 All tions and densities referred to in Proposition 4. r l . of realvalued random quantities. . * X I . Assuming the required densities to exist. . . is an injnitely exchangeable sequence o retrlf valued random quantities admitting a density representation as in Corollarj I . . it will be clear from the context what is intended. In fact. .. in effect. r l . then Corollary 2. .3 the joint density of s1.) [. . ..... 16) denoting the density jiinction corresponding to the "unknocon parameter" 6 E 8.'(. . a deeply satisfying clarification of such fundamental notions as models.. .. that happens. . 2 . . . everything carries over in an obviously analogous manner to the case of exchangeable sequences xl. we shall give detailed references to t e literature on represenh tation theorems for exchangeable sequences. . X I .. r f Proof. XI/) p ( r . . . . ... .E 'Rk.rt.. conditional independence and the relationship between helieji and l i n i i t i t i ~ ~ ~ t ~ ~ u e n c .r. avoid tedious distinctions between .. .
in the context of the general form of representation given in Proposition 4. They are intended. The mathematical form of the required representation is welldefined. . we consider the possibility of further . .. so that the “parameter” is. ~ Definition 4. however. .. x. to be exchangeable. It is interesting therefore to see whether there exist more complex formal structures of belief.4.1. perhaps relating to the “geometry”of the space in which say. there is no claim that such judgements have any a priori special status. 22. a somewhat daunting prospect.. Given the interpretation of the components of such a parameter as strong law limits of simple sequences of functions of the observations. In this case. (Spherical symmetry).4. the assumption of exchangeability for the realvalued random quantities ~ 1 . . but the practical task of translating actual beliefs about realvalued random quantities into the required mathematical form of a measure over a function space seems. injnire dimensional.4 Models via Invariance 181 In terms of Definition 4. .3. in effect. then becomes a much less daunting task. it is of interest to identify situations in which exchangeabilityleads to a mixture of conditional independence structures which are defined in terms of ajnite dimensional parameter so that the more explicit forms given in the corollaries to Proposition 4. 2 1 . A sequence of random quantities . judgements of invariance..and the family of coherent predictive probability models is generated by ranging through all possible prior distributions Q(Fj. underlying the conditional independence structure within the mixture is a random distribution function. .2. again places (as in the 01 case) a limitation on the family of probability measures P which can serve as predictive probability models. the “parameter”. The following definitions a finite subset of observations.is said to have spherical symmetry under a predictive probability model P ifthe latter defines the distributionsof x = ( x l .3 can be invoked. and hence of the complete predictive probability model P . . 4.. .) and Ax to be identical.1 MODELS VIA INVARIANCE The Normal Model Suppose that in addition to judging an infinite sequence of realvalued random quantities q .for any (orthogonal)n x n matrix A such that AtA = I . As with exchangeability. . . simply.4 4. xl. imposing further symmetries or structure beyond simple exchangeability. x. and whose consequences might be interesting to explore. F. lie. the specification of Q . to say the least.4. T. ~ . as possible forms of judgement that might be made. In particular. which lead to more specific and “familiar” model representations. describe two such possible judgements of invariance.. . ..
r. . . s i .4. this is equivalent to a judgement of identical beliefs for all outcomes of sI. . leading to the same value of J.. .. in the sense that.]. Proof. fi)r utiy 11. + . (ii) X is itself assigned a distribution Q.f rei~lvul~ied rundotn yirnntities with probability meuswe P.. X .. change if they had been expressed in a rotated coordinate system.r. although measurements happened to have been expressed in terms of a particular coordinate system (yielding s ~. c' *wen X (which. there e. . The next result states that if we make the judgement of spherical symmetry (which in turn implies a judgement of exchangeability.I.risfs Q distribution function Q on 'R 1. . . .. .' = interpreted as "beliefs about the reciprocal of the limiting mean sum of squares of the observations". { . .). as a "labelling parameter".4 tells us that the judgement of spherical symmetry restricts the set of coherent predictive probability models to those which are generated by acting us if: (i) observations are conditionally independent riortnal random quantities. . . Since rotational invariance fixes "distances" from the origin. . Freedman (1963a) and Kingman (1972): details are omitted here. + xf).. + . and X ' = l i i i i .)} h a w sphrriculsynnietry. corresponds to the the random quantity precision.r. und if. . for example. . (Representaliontheorem under spherical symmetry). . . since the proof of a generalisation of this result will be given in full in Proposition 4. . i.102 4 Modelling This definition encapsulates a judgement of rotational symmetry.. the reciprocal of the variance)..r2. . . Proposition 4. See.5..since permutation is a special case of orthogonal transformation).x has the jbrni where CP is the standard tiorninl distrihuriori~fimctioti unit with s = n'(.. a The form of representation obtained in Proposition 4.e. so that (2 may be (iii) by the strong law of large numbers.I. is uti infinite sequence .3 assumes a much more concrete and familiar form. .x si. .our quantitative beliefs would not .rl . .r: : + . the general mixture representation given in Proposition 4.such that thejoint distribirtion Jimctiori of r 1 .. If .
.cI.A) = lim P[(?. } have centred spherical symmetry.. then there exists a distribution function Q on R x R+ such that the joint distribution of x 1 .I) n ( s i 2 I A)] p nx with x. ~for any n. ~ v e ~ ~ p h e r i c a l s y m m e t ~ . x. x . . and if.T t 1 ) * . we need to generalise the above discussion slightly. Proof. (Smith. such that .. . if we were to feel able to make a judgement of spherical symmetry. (Representation under centred spherical symmetry). quantities with probability measure P. .’ [ ( ~ l . r ) 2 p = limrrda and A’ = 1imtt% En. x.5. + (zf. This is equivalent to a judgement of identical beliefsfor all outcomes of x1..3 there exists a random distribution function F such that. since it is equivalent to a judgement of invariance in terms of distance from the origin. . s?. . .. by . . . for any n.)’]. however. . = n’ where%.x .S: = 7 t . is exchangeable.. condi. . Proposition 4. We note first that the judgement of spherical symmetry implicitly attaches a special significance to the origin of the coordinate system. I. are independent.. Definition 4.. 1981).P . h a s theform ~ where is the standard normal distributionfunction and Q ( p . is said to have centred spherical symmetry ifthe random quantitiescl . ..x. Toobtain a justification for the usual normal specification. . In general. There is therefore a random characteristic function. . the random quantities .. A sequence ofrandom quantities X I .. . .1978).Z. . 4. Proposition 4. + + + + (x. = TI’(x~ . leading to the same value of ( X I + . .T . . x i . .). (Centredsphericdsymmetry). . .4 Models via Invariance 183 For related work see Dawid ( 1977. 22.. it would typically only be relative to an “origin” defined in terms of the “centre” of the random quantities under consideration. corresponding to F . .4. . . with “unknown mean and precision”. This motivates the following definition. x n . { X I . .. is an infinitely exchangeable sequence of realvalued random 22. .). . Ex. If xl. . Since the sequence X I . . tional on F . .5.
are spherically symmetric.5.q) . .s. This can be rewritten in the form \zI. both sides of this latter equality depend only on sf + . and that o ( 0 ) = 1.1).~(~. the complex conjugate.I / ) } ..)~(l/j~.. .) I1  1. ~ .) c ) ( I .. .t ? ) o " ( . This implies that. for example. s1 + . + s' .1  ~ .) takes the form d ( t ) = c'sp ipt for some random quantities 11 E %. j1.c. Lcmmn 1. 1973. A E !W. . where all four terms in this expression are of the form of the righthand side of ( * ) with n = 8.)(l(  f*j@(//  I $ ) O ( / * u)} .. almost surely with respect to the probability measure P.r ) ] .~(l. . so that the overall expression is zero. . Since !jI. . we see that.). for this quadratic.)o(u 1. [ f ) . it follows that ..such that sI + .. ) ~ ~ ( . . for any real u and v .) = A( I / ) + n( 1 8 ) . . . . This establishes that the random characteristic function I.184 If we now define y = .) = I>')( u)o(I ' ) C 3 ( I * ) and 11. ( I 1 + + \zI'( I. it follows that. where @ .tljc3(r.  ::l} = cxp ( i g t.I .I I .and where A ( . c $ ( O ) = 1.E ( O ( Ut ( .f ) = c .f ) = o ( t ) .si = I( f r 2 ) .E ( O ( . . If we now define a random quantity :by 2 { . ( t ) = @ ? ( t ) = l o g d ( f ) . Again using ~ ( . I Linnik and Rao.l/)(. . Recalling that q ( . All the four terms 11' are therefore equal.(u)and B ( v ) II: log[@(! ) @ ( . + s X = 0 and . + . . j ~ [ t ~ ) ~ ( f . .f l ) l ~ } = f<{Q(l/ + iQ)(. ./)c? ( .u ) o ( . ) ( > ( I ( . the linear coefficient purely imaginary and the quadratic coefficient real and nonpositive. satisfies the functional equation o f/ ( for all real II + r)c>( / I  I. . ~ .) } . . + . Kagan. the constant coefficient must be zero. ) = 2logd(. .)} + E { 0' ( (1 )O' ( I ' ) C 7 ? ( . i t . E { J O +~ ( r. s. L:. = 4 Mvdelling for all real s l . II . it follows that log o ( t )is a quadratic in 1 (see. + = 0.
. . * * .2 zl. spherical symmetry. therefore. (ii) we have a joint prior distribution Q ( p . each with mean p and precision A. x l . which can be given an operational interpretation as "beliefs about the sample mean and reciprocal sample variance which would result from a large number of observations". correspond to acting as if: (i) we have a random sample from a normal distribution with unknown mean and precision parameters.. .zn are independent normally distributed random quantities. expressed in conventional terminology. p and A. . (XI .4 Models via Invariance then. k 2 2. l p . a We see. . 4. by the strong law of large numbers. A) = n N ( + i i=l 1 ~ 5 ~ ) . .l) . The Multivariate Normal Model Suppose now that we have an infinitely exchangeable sequence of random vectors .A) for the unknown parameters. that the combined judgements of exchangeability and centred spherical symmetry restrict the set of coherent predictive probability models to those which..1 1 2 _ . we have 185 This establishes that. that the random quantities c ~ z ~ . conditional on p and A. taking values in !Rk. + (Xx .4.. generating a likelihood n p(x1.. . The next result then provides a multivariate generalisation of Proposition 4. for all n and for all c E 3Zk. and that. have centred . in addition. lirn 2 1 1. The mixing distribution in the general representation theorem reduces therefore to a joint distribution over p and A. . x i .5. . But.~ 2 .4. lim 11x. ) 2 + ~ ~ .1 x and the result follows. we judge. 2 2 . p and A. . by iterated expectation. .ctz. * + 211 l 7x n n = P.
nith u distribution o w . (Mulfivariccrerepresentah'ontheorem under centred spherical symmetry). 2 2 . . I ) . so that for all c E 92" f .p and A indircrd by P. E !HA. . . x.3 .1 = 1. . ttiulti~nriare mally distributed rundam wcms. .= 1. = 1. independent multivariate normal . . . . . I ) . If x 1 . . for all t . are random quantities each with mean p and precision matrix A. 11. . x. . . . . such that. have centred spherical symmetry and so. . conditional on (1 rundoni meun wcm.c'x.. E %. . y. . . we see that the random quantities yl. .have centred spherical symmetry the structure a evaluations under P of probabilities of events f . .for any '11 and c f ?TIk. . . . X I . . It follows that. there exist p = p(c)and X = X(c)such that.5. conditional on p and A. I ) . E 92. the random quantities C'XI.6. where . with probabiliry measure 1'.. .. = ctzJ. is nordejined by X I .j so that. . where j Prook Defining y. cis if rhe latter were independent. j = I . . is an infinitely exchangeable sequence of random vectors taking rdues in Rk. . . . for all f . by Proposition 4. . . I.p und a random precision matrix A.. ..186 I Madelling Proposition4. .
Az. when making judgements about events in the positive quadrant.1 A . C . an individual’sjudgement exhibited a form of “lack of memory” property with respect to the origin. symmetrically placed with respect to the 45” line through the origin. in Figure 4. . this would 2 be as though. an identity of beliefs for any events in the positive quadrant which are xJ. Ifwe added to the assumption of exchangeability the judgement that the “origins” of the xi and x.4. for any i # j. I XI / Figure 4.xz..C2rejections in (dashed) 45” line Thus. respectively. must be equal. the resulting representation is as follows. It is interesting to ask under what circumstances an individual might judge . even though they are symmetrically placed with respect to a 45” line (but not the one through the origin). is judged to be an infinitely exchangeable sequence of positive realvalued random quantities. axes are “irrelevant”. events such as C1 C2 to have equal probabilities.4 Models via Invariance 187 4. In general.. however. B1 and €32. . . for example. The answer is suggested by the additional (dashed) lines in the figure. the 1 have assumption of exchangeabilitywould not imply that events such as C and CZ equal probabilities. In perhaps more familiar terms. so far as probability judgements are concerned. If such a judgement is assumed to hold for all subsets of n (rather than just two) random quantities. for any pair x. L32 rejections in 45” line. then the probabilities of events such as C1and C would be judged equal. the probabilities assigned to A1 and A z . BI. In particular.3 The Exponential Model Suppose x l .1.4.. we note that this implies.
.). . . and any event A in R+ x .r. > u 2 )1 F ] = P [ ( & > u1) I F ] P [ ( x .. . P [ ( r . . (Continuousrepresentalion under origin invariance). > . p(s. independent. X I . has the form p ( x . and hence given by 8 cxp( Ox.)E .c.188 I f 4 Modelling Proposition 4. 1985).t>xp(Hx.Z. . .I F) = p(. I . then the joint densityfor x l . x !R+. + . for any n.41F] = P[(. tl’ = 1iiiir.’ ( . . <J .> a2 1 F ) . ) ] .. this implies that P(1. and recalling that I.. > 0 ) we have P [ ( x *> al + a 2 )n (x. . . is an infinitely exchangeable sequence of positive realvalued random quantities with probability measure P.. that p(. for positive realvalued x. + . conditional on F. .ap) and ‘ A = { ( s .. . If we now take a = ( a ] . ) .> a1 + a. for all 11.) A + a ( F ] E for A and Q as described above.> a l ) n (2.7. s.> a2) I F ] . where 0 = lim. . . r . The rest of the result follows on noting that. is the derivative of I .r. x Rsuch that at1 = 0and A+Q is an event in %+ x .such that. such that. sl. P[(J.2 I F ) = P ( X . By exchangeability. and Outline prooJ (Diaconis and Y Ivisaker. .r. [ t i .. > 01 1 F ) P ( s . . x %’. . .. > nl + (2. so that. by the strong law of large numbers. s .L. I E’) = < r “r for some 8. > 0) I F ] = P [ ( x . By the general representation theorem. .. 10). ? Z I f )E A + a] for all a E 8 x . . for any i # J .so that the density. . ). there exists a random distribution function F. is certainly positive for all j . . .x. .fx T i ’ . .. . x2...I. It can be shown that the additional are invariance property continues to hold conditional on F. But this functional relationship implies. .&) E A] = Pl(r1. .r.
> u2 I 8).. Dsrt I X I ..4.has theform + . . then the joint densityfor X I .. . . x 2+. namely..5 2 . . x Z+. . we see that judgements of exchangeability and “lack of memory” for sequences of positive realvalued random quantities constrain the possible predictive probability models for the sequence to be those which are generated by acting US if we have a random sample from an exponential distribution with unknown parameter 8.. .3. . . Proposition 4. .. for all event A in 2+ x . .. . . .4. except that events would now be defined in terms of sets of points on the lattice 2+ x .. if Q’denotes the corresponding distribution for 4 = 81 = lim. with a prior distribution Q for the latter. In fact. .4 The Geometric Model Suppose x 1 . Recalling the possible motivation given above for the additional invariance assumption on the sequence X I ..x. . . it is interesting to note the very specific and wellknown “lack of memory” property of the exponentialdistribution. which appears implicitly in the above proof. ~ 2 . . ~ p ( Z C . . x W. x 2 such thut a‘l = 0 and A + a is an event in 2+x .L) E A a] 4= for (111 a E 2 x . Z.) E . This enables us to state the following representation result. rather than as regions in %+ x . .... . x 2 + . .c.4 Models via lnvariance 189 Thus. . x 2 . is an infinitely exchangeuble sequence of positive integervalued f rundom quantities with probability measure P. . It is easy to see that we could repeat the entire introductory discussion of Section 4. ( i c e erepresentution under origin invariance). 11 and any P ( ( r 1 . such that. . .> nl + a2 1 8. is judged to be an infinitely exchangeable sequence of strictly positive integervalued random quantities. .4. it may be easier to use the “reparametrised” representation since Q*is then more directly accessible as “beliefs about the sample mean from a large number of observations”. 4.> a l ) = P(.8.1. . ] P [ ( x l .
..r.. . 1 I9) is easily seen to be #( 1 Again. ~ . > x I F ) = I 9 ' .. .I l l i l l ( .. = [ r n . recalling the possible motivation for the additional invariance property. ' tr. . /or t... ..rI. the coherent predictive probability models must be those which are generated by acting us ifwe have a rundom sample from a geometric distribution with unknown purumeter 6 . med{X I . the sample size und nirdinn ( k (I / / ) = 2 ) : t. s. I . . it is interesting to note the familiar "lack of memory" property of the geometric distribution.= [ n i .. . XI.. . ).= lim.190 4 Modelling Outline proof. ./j. .5. .sclnIplr si:e..#. . . except that. . by the strong law of large numbers.ri + . but this clearly does not achieve much by way of summarisation since A. I (XI I//) . (.. . with specified sets ofpossible i~uliresY I . Familiar examples of summary statistics are: t. > f11 I F)P(.1 MODELS VIA SUFFICIENT STATISTICS Summary Statistics We begin with a formal definition. but for notational simplicity we shall usually talk in terms of random quantities. .. I9 I = l i i i i . .6. or sample...( / ) / ) = 1 )..I.( ~ 5 )1 1 1 ) is culled X(irt)tlintc. for positive integervalued z.Slllll (k(rrr)= 3 ) : t. This follows precisely the steps in the proof of Proposition 4. 4. . . . > al + a ? 18. J .)I.5 4. . y ( r l 1 F) = p ( s .r.> 0 1 ) = P(. since .. which enables us to discuss the process ofsurnmurising a sequence.I . . > (I:! I F ) P(. forall i. of random quantities.0 < 6 5 1... t h e strrnplr rr/rrgr (A.. .( X I . ... . so that the probability function. ( J ]+ .} .. . . the .. Again. In this case. O ~ S ~ N N M S . . . + t~ statistic. . . ... + x. ). where 19.. respectiwly.: . \ .).. with aprior distribution Q for the latter. . + . (In general..= I H + ....7./. our discussion carries over to the case of random vectors.. ..tc/l Ulld .. . A trivial case of such a statistic would be t. )I. ... functional equation the P(r. .) Definition 4. . .!'I. (I mndoni vrc' .. (Stalislic). .. > Rj implies that + 112 I F ) = P ( r . = (.I. .. c R k " " ) ( L .. x X. . where. . . Z.r. .r.. = lllax{. .. a O)"t'.. the strniple niean (A( .r. > a? 18). r r t ) = / I / . . J'..= 1 ): t. } . P(n.). .r. . J . ix .I. Given rfJrIdOti1 quatitities (tvctors) .nsionuI '. + x. 2 1 .
~x.. . Throughout.. a fixed dimension independent of TR. x X. ..s .r >_ l a n d { i l.. let us consider the following general situation.i j f ~ w a l l 7 n 1. ! Definition 4. dejined on X x . in general.5. . . . . . . . The above definition captures the idea that.7. I . Another way of expressing this.. if so. In the next section.x:!. we shall need additional . the individual values of SI. conditional on this given information. and under what circumstances. ) . From a formal point of view. . greatly simplify the process of assessing probabilitiesof future events conditional on past observations. . Instead. further clarity of interpretation is achieved if k(m) = k. . t 2 . ... contribute nothing further to one's evaluation of probabilities of future events defined in terms of as yet unobserved random quantities. Clearly...) and past observations (21. = trl2(x1.. ..x. .. x.zl. x . (Predictive su&iency).2. we shall focus attention on the general questions of whether. . i = 1. . . . .zr}n{1 where p ( . what forms of representation for predictive probability models might result.7. Past observations xl . beliefs about future observations x . .to be described by p ( ~ . . we shall examine the formal acceptabilityand implications of seeking to act as if particular summary statistics have a special status in the context of representing beliefs about a sequence of random vectors.. we shall assume that beliefs can be represented in terms of density functions. the sequence of statistics t l . with t.2 PredictiveSufficiency and Parametric Sufficiency As an example of the way in which a summary statistic might be assumed to play a special role in the evolution of beliefs.. as is easily verified from Definition 4. . we clearly need k ( m ) < m:moreover.5 Models via Suficient Statistics 191 To achieve data reduction. definition describes one possible way in which assumptions of systematic data reduction might be incorporated into the structure of such conditional beliefs.) are conditionally independent given t. + ~. We shall not concern ourselves at this stage with the origin of or motivation for any such choice of particular summary statistics.1 zl. .z2 . . from a pragmatic point of view the assumption of a specified sequence of predictive sufficient statistics will. ~. is said to be predictive suficient for the 1 2 . ~ available and an individual is contemare plating. . . Given a sequence of random quantities x 1 .. +.. . . however. as with the above examples..xtl.. . .. .. . . seyuence.. it is coherent to invoke such a form of data reduction and. . . with probability measure P.) is the conditional density induced by P. .. The following . .4. . .rrl1). .! m} =0.x. . 4. ~.. .. given t. is that future observations ( x . where xt takes values in Xi.
.. is un infinitely e. This latter form makes clear that. . . r f . we can establish the following.192 4 Modelling structure if we are to succeed in using this idea to identify specific forms of the general representation of the joint distribution of ~ 1 . Definitions 4.4. within our assumed framework. . tukes wlues in X I = X. . .. .r. . .. predictive su@ciency and exchangeubili!\t for the injnite sequence . as shown in Corollary 2 to that proposition.x.I*?. .. . . I for any d 4 (6 ) dejtiing an exchangeable predictive prohubilih model rici thr represenration p ( x ..) = (rc)(et. . given . As a particular ilfustratioti of what might be achieved.. = t. . conditional density function of . . are assumed to be over the set of possible values of 6. .. x . .. .. . .I‘I.I.. (Parametric sufiiency). .(. . .7 and 4. as in Corollary I to Proposition 4. a mathematically rigorous treatment is beyond the intended level of this book and so we shall confine ourselves to an informal presentation of the main ideas... . ...3. I t is perhaps reassuring therefore that. r.. . . . has the form and all integrals.8. . As with our earlier discussion in Section 4.fi.rl. ...r..I. ..lie seqirence of srutistiis t l . with t ... . Definition 4.)... .vchangeable sequence of rundoni quantities. . we shall assume in what follows that the probubilih nieusure P describing our belieji implies both . . In particular. .Y. dejitieri on S x 1 . . the .r2.r tiny I I 2 I . )to be a “sufficient summary” of .. .?. .rl . the learning process is ”transmitted” within the mixture representation by the updating of beliefs about the “unknown parameter” 8. .. I . . . if . . so that. here and in what follows. for such exchangeable beliefs.2.r. . where .. . t?. q ( e 1 . . . . r ~... If‘ . . r l . throughout this section we shall assume that the exchangeability assumption leads to a finitely parametrised mixture representation. This suggests another possible way of defining a statistic t.r.i = 1. .8 both seem intuitively compelling as encapsulations of the notion of a statistic being a ”sufficient summary“. . . is said to be parametric sii&ienrjbr x I .
. .. 16) ~ h n ~ ( t . . * ..t 2 . . . .(21. . x m ) = t m } .. is predictive suflcient $ and only if. where xi takes values in Xi = X . . Heuristicproof. tics t. . . . For any X I .~n = I * ..1. . . Proposition4.x. The sequence tl .n . .n+1+ *  * 9 zn I t m ) = P(xm+lr.m = 1. . the sequence o statistics f t l . . . . where t ..x = for somefunctions h. it is parametric suflcient. 2 0 . d Q ( 8 I X I . . ... ..2. for To make further progress.xn and any sequence of statis. x2. . . in turn. 2 2 . n ) .5 Models via Suflcient Statistics 193 Proposition 4. .t 2 . . .x.. .. . . Given an infinitely exchangeable sequence of random quantities X I . we now establish that parametric sufficiency is itself equivalent to certain further conditions on the probability structure. . .. .X m ) = dQ(6 I t m ) all dQ(6). given 8 has the form ~ ( x I . x m ) . . 3 xm) J t=m+l fi p ( z l 1 e) d Q ( e 1x1. and only if. . . . with t j defined on XIx . .. g > 0.. I xm+l.. . the representation theorem implies that where A = ( ( 2 1 . . . .. x X. i parametric sufiient for infinitely exchangeable x 1 .. x .4. = t.. can be easily shown to be expressible as It follows that P(x. . . . . .10. . . .. i = 1. $ x m ) i t m ( x l .x. B ) g ( ~ l . . . (Neymanfactorisation criterion). Q if. the joint density for xI.which. .).. . . admitting a s finitely parametrised mixture representation if and only $ for any m 2 1. . (Equivalenceof predictive and paremetric sufiiencies).9.
is parametric sufficient..(t......) have ( we P(. by Proposition 4. so that..w. 1 for some h.y > 0.. . Outline prooj For any t. . Conversely...10. t .Conversely.. . we obtain so that which is independent of 8. . t 1 7 1 ) p ( t1 . (Sufficiency and condifiunal independence).rl. T I . ) is independenr off?. . . p ( .. tin) is independent of 8. ..194 4 Modelling Outline proof. . .. ) w . ...r. > 0. Substitutingforh.18). follows immediately from Proposition 4.. . . The righthand side depends on X I . 18. .. .). ~. . The sequence t 1 . r l . .t 2 . suppose that t .. . ..I e. 1 8.... hence. . 10) e)g(.x2.. ..0)intheexpressionforp(.I.. . . . 2 0. Proposition 4. .I'. s.. ..rl.s..10.} such . . . ..( ~ . for any dQ(8) with support 8... dQ(0 I . .. r l .11. .. . ij uttd only ij for ony I I I 2 1. p ( t l f l 0) = ~ I { ~ j ..(t. .el....only through t. cl . l~ If p ( r l .I e ) = p ( .. . . . Integrating over all values t.) t. ) = dQ(8 1 t. . ...r. s. . we have..rl. . ... ... is parametric suBcienr for infinitely exchangeable ..r... = h. ~ ~ .. .= t. x. for some It... . t 2 . .t?. we obtain = . . forsomeG > 0.. the density p ( x ~. Given such a factorisation.r. . . given parametric sufficiency.... the parametric sufficiency of t l . .1 .. and.. J .s.r~... for any d Q ( 8 ) we have ... . that t ( t ..... . .
however. the notion of parametric sufficiency has so far only been put forward within the context of exchangeable beliefs.5. In particular. t. for us.. the definitions and consequences of predictive and parametric sufficiency have been motivated and examined within the general framework of seeking to find coherent representations of subjective beliefs about sequencesof observables. and the factorisation given in Proposition 4. since our representation for exchangeable sequences provides.10 was established by Neyman (1935) as equivalent to the Fisher definition. we can exploit many of the important mathematical results which have been established using that definition as a starting point.. related concepts of “sufficiency”arealso central to nonsubjectivisttheories. of our definitions. . however. Example 4.established here in Proposition 4. then we have the general representation . it seems to us rather mysterious to launch into fundamental definitions about learning processes expressed in terms of conditioning on “parameters” having no status other than as “labels”. From an operational. a justification for regarding the usual (Fisher) definition as equivalent to predictive and parametric sufficiency.was itself put forward as the dejnition of a“sufficient statistic” by Fisher (1922). where the operationalsignificanceof “parameter”typically becomes clear from the relevant representation theorem. as the reader familiar with more”conventional”approaches will have already realised.1 1 as a consequence . We recall from Proposition 4. we shall mainly be interested in asking the following questions. In fact. for example. When is it coherent to act as if there is a sequence of predictive sufficient statistics associated with an exchangeable sequence of random quantities? What forms of predictive probability model are implied in cases where we can assume a sequence of predictive sufficient statistics? Aside from these foundational and modelling questions. (Bernoulli model). from a technical point of view.x2. is an infinitely exchangeable sequence of 0.5 Models viu SuBient Statistics 195 In the approach we have adopted. the results given above also enable us to check the form of the predictive sufficient statistics for any given exchangeable representation. However.1 random quantities. subjectivist point of view. . .4...1 that if x l . I 6.z. .)on 6. . Thus. we note that the nondependence of the density p ( z l : . In the context of our subjectivist discussion of beliefs and models. We shall illustrate this possibility with some simple examples before continuing with the general development.
..(t.1.c2. + ’. 10) O)y(. .. with h. t. I we could equally well define t..$ p . .r... 2 . is predictive and parametric sufficient for sI. . r i ) / as the sequence of sufficient ..9 and 4. .. it is perhaps not surprising that the sample size. I I .. 1 A) t.2. . r 2 .1 random quantities.. . . Example 4. This corresponds precisely to the intuitive idea that the .(t. .. Of course. = 1.s. g ( r I . ..’ ” . sequence length and total number of 1’s summarises all the interesting information in any sequence of observed exchangeable 0.r.. an exchangeable sequence of realvalued random quantities with the additional property of centred spherical symmetry then we have the general representation p ( z l ..) = 1% p(..6.. . t 2 ... [ t i .] and noting that we can write . = ~(3... . .. = r.~. since . . mean and sample mean sum of squares about the mean turn out to be sufficient summaries.. . for example. = (12.. A ) Ip where In the light of Propositions 4. rtj(.r. is ... = X I 4 Modelling + . . 9 defines a sequence of predictive and parametric sufficient statistics for . ..r2. . .10 and Proposition 4. In view of the centring and spherical symmetry conditions.. . . . ..rI.. ..z. . . .@)P ( 1 . = statistics. X ) t l Q ( p .H ) ” ..... ..t. .r. [7!.Defining t... 5. is not unique. = h.1.196 where Y. We recall from Proposition 4 5 that if s. (Nonnol model). .) it follows from Propositions 4..10 that the sequence t l .. + . inspection of reveals that ~ ( Z I. + r..
.= X.s. tics.. . Again.. I 1 that . it is equally clear that we should like to focus on sufficient statistics which are.. . i given any other sequence of suficient statis. . ~ = (12.(zl.is an infinitely exchangeable sequence of random quantities. . . ... r. Sufficiency and the Exponential Family In the previous section.q . s2.5 Models via Suficient Statistics 197 e Example 4.takes values in X. f . since n very often appears as part of the sufficient statistic.r2? .5. is However. in some sense.10 and 4. x X . This motivates the following definition. .)t~~(~) 11 . although. references to sufficient statistics should be interpreted as intending minimal sufficient statistics.9.5 to 4. (Expnentiulmudel). . It is clear from the general definition of a sufficient statistic (parametric or predictive) that t. ( s t ) ..2. .. .. . . = + + and the form of the sufficient statistic. there is not such an obvious link between the form of invariance assumed where s. . Finally. defined on X x ... .. Ifx x2.. 4.3 Definition4.tar. From now on..T. imal suflcienr for 21. (Minimal suyicient statistic). . is an exchangeable sequence of positive realvalued random quantities with an additional “origin invariance” property. to avoid tedious repetition. .7 are minimal sufficient statistics.. we identified some further potential structure in the general representation of joint densities for exchangeable random quantities when predictive sufficiency is assumed.). the sequence of statistics tl .. x. . = g . with t.)] always a sufficient statistic.7 that if 5 . t . then we have the general representation = Lx 0’1 exp(~u.x . (XI. W recall from Proposition 4. = In. in this example..g2(.). .. there existfunctions gl(. it is immediate from Propositions 4. given our interest in achieving simplification through data reduction.is min1 . .4.” as the sufficient statistic. such that t . i = 1. we shall sometimes. where x. ) .. .. .. sl.) defines a sequence of predictive and parametric sufficient statistics. We shall now take this process a stage further by examining in detail representations relating to sufficient statistics of fixed dimension.. omit explicit mention of n and refer to the “interesting function(s) of T I .7. . It is easily seen that the forms of t ( z )identified in Examples 4. . . minimal. .
2. . the statistic t.. . r I . . would be sufficient.viiSJicient stntistics.r 10) is identified in the following definition. Proof. then t. ~ )= [ r ~ h(.. with respect to some dQ(I9).lubelled by 8 E 0 C R.12. the equivalence of predictive and parametric sufficiency for the case of exchangeable random quantities.sft10) factors into h f . .).. . 0. . I I = 1. . without risk of confusion. A probubility densin lor massfunction) p ( s I 19). .Q. Proposition 4. given E is regular Ef(. An important class o such f p(. . .L(. un e. if the form of y( s 1 19) is such that p ( q . ]. We begin by considering exchangeable beliefs constructed by mixing.10. I f ' r 1 . I ' . i s ci sequence (f'.over a specified parametric form where I9 is a onedimensional parameter.I 98 4 Modelling Since we have established. in the finite parameter framework. .. I .r < x. + h ( . . = t . (Oneparameterexponentialfamily). By Proposition 4.. we shall from now on simply use the tern suflcient stutisric. Definition 4.0 ) g ( . . . x f l ) . . some h f 89. (Su@knt s&atis&ics the oneparameter exponential for family). r .and c.f(. = t. . for some dQ(t)). Thefumilj is culled regular if X does not depend on 8: orherwise it is culled nonregular. and their equivalence with the factorisation criterion of Proposition 4.rl. r 1 7 .10 on noting that . ) jiir . . given f . h . . . This follows immediately from Proposition 4. . . ( t .r) exp{co(O)h(s)}ri.rI) + . . [y(tl)]' = J . ( . ..rfJ S. for .z t l ) . .1 1.10. . is suid ro belong to tha onepurumeter exponentialfamily if it is of the fornr where. .rcliurigeuhle seqirence such thnt.
I}.4. e E %+. 9. However.B ) .5 Models via Suj’icient Statistics 199 The following standard univariate probability distributions are particular cases of the (regular) oneparameter exponential family with the appropriate choices of f . 1). We shall indicate later how the use of such forms in the modelling process might be given a subjectivist justification. In Definition 4. = 1. ) = 1. Definition 4. h ( ~= o.) form could always be simply written as @ ( O ) with cjt suitably defined (see. Poisson We note that the term cg(8) appearing in the general Ef(. Although we have not yet made a connection between this case and forms of representation arising in the modelling of exchangeable sequences. from the constant which happens to multiply it.10. as indicated. 1 .etc.) . Unijorm p ( I 6) = u ~ ( 10. it is often convenient to be able to separate the “interesting” function of 8. we allowed for the possibility (the nonregular case) that the range. I E {o. e E [o.1 I). it will be useful at this stage to note examples of the wellknown forms of distribution which are covered by this definition. . s E (0. g(e) = e .d(8). also. Bernoulli P(X I 8) = Br(s 18) eJ(i = ep. f ( . ) d ( 8 ) = 8. of possible values of z might itself depend on the labelling parameter 8.~ e) = e’. X.
( . x. .200 Shif ed exponentiul 4 Modelling p(. g ( H ) = f /I . . r l . The above discussion readily generalises to the case of exchangeable sequences generated by mixing over specified parametric forms involving a kdimensional parameter 8. ..) E R'. .. .10 that is a sequence of sufficient statistics in this case. For the uniforni.. It then follows immediately from Proposition 4. if we rewrite the density in the form a similar argument shows that. .. .O ) ] . . I.rI.10. = f . 5 . . for ri = 1. ( .2. for any sequence . for ( J . . .) f = . In order to identify sequences of sufficient statistics in these and similar cases. 8 E W . so that. . we make use of the factorisation criterion given in Proposition 4. = 1.r)= 0..f(. '. which is conditionally independent given 8. For the shifted exponential. I ' . . .)..r. . r . o(0) 0.r 18) = Shex(L 18) = c y [ .t' 8E >R+. .. h(.. ) = I provides a sequence of sufficient statistics. . (. we rewrite the density in the form so that. .
&) and.then is a sequence of suflcient statistics.. Proof. 4. x E X. h.xt e X .hk). . 4(8) = f . .4. ($1 . (SujJkientstatistics forthe kjwmeterexponentialfamily). . . which is labelled by 8 E 0 C Rk. 9 etc. .13. This is analogous to Proposition 4. . given regular kparameter Efk(. is said to belong to the kparameter exponentialfamily i it is of the form f where h = ( h l . If 51. as indicated.10. (kparameter expone& famdy). for some dQ(t?).1 1.12 and is a straightforward consequence of Roposition 4. the second nonregular) of the Icparameter exponential family with the appropriate choices off. is an exchangeable sequence such that. A probabilitydensity (or massfunction) p(x I O ) . .5 Models via Suflcient Statistics 201 Definition4. . . and the constants c.. x2. a The following standard probability distributions are particular cases (the first regular.. Proposition4. otherwise it is called f nonregular. . given the functions Thefamily is called regular i X does not depend on 8. . 1 ).
~ } ] ... n is easily seen to give a sequence of sufficient statistics. c1 = 1.12...X:llxp] ... = statistics. s E (el. . . cp= 1/2.potietiticil~iniil~. . ) . .min{xl. I . sia the trunsjhnutions y = (y . e. 7 ) 4 Modellitig In this case. k = 2 and so that t. The description of the exponential family forms given in Definitions 4. and t . . .T I p .] .~p.202 Normal (unknown mean and variunce) Y(x 10) = p(x I P . 7 ) = N(.. called the canonicalform of representcrtion ofthe e. (Canonical exponentialfamily). = 1.) in Definition 4. Definition 4.(T). .2.d.10 and 4.(e).e .Q2).021) = ) P(X e) = p ( : ~ ( e .=c. but somewhat cumbersome for others. The probability density (or massfunctiotr) derivedfrom Efk(. .. i = i . In this case.11. I . . . . which we give for the general kparameter case.. i . . e ~ U ( Z el&) = (e.= 1.A. is a sequence of sufficient 7i Uniform (over the interval [el. el E R+.. . .max{r. ~ .yk). = (n. This motivates the following definition. . I. .o1 E 8. .= (c. = [n.. .C:=lI.2.. @(e) ( 7 4 . .. x I f } . is convenient for some purposes (relating straightforwardly to familiar ver sions of parametric families). y Ll. Yf=h. C ' I ) . I I .
then is a suficient staristic and has a disrriburion Cef(s I a("). Here. from which the result follows straightforwardly. Pitman (1936). establishes the form of a("). a + Proposition 4.2. E(9 I +) = Vb(+)? V(Y 1111) = V 2 W ..12. we shall use it to examine briefly the nature and interpretation of the function b(+).y. plus induction. (Sumiency in the canonical exponentialfamily). classical results of Darmois (1936).5 Models via Su$Ecienr Staristics 203 Systematic use of this canonical form to clarify the nature of the Bayesian learning process will be presented in Section 5.15. that the exponential family is the only family of distributions for which such sufficient statistics exist. A consequence is that sufficient statistics of fixed dimension exist. Examination of the density convolution form for n = 1. Moreover. a Our discussion thus far has considered the situation where exchangeablebelief distributions are constructed by assuming a mixing over finiteparameter exponential family forms. and to identify the distribution of sums of independent Cef random quantities. . . For y in Definition 4.12. Koopman (1936). . Sufficiency is immediate from Proposition 4.l are independent Cef(y I a. (Firsttwomoments of the canonicalexponeniialfamdy). where a(") satisfies 7ib(+) = log 1 a(")(s) exp{t/~'s}ds. Proof. Proposition4. +). b: +) random quantities. under various regularity conditions. where d*) is the nfold convolurion of a. It is easy to verify that the characteristic function of y conditional on E(exp{iu'y}I +) = exp{b(iu +) .b ( + ) } .4. lfyl. so that the distribution of s is as claimed.2. .nb(+)}. + is given by Proof. Hipp (1974) and Huzurbazar (1976) establish.nb.14. We see immediately that the characteristic function of s is exp{nb(iu + +) .
= y1+ * . . identified the parametric forms that had to appear in the mixture representation. we considered particular invariance assumptions. . in Section 4.204 4 Modelling In the second part of this subsection. . in the numerator. which. the result and the “flavour” of a proof are given by the following. . . . . so that + + yI. ... suppose for a moment that an exchangeable sequence. .rz.. . where. = s). Previously. Here. 1990). for all 71and k < 11. Because of exchangeability. instead. together with exchangeability.1 1 and 4. . we shall consider.. is modelled by Now consider the form of p(yI.15). yk 5 s.. As a preliminary. . . . the conditional distributions have the above form (for some n defining a Cef(y [ ( I . g r 8 has the corresponding exponential family mixture form? A rigorous mathematical discussion of this question is beyond the scope of this volume (see Diaconis and Freedman. to be exchangeable and also assume that. However.k < n. . this has a representation a a mixture over s + But the latter does not involve because of the suffkiency of y1 + . { y. does this im. ) ply that p ( ~ .. we shall consider the question of whether 1 there are structural assumptions about an exchangeable sequence s . The exponential family mixture representation thus implies that. I y1+. . which imply that the mixing must be over exponential family forms.}. (Propositions 4..y?.4. . with considerable licence in ignoring regularity conditions. 6. y.y. whether characterisations can be established via assumptions about ronditiond distrihutions. +) form). motivated by suflciency ideas. s. + Now suppose we consider the converse. If we assume y l .
for some @.+gn = s) also has the specified form in terms of u ( .and noting that u(0) = 0. But each of the latter distributions.& Jy1 +"'+Y. . . ) f( must satisfy f ( 9 ) its twofoldconvolution. we obtain . mixing over distributions which make the y. f(y.)= 1fi 1=1 Cef(y. forn 1 2. +). sothat.. . ) . k < 71.u(s) = ~(s). i any exchangeuble sequence such that. . .b(+)l* The following example provides a concrete illustration of the general result. k = 1. + ) ~ Q W L for some dQ(+). s k p(yI. . so that a) If we now define U(Yl) = log . I a. Now consider ri = 2. . and Setting 9 ..ul. = s... . Outline pro($ We first note that exchangeability implies a mixture representation. with densities denoted generically by f.log 491) 40) f(Y1) f (0) and it follows that 4Yl) + 4 s .Y1) = 4 s ) . . . hence U(YA + 4 9 2 ) = 4 Y l + y... (Represenfationtheorem under surniency). . so that f b ) = 4 9 )e x P W Y . = 8 ) = ~a(y. independent.)a'~~"(sss)/n("'(s).g k I y1+. b. then p(Y.. Independence implies that where denotes the marginal density and f C 2 ) ( .2)a This implies that u(9) = +'y. b. ) defines Cef(y I u . themselves imply an exchangeablesequence. . ?=I where sk = y1 + .. Ifyl.5 Models via Sujficient Statistics 205 Proposition 4.16.for all n 2 2 and k < n. + gr and a ( . . . .y2.4.
} . . which hasdensity & I l ! .. Other relevant references for the mathematics of exponential families include RamdorffNielsen ( 1978).. . yk) given . = niax{. .x . a ) As we remarked earlier. . Cef(g / ( I . it is straightforward to check that. .0). A.) this will therefore be true for and all exchangeable . E 8'.txp { !p.16. .yl . T I . . E R.. . and sufficient statis) tic iiiitx{. U(s10. as is the case for regular families (and plays a key role in Proposition 4. * L". { I 1 2 . so that where ?it = y.y?. sg. with the conditional distribution of y = (9. = c. . . is judged exchangeable. r . where 8 = ( 1 : / I . + I J ~ Noting that the Poisson distribution. we might wonder whether positive exchangeable sequences having .* !/! 1 } = ( I ( ! / ) c'xp{. Pn(!/ form as t . ./s!. . briefly and informally. specified to be the multinomial Mul (g1 S . This sufficient statistic is clearly not a summation. We conclude this subsection by considering. . v . )and d'"().. . < 1 1 . Morris ( 1982)and Brown ( 1985)..yi. the above heuristic analysis and discussion for the 1. . . it follows that the belief specitication for y1. 10. . . . ) ..parameter regular exponential family has been given without any attempt at rigour.16). .g. . what can be said about characterisations of exchangeable sequences as mixtures of nonregular exponential families.206 4 Modelling Example 4. 10. m. .. A. For the full story the reader is referred to Diaconis and Freedman (1990). (Cluuacterisdon o the Poisson model).rl. constructed by mixing over independent L'(.. are approximately independent I:(. .8. . Conversely.. } . . . I).i.. l .0). . . .X I . Suppose that the sequence f of nonnegative integer valued random quantities . .given a sample x i .. is coherent and implies that for some d c ) ( ~ . distribution. 8).rl. .6. . .) } from which it easily follows that d'"(s. . in terms of a ( .r.bO . conditional on nil.y.. For concreteness. . . However. ( x3'. r k . c q ) t . can he witten in Pn(!/ I 1 % )= .r. . we shall focus on the uniform. . By Proposition 4.) = w. I / I / ).
4. . . and c are arbitrary constant multipliers. < 30. is to be approximated as closely as possible by a specified density f(z). not of the (specified) approximation f(x). In3 tuitively. . we minimise S(f I p) over p subject to the required constraints on p .4. This is indeed the case. however. . . possible values. The interested reader is referred to Diaconis and Freedman (1984).8..4 Information Measures and the Exponential Family Our approach to the exponential family has been through the concept of predictive or.O. together with the normalizing constraint p(r)ds = 1. . . We recall from Definition 3.20 (with a convenient change of notation) that the discrepancy from a probability density y ( ~ ) assumed to be true of an approximation f(x) is given by ly where f and p are both assumed to be strictly positive densities over the same range. Note that we are interested in deriving a mathematical repreof sentation of the true probability density y(x).5. X. and the further references discussed in Section 4. Thus. Hence. that exponential family distributions can also be motivated through the concept of the utility of a distribution (c. as R + 00. one might expect the result to be true. i = 1. and. we seek p to minimise where 81. We seek to obtain a mathematical representation of a probability density p ( z ) . . . tends to a finite f from below.which satisfies the k (independent) constraints h.5 Models via Sujficient Statistics 207 this conditional property are necessarily mixtures of independent U ( x . but a general account of the required mathematical results is beyond our intended scope in this volume. It is interesting to note. if rn..(r)p(x)& = mi. lo. in addition.nzk are specified constants. rather than 6( f I p) over f subject to constraints on f. Consider the following problem.. equivalently. using the derived notions of approximation and discrepancy.f. .@).I .k.4). Section 3. where ml. parametric sufficient statistics.
c ) .rponenrialfantily Reneruted by f uttd h.r .f l J .r”. 6 . Each distinct (t defines a distinct exponential family with density (. Jeffreys and Jeffreys. H) family with n known.0. This condition reduces to (a/ih)F(p(. . qb = 8 = ( 0 .r) + (tT(J)) I . specification of the sufficient statistic 1 I I  J does not uniquely identify the form off ( .r I f.r E .r)= [h. 195911968. ) The resulting exponential family form for p ( . c ) as the e. h k ( r ) ] . for example. see Kullback. Chapter lo). r ) within the exponential family framework. If we wish to emphasise this derivation of the family. . where f and h are given in b’(p).Y h. . the Ga(x 10. ..208 4 Modelling Proposition 4. we shall refer to Ef(s 1 f. (For an alternative proof. and .l / I . By a standard variational argument (see. = 1. Chapter 3 . r ) = Efk(.. 4.r)... . Q. } .9. Consider. 1946.() =0 from which it follows that as required. I’hefuncrional F ( p ) defined above is nrinimised by p ( . r ) was derived on the basis of a given approximation f ( ~ ) a collection of “constant” functions h(.17. a necessary condition for y to give a stationary value of F ( p ) is that for any function T : s the equation . In general. . . for example. h. (The exponential family as an approximatwn). H i . ) mid Proof.(. .% of sufficiently small norm. 9.( < t ) ) P ‘#P( .
4. Returning to the general problem of choosing p to be “as close as possible” to an “approximation” f. and we shall need to extend and . However. A’ = V(x I p .the maximum entropy choice for p ( z ) is EX(LI4).y p(x) logp(z) dx. if X = W and h(z) = x. i = 1. we have thus far restricted attention to the case of a single sequence of random quantities. We also briefly mentioned further possible kinds of judgements.6 4. Thus.22?. in the sense that f is extremely diffusely spread over X.is the socalled maximum enrropy choice of p . which leads us to seek the p minimising p ( z )logp(z)dx subject to the given constraints. as a random sample from a parametric family with density p(s 18). A). A limiting form of this would be to consider f(x) = constant.20). which further help to pinpoint the appropriate specification of a parametric family.2* . we discussed various kinds of justification for modelling a sequence . Example 3. . .8 M 4 } Jy exp {EL. .. involving assumptions about conditional moments or information considerations. to ask what happens if the approximation f is very “vague”. since minimising p ( z ) logp(z) dx is equivalent to maximising H ( p ) = . the exponential distribution with 4l = E(x 14). we shall extend our discussion in order to relate these ideas to the more complex situations. Clearly. A). A) (c. and unrelated to other random . in order to concentrate on the basic conceptual issues..z2.6 Models via Partial Exchangeability 209 so that.4.6.J.subject to the k constraints defined by h ( z ) it is interesting . . . w ) 4 exp which. quantities. following Definition 3. for example. which arise when several such sequencesof observationsare involved. we need to specify f(z) = x0’/r(a)order to in identify the family.judged to have various kinds of invariance or sufficiency properties. the normal distribution with p = E ( z I p .f. .. In the next section. Our discussion of modelling has so far concentrated on the case of beliefs about a single sequence of observations zl . If X = 93 and h ( z )= (x.5. x2).52> . or when there are several possible ways of making exchangeable or related judgements about sequences. in addition to h ( z )= 2 . .the maximum entropy choice for p ( z )turns out to be N(z I p. in many (if not most) areas of application of statisticalmodelling the situation will be more complicated than this.1 MODELS VIA PARTIAL EXCHANGEABILITY Models for Extended Data Structures In Section 4. sx 4. of random quantities z1. The solution is then sx P(X) = { ct. together with a prior distribution dQ(8) for 8. z1. labelled by a single index.
k denote observable responses for ' each context/treatment/replicate combination.?. . . ~ . Suppose that for each sequence.jth run of a chemical process when k inputs are set at the levels z . ~.210 4 Modelling adapt the basic form of representation to deal with the perceived complexities of the situation.'. . Among the typical (but by no means exhaustive) kinds of situation we shall wish to consider are the following. . with . from the jth replicate rate": or . or we may have I different geographical areas. .. For example: we may have sequences of clinical responses to each of I different drugs. . where . with an assumed form of relationship between z . A modelling framework is required which enables us to learn. of random quantities are to be observed in each of i E I contexts. and to quantify beliefs (predictions) about the observable f corresponding to a specified input or control quantity z... or responses to the same drug used on I different subgroups of a population. . In each case a inodelling framework is required which enables us to learn about the quantitative form of the relationship. with s.. For example: . ' . . .rIJ measurement of a plant or animal following some assumed form of "growth curve". For example: we may have I different irrigation systems for fruit trees. (i) Sequences x I l s12. : A .rl. and the random quantities . ~might denote the status (dead or alive) of the j t h rat exposed to a toxic substance administered at dose level z l .k denoting ' the total yield of fruit in a given year. . . are to be observed.r.. .I different tree pruning regimes and h trees exposed to each imgation/pruning combination. . (iv) Exchangeable sequences. . L . k 2 1. or a coding of voting intention. or treatments. . i E I. in each o f i E I contexts. or x t . (ii) In each of i E I contexts. .I. about differences between some aspect of the responses in the different sequences.. A modelling framework is required which enables us to investigate differences between either contexts. . or context/treatment combinations. = (:. or whatever.. . (iii) Sequences of random quantities . of random quantities are to be observed . where I is itself a selection from a potentially larger index set I. ) and the general form of relationship between process output and inputs is either assumed known or wellapproximated by a specified mathematical form..r.I different agegroups and I< individuals in each of the I J combinations. some form of qualitative assumption has been made about a form of relationship between the .. I ' . might denote the output yield on the . r . .A). and the corresponding "death might denote the height or weight at time . :. .k denoting the presence or absence of a specific type of disease. in some sense.. j E J different treatments are each replicated k E A times.rll and other specified (controlled or observed) quantities z l = ( z l l .
l . given 15 .. .m. . ) . 5 IV. . .2 Several Samples We shall begin our discussion of possible forms of partial exchangeabilityjudgeby ments for several sequences of observables. the sequence of largesample averages of quality for each of the dyestuffs might. x . observations from the ith sequence. z. a priori. x t l . ~ . ) = . be judged to be exchangeable. . ~ ~ . i = 1. 2 . . are suid to be unrestricredly exchangeable i each sequence is injnitely exchangeable f and. + x .1 sequences). of the N.1 observables would typically have the property encapsulated in the following definition.= 1 . or sequence i may consist of quality measurements of known precision on replicate samples of the ith of I chemically similar dyestuffs.rn.x . . .1 (successfailure) outcomes on repeated trials with the ith of I similar electronic components. be judged to be exchangeable.6..ni. .7n. (Unrestricted exchangeability for 0 . Definition 4.. Thus. for all 78. only the total for the irh sequence is relevant when it comes to beliefs about the outcomes of any subset of n. . . i In addition to the exchangeability of the individual sequences. A modelling framework is required which enables us to exploit such further judgements of exchangeability in order to be able to use information from all the sequences to strengthen. i = 1: . observations from that sequence. . . . where y . Sequences of 0 . x . including that of a comparative clinical trial.m . .13. 4. joint beliefs about several sequences of 0 .1 + . here and throughout this section. . . in addition. . i = 1. .. In the first case. . considering the simple case of 0 . ( N . . .I random quantities. ~ . In many situations.O?! . the learning process within an individual sequence. i = 1.6 Models via Partial Exchangeability 211 is judged to be a sufficient statistic.. . a priori. x . 2 ..)denotes the vector of random quantities ( x . .(n. for example. where. is itself judged exchangeable. .4. .. For example: sequence i may consist of 0 . the sequence of longrun frequencies of failures for each of the components might. in the second case.1 random quantities. . . in some sense. this definition encapsulates the judgement that. that the strong law limits exist and that the sequence 0.c. . . given the total number of successes in the first N .
. J . = 1 ) = 112 x 1/2 = 1.?. . .l x I i. . and n I 1 712 = >'V1= N? = 2. Then l'. .r?. .rl. .. = 0 . of in the case of the above counterexample. . ~ f l 1 . . i = 1. *. certainly an exchangeable sequence (since s I I . . (or by . . . .. .T. ) . rri ure unrestrictedly infinitely e. .l = 1. ... I ..18.1:. . there exists u distrihutionfunc. XI?. ~ . ~ ' r l ~ ~ . died (XI.1.P ( : r I q ( l ) . . . .r?'.212 4 Modelling deaths in the first 100 patients receiving Drug 1 (N1 100. As an example of a situation where this condition does i i o f apply. . . r ~ 1 . .V?)= 201. .rT2 = 0 1 yt. the information might well be judged relevant if we were not informed of the value of y I ( N l ) . is defined by sZJ= .CfI. r r Z. S I . S I I . I I *. 111. . . [ 1 i . Of course.i = 1. = xi. . . . . in. r r I ! .212 = 1.. . whereas p ( a l l = 0. .i where. ~ . . (Represenfation theorem for several sequences of 01 random qrcantilies).. . The definition thus encapsulates a kind of "conditional irrelevance" judgement. y?(. . = 1 . .. . + r .. . } . we only have invariance of the joint distribution when T I = TIT?. ! ~ foranyunrestrictedchoiceofpermutationssr. is an infinitely exchangeable 01 sequence and that another sequence .21 = 1. y I ( N I ) = 15) and = 20 deaths in the first 80 patients receiving Drug 2 ( * V 2 = 80. ~. but.rll. Further insight is obtained by noting (from Definition 4..1.tion Q such that . ' .r12 = 0. ~ 1 : s= 0..l. . we would typically judge the latter information to be irrelevant to any assessment of the probability that the first three patients receiving Drug 1 survived and the fourth one . . If'.l.r14 = 1). We can now establish the following generalisation of Proposition 4. .1 = 0. r l j ) .rchangeuble sequences ($0I random quantities Hith joint pmhabilic nieusure P.. .. taking .rz1= l. Proposition 4.u . . ..whereas.X?? = 0 I g1. { 1. . a development starting from this latter condition For see de Finetti ( 1938).. suppose that .r12= 1 jy12 = 1) p(.withy.ll. . . . ' . . .~ .)= r l l + .13)that unrestricted exchangeability implies that P(Xl1....c. is).I.!fTi = 1) = 0. p(.i. = 1 . . .(ri. . .
. . Y l ? . *‘ where the ith of the r n summations ranges from y.Ym(%n) IY l ( N ) . .) to y l ( N .2.1).( i V l ) = y.. by the multidimensional version of Helly’s theorem (see Section 3. where y. . . l ( N at J ) ll .3). * * * Y. Moreover. ? h ( N n ) )P(Yl(NI).~... yrla(n711)) CP(Yl(?’l). etc... . we may express p(yl(n1).. ~ y l n ( ~ ~ l l L ) ) Writing (YN).... and where. . = y. P(YI(’l1). (ti... . y .. ..(N. (ys .rrr. . it suffices to establish the corollary...v. as for any N . .( y l l ... .. .The result follows. . ) = 1 2 : . by Definition 4. .Nl. . having j a limit Q. . . .4 h “ ( N ” ) ) .1. ( n m ) l?/l(N).. . i = I.i 1.bmps”ofp(yl(Nl). a .4 and a straightforwardgeneralisation of the argument given in Proposition 4.. .18. . ( ~ ~ nis ) ) = l equal to uniformly in el. . We see that p(yl(n1). and. .(. . .v(ys . . . P(?/l(nl).) on 3Zrn to be the rndimensional “step” function with “. ..rn.1)). .2. . . .xrn (01. which is a distribution function on 8’’’.4. 6. and defining the function Qx.. 2 11.= 1.6 Models via Partial Exchangeability 213 Corollary. .) = 0.. to prove the proposition.O. . ..). . We first note that so that.%) Proof. ~ ~ l . Under the conditions of Proposition 4.() 7. l .. . there exists a subsequence Q.. .(j).. .
( ~ . Proposition 4. we note the following possibilities: (a) knowledge of the limiting relative frequency for one of the sequences would not change beliefs about outcomes in the other sequence. enables us to explore coherently any desired aspect of the learning process. )so that C) may . = liiii.I random quantities to be unrestrictedly exchangeable. that. where dC) ’ (0. (iii) by the strong law of large numbers. (ii) ( H I . are judged to be independent Bernoulli random quantities (or the y . say. 0. = 02. writing 0 = 8. ( n . the limiting frequencies could turn out to be equal. (c) there is a real possibility. 0 2 ) are assigned a joint probability distribution C). H ? ) . 0 2 ) such that 0. # H2. At a qualitative level. for simplicity. in Chapter 5.H?) assigns probability over the range of values of ( 0 . we can proceed US if: (i) the x... in this case r/Q(Hll has the form 612) TdQ’(0) 4..02).Ti)(IC)+(/?.. 711 = 2. H ? ) = dQ(fl1)(fQ(H2): (b) the limiting relative frequency for the second sequence will necessarily be greater than that for the first sequence (due.214 4 Modellitrg Considering. to a known improvement in a drug or an electronic component under test).2. d C ) ( H 1 . to which an individual assigns probability T . ) to be independent binomial random quantities) conditional on random quantities 0. in fact. s o that r I Q ( H l .(1 . so that. The model is completed by the specification of dQ(B1. . for example. & ) is zero outside the range 0 5 H I < H:! 5 I . for ( y ~ . we may have observed that out of the lirst { ( I .. ) / t t .?). i = 1. the general form of representation of beliefs for observables defined in terms of the two sequences. I ) ? . be interpreted as “joint beliefs about the limiting relative frequencies of 1’s in the two sequences”. As we shall see later. of course.yl. together with detailed specifications of dQ(0. whose detailed form will.18 (or its corollary) asserts that if we judge two sequences of 0 . depend on the particular beliefs appropriate to the actual practical application of the model. For example. say.. .x ( y . so that we have the independent form of prior specification.#.H:!) and the representation. has the form ~.
5 N. taking values in X .13 is the following. or whatever. = 1.. as discussed in previous sections.. i = 1 . and and.9/2(7L2))1 lim lim I which.. .. in addition. .are separately predictive suficient statistics for the individual sequences. respectively. . m . the fact that the k sequences are being considered together will mean that the random quantities r.4. y1( n l ) yz(n2) survived. . . . we typically amve at a representation of the form and where 0’ = fly:l 8. wish to make judgements about the relative performanceof the drugs were they to be used on a large future sequenceof patients. for all n.with I. i = l. . . . . . (Unrestricted exchangeabilityfor sequences withpredictive su@cient statistics). . 2.. One possible generalisation of Definition 4. the discussion and resulting forms of representation which we have given for the case of unrestrictedly exchangeable sequences of 01 random quantities can be extended to more general cases.ydn2)”.yl ( z t ( N z ) )i. . .x. Most often. m . Clearly. is the “posterior density for 4 .. so where the parameters correspond to strong law limits of functions of the sufficient statistics.y. given m unrestrictedly exchangeable sequences of random quantities.2. on the basis of this information. This might be done by calculating.. The following forms are frequently assumed in applications. .L(WlN) Y1(721). = t.z. relate to the same form of measurement or counting procedure for all . .m. x l l . 1=I where t.1. the parametric families have been identified through consideration of sufficient statistics of fixed dimension. xt?. taking values in X. P( !v2 ( Y I ( N ) I N ). Sequences of random quantities x. i = 1. In general. Definition 4.v= (Y. .62...) = p(x I O. m . . . . in the language of the conventional paradigm. i = 1. for example.). given y~(m).6 Models via Partial Exchangeability 215 patients receiving drug treatments I . . that typically we will have p l ( z 10. .na.14. .1. . are said to be unrestrictedly infinitely exchangeable if each sequence is infinitely exchangeable and. ..?.
..216 4 Modelling Example4. + .10. . ) d C . . It generalises the c a ~ of the 111 x 2 contingency e table described in Example 4.i. .w ~ ~ of measurement lqvoirr data.1 random vectors" (see Section 4..) denotes the category membershipcount(into the firstl:ofk+l exc1usivecategories)fromthe first I ) .. . ..3).1 random quantities.5. We shall not pursue this further here. = X..) I h i . . outcomesofthe ithofvr unrestrictedly exchangeable sequences of '0...) of the ith of rrr unrestrictedly exchangeable sequences of 0 . This model describes beliefs about an 111 x ( k 1 ) cotiringetyv rcrhle of count data.. I / .9.. (Normal). .. ~ .. . In that the representation then takes the form ' ( ... . 5 1. .~ (.. If ~ .(i). I = liiit. many applications. I = I .. If. . . ) a n d . (Multi'nomial). . [ I . = . i = .. . ..F.(i).(r1.t. .. discussed earlier in this section. . 4 As in the case of 0 ... ... . .. HI. outcomes . ..1 < / < A . ~ ~ ( . . r . the assumed sufficiency of the sample sum and sum of squares within each sequence might lead to the representation where. ci  Example4. I t . we could make analogous remarks concerning the various qualitative forms of specification of the prior distribution Q that might be made in these cases.r. . + . .9. hut will comment further in Section 4. and 0... ..denoterealvaluedobservations from 11) unrestrictedly exchangeable sequence!.. with row totals r r I . ( y . then Example4.J = 1. r~t. r r . . . so . U 1 10. Ifg.A.r) ~ . of realvalued random quantities. ..(/i. then P(T/l("l).' ~ ~ . / I . .. .+ 5 I}.. . . .7. ?J..\.. . fi M L (y.. .=liiri.. .withS.. . 3 4 . .s i ( i ) . the further judgement is made that X I = . 1. ( r r ) / r r ) a n d ~ = { O = ( H . H = ( / I .)) = l. X J. . say. ( n denotes the number of 1's in the first 1 1 .. . Hh.'.)suchthatO<H.I random quantities with 'rri = 2.wvehave//. . whereO. ... This is the model most often used to describe beliefs about a o t i e . . . .A. .) and (3) = H'" x (it"). (Binomid).(r))2. . . .(n.11.
.3 Structured Layouts Let us now consider the situation described in (ii) of Section4. Technically. a model which is widely used in the context of the twoway layout. of course. with I and J fixed. as Suppose that. depend on the actual perceived partial exchangeabilities in that application. rather than under the unrestricted set of all possible permutations (which would correspond to complete exchangeability). columns correspond to different drug treatments. x . .1. it is typically not the case that beliefs about the x f J t would be invariant under permutations of the subscript i. . subject to “treatment” j E J. for any fixed i . say. having I rows and J columns. or oughr to be routinely adopted.~ a (potentially) infinite sequence of realvalued random quantities (2 E such that the IJ sequences of this kind. the sequence x . such a situation corresponds to the invariance of joint beliefs for the collection of random quantities. In such contexts. ~ . There is no suggestion that the particular form discussed has any special status. The precise nature of the appropriate set of invariances encapsulating beliefs in a particular application will. we shall simply motivate. ~ xV2.6. we have a WOway layout. . for the kinds of mechanisms routinely used to allocate patients to treatment groups in clinical trials. if rows represent agegroups. of the subscripts. using very minimal exchangeability assumptions.6 Models via Partial Exchangeability 217 4.? . with K replicates in each of the I J cells. On the other hand. many individuals would have exchangeable beliefs about . x .4. . are judged to be unrestrictedly exchangeable. ment of complete exchangeability for the random quantities x T J k For example. . j. ~ under some restricted set of permutations . If further assumptionsof centred spherical symmetry or sufficiencyfor each sequence then lead to the normal form of representation. . or whatever. most individuals would find it unacceptable to make a judge. we have x). . replicates refer to sequences of patients within each agegroupheatment combination and the X t J k measure deathrates. we think of r r J 1 .6. where the random quantity xyk is triplesubscripted to indicate that it is the kth of K “replicates” of an observable in “context” i E I. for any fixed I j . In what follows. In general terms.
of course...5. . We shall return to this possibility in Section 46. = A. for fixed i .jth wlutnn efleect and as the (ij)th intercrcrioir rfleect. ci is referred to as the o\*erull inem.r. ~ assumed independently and normally distributed with means prJ and are variances (A/.. column averages and overall average. Letting denote the strong law limits of the row averages. Interest in applications often centres on whether or not interactions or main effects are close to zero and.. In many cases.)'. are referred to as the main cftecr.tt } and { . } as the } interactions. + + + ?IlJ.$. the nature of the observational process leads to the judgement that limb:.as the . It is possible. we can always write p. In conventional terminology. the { (. for all z .where we shall see that certain further assumptions of invariance lead naturally to the idea of hierarchical rcpresentations of beliefs. .. j).5 and { ?. .. . j . r t J 2 . ( I ! as the ith row eflect. where so that the random quantities x .l.$. from the twoway layout with Z and J fixed. that further forms of symmetric beliefs might be judged reasonable for certain permutations of the i. In the above discussion. .218 4 Modelling the x . . on making inferences about the magnitudes of differences between different row or column effects.j subscripts. if not. J . "./J = /1 C k . k are conditionally independently distributed with The full model representation is then completed by the specification of a prior distribution Q for X and any Z J linearly independent combinations of the p. may say. d. .(K) be assumed to be the same for all ( i . Collectively.j. our exchangeability assumptions were restricted to the sequence . respectively.so that A. .
. . . is a random quantity.6. m. survives. j t h animal receiving dose t. . tl. are to be observed..4. . . Hewlett and Plackett. 1979).. z. . . and if we denote the number of survivors ~ out of ni animals observed in the ith sequence by y. we shall denote the joint density of ~ . Again.1z. . .(z)) 8. of a toxic substance.2.m. Functions having the form G(4: 2.= 1 if the . j by p(zl(n1) . . in recognition of this dependency. of a related sequence of (random) i quantities.2:. Example 4.8(2)= (6JI(z). in some . z i l . where zi. . .. . . . (Bimsay). i = l ? .. for each i = 1. the sequences zil. .12.6.. we gave examples of situations where beliefs about sequences of observables xtllx22.i = 1. = 0 otherwise. If. Suppose that at each of 'rri specified dose levels.(n.i = 1.. . .4 Covariates In (iii) of Section 4. = 1. m. . z l ) . . ~=. We shall refer to the latter as cuvariares and.6 Models via Partial Exchangeability 219 4. z. . with + and G(41 + &z.. there is no suggestion that these particular forms have any special status. typically measured on a logarithmic scale. .. investigators often find it reasonable to assume that where the functional form G (usually monotone increasing from 0 to 1) is specified. z n l ) . for example. 1.18 implies a representation of the form . . ( z ) = linin'y.. ? !. The examples which follow illustrate some of the typical forms assumed in applications. .t n .~ &z.)/{l + + exp(4..... .42 E . . the required representation has the form . R'.1 random quantities. . . + where z = (q.)lz1.a straightforward generalisation of the corollary to Proposition 4. x .. they simply illustrate some of the kinds of models which are commonly Used. zi.)} (the logit model) being the most common. . .. +. . z. sense. and n x In many situations. For any specified C(. but d. .. . ... ...)= x . . sequences of 0 .m are functionally dependent.(n). . . ( n * ..) = G(41 p 2 z l ) with 4' E 92. are judged exchangeable. are widely used (see.) = exp(d. .). on the observed values.n. .. 2 . + cj2z. O.
.x. we might take v ! . ( t ) . at which G(o.C > ~ / O ~ corresponds to the (log) dose.): = (0. ~. in.. In the probit and logit cases...1 + .. . Suppose that at each of irt specified time points.02).respectively. Suppose further that the kinds of judgements outlined in Example 4. .(z))and0: = W x (W)“‘. about c ’ ~ Q2then acquire an operational meaning as beliefs about the average growthlevels. 2. In the logistic case.. . where the functional form y (usually monotone increasing) is specified. Beliefs about L‘nI then correspond to beliefs about the (log)dose level for which the survival frequency in a large series of animals would equal 112. (the /ogisric model) and (the straightline model). and ~2 = (ol 02)’. that would be observed from a large number of replicate measurements. Experimenters might typically be more accustomed to thinking in terms of ( . + correspondingto the growth level at the “time origin”. X 1 ( z .).1) transformation. As with Example 4. corresponding to the “saturation” growth level reached as s..) = g ( @ ::.:.220 4 Modellitig with dQ’(0) specifying a prior distribution for 0 E (P. the joint predictive density representation has the form 022.) = 1/2.I.(:. (.I. the judgement is made that X l ( r ) = .. : :.. ~ + I..1 1 are made about the sequences ..r.?. = .‘+ = 1og[0~&/(20~t & ) j / the time at which growth is halfway from the initial to the final level. . Commonly assumed forms include g ( ~2. z. In practice. (Growthcurues). . the socalled LD50 dose.\. 2 0. y(@: 2 . . 5 . .(Z) = = X p..x” “0. z .. and at times “. specification of C) might be facilitated if we reparametrise from q5 to a more suitable ( I . say i . ) ) In many such situations.... . so that we have the representation wherefiJ(z)= ( p l ( z. ’ . a 02 Example 4. are to be observed. . Beliefs . . . than in terms of (0. . ) = cil where dQ(& A ) specifying a prior distribution for @ E @ and X E %*.. for example. . . w . . A third possible parameter to which investigators could easily relate i n some applications might be c. 0 ~ c . with t = 1. .r.13.sequences of realvalued random quantities.12.(%I (particularly if measurements are made on a logarithmic scale) and that /I. .. = dl I . for example. = I)(@). + &:.. p . . but @ is a random quantity. is the jth replicate measurement (perhaps on a logarithmic scale) of the size or weight of the subject or object of interest at time 2 . .. .) . the specification of (1 might be facilitated by reparametrising from Q to a more suitable (11) transformation @ = @(@).!. . A. For any specified g ( .. where .Q I /c72.r. . I = 1. say. vl = .
qz ) = (P. and soon.. where each x .). for each 1 = 1. . . .(z). = 1. J = 1. ses quences of realvalued random quantities x. .rr(zno) a n d 8 = R"'x (Rt)"'. .. . (row and 8 = (&. = P ( z * ) . m... t .6 Models via Partial Exchangeability 221 Example 4.. Suppose that. adequately approximated by a firstorder Taylor expansion.. ...X.k . The form p ( z ) = A 8 is called a regression equation and the structure E ( z I A . . (2.. .A). ) are unknown.(Z... ..LL. . The unconditional representation can therefore be written as It is conventional to refer to .= ( l . )= nlim z.. . ..zz. . ( z ..... . . . are to be observed. ~ + = (8.Ok)' (column vector) with Conditional on e. denoting therr x 71 identity matrix.as values of the regressor rwiables z ' j ) . for some (unspecified) z * . = (i.I C . rrl t . is related to certain specified observed quantities z. where we define a. with I.]. .14. . z. . whose rows consist of a lreplicated n l times. *a '(2. followed by a replicated n2 2 times. to 8 as the vector of regression coeflcienrs and A as the design marrix. h ( Z 1 )* . ~ . . 8 . .). 01. .. where A is an rz x 12 matrix (n = I A). . ].) = A. .). (Muuiple regression).m.... the furtherjudgements are made that XI (2. . .z*D pu(z). A. = X and p . . .... and X = XI..(z A@. In many situations. so that. . ~ ) vector) .) = n?i liiii si(i).. ~. + n. ..4.. = l ~ I ~ ( z *i )=] 1. X )= A 8 .z . ) made which lead to the belief representation where p . the joint distribution of is thus seen to be multivariate normal.. where X and p ( . . but the latter is assumed to be a "smooth" function. ~and judgements are .) = p ( z l ) L.(ZI)...]..withz =(zl . N.e.
Qt. for some ... .14 essentially reduces to the same process as we have seen in earlier representations of joint predictive densities as integral mixtures.. . . . defined through conditional independence.. etc. . 4.). . we have a multiple regression model. I2 to 4. which arise in implementing the Bayesian learning process. beliefs about 8 will stem from rather different considerations. . 4.. z . Z .. . = 1. the likelihood. . . since this must reflect whatever beliefs are appropriate for the specific application being modelled. normal versus nonnormal distributions. In many cases. ~ . iO .. beliefs about 8 in the general case relate to beliefs about the intercept (&) of the regression equation and the marginal rates of change ( O I . However. but with at least some of the usual “labelling” parameters replaced by more complex functional forms involving the covariates.5 Hierarchical Models In Section 4. dd)(@).with respect to the regressor variables (2. we considered the general situation where several sequences of random quantities. leading typically to a joint density representation of the form We remarked at that time that nothing can be said.2. f / r ) of the J. . L. O.. involves familiar probability models. . within this general structure we can represent various special cases such as 2 ’ ’ ’ = z ’ @oIynmnriid V regression)or Z ~ J )= sin(jH/K). given the values of the relevant covuriates.. . .6. 2 2. it is often the case that additional judgements about relationships among the nt \equenccs lead to interestingly structured forms of Q ( H I . .). However. Howevcr.222 4 Modelling is said to define a linear model.. E(L. . approximation.. . x .wiorr):in these cases. (a version of trigononterric regre. z1 ).3 Specification of the kinds of structures which we have illustrated in Examples 4. normal and multivariate examples seen above).. I n are judged unrestrictedly infinitely ~ exchangeable. .. ~ for I. .) = 4. . linear versus nonlinear functional forms. We proceed c1s if: (i) the random quantities are conditionully independent. we have the simple regression (struighrline) model. when we consider the applications of such models. together with the problems of computation. this is all that really needs to be said for the time being. it is often useful to have a more structured taxonomy in mind: for example. From a conceptual point of view.. if I.6.and given the unknown parameters. I = 1.. about the prior specification Q ( & . . and so on.. (ii) the latter are assigned a prior distribution. From an operational point of view. . . often of exponential family form (as with the binomial. in generul.
. . . . .I Q(6.2. vt(7/. . I G) = n(G) n f G(@. x l l . . . p(yi(lli) .t)dQ(Q~. . .. . I . . if the sequences consists of successfailure outcomes on repeated trials with i i i different (but.3. ..6. . ( l ~ . we noted some of the possible contexts in which judgements of exchangeability might be appropriate not only for the random quantities within each of m separate sequence of observables. (Exchangeablebinomial parameters)...= n B i ( p . .3) the general representation Q(Qi. .1. . . l ( t . with i = 1. . . = liiri ( g . ) I O . . . the first stage of the hierarchy relates data to parameters via binomial distributions... “similar”) types of component. . . . 11x As we remarked in Section 4. . The following examples illustrate this kind of structuredjudgement and the forms of hierurchicuf model which result.) 1 In conventional terminology.O. we then have (see Section 4.. 11”’ 181. ! / f . .. . .. . .. and final. stage specifies beliefs about G..T/ffo(n. + s. .m. I G) m(G) The complete model structure is then seen to have the hierarchicalform .O. the second stage models the binomial parameters a5 i they were a f random sample from a distribution G. Suppose that we have unre1 strictedly infinitely exchangeable sequences of 01 random quantities.. H . [ 7 t .. Example 4.i)) = I Y(YI (ni).. 30. .) = s. .15.) ..ow) . . This corresponds to specifying an exchangeableform o prior f distributionfor rhe parameters el. . ~ 1 . ( n ) / n ) . . . ~ t . Then. .]. . where 8. it might be reasonable to judge the 711 “longrun success frequencies” to be themselves exchangeable. If the rri types of component can be thought of as a selection from a potentially infinite sequence of similar components. for i = 1. . ) l e ~ .6.. the third.4. but also between the m strong law limits of appropriately defined statistics for each of the sequences. .. is a sufficient statistic for the 2th sequence and p(yi(n1). .. ) ..l + . . . to all intents and purposes.6 Models via Partial Exchangeabilio 223 In Section 4...
the first stage of the hierarchy relates data to parameters in a form as sumed to be independent of G. Such beliefs acquire operational significance by identifying the hyperparameter with appropriate strong law limits of observables./'. . . ..sf(() and 11.. . _ 1_1 .. Suppose that we hake 111 unrestrictedly infinitely exchangeable sequences .?.. .?.r.. of real valued random quantities. I 1 ) the joint density has the representation where we recall that A' = I h .. . ( / ) ) i .. . . .. the prior specification takes the form /T 1 les the hierarchi As before. . and final..(i). . . beliefs about G might concentrate on a particular parametric family... the third. . wc tirst note that in many applications it is helpful to think in terms o f O(/fl. . . . (Exchangeable normal mean parameters).(/) "(. .//. for which (see Example 4. 1 So far as the specification of Q ( p l . .O . so that. /'. . . assuming the existence of the appropriate densities. .r. . / I . b l ..! +'"+.I.). as we shall indicate in the following example.. .lA"(A). It. . + Example 4.. . ...E CP.224 4 Modelling The above example is readily generalised to the case of exchangeable parameters for any oneparameter exponential family.where //Sf(/) Z ( . . i t 1 _ . X ) is concerned.: = . stage specifies beliefs about the hyperparameter. ... I = 1 . . . A ) . r . iu..16. the second stage now models the parameters us if they were a random sample from a parametric family labelled by the hyperpurutnem. . r .r. = hi. . In practice.
. in the form The complete model then has the hierarchical structure tr1 In practice. so that.1(1).3) a further representation of Q.4.forvery largenl..(pl I A.ji(m))2 the large sample analogues : (or of p(m)and s2(7n))were judged sufficient for the sequence. (i) the mean of all the observations from all the sequences (&). . beliefs about G. p ( m ) = . For an explicit example of this.. . the limiting sample means are judged exchangeable. given A.(PI..PI12 I A.6 Models via Partial Exchangeability 225 for some Qp. ( p l . . r n ) : .. p2. .0 2 ) . for example. knowledge of the strong law limits of sums of squares about the mean may be judged irrelevant to the assessment of beliefs for strong law limits of the sample averages: in such cases.~. QA. . . we might believe.. p. might concentrate on a particular parametric family. where + From an operational standpoint. Now suppose that.2(2). 42 and X then reduces to a specificationof beliefs about the following limits of observable quantities (for large m and 7 t l . . I A) will involve A. . . It would then be natural (see Section 4. . it is useful to think in terms of the product form of Q.)thequantitiesrri.. In other cases. In either case.4) = n 9 J P t I A. moreconcretely.az. .. the final stage specification of the joint prior distribution for &. we have (see Section 4. .~. I A) will not depend on A. i = 1. suppose that.5) to take g. ( p . Q . . . given a potentially infinite sequence pl .. . . . (?. assuming the existence of the appropriate densities. tn'(pl + . the hierarchical structure would take the form 111 gp(11.. 4) = N ( p I 14. .. 4) 1=1 lW4 I A) QA(A).. If the rn sequencescan be thought of as a selection from a potentially infinite collection of similar sequences. that variation among the limiting sample averages is certainly bigger (or certainly smaller) than withinsequence variation of observations about the sample mean: in such cases... (or. + f i n ) and s'(m) = m I 1 . In some cases. . conditional on A.
4.. (iii) the mean (over sequences)of the mean sum of squares of observations within a sequence about the sequence mean (A). ? . and the various extensions we have been considering in this chapter.1 . . consider 11 = 2 and finitely exchangeable 0. . .6.rl . The precise form of specification at this stage will.226 4 Modelling (ii) the mean sum of squares of the individual sequence means about the overall mean (6. However. . where links will be made with empirical Buyes ideas. of course. mathematical representations which correspond to probabilistic mixing over conditionally independent parametric forms do not hold for finite exchangeable sequences. and is being increasingly used in statistical modelling and analysis.4. 1980a). . An extensive discussion of hierarchical modelling will be given in the volumes Bayesian Curnpururiun and Bayesian Merhods. x t t ) for observables .Q.1 PRAGMATIC ASPECTS Finite and Infinite Exchangeability The de Finetti representation theorem for 01 random quantities.3. characterise forms of p ( r l .. Hierarchical modelling provides a powerful and flexible approach to the representation of beliefs about observables in extended data structures. we would have for some Q ( O ) .). modelling framework. . assumed to be part of an infinite exchangeable sequence. .6. further brief discussion will be given in Section 5.7 4. in general. . To see this. such that If the de Finetti representation held. .7.r. In the context of the Bayesian learning process.. A selection of references to the literature on inference for hierarchical models will be given in Section 5.I . This section has merely provided a brief introduction to the basic ideas and the way such structures arise naturally within a subjectivist. depend on the particular situation in which the model is being applied. an impossibility since the latter would have to assign probability one to both 0 = 0 and B = 1 (Diaconis and Freedman.
) E X” is of the form . The message is clear and somewhat comforting.z. a possible measure of the “distortion” introduced by assuming infinite exchangeability is given by where the supremum is taken over all events in the appropriate afield on X“..4. . the “distortion” should be somewhat negligible. no important distortion will occur in quantifying uncertainties. With the preceding notation. that there is a potential conflict between realistic modelling (acknowledgingthe necessarily finite nature of actual exchangeabilityjudgements) and the use of conventional mathematical representations (derived on the basis of assumed infinite exchangeability). . .x. .19. In general.19 to multivariate and linear model structures. there exists Q such that where f ( n )is the number of elements in X . . and f ( 7 1 ) = ( n . . see Diaconis (1977). = / F”(E)K?(F). Nextendible if it is part of the longer exchangeable sequence q .. see Diaconis et al. . For extensions of Proposition 4. .xn can be considered as part of a larger. but finite. E X. potential sequence of exchangeable observables. For further discussion. To discuss this problem. . If a realistic judgement of Nextendibility for large. Intuitively. This is made precise by the following.1) otherwise. Jaynes (1986) and Hill (1992). Infinite exchangeability correspondsto the possibly unrealistic assumption of Nextendibility for all N > n. but finite. with IC. . for some Q. ifthe latter isfinite. . therefore. Practical judgements of exchangeability for specific observables 2 1 . the assumption of infinite exchangeability implies that the probaE bility assigned to an event (q . . . typically of this kind: the xl. x1 . (Finite approximation of infinite exchangeability). If we denote by P( E) the corresponding probability assigned under Nextendibility for a specific N. . Proof.x. x. . N is replaced by the mathematically convenient assumption of infinite exchangeability. . See Diaconis and Freedman (1980a) for a rigorous statement and technical details. (1992). is Nextendible for some N > > n.v. let us call an exchangeablesequence. . Proposition 4.I I. are . one might feel that if 3c1 . .7 Pragmatic Aspects 227 It appears..
TI. 1992). This implies that we should proceed as ifwe have a random sample from an unknown distribution function F.Padgettand Wei( i981).x. Thorburn ( I 986).rl. the task of assessing and representing such a belief distribution Q over the set 3 of all possible distribution functions is by no means straightforward. as in the parametric case. Leonard (1973). For this reason. 1 11 where Q ( F ) = lim P(F. .. Dickey (1969).DykstraandLaud(198l). Ferguson (1973. . we saw that the assumption of exchangeability for a sequence joint distribution function of x I. Doksum (1974). As we have seen. whereas those involving the infinitedimensional parameter are referred to as nonparumetric models! The technical key to Bayesian nonparametric modelling is thus seen to be the specification of appropriate probability measures over function spaces. Susarla and van Ryzin ( 1976). effectively. the Bayesian analysis of nonparametric models requires considerably more mathematical machinery than the corresponding analysis of parametric models. . representations in the tinitedimensional case are referred to as pururnetric models. rather than over finitedimensional real spaces.Lo( 1984). with Q representing our beliefs about “what the empirical distribution would look like for large n*’. 1988. . of the form sl.228 4. In the rest of this volume we will deal exclusively with the parametric case. Among important references on this topic. albeit somewhat paradoxically.Wahba ( 1988).Rolin( 1983). Conventionally.since F is. J . 1974). postponing a treatment of nonparametric problems to the volumes Buyesion Coniputation and Bqcsiun Methods. is the empirical distribution function defined by . the use of specific parametric forms can often be given a formul motivation or justification as the coherent representation of certain forms of .. an infinitedimensional parameter. . Berliner and Hill ( 1988). Kimeldorf and Wahha (1970). . Dalal and Hall (1980).7. Most of this chapter has therefore been devoted to exploring additional features of beliefs which justify the restriction of ‘3 to families of distributions having explicit mathematical forms involving only a tinitedimensional labelling parameter. .3. s2. of realvalued random quantities implied a general representation of the .3.) 11 x and F...2 4 Modelling Parametrlc and NonparametricModels In Section 4. Ferguson and Phadia ( 1979).. Lenk (1991) and Lavine (1992a).. As we remarked at the end of Section 4. Good and Gaskins (1971. .3. we note Whittle ( 1958). Kestemont (I 987). 1980). .Hjort (1990). Hill (1968. Antoniak (1974).
such judgements correspond to acting as ifone has a Q which concentrates on a subset of 3 defined in terms of a finitedimensional labelling parameter. reasons for choosing to work with a particular parametric model (as there often are for acting. of course. is it reasonable to have excluded the others. historical reference to“similar” situations. It would always be prudent. or should some form of correlation be taken into account? (iv) if some. for example. a number of simplifying assumptions will necessarily have been made (either consciously or unconsciously). reciprocals. the choice involves subjective judgements. formally.7 Pragmatic Aspects belief characterised by invariance or sufficiency properties. In practice.7. or should the scale be transformed to logarithms. but not all. the following kinds of critical questions might be appropriate: (i) is it reasonable to assume that all the observables form a “homogeneous sample”.e. or might some of them be important. In each case. potential covariates have been included in the model. and the “applicabilityof a theory to the situation under study.3 Model Elaboration However. regarding such things as the “straightness” of a graphical normal plot. indicating briefly the kinds of elaboration of the “first thought o f ’ model that might be considered. Depending on the context. or might a few of them be “aberrant” in some sense? (ii) is it reasonable to apply the modelling assumptions to the observables on their original scale of measurement. .or by experience (i. either individually or in conjunction with covariates already included? We shall consider each of these possibilities in turn. the “similarity” between a current and a previous trial..4. 4. there are often less formal. as ifparticular forms of summary statistic were sufficient!). by means of whatever combination of formal and pragmatic judgements have been deemed appropriate. to “expand one’s consciousness” a little in relation to an intended model in order to review the judgements that have been made. From the standpoint of the general representation theorem. is it reasonable to have made a particular conditional independence assumption. therefore. In particular. or whatever? (iii) when considering temporally or spatially related observables. more pragmatic. where a given model seemed “to work”) or by scientific theory (wtuch determines that a specific mathematical relationship “must” hold. in accordance with an assumed “law”). in arriving at a particular parametric model specification. of course. specific parametric models are often suggested by exploratory data analysis (typically involving graphical techniques to identify plausible distributional shapes and forms of relationship with covariates).
but. ( J ' .p ) ? 111. Suppose that judgements about a sequence zI.\.the aberrant observation will "outlie". Smith ( 1983). . O'Hagan (1979. Arnaiz and RuizRivas ( 1986). . . it is thought wise to allow for the fact that (an unknown) one of . Tran. West ( I984 1985).7 E 1') ( . . A.rj. which is equally likely to be any one of . . . of realvalued random quantities are such that it seems reasonable to sup/' i pose that.'. Such models are usually referred to as "outlier" models. . . in the observed sample sI.rl. 1990). where p and A denote the corresponding limits for nonaberrant observations.? . Muirhead ( 1986). . z2. aberrant observation.I)/*..Generalisations to cover more than one possible aberrant observation can be constructed in an obviously analogous manner.prior belief in the relative inaccuracy of' an aberrant observation as a "predictor" of p is reflected in the weight attached by the prior distribution Q ( ? )t o values of!much smaller than 1.7 ~ there is precisely one . . .I. (? # 0. . beliefs about the sequence . but a limiting mean square about the mean equal to (7 A)'. = log(. there are no aberrant observations. Pettit ( 1986. Dawid ( 1973). ) . Pettit and Smith ( 1985). a suitable form of elaborated model might be . = 0 ) . if'a suitaMe :were idetitijied. This model corresponds to an initial assumption that.. with specified probability T. More recent literature includes. of realvalued random quantities have led to serious consideration of the model but.] = ( ? A ) ~ ' I . Freeman ( 1980).. .. . . Since for an aberrant observa.. 0 < 2 < 1. . on reflection. . r l . x. sf. .vjhnnurion ekuhorution. . tion 1. Suppose now that judgements about a sequence xl .I.! ..t.4 Modelling Outlier elaboration. If aberrant observations are assumed to be such that a sequence of them would have a limiting mean equal to I / .E [ ( r . 1992). with probability 1. De Finetti ( I96 1 ) and Box and Tiao ( 1968) are pioneering Bayesian papers on this topic.. Guttman and Peiia (1988) and Peiia and Guttman (1993). 1988b. defined by 1. x f amight be aberrant. . . since 7 < 1 implies an increased probability that.I.
t r=l but that it is then recognised that there may be a serial correlation structure among zl. .y = 1 / 2 .x. ~ 0. .r. . for a given y E [ .+h is given by R(.) A possible extension of the representation to incorporate such correlation might be to assume that. . .X . Suppose that judgements about z1. the observations correspond to successive timepoints. I P. for example. Correlationelaboration. or logarithm.7 Pragmatic Aspects 231 would plausibly have the representation It then follows that where 1 n 4 The case y = 1 corresponds to assuming a normal parametric model for the observations on their original scale of measurement. A..x2. the elaborated model admits the possibility that trans= formations such as reciprocal. t = 2. t = 1. etc. . square root. again lead to a "first thought of model in which I.4. r l n .. . ~ ( ~ 1 7 .and . Judgements about the relative plausibilities of these and other possible transformations are then incorporated in &+. and conditionalon p and A. might provide a better scale on which to assume a normal parametric model.(since.. 1). 1985). For detailed developments see Box and Cox (1964). so that .r. + h 1 p. If r includes values such as y = 1. Pericchi (1981) and Sweeting (1984. 3) = ? h .1. the correlation between x. = n ~ ( rI P ~ X ) .
. . .232 The elaborated model then becomes.p Nl.6. .~ . . In all these cases. . . . . . ( / t . . as well as learning about the directions in which the model needs extending.. ] replicate observations corresponding to the observed valuc z . . . 2 should also have been taken into account. + I . . . 1 . ( I ~ ~ . .7) d 4 * ( 4 .. :. .el) denotes the regression coefficients of the additional regressor variables Z A . . . .AI' II I ) H Y R + I I . .(a: I AO. ) replicated I I . =/P(z: 14) dQ(4) is replaced by an elaborated representation A=)= / A s I 4. . a suitable elaboration might take the form of an extended regression model 3 where 3 consists of rows containing b..16 of Section 4. . . . Other Bayesian apprwdches to the problem of covariate selection include Bernardo and Bermudez ( 1985). r . . q .XI?) d Q t ( i ) p(. )= ( . . .) denotes ) ) . 4 I Modelling '. for some Q'. = ( z . . The value y = 0 corresponds to the "first thought of' model.A ) of the covariates z I . If it is subsequently thought that ' covariates t A + l . . . Inference about such a 7. 7 ) . d&'(p. is the multiple regression model with representation Y(Z) = / pPtl. an initially considered representation of the form . . . the latter reducing to the original representation on setting the elaboration parameter y equal to 0. ~ . . Suppose that the "first thought of' model for the observables z = (zl ( 1 1 ~ ) . si.. correlation are reflected in the specification of ( Cowiricite aluborurion. imaginatively chosen to reflect interesting possible forms of departure from the original model. .I The "first thought of' model corresponds to = 0 and beliefs about the relative plausibility of this value compared with other possible values of positive or negative 2 ' .. A) as described in Example 4.. iri and y = ( & . Mitchell and Beauchamp ( 1988) and George and McCulloch ( I YY3a). . times. .where z . . .(zlpl. k T l . r l l )= / N. A) &W. z . often provides a natural basis for checking on the adequacy of an initially proposed model. . .. . .I' . i FT 1 . . = ( 2 .r1.
4. and the parameter 8. the labelling of the sequences is irrelevant and that any combined collection of observables from any or all of the sequences would be completely exchangeable.6 (for the case of two 01 sequences).7 Pragmatic Aspects 233 4. In reviewing a currently proposed model. we might wonder whether some parameters (or covariates) have been unnecessary included. . rather than opt for sure for one or other of these representations. we analysed the situation where several sequences of observables are judged unrestrictedly infinitely exchangeable. E Q. The process of model simplification is. general.6. For the present. Equalify ofparameters. relating to the ith sequence can typically be interpreted as the limit of a suitable summary statistic for the ith sequence. on the other hand. . or otherwise. thought to be possibly injudicious. In Section 4. the simplifyingjudgement were made that.. in a sense. leading to a general representation of the form where 8. on reflection. As we saw in Section 4. of course.). consists in expanding a “first thought of’ model to include additional parameters (and possibly covariates). we would have the representation where the same parameter 8 E 8 now suffices to label the parametric model for each of the sequences.7. 8’ = flzlQ. This . the converse. representation). . If. outlined in the previous section. reflecting features of the situation whose omission from the original model formulation is. These and other questions relating to the fundamentally important area of model comparison and model choice will be considered at length in Chapter 6. # 8. of a particular form of belief representation can only be judged in relation to the consequence arising from actions taken on the basis of such beliefs. = Om) and the original 8 representation as the alternative hypothesis ( 1 # .4 Model Simplification The process of model elaboration. in the sense that a simpler form of model might be perfectly adequate. the simplified representation is sometimes referred to as the nullhypothesis (61 = . we could take a mixture of the two (with weight x .x on the alternative. it will suffice just to give an indication of some particular forms of model simplification that are routinely considered. say. on the null representation and 1 . In conventional terminology. in fact. . As it stands. this latter consideration is somewhat illdefined: the “adequacy”.
where it will be shown to provide a possible basis for evaluating the relative plausibility of the “null and alternative hypotheses” in the light of data. and a possible parametric model representation involving row effecrs (a1. defining a likelihood).1 cells.6. In Section 4. = . which we defer until Chapter 6.. In terms of the mixture representation. = 71. . It is the combination of prior and likelihood which dejines the overull model. we specify a distribution for the observables conditional on unknown parameters (a sampling distribufion. for example. . T ~ J ) . column effecrs ($1. In any case.14 of Section 4. issues of model comparison and choice require a separate detailed and extensive treatment. we considered the situation of a structured layout with replicate sequences of observations in each of 1. Whether the process of model comparison and choicc is seen as one of elaboration or of simplification is then very much a pragmatic issue of whether we begin with a “smaller” model and consider making it “bigger”. In fact. . . .5 Prior Distributions The operational subjectivist approach to modelling views predictive models as representations of beliefs about observables (including limiting.6 and reconsidered in the previous section on model elaboration.!I = 0) and large sample cell means coincide with column (or row) means. If y denotes the regression coefficients of the covariates we are considering omitting. . O I 1. . 4..A commonly considered sim) = plifying assumption is that there are no interaction effects (71I = . Omission o covuriares. In familiar terminology. d ~and interaction effects (711. Absence of efects. Considering. . conventionally referred to as parameters). described in Example 4. in all the cases of elaboration which we considered in the previous section.. the specification of a prior distribution for unknown parameters is theretbre an essential . = crl = 0 (or dl = . . or vice versa.234 4 Modelling form of representation will be considered in more detail in Chapter 6. lnvariancc and sufficiency considerations have then been shown to justify a structured approach to predictive models in terms of integral mixtures of parametric models with respect to distributions for the labelling parameters. .7.. Again. we see that here the simplification process is very clearly just the converse of the elaboration process. the multiple regression f case. Further possible simplifying judgements would be that the row (or column) labelling is irrelevant. together with a distribution for the unknown parameters (a prior distribution). . . largesample functions of observables. conventional terminology would refer to these simplifying judgements as “null hypotheses”.1 0). setting the “elaboration parameter” to 0 provides a natural form of simplification of potential interest. . so that o1 = . . so that large sample means in individual cells are just the additive combination of the corresponding large sample row and column means. then the model corresponding to = 0 provides the required simplification.
That said. subjectivist perspective. we shall see that the range ofcreative possibilities opened up by the consideration of mixtures. Care must therefore obviously be taken to ensure that prior specifications respect logical or other constraintspertaining to such limits. together with a number of illustrative examples. a general overview of representation strategies. asymptotics. irrelevant. From a practical point of view. Often.8. the specification process will be facilitated by suitable “reparametrisation”. From a conceptual point of view. detailed treatment of specific cases is very much a matter of “methods” rather than “theory” and will be dealt with in the third volume of this series. provides a rich and illuminatingperspective and framework for inference.8 4. However. W are.within which many of the apparent difficulties associated with the precise specification of prior distributions are seen to be of far less significance than is commonly asserted by critics of the Bayesian approach.1 random quantities a p pears in de Finetti ( 1930). in fundamental disagreement with approaches to e statistical modelling and analysis which proceed only on the basis of the sampling distribution or likelihood and treat the prior distribution as something optional. 4.robustness and sensitivity analysis. prior beliefs about parameters typically acquire an operational significance and interpretationas beliefs about limiting (largesample)functions of observables. From the operational. or even subversive (seeAppendix B).8 Discussion and Further References 235 and unavoidable part of the process of representing beliefs about observables and hence of learning from experience. as we have repeatedly stressed throughout this chapter. will be given in the inference context in Chapter 5. it should be readily acknowledged that the process of representing prior beliefs itself involves a number of both conceptual and practical difficulties. the concept of exchangeability having been considered earlier by Haag (1924) and also in the early 1930’s by Khintchine (1932). it is meaningless to approach modelling solely in terms of the parametric component and ignoring the prior distribution. and certainly cannot be summarily dealt with in a superficial or glib manner. Extensions to the case of general exchangeable random quantities appear in de Finetti (193711964) and Dynkin (1953). Seminal extensions to more complex forms of symmetry (partial exchangeability) can be found in de Finetti (1938) and Freedman .4. as well as novel and flexible forms of inference reporting. therefore. In particular.1 DISCUSSION AND FURTHER REFERENCES Representation Theorems The original representation theorem for exchangeable 0 . with an abstract analytical version appearing in Hewitt and Savage (1955).
from a rather different perspective. and puts into perspective. 1988). 1986b). 1994) and Mendel (1992). also. consider a group of Bayesians. more complex invariance or sufficiency. 1982b. We have noted how this illuminates. However. partial symmetry.IS part of ”objective reality”. But we have also stressed that. The latter then defines ”objective” probabilities for outcomes defined in terms of observables. For related developments from a reliability perspective. on the other hand. 1990) and Kuchlerand Lauritzen ( 1989). see Barlow and Mendel (1992. The “prior”. It is as if. Through judgements of symmetry. from our standpoint. the corresponding parametric form would be the “true” model for the observables. 4. Clearly. this view has little in common with the approach we have systematically followed in this volume. linguistic separation into “likelihood” (or “sampling model”) and “prior” components. In the absence of any general agreement over assumptions of symmetry. even from our standpoint. providing the mechanism whereby the observables are generated. the corresponding parametric distribution is seen . we view them in terms of an intersubjective/subjectivedichotomy (following Dawid.2 Subjectivity and Objectivity Our approach to modelling has been dictated by a subjectivist. operational concern with individual beliefs about (potential) observables. 4 Modelling See Diaconis and Freedman ( 1980b) and Wechsler ( 1993) for overviews and generalisations of the concept of exchangeability. Ressel(l985). Instead of viewing these roles as corresponding to an objectivdsubjcctive dichotomy. we have seen how mixtures over conditionally independent “parameterlabelled’’ forms arise as typical representations of such beliefs. the two are actually inseparable in defining a belief model. is seen as a ”subjective” optional extra. such an approach seeks to make a very clear distinction between the nature of observables and parameters. there is an interesting sense. traditional discussion of a statistical model typically refers to the parametric form as “the model”.236 ( 1962). given the ”true” parameter. It is often implicit in such discussion that if the “true” parameter were known. these probabilities being determined by the values of the “unknown parameters“. See. invariance or . To this end. 1987. Useful reviews are given by Aldous ( 1985). Recent and current developments have generated an extensive catalogue of characterisations of distributions via both invariance and sufficiency conditions. Diaconis (1988a) and. Important progress is made in Diaconis and Freedman (1984. In contrast. Lauritzen ( 1982.8. all concerned with their belief distributions for the same sequence of observables. a potential contaminant of the objective statements provided by the parametric mtdel. Clearly. The conference proceedings edited by Koch and Spizzichino (1982) also provides a wealth of related material and references. in which the parametric model and the prior can be seen as having different roles.
the parametric model. However. in typical cases. perhaps. we have drawn attention in Section 4. However.8. if required. judgements about the parameter will tend more towards a consensus as more data are acquired. (ii) Structurul und Stochastic Assumptions.8. and the belief model for the parameters. One might go further and argue that without some element of agreement of this kind there would be great difficulty in obtaining any meaningful form of scientific discussion or possible consensus. However. 1990. and Lehmann. Cox. (iii) Identtj5abiliry and (iv) Robustness Considerations. In order to think about complex phenomena. As we shall see in Chapter 5 . even if their initial judgements about the parameter were markedly different.the results of this chapter imply that the entire group will structure their beliefs using some common form of mixture representation. the parametric forms adopted will be the same (the intersubjective component). In any given context.2 to the fact that shared structural belief assumptions among a group of individuals can imply the adoption of a common form of parametric model. however. the fundamental notion of a model is that of a predictive probability specification for observables. that the key element here is intersubjective agreement or consensus. "shorthand" for intersubjective communality of beliefs. a basis for separating out. Indeed. Within the mixture. as a possibly convenient. there are typically . such discussions often end up with a similar message.4.3 Critical Issues We conclude this chapter on modelling with some further comments concerning (i) The Role and Nature of Models. We emphasise again. Such intersubjective agreementclearly facilitatescommunication within the group and reduces areas of potential disagreement to just that of different prior judgements for the parameter. about the importanceof models in providing a focused framework to serve as a basis for subsequent identificationof areas of agreementand disagreement.8 Discussion and Further References 237 sufficiency. implicit or explicit. two components. while the priors for the parameter will differ from individual to individual (the subjective component). but potentially dangerously misleading. 4. given some set of common assumptions. for example. while allowing the belief models for the parameters to vary from individual to individual. Nonsubjectivist discussions of the role and nature of models in statistical analysis tend to have a rather different emphasis (see. We can find no real role for the idea of objectivity except. the individuals are each simply left with their own subjective assessments. The Role and Nature of Models In the approach we have adopted. so that such a group of Bayesians may eventually come to share very similar beliefs. the forms of representation theorems we have been discussing provide. 1990). one must necessarily work with simplified representations.
hut . If. . . separate from considerations about the complete form of probability specification to be adopted. a phenomenon is gova erned by the specific mechanism p ( z 10) with 0 = 0. all other things being equal we are intuitively more impressed with the one which consistently assigns higher probabilities to the things that actually happen. The essence of the dichotomy is that scientists are assumed to seek r. that However. As we have stressed many times. . in fact.rpIuwtory models. particular emphasis is placed on transparent characterisations or descriptions of the models that would facilitate the understanding of when a given model is appropriate. scientist who discovers this and sets p ( s ) = p(s10. but simply with providing a reliable basis for practical action in predicting and controlling phenomena of interest. Structural and Stochastic Assumptions In Section 4. whereas empirical modellers are simply concerned that p(z)"works".fii1. (Lehmann. where . see Leonard ( 1980). explanatory modellcrs take the form of p ( z 10) very seriously.6.sonic are Irsefirr'. The approach we have adopted is compatible with either emphasis.4 Modelling a number of different choices of degrees of simplification and idealisation that might be adopted and these different choices correspond to what Lehmann calls "a reservoir of models". we are personally rather sceptical about taking the science versus technology distinction too seriously. but our prejudices are wellcaptured in the adage: "trll moifels are. Put very crudely.w. it is observables which provide the touchstone of experience. For an elaboration of the latter view. We recall two examples. we considered several illustrative examples where. 1990) But appropriate for what'? Many authorsincluding Cox and Lehmannhighlight a distinction between what one might call scientiJc and fechnolugical approaches to models.) will certainly have a p(z) "works". Whilst we would not dispute that there are typically real differences in motivation and rhetoric between scientists and technologists. When comparing rival belief specifications.). the key role of the parametric model component p ( z 10) was to specify structured forms of expectations for the observables conditional on the parameters. in terms of our generic notation. which aim at providing insight into and understanding of the "true" mechanisms of the phenomenon under study: whereas technologists are content with empirical models. which are not concerned with the"truth". it seems to us that theories are always ultimately judged by the predictive power they provide. Is there really a meaningful concept of "truth" i n this context other than a pragmatic one predicated on p(z)'? shall return to this issue in We Chapter 6.
4.8 Discussion and Further References
In the case of observables x i j k in a twoway layout with replications (Section 4.6.3), with parameters corresponding to overall mean, main effects and interactions, we encountered the form
in the case of a vector of observables x in a multiple regression context with design matrix A (Section 4.6.4, Example 414),we encountered the form
In both of these cases, fundamental explanatory or predictive structure is captured by the specification of the conditionalexpectation, and this aspect can in many cases be thought through separately from the choice of a particular specification of full probability distribution.
Identi$abiliry
A parametric model for which an element of the parametrisation is redundant is said to be nonidentified. Such models are often introduced at an early stage of model building (particularly in econometrics) in order to include all parameters which may originally be thought to be relevant. Identifiability is a property of the parametric model, but a Bayesian analysis of a nonidentified model is always possible if a proper prior on all the parameters is specified. For detailed discussion of this issue, see Morales ( 197 1), Drhze ( 1974), Kadane ( 1974), Florens and Mouchart ( I986), Hills (1987) and Florens et ul. (1990, Section 4.5).
Robustness Considerations For concreteness, in our earlier discussion of these examples we assumed that the p ( z I e) terms were specified in terms of normal distributions. As we demonstrated earlier in this chapter, under the a priori assumption of appropriate invariances, or on the basis of experience with particular applications, such a specification may well be natural and acceptable. However, in many situations the choice of a specific probability distribution may feel a much less “secure” component of the overall modelling process than the choice of conditional expectation structure. For example, past experience might suggest that departures of observables from assumed expectations resemble a symmetric bellshaped distribution centred around zero. But a number of families of distributions match these general characteristics, including the normal, Student and logistic families. Faced with a seemingly arbitrary choice. what can be done in a situation like this to obtain further insight and guidance? Does the choice matter? Or are subsequent inferences or predictions robust against such choices?
240
4 Modelling
An exactly analogous problem arises with the choice of mathematical specifications for the prior model component. In robustness considerations, theoretical analysis sometimes referred to as "what if?" analysishas an interesting role to play. Using the inference machinery which we shall develop in Chapter 5. the desired insight and guidance can often be obtained by studying mathematically the ways in which the various "arbitrary" choices affect subsequent formsof inferences and predictions. For example. a"what if?" analysis might consider the effect of a single. aberrant, outlying observation on inferences for main etYects in a multiway layout under the alternative assumptions of a normal or Student parametric model distribution. It can be shown that the influence of the aberrant observation is large under the normal assumption. but negligible under the Student assumption. thus providing a potential basis for preferring one or other of the otherwise seemingly arbitrary choices. More detailed analysis of such robustness issues will be given in Section 5.6.3.
BAYESIAN THEORY
Edited by Josd M. Bemardo and Adrian F. M. Smith Copyright 02000 by John Wiley & Sons, Ltd
Chapter 5
Inference
Summary
The role of Bayes’ theorem in the updating of beliefs about observables in the light of new information is identified and related to conventional mechanisms of predictive and parametric inference. The roles of sufficiency, ancillarity and stopping rules in such inference processes are also examined. Forms of common statistical decisions and inference summaries are introduced and the problems of implementing Bayesian procedures are discussed at length. In particular. conjugate, asymptotic and reference forms of analysis and numerical approximation approaches are detailed.
5.1
5.1.1
THE BAYESIAN PARADIGM
Observables, Beliefs and Models
Ourdevelopment has focused on the foundational issues which arise when we aspire to formal quantitative coherence in the context of decision making in situations of uncertainty. This development, in combination with an operational approach to the basic concepts, has led us to view the problem of statistical modelling as that of identifying or selecting particular forms of representation of beliefs about observables.
242
5 Inference
For example, in the case of a sequence .r1. .r2. . . . of 0  1 random quantities for which beliefs correspond to a judgement of infinite exchangeability, Proposition 4.1, (de Finetti's theorem) identifies the representation of the joint mass as function for x l . . . . . s,, having the form
.
for some choice of distribution Q over the interval [O. I]. More generally. for sequences of realvalued or integervalued random quantities, .rl. q. . . . we have seen, in Sections 4.3  4.5, that beliefs which combine . judgements of exchangeability with some form of further structure (either in terms of invariance or sufficient statistics). often lead us to work with representations of the form
where p(s I d ) denotes a specified form of labelled family of probability distributions and Q is some choice of distribution over RA'. Such representations, and the more complicated forms considered in Section 4.6, exhibit the various ways in which the element of primary significance from the subjectivist, operationalist standpoint. namely the predictive matlef of beliefs about observables. can be thought of us (fconstructed from a puramerric model together with a prior distribution for the labelling parameter. Our primary concern in this chapter will bc with the way in which the updating of beliefs in the light of new information takes place within the framework of such representations.
5.1.2
The Role of Bayes' Theorem
In its simplest form. within the formal framework of predictive model belief distributions derived from quantitative coherence considerations, the problem corresponds to identifying the joint conditional density of
for any 7 n 2 1, given, for any p ( . r , .. . . ,slJ.
11
2 1. the form of representation of the joint density
In general. of course. this simply reduces to calculating
5.1 The Bayesian Paradigm
243
and, in the absence of further structure, there is little more that can be said. However, when the predictive model admits a representation in terms of parametric models and prior distributions, the learning process can be essentially identified, in conventional terminology, with the standard parametric form of Bayes’ theorem. Thus, for example, if we consider the general parametric form of representation for an exchangeable sequence, with dQ(8)having density representation,p ( 8 ) d 8 , we have
from which it follows that
P(xn+ly.
.. * xn+m I 1
1

where
P ( 8 I XI,.
This latter relationship is just Bayes’ theorem, expressing the posterior density for 8, given 11,. . . ,xn. in terms of the parametric model for 5 1 , . . . ,xn given 8, and the prior density for 8. The (conditional, or posterior) predictive model for zn+l,. . ,xfl+fll, . given x1 . . . , x I 1is seen to have precisely the same general form of representation as the initial predictive model, except that the corresponding parametric model component is now integrated with respect to the posterior distribution of the parameter, rather than with respect to the prior distribution. We recall from Chapter 4 that, considered as a function of 8 ,
+
lik(8 I xl,. . ,.rn) = p(xl.... ,I,18) . is usually referred to as the likelihood function. A formal definition of such a concept is, however, problematic; for details, see Bayarri et al. (1988) and Baymi and DeGroot (1992b).
5.1.3
Predictive and Parametric Inference
Given our operationalist concern with modelling and reporting uncertainty in terms of observables, it is not surprising that Bayes’ theorem, in its role as the key to a coherent learning process for parameters, simply appears as a step within the predictive process of passing from
p(Z,.
. . . x,) = / p ( x l . . . . .
.
5,1
1 e)p(e) e d
244
to
5 Inference
by means of
to denote future (or. as Writing y = { g ~. ,. . . gn,} = {.r,l.,.. . . , z l l . yet unobserved) quantities and z = {SI. . . . zI,} to denote the already observed quantities. these relations may be reexpressed more simply as
and
P(@ 1x1 = P ( 3 : I8)P(6)/PtZ).
However, as we noted on many occasions in Chapter 4, if we proceed purely formally, from an operationalist standpoint it is not at all clear, at first sight. how we should interpret “beliefs about parameters”. as represented by p ( 6 ) and p(8 I z), or even whether such “beliefs” have any intrinsic interest. We also answered these questions on many occasions in Chapter 4, by noting that, in all the forms of predictive model representations we considered, the parameters had interpretations as strong law limits of (appropriate functions 00 observables. Thus. for example, in the case of the infinitely exchangeable 0  I sequence (Section 4.3. I ) beliefs about 0 correspond to beliefs about what the longrun frequency of 1’s would be in a future sample; in the context of a realvalued exchangeable sequence with centred spherical symmetry (Section 4.4.1). beliefs about 11 and d. respectively, correspond to beliefs about what the largc sample mean. and the large sample mean sum of squares about the sample mean would be. in a future sample. inference about purunietrrs is thus seen to be a limiting form of predictive inference about observables. This means that, although the predictive form is primary. and the role of parametric inference is typically that of an intermediate structural step, parametric inference will often itself be the legitimate endproduct of a statistical analysis in situations where interest focuses on quantities which could be viewed as largesample functions of observables. Either way, parametric inference is of considerable importance for statistical analysis in the context of the models we are mainly concerned with in this volume.
5.1 The Bayesian Paradigm
245
When a parametric form is involved simply as an intermediate step in the predictive process, we have seen that p(8 I x1.. . . ,T , ~ ) the full joint posterior , density for the parameter vector 8 is all that is required. However, if we are concerned with parametric inference per se, we may be interested in only some subset, 4, of the components of 8, or in some transformed subvector of parameters, g ( 8 ) . For example, in the case of a realvalued sequence we may only be interested in the largesample mean and not in the variance; or in the case of two 0  1 sequences we may only be interested in the difference in the longrun frequencies. In the case of interest in a subvector of 8, let us suppose that the full parameter vector can be partitioned into 8 = {@,A}, where 4 is the subvector of interest, and A is the complementary subvector of 8, often referred to, in this context, as the vector of nuisance parameters. Since
the (marginal) posterior density for t is given by $
where
with all integrals taken over the full range of possible values of the relevant quantities. Expressed in terms of the notation introduced in Section 3.2.4, we have
P ( 4 . A I e>
$
0
P ( 4 12).
In some situations, the prior specification p ( 4 , A) may be most easily arrived at through the specification of p(A I @ ) p ( 4 ) .In such cases, we note that we could first calculate the integrated likelihood for 4,
and subsequently proceed without any further need to consider the nuisance parameters. since
246
5 Inference
In the case where interest is focused on a transformed parameter vector. g ( 8 ) . we proceed using standard changeofvariable probability techniques as described in Section 3.2.4. Suppose first that ?/.J = g ( 8 ) is a onetoone differentiable transformation of 8. It then follows that
where
is the Jacobian of the inverse transformation 8 = g'(+). Alternatively, by substituting 8 = g  ' ( + ) . we could write p(z 18) as y(a: 1 +). and replace p ( 8 ) by 149 '($1) I J,, 1 (?I)I. to obtain A+ 12) = P(a: I + M + ) / P ( Z j directly.
If = g ( 8 ) has dimension less than 8 , we can typically define y = (q. )= w h(8), some w such that y = h ( 8 )is a onetoone differentiable transformation, for and then proceed in two steps. We first obtain
P(?/.J.W 12) = PN(h
+
I(Y) 12) I J,,l(Y)
I
where
and then marginalise to
These techniques will be used extensively in later sections of this chapter. In order to keep the presentation of these basic manipulative techniques as simple as possible, we have avoided introducing additional notation for the ranges of possible values of the various parameters. In particular, all integrals have been assumed to be over the full ranges of the possible parameter values. In general, this notational economy will cause no confusion and the parameter ranges will be clear from the context. However. there are situations where specific constraints on parameters are introduced and need to be made explicit in the analysis. In such cases, notation for ranges of parameter values will typically also need to bc made explicit. Consider, for example. a parametric model. p(a: 18). together with a prior specification p ( 8 ) .8 E 0,for which the posterior density, suppressing explicit use
5.1 The Bayesian Paradigm
247
Now suppose that it is required to specify the posterior subject to the constraint 8 E 0"c 8, where p ( 8 ) d O > 0. Defining the constrained prior density by
,s ,
we obtain, using Bayes' theorem,
From this, substituting for w(8) in terms of p ( 8 ) and dividing both numerator and
expressing the constraint in terms of the unconstrained posterior (a result which could, of course, have been obtained by direct, straightforward conditioning). Numerical methods are often necessary to analyze models with constrained parameters; see Gelfand et al. (1992)for the use of Gibbs sampling in this context.
5.1.4
Sufficiency, Ancillarity and Stopping Rules
The concepts of predictive and parametric sufficient statistics were introduced in Section 4.5.2, and shown to be equivalent, within the framework of the kinds of models we are considering in this volume. In particular, it was established that a (minimal) sufficient statistic, t ( z ) ,for 8, in the context of a parametric model p ( z I O ) , can be characterised by either of the conditions
The important implication of the concept is that t ( z )serves as a sufficient summary of the complete data zin forming any required revision of beliefs. The resultingdata reduction often implies considerable simplification in modelling and analysis. In
240
5 Inference
many cases, the sufficient statistic t ( z )can itself be partitioned into two component statistics, t ( z )= [a(z). such that, for all 8. 45))
It then follows that. for any choice ol'p(c)).
so that, in the prior to posterior inference process defined by Bayes' theorem. it suffices to use p(s(z) a(z). ) .rather than p ( t ( z 10) as the likelihood function. I 8 ) This further simplification motivates the following definition. Definition 5.1. (Ancillary statiSric). A stutistic. a ( z ) . said to be ancillary. is with respect to 8 in u purumerric nrocidp(x I c ) ) . if p ,( a ( x 18) = p ( a ( x ) ) f i w ) all values of 8.
Example 5.1. (Bernoulli model). In Example 4.5. we saw that for the Bernoulli parametric model
which only depends on i i and r?,= .rl + . . . i.I,,,. Thus, t,, = [ t i . I.,,] provides a minimal sufficient statistic, and one may work in terms of the joint probability function p ( 11. I . * , 1 0). If we now write p("* I,,, 10) = p ( r . I f l . H)I'(fi I H ) .
and make the assumption that, for all r i 2 1, the mechanism by which the sample size, 0 . is arrived at does not depend on 0, so that p( r i i H ) = p ( . i i ) . i f 2 1. we see that is uncilkrry j i w (1. in the sen.= of Definition 5 . I. It follows that prior to posterior inference for 0 can thereforc proceed on the basis o f
for any choice of p ( B ) .15 0 5 1. From Corollary 4.I . we see that I
5.1 The Bayesian Paradigm
249
so that inferences in this case can be made as if we had adopted a binomiulpurametric model. However, if we write p(n.,rrlI 0) = ~ ( 7 I1r,,, @)p(r,, I 8) and make the assumption that, for all r,, 2 1, termination of sampling is governed by a r,, mechanism for selecting r,,, which does not depend on 8,so that p(r,, 18) = p(r,,), 2 1. we see that r , is ancillary for@, the sense of Definition 5. I . It follows that priorto posterior in inference for 0 can therefore proceed on the basis of p(6 I Z) = P ( @ I n: rw) x
I
@)P(O),
for any choice of p(B), 0 < 8 5 1. It is easily verified that
= Nb(n 1 ,I.,,) 8 (see Section 3.2.2). so that inferences in this case can be made as if we had adopted a negativebinomial parametric model. We note, incidentally, that whereas in the binomial case it makes sense to consider p(B) as specified over 0 5 6 5 1, in the negativebinomial case it may only make sense. to think of p ( 8 ) as specified over 0 < 8 5 1, since p ( r f l18 = 0) = 0. for all r,, 2. 1. So far as prior to posterior inference for 8 is concerned. we note that, for any specified p ( @ ) and assuming that either p ( n 18) = p ( n ) or p(r,, 18) = p ( r , , ) . we obtain ,
p ( 8 1 r l....,z,!) p ( B I ~ t . r , , x P ( i  8 ) n  r t i p ( e ) = )
since, considered as functions of 6 ,
y(r,,I TI, 0) x p(n r,,.8) x Orfi (1  8)"rfd.
The last part of the above example illustrates a general fact about the mechanism of parametric Bayesian inference which is trivially obvious; namely,for any 2 specified p ( 8 ) ,ifthe likelihoodfunctions pl ( z1I 8 ) ,p2(z I 8 ) are proportional US functions of 8, the resulting posterior densities for 8 are identical. It turns out, as we shall see in Appendix B, that many nonBayesian inference procedures do not lead to identical inferences when applied to such proportional likelihoods. The assertion that they should, the socalled Likelihood Principle. is therefore a controversial issue among statisticians . In contrast, in the Bayesian inference context described above, this is a straightforward consequence of Bayes' theorem, rather than an imposed "principle". Note, however, that the above remarks are predicated on a specified p ( 8 ) . It may be, of course, that knowledge of the particular sampling mechanism employed has implications for the specification of p( 8 ) ,as illustrated, for example, by the comment above concerning negativebinomial sampling and the restriction to 0 < 8 5 1.
5 Inference
Although the likelihood principle is implicit in Bayesian statistics, it was developed as a separate principle by Barnard ( 1949), and became a focus of interest when Birnbaum (1962) showed that it followed from the widely accepted sufticiency and conditionality principles. Berger and Wolpert ( 1984/1988)provide an extensive discussion of the likelihood principle and related issues. Other relevant references are Barnard era!. ( 1962). Fraser (1963).Pratt ( 1963, Barnard ( 1967). Hartigan ( I967), Birnbaum ( 1968, 1978). Durbin ( 1970). Basu ( 1975). Dawid (1983a).Joshi (1983).Berger ( 198%). Hill (1987)and Bayarri et d.(1988). Example 5.1 illustrates the way in which ancillary statistics often arise naturally as a consequence of the way in which data are collected. In general. it is very often the case that the sample size, 11. is fixed in advance and that inferences are automatically made conditional on t i . without further reflection. It is. however. perhaps not obvious that inferences can be made conditional on r i if the latter has arisen as a result of such familiar imperatives as “stop collecting data when you feel tired”. or “when the research budget runs out”. The kind of analysis given above makes it intuitively clear that such conditioning is, in fact. valid. provided that the mechanism which has led to “does not depend on 8”. This latter condition may. however, not always be immediately obviously transparent. and the following definition provides one version of a more formal framework for considering sampling mechanisms and their dependence on model parameters.
Definition 5.2. (Stoppingrule ). A stopping rule. h.jilr (sequentiul)sarnpling frclm CI sequence of observubles .I‘I E Yt. r? 6 .Y?. . is u sequence (IJ. ,functions h,, : X x . . . x X,, [O.I]. such that, ifs,,,, ( . ( . I . . . . . . I . , , ) is I = observed, then sampling is terminated with probabiliv 11,,(q, , j; otherwise. I the ( t i + I)th observution is made. A stopping rule is proper the indirced probability distribution p,, ( r t j . 11 = 1. 2 . . . . .for find suniple s i x girurwtees that the latter is finite. The rule is deterministic i h,,(z, E { 0. 1} ,for ull f ) ( n ,x(,,)); otherwise, it is u randomised stopping rule.

In general, we must regard the data resulting from a sampling mechanism defined by a stopping rule h as consisting of ( I ! . z,,,,).the sample size. together with the observed quantities .r1. . . . . x,,. A parametric model for these data thus involves a probability density of the form p ( n , z,,,? h. 8 ) . conditioning both on I the stopping rule (i.e., sampling mechanism) and on an underlying labelling parameter 8. But, either through unawareness or misapprehension, this is typically ignored and, instead. we act as if the actual observed sample size I ) had been fixed in advance, in effect assuming that
14.r).
~ ( 1 , )
I h. 8 ) = Y(x,,,,I 1 ) . 8 ) = ~ ( z l i18),,
using the standard notation we have hitherto adopted for fixed I I . The important question that now arises is the following: under what circumstances. if any, can
Example 5.. the following detailed analysis: p ( n = i .ss. with. Q) = 1 for all x l .e) ( = = I . hz(zl. other words. r 1 = 0. given 8.0) = p(rl = 1 I = i.= 1 1 e) = p ( t l = 1 1 e) and.c2 = z 1 . It then follows trivially from Bayes’ theorem that. for x = 0. X Thus. t i .I The Bayesian Paradigm 251 we proceed to make inferences about 0 on the basis of this (generally erroneous!) assumption.7t = 1 .e) = 1 .x2.cl = 0. likelihoods). it might appearessentialto take explicit account of h in making inferences about 8. l . e ) ~ =~ .rl. that . Consider now a randomised version of this stopping rule which is defined by h1(1) = ir. two observations are obtained (resulting in either n = 2. (0) = 0.h. .p ( q = 0 I O)p(zz = I 18) x p(zI = = 2. Consider.defined by: hl(1) = 1.h. x I = 0. p ( n = 2 .e.1. we have 1 p(71=i. =~ 1 71 = 2. if there is a success on the first trial.e)p(n 2 I h. x2 = s I h. inferences for 0 based on assuming 71 to have been fixed at its observed value will be identical to those based on a likelihood derived from explicit consideration of h. sampling is terminated (resulting In in n = 1.X.r. r p(71=2.5. may be regarded as a sequence of independent Bernoulli random quantities with p ( z . 1 t . p(. without consideringexplicit conditioningon the actual form of h? Let us first consider a simple example.e) = p(zl = o = p(xl = O . we obtain in this case the latter considered pointwise as functions of 0 (i. . x2 = z 1 h.e)P(T1 1 h.1. x2 = 1).e) n o 1 h. h . and that a sequential sample is to be obtained using the deterministic stopping rule h. l z I = 0. having nonzero probability.for any speciJiedp ( B ) . = 0 . L C ~= 0 or n = 2. At first sight. = 2. In this case. 10) = Bi(. = i ~ h . xI = 1).e) . e ) p ( ~ ~ .e)p(n = 1 1 h.2. otherwise. e)p(zl= o 1 e ) = p ( 1 2 = 5. Suppose. .22) = 1 for all s I22. x I = 0.e)p(zL = 1 z1 = 0. h. x l = 11 h. (‘‘BiOSed” stopping rulefor a Bernoulli sequence).e) = ~ ( = 2 I z1 = 0. I = o 1 el.h.o)=p(tl = ~ l ~ = i .e) ~ =z J =~2 . for . however. for all (n. h l ( 0 ) = 0. I8. h.p(x. . 7 i . y ( 3 j =118). e ) = 1 .cl = 0. since the sampling procedure seems designed to bias us towards believing in large values of 8.l). z1= 0. h 2 ( r I . =iih.
Q. = tibles .. = 1 .r Thus. ..H)p(.I) 101. z i l l obtained using ...252 and 5 Ir$erence p ( n = 2. Proposition 5. for uny specified prior densir!...\ 1h. 6 E 8.r? = . the stopping rule does not in fact lead to a different likelihood from that of the a priori fixed sample size. for all ( t i .1'. The analysis of the preceding example showed. for all (71.n)p(. . 1 ) Ih. z ( .0) P(X(. h . that.fired suniplr size purunirtric niodel p(x. 10).1.) /H) as functions of cl.I' J . . The following. 2 = . = 1 I h. it is a trivial consequence of Bayes' theorem that. rather trivial. a Again. we again find that having j/(f/. 0) = p ( u = 2 . .r I h. Pruc. H j = (1 . perhaps contrary to intuition.r.) nonzero probability. prior to posterior inference for 0 given data 0 1 . proposition makes clear that this is true for all stopping rules as defined in Definition 5. This follows straightforwardly on noting that and that the term in square brackets does not depend on 6.. = l.r1= 1.I'. i 11. z(.) were obtained using a fixed sample size u .H) X /J(Z{.rl. having. cl)p(. . However. . which we might therefore describe as "likelihood noninformative stopping rules".r2: 1 H). For any stopping rule h.for (sequential)samplingjioni n sequence ofohserv. although seemingly biasing the analysis towards beliefs in larger values of 0. h. h. f ) ( 7 ? .2.3!(.. 0 ) x p ( J .. / / = 2. for any given p(N).) a likelihood noninformative stopping rule h can proceed by acting as if q. q. (Stopping rules are likelihood noninformative 1.f.~)) such rhatp(n. so that the proportionality o f the likelihoods once more implies identical inferences from Bayes' theorem. q l lI) 0) # 0. a notationally precise rendering of Bayes' theorem.0) P(z(.. . 1 1 = 1.
3).2./Wever takes on a value that is “unlikely”.57 for o = 0. ~ . or k = 3.1 The Bayesian Paradigm 253 reveals that knowledge of h might wellaffect the speciJcation of theprior density! It is for this reason that we use the term “likelihood noninformative” rather than just “noninformative” stopping rules. using the fact that y(f. 10. we assume the I prior specification p ( 0 ) = N(O I p . if ”unlikely” is interpreted as “an event having probability less than or equal to a”. our discussion in Chapter 4 makes clear their basic inseparability in coherent modelling and analysis of beliefs. Now consider prior to posterior inference for 0. = C. Since the latter is likelihood noninformative. It cannot be emphasised too often that. x2. It can be shown. given 0. Example 5.1).. . Moreover. k = lL96 for a = 0.S. Section 3. z . 2. It follows that h is a likelihood noninformative stopping rule. although it is often convenient for expository reasons to focus at a given juncture on one or other of the “likelihood and “prior.)) for which the lefthand side is nonzero.we have . E %. for illustration. which take no explicit account of the stopping rule h. yielding data ( 7 1 . k = 2. a possible stopping rule. T L ) .for all ( r t . using the law of the iterated logarithm (see.5.01.3. Suppose further that an investigator has a particular concern with the parameter value 0 = 0 and wants to stop sampling i f f .. q. for example. Suppose. where.’ components of the model. for small fv. I r i . so that termination will certainly occur for some finite n. 0) = N(Ti.001).31 for R = 0. For any fixed sample size n.05. that this is a proper stopping rule. . . assuming 0 = 0 to be true.rl 10) = N ( r . ~ ) . may be regarded as a sequence of independent normal random quantities with y(.. . (“Biased” stopping rule for a n o d mean). A). defining we have as a function of 8. that xl. This issue is highlighted in the following example. might be for suitable k(n)(for example. . to be interpreted as indicating extremely vague prior beliefs about 8.10. with precision X ‘I0..
The righthand side for is easily seen to be proportional to exp{ . 7.(Hj . where Q(B) = ( r t which implies that +h) [ H  //+A PI. we are led to the posterior probability statement I P BE [ ( . acts as if initially very vague. 2.) the normal parametric model...z h ) ~ N ( F . The complete posterior p ( H I I / .\... we have . p(0111. mislead someone who.iQ(B)}.i. One consequence of this vague prior specification is that..5 1ttferenc.. The nature of h might suggest a prior specification p ( H I h ) that recognises 0 0 as a possibly "special" parameter value.i~ : times a N(H \ ( I .. by using this stopping rule.e by virtue of the sufficiency of (R. Since h is a likelihood noninformative stopping rule and ( I ) .] y)I T.. All) density to the range fl # 0. > k(o)/"'%.. h) is thus given by . having observed i n . for X 2 0. An investigator knowing B = 0 to be the true value can therefore.] are sufficient statistics for the normal parametric model. . 10 the special value.. let us now consider an analysis which takes into account the stopping rule.?. suppose that we specify p(Hih)= i l i NII.0. But the stopping rule h ensures that IF. and assigns 1 . . which assigns a "spike" of probability. I I ) P ( H I ~ ) . 1. K . xl. + A/' I'  . ' ~ ~ . f /I.. T (1  ~)l. This means that the value 0 = 0 j certainly does not lie in the posterior interval to which someone with initially very vague beliefs would attach a high probability.. As an illustration.(t. . unaware of the stopping rule.ll(H)N(O'O. which should be assigned nonzero prior probability (rather than the zero probability resulting from any continuous prior density specification)...~. = 1 . However.. Y. .).+. N .
since 255 it is easily verified that The posterior distribution thus assigns a “spike” n* to 8 = 0 and assigns 1 . we recall fromchapters 2 and 3 the elements and procedures required for coherent. 5. With slightly revised notation and terminology. w . a function attaching utilities to each consequence ( a . The elements of a decision problem in the inference context are: (i) a E A.5. In such a case.5 Decisions and Inference Summaries In Chapter 2. k ( a ) / f i . so that knowing the stopping rule and then observing that it results in a large sample size leads to an increasingconvictionthat 6 = 0.) density to the range 6 # 0. ri + A. 11 is likely to be very large and at the stopping point we shall have 7.n’ times a N(8 I (n+ Xu) IrE. or are to be recorded or reported in some selected form. and hence n‘. (ii) w E fl. the resulting n. we made clear that our central concern is the representation and revision of beliefs as the basis for decisions. . On the other hand.1 The Bayesian Paradigm where. derived from a prior taking account of h. The behaviwr of this posterior density. available “answers” to the inference problem. will tend to be small and the posterior will be dominated by the N(8 I (7t + All) ‘ r i X t l .and an ensuing state of the world.1. with the possibility or intention of subsequently guiding the choice of a future action. is clearly very different from that of the posterior density based on a vague prior taking no account of the stopping rule.. if 8 is appreciably different from 0. a... consider the case where actually 6 = 0 and a has been chosen to be very small. Either beliefs are to be used directly in the choice of an action. This means that 2 : for large 71. quantitative decisionmaking. rt + Ao) component. so that k ( a )is quite large. For qualitative insight. unknown states of the world: (iii) u : A x (1 + R. w ) of a decision to summarise inference in the form of an “answer”.
that. with specified A and zi (or I). where. as yet. given data 2. E A which ntuxitni.w) /(w)u(a. if instead of working with rr(a. of current beliefs about the possible states of the world. fixed function. specific features of the belief distribution. if an immediate application to a particular decision problem. so that these "summaries" of the full distribution suffice for decisionmaking purposes. where the unknown states of the world are parameters or future data values (observables). In the following headed subsections. given data 2: beliefs about = g ( 8 ) . + .given data z: beliefs about future data y.ws Alternatively.256 5 Inference (iv) p(w). p ( w ) . Throughout. 0: beliefs about 6.w). the optimal answermaximising the expected utility or minimising the expected lossmay turn out to involve only limited. aspecification. A and u (or 1 ) may be unspecified.w ) we work with a socalled l o s s junction. However. we shall illustrate and discuss some of these commonly used forms of summary. I(a.rpected loss. if beliefs about unknown states of the world are to provide an appropriate basis for future decision making. in the form of aprobability distribution. It is clear from the forms of the expected utilities or losses which have to be calculated in order to choose an optimal answer. The optimal choice of answer to an inference problem is an a the expected utility. we need to report the complete belief distribution p(w). the optimal choice of answer is an a E A which tnitiimises the e. typically reduce to one or other of the familiar forms: p(0) p ( O 1 z) p(+ I z) p(y I z) initial beliefs about a parameter vector. and current beliefs. = where f is an arbitrary. we shall have in mind the context of parametric and predictive inference. is all that is required.
+ cp). w ) = c ~ ( a .’ exists.2.w)p(w) dw. a E A.w) = (a. . the function to be maximised tends to p ( a ) and so the Bayes estimate with respect to zeroone loss is a mode of p(w). the Bayes estimate maximises La.w ) . since one is almost certain not to get the answer exactly right in an estimation problem. As p ( w )rlw. ) are w specified.w ) and the belief distribution p(w)is an a E A = l2 which minimises J 1I(a. (Forms of Bayes estimates). ~ ( a ) . is an estimate of the true value of w (so that A = Q). the Bayes estimate with respect to linear loss is the qwntile such that P(w 5 a ) = (‘2/(c. we refer to predictive point estimation. (Bayes estimate). ~ .w) = 1 . A point estimation problem is thus completely defined once A = R and 1(a. Clearly. etc. we refer to parametric point estimation.5. E + 0. (ii) I f d = Q = R a n d l ( a . and more formal guidance may be required as to when and why particular functionals of the belief distribution are justified as point estimates. . I f H . or IRk. the corresponding decision problem is typically referred to as one of point estimation. Direct intuition suggests that in the onedimensionalcase. Proposition 5. This is provided by the following definition and result. ~ ( a ) + c ~ ( w . if w = 9. or R+. Ifcl = c2. however. a = E ( w ) ..assuming the mean to exist. median or mode ofp(w) could be reasonable point estimates of a random quantity w. (iii) I f A = R C @ and I(a.3.l(BE(a))(w). and H issymmetric definite positive. rather than with the utility function. and so the Bayes estimate with respect to quadmtic form loss is the mean of p(w). l(a.w ) t H ( a. so that R is R. the Bayes estimate satisfies H a = HE(w). .a ) l ~ . Moreover. distributional summaries such as the mean.w ) l ~ . and the required answer. statisticians typically work directly with the loss function concept. these could differ considerably. If w = 6 or w = 3. Definition 5. righthand side equals 1/2 andso the Bayes estimate with the respect to absolute value loss is a median of p(w).assuming a mode to exist. (i) IfA = R = Rk. A Bayes estimate of w with respect to the lossfunction l(a. &(a) is a ball where of radius E in R centred at a.1 The Bayesian Paradigm 257 Point Estimates In cases where w E R corresponds to an unknown quantity. or ‘8 x R+.
we obtain (ii). the mean. In the case of a unimodal. Irony (1992) and Spall and Maryak (1992).250 5 Inference Proof. 1992b). 1988. Tiao and Box (1974).2..w ) p ( w )rlw with respect to a and equating to zero yields 2H This establishes (i). therefore. positively skewed.w)p(w)dw C ] = (a. 1988). for example. ( w )dw is maximised. see. for unimodal. and this is minimised when JB5(". 1986). de la Horra (1987. Berger and Srinivasan (1978). a p Further insight into the nature of case (iii) can be obtained by thinking of a ) unimodal. we have (iii). adding c1 J*.a ) p ( w ) dw. DifferentiatingJ ( a . such a comment acquires even greater force if p ( w ) is multimodal or otherwise "irregular". DeGroot and Rao (1963. Brown (1973).W ) Y ( W ) dw = 0.(w)P(w) h. since 1 I ( u . Berger (1979. Farrell (1964). and a strong rationale for a specific one of the loss functions considered in Proposition 5.) P ( W ) dw = C? whence. the provision of a single number to summarise p ( d ) may be extremely misleading as a summary of the information available about ii. w M w )d o = 1  J lBE(. For further discussion of Bayes estimators.E ) = Y(" + E). densities we have the relation mean > median > mode and the difference can be substantial if p ( ~is) markedly skew. Unless. symmetric belief distribution. there is a very clear need for a point estimate. It is then immediate by a continuity argument that a should be chosen such that p ( a . Of course. In general. s l(a. for a single random quantity w'. differentiating with respect to (I and equating to zero yields c1 lW5.o p ( w )dw to each side.W ) P ( W ) dw +Q P ( W ) h.I. lW>. Sacks (1963). . (w . 1966). Since .. median and mode coincide. Hwang (1985. p ( u ) .. Ghosh (19923. Finally. I (U . continuous ~ ( u J in one dimension.w ) ' H ( a. 1992).
p ( w ) .a could be viewed as a decision problem. we refer to (priorposteriorpredictive)credible regions. probability).needs to be communicated in the form of the full (prior. w2 $ C (exceptpossibly for a subset of R of Zen. p ( w ) and fixed a. from a theoretical perspective. (Minimal size credible regions).0 < a < 1. posterior or predictive) density.with respect to p ( w ) . This is the intuitive basis of popular graphical representations of univariate distributions such as box plots. and also sufficient for general orientation regarding the uncertainty about w .3. given a. If p(w ) is a (priorposteriorpredictive)density.a } # 0 a n d P E then C is optimal ifand only ifit has thepmperty that p(w1) 2 p(w2)for all w 1 E C.5. reflecting the possible consequences of quoting the 100(1.if formal calculation of expected loss or utility is to be possible for any arbitrary future decision problem. ! . provided that we are C willing to specify a loss function. w . If R G 92. the identification of intervals containing 50%. ( ~C ) = 1 . connected credible regions will be referred to as credible intervals.a)% credible region C. p ( w ) may be a somewhat complicated entity and it may be both more convenient. Let p(w)be a probability density for w E R almost everywhere continuous. Definition 5. For given 0. area.the problem of choosing among the subsets C G R such that J p ( w )dw = 1 . In practice. Clearly. 95% or 99% of the probability under p ( w ) might suffice to give a good idea of the general quantitative messages implicit in p ( w ) . however. for given a. as we should normally wish to do for obvious ease of interpretation (at least in cases where p(w) is unimodal). 90%. A region C C_ R such that is said to be a 100( 1 . length) is minimised. Proposition5. We now describe the resulting form of credible region when a loss function is used which encapsulates the intuitive idea that.a)%credible region for w .w). uncertainty about an unknown quantity of interest.we would prefer to report a credible region C whose size llCll (volume. (Credible Region). I(C. if A = {C. for any given a there is not a unique credible regioneven if we restrict attention to connected regions.4.1 The Bayesian Paradigm 259 Credible regions We have emphasised that.simply to describe regions C R of given probability undery(o). Thus. for example. in the case where R C 9.
A region C C fl is said to be a 100(1.e) we refer to highest (priordensity. set of $1 having prohahilit?. for any C E A. (Highestprobability density (HPD) regions). If C has the stated property and D is any otherregion belonging to A.~teriiorpredi~. since such regions will only exist for limited choices of ( t . 1965). If C does not have the stated property. For a number of commonly occurring univariate forms of p(w). The above development should therefore be understood as intended for the case of continuous fl. Definition 5.3 is worth emphasising in the form ofa definition (Box and Tiao. D= ( C ' n D ) u ( C " n D ) a n d P ( w E C )= P(w E D).Define for D = (C n A') U B. / p(w)l cw p(w)riw 5 S I I ~p(w)llc" =I < n DII (*'l)Ll W:PnIl with so that IjC n D'll 2 IIC' n Dll. cj The property of Proposition 5. there exists A C C such that for all w 1E A.rc. zero. there exist tables which facilitate the identification of HPD intervals for a range of values of ct .(1 (ii) p(wl) p(w2) for all w1 E C'and w? @ C e. so that an optimal C must have minimal size.260 Proof. the credible region approach to summarising p(w)is not particularly useful in the case of discrete $2. It follows straightforwardly that. then since C = ( C n D ) u ( C n D ' ) . If p ( w ) is a fpriorpo.ti~. B G C' be such that ! Let P(w E A ) = P(w E B)and p(w2)> p(wl) all w? E B and 0 1 E .1.a ) %highesrprobahilip density region for w with respect to p(w)i f (i) P(w E c') = 1 .5. posteriorpredictive) density regions.eptpossihl~.a. Then D E A and by a similar argument to that given above the result follows by showing that llDll < IlC'll. and hence IIC((5 IlDll.forcrsiih2 ' .w)p(w) = k1'1 dw 1(1 + 1 . we have wcC'r81r' p(w)ll~ D 11 5 n ' * (' / I f . Clearly. there exists wz $ C such that p(w2) p(wI). 5 Inference A irif l(C.
1985).1982. it may be satisfactory in practice to quote some more simply calculated credible re .la woalmost as “plausible” us all (u’ E C U O C Figure 5.1 The Buyesiun Paradigm 261 WO c Figure 5. the derivation of an HPD region requires numerical calculation and. In general.5.lb w much less “plausible” than most w E C o (see. and Lindley and Scott. however.. particularly if p ( w )does not exhibit markedly skewed behaviour. lsaacs ef al. 1974. Ferrandiz and Sendra. for example.
= / I I = 0. A = {a. we must formulate a proper decision problem. loss function and = I. Pratt (1961 ). for example. In addition.)of a random quantity w in the absence of alternative hypothesised values are often considered in the general statistical literature under the heading of “significance testing”. 1). Jnferenccs about a specific hypothesised value w. Wright ( 1986) and DasGupta ( 199 I ).( I l } . Aitchison ( 1964. I . with the l. This. For example. the main motivation and the principal use of the hypothesis testing framework is in model choice and comparison. a specific value wg E fl will tend to be regarded as somewhat ! “implausible” if wg $ C‘. The problem is that of lack of invariance under parameter transformation. p(z 1 O ) . I that even the intuitive notion of “implausibility if wl) @ c“’depends much more on the complete characterisation of p ( w ) than on an eitheror assessment based on an HPD region. the parameter labelling a paramctric is model. typically..05. Even if v = g ( w ) is a onetoone transformation. 0. then. 1966). of course. we shall postpone a detailed consideration of the . an activity which has a somewhat different flavour from decisionmaking and inference within the context of an accepted model. For further discussion of credible regions see.262 5 Inference gion. We shall discuss this further in Chapter 6. Although an appropriately chosen selection of credible regions can serve to give a useful summary of p ( w ) when we focus just on the quantity w.. there is no way of identifying a marginal HPD region for a (possibly transformed) subset of components of w from knowledge of the joint HPD region. it will suffice to noteas illustrated in Figure 5 . in general. there is a fundamental difficulty which prevents such regions serving.O. For the present.. In cases where an HPD credible region C is pragmatically acceptable as a crude summary of the density p ( w ) . provides no justification for actions such as “rejecting the hypothesis that w = w0”.). in the univariate case. If we wish to consider such actions. where 8 E 8 = 81)Uc31. reflecting the relative seriousness of the four possible consequences and. with u l ( a l . j E (0. it is easy to see that there is no general relation between HPD regions for w and v. specifying alternativeactions and the losses consequent on correct and incorrect actions. particularly for small values of c1 (for example. conventional statistical tables facilitate the identification of intervals which exclude equiprobable tails of p ( w ) for many standard distributions. For this reason. Hypothesis Testing The basic hypothesis testing problem usually considered by statisticians may be described as a decision problem with elements together with p ( w ) . as a proxy for the actual density y ( w ) . Clearly.Ol). )corresponding to rejecting hypothesis Ho(H1). /(H.
and so on. 1977).Berger and Sellke ( 1987)and Hodges ( 1990. in the case of p ( 8 I z).1992). or transformations thereof. lead to challenging technical problems. Good (1950.6 Implementation Issues Given a likelihood p ( z 18) and prior density p ( 8 ) . Moreover. The above discussion has assumed a given specification of a likelihood and prior density function. if k is at all large. in order to obtain marginal densities. Ic (k. the problem of implementation will. Smith. Berger and Delampady ( 1987). Pratt (1965).Gilio and Scozzafava (1985). Edwards el al. Dickey (197 1. two kdimensional integrals are and for required just to form p(8 I z) p(y 12). although a specific mathematical form for the likelihood in a given context is very often implied (i) .2)dimensionalintegrals are required to obtain bivariate marginal densities. The key problem in implementing the formal Bayes solution to inference reporting or decision problems is therefore seen to be that of evaluating the required integrals. Clearly. the starting point for any form of parametric inference summary or decision about 8 is the joint posterior density and the starting point for any predictive inference summary or decision about future observables y is the predictive density P(Y 12) = /P(Y I 1%) do.1. Lempers (1971). 1983). However. where we shall provide a much more general perspective on model choice and criticism.1 The Bayesian Paradigm 263 topic until Chapter 6. (k. 1965. if 8 has k components. DeGroot (1973). (1986). in problems involving a multiparameter likelihood the task of implementation is anything but trivial. 1961b. since. (1963). or marginal moments.1)dimensionalintegrals are required to obtain univariate marginal density values or moments. will necessitate further integrations with respect to components of 8 or y. 5. or expected utilities or losses in explicitly defined decision problems. further summarisation. Smith (1965). Learner ( 1978). However.1974. implementation just involves integration in one dimension and is essentially trivial. It is clear that to form these posterior and predictive densities there is a technical requirement to perform integrations over the range of 8. Shafer (1982b). General discussions of Bayesian hypothesis testing are included in Jeffreys (1939/1961).5. Zellner (1971). example. Lindley (1957. In cases where the likelihoodjust involves a single parameter. Box (1980). Rubin (1971). Farrell(1968). 1977). 1965. as we have seen in Chapter 4. Moreover. requiring simultaneous analytic or numerical approximation of a number of multidimensional integrals. in general.
5. Question (iv) will be answered in Section 5.3? (ii) if the information to be provided by the data is known to be far grcater than that implicit in an individual's prior beliefs. are there choices which enable the integrations required to be carried out straightforwardly and hence permit the tractable implementation of a range of analyses. or can a Bayesian implementation proceed purely on the basis of this qualitative understanding? (iii) either in the context of interpersonal analysis. where an approximate "large sample" Bayesian theory involving asymptotic posterior norniality will be presented. together with classical numerical analytical techniques such as GmssHennirc. Question(ii)will beansweredinpartat theendofSectionS. where the informationbased concept of a reference prior density will be introduced. is there a formal way of representing the beliefs of an individual whose prior information is to be regarded as minimal. there is some arbitrariness in the precise choice of the mathematical representation of a prior density. An extended historical discussion ofthis celebrated philosophical problem of how to represent "ignorance" will be given in Section 5.2 and again later in 6.3.2. where the concept ofa cottjugatu prior density will be introduced. the mathematical specification of prior densities is typically more problematic. for any specific beliefs. sufficiency or experience.3. where classical applied analysis techniques such as Lupluce's approxiniaticm for integrals will be briefly reviewed in the context of implementing Bayesian inference and decision summaries. relative to the information provided by the data'? (iv) for general forms of likelihood and prior density. Some of the problems involvedsuch as the pragmatic strategies to be adopted in translating actual beliefs into mathematical form.6. as exemplified by the following questions: (i) given that.2. many of the other problems of specifying prior densities are closely related to the general problems of implementation described above.eresampli~ig Markov chain Monte Curlo. or as a special form of actual individual analysis. sumplirtgitnportcirtc.4. However. is there any necessity for a precise mathematical representation of the latter.relate more to practical methodology than to conceptual and theoretical issues and will be not be discussed in detail in this volume. and . thus facilitating the kind of interpersonal analysis and scientific reporting referred to in Section 4.5 Injerence or suggested by consideration of symmetry. quadrature and stochastic simulation techniques such as intportuncv sanipling.8.2and in moredetail in Section 5. Question (iii) will be answered in Section 5. are there analytic/numerical techniques available for approximating the integrals required for implementing Bayesian methods? Question (i) will be answered in Section 5.
Again. Definition 5.6 is that of tractability.5. Given a likelihood function p ( s 18).2. This suggests that it is only in the case of likelihoods admitting sufficient statistics of fixed dimension that we shall be able to identify a family of prior densities which ensures both tractability and ease of interpretation. this is a rather vacuous idea. or as part of a stylised sensitivity study involving a range of prior to posterior analyseswe require. However. This motivates the following definition. when the latter is viewed as a function of 8. Tractability can be achieved by noting that. since p(8 I s)and p ( 8 ) would always belong to the same “general family” of functions if the latter were suitably defined. when the latter is viewed as a function of 8.with respect to a likelihood p ( x I 8 ) with suficienr statistic t = t ( x ) = {n. (Conjugatepriorfamily). for what choices of p ( 8 ) are integrals such as easily evaluated analytically? However.2 5. since Bayes’ theorem may be expressed in the form both p(8 I x) and p ( 8 ) can be guaranteed to belong to the same general family of mathematical functions by choosing p ( 8 ) to have the same “structure” as p ( z I 8 ) . in addition to tractability. The conjugatefamily of prior densities for 8 E 8.6. since any particular mathematical form of p ( 8 ) is acting as a representation of beliefseither of an actual individual. as stated. To achieve a more meaningful version of the underlying idea. let us first m a l l (from Section 4.is .1 CONJUGATE ANALYSIS Conjugate Families The first issue raised at the end of Section 5. however.1.5) that if t = t ( s )is a sufficient statistic we have so that we can restate our requirement for tractability in terms of p ( 8 ) having the same structure as p ( t 18). without further constraint on the nature of the sequence of sufficient statistics the class of possible functionsp(8) is too large to permit easily interpreted matching of beliefs to particular members of the class. that the class of mathematical functions from which p ( 8 ) is to be chosen be both rich in the forms of beliefs it can represent and also facilitate the matching of beliefs to particular members of the class.2 Conjugate Analysis 265 5.s ( x ) }of abed dimension k independent of that of x.
4. it follows that the likelihoods for which conjugate prior families exist are those corresponding to general exponential family parametric models (Definitions 4. . ~ ~ [ ~ ( e p e~x ft .x. the sufficient statistics for t$ have the form ..5 and Definition 5. given j. according as X does not or does depend on 8.1 I). ] Tl { Prooj By Proposition 4. The exponential family model is referred to as regular or nonregular.h. c ~ l ~ .10 (the Neyman factorisation criterion). respectively.6. (Conjugate families for regular exponential families). If x = ( X I : . for which.266 where 5 inference and From Section 4. .10 and 4.)is a random sample from a regular exponential family distribution such that then the conjugate family for 8 has the form where is such thar ~ ( 7 ) =. t$ and c. . Proposition 5. ( e ) d e }< x.
5. . Example 5. so that. is given by p(6. ij).4.rO@) (V(T log8) . p ( 8 I q).71) = 1' 8'1 (1 . and 13 = To . by Proposition 5. The Poisson likelihood has the form so that.o)r"r' do.g ) v r l . 0 > 0.rI+ 1..T I ) (1 .8 ) ~ ) X {log ~ P assuming the existence of K(qi.TI ' (A) TI} 8'1 (1 . K(m. . S . (Bernoulli likelihood.8)831. hence. . a Example 5. by Proposition 5. d > 0.5. 8) = Be(@1 ck.@ ) ' . beru prior).8) = . r l ) x exp(.=I so that.q . + 1. we have p(H I a. the conjugate prior density for 6.. .2 Conjugate Analysis 267 00.2. sk(z) Tk I e.) such that J. . Writing (x = T. I ro.6 (1 . n.' I ' . the conjugate prior density for 8 is given by p ( e I TI1.4. .d) x 8" ' ( 1 . The Bernoulli likelihood has the form Jl(S1~ . I 1) . and comparing with the definition of a beta density. for any r = (TO.p(O Ir)dO < prior density has the form a conjugate p(eI 7) = p(sl(z) = T I . . I n I. (Poisson likelihood. = To) = by Proposition 4. .T.T I ) = p ( 8 I Q . gamma prior). (0 5 6.4.
and comparing with the definition ofa normalgamma density.4 Inference assuming the existence of Writing c t = rI + 1and A = T. the conjugate prior density for H c: (I(. we have p(H I (k. we have __ ..7 = T .4. (Normal likelihood. comparing Example 5. .6. . A ) is given by assuming the existence of K ( q t .5) x 8 " with the definition of a gamma density. normalgamma prior).T ? ) . has the form so that. The normal likelihood.$). given by Writing ( I = : ( T O f 1). . / T { . 'ap( and hence. by Proposition 5.rl. with unknown mean and precision.1= f(r: .
in the case of regular exponential family likelihoodsand conjugate prior densities.4: (i) the posterior density for 8 is which proves (ii). a . (Conjugateanalysisfor regular exponentialfamilies). the analytic forms of the joint posterior and predictive densities which underlie any form of inference summary or decision making are easily identified.2 Conjugate Analysis 269 5. Proposition 5.2.5. For the exponentid family likelihood and conjugate prior density of Proposition 5.5.2 Canonical Conjugate Analysis Conjugate prior density families were motivated by considerations of tractability in implementing the Bayesian paradigm. The following proposition demonstrates that.
The inference process defined by Bayes’ theorem is therefore reduced from the essentially infinitedimensionalproblem of the transformation of density functions. is given by [ I ( @ 12. . 11 and r .’(# I Tit. = o + r. 71) x /dz I #)I.. T!).. where q1( ) = n 11. the process reduces to the updating of the + prior to posterior hyperparameters. }(I . A. TI 1 x ( 1 . family of distributions.  + +  Example 5. and.*j) ( z l O ) p ( f l l o ..r. However.4. and writing r. posterior density corresponding to the conjugate prior density. p(B I GI..rI t .. rl ( 7 1 ) = T~ +I. simply defined. the inference process being totally defined by the mapping T under which the labelling parameters of the prior density are simply modified by the addition of the values of the sufficient statistic to form the labelling parameter of the posterior distribution. to a simple.. with respect to the corresponding exponential family likelihood. we could proceed on the basis of the original representation of the Bernoulli likelihood. . Be(H I (I. j).j. Proposition 5.H)” rxp {log () 10 H r. to provide some preliminary insights into the prior posterior predictive process described by Proposition 5. again. T rt. Barnard..o.5(i) establishes that the conjugate family is closed under sumpling.5. (T t. additive finitedimensionaltransformation.. With the Bernoulli likelihood written in its explicit expothe nential family form. ~ ) xp xH’”(j ())‘‘‘f’@I + ’(I 0)‘ I where (k.. Alternatively. (continued). \o that p(fllz.. The forms arising in the conjugate analysisof a number of standard exponential family forms are summarised in Appendix A. = c j ( 1 . . . a concept which seems to be due to G. simplifyingclosure property holds for predictive densities.. = . .. showing explicitly how the inference process reduces to the updating of the prior to posterior hyperparameters by the addition of the sufficient statistics. .(z)).. we shall illustrate the general results by reconsidering Example 5. combining it directly with the familiar beta prior density.H)’” x ex[> {log ( )H@ I TI (/!)} .5 Inference Proposition 5.5(ii) establishes that a similar.4. . 71. This means that both the joint prior and posterior densities belong to the same.
the second form seems much less cumbersome notationally and is more transparently interpretable and memorable in terms of the beta density. it easily follows that p(rb..T ~ ~In . Using the standard exponential family form has the advantage of displaying the simple hyperparameter updating by the addition of the sufficient statistics.+I. .)* .. B)p(e a.. the two notational forms and procedures used in the example are equivalent..work withalikelihooddefinedin termsofthesufficient statistic (R. )de d I ..~ ) ~ ' . .n ) oc Be(8 I a. when analysing particular models we shall work in terms of whatever functional representation seems best suited to the task in hand. . Writing = yl + .particular.4..I n .~ n @ 'I (1 ' Taking the binomial form.. Ia.. If.. . Example 5.. in either case. .. we were interested in the predictive density for r. in terms of the notation introduced in Section 3... The predictive density for future Bernoulli observables. ) . 0) or Y ( R I T.13) x 8" (1 .. p(r.. instead. + y. ym) = (G. B)IiI theorem can be simply expressed.4) E Be(@ Q I + r. 1 r n .I n. (continued). .. Instead of working with the original Bernoulli likelihood. is also easily derived. as Bi(rll 18.. we would use one or other of ) p(rl.Tn+. wecould. . we see that where a result which also could be obtained directly from Proposition 5 3 i i ) . if either n or rl.2. p ( s l . which we denote by 9 = ( Y l 7 . 6 the prior to posterior operation defined by Bayes' ' ) . . r..$ + n .ll@). .5. 6) and. .. A.r. r n ) = I' p(r:.0. ... were ancillary...r .. y(e I 71. However.2 Conjugate Analysis 271 Clearly.ofcourse.4. In general.
. P(Y I + ) = a ( y ) w ( y ' @. I 8 12). We shall refer to these (Definition 4.Yo) = ('(74). . If Q = %*.. that using the fact that r(t + 1 ) = t l ' ( t ) and recalling. . where kvf = . given r . = 1 j ( I . we require I t ( ) > 0. . The particular case I I J = I is of some interest. .Comparison with Section 3.  Y E 1'. . h and c..(O)and y. frequency of successes.yr).t i ( + ) } . df. . s With respect to quadratic loss. Bb(r:.p. I f t . .i 1 this evaluation becomes (r.. and the above result demonstrates that this should serve as the evaluation ofthe probability of a success o n the next trial. = h ( x . . ..s of the exponential family and its conjugate prior family. the lorn ofthe mean of a beta distribution. simplification by defining c. a and realvalued functions u. since y ( (. . . For an elementary. yo.4. hut insightful. which is the celebrated kiplace's rrclec~. ) . .. 2 In presenting the basic ideas of conjugate analysis.2. yo E 1 . we can obtain considerable = ( $ ' $ I . on substituting into the above.cessio~r (Laplace. .rrn. From a notational perspective (cf..2. together with prior hyperparameters rill. I I I = 1 ) is then the predictive probability assigned to a success on the ( n + 1)th trial.. which has served historically to stimulate considerable philosophical debate about the nature of inductive inference.j)belief about the limiting relative If>. H. E(BI n . so that these forms become . We shall consider this problem further in Example 5. c ~ )y = (y.2.12)..sicc. . j. .m). observed successes in the first rt trials and an initial Be(@ .i = 1.. from Section 3. see Lindley and Phillips (1976).j.jt r i ) i the optimal estimate of H given current information. Definition 4. P ( @ I rLo. .. I6 of Section 5. In the case ( I . ilccount of Bayesian inference for the Bernoulli case.4. YIJ C'XP { rhYh+ ~M11)} 11 E 3 ** for appropriately defined Y .t I ) / ( r / t2 ) .. we used the following notation for the kparameter exponential family and corresponding prior form: and P(8 the latter being lefined for T such that K ( T )< x. J / ( o . .k.2 reveals this predictive density to have the binomialbeta form..12) as the cunonicul (or nururul)fi.) ( ( I + r . We see immediately.
4. yo). the posterior distribution of the canonical parameter 3 is given by . (Canonical conjugate analysis).6. .~ the valare ues of I/ resultingfrom a random sample of size nfrom the canonical exponential family parametric model. the situation is somewhat more complicated (see Diaconis and Ylvisaker.p0 as notation for the prior hyperparameter is partly clarified by the following proposition and becomes even clearer in the context of Proposition 5.5.6 ) e x p {zlog (1 3 a ( y ) = 1. In the case of the Bernoulli parametric model. . ffyl. yo) to be a proper density. .p(y I $)dg = 1 and that b($) is continuously differentiable and strictly convex throughout the interior of The motivation for choosing no. Proposition 5. p(y 1 $). 1979. for details). is given by Example 5. (continued). and.7. hence. rhen the posterior density corresponding to the canonical conjugateform. we have seen earlier that the pairing of the parametric model and conjugate prior can be expressed as p(slt9)= (1 . b(t+'j) = log(1 The canonical forms in this case are obtained by setting + e"). We shall typically assume that @ consists of all $ such that s.2 Conjugate Analysis 273 in order for p ( + I no. for J! # @. I/. *. p ( $ I no.
( f ) I] I'Xl)( T11#)(~XJI(Tt log0).274 5 inference Example 5.6. the canonical representation often provides valuable unifying insight. we have seen earlier that the pairings of the parametric model and conjugate form can be expressed as p(O I TI. as in Proposition 5. In the case of the Poisson parametric model. T I ) p(. For specific applications. where the economy of notation makes it straightforward to demonstrate that the learning process just involves a simple weighted average.6 above suffices to demonstrate that the canonical forms may be very unappealing..6. From a rheorerical perspective. 1 The canonical forms in this case are obtained by setting The posterior distribution of the canonical parameter L" is now immediately given by Proposition 5. we can give a more precise characterisation of this weighted average. Example 5. the choice of the representation of the parametric model and conjugate prior forms is typically guided by the ease of interpretation of the parametrisations adopted. In the case ofthe normal parametric model. . Again using the canonical forms. w/l)+ nil.r 10) = .. however.r! = [ h .5.6. the posterior distribution of the canonical parameters @ = ( t i I .c q ( 0) PXp(. Example 5..6.t'?) is now immediately given by Proposition 5. (continued). rtn + 11 of prior and sample information. we have seen earlier that the pairings of the parametric model and conjugate form can be expressed as The canonical forms in this caSe are obtained by setting Again. (continued).r k1g#) .
In this context.as the sample size n becomes large. VP(+ I n o .8.since and hence E ( y I +) = E ( g nI +) = Vb(+).the latter can be viewed as an intuitively “natural” samplebased estimate of Vb(+). y.2. that in the stylised setting of a group of individuals agreeing on an exponential family parametric form.2 Conjugate Analysis 275 Proposition 5. a Proposition 5. resulting from a large amount of data processed in the light of a central core of shared assumptions. a sufficiently large sample will lead to more or less identical posterior beliefs. Statements based on the latter might well. This establishes the result. One should always be aware. in this natural conjugate setting.5. I y1 .6. is a weighted average of yo and 8. that this is no more than a conventional way of indicating a subjective consensus. . Namely.7 reveals. be claimed to be “objective”. but assigning different conjugate priors. The former is the prior estimate of Vb(+). 71. it suffices to prove that E(Vb(@)no. in common parlance. that the posterior expectation of Vb(+). ! Proof. (710. the weight. I But = J. . however. we make an important point alluded to in our discussion of “objectivity and subjectivity”. (Weightedavemgeform of posterior expectation). yo) = yo. . in Section 4. that is its Bayes estimate with respect to quadratic loss (see Proposition 5. . tends to one and the samplebased information dominates the posterior.7. Yo)d+. For any given prior hyperparameters.yo).. are the values of y resulting from a random sample of size n f from the canonical exponential family parametric model. By Proposition 5.2).
We have 5een that if I. denotes the number of successes in ri Bernoulli trials. Example 5. for example. Conversely. The weighted average form of posterior mean. providing a weighted average between the prior mean for H and the frequency estimate provided by the data.. plays an analogous role to the sample size. t).+ t z ) ' t i . lor I the limiting relative frequency of successes. In particular. which has expectation  + I where 7: = ( c t + . obtained in Proposition 5. The following example illustrates this use of limiting. Be(@ c t .. Section 5. yo) which does not integrate to unity (a socalled improper density) and thus cannot be interpreted as representing an actual prior belief. .5 Inference Proposition 5. for Ob(+). In this notalion. (confinuedJ. j ) . Or.7 shows that conjugate priors for exponential family parameters imply that posterior expectations are linear functions of the sufficient sratistics.{ + i t . this suggests that a tractable analysis which "lets the data speak for themselves" can be obtained by letting 7)u +0. under some regularity conditions. more generally. using standard rather than canonical forms for the parametric models and prior densities. attached to the data mean y. = . These kinds of questions are considered in detail in. linearity of the posterior expectation does imply that the prior must be conjugate. 0 corresponds to ( I 0.r.. however. It is interesting to ask whether other forms of prior specification can also lead to linear posterior expectations. H.7.. The choice of an itII which is large relative to n thus implies that the prior will dominate the data in determining the posterior (see.. leads to a Be(# I c i r. for continuous exponential families. however. that.r . Diaconis and Ylvisaker (1979) and GoeI and DeGrim ( 1980).. Clearly.4. the choice of an which is small relative to 11 ensures that the form of the posterior is essentially determined by the data. improper conjugate priors in the context of the Bernoulli parametric model with beta conjugate prior. whether knowing or constraining posterior moments to be of some simple algebraicform suffices to characterice possible families of prior distributions. this has to be regarded as simply a convenient approximation to the posterior that would have been obtained from the choice of a prior with small.6 makes clear that the prior parameter.6.The choice i t 0 = 0 typically implies a form of p ( @ 1 n o .. In particular. ii+ attached to the prior mean. y . the conjugate beta prior density. it can be shown. and also appearing explicitly in the prior to posterior updating process given in Proposition 5. ) posterior for H..  + . . t .3 for illustration of why a weightedaverage form might not be desirable). . 11.r. . but positive nqj.
since p ( 8 I ct = 0.n .. p ( 8 ) uniform implies that p(g(8)) is not uniform.3 = 1. At an intuitive level./7t~ limiting prior form. Consider. it might be argued that this represents “complete ignorance“ about 8 which should. Clearly.2 Conjugate Anulysis 277 3 4 0.8 ) x Be(BIr.. of 8. any choice of ct. (for example. it is certainly convenient to make formal use of Bayes’ theorem with this improper form playing the role of a prior. .r . presumably. based closely on GutiCrrezPeiia ( 1992). which implies a Be(8 I r. for appropriately defined Y. for example. .r.lwlo.Yo) = 4% Yo)exrJ{ WY. We recall. however. E Q. I t .. Q b .. (3 = 0) as having any special significance as a representation of “prior ignorance”. ’ It is important to recognise. A Proposition 5. the canonical forms of the kparameter exponential family and its corresponding conjugate prior: and P(21.j small compared with I‘. However.n .#)x p ( ~ .r. and c.r.I. from the discussion preceding Proposition 5. This makes it clear that ad hoc intuitive notions of “ignorance.. a = .n8)p(8I c1 = 0. and realvalued functions 6. would be The which is not a proper density.. . provides further insight into how the posterior mode can be justified as a Bayes estimate.6. that this is merely an approximation device and in no way justifies regarding p ( 8 I o = 0.+ 7~0Wr)) 7 21. . r..7 provided further insight into the (posterior mean) form arising from quadratic loss in the case of an exponential family parametric model with conjugate prior..4. g(8).2. d = 0. A further problem of interpretation arises if we consider inferences for functions of 8 . d = 0 ) XX6r“’(l8)” r’J8l ( 1 . which implies a uniform prior density for 8. As a technique for arriving at the approximate posterior distribution. Within this latter framework. There is a need for a more formal analysis of the concept and this will be given in Section 5.()approximation to the posteriordistribution. . entail “complete ignorance” about any function.. Proposition 5. or of what constitutes a “noninformative”prior distribution (in some sense). however.)will lead to an almost identical posterior distribution for 8. with further discussion in Section 5.1 = or ct = d = 1 for typical values of r v . the following development. ) . rt .2 established the general forms of Bayes estimates for some commonly used loss functions. having expectation r.6.5. the choice ct = . cannot be relied upon.
With respect to the canoniculform of the kparameter exponentialfumily and its corresponding conjugate prior: ( i ) 6(+l+l.(+. J .. t) = . Moreover. Differentiating this identity with respect to t. establishes straightforwardly that E[b(+)] 4dW %Y. with s > 0 and = [ r f . t) = &f(s.QG'+:)(w. From the definition of logarithmic divergence we see that d"+llctIl) = h(+.t)]' t) and &(s. = = Recalling that E(Vb(+)] yo... E[6(+I+n)I = b(+n) . for 1 = 1.j and (i) follows. As a final preliminary. Differentiation of the identity lag/Fxp{t'+ .we can write. .d&.b(+) + (+ . . Proposition 5..j = b(+. t ) . ( s ..logc. + (+ . we see that J L.. Funherdefine 5 Inference Consider p(+lno.t)f.+hE[Ch(+)I. and interchanging the order of differentiation and integration. = s "I + d..)'W+): ( i i ) E[b(+l$~~~j] r o .s It). . so that and (ii) follows.+I. t)/i)n. . . rtnyo) b($l)) =4dr +)Lo' { k + [ ~ ~ ~ ( ~ J~ JI I ~.sb(+)}d+ = d ( s . ..E[b(+)]+ E[+'Cb(+)I .270 t E Y.l. ~ I ) +(I]'?~I)$/(I}.). (Logarithmic divergence befween conjugatedistributions).k.+. . with respect to s. recall the logarithmic diver between two distributions y(zl6) and p(xl0o). .]. yo) E [ + ' v ~ ( ~ L )7]t .. . ' [ k = a + ~ d ( 7 1 n~opo)'(?tng/o)] .We can now establish the following technical results.s~t)cx.b.+[~l.)'Ey. k.) . yo) and define d(s. .(t"gD sb(+)}d+ for 7 = 1 ..(s.)C(S.b(+) + Proof.8. .(s.
y. 279 siae) Proposition 5.6) that the logarithm of the posterior density is given by + (w. a For a recent discussion of conjugate priors for exponential families. + yn). Example5.(no + n)b($).7. it also indicates how. whereas a tossed coin typically generates equal longrun frequencies of heads and tails. With respect to the loss function I ( @ . The result now follows by noting that the same equation arises in the o minimisation of (ii) of Proposition 5. with a. + from which the claimed estimating equation for the posterior mode. Experience suggests that these longrun frequencies often turn out for some coins to be in the ratio 2: 1 or 1:2. with a suitable extension of the conjugate family idea. constant Proof. about which we have no specific information beyond the general background set out . we might hope that the conjugate family for a particular parametric model would contain a sufficiently rich range of prior density “shapes” to enable one to approximate reasonably closely any particular actual prior belief function of interest..9. $*.2 ConjugafeAnalysis This result now enables us to establish easily the main result of interest. from the canonical kparameter exponential family p(y I+) and corresponding conjugare prior p($lno. In addition. is immediately obtained. However. for an example. which satisfies Vb($*)= (no + n)l(nOyo + n&).the Bayes estimate for $.8. some coins do appear to behave symmetrically. replacing noyo. see Consonni and Veronese (1992b).is the posterior mode.yo+ ng. . see Dawid (1988a). (The spun coin ). $*.. In complex problems. $) = 6($la). . conjugate priors may have strong. derived from independent observations yl. . . Let us consider the repeated spinning under perceived “identical conditions” of a given coin. At the same time. and for other coins even as extreme as 1:4. (Conjugateposterior modes as Bayes e t m t s . unsuspected implications. this is not at all the case if a coin is spun on its edge. .5.2. with n + n replacing no. 5. and myo n&. yo). The next example shows that might well not be the case.3 Approximations with Conjugate Families Our main motivation in considering conjugate priors for exponential families has been to provide tractable prior to posterior (or predictive) analysis. We note first (see the proof of Proposition 5.)‘$ . we can achieve both tractability and the ability to approximate closely any actual beliefs. Diaconis and Ylvisaker ( 1979)highlight the fact that. = n’(y. + .
rl + .. that an insistence on tractability. J = . . therelore. Under the circumstances specified.which does not contain bimodal densities. i n the light of the information given. i = 1 . reflects a judgement that about 20% of coins seem to behave symmetrically and most of the rest tend to lead to 2: 1 or I:2 ratios. and .20) t 0.i. . . It appears. Suppose now that we observe / I outcomes z = ( J ..3 Be(@ I 121. . completely specifies our belief model. B e ( B I r t . . 111 p ( H 1 l r . . which. so that a Bernoulli parametric model.I Considering the general mixture prior form 5 . . I . . suppose we judge the sequence of outcomes t o be exchangeable. 10). we can easily generate multimodal shapes by considering rtri. secondly. d.( ' .280 5 Inference above. r.H ) " . with mixing weights x . any realistic prior shape will be at least bimodal.1 we easily see from Bayes' theorem that where rt: = (I. among other things.2 displays the prior density resulting from the mixture 0.. . [ H ) = nH'q ..rtrtrcrs of beta densities. attached toaselectionofconjugatedensities.). I ) ) . would preclude an honest prior specification. = . so that . 1 z. . ) . . 8 . . with somewhat more of the latter than the former..H ) I . . Figure 5. However.= 1. and possibly trimodal. . . Be(0 1 r t .r. ) and that these result in . How might we represent this prior density mathematically? We are immediately struck by two things: first. . rl + r.5 Be(@I 10. . . . j . a . heads. > (1. I . .. . z l +. .2 Be(@ 15.' J = H ' " ( 1 . . + . . in the sense of restricting ourselves to conjugate priors. . the conjugate family for the Bernoulli parametric model is the beta family (see Example 5. 15) t (1. together with a prior density for the longrun frequency of heads. P )= C n .4).
~ = ( 0 . 3 ) .09. r.20). ' p' = (27.2 Prior and posteriorsfrom a threecomponent beta mixture prior density Detailed calculation yields: for 71 = 10. The suggested prior density corresponds to rrt = 3. suppose that the spun coin results in 3 heads after 1 0 spins and 14 heads after 50 spins. Figure 5.2.07).23). = 14. r. ~=(20.15.17) for 71 = 50.10). In the case considered above.22.77. x * = (0. 2 . ' and the resulting posterior densities are shown in Figure 5..34). a = (13. 0 .18. 281 is itself a mixture of rri beta components. x' = (0.15.. p' = (56.16. a = (24. This establishes that the general mixture class of beta densities is closed under sampling with respect to the Bernoulli model. 0 .2 Conjugate Anulysis so that the resulting posterior density. .O.0.O0fi). 5 . = 3.0.90.0.5.46).51. a=(lO.29.
the same is true for any exponential family model and corresponding conjugate family. Let z = ( X I .. ure elements o the conjugate jamily. Then f and Proof.5.. Proposition 5. as we show in the following. The results follows straightforwardly from Bayes' theorem and Proposition 5. x.10. ) bc u rundor sample from a regulur exponentia1. d . In fact.202 5 Inference This example demonstrates that. .. . at least in the case of the Bernoulli parametric model and the beta conjugate family. .faniilj distribution S N ~ that I P(X and let P(e where. the use of mixtures of conjugate densities both maintains the tractability of the analysis and provides a great deal of flexibility in approximating actual forms of prior belief. . for 1 = 1. (Mixtures ofconjugute priors).
in the sense that probability statements based on the resulting posterior will not differ radically from the statements that would have resulted from using a more honest. as shown by Dalal and Hall (1983). in a much more general setting than that of conjugate mixtures. The answer is that any prior density for an exponential family parameter can be approximated arbitrarily closely by such a mixture.. However.p E %*. for some 00 C 0 and a.2 Conjugate Analysis 203 It is interesting to ask just how flexible mixtures of conjugate prior are. . typically a conjugate form. ( 8 I p I a) Ip ( e ) / q ( W q 5 b / q [ (iv) for all 8 E 8. a limiting conjugate form. and Diaconis and Ylvisaker (1985). 1976). (Dickey. Let P = S p ( 8 I z)d& = S q ( 8 I z ) d e .q ) } and f : 8 8 such rhat I f( 8 ) I I m. prior. (a) 1 I( 8 ) / q ( 8 ) I 1 t. ( 1 . . The following result provides some guidance. where.P)/P I 1 . Suppose that a belief model is defined by p ( z 1 8 ) und p ( B ) . P r (i) clearly follows from at P Clearly. Proposition 5. as to when an “approximate” (possibly improper) prior may be safely used in place of an “honest” prior. and . (Priorapproximation). we are left with having to judge when a particular tractable choice. Proof.5.a. but difficult to specify or intractable. P / ( I + a)I( 8 I z ) / q ( e 1 5 (1 f a ) / q p 4 (v) for E = max { ( 1 . 4 0 12) = Pta: I @)q(@)/P ( 2 I Q>q(fW8. 8 E 8 and rhut q ( 8 ) is a nonnegative function such rhar q ( z ) = J0 p ( z I 8 ) q ( O ) d 8 < 30. In practice.q ) / q a (ii) q I p ( z > / q ( z )5 (1 + Q>/P (iii) for all 8 E 8. all 8 E p for (b) p ( e ) / q ( e ) I 0 . or a mixture of conjugate forms.1 1. their analyses do not provide a constructive mechanism for building up such a mixture. is “good enough. f o r 011 8 E ~ 3 .p ) . %?n* S (i) (1 .
a frequently occurring situation.q ) 5 0 + 3:.then q(0)may reasonably be used in place of !>(0). Part (iii) follows from (b) and (ii).q ) + ( 1 .284 5 Inference which establishes (ii). a If.( v ) implies that providing a bound on the inaccuracy of the posterior probability statement made using q(0 12) rather than p(O 12). 11) which proves (v). In the case of 8 = 8. in stylised form.then (i) implies that 0. i f f is taken to be the indicator function of any subset Q" 2 8. 80 a subset of 8 with high probability under q(O I x)and CL is is chosen to be small and ijnot too large. can be found. giving high posterior probability to a set 80C 0 within which it provides a good approximation to p ( 0 ) and such that it is nowhere orders of magnitude smaller than p(0)outside C)O. and also the posterior expectations of bounded functions are very close. so that q(0)provides a good approximation 0 to p(0) within 8 and p(0)is nowhere much greater than q(0). Finally. to the wouldbe honest prior. p ( 0 ) .)has high probability under p ( 0 12) and (ii). q ( B ) .+ ( 1 . for some constant c. More specifically.3 illustrates. (iv) and (v) establish that both the respective predictive and posterior distributions.1 I therefore asserts that if a mathematically convenient alternative. tnvs from = (1 + 0 . provides . where the choice q(0) = c. Proposition 5. within (30. and (a) and (ii). Figure 5. in the above.
Given observations z = ( L ~.3 Typical conditionsfor precise meusurement iue a convenient approximation. . The second of the implementation questions posed at the end of Section 5.. we . In this situation of “precise measurement” (Savage. the posterior distribution.1. ) then describes beliefs about that strong law limit in the light of the information provided by 2 1 . .3 ASYMPTOTIC ANALYSIS In Chapter 4. 5. ) . p(B I z . . . which has little curvature in the region of nonnegligible likelihood. In qualitative terms. the parameter 8 acquired an operational meaning as some form of strong law limit of observables. The above analysis goes some way to answering that question. .1.. Fgr 5. the likelihood is highly peaked relative to p ( 0 ) . .10 and we obtain the normalised likelihood function. the choice of the function q(0) = c. for an appropriate constant c‘.x.6.6 concerned the possibility of avoiding the need for precise mathematical representation of the prior density in situations where the information provided by the data is far greater than that implicit in the prior. .L . the following section provides a more detailed analysis. 1962).5 3 Asymptotic Analysis . clearly satisfies the conditions of Proposition 5.To answer the second question posed at the end of Section 5. . we saw that in representations of belief models for observables involving a parametric model p(x I 0) and a prior specification p ( B ) . .
i. we shall see that this is. 6. then r1x liin p ( 8 . in the sense that the logarithmic divergences.)be ohser. .1 Discrete Asymptotics Webegin by considering the situation whereC3 = {el. XI.. By Bayes’ theorem. s.}. the corresponding strong law limit. = 1.2. Suppose thut Of E 8 is the true vulue of 8 and thur.12. u n d r h e p r i u r y ( 8 ) = { p i . } consistsofacountable (possibly finite) set of values. . as 11 The righthand side is negative for all i # t. I O ) . . Proof. . and equals ixro for I = I. vurions for which a belief model is defined by the purumetric model p ( . . 82. p where Conditional on O f . . 0 and S. (Discrete asymptotics).[y(r I e.p . by the strong law of large numbers (see Section 3. as 7) cx.I: I 8 ) . Intuitively. p 2 . . . = lim p ( O I 12) = 0.3)..e.. C. i 1 1x # f. indeed.286 5 Inference now wish to examine various properties of p(8 I z) the number of observations as increases. . kt x = (q. is “distinguishable”from the others. I z) 1. .)]d.c are strictly larger than zero.. . the latter is the sum of I I independent identically distributed randomquantities and hence. such that the parametric model correspondingto the true parameter. we would hope that beliefs about 8 would become more and more concentrated around the “true” parameter value. 0 2 } .. p .for all I # t . and assuming that p(z18) = nyFl( x . for all i # t . where8 E 0 = { 8 .e.x for i # t . > 0..rI B f ) / p ( r1 e. Proposition 5.3.. so that. i. the case.rl.. which establishes the result.)logb)(.    .  5.. Under appropriate conditions..
m u and 6. the proofs require considerable measuretheoretic machinery and the reader is referred to Berk (1966.3 Asymptotic Analysis 207 An alternative way of expressing the result of Proposition 5. If we now expand the two logarithmic terms about theirrespective maxima. to say that the posterior distribution function for 8 ultimately degenerates to a step function with a single (unit) step at 8 = Bt.2 Continuous Asymptotlcs Let us now consider what can be said in the case of general 8 about the forms of probability statements implied by p(8 15) for large n.Rn denote remainder terms and Assuming regularity conditions which ensure that RQ. In fact.12. and R.3. we obtain where R<.)tH(b.mo)lHo(e mo) :(e . postethe rior degenerates onto the value in 0 which gives the parametric model closest in logarithmic divergence to the true model.&. are small for large n. A particularly interesting result is that if the true 8 is not in 8. ( e m.m . we note that..L)(O. established is for countable 0.. under suitable regularity conditions. without concern for precise regularity conditions. assumed to be determined by setting V logp(6) = 0.However. respectively.erf) 2 1 . Proceeding heuristically for the moment. for much more general forms of 8. 1970)for details. in the case of a parametric representation for an exchangeable sequence of observables. . V logp(z 1 0) = 0. l ) f H .5.)  .. this result can be shown to hold. 5. ignoring constants of proportionality. we see that .
so that the approximate posterior variance is the negative reciprocal of the rate of change of the tirst derivative of log p ( z 1 8) in the neighbourhood of its maximum.. . )a. We might approximate p(8 lz). Sharply peaked loglikelihoods imply small posterior uncertainty and viceversa. )measures the local cur. (see Section 3. ) )or Nk(8 I el.defined by ..rrl(hl. There is a large literature on the regularity conditions required to justify rnathematically the heuristics presented above. Chapteril). I . for all i . . Jeffreys ( 1939/ I96 1. Other approximations suggest themselves: for example. we see that H ( b . Also. The Hessian matrix. +I . This heuristic development thus suggests that p ( 8 15) will.and is often called the observed inforniution mutrix. is the socalled Fisher (or expecfed)informution mutrix.. 1956. vature of the loglikelihood function at its maximum..2. =N o where m (the prior mode) maximises p ( 8 ) and ell(the niuxiniutn likelihood eso timate) maximises p(z 18). . 5 Inference + W@/t) m. e. )  n I ( & . = H.))... )where I ( 8 ) . Those who have contributed to the field include: Laplace ( 18 12). for large r z the prior precision will tend to be small compared with the precision provided by the data and could be ignored.5) H ) whose mean is a matrix weighted average of a prior (modal) estimate and an observationbased (maximum likelihood) estimate. LeCarn ( 1953.288 with HI. j .. .).. for large t i . since. 1958.. H ( e . In the case of 19 E 8 C 'R. therefore. h . tend to resemble a multivariate normal distribution. . H( where k is the dimension of 8. by either Nk(8 I e. by the strong law of large numbers.' (H I ~ N ( e . and whose precision matrix is the sum of the prior precision matrix and the observed information matrix. Nk(8 I m..
Ibragimov and Hasminski (1973). Bermlidez (1985).L’. Defining 1 I = (8t8)1i2 Bb(8’) ( 8 E 0.. n 1. Johnson and Ladalla ( 1979) and Crowder (1988). . ( 8 )in a small neighbourhood of m.). inside a small neighbourhood of m.’(8) exists and satisfies I . as n becomes large.m . there is a strict local maximum. for large n.z. Freedman (1963b. (c3) “Concentration”. : (cl) “Steepness”. like themultivariate normaldensitykernelexp{i (0m. (c2) “Smoothness”. . (1994). DeGroot (1970.and assume throughout that. where I is the k x k identity matrix and A ( € ) a k x k symmetric positiveis semidefinite matrix whose largest eigenvalue tends to zero as E + 0. the function pll becomes highly peaked and behaves . For any 6 > 0. T i + 0 as n + 00. for every n. we shall see that (cl).(m~) L t ( 8 )I 8=mr.)ltJ ( a 2 L . Lindley (196lb). ( ~ ) / a f h I 8=mn. 1965). The account given below is based on Chen (1985). derived from an exchangeable sequence with parametric model . We define L n (8 ) = l o g p n ( 8 ) . For any E > 0 . m.). . (c2) together ensure that. implicitly assume. becomes negligible.. Heyde and Johnstone (1979). we assume that 8 E 8 C Rk and that ( p f l ( 8 ) . although the mathematical development to be given does not require this.. Walker (1969). 1970.. nor do we need to insist that m. themselves converge. f . Chen (1985).8’ I < 6). = a~J) where [LZ(m.(mrr) I exists as I I and we shall now establish a bound for that limit. where 5 is the largest eigenvalue of Ell. x. of p . Chapter 4). In what follows. for any n > N and 8 E Bd(m.. typically of the form ppl(8)= p ( 8 I q . Related work on higherorder expansion approximations in which the normal appears as a leading term includes that of Hartigan (1965). Fraser and McDunnough (1989). equivalently. Sweeting and Adekola (1987). . Dawid (l970). m. however. .)’C.1970). 1 .} is a sequence of posterior densities for 8. ) }The final condition (c3) ensures that the probability outside any neighbourhood of m..5. 1986).’ ( 8 .. we shall 8 and = 8 show that the following three basic conditions are sufficient to ensure a valid normal approximation for p . there exists N and 6 > 0 such that. Fu and Kass (1988).) satisfying: L:.3 Asymptotic Analysis 289 1966. (or. Sweeting (1992) and Ghosh et al. Hartigan (1983. p(x 18) and prior p ( 8 ) . Johnson (1967. that the limit of pr. = 2 .) + Essentially. .A ( E 5 L~(8){L”(m. be a global maximum of p n . .2)}~1 ) 5 I t A(&). Chapter lo). L. 0 =V = and implying the existence and positivedefiniteness of Cn = (~::(mn))’. We do not require any assumption that the m. We I C. ~ 1 3 6 i pn(8)d8 1 as n + 30. Chao (1970).
. tend to infinity as ri v2((v') 11  3c. = h ( l .~ = . { E Bh!K(O) where are the largest (smallest) eigenvalues of V.(b) 5 1 for all n. +  .(e) P ~ ~ .290 Proposition 5. .".( 0 ..).for some 8 lying between 8 and m. Bdli7(0)c z : (%lV%)"2d} < + . */ cr ~ . I. = ~ ~ ~ ( 8 ' ) { ~ ~ ( m l l ) }1. ) W { L J W. we have ..' ( m l ... exp { .. Pmot Given E > 0 consider n > IV and d > 0 as given in ( ~ 2 ) Then. Clearly. (Bounded eoncentralion). z ' z } d r . l ( m . and A ( E )respectively. ) ~ 1. 1 1x and the required inequality follows from the fact that 1 A ( E )t 1 as E 1 / 0 and P. .13. ( m ~ ~ ) ~ = p .lS2  Q ( E ) ( ~ ( E ) )the where s.It follows that ' . ( b I. Since (cl) implies that both s. .simple Taylor expansion establishes that a I). which is condition (c3). since. I X . we have equality if and only if h i . ..A ( E ) J ' 1.(i'. for any k x k matrix V. ( d ) . .m. .'(0. l ) ~ x p.. (c2)implv that lirn p ..* x p n ( m l .I.) where {k + R. . any 8 E Bd(m. t. .. and t .) and and with largest (smallest) eigenvalues of C .x lim ~ .)C.)'(I R. ) .. ? ( m .m/.x 5 Inference with eyuulity ifand only i f ( c 3 ) holds..x El(&)= 1. The conditions (cI ) ... = b(1 g(?))'r2/ZF. I. iF'. ) I '~(f2 ~ l ~ ) nlim l~) ' 5 II + A(E)(' lirri P r ..g( ~ ) ) I . ~ "5 (2n)". for .
(a 5 4.(Q5 (b.s(b'(b}.3 Asymptotic Analysis 291 We can now establish the main result. 5 b) is bounded above by where Z(E) = { Z.13. ) as the densityfunction of a random quantity 8.. For each n.Xi') distribution. P. 5 b) . given ( c l ) and ( ~ 2 ) .(mn)= 0 and c. We first note that where.Then. ~ / ~ Proof. and dejne..' = L. Given (cl) and (c2).to denote that all components of b . to converge in distribution to (b.(mn):' Proposition 5. ) Given (cl). (b. It then follows. is a necessary and suflcient conditionfor ( . using the (c3) notation above. and writing b 2 a. where L'. by (cl). 5 a . .for a. by a similar argument to that used in Proposition 5.as E + 0 we have where Z ( 0 ) = { % . b where p ( 4 ) = ( 2 ~ ) . = C. (c2).b E !Rk.5.exp { . for any 6 > 0 and sufficiently large n. it suffices to show that. (Asymptotic posterior n o d t y ) . P ( a 5 (b 5 b) if and only if (c3) holds.a are nonnegative. The result follows from Proposition 5.'/2(8. P. a z 5 b}. that.mn).A(E)]*'* ~5 b} and is bounded below by a similar quantity with + A ( E in place of A(&).13. consider p l l ( .A ( E ) ] 'a ' z 5 [ I . for any E > 0. which may colloquially be stated as "8 has an asymptotic posterior Nb(8(mn.14. [I . as n + m.
I or the righthand side clearly tend to zero. imply (c3).. is bounded (Proposition 5. (c4) does not even require the use of a proper prior for the vector 8.1 I ) and the remaining terms Sincep.often intractable. . that given (c4). for any rz > iV and 8 $Z B. given (cl). (Alternative conditions). We shall illustrate the use of (c4) for the general case of canonical conjugate analysis for exponential families. there exists an intcger lV and d E 3.such that.. ( 8 .normalisingconstant l) p(z). sothat L . . Two such are provided by the following: (c4) For any d > 0.. with G ( 8 ) = log y(8) for some density (or normalisable positive function) y(8) over 63. Proposition 5. ( 8 ) .. but. a To understand better the relative ease of checking (c4) or (cS) in applications. either ( 4 ) .292 5 inference Conditions (cl) and (c2) are often relatively easy to check in specific applications. ifp. Given ( ~ 1 )(c2). Proof. (c5) As (c4). doesnot involve the.) 1 C. or (c. and similarly. it follows that the lefthand side tends to Zero as 11 + XJ.5) implies ( ~ 3 ) .).L f f ( m . we note that. (c2).15.(O) is based on data z.(m. but (c3) may not be so directly accessible. It is useful therefore to have available alternative conditions which. Moreover.. It is straightforward to verify that given (c4).j(m.
) = 0. satisfying V h ( m . for some 11 and m. where b’(m:. with b(70) a continuously differentiable and strictly convex function (see Section 5.~ ( n ..16. . where Nk(@ Proof.)= (7t0 ~ i ) . . have.. we have to prove that has an asymptotic posterior 1 6’(m.m. ..2).)]’+ . for each 11. + + + + where h(+) = [b’(m.for any 6 > 0 and B # B .2.)9 i ’ ) distribution. and V h ( + + ) .. It follows that.’ = (7% n)”6”(mI..b(+).l.(+) is unimodal with a maximum at = m. By the strict we between concavity of h ( . .where = +. Suppose that y l ..5. .. Colloquially. y ~ and X njjIl) E. ) . yI. converges in distribution to 4. = Cr=l yi/ri. consider the posterior density with 5. j ( m .1 Ei1r2(+. + ++ .y n are data resulting from a random sample of size n from the canonical exponential family form with canonical conjugate prior density For each n. Then 4. ) . From a mathematical perspective. to be the densiryfunction for a random quantify and &jne 4. with angle 0 between .). (Asymptotic normality under conjugate analysis).3 Asymptotic Analysis 293 Proposition 5.b’(m11)).
= b/'(m. / n given H. which... (n. the latter not depending on no and 5. I .8 ) .. (c2) fo1lows straightforwardly from the fact that ' h.(@) 1 0 g p (j ~~ + iogP(e) I = ) = (fb. ..).s(rn.1) ( .16.(rnfl)} b"(lC.L : ( m . ) ) ' I) as 1 1 ...I o g p ( z ) so that ((t. .).) . and so the result follows by Propositions 5. I2 Example 5... ( o )= iogp. . Finally. = ( t + r. we note the interesting "duality" between this asymptotic form for H given n ...(@){G. and the asymptotic distribution for r . by the central limit theorem. (confinued).$). Conditions (cI).A. . = .l ) l o g H S ~ ~ ~ ~ ~ ( ~ ) (... + n).H and It follows that Condition (cl) is clearly satisfied since ( ...y. J) prior. I where ct.){b"(mfI)}' = '. Taking n = c j= 1 for illustration.J. Proceeding directly. . 1 + L .v. we see that  and hence that the asymptotic posterior for H is (As an aside. hence that (c4) is and satisfied. a + r i . .= H 1 .x.)}> 0. and :I. .1) L. (c4) may be veritied with an argument similar to the one used in the proof of Proposition 5.13. is the posterior derived from I I Bernoulli trials with I. It follows that 5 Injerence L(@)  JL(mfl) < (no + r o I@ m l I < cl { (@ .condition ( c 2 ) follows from the fact that L::(H)is a continuous function of H . has the form Further reference to this kind of "duality" will be given in Appendix B.4...(@) . with X2 the largest eigenvalue of b"(rn..294 for c = irif { 1 V h ( @ + I): @ $? B. Suppose that Be(@ n. r. successes and a Be(@0 .I ri .. ..l)log(l .m n ) ' K 1 ( + m J } ' where 1'1 = cX'.
533 . 1992.17 may be highly dependent on the particular transformation used.10. With the notation and background of Proposition 5. Proposition 5. with respect to a parametric model p(xl80). a . 7 A simpler. and that given any 6 > 0.3).14. there is a constraint c(6) such that P(@. we could ask the same question in relation to Proposition 5. to an arbitrary (onetoone) reparametrisation of the model.14.. if u = g( 8 ) is a transformation . A partial answer is provided by the following. the adequacy of the normal approximation provided by Proposition 5. This follows immediately by reversing of the roles of 8 and v. suppose that 8 has an asymptotic Nk (8 . u has an asymptotic distribution Proof.C ) distribution.. A related issue is that of selecting appropriate parametrisations for various numerical approximation methods (Hills and Smith.5 3 Asymptotic Analysis .1 ( u )= dg'(u) au is the Jacobian of the inverse transformation. (Asymptoticprecision after transformah'on). 1964b) analyses the choice of transformations which improve asymptotic normality. ifH. with a p propriate transformations of the mean and covariance. ug + 0 and mn t 8 0 in probabilitj. where u: (z:) is the G largest (smallest) eigenvalue of C . = C denotes the asymptoticprecision matrixfor . 1993). see Mendoza ( 1994). In Proposition 5. The expression for the asymptotic posterior precision matrix (inverse covariance matrix) given in Proposition 5. Then. I is often rather cumbersome to work with.z 25 c(6)) 2 1 . This is a generalization and Bayesian reformulation of classical results presented in Sertling (1980. alternative form is given by the following. ' 8. More generally. This prompts the obvious question as to whether the asymptotic posterior normality "carries over. a For any finite n. (Asymptoticnormality under transformation). 295 Asymptotics under Transformations The result of Proposition 5.17. . thut. is nonsingular with continuous entries. Proof. For details. such that. Section 3. Anscombe (1964a. then the asymptotic precision matrixfor u = g ( 8 ) has the form where J g .16 is given in terms of the canonical paramevisation of the exponential family underlying the conjugate analysis. at 8 = On. with the additional assumptions lm.. Corollary 1.6 for all suflciently large n.
..L. with N(HI0. we simply wish to consider onetoone transformations ofa single parameter. again.3 Example 5. and suppose now that we are interested in the asymptotic posterior distribution of the variance stabilising transformation (recall Example 3. ) [y'( rrtl.. whew t i . This is dealt with in the result presented by the requirement that g'(00) # 0.IU. I ) .3) v = g ( ~= 2hiiiI ) \/ii Straightforward application of Corollary 3.. It is clear from the presence of the term [g'( m. (Nonnormal asympfoticposrerior). Now consider the transformation I / . / I .h). and (I.17.:( i n . to Proposition 5.I7 with sccilrr 0..... terior for a parameter H E W is given by N(H!.I0.1. .). Technically. . Intuitivcly. Suppose.. g ' ( 0 ) = 0 and the condition 0 1 the corollary is not satisfied. .)]?). the asymptotic. that Be(H I t t . (Asymptotic normality after scalar transformation)... I7 that things will go wrong if ~ ' ( T I z . .i. The conditions ensure. ! = ( i t I .. .Fi is converging in probability to 0 through sfricr/? posilitv values. ) 0 as n t 9c. . Suppose that given the conditions of Propositions 5. is the posterior distribution of the parameter of a Bernoulli distribution afte 7) trials. i = I . it is clear that u cannot have an asymptotic normal distribution since the sequence . it can be shown that the asymptotic posterior distribution of 1111 is \' in this case. so that the result follows from Proposition 5. .3. . . Corollary 2.)). . ). .I. A concrete illustration of the problems that arise when such a condition is not met is given by the following.Then. + Example 5.14.r. C x . posterior distribution for I / is N (i+g( rrr. where utI.. leads to the asymptotic distribution N(v12siii ' ( (1). . 5.o ( H ) = H'.r.)] in the form of the asymptotic precision given in Corollary 2 to Proposition 5.j. that 0 has an asymptotic posterior distribution of the form N(0Iwl.8.  Proof. r . . perhaps derived from N(x. The next result provides aconvenient summary of the required transformation result. . Suppose that the asymptotic pos.I t ) . r . I ) i s 0 = 0. )+ 0 sequence ~ I I . having IJ z 0 . .. in probubiliry us 11 x.fi = g ( 0 ) is such thnt . + 8. In fact. / ~ .:(m. < .14.17. (continued).) in probability.. i . = d + r ) . and suppose that the actual value of H generating the .= .296 5 Inference In many applications. by Proposition 5. through N(.tends in probubiiity to 0" under p(xl60). . the and that L : : ( i n .4.9'(0) = d g ( # ) / ri0 is continirous and nonzero at 0 = 00. m). . whose mean and variance can be compared with the forms given in Example 3.
l). XZ)}./& is Any reader remaining unappreciative of the simplicity of the above analysis may care to examine the form of the likelihood function. n&... Example 5.I). @* . . etc.. asymptoticposterior for can be obtained by defining an appropriated. and subsequently marginalising to 41.r.5 3 Asymptotic Analysis . .) is given by .= O. 0. g(el. . & is a onetoone transformation. another random sample yl.14) may be much more straightforward under one choice of parametrisation of the likelihood than under another. as First. + L. sation directly in terms of 0. is so that the required asymptotic posterior for 4.9. so that. prrametrisation. Secondly. A. . we note that the marginal where 91&# = X I + ...17.17 and Corollary I is that verification of the conditions for asymptotic posterior normality (as in. .02) + (&. 0. 0.= p1 . + y. from the model {fly7] N(xtIOl.. N(y.. It follows that the asymptotic posterior of Q. N(BI10.) = (4..it is very easily verified that the joint posterior distribution for 8 = (0. independently. and the conditions of Proposition 5. subsequently deriving the form for the parameters of interest by straightforward transformation. obtaining the distribution of c$ = (41. .17. . .)} and. An indication of the usefulness of this result is given in the following example (and further applications can be found in Section 5. (Asymptotic posterior normakity for a ratio). . .. 297 One attraction of the availability of the results given in Proposition 5.. 102. in the notation of Proposition 5.4)... is nonzero for O2 # 0. for example... We are interested in the posterior distribution of dI = @. Proposition 5. XI zz 0 and O2 # 0. corresponding to an initial parametri&.14 using the 41./& TI + 50. where XI zz 0. y.N(0210. &) Proposition 5. Suppose that we have a random sample sl. from the model { fir=. and to contemplate verifying directly the conditions of Proposition 5. I7 are clearly satisfied. we note that. 42) and + The determinant of this.. An obvious choice for 6 2 is 62 = 0. such that (8. for large n. using ) . The result given enables us to identify the posterior normal form for any convenient choice of parameters.
However. the setting for our development of such a reference analysis will be the general decisiontheoretic framework. it will be clear that the reference prior component of the analysis is simply a mathematical tool. it might be required as a limiting “what if?” baseline in considering a range of prior to posterior analyses. to be used when a default specification having a claim to being noninfluential in the sense described above is required. From the approach we adopt. we recognise that there is often a pragmatically important need for a form of prior to posterior analysis capturing.6 relates to specifying prior distributions in situations where it is felt that. in sonie welldefinedsense. therefore. However. It has considerable pragmatic importance in implementing a reference unulysis.2. In line with the unified perspective we have tried to adopt throughout this volume. or as a defuult option when there are insufficient resources for detailed elicitation of actual prior knowledge. we have examined situations where data corresponding to large sample sizes come to dominate prior information. .4 REFERENCE ANALYSIS In the previous section.ie”or "objective" prior distribution. representing “prior ignorance”. We seek to move away. “vugue prior knowledge” and “letting the data speak for themselves” is far more complex than the apparent intuitive immediacy of these words and phrases would suggest. whose role and character will be precisely defined. it is as well to make clear straightaway our own view very much in the operationalist spirit with which we began our discussion of uncertainty in Chapter 2that “mere words” are an inadequate basis for clarifying such a slippery concept. from the rather philosophically muddled debates about “prior ignorance” that have all too often confused these issues. Such a reference analysis might be required as an approximation to actual individual beliefs. we shall provide a brief review of the fascinating history of the quest for this “baseline”.5 Inference 5. and towards welldefined decisiontheoretic and informationtheoretic procedures. the problem of characterising a “tioninformuri. the notion of the prior having a minimal effect. limiting prior form.6. the data should be expected to dominate prior information because of the “vague” nature of the latter. “uniquely noninformative“ or “objective” prior. In Section 5. but it is not a privileged. There is no “objective” prior that represents ignorance. On the other hand. evenfor moderate sumple sizes. together with the specific informationtheoretic tools that have emerged in earlier chapters as key measures of the discrepancies (or “distances”) between belief distributions. on the final inference. leading to inferences which are negligibly dependent on the initial state of information. relative to the data. The third of the questions posed at the end of Section 5.1. every prior specification has sonic informative posterior or predictive implications: and “vague” is itself much too vague an idea to be useful. Put bluntly: data cannot ever speak entirely for themselves. more typically. Its main use will be to provide a “conventional” prior.
. has the . let us examine. the influenceof the priorp(8). We note that if a. with unknown state of the world w = ( w I . This both simplifies the exposition and has the mnemonic value of suggesting that w is the state of the world of ultimate interest (since it occurs in the utility function).the expected utility value of the experiment e(lc). if w is not a function of 8. a E A. .p(8)}. 8 ) = p ( o I 8 ) p ( 8 ) . using Definition 3. Now.13 (ii). given the prior p ( 8 ) . If e ( k ) denotes the experiment consisting of k independent replications of e .zk) andp(x I 8 ) = p(z. { e . Intuitively. at least in suitably regular cases. To avoid subscript proliferation. . the prior density is. x k } with joint parametric model p ( s i I O). simply P(W2). denotes the optimal answer under p(w I x). . . that is yielding observations ( ~ 1 . as k m we obtain. denotes the optimal answer underp(w) and a. then vu{e(k). in uriliry rerms. w. is where. n:=.  . of course.If w1 is a function of w2. whereas 8 is a parameter in the usual sense (since it occurs in the probability model). and = Jp(r I8 ~ 8 ) do. . same mathematical form as v . Often of w is just some function w = @(0) 8.1 Reference Decisions Consider a specific form of decision problem with possible decisions d E 23 providing possible answers. to an inference problem.w). relative to the observational information provided by e. for given conditional prior p(w 18) and utility function u(a. whereas inference from data 2 E X provided by experiment e takes place indirectly. p ( @ ) )but with x = (21 . through the w2 component of w as described by p(w1 1 w2). the relationship betweenw and 8 is that described in theirjoint distributionp(w. then.4. u p ) . I 0).5. w l ) and the availability of an experiment e which consists of obtaining an observation x having parametric model p(x I wa)and a prior probafor bility density p(w) = p(w1 I w ~ ) p ( w 2 ) the unknown state of the world. with appropriate notational changes. nfZ1 ? . assuming w is independent of x.4 Reference Analysis 299 5. indulge in a harmless abuse of notation by writing wl = w > w 2= 8. without any risk of confusion.utilities for consequences ( a . and noting that the expected (utility) value of the experiment e. given 0. let us now. w ) given by u(d(w1)) = u ( a . This general structure describes a situation where practical consequences depend directly on the w Icomponent of w .
Y(/r I / I . the "exact" reference prior is dejned as that which maximises the value ofperfect information about 8. .. . correspond to assuming the latter to be a random sample from an :V(. let zL = ( 5 ..{ c( k).p.that prior.A ) parametric model.10.. . from the class of priors which has been identified as compatible with other assumptions about ( w . to exist. . given . (Prediction with quadratic loss). Moreover. the decision problem is to provide a point estimate for .. .. I . . h X is the expected (utility) vufue of perfecr injormution.I.rl.) perfect (i. which muvimises the expected v h e ofperfect informafion ubout 8. conversely. then. .J = .300 5 Inference from c. .. r l . . Assuming a quadratic loss function. Example 5. Such a prior will be called a ureference prior. the less valuable the information contained in the prior.e..say. assuming the limit (c..r I I(. will be called a ureference decision. + 15) = 1 q a ( e 1 +e I z)x lJ(ZI W 4 8 ) derived from combining T ( 8 )with actual data 5. . .Q ) denote the (imagined) .I. . . the posterior distributions. about 8.p ( 8 )} )p Ill. and H = I ( . . 2 0)... with known precision A. AI. I . .. together with a prior for / r to be selected from the class {. ..). given p ( 8 ) .8 ) .). . outcomes of I. A. for which A = 'R.( 8 ) ) = lirii r. . r ~ ~ say. 2 = ( . the more valuable the information contained in y ( 8 ) . Al. It is important to note that the limit above is not taken in order to obtain some form of asymptotic "approximation" to reference distributions. Clearly. we have For the purposes of the"thought experiment". This. ... . complete) information about 8. . { e( x . suggests a welldefined "thought experiment" procedure for characterising a "minimally valuable prior": choose. replications of the experiment yielding the observables ( . for given pII. ] . so that.. . . the more we would expect to gain from exhaustive experimentation. will be called urejerence posferiors: and the optimal decision derived from ~ ( 15)and ~ ( ow) w .. the less will be the expected value of perfect information about 8. We shall derive a reference analysis of this problem.. T ( 8 ) . and. E R. . . .). not as that which maximises the expected value of the experiment. Suppose that beliefs about a sequence of observables.I.
A) together with a gamma prior for X centred on XI. Example 5.I . so that Thus.} . Then Let Z A = {z. ... and p ( z . In fact. so that p ( X ) = Ga(X I n . .. correspond to assuming I to be a random sample from N ( z 10.. X) Ga(X I ( 1 .5. Then ' 301 However. Suppose that beliefs about I = {rI.. we have A = 9. we know from Proposition 5. The decision problem is to provide a point estimate for u2 = X .3 that optimal estimates with respect to quadratic loss functions are given by the appropriate means. . Q > 0. X) = n ..II. Q&'). by virtue of the normal distributional assumptions. I ). I 0.4 Reference Analysis and let us denote the future observation to be predicted ($A.. k replications of the experiment. assuming a standardised quadratic loss function. 'w = 02.} the outcome of denote .1 1. so that since. OX.x. striightforward manipulations reveal that so that the ureference prior corresponds to the choice Xo = 0. the predictive variance of T given does not depend explicitly on 2 1 . (Variance estimation). . . 2 ' 8 = A.+]) simply by 1. with pIJarbitrary.I 11 N(Z.
c. . the cwyfprencc prior corresponds to the choice f k = 0. IX I . . the iireference decision is to give the estimate c. z = (.2 Onedimensional Reference Distributions In Sections 2. Explicit reference decision analysis is possible when the parameter space 8 = { B .ns2/2). . with &. { r ( k ) )>(A)) . . thus. f/ f ’L Hence.0. Chapter 4). { r ( x )p ( X ) } = lirii . j ) (OX . Example 45 in Berger. Ci’ dominates 3’ or any smaller multiple of . may be explicitly obtained by standard algebraic manipulations. .4. rf.rf and. ?... the expected value of perfect information (cf. . = /lS?/2 (71/2) 41 . .~feriur for X is Ga(X I 1/12..).G2 as seen from a frequentist perspective.7 and 3.. the reference estimator of 0’ with respect to smndordised quadratic loss is t i o r the usual s 2 . the ureferenc. Definition 2. ._ _ . hence.epc).1} is finite.19) may be written as andthewreference prior. the itreference approach has led to the “correct”multiple of . and the utility function is a proper scoring rule whichin pure inference problemsmay be identified with the logarithmic scoring rule.4. where nsL) = . from a frequentist perspective. inf 1 Ga(X 1 0 . 5. . Given actualdata. . .302 and krrs’ = 5 Inference 1. arbitrary.~ ( 8 ) ) . we noted that reporting beliefs is itself a decision problem. one has v .{e(rX). which isthat r ( 0 )whichmaximises21.r1.r. It is of interest to note that. In this case.Y2 in terms of frequentist risk (cf. J.. . see Bemardo ( 1981a) and Rabena ( 1998). Indeed. where the “inference answer” space consists of the class of possible belief distributions that could be reported about the quantity of interest. I I This is maximised when c) = 0 and. .Thus. 198Sa. For further information. .. but a slightly smaller multiple of s’.1 )‘(/A = * ft I  I and this is attained when a = d / ( c i + 1). 6‘ is the best invariant estimator of u2 and is admissible. Since 1.
we shall consider the case where the quantity of interest is 8 itself.More general cases will be considered later. q(8) > 0. given the prior density p ( 8 ) . . we have considered a rather general utility structure where practical interest centred on a quantity w related to the 8 of an experiment by a conditional probability specification.{ by v{ and replace the term “ureference” by “reference”. In discussing reference decisions. A = { q ( .4 Reference Analysis 303 a } a}. Ja q(8)dO = 1) and the utility function is the logarithmic scoring rule u{q(. respectively. 3:~) parametric model with k P(% is given by I @ = n P ( 3 : i 18) r=l and so the expected (utility) value of perfect information about 8 is I { e ( O Q ) . The corresponding expected information from the (hypothetical) experiment e ( k )yielding the (imagined)observation f k = ( 2 1 . Noting that u is a proper scoring rule. p ( f l ) = Cm r { e ( k ) . ) with respect to p ( 8 ) and p ( 8 ( s ) .. P ( 8 ) } . . for which we simply denote v. If an experiment e consists of an observation 3 E X having parametric model : p(3: I O ) .5. . ) .). p(w 18). Our development of reference analysis from now on will concentrate on this case. is where qo(. ) lim . for any it is easily seen that the amount of information about 0 which e may be expected to provide. with 8 E 8 C 8.qx() denote the optimal choices of q ( . Here. with w = 8.).8} = A l o g g ( 8 ) + B(8)? the expected utility value of the experiment e. so that.
Given actual data x. ) which maximises Since this is of the form F { y ( .. The reference prior for 0.k = 1. where. Let c be the experiment which consists of one observation 2 from ~ J [10). limk. Suppose that we are interested in reporting inferences about d and that R no restrictions are imposed on the form of the prior distribution ~ ( 0 )It is easily .a}a possible outcome from cr(k).. ..2. verified that the amount of information about H which k independent replications of c may be expected to provide may be rewritten as and zp = (21. ) . f 0 C Y. as a functional of p ( .and subsequently take ~ ( 0 )be a suitable limit. This quantity measures the missing informulion about 0 as a function of the prior p ( 0 ) . Moreover.)j is typically infinite (unless H can only take a finite range of values) and a direct approach to deriving r(Q) along these lines cannot be implemented. so that is is the posterior distribution for 0 after % A has been observed. therefore. which maximises the missing information functional. F must satisfy the condition for all T.304 5 liverenee provided that this limit exists. . This to approach will now be developed in detail. the prior ~ p ( d which maximises I'{( (k). ) } = l g { p ( .p(6. a natural way of overcoming this technical difficulty is available: we derive the sequence of priors ~ k ( 0 which maxiniise ) I { t ( k ) .p(6. .x I { c ( k ) . . d as ~ ( 12) x p ( x I O ) T ( O ) . d Unfortunately. Z 6.p ( 0 ) } . . ) }do. for any prior y(8) one must have the constraint p [ H ) dd = 1 and. the reference posterior ~ ( I x)to be reported is simply derived from Bayes' theorem. denoted by ~ ( 0 )is thus defined to be that prior ..)} ) must be an extremal of the functional y is twice continuously differentiable. However. any function 7 4 .
for each k. Note that. It follows that. may be found to the posterior distribution of say. an approximation. k. 8. by formal use of Bayes' theorem. 305 where. under suitable regularity conditions. for all 8 E 8.5. For further information see Bemardo ( I 979b) and ensuing discussion. . p ( 8 ) } . the required condition becomes which implies that the desired extremal should satisfy. p'(8 I q. this only provides an implicit solution for the prior which maximises 1 8 { e ( k ) .).since fk. . a sequence of posterior distributions with the same limiting distributions that would have been obtained from the sequence of posteriors derived from the sequence of priors r k ( 8 ) which maximise I B { e ( k ) .This completes our motivation for Definition 5.q). . and hence that p ( 8 ) cx fk(8). which is independent of the prior p ( 0 ) . after some algebra. the sequence of positive functions I will induce. for large values of posterior distribution p(8 zk) = p ( 8 I zl. . p ( 8 ) } .0) depends on the prior through the However.4 Reference Analysis It follows that.7. Thus. for any function T ..
rather than just pointwise convergence. It should be clear from the argument which motivates the definition that any asymptotic approximation to the posterior distribution may be used in place of the asymptotic approximation y'(8 I a )defined above. lr(9 I x . 10) the ck(x)s are the required normalising constants und.. relutive to the e.(@.7. The positive functions IT( 0) are merely pragmatically convenient tools for the derivation of reference posterior distributions via Baycs' theorcm. see Berger and Bernardo (1992~). An explicit form for the reference prior is immediately available from Definition 5. Any positiveefunction ~ ( 9such that. for details. Although most of the following discussion refers to reference priors.fi)r almost ull x. I . und dejine Zn. 9 E 0 C 92. 5 Inference Let x be the result of un experiment c which consists of one observation from p ( x 19). ) where Tk(0 I x) = Q(z)p(" f. x).306 Definition 5. and it will be clear from later illustrative examples that the forms which arise may have no direct probabilistic interpretation.xk} he the result oj'k .)d%L. We should stress that the dejinitions and "propositions '' in this section ore by and l u g e heuristic in the sense that they are lacking statements of the technical conditions which would make the theory rigorous. ) will be called a reference prior for 6.rperiment P .of ~ k ( 6 I.7. independent replications off: . it must be stressed that only reference posterior distributions are directly interpretable in probabilistic terms.fi)r some c(x)> 0 und for ull B E 0.x E X . Making the statements and .assuming this limit to exist. where The reference posterior density of9 u8er x has been observed i dejined to be s the logdivergence litnit.. let zk = {XI. . (Onedimensional reference distributions). the natural convergence in this context. The use of convergence in the information sense. is necessary to avoid possibly pathological behaviour.
(Explicitform of the reference prior). Then. k asymptotic approximation to the posterior distribution of 0.. .5. Note that. 1992c)and Berger et al. however. which compared with other proposalsdiscussed in Section 5.O.19. and hence n(e I z)= kcc nk(@ Iz). Proposition 5.18. . by where c > 0. the reader would be best advised to view the procedure as an “algorithm. a If the parameter space is finite. .hi where c(z) is the required normulising constant.6.. at the time of writing.z }a random samplefrom p ( z I 0).. Proposition 5. any f function of the form n(B.a > 0. the limits above will not depend on the particular asymptotic approximation to the posterior distribution used to derive fi((e). provided the limit exists. 1992a. (Reference prior in the finite case). . So far as the contents of this section are concerned. . . A reference prior for t9 relative to the experiment which consists of one observutionfrom p ( z 1 O ) . with zk = { q . ~ k ( Iz) x p ( z I H ) f . under suitable regularity conditions.4 Reference Analysis 307 proofs precise. Let x be the result o one observation from p ( z 1 8 ) .M . z E X. where B E 0 = { 61. . it turns out that the reference prior is uniform. is and convergence in the information sense is verijied. ( @ ) lim 0 a required. 1992b. . i = 1.2appears to produce appealing solutions in all situations thus far examined..is a reference prior and the reference posterior is n(e. would require a different level of mathematics from that used in this book and.Bo E (3. is still an active area of research. Using n ( e )as a formal prior. independently of the experiment performed. and p ” ( 0I x k ) is an . . given.)= a.I 2) = c ( z ) p ( z1 e.6 E 0 C 8.. The reader interested in the technicalities involved is referred to Berger and Bemardo (1989. Proof. i = 1.}. (1989). . .. .).
.0 E (3. a reference prior is given by The general form of reference prior follows immediately. . . that for experiments involving more structured designs (for example. We have already established (Proposition 5. z L . p(I9.. by considering the sample to be a single observation .)logp(6.. cc Hence.20. Then PI= P. z A . our derivations have proceeded on the basis of an assumed single observation from a parametric model.7 for identifying the reference prior depends on the asymptotic behaviour of the posterior for the parameter of interest under (imagined)replications of the experiment to be actually performed. conditional on a parametric model. Thus far.for all n.. Note. will converge to zero a k s + { 1 i = 1. the reference prior does not depend on the size of the experiment and can thus be derived on the basis of a single observation experiment. r k ) will converge to 1 if 8. however.n 2 1. . Proposition 5. in linear models) the situation is much more complicated. . Indeed. a The preceding result for the case of a finite parameter space is easily derived from first principles.x E X. . 10). which are to be modelled as if they are a random sample.. Let e.7. . denote the class of reference priors for 0 with respect to e l # . n::. from y ( x 1 0). involving a sequence of n 2 1 observations.) = cxp J P ( % * 8. dl. . be the experiment which consists of the observation ofa rundom sample 2 1 . for any strictly positive prior.fiom p ( x . .. ) ~ . p ( x 1 ) The next proposition establishes that for experiments 6. is the true value of 19. (Independence of sample size). I . It follows that the integral in the exponent of fLd0..12) that if 8 is finite then.derived in accordance with Definition 5. . and let P. . The technique encapsulatedin Definition 5.. .308 5 Inference Proof. I T I . . This is maximised if and only if the prior is uniform. in this case the expected missing information is finite and equals the entropy of the prior. x.
.4 Reference Analysis 309 el. Let err. ... PIconsists of r(0)of the form with c > 0.x2. so that P.. E 8.. . coincide.5) and t. .. X ~ . . by the definition of a sufficient statistic. .21.5. . a In considering experiments involving random samples from distributions admitting a sufficient statistic of fixed dimension. . Proof.. then.8..for all n. .for any prior p ( 0 ) . 2 1 be the experiment which consists of the observation of a random n .. . x . . 00) to the posterior distribution of 8 given z k . . . it is natural to wonder whether the reference priors derived from the distribution of the sufficient statistic are identical to those derived from the joint distribution for the sample.}is the result of a kfold independent replicate of then.. Proof. ) Then. ... If q..0.k = {xl.for any n. p ( 0 I zk) = p(O I uk).. ~ . . E 8 and where p*(0 1 zk) is an asymptotic approximation (as k .x.z. Now consider z. so that p'(0 I zk) = p'(8 1 yk). denotes a kfold replicate of (xl.. .. sample xl.. the latter . . consists of r ( 0 )of the form But znk can equally be considered as a nkfold independent replicate of el and so the limiting ratios are clearly identical. .l = t(xl. . of reference priors derived by considering replicationsof ( 2 1 . . respectively. x E X. The next proposition guarantees us that this is indeed the case. Proposition 5. and are identical to the class obtuined by considering replications of el. corresponding kfold replicate oft.18. . considered as the result of a kfold independent replicate of e. . XA. If zk = {XI? . by Proposition 5. . Z . the classes admits a suflcient statistic t. We thus have .xkn} which can be . . 0 where. from p ( x I 0). (Compatibility with sumient stutistics).) and y k denotes the .. It follows that the corresponding asymptotic distributionsare identical.
respecti\selt. then. for sonic c > 0 und for all (3 E a: (i) K .22. However. I we for reparametrise and work instead with p ( x lo).( (3) ure reference priors derived by considering replio cations ofe. the reference priors are identical. f? E 8 undfroni p(z I &).I' E S. Otherwise. the asymptotic posterior approximations are related by the same Jacobian element and hence = ~ J ~I exp ~I = l.o = ~ ( 0 ) . so is Q and the result follows from Proposition 5.I 10) p(zk logp'((9 I zk)dzk a The assumed existence of the asymptotic posterior distributions that would result from an imagined kfold replicate of the experiment under consideration clearly plays a key role in the derivation of the reference prior. (Invariance under onetoone transformations). t Proposition5.7. It follows that. The question now arises as to whether referonetoone mapping y : 8 ence priors for f? and o. are consistent. T'. for all o E CP. If 0 is discrete. ' E X . . with .' ( o ) ) /J. by Definition 5. it is important to note that no assumption has thus far been required concerning the form of this asymptotic posterior distribution. if zk denotes a kfold replicate of a single observation from p ( z 10).18. of course.. p. The next proposition establishes this form of consistency and can clearly be extended to mappings which are piecewise monotone. with f z 6 X . where o o = y( 0 ) und y : (3 9 is u onetoone monotone mupping. Identity with those derived from c I follows from Proposition 5. As we shall see later.(o 1 zh) = PO (9 ' ( 0 )I % A ) I.l and hence.tperiments consisting o u single obserwtion from p ( x I 0).(o)= . in the sense that their ratio is the required Jacobian element.' ( o ) ) . a Given a parametric model. s E X.derived from the parametric models p ( z I H ) and p ( z 1 o).>l' fi(0). we shall typically consider the case of asymptotic posterior normality. E 4?. for any proper prior p ( f ? )the corresponding prior for c? is given by p.. but the following example shows that the technique is by n o means restricted to this case.19.310 5 Inference so that.. . Suppose that n (0). The second result now follows from Proposition 5. (  Proof. . ( o ) = (TOy .J<. p o ( g . as k + x. H E 0. if0 isdiscrete. Then.20.IoI. {. p ( z If?). respectively. could. any monotone CP.
.8 . = [z::/n . we see that the above reduces to kn + 1 It follows that fi.z. . . . sample from a uniform distributionon [8 .8 E ‘R...and hence that ~ ( 8= c h ) (K71f i 1)/2 kl (ktf + 1)/2 = Any reference prior for this problem is therefore a constant and... n 2 1. as k + oc.z.. . 1: dnl .. actual data 2 = (q. .z... ~f i.. the expectation being with respect to the distribution of tk. is a sufficient statistic for 8. . kn).)a ~(8). : ~ .1 .x ! .8 + . whose belief distribution is represented as that of a random . 2 lllnx L J It follows that. given a set of .5. . are Be(u I k71.12.1 < 5 x!./n = mintxl. respectively. .. and p ( e I z) p(t) I t. = uiax{q... Let e be the experimentwhich consists of observing the sequence 11 . together with aprior distribution p(e) fore. .}.(0)= (ktz + 1)/2. . Otl I.11) 1 * < t ) 5 p:I  + 11u11 4. x. t. (Uniform model).. . z!:.  f <_ 8 5 2) + f null * a uniform distribution over the set of 8 values which remain possible after z has been observed. 2 2’ Iknl = J IlIln .8+ 41. 1 It is easily verified that c. For large k. + f . the reference posterior distribution is n ( e l z )I x c .1) and Ek(v I 1... a kfold replicate of e with a uniform prior will result in the posterior uniform distribution P’(0 I t k . (r. = 3 i111 q. therefore. the righthand side is wellapproximatedby log { 1 and. ...xn).4 Reference Analysis 311 Example5. . then t.:il.. noting that the distributions of 11 = p ’mr i x) . .
(xj. the reference prior can easily be identified from the form of the asymptotic posterior distribution. the asymptotic posterior distribution p’(0 I r k f l ) . it follows from the assumptions that The rcsult now follows from Proposition 5. us k  x. x E X ..corresponding to an imagined kfold replication of an experiment p I l involving a random sample of IZ from p ( x 10).. estimator). then.$ As k ) . x. In such cases. L4t c he the e. . p ( x I 0). through an usyniptuticully sirflcient.18.31 2 5 Inference Typically. . .vists HI.)froni . consistent estiniute of 8.0. Proposition 5.such thnt. will only depend on za.23. under suitable regularity conditions. . (Explicit form o the reference prior when there is a f consislent. E 0. for m y c > 0. 8 E 0 C %.rperinietit which consists ofthe observation of a random saniple x = { X I . asymptotically sufjiicient. be the result . und let zi I ...fu kjbld replicwte of ) c lfthere e. . a concept which is made precise in the next proposition. with prohahilic one und.I = #a .( z I I ~.reference priors are dejned by where Proc.
. = ti. .18. 0 5 I 5 1. 1985) in exploring deviations from the standard uniform model on [O..x)}''l for f 5 x 5 1 defines a oneparameter probability model on [O.. I.. where ~ { 2 ~ )1 @ P(I I@)= for O < r < $ e(2(1 . for large k .: is a sufficient. .. The reference posterior for B having observed actual data 2 = ( T I . 1 E[tkrrel =  ~ [ t .& > 0.} ... rit) distribution. xl. It follows el 1 el = kn02 provides. It is easily verified that if = {xI1. from Proposition 5...23..) so that and. . which is a Ga(0 I n.. we see that p'(0 I &. P(B I zkrr It is also easily shown that p(tk. is given by and. ... from which we can establish that that a. consistent estimate of 8.) = Ga(0 I k7L + 1: knlb. an asymptotic posterior approximation which satisfies the conditions required in Proposition 5. (Dsvhztbnfrom uniformity nrodel).5. Let c.xn). be the experiment which consists of obtaining a random sample from y(. which finds application (see Bemardo and Bayarri. the sufficient statistic tk. for some c > 0.13.4 Reference Analysis 313 Example 5.. results from a kfold replicate of c'. . for any prior p(B). is therefore 1 r(eI 5 ) = ~ ( I to) p ( Z I e) errI exp{71(61 l)t} . producing the sufficient statistic t = t(z). 1)(given by 8 = 1). 11. 0 > 0.r I e). From the form of the righthand side.
. . is normul with f precision knh ( O k I .. . if rhe asymprotic posterior distribution of 8.2. 0 E 8 c R. where @A. and hence the reference prior. Thus. be the experiment which consists o the observation of a random sample f 2 .314 5 Inference Under regularity conditions similar to those described in Section 5.n is a consistent estimute u 0. . In such cases. 1 . Fisher's information (Fisher. ) . under the conditions where asymptotic posterior normality obtains we find that h(0) = p ( z 10) . x E X. and therefore. by Proposition 5.l o g p ( z 10) J ( :2 i i..z.. as required. = (2) 1 :L' {h(8)}li2.24 is closely related to the "rules" proposed by Jeffreys ( 1946. from p(x I Q).given a kfold replicate of el.. . Spically.3.. I939/1961 ) and by Perks ( 1947) to derive "noninformative" priors.24. Let e. E 9. H. we can obtain a characterisation of the reference prior directly in terms of the parametric model in which 8 appears. the asymptotic posterior distribution of CI tends to normality. a The result of Proposition 5.23. (Reference priors under asymptotic normality). given a kjold replicate of el. is where &. it follows that an asymptotic approximation to the posterior distribution of 8. reference priors have the fortn K(e) {/~(e)}~? l~ Prooj Under regularity conditions such as those detailed in Section 5.3. 1925). T ( 0 ) x h(0)' ?. for some c > 0. is some consistent estimator of 0.2. Proposition 5. Then.e.
a(@ = Be(@ f . whatever the number of successes In particular.5. however.zn}. Exampie 5. has to be specified.f.4 the idea that knowledge of the data generating mechanism may influence the prior specification. so that G = {XI.’ ( l @ ) .a function of the entire probability model p ( z I B). n.18 using an asymptotic approximation to the posterior distribution. We have.1.13 above. Consider an experiment which consists of the observation of n Bernoulli trials. p ( x p)= t3yi e p..r + i). already encountered in Section 5. even though there are no observed successes. by definition. (Compare this with the Haldane (1948) prior. the reference prior is *(e) 3: e1/2(i e ) . i).from which I a(@ 12) C( P(G C( T. the reference posterior. Note that a(0 1 )is proper. this is a consequence of the fact that the amount of information which an experiment may be expected to provide is rhe value of an integral over the entire sample space X.12 and 5. of course. a(@)3: @ . not only of the observed likelihood.yr1/2. with n fixed in advance. As illustrated in Examples 5.n + i). by Roposition 5. . it is often simpler to apply Proposition 5.4 Reference Analysis 315 becomes Jeffreys’ (or Perks’) prior.24.I . . It is importantto stress that reference distributionsare. r 2 1.8 E 0. an experiment which consists of counting the number z of Bernoulli trials which it is necessary to perform in order to observe a prespecified number of successes. (Binomial and negative binomial &k). If T = C:=. . which produces an improper posterior until at least one success is observed. Jeffreys’ formula is not necessarily the easiest way of deriving a reference prior. It should be noted however that. and hence. See Polson (1992) for a related derivation. o Ie I 1.I / * . I q T ( e ) er”*(i. z E X . 12) sensible inference summaries can be made.which.) Consider now. zi. 1 s is the beta distribution Be(@ r + $ . The probability model for this situation is the negative binomial from which we obtain . E (0. if r = 0.14. even under conditions which guaranteeasymptotic normality. therefore. Technically.
r) I p ( .H ) posterior is given by K(H I '. z E X. of priors which consists of those distributions compatible with such conditioning. however. or on a "what if:'" basis. where. it is often interesting. Reference analysis provides an answer to an important "what if?" question: namely. . Again. the reference prior is a(@) x H ' ( 1 . we continue to assume that 0 E 8 C R. In reporting results. say.r . we note that this distribution is proper. . which is the beta distribution Be(0 1 r. The concept of a reference posterior may easily be extended to this situation by maximising the missing information which the experiment may possibly be expected to provide nirhin this restricted class of priors. To report the corresponding reference posteriors (possibly for a range of possible alternative models) is only part of the general priortoposterior mapping which interpersonal or sensitivity considerations would suggest should always be carried out.0 ) ' ' I 2. In order to carry out the reference analysis described in this section. that the preceding argument is totally compatible with a full personalistic view of probability. . what can be said about the parameter of interest ifprior information were minimal rrlurive to the maximum information which a welldefined. different experiments generally define different types of limiting beliefs. for simplicity. so that the data analyst has available the full specification of the probability model p(s I H ) . either per se. thus defining a restricted class. The reference 1 . See Geisser (1984) and ensuing discussion for further analysis and discussion o f this canonical example. whatever the number of observations s required to obtain r successes. such a full specification is clearly required. those which maximise the missing information which a purriculur experiment could possibly be expected to provide. . .I' + f ). We want to stress. specific experiment could be expected to provide? 5. .3 Restricted Reference Distributions When analysing the inferential implications of the result of an experiment for a quantity of interest. is duly reflected in the slight difference which occurs between the respective reference prior distributions. 8. by Proposition 5. Q. to twndirion on some assumed features of the prior distribution p ( 0 ) .4. I' 7 1. = r I'. Note that I' = 0 is not possible under this model: the use of an inverse binomial sampling design implicitly assumes that r successes willeventually occurfor sure. This difference in the underlying assumption about 6. 8 E 8. A reference prior is nothing but a (limiting) form of rather spec& beliefs: namely. r I H ) i T ( H ) x 8' ' ( 1 .24. Consequently. scientists are typically required to specify not only the data but also the conditions under which the data were obtained (the design of the experiment). which is not true in direct binomial sampling.5 Inference and hence. .
4 Reference Analysis 317 Repeating the argument which motivated the definition of (unrestricted) reference distributions.8. {J P(%k 10) k P ( 6 I % k ) h 1 .7. . Definition 5. is defined to be no(6 I z). Let x be the result o an experiment e which consists of one observationfrom f p ( x IO). as k + 30. . s . Q ( 1oz). . and nf(6) is a prior which minimises. nk(6I z). x E X . within 4 A positivefunction &(6) in Q such that i then called a Qreferenceprior fur 6 relutive to the experiment e. within Q.the amount of infonnation where fk(6) = CXP which could be expected from k independent replications I = {zl. the Qreference posterior distribution of 6. that such E [ ~ { ~ . which correspond to the sequence of priors. I% ) ) I + nQ(6 rf(6 12) cx 0. nk(6). .5. (Restricted reference dishibutions). we are led to seek the limit of the sequence of posterior distributions.z }of the k single observation experiment. afer x has been observed. . where 6 is the logarithmic divergence specijied in Definition 5..let 6 be a subclass of the class o all f prior distributionsfor 6.Q(O). which are obtained by maximising. . p ( s I O)n.z k } be the result of k independent f replications o e and define Provided it exists. let zk = { X I .. with 6 E (3 3.
Suppose that un unrestricted reference prior r ( 0 )relative to a given experis f sarisJes ment i proper. which essentially establishes that the Qreference prior is the closest prior in G. Let r f ( H ) be the prior which minimises the integral within Q.18 that x ( 0 ) is proper if and only if Moreover. a Qreference prior r ( ~ ( 0 ) Proof. i ir exists. where.318 5 Inference The intuitive content of Definition 5. is the prior which a . (The restricted reference prior as an approximation). divergence from r(0).8.8 is illuminated by the following result. then. by the continuity of the divergence functional. to the unrestricted reference prior ~ ( 0 )in the sense of minimising its logarithmic . It follows from Proposition 5. Then. by Definition 5. Proposition 5. within Q .25. which is maximised if the integral i s minimised. & ( H ) minimises.
. letQberhecfassofpriordisrribu~oionsp(8)ofe ..8 directly in order to The characterise nQ(8). 00. is of the form f f where the ' are constants determined by the conditions which define Q. . A standard argument now shows that the solution must satisfy and hence that Taking k + . a .26. a Q reference prior o 8 relative to e. then. it is necessary to apply Definition 5. Proposition 5. .i = I. i it exists. following result provides an explicit solution for the rather large class of problems where the conditions which define Q may be expressed as a collection of expected value restrictions. O s ). s Proofi The calculus of variations argument which underlay the derivation of reference priors may be extended to include the additional restrictions imposed by the definition of Q.5. for given { ( g i ( .m } . thus leading us to seek an extremal of the functional corresponding to the assumption of a kfold replicate of e. (Explicitform of restricted reference priors). ) . Let e be an experiment which provides information about 8.4 Reference Analysis 319 If n(8)is not proper. which satisfy Let n(0) be an unrestricted reference prior for 8 relative to e.. the result follows from Proposition 5.and.18.
.assuming a logarithmic score (utility) function. .. T (X I 0 ) . r l . Under suitable regularity conditions. . for p ( X I o) has been specified and that only n(@)remains to be identified.18 then implies that the “marginal reference prior” for c) is given by ..H ) . the asymptotic posterior distribution of 19 will be of the form p ’ ( 0 1 . (Loeation modoh).r 1 H ) = h(. the decomposition Ij(e) f)(c>.I’ E Ji. and suppose that the prior mean and variance of 0 are restricted to be E[H]= p. that a slrirable referenceform.320 5 Inference Example 5. V[H]= m i ..A ) = I+. when the decision problem is that o reporting f marginul inferencesfor Q. I we may rewrite the vector parameter in the form 8 = (&A). . ( 1 ( H ) cfH = mi. Thus. the restricted reference prior is the norm1 distribution with the specified mean and variance. 5. so that the unrestricted reference prior will be from Proposition 5. f(e. .4 Nuisance Parameters The development given thus far has assumed that H was onedimensional and that interest was centred on 0 or on a onetoone transformation of 19. consider Z A to be the result of a kfold replicate of the experiment which consists in obtaining a single observation. ) 2 1 . where H. . } be a random sample from a location model p(. Q!J = (p(8).. The problem is to identib a reference prior for 8. .and [ ( O . We shall next consider the case where 8 is twodimensional and interest centres on reporting inferences for a onedimensional function. .qll(~ = I 0). . Thus. where (3 is the parameter of interest and X is a nuisance parameter. z from p(a: 18) = p ( z 16. . .r . Proposition 5. o E CP. X E .p . H E P. . which is constant.ri.Without loss of generality. for the moment. Let 2 = { .26 that the restricted reference prior will be u~i#~rnz..15. suppose.23. is an asymptotically sufficient.r. It now follows with Hav(6))dfl = p. To motivate our approach to this problem.A)... by Proposition 5. Recalling that p ( 8 ) can be thought of in terms o f .0).4. I . consistentestimator of 8. ) Y . .
to provide a straightforward approach to deriving reference analysis procedures in the presence of nuisance parameters.5. involvingjnire parameter ranges. Clearly. P(%k 14) = which plays a key role in the above derivation of T ( @ ) . However. where the approach outlined above does produce an interesting solution. marginal reference posterior for 4.X) = JJP(zz 1 7 4.(X I$) a. corresponding to the the reference prior r(e)= r(4. and k 1 t P(Xk ICp. The above approach will fail in such cases. However. In general. we see from Proposition 5. .A) = r ( 4 w 14) derived from the above procedure. there is a major dificuiry. a more subtle approach is required toovercome this technical problem. I8 that the "conditional reference prior" for X given 4 has the form where f. 4 i=l Given actual data z. This means that the integrated form derived from r ( X 1 Cp). . and By conditioningthroughout on Cp.4 Reference Analysis 321 p * ( @I zk)is an asymptotic approximation to the marginal posterior for 4. then. J P(Zk I47 X). zk)is an asymptotic approximation to the conditionalposterior for X given 4. before turning to the details of such an approach. Z k ) d Z k p* (A 1 4.(X 14) = ex* { JP(%k I4. typically not be a proper will probability model. reference priors are typically not proper probability densities. we present an example.X) logp'(X 14. as we have already seen. would then be This would appear.
Consider a large. is given by Suppose we considered 0 to be the parameter of interest. hut still relatively small compared with the population size. and wished to provide a reference analysis. all of whose elements individually may or may not have a specified property. since the set of possible values for 8 is finite. finite dichotomised population. .. the sample being large in absolute size. r = 0. . 128132 and Geisser. Jeffreys.) Let us denote the population size by . 1939/1%1. . .16.19 implies that p(e = r) =  1 !V+l. the sample size by I t .1.V if H # :Y. . (See. the observed number of elements having the property by . having observed :r = 71. in view of the large absolute size of the sample. 1980a.322 5 Inference Example 5. T* = 0. pp.A). Many commentators have argued that. Straightforwardcalculation then establishes that which is not close to unity when ri is large but n / N is small..r. which.the posterior probability that 8 = N . and the actual number of elements in the population having the property by 8. 1921. for possible values of . . for example. an argument related to Laplace's rule of succession. Wrinch and Jeffreys.I: defines a prior distribution for 8. Proposition 5.r. However. :V. careful consideration of the problem suggests that it is nor 0 which is the parameter of interest: rather it is the parameter o= {0 1 if0 = lV if 0 # :V To obtain a representation of O in the form (d. irrespective of the fact that the population size is greater still. let us define A = { H 1 ifB=. Then. is a reference prior. . A random sample is taken without replacement from the population. has the form If p(8 = T). one should be led to believe quite strongly that all elements of the popululion have the property. (Induction).V. . All the elements sampled turn out to have the specified property. The probability model for the sampling mechanism is then the hypergeometric.
but in any specific problem a ‘‘natural’’ sequence usually suggests itself.for further discussion). = 1) = 1.k 14) = J P ( % k I7X>n(X 14) dX.. on each A. n ( X 14) restricted to A. The required 00.1. Ui = A. 1985b. this will not be the case. conditional reference prior. .(X I4). which may depend on 4. This suggests the following strategy: identify an increasing sequence {A*} of subsets of A. A). and are given by T(X = 1 I Q.1. N . can then be obtained and a marginal reference prior n. such that. We formalise this procedure in the next definition.. this will only be a proper model if the conditional prior n(X 14) for the second parameter is a proper probability density and. the reference priors r(@) K ( X 14) are both uniform over the apand propriate ranges.’s to be made. n.A) as an ordered parurnerrisation of the model. A) is then obtained by taking the limit as i clearly requires a choice of the A.4 Reference Analysis 323 By Proposition 5. the A. These imply a reference prior for 0 of the form and straightforward calculation establishes that 1 p(e=~jz=n)= n+l which clearly displays the irrelevance of the sampling fraction and the approach to unity for large n (see Bemardo. E A. typically. We return now to the general problem of defining a reference prior for 6 = (4. a proper integrated model reference prior.(@)identified. . # E 9. where 4 is the parameter vector of interest and X is a nuisance I X parameter. For each i. 4 However. We shall refer to the pair (4. 1 T ( X = T 1 Q = 0) =  N 3 T = 0.5.. We recall that the problem arises because in order to obtain the marginal reference prior n(@) for the first parameter we need to work with the integrated model P(. The strategy reference prior n(q5.19. can be normalised to give a which is proper.  .
Let x be the result of an experintent c which consists of one observationfrom A) the probubili9 model p ( x 1 o. the reference prior thus defined maximises the missing information about the parameter of interest. (9. A b I ( A I 0 )flA: (iv) use those to derive the sequence of reference priors ( v ) dejne 7r(@ I x ) such that.I. is any positivefunction 7r(@ A).7 to the model p ( x I 0. (ii) for each 6. Jilrfired Q. E 4) x A C fR x 9. 7rf ( A 1 0). .for the parameter of interest (3.for almost all x. obtain the A). x E X . 0) P(X I cp. f u.A). (3 E @. The referenceprior. relative r o the 1 (0) experiment e and to the increasing sequences of subsets of 1 . 6. 7r( A I 0). is defined to be the result of thefollowing procedure: (i) applying Definition 5. . for conditional reference prior. (iii) use these to obtain a sequence ofintegruted models Pl(Z 10) = l. in effect. ir(9 I x ) . proper priors. (4)= A. { :Ii }.The reference posterior. A. normalise T (X I @) within each :(cp) to obtain a sequence o I . relutive to the orderedparametrisation (li. (Reference distributions given a nuisance parameter).324 5 Inference Definition 5..9. such that This will typically be simply obtained as Ghosh and Mukerjee (1992) showed that. A).
the form chosen for the latter is. p ( x I @.5. for simplicity. (Invariance with respect t the choice of the nuisance o parameter). $I). we would hope that the reference posterior for @ derived according to Definition 5. By Proposition 5. a . for given 4. the missing information about the nuisance parameter.(@) rA(g.9 would not depend on the particular form chosen for the nuisance parameters. A) for which the transformation (4. we see 7 that. with li.27. if we define and normalise re(@ over 9. Then.we see that the 14) and normalised forms are consistently related by the appropriate Jacobian element.(X I 4). If we denote these normalised forms.22. and hence that the procedure will lead to identical forms of r(415). and let e’ be an experiment which consists in obtaining one observation from p ( x I 4. . $ Proposition 5. (@. Proof.’(@) 14)over At(@). $) is onetoone. ]. Let e be an experiment which consists in obtaining one observation from p(xI@. where Hence.{At($)}] and [el. by r. arbitrary. = $(#. { S 1 ( # ) } where \zIt(&) = g d { A f ( @ ) }ure identical. @) is onetoone transx 9.+) E x \zI C ? ?where (Q. for the integrated models used in steps (iii) and (iv) of Definition 5. A. Thus.4 Reference Analysis 325 subject to the condition that. = go(A). for any 1c.A) (4.9. is maximised. In a model involving a parameter of interest and a nuisance parameter. ~ ~ ( $14). + (4. of course. relative to [e. A) can be written alternatively as p ( z I 4. (&. given @. Intuitively. the reference posteriorsfor @. I A) formation. The following proposition establishes that this is the case. $).A) E CP x A c R x R.A).
326 5 Inference Alternatively.rr. but redefining the parameter of interest to be a onetoone function of 9. Let c be an experiment which consists in obtaining one observation from J J ( X 16. Proposition 5. girena I. relative to [ P .23. Then.A) E @ x A C 'H x 'R.9 are related by the claimed Jacobian element. A. we may wish to consider retaining the same form of nuisance parameter.~terior distribution of (8. A). = . The next proposition establishes that this is the case.(g = ':)3) (.9 are both uniform distributions.9..xnfrom p ( z 14. ( B i v h ereferencepriors under asymptotic normulily). A).i~ N consistent estimate rf (0.fold A).. step (i)of Definition 5. Intuitively. where 7 = g ( 4 ) is now the parameter vector of interest. @. A E iL where = g ( o ) .18. Proof.rr(A 19 '(7)). (o7. { A .. and let { A f ( c l ) } . We shall now extend this to the nuisance parameter case.y ) } ] . and the result follows immediately. we would hope that the reference posterior for y would be consistentlyrelated to that of qj by means of the appropriate Jacobian element.9 clearly results in aconditional referencepriorn(A I @) = .22. a . we saw that the identificationof explicit forms of reference prior can be greatly simplified if the approximate asymptotic posterior distribution is of the form p ...(y) = A f { g ( o ) are relatedby: } A : (i) 7r.(o)and7rf(. . o E @. ).(~yIx) .E l?. . Suppose that thejointas~mptc)ticpci. Thus.)definedby steps (ii)(iv)of Definition 5. is multivariute normul with precision matrix k n H (& .. be the experiment which consists ofthe observation ofu rundom sample xl. as required by Definition 5.24 establishes that even greater simplification results when the asymptotic distribution is normal.. A). Proposition 5.. consistent estimate of 8. I ( ? ) 1 by Proposition 5. (Invariance under onetoone transformations).rr. In Proposition 5. Fordiscrete@.given datu x. (o)}] the and [e'. replicate off I . I 1. . and suppose that h . p ( z 14. If . be suitably defined sequences of subsets of A. and let f J be un experiment which consists in obtaining one observationfmm p ( x 17.28.) is A) . Let F. referenceposteriorsfor @ arid {. 7.( 6 1 Zk) = p " ( 0 1 I % ) > where & is an asymptotically sufficient. 7rf (0) 7r..29.. { @ I ( .X. A E A. $@isdiscrete. and the result follows straightforwardly.. In all cases. Proposition5.1c.(?) defined and by steps (ii)(iv) of Definition 5. A) might be written as p ( z I f. uhere (& ).Jgi ( 7 ) exists. A). by Proposition 5.
Given 4. with n. i = 1. the integrand has the form for large k .). A). (&.5..2.).4 Reference Analysis hi. derive the form of x t ( $ ) .24. Then 327 = . so that . Marginally. is the partition of H corresponding to &. where. {h22(4. we note that To if zk E 2 denotes the result of a kfold replication of e..2. Proof. j n(A 14) 1. where ho = (hll .).ik.(A 14) denoting the normalised version of x(A 14) over A l ( $ ) . where dA} .29 then follows from Proposition 5. the asymptotic distribution of $ is univariate normal with precision knh. 4 v 2 : deJinea reference prior relative to the ordered parametrisation ($. A. The first part of Proposition 5. ik. the asymptotic conditional distribution of A is normal with precision knhlZ($kn.h12hG1h21).
where a.gz(A). suppose ulso that arul Then a reference prior relative to the orderedparametrisation (0. A)dX.X ) 7 r ( d . By Proposition 5. and the result easily follows. he the experiment which consists in the observation of a random sample x = { . . Corollary.A) is defined by . } do not depend on 0. we c*lroosc a suitable increasing sequence o subsets { 1.. with both mean.29. It then follows that where b.L(X) logy1 ( A ) dA. the result follows.(@. ) is A Proof. We shall first obtain a reference analysis for ~ r taking (T to be the nuisance parameter. n. = J. . and hence 7r. (Normal mean and standard deviation).17. Suppose that. I a Example 5. and standard deviation. . for data z the reference prior n(S. .} from a normal distribution. .29.. the reference prior takes on a very simple form. . a In many cases. and { h.I. . which do not depend ON f } 0. c I . Since. the formsof {hz2(q>.. unknown. under the conditions ($Proposition 5. In such cases. Let c .(A I d ' ) ) = tr.$A. factorise into products A)} A)} of separate functions of 6 and A. and the subsets {A.y.' = J. g2(A) 1 dX.. 1 1 . x I[)(" lo.328 5 Injerence has the stated form. (. 7r( A I 0)x fi(~)!g2( A).. u.
) p . ' 5 a 5 el}. . A. If we now reverse the roles of p and a. of subsets of A = %+ not depending on p.2. . over which ~ ( I p ) can be norrnalised and the a corollary to Proposition 5.. we obtain. whose elements we recall are given by from which it is easily verified that the asymptoticprecision matrix as a functionof 8 = (11: a) is given by so that. given 2. provides a suitable sequence e . .4 Reference Analysis 329 Since the distribution belongs to the exponential family. asymptotic normality obtains and the results of Proposition 5. for example.Z)*.5.. is 3( [' s + ( p .29 can be applied.so that the latter is now the parameter of interest and /L is the nuisance parameter.29 can be applied. writing 4 = (a. i = 1. (n 71 where ns" = C(z.The corresponding reference posterior for p.F)'] "I2 = St(p (2. u).l)K2.1). We therefore first obtain the Fisher (expected) information matrix. It follows that x ( p 3 = x ( p ) n ( uI p ) x 1 x a) 0 1  provides a reference prior relative to the ordered parametrisation ( p . = {a. .
2.27 the choice is irrelevant). . in fact.' .given 2. that the integral is a constant. udii ) x 40) J fi .). ~ ( X n l ~ = Ga (Xriu' s) z I i(n  1). 0)4 P 10) dP. One feature of the above example is that the reference prior did not. (Stundordised nomulmeun). is ~ ( n l z3:) / p ( z l / i . I r 2 2 ( a .@ = (6. with .n X ) density. . = {p:P' 5 / I 5 c ' } . A. for example.n) = 7r(n)7r(p n ) J 1 x n I ' provides a reference prior relative to the ordered parametrisation (a. provides a suitable sequence of subsets of 11 = R not depending on u. depend on which of the parameters was taken to be the parameter of interest.330 5 Inference sothat { h o ( ~ . by comparison with a N ( p If. the righthand side of which can be written in the form Noting. i = 1. / ~ ) } nl2and. p ) } ' : = f i u . n ) ~ ( p . It follows that 7r(p. by a similar analysis tothe above.18.1).. Example 5. and implies that changing the variable to X = o*. I P.?I "J. ) = g ( / r . In the following example the form does change when the parameter of interest changes. but we now take d = p / a to be the parameter of interest. If n is taken as n the nuisance parameter (by Proposition 5. alternatively.n ) is clearly a onetoone transformation. i) = x?(Xrrs' I if . The corresponding p.29 can be applied. ' = 1ir(p1n) x a' so that.17.over which ~ ( 1 a) can be normalised and the p corollary to Proposition 5. We consider the same situation as that of Example 5. reference posterior for a. ?r(X 1 ~ X!?lW?l exp { ins'X} or.
. i = 1.} y = { yl. provides a reasonable basis for e' applying the corollary to Proposition 5.I In the ( p . respectively.. for example.3 > 0.b9) = g ( 6 ) . a)is given by K ( 4 . which is clearly different from the form obtained in Example 5. 1) and N(y I d. (Product of n o d means).2. . It is easily seen that so that the reference prior relative to the ordered parametrisation (4. . @I 331 a4(2 + d2)) . fr > 0.19. Example 5. Such a parameter of interest arises.y.17.6. and N ( z I (r.} are to be taken. writing 6 = (a. = {a. . . A) = (ad. from .4 Reference Analysis and using Corollary I to Proposition 5. we use Proposition 5.2.26 of Section 5. when . Again. Further discussion of this example will be provided in Example 5. Consider the case where independent random samples 2 = {q. We conclude this subsection by considering a rather more involved example. a) x (2 + d .17. 5 a 5 r'}. . the Fisher information matrix is easily seen to be d) H e ( 6 ) = W(tL a) = (a :J Suppose now that we make the onetoone transformation q5 = (4.2. so that the complete parametric model is for which.. a)parametrisation this corresponds to 7r(p. 1).' % ..5.a/i?) = g ( o .z... . the sequence A.a) x (2 + 5) 'I2 a. since its corollary does not apply.29. where a natural choice of the required Ai(@) subsequence does depend on 4.29. . In this case... so that d = n o is taken to be the parameter of interest and A = a/@is taken to be the nuisance parameter.
and which leads to a reference prior relative to the ordered parametrisation (L?. after some manipulation. Pi+ x W . for (0. I7 The question now arises as to what constitutes a "natural" sequence {A.( o)}. I = 1. A ) given by . into the sequence We note that unlike in the previous cases we have considered. The Jacobian of the inverse transformation is given by and hence. for large i . using Corollary 1 to Proposition 5.(X I @) required by Definition 5. that. subsets of the original parameter space.j ) would be the sets s. To complete the analysis. in the space of X 6 A. A natural increasing sequence of . over which to define the normalised a.332 5 Inference inference about the area of a rectangle is required from data consisting of measurements of its sides. which transform. this does depend on 0.9. . it can be shown. = {(o.2. 0 < < j< / } .j): 0 < (I < 1 .
e. of course. (@).5 Multlparameter Problems The approach to the nuisance parameter case considered above was based on the use of an ordered parameuisation whose first and second components were (4. . eLI =(8J+l.. The reference prior for the ordered parametrisation (#.. and define 8 l= L . .).with 8. . nothing paradoxical in this. We note that the preceding example serves to illustrate the fact that reference priors may depend explicitly on the sample sizes defined by the experiment. . x e!. in the case n = rn..8. In order to describe the algorithm for producing this successively conditioned form.define the component vectors g l J l = ( 8 . who showed that it provides approximate agreement between Bayesian credible regions and classical confidence intervals for 4. . ..87. 5.. 8 E 8. relative to this ordered parametrisation. respectively. and. . a reference prior.$)*x (tr2 + $?)1’2. . 2 . ... this corresponds to 333 which depends on the sample sizes through the ratio m / n and reduces. . ) . ... E 8. in the standard. as the parameter of interest and the nuisance parameter.. and of the consequencesof choosing a different sequence A. (81.5.~). an increasing sequence of compact subsets of 8.. this successive conditioning idea can obviously be extended by considering 8 as an ordered parametrisation.see Berger and Bemardo ( 1989).. I x . . = 1 .. Finally. a form originally proposed for this problem in an unpublished 1982 Stanford University technical report by Stein..’ ( 8 ) . regular case we shall first need to introduce some notation. T(e2 1 e l ) q . since the underlying notion of a reference analysis is a “minimally informative” prior rekutive to the actual experiment to be performed.and by the h J ( 8 )the lower right element of SJ’(8). we denote by (8. and denote by SJ(8) corresponding upper left j x j submatrix of S ( 8 ) . . by successive conditioning.evtl) . For a detailed discussion of this example.). to be such that the Fisher information matrix has full rank. When the model parameter vector 8 has more than two components. we assume that 8 = 8 1 x .. Assuming the parametric model p ( z I e).4. .x 0.. and generating.. of the form x(e) = .2. . .. .). we define S ( 8 ) = H .4 Reference Analysis In the original parametnsation.. . There is. for i = 1.(ern 18. . referred to. say. to n(a. A) was then constructed by conditioning to give the form x(A I @)x(r#~). A).
relative to the ordered paramerrisation (0.. a The derivationof the ordered reference prior is greatly simplified if the { h. With the above notation.. . if. in .30. and a function not depending on fl. (0)} terms in the above depend only on even greater simplification obtains if H ( 8 ) is block diagonal.334 5 Inference Proposition 5. . . the reference prior n ( 0 ) . then . O. E k)!.29. . and under regu1arit.. . This follows closely the development given in Proposition 5. 1992b.. 1992~). where Proof. r n . 2 . (ii) For j = ria . For details see Berger and Bemardo (1992a. . E el. . ). j = 1.1. .v conditions extending those of Proposition 5.. .2. and 6.. ( 0 ) depends only on &I. particularly. for j = 1.O. . rrt. . . .29 in an obvious way.. Corollary... is given by where w f( 0 )is dejined by the following recirrsion: and (i) For j = n?. (Orderedreference priors under asymptotic normality). . the j t h term can be factored into a product of a function of 0. If b . .
J((I1l)S~. 8.... 71 2 2.p . for which the Fisher information matrix is easily seen to be It follows from Proposition 5. p. . . . .is given by x ( p . ( 8 )does nor depend on el. 1 ) X. .. . T ) ix T'.20. At present. 4 6 ) 0. = 1. .. ( zI p. ... .. where the parameters may have special forms of interrelationshipthe best procedure is to order the components of 8 on the basis of their inferential interest.6. J=l n m f. depend on the order in which the parametrisation is taken. . . ... are mutually orthogonal). in normal model N .. ... The reference posterior for any pJ is easily shown to be the Student density 7T(/lIlSl . TI. Clearly. then ! . . . .81. . but experience with a wide range of examples suggests thatat least for nonhierarchical models (see Section 4.. r ) .with f then h. since the parameters are all mutually orthogonal.. j in this latter case. The question obviously arises as to the appropriate ordering to be adopted in any specific problem.. Let e. a random sample from the multivariate which consists in obtaining (2. The reference prior +. ( 8 ) .). . ( 8 ) = h. ( 8 )1 = (6.??l(nl)) I .(O..5). Example 5. p . Furthermore. . . {)a. and if the @ 's do not depend on 8.29. .. no formal theory exists to guide such a choice. . p. .e. (Reference anafysisfor nt normal means). in fact. .4 Reference Analysis ! 335 I H ( 8 )is block diagonal (i. ~ x a'if we para) metrise in terms of u = r is thus the appropriate reference form if we are interested in any of the individual parameters.. T ) x 7I or n ( p l . . .5.). in this example the result does not. . 2 1. f m o j The results follow from the recursion of Proposition 5. .#) =St(/II)F. .30 that the reference prior relative to the natural parametrisation ( p . . ..7n..=I I1 which agrees with the standard argument according to which one degree of freedom should be lost by each of the unknown means. be an experiment . )!?I ( 8 ) v : where g.
and there i s no need to consider subset sequences { O ! } . and hence that :I f/1(101) HI#? O2(IH2) " ' " ' 010.. 1 = //! rl! ' ' ' l'. ... ) . .H . . ...H. 1 . . 02 1 . In this case.) is given by we see that the conditional asymptotic precisions used in Proposition 5. . . r . . . .)! H' '' ..). . . say.H.21. . ....YH...O..... 4 I W ( 0 .1 + HI .. .T(HI. . .) x . ... ) rlH2 .S8. (Multinomial model). .. .. . . H?H.336 5 Inference Example 5.. k ( 1 H.H? ... .. .. .. .. IIH. I'...&. . ... ... . .. 10. H. . H. . . .H. B.. 1 f H.2). .. noting that H ( O l .. 1 The required reference prior relative to the ordered parametrisation ( 0 . is then given by and corresponding reference posterior for HI is x(H1 11'1...' . N . ..) U.. I />(I).. . 1 1 3. . . Let x = { r l .1.} be an observation from a rnultinomial distribution (see Section 3. so that p ( 1.Z'H.29 are easily identified.. 1 . . .28 turn out to be proper.. ~lfl. . ...) = 1 . ...... the conditional reference priors derived using Proposition 5. I . .!(!l . . . I. .2. In fact.. .XI. . . .CH. . '.
2.rID. coincides with the reference posterior which would have been obtained had we initially collapsed the multinomial analysis to a binomial model and then carried out a reference analysis for the latter. . It is easily seen that 50 that . . A$(% I /A. . . . . }be a random sample from a bivariate normal distribution..7n.q .6)*)1'* 1 After some algebra. .i = 1. (Normalcornlationcoefiient). Clearly.) = Be (8. .. 7 ) .. . .4 Reference Analysis which is proportional to 337 J (). For a detailed discussion of this example see Berger and Bemardo ( 1992a). . x . r.81)1!2(1 . ~ which.t i 2 (1 . this implies that 7r(01I r l .c 5 x (1 . Example5.p:". .u1}.6.22. after appropriate changes in labelling and it is independent of the particular order in which the parameters are taken. and consider the ordered parametrisation { p .u l .. P I .8 . the above analysis applies to any 8. as one could expect.. by symmetry considerations. p2.where Supposethat the correlationcoefficient p is the parameterof interest. Further comments on ordering of parameters are given in Section 5. Let { xl. .5.
p ( .2. often possible to obtain an approximate reference posterior by embedding the discrete parameter space within a continuous one.2. however.24. One possibility would be to embed the discrete space { 1..02) 5 Inference x (1 lol In. .I '2 . . Example 5. x[ since. (In@& discrete case).2 x (H+ ])lo ' and it is easily verified that this prior leads to a posterior in which the data are no longer overwhelmed. . This agrees with Lindley's (1965. ! . for example.. one could always use. .r. . as one could expect from Fisher's (1915) original analysis. See. due to the nonexistence of an asymptotic theory comparable to that of the continuous case..oI. o2 is taken. the corresponding reference posterior distribution for p (where F is the hypergeometric function). P(H = 1 I z ) = as an .2. using Proposition 5. 1 = . this has to be interpreted as suggesting that such priors actually contain a large amount of information about H compared with that provided by the data. } on the basis of a random sample z = {q. Infinite discrete parameter spaces The infinite discrete case presents special problems. also. A more careful approach to providing a "noninformative" prior is clearly required. If the physical conditions of the problem require the use of discrete H values. Y and Berger ( 1991)and Berger and Bemardo ( I992b) e for derivations of the reference distributionsfor a variety of other interesting models. In the context of capturerecaptureproblems. Then. For a detailed analysis of this example. see Bayarri (198 I ) .I 2 r . whose samplingdistributiononly depends on p. . Hills (1987).23. the appropriate refrence prior is x(B) x //(#)I . 219) analysis. Intuitively. i l l . for each H > 0.}".6. from . } For several plausible "diffuse looking" prior distributions for 19one finds that the corresponding posterior virtually ignores the data. m . . further discussion will be provided in Section 5..} in the continuous space 10. ( ~z)tlfl.' whatever ordering of the nuisance parameters p I. It is. . t ) l l )is still a probability density for . .j 1 z ) = approximate discrete reference posterior. suppose it is of interest to make inferences about an integer 19 E { 1. p.338 After some algebra it can be shown that this leads to the reference prior ?T(/A//. Furthermore.. r. only depends on the data through the sample correlation coefficient r.
Section 5.. . the starting point for all subsequent inference summaries is the joint posterior density for 6 given by From this. we propose todeal with this difficulty by defining the parameter of interest in the reduced model to be the conditional mean of the original parameter of interest.the unknown of interest is y. z I 8 ) . we shall consider numerical techniques for implementing Bayesian methods for arbitrary forms of likelihood and prior specification.A). bivariate joint marginal posterior densities for pairs of . . This technique has so far worked well in the examples to which it has been applied. not y. E[yl6] (which will be either 8 or some transformation thereof) will be the parameter of interest. pl.p p being N ( p t I pu. } in the hierarchical model).4 considered the problem of approximatinga prior specification maximising the expected information to be obtained from the data.5 NUMERICAL APPROXIMATIONS Section 5. in a hierarchical model with. A] = po will be defined to be the parameter of interest.5. Thus. given a likelihood p(a: 1 8 ) and a prior density p ( 8 ) . in the prediction problem. the p t ' s may be the parameters of interest but a prior is only needed for the hyperparameters po and A. but its distribution is conditionally specified. Likewise. if one wants to predict y based on I when (y. The difficulty with these problems is that there are unknowns (typically the unknowns of interest) that have specified distributions. we may be interested in obtaining univariate marginal posterior densities for the components of 6. One needs a reference prior for 8. The difficulty that arises is how to then identify parameters of interest and nuisance parameters to construct the ordering necessary for applying the reference prior method. p2. and in the hierarchical model E [ p l I jb. For instance. We note that the technical problem of evaluating quantities required for Bayesian inference summaries typically reduces to the calculation of a ratio of two integrals. The obvious way to approach such problems is to integrate out the variables with conditionally known distributions (y in the predictive problem and the { p . but further study is clearly needed.3 considered forms of approximation appropriate as the sample size becomes large relative to the amount of information contained in the prior distribution. In future work. say.. z ) has density p(y. 5. the real parameters of interest having been integrated out. and arbitrary sample size. and find the reference prior for the remaining parameters based on this marginal model.5 Numerical Approximations 339 Prediction urui Hierurchical Models Two classes of problems that are not covered by the methods so far discussed are hierarchical models and prediction problems. Specifically. In this section.
er~. we may be interested in marginal posterior densities for functions of components of 8 such as ratios or products. thus.~. or transformations of 8. we note that thc posterior expectation of interest can be written in the form . and since p(8I z ) is given by we see that E[y(8) I z)]has the form of a ratio of two integrals.1 Laplace Approximation We motivate the approximation by noting that the technical problem of evaluating quantities required for Bayesian inference summaries. Alternatively. for specified likelihood and prior.g. Except in certain rather stylised problems (e. and assuming !I( 8 ) almost everywhere positive. Iterative Quadrature: Imporrance Sanipling. Focusing initially on this situation of a required inference summary for g ( 8 ) . and so on.vesiuti Conipirtcirion. An exhaustive account of these and other methods will be given in the second volume in this series. is typically that of evaluating an integral of the form wherep(8 I 5 )is derived from a predictive model with an appropriate representation as a mixture of parametric models. In this section. together with summary moments. 5. exponential families together with conjugate priors). the technical key to the implementation of the formal solution given by Bayes' theorem. Bu. which will be discussed under the subheadings: kiplace Apl)r~~. highest posterior density intervals and regions. In all these cases.. Markov Chuiri Monte Curlo. is the ability to perform a number of integrations. we need to evaluate the denominator in Rayes' theorem in order to obtain the normalising constant of the posterior density: then we need to integrate over complementary components of 8.stinipliti~~. the required integrations will not be feasible analytically and.340 5 Inference components of 8. Suniplinginiportcinc. First.5. in order to obtain marginal (univariate or bivariate) densities. Often.rit?rcJriori. g ( 8 ) is a first or second moment. and g ( 8 ) is some realvalued function of interest. or whatever. efficient approximation strategies will be required. we shall outline five possible numerical approximation strategies.
. this methodology. for example.. and extensions to. it then follows immediately that the resulting approximation for E[g(O)I x has the form i fi"g(8) 1 1 = x ($) exp { n [h'(B')  "(e)]}. 8 = 6 E 91. { Essentially. ) .) to be suitably smooth functions.d such that h(B) = sup { . 1989a. also.).and are thus equivalent to normallike approximations to the integrands. . In the context we are considering. Jeffreys and Jeffreys. is thus seen to provide a potentially very powerful general approximation technique. 1989a. See. It'(. the denominator of E[g(O)I x]is given by . the approximations consist of retaining quadratic terms in Taylor expansions of h ( . nh*(e) = iogg(e) + logp(6) + iogp(x I e). the Lapface approximations for the two integrals defining the numerator and denominator of E[g(O)I x]are given (see. 1989b.h ( 8 ) } B ~ ci = [h'7e)]1'2 lo=* ' Assuming I t ( ..~ ) ) 1 The Laplace approximation approach.5. with the vector x = (x. and Tierney and Kadane (1986) have shown that E [ g ( d )I 2 = fi[g(8)I x](1 4. 1946) by J2ncT*n'/2 { . Let us consider first the case of a single unknown parameter. Kass and Kadane ( 1987. Tierney and Kadane (1988. of and h*(O) are defined by nh(e) = iogp(e) + iogp(z I el. 1991) and Wong and Li (1992) for further underpinning of. the functions h(8) . Considering now the general case of 8 E 9'the Laplace approximation to '.)and h'(.5 Numerical Approximations 341 where.) observations fixed.n h * ( ~ * ) } exp .x.and define 88' and 6 .. and &3n'/* exp nh(e)} . 1989b).O ( T ~ . Tierney. exploiting the fact that Bayesian inference summaries typically involve ratios of integrals. Kass..
defined in terms of It ' (.(A) = logy(@. Writing completely analogous to the univariate case.. with an exactly analogous expression for the numerator. we see that.342 and the Hessian matrix of h evaluated at 8.(A). if p(Cp. The form fi(4I z) thus provides (up to proportionality) B pointwise approximation to the ordinates of the marginal posterior density for 4.) =  sup /?. and /lJA.A ) + logy(z 14.) and 8 * . ' . If 8 = (4. A ) IS constant. application of the Laplace approximation approach corresponds to obtaining y(Cp I 2)pointwise by fixing Cp in the numerator and defining g(A) = I ..A).A) and the required inference summary is the marginal posterior density for Cp.. Considering this form in more detail. A considered as a function of X for fixed 4. It is easily seen that this leads to where nh.
Cox and Reid. These are characterised by parameters m. at least if 8 = 0 is a scalar parameter. for example. particularly when components of 8 are constrained to ranges other than the real line. corresponding to the parametric model p ( s I 4:A).+. It is shown by Morris (1988) that. In relation to the above analysis. see Appendix B. who develops a general approximation technique based around the Pearson family of densities.5. such as gammas or betas. we may be able to improve substantially on this direct Laplace approximation approach. which specify a density for 0 of the form where &(el = qo + qle + q2e2 and the range of 8 is such that 0 < Q(0) < x.&) is usually called the profile likelihood for 9.oice of qualatic function Q. the forms concerned are not well approximated by secondorder Taylor expansions of the exponent terms of the integrands. considered as a function of X for fixed 4.5 Numericul Approximurions 343 The form V2logp(s I 4. Such an approach has been followed in the oneparameter case by Morris (1988). for a given c. Section 4. and evaluated at the value which maximises the loglikelihood over X for fixed 4. The approximation to the marginal density for 4 given by $(+Is)has a form often referred to as the modiJedprofile likelihood (see. 1987.. is to attempt to approximate the integrands by forms other than normal.p0 and a quadratic function Q . which may be the case with small or moderate samples. an analogue to the Laplacetype approximation of an integral of a unimodal function f (0) is given by . One possible alternative. Approximation to Bayesian inference summaries through Laplace approximation is therefore seen to have links with forms of inference summary proposed and derived from a nonBayesian perspective. For further references. for a convenient discussion of this terminology).2. A. the form p ( z 14.) is the Hessian of the loglikelihood function. we note that the Laplace approximation is essentially derived by considering normal approximations to the integrands apIf pearing in the numerator and denominator of the general form E [ g ( 8 )I z]. perhaps resembling more the actual posterior shapes.
We.I.I'.A. Jacobian adjusted. To provide a concrete illustration of these alternative analytic approximation approaches consider the following. that the TierneyKadane form of the Laplace approximation gives the estimated posterior mean If. has arisen from I a Bi( r. instead.. together with. ) prior (the referenci prior. identify the analytic form of the posterior mean in this case. even in contexts where the original parametrisation does not suggest the plausibility of a normaltype approximation to the integrands. incidentally. in fact. that by judicious rcparametrisation (of the likelihood. we reparametrise to ( 5 = h i i i \'Ti. Writing r. Suppose that a posterior beta distribution.. note. we can.Details of the Q and p for familiar forms of Pearson densities are given in Morris forms of I< ( 19881. First. dehning ! I ( @ ) = U. A second alternative is to note that the version of the Laplace approximation proposed by Tierney and Kadane (1986) is not invariant to changes in the (arbitrary) parametrisation chosen when specifying the likelihood and prior density functions.14). where it is also shown that the approximation can often be further simplified to the expression '. the required integrals are detined in terms of and the Laplace approximation can be shown to be .. we see. N ) likelihood. I n . after some algebra. together with the appropriate. therefore.where rii = l"(e)Q(o) and 8 maximises r ( H ) = log(f(O)Q(H)]. that such a strategy is also available in multiparameter contexts. (Approximatingthe mean of a beta distribution)..24. I I . Be(H I I. Example 5. It may be. but we shall ignore this for the moment and examine approximations implied by the techniques discussed above.#= . Be(@ I.. . prior density) the Laplace approximation can itself be made more accurate. derived in f Example 5. whereas the Pearson family approach does not seem so readily generalisable. i 4 ).
in combination with judicious reparametrisation.. Table 5.585 (0.1. n . .s + i) (percentage errors in purmrheses) True value Laplace appro. in Table 5. we obtain E‘(0Ir.1 Approximation oJE[OI .0) as the “natural” choice for a betalike posterior. s = 3. the results for 71 = 5. defined by I true . we simply summarise.r +I By considering the percentage errors of estimation.10 I s] 0. there are awkward complications if the integrands are multimodal.r]fiomBe(0 I s + f .j = ( n + 1)”1 + 3/2)’ ( n + 2)”’.estimated true I.38) 0.6%) E. the Laplace approach. in multiparameter contexts there may be numerical problems with the evaluation of local derivatives in cases where analytic forms are unobtainable or too tedious to identify explicitly. we can study the performance of the three estimates for various values of n and s. in some sense.1 (r + f)’ (. which typify the performance of the estimates for small n. Leonard et 01. At the time of writing. this area of approximation theory is very much still an active research field and the full potential of this and related methods (see. I that the Pearson approximation.rimations Pearson approximation Ep Is] ~~ ~~ E[01. Lindley. Details are given in Achcar and Smith (1989). does.1 0. Further examples are given in Achcar and Smith (1989). 1980b. in cases involving a relatively small number of parameters. outperformthe others. if we work via the Pearson family. 1989) has yet to be clarified. it would appear that.5 Numerical Approximations 345 Alternatively.583 0. in fact.which is. can provide excellent approximations to general Bayesian inference summaries. However. and is a very satisfactory alternative to the “optimal” Pearson form.69) We see fromTable 5. also.580 (0. with Q ( 0 ) = O( 1 . here. In general. whether in the form of posterior moments o r marginal posteriordensities. preselected to be best.5.563 (3. However. it is strikingthat the performance of the Laplace approximation under reparametrisation leads to such a considerable improvement over that based on the original parametrisation. In addition.
) . then an rwiluciting /he tipoint GaussHermite rule will prove effective for . respectively. covers many of the likelihood x prior shapes we typically encounter for Moreover. prior information. = w. that.. using the same (iterated) . (see.log(h .firsfmid .sitnul~cinc~oits!\ norrntrlising c w i s f r i n f and thc>. and 2. is the ith zero of the Hermite polynomial H. log(f ) or log(1 . if h ( t ) is a suitably well behaved function and g(t) = h ( t )( 2 K I T L ) "'cxp { . = /I + J2a t . Naylor and Smith. It follows that GaussHermite rules are likely to prove very efficient for functions which. for example.. if f ( t ) is a polynomial of degree at most 211 . for example. the applicability of this approximaparameters defined on (x.5. given reasonable starting values (from any convenient source. using. for example.. In particular. This implies.3. even for moderate n (less than 12. then the quadrature rule approximates the integral without error. Moreover. x).t ) . we typically can successfully iterate the quadrature rule.2 5 Irtference lteratfve Quadrature It is well known that univariate integrals of the type are often well approximated by GaussHermite quadrature rules of the form where 1 .). substituting estimates of the posterior mean and variance obtained using previous values of 'in. :. exp(t . tion is vastly extended by working with suitable transformations of parameters h defined on other ranges such as (0.346 5. say).(t). closely resemble "polynomial x normal" forms. expressed in informal terms. It turns out that. In fact.secoiid tttoitietirs. x) or (0. ' then where n ) . this is a rather rich class which. Of course.. ) a. to use the above form we must specify 11and cr in the normal component..(J}/ l t1 . maximum likelihood estimates. 1982).1. we note that if the posterior density i s wellapproximated by the product of a normal and a polynomial of degree at most 21.u) . etc. .
however. this linear transformation derives from an initial guess or estimate of the posterior covariance matrix (for example. the need for an efficient strategy is most acute in higher dimensions. Naylor and Smith. transform further to a centred. it is efficient to begin with a small grid size (71 = 4 or n = 5) and then to gradually increase the grid size until stable answers are obtained both within and between the last two grid sizes used. The problem with this “obvious” strategy is that the product form is only efficient if we are able to make an (at least approximate) assumption of posterior independence among the individual components. However. . 1982. cany out. (2) Using initial estimates of the joint posterior mean vector and covariance rnam x for the working parameters. via an appropriate linear transformation.to a new. To overcome this problem. approximately orthogonal.5.. Cartesian product integration of functions of interest. for example. based on the observed information matrix from a maximum likelihood analysis). these will lead to many of the lattice points falling in areas of negligible posterior density. Successive transformationsare then based on the estimated covariance matrix from the previous iteration. 1987. Clearly. set of parameters. At the first step. (3) Using the derived initial location and scale estimates for these “orthogonal” parameters. In this case. 1988). scaled. Naylor and Smith.. ( I ) Reparametrise individual parameters so that the resulting working parameters all take values on the real line. if high posterior comelations exist. .5 Numricul Approximations 347 set of mi and 2. The following general strategy has proved highly effective for problems involving up to six parameters (see. more “orthogonal” set of parameters. the lattice of integration points formed from the product of the two onedimensional grids will efficiently cover the bulk of the posterior density. In practice. thus causing the Cartesian product rule to provide poor estimates of the normalising constant and moments. Smith er al. on suitably dimensioned grids. Our discussion so far has been for the onedimensional case. we could first apply individual parameter transformations of the type discussed above and then attempt to transform the resulting parameters. The “obvious” extension of the above ideas is to use a Cartesian product rule giving the approximation where the grid points and the weights are found by substituting the appropriate iterated estimates of p and a2corresponding to the marginal component t. 1985.
or modifications thereof.3 Importance Sampling The importance sampling approach to numerical integration is based on the observation that.340 5 1njerenc. derived by transforming from cartesian to spherical polar coordinates and constructing optimal integration formulae based on symmetric configurations over concentric spheres. For a brief introduction. Diaconis ( 1988b) and O'Hagan ( 1992). see Kass and Slate (1992). In multiparameter Bayesian contexts.e (4) Iterate.5. exploitation of this idea requires designing importance sampling distributions which are efficient for the kinds otintegrands arising in typical Bayesian applications. The efficiency of numerical quadrature methods is often very dependent o n the particular parametrisation used.r)d. It is amusing to note that the roles can be reversed and Bayesian statistical methods used to derive optimal numerical quadrature formulae! See. see Smith (1991 ). . the variance of such an estimator clearly depends critically on the choice of G. successively updating the mean and covariance estimates. 1971. it being desirable to choose g to be "similar" to the shape of f . Other relevant references on numerical quadrature include Shaw ( I98Xb). if f is a function and 9 is a probability density function which suggest the "statistical" approach of generating a sample from the distribution function G. and 2. For further information on this topic.6. A considerable amount of work has focused on the use of multivariate normal or Student forms.referred to in this context as the inzportunce sunipling distributionand using the average of the values of the ratio f l y as an unbiased estimator of j(. say between six and twenty. for example.7). Sections 2. Hills and Smith ( 1992. Full details of this approach will be given in the volume Bayesinn Conipicrclrion. For related discussion. Cartesian product approaches become computationally prohibitive and alternative approaches to numerical integration are required. The ideas outlined above relate to the use of numerical quadrature formulae to implement Bayesian statistical methods. see Marriott ( I 988).r. However. For problems involving larger numbers of parameters. One possibility is the use of spherical quadrature rules (Stroud. until stable results are obtained both within and between grids of specified dimension. 1993) and Marriott and Smith ( 1992). Flournoy and l'sutakawa ( 199 I ). O'Hagan ( I99 I ) and Dellaportas and Wright ( 1992). 5.
so that all parameters belong to 8. we are likely to get a reasonable approximation to the integral by simply taking some equally spaced set of points on (0. part of this strategy requires the specification of “uniform’’ configurationsof points in the kdimensional unit hypercube. However. as a . the contributions of Kloek and van Dijk (1978). van Dijk et al. l). bearing in mind that we need to have available a flexible set of possible distributionalshapes. Moreover. The choice u = 0. Owing to the periodic nature of the ratio function over this interval. the moments of the distributions of the I. etc.l) function such that limh(u) = cc 11u  R is a monotone increasing and 0 5 a 5 1 is a constant.rather than actually generating “uniformly distributed” random numbers. Iff is a function of more than one argument (k.5 leads to symmetric distributions. 0 or a + 1 we obtain increasingly skew distributions (to the left or right).r)] with respect to a uniform distribution on the interval (0. forexample. The tailbehaviour of the distributions is governed by the choice of the function h. van Dijk and Kloek (1983.the required integral is the expected value of f[G’(z))/y[G’(. h : (0.5 Numerical Approximations 349 much of this work motivated by econometric applications. l). To use this family in the multiparameter case.(u)= .tan [ T (1 . We note. As we remarked earlier.. (1987) and Geweke (1988.1). if we choose g to be heaviertailed than f.say). one such family defined on R is provided by considering the random variable where u is uniformly distributed on (0. the median is linear in a. h. 1985). are polynomials in a (of correspondingorder). in particular. In the transformed setting. 1988a) proceeds as follows. an exactly parallel argument suggess that the choice of a suitable g followed by the use of a suitably selected “uniform” configuration of points in the kdimensional unit hypercube will provide an efficient multidimensional integration procedure.u ) / 2 ] leads toafamily whose symmetric member is the Cauchy distribution.5. Thus. we again employ individual parameter transformations. so that parameters can be treated “independently”. This problem has been extensively studied by number theorists and systematic experimentation . the effectivenessof all this dependson choosing a suitable G. it is natural to consider an iterative importance sampling strategy which attempts to learn about an appropriate choice of G for each parameter. In the univariate case. 1989). An alternative line of development (Shaw. and if we work with y = C(s). In the univariate case. is for which G’ available explicitly. h ( u ) = log(u) leads toa family whosesymmetric member is the logisticdistribution. together with “orthogonalising” transformations. so that sample information about such quantities provides (for any given choice of h ) operational guidance on the appropriate choice of a.
transform to a centred. we note the essential duality between a sample and the distribution from which it is generated: clearly.rre.4 Samplingimportanceresampling Instead of just using importance sampling to estimate integralsand hence calculate posterior normalising constants and moments.350 5 1nfrrenc. (2) Using initial estimates of the posterior mean vector and covariance matrix for the working parameters. k. thc distribution . (4)Use the inverse distribution function transformation to reduce the problem to that of calculating an average over a "suitable" uniform configuration in the kdimensional hypercube. 1987). For further advocacy and illustration of the use of (nonMarkovchain)Monte Carlo methods in Bayesian Statistics. 1990) and Wolpert (I991 ). see Stewart ( 1979. given a sample we can recreate. tailweight. the distribution can generate the sample. . Shao ( 1989. . . ( I ) Reparametrise individual parameters so that resulting working parameters all take values on the real line. We begin by taking a fresh look at Bayes' theorem from this samplingimportanceresampling perspective.5. shifting the focus in Bayes' theorem from densities to samples.. k. . at least approximately. Teichroew ( 1965) provides a historical perspective on simulation techniques. . The general strategy is then the following.~uniF. 1983. j = 1. conversely. 5.we can also exploit the idea in order to produce simulated samples from posterior or predictive distributions. (3) In terms of these transformed parameters.. scaled. j = I. and hence choose "better" !I. more "orthogonal" set of parameters. ( 5 ) Use information from this "sample" to learn about skewness.e with various suggested forms of "quasirandom" sequences has identified effective forms of configuration for importance sampling purposes: for details. and revise estimates of the mean vector and covariance matrix. As a first step. for each gJ. see Shaw ( 1988a).liil~ (SIR). Our account is based on Smith and Gelfand (1992). . set for "suitable" choices of q. Stewart and Davis (1986). etc. (6) Iterate until the sample variance of replicate estimates o f the integral value is sufficiently small. . 1985. . This technique is referred to by Rubin ( 1988)as scimplinKimport~~nc.
Any accepted 8 is then a random quantity from h ( 8 ) .v} having mass q1 ..Given a sample of size N for g ( 8 ) . Bayes' theorem defines the inference process as the modification of the prior density p ( 8 ) to form the posterior density p ( 6 I z). In terms of densities. or whatever).5 Numerical Approximulions 351 (as a histogram.1). To see this. consider. p. an empirical distribution function.v from g ( 8 ) . . . In cases where the bound A 1 in the above is not readily available. (iii) if ti 5 f ( 8 ) / M g ( 8 ) accept 8. (ii) generate u from Un(u 10.8. the univariate case. Suppose that a sample of random quantities has been generated from adensity g ( 8 ) . how can we derive a sample from h(B)? In cases where there exists an identifiable constant hi > 0 such that f(8)/g(6) A f . through the medium of the likelihood function JI(Z 18). 1987. 5 for all 8 . consider the following. it is immediately verified that the expected sample size from h ( 8 ) is AlIN f(x)drr. this corresponds to the modification of a sample from p ( 8 ) to form a sample from p(8 12) through the medium of the likelihood function p ( z 18). . an exact sampling procedure follows immediately from the well known rejection method for generating random quantities (see. for example.8. Given f(8) and the sample from g ( 8 ) . . for mathematical convenience. N i=l . Shifting to a sampling perspective. on then 8' is approximately distributed as a random quantity from h(8). Given el. Then. if P describes the actual distribution of 8*. 60): (i) consider a 8 generated from g ( 8 ) . under appropriate regularity conditions. . To gain insight into the general problem of how a sample from one density may be modified to form a sample from a different density.5. otherwise repeat (i)(iii). . but that what it is required is a sample from the density where only the functional form of f(8) is specified. calculate s If we now draw 8' from the discrete distribution {el. Ripley. a kernel density estimate. we can approximate samples from h ( 8 )as follows.
u(e) p ( e ) . see Stephens and Smith ( 1992). define fz(8) = p(x I 8 ) p ( 8 ) . With this samplingimportanceresampling procedure in mind. For fixed 2. . . Alternatively. For an illustration of the method in the context of sensitivity analysis and intractable reference analysis. = Bayes' theorem then takes the simple form: For each 8 in the prior sample. the rejection procedure given above can be applied to a sample for p ( 0 ) to obtain a sample from p ( e I z)by taking . since the topic is more properly dealt with in the subsequent volume Bu. so that the inference process via sampling proceeds in an intuitive way. see Albert ( 1993).(O) .352 Since sampling with replacement is not ruled out. uccept 8 into the posterior saniylr with prohNb i b f. into the posterior sample with probability Again we note that this is proportional to the likelihood.P(ZI8) . if 4 maximibinp p(z18) is available. if "lf is not readily available.wsictn Conipurutiori. the less h ( 0 ) resembles g(0) the larger N will need to be if the distribution of 0' is t o be a reasonable approximation to h(0). the sample size generated in this case can be as large as desired. we can use the approximate resampling method. The samplingresampling perspective outlined above opens up the possibility of novel applications of exploratory data analytic and computer graphical techniques in Bayesian statistics.\I = p(x I b ) . We shall not pursue these ideas further here.\IIJ(W p(z1 e) The likelihood therefore acts in an intuitive way to define the resampling probability: those 8 with high likelihoods are more likely to be represented in the posterior sample. Clearly.Then. let us return to the prior to posterior sample process defined by Bayes' theorem.for pedagogical illustration. j ( e ) = fz(e)and . however. which selects 8.
is a realisation from an appropriate chain. . the vector of unknown quantities appearing in Bayes' theorem. or parallel independent runs of the chain might be considered. . If 01. . with respect to p ( 8 ( z ) . and whose equilibrium distribution is p ( 8 1 z ) . &I=).8'. also. . except in simple. Tanner and Wong ( 1987) and Tanner ( I 9 9 1 ). If we then run the chain for a long time. in distribution and 1 . Suppose that we wish to generate a sample from a ' posterior distribution p ( 8 1 z ) for 8 E 0 c 8'but cannot do this directly.5. section. The second of the asymptotic results implies that ergodic averaging of a function of interest over realisations from a single run of the chain provides a consistent estimator of its expectation. Gelman and Rubin (1992a. Roberts (1992). Tierney (1992). typically avail02. successive8' will be correlated.5 Numerical Approximations 353 5. . . to challenging problems of numerical integration. simulated values of the chain can be used as a basis for summarising features of the posterior p(8lz)of interest. 1992b). Geyer (1992). . For recent accounts and discussion. Gilks er ul. able asymptotic results as t .. As we have already observed in this . forward to simulate from. has components 01.o k . and that our objective is to obtain summary inferences from the joint posterior p(8 1 z)= ~ ( 0 1 . stylised cases. . In what follows. asymptotic results exist which clarify the sense in which the sample output from a chain with equilibrium distribution p ( 8 1 z ) can be used to mimic a random sample from p ( 8 l s ) or to estimate the expected value. so that. . . t a= f { g(e)} almost surely. unavoidably. see. I Clearly. Under suitable regularity conditions. However. Gelfand and Smith (1990). which have proved particularly convenient for a range of applications in Bayesian statistics. see. (1993) and Smith and Roberts ( 1993). Chan (1993. will be required between realisations used to form the sample.3c include 8' + 8 CY p ( 8 1 s ) .C g(si) . this will typically lead. The Gibbs Sumpling Algorithm Suppose that 8 . .of a function g ( 8 ) of interest. Casella and George (1992). we simply need algorithms for constructing chains with specified equilibrium distributions. . To implement this strategy.5. Besag and Green (1993).. . Raftery and Lewis ( 1 992). we outline two particular forms of Markov chain scheme. Ritter and Tanner (1992). . which is straightsuppose that we can construct a Markov chain with state space 0.5 Markov Chain Monte Carlo The key idea is very simple. if the first of these asymptotic suitable spacings results is to be exploited to mimic a random sample from p ( 8 ) z ) . for example.
(01).. 1984.61:”.”. H. (‘) as t x . 1994). this apparent need for sophisticated numerical integration technology can often be avoided by recasting the problem as one of iterative sampling of random quantities from appropriate distributions to produce an appropriate Markov chain. for example.O::.’. .In fact.8. Suppose then.. . . the replicates (8:.‘I). follows.” from ~ ( 8 : ~ 6rli I t .8.. given the data and specified values of all the other components of 8.‘)) tends in distribution to a random vector whose joint density is p(8 1 z). Iz.:) are It approximately a random sample from p ( 8 . as functions of 8. . that given an arbitrary set of starting values. where 8‘ is a realisation o f a . are typically easily identified. @:. . and so on. . 12). . Markov chain with transition probabilities given by Then (see. we implement the following iterative procedure: draw 8..0...In particular. (0) [Ill for the unknown quantities. I z). . ). by making H I suitably +  . . for large f . . . Now suppose that the above procedure is continued through 1 iterations and is independently replicated nr times so that from the current iteration we have 971 replicates of the sampled vector 8‘ = (8~”. Roberts and Smith..8. by inspection of the form of p ( 8 I z)x p(z I 8 ) p ( 8 )in any given application.Iy.. 101 draw@’’ from2)(6)2(z .. 8. (0. . Thus.. .. . H ~ ) ’ ) .~” . To this end.”. . we note that for the socalledful(condirionaldensities the individual components. . Geman and Geman. . ).. (0 draw B!. .’) from p(0l Iz. 8i. 8:” tends in distribution to a random quantity whose density is p ( 8 .
however. a further randomisation now takes place. = 8'. in which case general stochastic simulation techniques are available such as envelope rejection and ratio of uniformswhich can be adapted to the specific forms (see. in which case computer routines are typically already available. or they are simple arbitrary mathematical forms.e')is considered as a proposed possible value for However..e") a(e. and Dellaportas and Smith. . either the full conditionals assume familiar forms. The potential of this iterative scheme for routine implementation of Bayesian analysis has been demonstrated in detail for a wide variety of problems: see. Carlin and Gelfand (1991). See.. ! 8'. we actually accept O').5. . 8 * . de" 8") 1 . since illustration of the technique in complex situations more properly belongs to the second volume of this work. Devroye. that simulation approaches are ideally suited to providing summary inferences (we simply report an appropriate summary of the sample). We shall not provide a more extensive discussion here. otherwise. inferences for arbitrar))kncrions of 81. which have an approximate p ( 0 I z) Ez. for example.k. 12) is easily obtained.5 Numerical Approxinzutions 355 large. . . the vector 8' drawn from q(8. in an obvious notation. Wakefield er al. . 1992. (1990) and Gilks ef al. 1992. we reject the value generated from q ( 0 . 1993). the average being over distribution for large t). 8ib. So far as sampling from the p(8. We note. This constructiondefines a Markov chain with transition probabilities given by @' q(8. Gilks and Wild. for example. Gelfand and Smith (1990).. . .& (we simply form a sample of the appropriate function from the samples of the 8. The MetropolisHustings algorithm This algorithm constructs a Markov chain 8 ' . either as a kernel 8.. Gilks. (1993). p(y I z) = n1l p(y I8:"). also. i = 1. Ripley. 8$'. such that. if 8' = 8. . that an estimate p(6.". 1987. density estimate derived from ( ! ' . .j # i) is concerned. Let q(8!s') denote a (for the moment arbitrary) transition probability function.. . . With some probability a(@. Gelfand er al. . .). I x. the 8. or from .'s) orpredictions (for example. 1991. 1986. O ' ) and set Of+' = 8. . I 2) for p ( 8 . with state space 8 and equilibrium distribution p ( 8 1 z ) by defining the transition probability from 8' = 8 to the next realised state dt+' as follows.
(1953).. Metropolis ef al.0). + . . . It is important to note that the (equilibrium) distribution of interest.) is the indicator function.e') through the ratio p ( 0 ' ( z ) / p ( 0 1 z ) . which.6 5.s + 1) and s and r) . + .1 DISCUSSION AND FURTHER REFERENCES An Historical Footnote Blackwell (1988) gave a very elegant demonstration of the way in which a simple finite additivity argument can be used to give powerful insight into the relation between frequency and belief probability. .r. is a sufficient condition forp(01z) to be the equilibrium distribution of the constructed chain. . Roberts and Smith (1994) and Smith and Roberts (1993). and wish to evaluate Writing s = . The argument goes as follows. . this ratio.r. .r. The calculation involved has added interest in thataccording to Stigler (1982)it might very well have been made by Bayes himself. see. = p ( t I l z ) p ( 0 ' . by virtue of exchangeability. . . also.where I(.s are not too small. Suppose that 01 observables x i .rl + ... 5. .6. We observe 2 = (x1. p ( t ) = P ( J . . provided that the 0') thus far arbitrary q(0.0') is chosen to be irreducible and aperiodic on a suitable state space.. This general algorithm is due to Hastings (1970). Besag and Green ( 1993).l are finitely exchangeable.This is quite crucial since it means that knowledge of the distribution up to proportionality (given by the likelihood multiplied by the prior) is sufficient for implementation. only enters p(0. . ' + . Peskun (1973).1 = f ) .. . . is easily seen to be equal to if p ( s ) 2 p(. Tierney ( 1992). y ( 0 l z ) . If now we set it is easy to check that p(Ojz)p(S.
see Jaynes (1971) for an interesting Bayesian resolution. But what does this entail? Reverting to an infinite exchangeability assumption. . all possibilities should have the same initial probability.l)the socalled Bayes (or BayesLaplace) Postulate. In the early works by Bayes ( 1 763) and Laplace ( 1 8 14/1952). in the absence of evidence to the contrary. the definition of a noninformative prior is based on what has now become known as the principle of insuficient reason. . we see that if one wants to have this “convergence” of beliefs and frequencies it is necessary that p ( s ) z p ( s + 1). It will be clear from the general subjective perspective we have maintained throughout this volume. Stigler (1982) has argued that an argument like the above could have been Bayes’ motivation for the adoption of this uniform prior. or the BayesLaplace postulate (see Section 5.2 Prior Ignorance To many attracted to the formalism of the Bayesian inferential paradigm. the idea of a noninjormarive prior distribution. and hence the familiar binomial framework. say.6. it will also be clear from our detailed development in Section 5. say. An easy calculation shows that this is satisfied if p(19)is taken to be uniform on (0.the resulting posterior odds for x n + l = 1 will be essentially the frequency odds based on the first n trials. I). at first sight. This may. we considered s and s 1to be about equally plausible as values for 2 1 . if an unknown quantity. According to this principle. In particular. In this section we shall provide an overview of some of the main directions followed in this search for a Bayesian “Holy Grail”. . often being regarded as synonymous with providing objective inferences. If. the considerable body of conceptual and theoretical literature devoted to identifying “appropriate” procedures for formulating prior representationsof “ignorance” constitutes a fascinating chapter in the history of Bayesian Statistics. . .. However. l/M}. h i .4 that we recognise the rather special nature and role of the concept of a “minimally informative” prior specification appropriately defined! In any case. (6. xn+l.5. This is closely related to the socalled LaplaceBertrand paradox.6.6 Discussion and Further References 357 This can be interpreted as follows. that we regard this search for “objectivity” to be misguided. representing “ignorance” and “letting the data speak for themselves” has proved extremely seductive. can only take a finite number of values. seem . 5. before observing 5 . suppose we require that p ( 8 ) be chosen such that + + + does not depend on s. Inverting the argument. the noninformative prior suggested by the principle is the discrete uniform distribution y ( @ ) = { l/AI.
"ignorance about 0" should surely imply "equal ignorance" about a onetoone transformation of o.' . where In effect. p.the (often improper) density A ( 0 ) x h(o)'. However. in countably infinite. f element i s h ( ~ 5 ) ' /and that natural length elements o Riemannian nietrics are invariant to reparametrisation. Jeffreys noted that the logarithmic divergence locally behaves like the square of a distance. involving a parametric model which depends on a single parameter o. If the space.. log log I t r. for the case of the integers.p(.23.. of @ values is a continuum (say. if some procedure yields p ( o ) as a noninformative prior for o and the same procedure yields p ( < ) as a noninformative prior for a onetoone transformation ( = <( o)of Q. a uniform distribution for o implies a nonuniform distribution for any nonlinear monotone transformation of 62 and thus the BayesLaplace postulate is inconsistent in the sense that. However. 238) suggested..consistency would seem to demand that p ( <)d< = p(o)do.. with respect to an experiment f = {.x . . as indicated in Example 5.'. Based on these invariance considerations.2. discrete cases the uniform (now intproper) prior is known to produce unappealing results. embedding the discrete problem within a continuous framework and subsequently discretising the resulting reference prior for the continuous case may produce better results. the prior ir(/I) x I/'..o.Y. a procedure for obtaining the "ignorance" prior should presumably be invariant under onetoone reparametrisation. but Example 5. more generally. determined by a Riemannian metric. intuitively. Moreover. Kass ( 1989) elaborated on this geometrical interpretation by arguing that. I6 showed that even in simple. Specifically.. the real line) the principle of' insufficient reason has been interpreted as requiring a uniform distribution over @. More recently... Jeffreys ( I939/ I96 I . t ) = 1. finite. whose natural length ~. discrete cases care can be required in appropriately defining the unknown yucmtip of inreresf. In an illuminating paper.=l.intuitively reasonable. Rissanen ( 1983) used a coding theory argument to motivate the prior ir(71) ..r lo)}.x ?t 1 x I log ?Z ~ 1 x . Jeffreys ( 1946)proposed as a noninformative prior. in the sense that equal mass . @. thus. natural volume elements generate "uniform" measures on manifolds.
in practice.’ where s(Q) is the asymptotic standarddeviationofthe maximum likelihoodestimate of I$. 6(Q). In his work. Peers ( I 965) later showed that the argument does not extend to several dimensions. the essential property that makes Lebesgue measure intuitively appealing. one can always replace a continuous range of q by discrete values over a grid whose mesh size.’ !thus defining an appropriate mesh. the amount of information that e may be expected to provide about 4. this turns out to be equivalent to Jeffreys’ rule. Welch and Peers (1963) and Welch (1965) discussed conditions under which there is formal mathematical equivalencebetween onedimensionalBayesian credible regions and corresponding frequentist confidence intervals. He found that his rule (by definition restricted to a continuous parameter) works well in the onedimensional case. but can lead to unappealing results (Jeffreys. In the continuous case. Mukerjee and Dey (1993) and Nicolau (1993) extend the analysis to the case where there are nuisance parameters. His intuition as to what is required.Under regularity conditions which imply asymptotic normality.5.ai. describes the precision of the measuring process. 182) when one tries to extend it to multiparameter situations. Tibshirani (19891.6 Discussion and Further References 359 is assigned to regions of equal volume. The procedure proposed by Jeffreys’ preferred rule was rather ad hoc. this implies a prior proportional to b(q5I.p(r 14)). + . Hanigan (1966b) and Peers (1968) discuss twosided intervals. They showed that. under suitable regularity assumptions.say. Jeffreys explored the implications of such a noninformative prior for a large number of inference problems. was rather good. 193911961.This expected information will be independent of Q if 6(0) 0 h ( ~ ) . Jeffreys’ prior ~ ( qx)h ( Q ) ’ / 2Akaike (1978a) used a related argument to justify Jeffreys’ prior as “locally impartial”. this suggests : ~. i. arguing as before. and that a possible operational interpretation of “ignorance” is a probability distribution which assigns equal probability to all points of this grid. . Jeffreys’ solution for the onedimensional continuous case has been widely adopted. p. Lindley showed that if the quantity can only take the values Cp or Q + b(4). if p(4) = p ( @ 6(4)) = is 2b2(d)h(d). onesided intervals asymptotically coincide if the prior used for the Bayesian analysis is Jeffreys’ prior. in that there are many other procedures (some of which he described) which exhibit the required type of invariance. Lindley (1961b) argued that. To determine 6 ( Q ) in the context of an experiment e = {X. however. Perks (1947) used an argument based on the asymptotic size of confidence regions to propose a noninformative prior of the form n(Q) x s ( Q ) . and a number of alternative justifications of the procedure have been provided.
this implies that n ( 0 )= / / ( 0 ) ' . p(X 1e) 4 ! (((0) f ( J ' ) ] .1 that ~ ( owhich implies a uniform prior for a function 4 = <(@) such that p(. Box and Tiao (1973. /  Using standard asymptotic theory. Following Jeffreys ( 1955). and constant similarily for current and future observations. Since the logarithmic score is proper. Kashyap ( 197 1 ) provided a similar. asymptotically.r I C ) is. Hartigan (197 I . under suitable regularity conditions and for large samples. 1983.3) argued for selecting a prior by convention to be used as a standard of'rejerencv. Jeffreys' prior. and hence is maximised by reporting the true distribution. for some functions C and f. an axiom system is used to justify the use of an information measure as a payoff function and Jeffreys' prior is shown to be a minimax solution in a two personzero sum game. such that. where the statistician chooses the "noninformative" prior and nature chooses the "true" prior. this will happen if .d)) is If.15). They suggested that the principle of insufficient reason may be sensible in location problems. Chapter 5) defines a similarity measure for events E. Good ( 1969)derived Jeffreys' prior as the"least favourable" initial distribution with respect to a logarithmic scoring rule. Jeffreys' prior may technically be described. again. 2 . more detailed argument. and for ) proposed as a conventional prior " ( 0 ) a modcl parameter f. Section 1. which is. they showed that. under suitable regularity conditions. in particular. I: to be P ( E n F ) / P ( E ) P ( F ) shows that Jeffreys' priorensures. as a minimax solution to the problem of scientific reporting when the utility function is the logarithmic score function. in the sense that it minimises the expected score from reporting the true distribution. one uses the discrepancy measure as a natural loss function (see Definition 3. at least approximately.360 5 Inference Hartigan (1965) reported that the prior density which minimises the bias of the estimator (I of ($ associated with the loss function / ( d . that is. a location J parameter family.
and consider D = the (unknown) standard deviation. 1939/1961 p. a ) x D the posterior distribution of B would be such that [Cy=. yc. of independent a priori treatment of location and scale parameters. Let { cl. An example of this. the results thus obtained typically have intuitively unappealing implications. 1.~ Example 5. to the requirement of uniformity following a transformation which ensures that credible regions .. in that one dws not lose any degrees of freedom even though one has lost the knowledge that p = 0. Unfortunately. in the case of unknown : mean..a)3c o*.. from N ( S 1 p. 182) is provided by the simple locationscale problem. one may wonder just what has become of the intuition motivating the arguments outlined above.a) = B . This is widely recognised as is. see Kass ( 1990). if we used the multivariate Jeffreys' prior x ( p . where the multivariate rule leads to n(8. unacceptable. Example 5. see Geisser and Cornfield (1963): for an elaboration of the idea. see Zellner (1986a). a J u' and the () posterior distribution of u would be such that [E:=. For example. also. A). For an illustration of this. the appropriate (univariate) Jeffreys' prior is . At this point. in higher dimensions the forms obtained may depend on the path followed to obtain the limit. Similar problems arise with other intuitively appealing desiderata. say. 7 ) * ] / 0 2 again. although many of the arguments summarised above generalise to the multiparameter continuous case.25. where 0 is the location and (z the scale parameter. leading to the socalled multivariate Jeffreys' rule n(e) I H ( e )I 3c where is Fisher's infurmurim marrix. .and conflicts with the use of the widely adopted reference prior n(p. if the noninformative prior is Jeffreys' prior.Z)']/a* is (see (T.. who extends the analysis by conditioning on an ancillary statistic.4). which implies that [Ey=. widely adopted in the literature.(s.rf]/o* is 1 . Unfortunately.e. although the implied information limits are mathematically welldefined in one dimension. the Box and Tiao suggestion of a uniform prior following transformation to a parametrisation ensuring data translation generalises. } be a random sample . See. Stein (1962).6 Discussion and Further References 361 i. Xi. . pointed out by Jeffreys himself (Jeffreys.5.. (Univoriote normal model). x i . In the case of known mean. For a recent reconsideration and elaboration of these ideas. and then multiplying the resulting forms together to anive at the overall prior specification. applying his rule separately to each of the two subgroups of parameters.1 = 0.17 in Section 5. The kind of problem exemplified above led Jeffreys to the ad huc recommendation. in the multiparameter setting.
in those onedimensional problems for which a group of transformations docs exist. is that. Chang and Villegas (1986) and Chang and Eaves ( I 990). Geisser ( 1979) and Torgesen ( 1981 ). The standard prior recommended by group invariance arguments is T ( I L . For some undesirable consequences of the f e j Haar measure see Bemardo (1978b). a large group of interesting models do not have any group structure. Jeffreys' original requirement of invariance under reparametrisation remains perhaps the most intuitively convincing.. of course. the right Haar measures coincides with the relevant Jeffreys' prior. Such approaches include the use of limiting forms of conjugate priors. (Standardised mean). but unappealing results. If this is conceded. Example 5. A). It is reassuring that. as in Haldane (1948). Stone ( 1965)recognised that.}be a random sample from a normal distribution "V(z I 11. 1977b. This idea was pioneered by Barnard ( 1952). it follows that. whatever their apparent motivating intuition. Novick and Hall ( 1965). Spall and Hill ( 1990) and Rodriguez ( 1991 ). if a group structure is present.. . The problem. Stone and Dawid ( 1 972) show that the posterior . Chapter 10) and Piccinato ( 1973. Villegas ( 1969. 1991 ). a predictivistic version of the principle of insufficient reason. Geisser (1984). the corresponding right Haar measure is the only prior for which such a desirable convergence is obtained.5 Inference are of the same size. . . DeGroot ( 1970. 1964)priors like the Haar measures. such regions can be of the same size but very different in form. approaches which do not have this property should be regarded as unsatisfactory. in an appropriate sense. lnvariance considerations within such a group suggest the use of relurivefyinvuriant(Hartigan. 1989). . in several dimensions. He went on to show that. such as those put forward by Zellner ( 1977. Other informationbased suggestions are those of Eaton (1 982).26. it should be possible to approximate the results obtained using a noninformative prior by those obtained using a convenient sequence of proper priors. the right Haar measures are the obvious choices and yet even these are open to criticism. Maximising the expected information (asopposed to maximising the expected missing information) gives invariant. Even when the parameter space may be considered as a group of transformations there is no definitive answer. Novick ( 1969). 1977a. Partially satisfactory results have nevertheless been obtained in multiparameter problems where the parameter space can be considered as a group of transformations of the sample space. Further developments involving Haar measures are provided by Zidek ( 1969). 1981). Dawid ( 1983b)provides an excellent review of work up to the early 1980's. 1977). so that these arguments cannot produce general solutions. O )= o' where X = n'. . 1982).x. However. Stone ( 1970). producing priors that can have finite support (Berger et ul. is quite unsatisfactory if inferences about the it standardised mean d = / L / O are required. Let 2 = { x . I97 I . In such situations. Although this gives adequate results if one wants to make inferences about either 11 or 17. and different forms of informationtheoretical arguments. Florens ( 1978.
‘ D 2 n .#%~.18) 7r(& a) = (2 + &)’:‘La’ and the corresponding posterior distribution for 4 is We observe that the factor in square brackets is proportional to p ( t 1 Q) and thus the inconsistency disappears. no such prior exists. the reference prior relative to the ordered partition (4. yl). . (. .I.. .. further explored by Dawid.27. Example 5. One would. (Correlclrion coejjkient).az. Let (z. for any given model.. but equivalent forms. appears in a large number of multivariate problems and makes it difficult to believe that. a single prior may be usefully regarded as “universally” noninformative. P(t I P.a) = P(t 14) only depends on d. Although this idea has failed to produce a constructiveprocedure for deriving priors. a) is (see Example 5.6 Discussion and Further References distribution of Q obtained from such a prior depends on the data through the statistic 363 whose sampling distribution. An acceptable general theory for noninformative priors should be able to provide consistent answers to the same inference problem whenever this is posed in different. then the posterior distribution of p is given by a(pIz79) = 4 p l r ) . and suppose that inferences about the correlation coefficient p are required. which includes all proposed “noninformative”priors for this model that we are aware of.5.) = { ( X I . Jaynes ( 1980)disagrees. It may be shown that if the prior is of the form x(/LI. about @J by the use of p ( t I Q) together with some appropriate prior for 4 However. This type of marginalisation paradox. Stone and Zidek (1973). On the other hand.p) = 7r(p)a. therefore.aI. ‘ . it may be used to discard those methods which fail to satisfy this rather intuitive requirement. yn)} be a 9 random sample from a bivariate normal distribution. expect to be able to “match’ the original inferences .
and I. the sampling distribution of r is Moreover. .p) = (1 . )= ( I .p') I. using the transformationsh = tanh ip and t = tanh ' I . ) Hence one would expect to be able to match. Although marginalisation paradoxes disappear when one uses proper priors. and ~ ( p = (1 . Jeffreys' prior for this univariate model is found to be.f7'. 7)ylqc.I. f 7 : . this example suggests that different noninformative priors may be appropriate depending 011 the pcirtic.#. {p.ffl.pi) +rl I f 7 : I . with this form of prior. Hence.. / I I./l2.' f f l'051'.y. / . so that Comparison between x ( p 1 r ) and p ( I' 1 p ) shows that this is possible if and only if ( I = 1.  (173 is the sample correlation coefficient. 337) the reference prior relative to the natural order. see Bayarri ( 198I ). the posterior distribution ir(pi I .') I (see Lindley. using this reduced model. Once again. which is precisely (see Example 5. to avoid inconsistency the joint reference prior must be of ) the form 7r(p*. ) given previously. of on the ordering of the parameters. This posterior distribution only depends on the data through the sample correlation coefficient I. However. 1965.ul~irfrmc. more generally. f 7 . I' is sufficient. For further detailed discussion of this example. t o use proper approximations to noninformative priors as an approximate description of "ignorance" does not solve the problem either. pp.1. .I'.(.i interest or. / / 1 ) T ( f f l .' is the hypergeometric function. 2 . On the other hand. it is easily checked that Jeffreys' multivariate prior is and that the "twostep" Jeffreys' multivariate prior which separates the location and scale parameters is ? r ( / / .22. .tio.p ? ) .: thus. p. .(.3)q' c. I/?. 2 IS? 19). ni 1.~ ( 0x ( I .364 where 1' 5 Inference = [C.
Ik}.z. } be a multivariate normal distribution . . 5 (4&)’) and hence. k = 1OO. Section 3. Jaynes (1968) introduced a more general formulation of the problem. . what the data have to say about b. Naive use of apparently “noninformative” prior distributions can lead to posterior distributions whose corresponding credible regions have untenable coverage probabilities.r]z 322. . see Appendix B. a random sample from . ” :] I1 while the sampling distribution of nf is a noncentral a2 distribution with k degrees of freedom and parameter n6r so that E[t 141 = dj + k / n . . C”b13. % 100. Indeed. reference analysis avoids this type of inconsistency. For further discussion of strong inconsistency and related topics. Stone. Let x = (5.]=. . the asymptotic posterior distribution of q is A‘(@ 16. However. u l . Such a phenomenon is often referred to as srrong inconsistency (see. p: are desired. (5I p. k E [ 4 1 z ] = t + . Efron ( 1973). . Bemardo (1979b) and Femindiz (1982). For further details. by Proposition 5.L J ~ }. 1976). be the mean of the 71. if inferences about Q = prior ovenvhelms.2. the corresponding posterior probabilities P(C I z) may be completely different from the conditional values P(C 18)for almost all 8 values.I.2. 114) 1 x 14) ( 1 whose mode is close to ...4 a noncentral x2 distribution with k degrees of freedom is and noncentrality parameter nt. reply to the discussion) of Stone’s (1976) Flatland example.V. with such a prior the posterior distribution of ~1. V [ @. the use of this where X is very small. say. . but nothing .The universally recommended “noninformative”prior for this model is x ( p 1 .28. However. so that c. He allowed for the existence of a certain amount of initial “objective” information and then tried to determine a prior which reflected this initial information. which may be approximated by the proper density z. Thus. / L A ) = 1.l ’ ’ ~ ’ ( n f k. observations from coordinate i and let t = Tr. see Stein (1959). I whereas the unbiased estimator based on the sampling distribution gives 4 = t .5.n = 1 and t = 200.Let T.2 t + I1 . However.. the reference posterior for Q relative to p ( t 14) is n ( @z)s n(o)p(t o + . we have El0 I 2)‘c 30!. for some region C. with. An illuminating example is provided by the reanalysis by Bemardo (1979b. for large k. in the sense that.6 Discussion and Further References 365 Example 5.. It may be shown that this is also the posterior distribution of 4 derived from the reference prior relative to the ordered partition { 4. (Stein’s parodar). for example..~ obtained by reparametrising to polar coordinates in the full model. . by carefully distinguishing between parameters of interest and nuisance parameters. .
5 1nferenc. however. no general procedure is proposed.j'p(ojl o g p ( o ) d ( . However. the prior has to depend not only on the model but also on the parameter of interest or. with no attempt to provide a general theory. more generally. avoiding marginalisation paradoxes by insisting that the reference prior be tailored to the parameter of interest. the noninvariant entropy functional. Jaynes' solution is to introduce a "reference" density T ( Q ) in order to define an "invariantised" entropy. Moreover. on some notion of the order of importance of the parameters. Jeffreys' prior is appropriate. Jaynes suggests invariance arguments to select the reference density. These include dynamic models (Pole and West. thus necessitating more precise definitions. appropriate for all the inference problems within a given model. and to use the prior which maximises this expression. To avoid having the prior dominating the posterior for sonw function 4 of interest. subject. CsiszAr. also.. subsequent work by Berger and Bemardo ( 1989) has shown that the heuristic arguments in Bemardo ( 1979b) can be misleading in complicated situations. The quest for noninformative priors could be summarised as follows. Jaynes' niurini~tn entropy solution reduces to the BayesLaplace postulate. (iv) In continuous multiparameter situations there is no hope for a single. If a convenient group of transformations is present. (ii) In onedimensional continuous regular problems.4 avoids most of the problems encountered with other proposals. (iii) The infinite discrete case can often be handled by suitably embedding the problem within a continuous framework. If no such information exists and c) can only take a finite number of values. 19923 . if c) is continuous. 1989)and finite population survey sampling (Meeden and Vardeman. "noninformative prior". Jaynes' principle of maximising the entropy is convincing. ri{p(oj} = . Berger and Bernardo ( 1992a. I992b. but cannot be extended to the continuous case. Jaynes considered the entropy of a distribution to be the appropriate measure of uncertainty subject to any "objective" information one might have. xno longer has a sensible interpretation in terms of uncertainty. again. Unfortunately. ~ ( o ) itself be a repmust resentation of ignorance about @ so that no progress has been made. His arguments are quite convincing in the finite case. unique. Contextspecific"noninformative" Bayesian analyses have been produced for specific classes of problems. (i) In the finite case. 1991). The reference prior theory introduced in Bernardo ( 1979b) and developed in detail in Section 5.e else (see. to any initial "objective" information one might have. It reduces to Jaynes' form in the finite case and to Jeffreys' form in onedimensional regular continuous problems. 1985). However.
3. a given ordered groupingof the parameters. Finally.8. there can be no final word on this topic! For example. Clarke and Wasserman (1993). 1963. Ye (1993) derives reference priors for sequential experiments.1. to inadmissible estimates (see.g. recent work by Eaton (1992). 1956).3 Robustness In Section 4. However. p. for instance.4) as technical devices to produce reference posteriors. and cited . 1980a). Stein. we would accept this argument. posterior distributions derived from “noninformative”priors need not be themselves admissible. 1973. the discussion of stopping rules in Section 5. In general we feel that it is sensible to choose a noninformative prior which expresses ignorance relarive to informationwhich can be supplied by a particular experiment. in an appropriate sense. sensible “noninformative” priors may be seen to be. For many subjectivists. Posteriors obtained from actual prior opinions could then be compared with those derived from a reference analysis in order to assess the relative importance of the initial opinions on the final inference. 46). or ignorance. for continuous parameters. This approach was described in detail in Section 5. then the expression of relative ignorance can be expected to change correspondingly. Stein. A completely different objection to such approaches to noninformative priors lies in the fact that. why should one’s knowledge. as we argued earlier. can seem arbitrary. independent of the experiment performed. 1970. In many situations. “noninformalive”distributions have sometimes been criticised on the grounds that they are typically improper and may lead. 5. (Box and Tiao. for example. Regarded as a “baseline” for admissible inferences. If the experiment is changed. George and McCulloch (1993b) and Ye (1993) seems to open up new perspectives and directions.6.4.5. 71). However. we noted that some aspects of model specification. However. but only arbitrarily close to admissible posteriors. This is recognised to be potentially inconsistent with a personal interpretation of probability. e. the initial density p ( @ )is a description of the opinions held about d. Akaike. 1965. of a quantity depend on the experiment being used to determine it? Lindley (1972. 1965. limits of proper priors (Stone. they depend on the likelihood function.6 Discussion and Further References 367 showed that the partition into parameters of interest and nuisance parameter may not go far enough and that reference priors should be viewed relative to a given orderingor. priors which reflect knowledge of the experiment can sometimes be genuinely appropriate in Bayesian inference. p. either for the parametric model or the prior distribution components. and may also have a useful role to play (see. more generally.
This now results in a /inem form . perhaps. involves multiplication of the two model components. p ( 0 ) . suppose we take logarithms in Bayes’ theorem and subsequently differentiate with respect to 8.r[O)p(H)d0. See.nce as an example the case of the choice between normal and Student4 distributions as a parametric model component to represent departures of observables from their conditional expected values.rIH) + . Consider. Conversely. In this section. With . for example. s(1’) = I3 u I3 s p(. a somewhat ”opaque” operation from the point of view of comparing specifications of p(. Ramsey and Novick (1980) and Smith (1983). with 6’ E Y? having prior density ! p(f?).clO)or y(O) on a “what if!” basis. 1974). O the linear scale.r) = . / ) A ) and p ( 0 ) is of ”arbitrary” form.r10). for simplicity.) = .logp(H1. The mechanism of Bayes’ theorem. p(l.368 5 1trjert.rio) = N(. a0 OH do The first term on the righthand side is (apart from a sign change) a quantity known in classical statistics as the eyj’icient score funcrioti (see.I’denoting the mean of 1 ) independent observables from a normal distribution with mean 0 and precision A. direct approach to examining the ways in which a posterior density for a parameter depends on the choices of parametric model or prior distribution components.logp(0). However.10).r) ?l. to insight into the effect of a particular choice of p(. examination of the second term on the righthand side for given [ J ( T ~ #may provide insight into the implications of ) the mathematical specification of the prior. il logp(. Cox and Hinkley. this is the quantity which transforms the prior n into the posterior and hence opens the way. for example.logp(. we shall provide some discussion of how insight and guidance into appropriate choices might be obtained. For convenience of expositionand perhaps because the prior component is often felt to be the less secure element in the model specificationwe shall focus the following discussion on the sensitivity of characteristics of p ( H I : r ) to the choice of p ( H ) .rlO) given the form of p ( 0 ) . followed by normalisation.l* . a single observable I E R having a parametric density p(xlH). Similar ideas apply to the choice ofp(. we shall illustrate these ideas by considering the form of the posterior mean for 19 when p(.rlO. We begin our discussion with a simple.L. Defining y(.
detailed analysis (Pericchi and Smith. 1992) provides the approximation What if we take p ( 0 ) to be doubleexponential? In this case. In the case of the normal. the posterior mean is unbounded in z . . the reader can easily verify that in this case p ( x ) will be normal.' A . p E 92 and the evaluation of p(s) and S ( X ) is possible.p very large the posterior mean approaches z. we see that for very small x . like the normal. A"). so that W Examination of the three forms for E(0lx) reveals striking qualitative differences. E(8l.1 ~ ( ~ ) . After some algebrasee Pericchi and Smith (1992)it can be shown that. the posterior mean is bounded.6 Discussion and Further References it can be shown (see. if b = ~ . for some v > 0.' v .' JZ. but tedious.2. XU. whereas for z .7 A .p.l. for example. However.5.w ( x ) ] ( x. The formula given for E(0lx) therefore reproduces the weighted average of sample and prior means that we obtained in Section 5. the departure of the observed mean from the prior mean. a ) the exact treatment of p(x) and s ( x ) becomes intractable. Suppose we carry out a "what if?" analysis by asking how the behaviour of the posterior mean depends on the mathematical form adopted for p ( 0 ) . so that What if we take p ( 8 ) to be Srudentt? With p ( 0 ) = St(0lp.)= W ( X ) ( X + b ) + [l. Pericchi and Smith. In the case of the doubleexponential.p the posterior mean is approximately linear in x . 0 I ( T ) 5 1. 1992) that 369 E ( e l X )= x . . What if we take p ( 0 ) to be normal? With p ( 0 ) = N(0lp.p. with limits equal to x plus or minus a constant.b)! where w(x) is a weight function. In the case of the Studentt. and hence S ( X ) will be a linear combination of x and the prior mean.
. O'Hagan and Berger ( 1988). instead. individually. for example. See Jeffreys (1939/1961) for seminal ideas relating to the effect of the tailweight of the distribution of the parametric model on posterior inferences. From a mathematical perspective. Here. if nonparametrically. Suppose. 1988b). should neighbourhoods be defined parametrically or nonparametrically'? And. p ~ say. with respect to such prior classes. which was far from p . with considerable remaining arbitrariness in "filling out" the rest of the distribution.? Should the elements. bounds on posterior quantities of interest such as expectations or probabilities? At the time of writing. Other relevant references include Masreliez (1975). ' The approach illustrated above is wellsuited to probing qualitative differences in the posterior by considering. See Smith (1983) and Pericchi t i crl. O'Hagan (1979. by identifying a few quantiles). this formulation of the robustness problem prcsents some intriguing challenges.. the seminal papers of Box and Tiao ( 1962. Here. Main ( 1988). Polson ( I99 1 ). 1964). but is all too aware that aspects of the specification have . a suitable neighbourhood of p. involved somewhat arbitrary choices. the effects of a small number of potential alternative choices of model component (parametric model or prior distribution). for example. How to formulate interesting neighbourhood classes ofdistributions? How tocalculate. might consist of a class of priors all having the specified quantiles but with other characteristics varying: see. (1993) for further discussion and elaboration. In the case of specifying a prior component such concern might be motivated by the fact that elicitation of prior opinion has only partly determined the specification (for example. that someone has in mind a specific candidate component specification. for example. p. a suitable neighbourhood might be formed by taking po to be normal and forming a class of distributions whose tailweights deviate (lighter and heavier) from normal: see. one knew how one would like the Bayesian learning mechanism to react to an "outlying" x .S Injerence Consideration of these qualitative differences might provide guidance regarding an otherwise arbitrary choice if. this is an area of intensive research. For example. In the case of specifying a parametric component &. viewed in the context of alternative choices in an appropriately defined neighbourhood of 110. but an awareness of the arbitrariness of choosing a conventional distributional form such as normality. 1981.for example an "error" model for differences between observables and their (conditional) expected valuessuch concern might be motivated by definite knowledge of symmetry and unimodality. of the neighbourhood be those such that the density ratio p / p 0 is hounded in some sense'? Or such that the niaximum . It is then natural to be concerned about whether posterior conclusions might be highly sensitive to the particular specitication plj. West (1981 ). Gordon and Smith ( 1993)and 0 Hagan and Le ( 1994). what measures of distance should be used to detine a neighbourhood "close" to p.
Carlin and Demp ster (1989). a simple version of such . Sans6 and Pericchi ( I 992). 1992a. Sivaganesan ( I99 l ). Delampady and Berger ( 1990). for small E and q belonging to a suitable class? As yet. Meeden and Isaacson ( I977). 1994). 1985a). Nau (19921. Wasserman and Kadane (I990. + The term Boorsrrap is more familiar to most statisticians as a computationally intensive frequentist databased simulation method for statistical inference. Rios (1990. 1985a.1994). as a computerbased method for assigning frequentist measures of accuracy to point estimates. Delampady and Dey (1994). Pericchi and Pkrez (1994). Potzelberger and Polasek ( 199l ). 1975). O’Hagan (1994b). Kempthorne (1986). 1993). 1982. Pericchi and Walley ( 1991). in the case of a large data sample. see Efron and Tibshirani (1993). Berger and Mortera (1991b. Hill ( 1975). Kadane and Chuang ( 1978). Lavine (1991a. Edwards et al.6.6. Moreno and Pericchi (1992. 1993). Berger and Berliner ( 1986).4 Hierarchicaland Emplrlcal Bayes In Section 4. Osiewalski and Steel ( 1993). 1992a.5. Angers and Berger (1991). Salinetti (1994). de la Horra and Femandez (1994). 1993). Berger and O’Hagan (1988). For a recent textbook treatment. 1992b. Hartigan (1969. Sivaganesan and Berger (1989. 1991b. Delampady (1989). Berger and Jefferys ( 1992). one might wonder whether the data themselves could be used to suggest a suitable form of parametric model component. Hartigan ( 1983). For an introduction to the methodand to the related technique ofjuckknijingsee Efron (1982). see. 1988a. since it is a heavily computationally based method we shall defer discussion to the volume Bayesian Computation. (1963). Dempster ( 1973. Kadane ( 1984). However. in particular. See. Finally. Cuevas and Sanz (1988). (1991. p = ( 1 . Lavine e t a f . Rubin ( 1977. Gilio (1992b). There are excellent reviews by Berger (l984a. few issues seem to be resolved and we shall not. thus removing the need for detailed specification and hence the arbitrariness of the choice. Walley ( I99 l ). 5. we motivated and discussed model structures which take the form of an hierarchy.1993). Expressed in terms of generic densities. The socalled Bayesian bootstrap provides such a possible approach. which together provide a wealth of further references. Berliner and Goel (l99O). 1990.also. therefore.5. Rubin (1981) and Lo (1987. 1992b).E)P” E q . DeRobertis and Hartigan (1 98 I ). Bayarri and Berger ( 1994). 1994).1992a. attept a detailed overview. Rios and Martin (1994). Berger ( 1980. 1988b).6 Discussion and Further References 371 difference in the probability assigned to any event under p and po is bounded? Or such that p can be written as a “contamination” of p o . G6mezVillegas and Main (1992). Liseo et al. Dawid (1 973). 1992b). 19921. (1993). Doksum and Lo (1990). for instance. Wasserman ( 1989. Moreno and Can0 (199 I). Pericchi and Nazaret (1988). Relevant references include. Polasek and Potzelberger (1988. 1994) and Wasserman (1992a). Berger and Fan (1991).
straightforward probability manipulations involving Bayes’ theorem provide the required posterior inferences as follows: where P(@. . and the population characteristics. In many applications.’s. sources: for example. k individuals in a homogeneous population. In either case. are themselves judged to be exchangeable. Of course. actual implementation requires the evaluation of the appropriate integrals and this may be nontrivial in many cases. it may be of interest to make inferences both about the unit characteristics. Observables 21.)Pt~I z) 14) P ( @ b )x P ( 4 4 M 4 L and P(ZI4) = / P w ) P ( e l 4 ) d e . . such models can be implemented in a fully Bayesian way using appropriate computational techniques. But because of the “relatedness” of the k observables. . or k clinical trial centres involved in the Same study. the e. However.I4% !x P(zle. mean and covarianceof the population (of individuals. . . The first stage of the hierarchy specifies parametric model components for each of the k observables. . the parameters el.z k are available from k different. 4..372 an hierarchical model has the form 5 inference The basic interpretation is as follows. The second and third stages of the hierarchy thus provide a prior for 8 of the familiar mixture representation form Here. but related. as we shall see in the volumes Buyesiun Computation and Buyesiun Methods. . the “hyperparameter” 4 typically has an interpretation in terms of characteristicsfor example. trial centres) from which the k units are drawn. .&.
Ericson (1969a. 1964. Brown and Mtikelainen.Iz) = P(& I#**PI. A tempting approximation is suggested by the first line of the analysis above. Smith (1973a. However. 5. see Deely and Lindley (1981). Dawid (1988b). 1975) and Morris (1983).5.6 Discussion and Further References 373 A detailed analysis of hierarchical models will be provided in those volumes. 1994. Mouchart and Simar (1980). Much of the development following Moms ( 1 983) has been directed to finding more defensible approximations. Gilliland et al. We note that if p ( 4 l z ) were fairly sharply peaked around its mode. use Bayes’ theorem for the first two stages of the hierarchy. Hoadley. with 4’ as a “plugin” value. 1988. Goldstein and Smith (1974). following the line of development of Efron and Moms (1972. 1966. Binder. but with an “empirical” prior based on the data. Schervish et al. (1 993. @*. 1978. Dawid . Such shortcut approximations to a fully Bayesian analysis of hierarchical models have become known as Empirical Bayes methods. Among the areas which have stimulated the development of Bayesian theory. Berger and Robert (1990). Goel (1983). Bemardo. ( 1992). more recently. Leonard (1975). We shall review this material in the volume Bayesian Methods and confine ourselves here to simply providing a few references. 1964. This is actually slightly confusing. we would have p(6. Bemardo and Gir6n. An eclectic account of empirical Bayes methods is given by Maritz and Lwin (1989). 1994).5 Further Methodological Developments The distinction between theory and methods is not always clearcut and the extensive Bayesian literature on specific methodological topics obviously includes a wealth of material relating to Bayesian concepts and theory. For more wholehearted Bayesian approaches. 1988. 1974. 1983). Calibration (Dunsmore. 1973b). The naYve approximation outlined above is clearly deficient in that it ignores uncertainty in 4. Classification and Discrimination (Geisser. 1970. Lindley and Smith (1972). Pkrez and Pericchi ( 1992). since the term was originally used to describe frequentist estimation of the secondstage distribution: see Robbins (1955. (1982). 1%9b). 1974). The analysis thus has the flavour of a Bayesian analysis. 1992). Goel and DeGroot (1981). 1980b). say. 1989. Wolpert and WarrenHicks ( 1992) and George et al. Hill (1969. some key references are Good (1965. we note the following: Actuarial Science and Insurance (Jewell. 1968. 1992). van der Merwe and van der Merwe ( 1992). Lindley (1971). Kass and Steffey (1989) and Ghosh (1992a). the term has come to refer mainly to work aimed at approximating (aspects of) posterior distributions arising from hierarchical models. Singpurwalla and Wilson.6. The form that results can be thought of as ifwe first use the data to estimate 4 and then.
1967. model comparison. MetuAnulysis (DuMoucheland Harris. 1988) and fime Series and fiwecmting (Meinhold and Singpurwalla.. 1989.Leonard and Hsu. 1969. L a w (Dawid.. 1992b: Diebolt and Robert. Econometrics (Mills. 1994: Robert and Soubiran. 1994). 1992). Steel. 1978..Missing Dutu (Little and Rubin. Splines (Wahba. Smouse. Cox. Smith. 1993: McCulloch and Tsay.suruncr (Wetherill and Campling.. 1988.Wolpert and WarrenHicks. 1967). 1983. model choice. 1986. In this chapter.. Carlin and Polson. 1985. 1992). 1986). 1984. 1992. Meng and Rubin. 1992. 1994). and. Ameen. 1983. Smith and Gathercole. 1986.e and Fang. 1969. Irony et d. 1971.Contingency Tables (Lindley. Mixtirres (Titterington et al. in the sense that all prior to posterior or predictive inferences have taken place within the closed world of an assumed model structure. If we accept the model. Singpurwalla and Soyer. Grenander and Miller. 1967. 1986. 1976. 1992. Garnerman and Migon. 1992. 1994).. 1965. Bernard0 and Gir6n. Finite Populution Sampling (Basu.6 Critical Issues We conclude this chapter on inference by briefly discussing some further issues under the headings: (i) Model Conditioned Inference. 1988: Mardia et ul. West. Qiculity A. . Image Analysis (Geman and Geman. West and Harrison. issues of model criticism. 1985. Gamerman. fhen the mechanics of Bayesian learningderived ultimately from the requirements of quantitative coherenceprovide the appropriate uncertainty accounting and dynamics. We shall therefore devote Chapter 6 to a systematic exploration of these issues. Pole et al. Harrison and West. 1975. 1966.. 1969b. 5. Control Theor?. 1993. Stochusric Appro. 1989. (ii) Prior Elicitdon. 1992. Ericson. ultimately. 1984. 1968. Geman. But what if. 1970. (Aoki.5 1tlferenc. 1993). 1992).6. 1964. Ansley et nl. Hald. Sawagari et ul. Mortera. West and Migon. 1987: Rubin. Multivoriate Anulwis (Brown et a/. (iii) Sequentid Methods and (iv) Conrparative Injerence. It has therefore to be frankly acknowledged and recognised that all such inference is conditional. as individuals. Gu. 1988.s. Lo. 1992. 1988: Florcns et al. 1993: West et ul. Godambe. 1992). 1994). Booth and Smith. are as much a part of the general world of confronting uncertainty as model conditioned thinking. Model Conditioned Inference We have remarked on several occasions that the Bayesian learning process is predicated on a more or less formal framework. 1987. 1987. 1987. we acknowledge some insecurity about the model? Or need to communicate with other individuals whose own models differ'? Clearly. 1986. 1992.. 1992). 1992. this has translated into model conditioned inference. Berliner. I994).. 1992. 1993. Leonard. 1992. 1994). Good.rimcrtion (Makov. Besag.
Jackson (1960). there are often computational advantages in processing data sequentially. how to structure questions to an individual. and how to process the answers. despite its importance. which clearly goes beyond the boundaries of statistical formalism and has proved of interest and importance to researchers from a number of other disciplines. We shall therefore not attempt here any kind of systematic review of the very extensive literature. 1967b). French (l980). It is obvious.over and over. 1980) Dickey (1980). which we will review in the volumes Bayesian Computation and Bayesian Methods. However. however. and Goodwin and Wright (1991). in real applications. by Morgan and Henrion (1990). Sequential Methods In Section 2. Warnings about the problems and difficulties are given in Kahneman et af. Wetherill (1961).e. Kadane (1980). and will be better discussed in the volume Bayesian Methods. for example. moreover. there is a danger of losing sight of the fact that. Leonard and Hsu (1992) and West and Crosse ( 1992).6 Discussion and Further References Prior Elicitation 37s We have emphasised.5. Much has been written on this topic. Key references include the seminal monograph of Wald ( 1947). even if they are all immediately available. Berger and . from the perspective of applications the best known protocol seems to be that described by Stliel von Holstein and Matheson (1979). Winkler (1967a. Very briefly. that our interpretation of a model requires in conventional parametric representationsboth a likelihood and a prior. In accounts of Bayesian Statistics from a theoretical perspectivelike that of this volumediscussions of the prior component inevitablyfocus on stylised forms. which are amenable to a mathematical treatment. of course. including psychology and economics. the use of which in a large number of case studies has been reviewed by Merkhofer (1987). in order to arrive at a formal representation. the topic has a focus and flavour substantially different from the main technical concerns of this volume. This.6 we gave a briefoverview of sequential decision problems but for most of our developments. Some key references are de Finetti (1967).. Garthwaite and Dickey (1992). who provides a bibliography of early work.(1982). and the classic texts of Wetherill (1966) and DeGroot (1970). General discussion in a textbook setting is provided. such as conjugate or reference specifications. Lindley (1982d). Edwards et al. prior specifications should be encapsulations of actual beliefs rather than stylised forms. However. (1968). i. that data are often available in sequential form and. There is a large Bayesian literature on sequential analysis and on sequential computation. leads to the problem of how to elicit and encode such beliefs. Jaynes (1985). thus enabling general results and insights to be developed. Hogarth (1975. we assumed that data were treated globally.
are Amster ( 1963). Conipurutive Injerence In this and in other chapters. in Appendix B. Anderson ( 1984). Banholomew (1971). 1990). DeGroot ( 1987). However. both for completeness. Barnett (1973/1982). and for the very obvious reason that there are still some statisticians who do not currently subscribe to the position adopted here. Piccinato (1992) and Poirier (1993). Pratt ( 1965). Box ( 1983). Press (1972/1982). Banholomew ( 1967). . our main concern has been to provide a selfcontained systematic development of Bayesian ideas. Witmer ( 1986) reviews multistage decision problems. a condensed critical overview of mainstream nonBayesian ideas and developments. should consult Thatcher ( 1964).Berry ( 1988)discuss the relevance of stopping rides in statistical inference. Roberts ( 1967). Barnard ( 19671. it seems necessary to make some attempt to compare and contrast Bayesian and nonBayesian approaches. Any reader for whom our treatment is too condensed. Basu ( 1975) and Irony ( 1993). Wc shall therefore provide. Casella and Berger ( 1987. primarily dealing with the analysis of stopping rules. Cox and Hinkley (1973). Some other references.
some involving model choice followed by a terminal action. such as prediction. A variety of decision problems are examined within this framework: some involving model choice only. M. but intractable. or from that of a group of modellers.1.BAYESIAN THEORY Edited by Josd M. other involving only a terminal action. the case where the range of models is being considered in the absence of specification of an actual belief model. finally. Ltd Chapter 6 Remodelling Summary It is argued that whether viewed from the perspective of a sensitive individual modeller. Bemardo and Adrian F. Throughout. More specifically. secondly.1 6. there are good reasons for systematically entertaining a range of possible belief models. actual belief model. Smith Copyright 02000 by John Wiley & Sons. 6. the case where the range of models under consideration is assumed to include the “‘true” belief model. Links with hypothesis testing. most of our detailed develop . the case where the range of models is being considered in order to provide a proxy for a specified. significance testing and crossvalidation are established. a clear distinction is drawn between three rather different perspectives: first.1 MODEL COMPARISON Ranges of Models We recall from Chapter 4 that our ultimate modelling concern is with predictive beliefs for sequences of observables.
.. P I . a predictive belief distribution. model involves judgements and assumptions going far beyond the simple initial judgement of exchangeability. Given the assumption of exchangeability.. in order to replace the general representation by mixtures involving finiteparameter families of densities. the very general nature of this representation precludes it at least in terms of current limitations on intuition and technique . . J J I I for some Q1. in Chapter 4 much of our . various forms of partial exchangeability. F’. however. 1986). a range of different belief models.from providing a practical basis for routine concrete applications. for F.t??. the latter being interpretable in terms of a strong law limit of observables. the predictive model typically has a mixture representation in terms of a random sample from a labelled model. we saw that for an exchangeable realvalued sequence. Both from the perspective of a sensitive individual modeller and also from that of a groupof modellers. F. and hence the models that result from them. P. the latter encapsulating the particular alternative judgements that characterise the different models. In each case. Inescapably. Dickey. the range of . are therefore typically much less securely based in terms of individual beliefs. However. In such cases. These further judgements. more generally. but intractable.of all distribution functions on 8. subsequent development was based on formal assumptions of further invariance or sufficiency structure. 1973. or pragmatic appeal to historical experience or scientific authority.370 6 Remodelling ment has centred on belief models corresponding to judgements of exchangeability or. Q 2 . . and Smith. than the straightforward symmetry judgement. Q.. has the general representation This corresponds to an (as ifi assumption of a random sample from the unknown distribution function. defined over the space. for example. The following stylised examples serve to illustrate some of the kinds of ranges of models that might be entertained in applications involving simple exchangeability judgements. For example. can each be represented in the general form . this passage from the general. form to a specific. This is why. there are therefore strong reasons for systematically entertaining a range of possihle models (see. together with a prior distribution for the label. but tractable. and certainly much less likely to be mutually acceptable in an interpersonal context. together with a prior distribution.
and using density representationsthroughout. or generated purely formally.1 Model Comparison 379 models can either be thought of as generated by a single.. with el. fl. . as measurements of p with errors el. .. nondogmatic individual (seeking to avoid commitment to one specific form). achoice of a range of models to cover these possibilities might be: . . in practice. however.. . . past experience might suggest a substantial proportion of “aberrant” or “outlying” measurements.so that Xi= p+e.3) might suggest the assumption of normality.en.2. x. With k = 3. . as an imaginative proxy for models thought likely to correspond to the ranges of judgements which might be made by the eventual readership of inference reports based on the models. . and . . i = 1. p for . can be thought of..... .n. . . Various beliefs are then possible about the “error distribution. . Inferencefor a Location Parameter Suppose that observations z1.. k some k 2 2. . ..6. . our subsequent development will be expressed in terms of a possibly infinite sequence of models PI.P2. exchangeable.. . appeal to the central limit theorem (Section 3.. thus requiring the assumption of a distribution with lighter tails than normality. different past experience might suggest that the experimenter automatically suppresses any observations suspected of being “aberrant.. conditional on . In general. .. or generated as concrete suggestions by a group of individuals (each committed to one of the forms). e 2 .. we typically only work with a finite range. For example. thus requiring a distribution with heavier tails than normality. .
81..A[ is the set of all distributions other than normal. ( .(p.(. even though. ( p . ~ .. within the context of the parametric family p(.. . a ) probability one tothe family with parametric formy. and hence that he'. . . a ) could differ.o) = dQ. If these modelling possibilities emanate from a single individual. 3 . again focuses on a specific parameter value.GI = n p ( s f18.I'. . given = the assumption of exchangeability for a realvalued sequence. A somewhat different situation arises if k = 2 . so that I1 p ..=I f 1 corresponding to what are usually referred to. an individual dogmatically asserting normality is specifying. respectively. o). However. . 0 )might not depend on j. )=I If k = 2. . (F)corresponds to a belief over 3 which assigns withdensityp. Normality versus nonNormality Suppose that N c 3is the set of all normal distributions on the real line. cannot but be struck by the monumental dogmatism implicit in 41! Purumetric Hypotheses Suppose that Q. p . 9 . an individual dogmatically asserting nonnormality is specifying a &( F ) which concentrates with probability one on N'. within the family.380 6 Remodelling with p.) . 8 .) = lJp(. Then.rf I 8. p(s 16). a (21 ( F ) which concentrates with probability one on A*. specifying prior beliefs for the location and scale parameters appearing in these normal. Our purpose here is mainly to point out how choices within the general exchangeable framework correspond to specification of Q. one given the "size" of 3. Conversely. . but ( 3 2 simply assigns a prior density p ( 0 ) over 8 in p(s 18). in general.2. . Thus pJ (p. they specify 81. in this case. as the values of the parameter. k. .. .. c ~ . doubleexponential and uniform parametric models.). for the two parameters of this family. in the general representation.j = 1. j = 1. . The rival models then have the representations If P ~ ( . but. . o). . in that they not only all focus on a single parametric family. I p . are even more dogmatic. however. the p .as a simple hypothesis and a general alternative.r I 8 ) . ( p .3. . the interpretations of the parameter a strong s law limits of observable measures of the location and spread of the measurements are the same..(p. this is often referred to as a situation of two simple hypotheses.
so that dQ(Ol$.) over the limiting frequencies.. N(ztJk I + .~). which themselves have the same effects. 6 2 2 corresponds to the hypothesis that one of the treatments (possibly a “control”) is different from all the other treatments.) .) reduces to dQl(O).. . say. .. density for z ( n ~.6. For example. in the context of 01 responses in 7n clinical trial treatment groups. to the hypothesis that all treatments have the same effect. nondegenerate. .. xIJk. .. the many versions of the former discussed in Chapter 4 clearly provide considerable scope for positing interesting ranges of models in any given application.. given a basic assumption of unrestricted exchangeability. we considered triply subscripted random quantities.z~.T ..I. .O.. = 1. As a stylised illustration of the possibilities..z(nJ is given by so that. say.1 Model Comparison 381 In the contexts of judgements of partial rather than full exchangeability. We recall that. . .alternativemodels are defined by different forms of Q. r e p resenting the kth of a number of replicates of an observable in “context” i E I.7.O.) reduces to dC)2(&. discussed in detail in Section 4.7.. if z(%)= (zJ1. any further (nondegenerate)relationshipsamong them being defined by the specific form of &. i . $) + YZJ. The examples which follow illustratejust a few of these possibilities..42).3 and 4. . so that . . the general representation of the joint predictive . corresponding to Qi: the assumed equality of the limiting frequencies of ones in each of the rn sequences.. form dQ(131. O.. .3. . Q1 corresponds. we might consider: assigning probability one to 81 = . Q3 : retaining a general. Q2 : assigning probability one to 81 = ~ I . loosely speaking. expanding somewhat on the earlier discussion of model elaboration and simplificationgiven in Sections4. In particular. .. Structured Luyours In Section 4. Several Samples Consider the situation of m unrestrictedly exchangeable sequences of zeroone random quantities. we considered the situation where the predictive model might be thought of as generated via conditionally independent normal + a.6.2. = Om = 6.4. . . subject to “treatment” j E J.6. In. = Q = &. Q3 corresponds to a general hypothesis that all treatments have different effects. . O ?= dQ(OI. .
linear versus logistic). (Biwssay). 01: specifying = 0. = 0 for all i .to be equal.2. we discussed a variety of models involving covariates.jJ } and 1. . it does not seem appropriate to attempt a notationally precise illustration of all possibilities.. the kinds of alternative models that might be considered. j and (I. where beliefs about the sequence of observables (5's) were structurally dependent on another set of observables (z's).. but differ in whether or not they constrain model parametersfor example.i E I . } and p: Qj: specifying ?.1. j and . alternative models might assume the same functional form. . alternative models might be concerned with whether some or all ofthe parameters defining the growth curves are identical or differ across subjects.}.6. In the case of several growth curves for subjects from a relatively homogeneous population. the LDWs.. j E J. The reader familiar with analysis of variance methods will readily identify these prior specifications with conventional forms of hypotheses regarding absences of interaction and main effects. for each of the cases considered in Section 4. (Growth curves). Example 6. { n ( i l } . together with a nondegenerate ! specification for p. together with a prior distribution Q for T and any I J linearly independent combinations of { o . Instead.4. for all i . together with a nondegenerate specification for {fk.6. As a stylised illustration of alternative modelling possibilities. In the case of several separate experiments. j . together with a nondegenerate specification for { . j. j . Example 6. = 0. together with a nondegenerate Specification for { n.jJ = 0 for all . Cowriates In Section 4. we might consider: QI: specifyingT.1. &: specifying = 0 for all i. Given the enormous potential variety of such covariate dependent models.4. = 0 for all i.382 6 Remodelling distributions. { dj}. } . n . . {$. Alternative models for a single experiment might correspond to different assumptions about the functional dependenceof the survival probabilities on the dose (for example.} and cr. we shall simply indicate in general terms. = 0. for all i. logit versus probit). Alternative models for an individual growth curve might correspond to different assumptions about the functional dependence of the response on time (for example. = 0.
) p . . . p l . of the m. but all are equally likely. . and where this latter relationship was modelled as a mixture over a further parametric form.7. This would create a model allowing potential “outliers”among the m groups themselves.3 for further development of this idea in a nonhierarchical setting.I Model Comparison 383 Example 63.6. typically simple.. E I. ) d e . a priori. say.1of the p.5. to be the odd one out. . that rn . reflecting a symmetric judgement of “similarity”for pI. Bayesian Methods.1. Detailed casestudies. we considered a case where all the means. ( Z I e .2 Perspectiveson Model Comparison To be concrete. . we shall discuss in detail a number of practical applications of this kind. In the third volume of this work. to whether or not various regression coefficients can be set equal to zero. nu J .6. under i consideration for observations z can be described in terms of finite parameter mixture representations.16 of Section 4.4. versions of such problems. we shall just content ourselves with some general comments for one of the specific cases considered in Section 4. In the remainder of this chapter. groups of observables with normal parametric models were judged exchangeable. equivalently. Example 6. . in order to highlight the conceptual issues. However. Given the specifications of the various densities forming the mixtures. the predictive distributions for the alternative models are described by PJZ) = P ( Z I = P . (Multiple regression).6. . we shall therefore illustrate various of the kinds of decision problems that might be considered. let us assume that all the belief models P2..pll. how should an individual or a group proceed? From the perspective adopted throughout this book. (Exchangeable normal mean parameters).. Alternative models in the multiple regression context typically correspond to whether or not various regressor variables can be omitted from the linear regression form. clearly the answer depends on the perceived decision problem to which the modelling is a response.p. involving the substantive complexities of context and the computational complexities of implementation will be more appropriately presented in the volumes Bayesian Computationand Bayesian Methods. (SeeSection 4.5. Hierarchical Models Given the enormous variety of potential hierarchical models and alternative forms. ( e . In Example 4.) Confronted with a range of possible models. E 1. . . The emphasis will be on somewhat stylised. other symmetry judgements are possible: for example. the other one is not. 6.’s are exchangeable.
Provided we feel comfortable. to incorporate uncertainty about the appropriate value. Beyond such "controlled" situations.6 RemodelliriR For mnemonic convenience. the overall model specifies beliefs for 2 of the form = P ( N . However. But. without the explicit knowledge of which of them is the true model. of course. and JI. reality is typically not as relatively straightforward as this.!i E I. continuing the discussion of Section 4. i E I}. Now suppose that a new application context arises and that it is felt necessary to reconsider whether to continue with the previous specified parameter value or. corresponds to believing that one of the models { J\. we ourselves choose the lists as part of the process . this would be appropriate whenever one knew for sure that the real world mechanism involved was one of a specified finite set. i E I}(rather than P.. with assigning prior weights to these two alternative formulations. . as in our previous discussions) and the set of these models by M = { A I . it seems to us to be difficult to accept the Mciosed perspective in a literal sense. From this perspective. i E I).). in this new context. from now on we shall denote the alternative models by { Af. One rather artificial situation where this would apply would be that of a computer simulation "inference game". where data are known to have been generated using one of a set of possible simulation programs./osed view. each a coded version of a different specified probability model. i E I}to be those individual models we are interested in comparing or choosing among..Uclosed framework.we need to draw attention to important distinctions among three alternative ways in which these possible models might be viewed.. we can exploit the . For example. in principle. when does it actually make sense to speak of a "true" model and hence to adopt the . the renormalised mixture of i\f. which may reflect cither the range of uncertainties within an undecided individual. suppose that a parametric model with a specified parameter has heen extensively adopted and found to be a successful predictive device in a range of applications.There is. The first alternative. Instead. jp(a: I M. i E I}. could itself be regarded as a model). there may he situations where one might not feel too uncomfortable in proceeding "as if' one meant it. C l i t with f"("iI. or the range of different beliefs of a group of individuals.8. but this can be resolved pragmatically by taking { Af. which we shall call the .)denoting prior weights on the component models { M. some ambiguity a to what should be regarded as a component model s (for example. i E I} is "true".. Nature does not provide us with an exhaustive list of possible mechanisms and a guarantee that one of them is true..3 on the role and nature of models.Vclosed perspective'? Clearly.Mc. but it is not known which program was used. Before we turn toa detailed discussion of decisions concerning model choice or comparison among { '11. However.
so that assigning probabilities { P ( M t ) .i E I}does not make sense and the actual overall model specifies beliefs for z of the form p ( z ) = p t ( z ) = p ( z I Mt). But if we abandon the Mclosed perspective. Examples of lists of ‘‘proxy models that are widely used include familiar ones based on parametric components. in the absence of an actual specified belief model. . We now proceed to give these alternativeperspectives a somewhat more formal description.. however. etc. in this case. p(z)perhaps because we lack the time or competence to provide it. Typically. However. Mcompleted models.i E I} are simply a range of specified models available for comparison. {Aft. link functions. contingency table structures with different patterns of independence and dependence assumptions. the alternative models are presumably being contemplated as a proxy because the actual belief model is too cumbersome to implement. . they will still have to be evaluated and compared in the light of these actual beliefs. assigning the probabilities {P(Aft).6. it would seem intuitivelyand we shall see this more formally laterthat the alternative models have to battle it out among themselves on some crossvalidatory basis. In the latter case. i E I}simply constitutea range of specified models currently available for comparison. if the actual belief model is based on nonlinear functions of many covariates. iE I} does not make sense. to be evaluated in the light of the individuals separate actual belief model.1 Model Comparison 385 of settling on a predictive specification that we hope will prove “fit for purpose” in the jargon of modern quality assurance. The third alternative. there is no separate overall actual belief specification. generalised linear models with different choices of covariates. together with Student probability distribution specifications. i E I} will have been proposed largely because they are attractive from the point of view of tractability of analysis or communication of results compared with the actual belief model hit. relative to a given proposed range of models A&. corresponds to an individual acting as if { A l . how else might we approach the very real and important problem of comparing or choosing among alternative models? It seems to us that the approach depends critically on whether one has oneself separately formulated a clear belief model or not. From this perspective. i E I might be adopted for a variety of reasons. which we shall denote by Alt. also acknowledges that { M . which we shall call the Mcompleted view. corresponding to: regression models with different choices of regressors. The Mopen perspective requires comparison of such models in the absence of a separate belief model. For example. The Mcompleted perspective will typically have selected the particular proxy models in the light of an actual belief model.which we shall call the M open view. . the proxy models to be evaluated might be various linear regression models with limited numbers of covariates and normal probability distribution specifications. The second alternative. In the former case.
.6 Remodelling 6.2. and the utility of choosing a particular model then depends on whether a correct choice has been made. I y) + 1 as s Proposition 5. Throughout the following development. w in this case labels the “true model.. . However. P ( M . The first decision problem we shall consider involves only the choice of an M. .1 A decision problem involving model choice only It is perhaps not obvious why such a problem would be of interest from an Mopen perspective. an Mcompleted or an & open perspective.. . an example of an obvious w of interest might be the M .bfclosed framework.Recalling sample of observations. in principle.u). I. Some of these only make sense from an Mclosed perspective. I others can be approached from either an Mclosed. . Provided we feel comfortable.3. so that the utility function has the form z!(?nf. i E I. either as an end in itself..1. will be denoted by rrz. with p ( w 1 x)representing acruul belit$ about w having observed x. imagining a large future x. observed data on which decisions are to be based will be denoted by 2.  ftl where ~ ( m z)= I . without any subsequent action.. where w is some unknown of interest. 1 1 E I.1. .and the choice of model M. Whatever the forms ofw and u(m. 0).. / u ( m f . in the general decision problem defined by Figure 6. we can exploit the .9 of Section 5. or as the basis for a subsequent answer to an inference problem..i E I}. w ) p ( w z)rlw. from an Mclosed perspective. fI(r u . l ) 2 z. w) Figure 6. maximising expected utility implies that the optimal model choice rri’ is given by 7f(77L* 1 )= supu(r7/. This decision structure is shown schematically in Figure 6.y$). with assigning prior weights to these two alternative formulations.3 Model Comparison as a Decision Problem We shall now discuss various possible decision problems where the answer to an inference problem involves model choice or comparison among the alternatives in M = {If. for which. y = (y.
being true. we may wish to predict a future observation. the same analysis can be carried out in the Mopen as in the Mcompleted framework. even though this may require extensive i (Monte Carlo) numerical calculations in specific applications. at least approxiIt mately. even though none of them corresponds to one’s own assumption regarding the true model. within the purview of the Mclosed i framework. given M. assuming klz to be the model. i E I. perhaps surprisingly. ( u I s)= p(w I A’. j E J .1 Model Comparison In the Mclosed case. of From the Mcompleted perspective. j E J. From the Mopen perspective.).I z). the resulting decision problem is shown schematically in Figure 6. denotes the utility resulting from the successive choices m. model h4.. we can. one can compare the models in M. nothing can be said in general about the explicit form of p(w I z). Let us now consider a rather different form of decision problem which first requires the choice of model Mt from M.2 A decisionproblem involving model choice and subsequenr inference . We note.e. the key role played by the quantities (P(h1. w ) a. however.. at least in principle. which we denote by m. If u(mt. in other words. requires an answer a. one can compare the models in M on the basis of their expected utilities without actually having specified an alternative“true” model. For example. and then. 387 where and p . in particular. (i... are the posterior probabilities. E I. Figure 6. model All. turns out. I z). obtain p(w I z)and evaluate ii(m.1. In this way. 5)is given by standard (posterior or predictive) manipulations conditional on model Ma. or estimate a parameter common to all the models in M .6. that.)and a. We shall defer a detailed discussion of this until Section 6. (answer to inference question.6.2.which. relating to an unknown “state of the world” w of interest. given s. when w is the actual “state of the world.E I}.i E I..
while the form p(w la)again represents actual beliefs about w given x. Detailed analysis for the Mopen case will be given in Section 6. the resulting. including a utility function.. it is worth remarking that. Figure 63 A decision problem involving rerminul decision only . to an inference problem.6. numerically if necessary. = If1 where G ( m .. 1 x)dw is the expected utility.a. Before proceeding to further aspects of model choice and comparison. we shall explore a number of specific cases. it is not necessary to choose among the elements of M in order to provide an answer a. however. different form of decision problem is that shown schematically in Figure 6.Iz) S l l p U ( m . If we omit the explicit model choice step.1 x).i E I} can in principle be carried out. In the next two subsections.set of alternative models depends on the specification (at least implicitly) of a decision structure. The explicit form of p(w 1 z) as a mixture of the y. it is important to recognise that different choices of w and different forms of utility structure will naturally imply different forms of solution to the problem of model choice.(w I a). In the Mcompleted case.\I. From a conceptual perspective.3aa 6 Remodelling Systematic application of the criterion of maximising expected utility establishes that the optimal model choice is that r n .so that n.w)y(w . we have also noted above that evaluation of p(w I x) and {ii(rn. in the above context. I . 12) = J’ 1 1 ( 7 ~ . E 1. given x.. is obtained from maximising The form pl(w1 x)in the above is again given by standard (posterior or predici tive) manipulation conditional on model M. . for which ii(nz. / a ) . in order to underline the general message that coherent comparison of a finite or countable . been given has above in the Mclosed case.of optimal behaviour given model .3.
J u ( m zw)p(w I z)dw . is that of choosing rhe m e model. It is then easily seen from the analysis relating to Figure 6. s u(a.4 Zeroone Utllitles and Bayes Factors In this section. model comparison in the light of the data x is still being effected through the presence of { P ( M . as it follows from the posterior weighted mixture form of p(w 1 x) that. and The expected utility of the decision m. the problem. when the “state of the world” of interest is defined to be the “true” model. we confine attention to the Mclosed perspective and consider first the problem of choosing a model from M . . 0 if w # M. . 1 z). without any subsequent decision. In this case. stated colloquially. The optimal decision is therefore to “choose rhe model which has the highesr posterior probability”. if i we entertain a range of possible models for data x. 6. (choosing . = P(M1I z). E I}. yS). .. given by c(a* 12) = sup ii(u 12) n where c(a* 12) = with p(w I z) discussed above. From the Mclosed perspective.6. a natural form of utility function may be u(m. P( Aft I y) + 1 as s + 30.%It).:w ) = 1 if w = .1 Model Comparison 389 In this case.given z is hence ..solutions to decision problems conditional on z will always implicitly depend on a comparison of the models in the light of the data.\It 0 if w # Af. In the particular case of an Mclosed perspective. maximising expected utility leads immediately to the optimal answer a*. ~ ( mz)= I . .1. so that assuming a future sample y = (yl. even if explicit choice among the models is not part of the decision problem. I that 1 if w = A I . i E I..w)p(w I z) dw. In general. although we have omitted the model choice step.
390 6 Remodelling Buyes Factors Less formally.HJ corresponding to assumptions of alternative models. then just corrcsponds to the specifications { p l ( xI el).)Pl(e.)del Definition 6.IfJ. I :V) = In words.\IJ). Ail. t I}. apparently due to Turing (see. (5) I signifies that H. . the prior On weight of evidence and log B. Intuitively. suppose that some form of intuitive measure o pairwise comparison f of plausibility is required between any two of the models { A f ..) P(MJ12) p(s I Af.If. (and against . but also identify the value of the parameter. An alternative set of such models.) x7 r(q) where. making explicit the key role of the rutio o integruted fike1ihood. . (Bayesfactor). 1 E I .p(s IM. in the context of parametric models. (2)combine udditive/y to give the posterior weight of evidence.. for example. ( x ) > 1 signifies that HI is now more relatively plausible in the light of 5 . The above analysis suggests that . Good ( 1950)has suggested that the logarithms ofthe various ratios in the above be called weights o evidence (a term apparently first used in a related context by f Peirce.1.. . I E I}. we noted the extremely simple forms of prcdictivc models which result when beliefs not only concentrate on a specific parametric family of distributions. the Bayes factor provides a measure of whether the data x have increased or decreased the odds on H I relative to H. < the relative plausibility of H. for example. I f l ..0. . this logarithmic scale. 1878).) P(111. the above comparison can be described as “posreriorodds rutio = intepited likelihood ratio x prior odds ratio”. In Section 6. . P (M / 1 z) . I. . The fundamental importance of this transformation warrants the following definition. for data x.y in providing the f mechanism by which the data transform relative prior beliefs into relative posterior beliefs. has increased. 1. Thus. I988b). B . Good. I and the integrated likelihood ratios reduce to simple ratios of likelihoods. . s Pl(X I e. Given rwo hyporheses H I . so that log U .the Bayes factor it1 favour of H I (cind ugaittst H I )is Riven by rhe posterior to prior odds rm.IfJ may be usefully compared using the posrerior odds rurio. ( x )corresponds to the integrated likelihood weight of evidence in favour of ill.
6.I Model Comparison
391
Hypothesis Testing The problem of hypothesis testing has its own conventional terminology, which, within the framework we are adopting, can be described as follows. Two alternative models, A l l , M 2 are under considerationand both are special cases of the predictive model
P ( Z ) = / P ( Z I 8)4?(e).
with the same assumed parametric form p ( z I O ) , 8 E 0,but with different choices of Q. If, for model A&, Qt assigns probability one to a specific value, 8,. say, the model is said to reduce to a simple hypothesis for 8 (recalling that the form p ( z 10) is assumed throughout). If. for model Af,, Q, defines a nondegenerate the density p , ( 8 )over 8,5 8, model is said to reduce to a composite hypothesis for 8. If a simple hypothesis is being compared with a composite hypothesis, so that 8,= 8  { O , } , the latter is called a general ulfernative hypothesis. In the situation where the “state of the world” of interest, w , is defined to be the true model A&, we can generalise slightly the zeroone utility structure used earlier by assuming that
~ ( t n ~=, 1,,, ) ~
w = Af,.
i = 1.2, j = 1,2 .
with Z1, = 12. = 0 and 1 1 2 >Zzl > 0. Intuitively, there is a (possibly asymmetric)loss in choosing the wrong model, and there is no loss in choosing the correct model. Given data 5,and using, again p i ( w ( z )= 1 if w = Afi and 0 otherwise, the expected utility of mi is then easily seen to be
so that
We thus prefer A.12 to A l l , if and only if
revealing a balancing of the posterior odds against the relative seriousness of the two possible ways of selecting the wrong model. In the symmetric case, 112 = 121. the choice reduces to choosing the a posteriori most likely model, as shown earlier for the zeroone case. The following describes the forms of socalled Bayes tests which arise in comparing models when the latter are defined by parametric hypotheses.
392
6 Reniodellirig
Proposition6.1. (Forms of Bayes tests). In comparing nvo nrodels. i\lI . >\f?, defined by parametric hypothesesfor p ( x I 0 ) .with utility structure
where: (simple versus simple test).
Proof. The results follow directly from the preceding discussion.
a
The following examples illustrate both general model comparison and a specific instance of hypothesis testing.
Example 65. (Geometric versus Poisson). Suppose we wish to compare the two completely specified parametric models, NegativeBinomial and Poisson. defined for conby ditionally independent .rl . . . . . .(',,.
.\II : Nb(.r, 1 H I . I ) . A12 : Pn(.c.,1 1 9 2 ) .
i = I . . . . .I I
The Bayes factor in this case is given by the simple likelihood ratio
Suppose for illustration that HI = {, H1 = 2 (implying equal mean values E [ . r ]= 2 for both models); then, for example, with rt = 2. , r , = .r2 = 0, we have U l ? ( z ) = (.I;!) z (i.07. indicating an increase in plausibility for A I , ; whereas with I ! = 2. .rl = .rr : we have 2, U 1 ? ( s )= lri/729 x 0.30, indicating a slight increase in plausibility for :\I2. Suppose now that 8,. are not known and are assigned the prior distributions
6.1 Model Comparison
393
whose forms are given in Section 3.2.2 (where details of Nb(z, I l , e l )and Pn(r, 102) can also be found). It follows straightforwardly that
and that
so that
We further note that
so that prior specifications with ( a l  l)az = ,&,&imply the same means for the two predictive models.
Table 6.1 Dependence of BI2 on priordata combinations (2)
= 2, $1 = 2 = 2. & = 1
01
01
= 31.0, = 60
a 2
3 1
a = 2
( k i = 2, Sl = 3 6 . 3 = 30 a = 3. $2 = 2 012 2
= x2 = 0
x1=x2=2
2.70 0.29
5.69 0.30
0.80 0.49
6 Remodelling
As an illustration of the way in which the prior specification can affect the inferences. ) we present in Table 6.1 a selection of values of B I 9 ( zresulting from particular priordata combinations. In the first twocolumns. the priors specify the same predictivemeansforthe two models. namely E [ sI A[,] = 2, but the priors in the second column are much more informative. In the final column, different predictive means are specified. Column 2 gives Bayes factors close to those obtained above assuming 0 , .B2 known, as might be expected from prior distributions concentrating sharply around the values 0, = and Or = 2. However, comparison of the first and third columns for s1= x2 = 0 makes clear that, with small data sets, seemingly minor changes in the priors for model parameters can lead to changes in direction in the Bayes factor.
4.
The point made at the end of the above example is, of course, a general one. In any model comparison, the Bayes factor will depend on the prior distributions specified for the parameters of each model. That such dependence can be rather striking is well illustrated in the following example.
Example6.6. (Lindley’sparodoxj. Suppose that forz = (1,.. ..r,,)twoalternative .. models ,\Il. P ( M l )> 0. I = 1.2. correspond to simple and composite hypotheses ;If2.with about p in X ( x t 1 p. A) defined by
MI : pl(z) =
n
41
!V(x, I ill,. A).
p,,. A known.
,. , : p 2 ( z ) = / f i X ( x , Ir
I
l ~ ~ , A ) ~ ( ~ 4 I ~ ~ , . AplI .) Af l . A ~known f ~ .
I
In more conventional tenninology, .cl. . . . ,s,, a random sample from S ( s [ 11. A). with are precision A known; the null hypothesis is that p = and the alternative hypothesis IS that /L # p,,, with uncertainty about /L described by ” V ( p 1 p l . A, ). Since F = nI ~ ~ I,is a sufficient statistic under both models, we easily see that = ,
It is easily checked that,& anyjixed?, B 1 2 ( s+ x as A, ) 0, so that evidence in favour of All becomes overwhelming as the prior precision in ,\I2 gets vanishingly small, and hence P(U1 15) 4 1. In particular, this is true for x such that 1  / I , , i is 5 large enough to cause the “null hypothesis” to be “rejected” at any arbitrary, prespecified level using a conventional significance test! This “paradox” was tirst discussed in detail by Lindley (1957) and has since occasioned considerable debate: see Smith (1965), Bernardo ( 1980), Shafer ( 1982b). Berger and Delampady ( 1987). Moreno and Cano ( I989), Berger and Mortera (1991a)and Robert (1993) for further contributions and references.
+
6.1 Model Comparison
395
A model comparison procedure which seems to be widely used implicitly in statistical practice, but rarely formalised, is the following. Given the assumption of a particular predictive model, { p ( z I 8 ) ,p ( 8 ) } ,8 E 0, posteriordensity,p(8 I z), a is derived and, as we have seen in Section 5.1.5, may be at least partially summarised by identifying, for some 0 < p < 1, a highest posterior density credible region R,(z), which is typically the smallest region such that
Intuitively, for large p , R p ( z )contains those values of 8 which are most plausible given the model and the data. Conversely, R;(z) consists of those values of 8 which are rather implausible. ) one Now suppose that, given a specified p and derived Rp(z . is going to assert that the “true value” of 8 (ie., the value onto which p(8 I y> would concentrate as the size of a future sample tended to 30) lies in R p ( z ) . Defining the decision problem to be the choice of p , so that the possible answers to the inference problem are in A = [O,11, with the state of the world w defined to be the true 8, a value up = p has to be chosen. An appropriate utility function may be
where f and g are decreasing functions defined on [O, 11. Essentially, such a utility function extends the idea of a zeroone function by reflecting the desire for a “correct” decision, but modified to allow for the fact that choosing p close to one leads to a rather vacuous assertion, whereas a correct assertion with p small is rather impressive. The expected utility of choosing a = p is easily seen to be ,
.ii(ap> = P f b )
+ (1  p)g(l  P ) ,
from which the optimal p may be derived for any specific choices of f and g. We note that i f f = g, the unique maximum is at p = 0.50, so that it becomes optimal to quote a 50% highest posterior density credible region. If, for example, f(p) = 1  p , g( 1  p) = [l  (1  p)I2 = p 2 , the resulting optimal value of p is l/& x 0.58, so that a 58% credible region is appropriate. More exotically, if f(p) = 1  (2.7)p’. g(l p ) = (1  p )  l , the reader might like to verify that a 95% credible region is optimal.

6.1.5
General Utilities
Continuing for the present with the (Mclosed) hypothesis testing framework, the consequences of incorrectly choosing a model may be less serious if the alternative models are“close” in some sense, in which case utilities of the zeroone type, which take no account of such “closeness”, may be inappropriate.
396
6 Remodelling
Onesided Tests We shall illustrate this idea, and forms of possibly more reasonable utility functions. by considering the special case of 0 E E 82, with parametric form p ( z 10) and models 1\21. 1\22 defined by p , ( . ~  l 0= p ( r i 0 ) . 0 E 8, (0: 0 5 0(,} ) = y.r(a .I0 )=p (r1 8 ). I 9 € 0 2 = ( 0 : I 9 > 0 , , }
for some 00 E 8. models thus correspond to the hypotheses that the parameter The is smaller or larger than some specified value 00. It seems reasonable in such a situation to suppose that ifone were to incorrectly choose (0 > 0") rather than MI (I9 5 Ho), in many cases this would be much less serious if the true value of 0 were actually 19,) i than if it were 0,)  100:. say, for E > 0. Such arguments suggest that, with the state of the world ui now representing the true parameter value 6'. we might specify a utility function of the
for i = 1.2 where I,, /2 are increasing positive functions of ( H  00) and (19,) H ) , respectively. The expected utility of the decision corresponding to 7 ~ ! , (i.e., the choice of Af,) is therefore given by
ii(m, I z) =
J'.:
I,(O)p(B [ s)d0.
where
The optimal answer to the inference problem is to prefer 1\11 to A12 if and only if
E ( m * 1 )> ii(7172 1 ) 2 2.
with explicit solutions depending, of course, on the choices of (1.12, and the form of p(O I z), illustrated in the following example. as
Example 6.7. (Normalpostenor; linear losses). If I I ( H ) = H  H,,, I,(/?) = k(&  0). with k retlectingthe relative seriousness of "overestimating" by choosing mcdel .If,. and p ( H I s). given z = . I ' , . . . . . .r,,.is N ( H I p,,. A,,). say, then we have
& ( m I z)= ,

.
I
( H  H,l),V(t? I I / , ) . A,,) tlH [A,,
2([AI
/I:.",,
= ,\,,
 //,8)].
6.1 Model Comparison
where
397
and
where
q * ( t )= N ( t 10. 1)
+t
l
"9 .
10.1) ds.
It is therefore optimal to prefer M ito Af2 if and only if
In the symmetric case. k = 1. it is easily seen that this reduces to preferring MI if and only if 11,~< 00, as one might intuitively have expected. For references and further discussion of related topics, see DeGroot ( 1970, Chapter I I), and Winkler (1972, Chapter 6).
Predicrion
Moving away now from model comparisons which reduce to hypothesis tests in parametric models, let us consider the problem of model comparison or choice, given data x,in order to make a point prediction for a future observation y. The general decision structure is that given schematicallyin Figure 6.2, where, assuming realvalued observables, m, corresponds to acting in accordance with model hiz,u J ,j E .I, denotes the choice, based on hl,, of a prediction, yz, for a future observation y, and we shall assume a "quadratic loss" utility,
u(mt.Ga,y) (G,  Y Y ) ~ , =
i E I.
We recall from the analysis given in Section 6. I .3 that the optimal model choice is m*, given by
where 6 is the optimal prediction of a future observation 9, given data x and : assuming model hf,; is. the value y which minimises that
390
6 Remodelling
where p i ( y 12) is the predictive density for y given model AI,. It then follows immediately that
the predictive mean, given model AI,, s o that
Completion of the analysis now depends on the specification of the overall actual ' ) g belief distribution p(y I z)and the computation of the expectation of (g,'  . i E 1, with respect to y(y 1 z). Again, in the Mcompleted case there is nothing further to be said explicitly: one simply carries out the necessary evaluations, using the appropriate form of p(y 12).by numerical integration if necessary. In the Mopen case. the detailed analysis of the problem of point prediction with quadratic loss will be given in Section 6. I .6. In the Mclosed case, we have
and. after some rearrangement. it is easily seen that
x)rly.
which reduces to
The first term does not depend on i , and the second term can be rearranged in the form
where $' is the weighted prediction
6.I Model Comparison
399
The preferred model MI is therefore seen to be that for which the resulting prediction, y{*, is closest to y', the posterior weightedaverage, over models, of the individual model predictions. If k = 2, it is easily checked that the preferred model is simply that with the highest posterior probability. If we wish to make a prediction, but without first choosing a specific model, it is easily seen that the analysis of the problem in terms of the schematic decision problem given in Figure 6.3 of Section 6.1.2 leads directly to jj* as the optimal prediction. Clearly, the above analyses go through in an obvious way, with very few modifications, if, instead of prediction, we were to consider point estimation, with quadratic loss, of a parameter common to all the models. More generally, the analysis can be camed out for loss functions other than the quadratic.
Reporting Inferences Generalising beyond the specific problems of point prediction and estimation, let us consider the problem of model comparison or choice in order to report inferences about some unknown state of the world w. For example, the latter might be a common model parameter, a function of future observables, an indicator function of the future realisation of a specified event, or whatever. A major theme of our development in Chapters 2 and 3 has been that the problem of reporting beliefs about w is itself a decision problem, where the possible answers to the inference problem are the consists of the class of probability distributions for w which are compatible with given data. The appropriate utility functions in such problems were seen to be the scorefunctionsdiscussed in Sections 2.7 and 3.4. This general decision problem is thus a special case of that represented by Figure 6.2, where, given data ac, m ,represents the choice of model A!{, the subsequent answer u J ,j E J,. to the inference problem is some report of beliefs assuming M,, and the utility function is defined by about w,
u ( m 7 , a J . w ) u,(q l(. z ) , w ) , = l
for some score function u l , and form of belief report, ql( * 1 ac), about w, corresponding to d,, j € J,. If pl(.I 2) is the form of belief report for w actually implied by m,and if u, is a proper scoring rule (see, for example, Definition 3.16) then it follows that the and optimal a,, j E J , must be a; = pl(. 1 z) that
u(7n7.a:,w) u , ( p t ( .lz).w), =
i E I.
If, moreover, the score function is local (see, for example, Definition 3.1% we have the logarithmic form
u(m,,a:.w) = Alogp,(w lz)
+ B(u), i E I ,
A > 0,
400
6 Reniodellitig
for n > 0 and B(w) arbitrary. in accordance with Proposition 3.13. The expected utility of I n , is therefore given by
u(m,lz) =
and the preferred model is the M , for which this is maximised over I E I. Comments about the detailed implementation of the analysis in the .Mopen case are similar to those made in the previous problem. For ;Mclosed models, we have the more explicit form
u(rn, 12) = (I c ( I J pM
JCI
s
{a1015y,(wIz)+U(w)}p(wIz)tlw.
p,(w
I z)logp,(wI s ) d w +
O ( w ) p ( w z)rlw. I
H I , ib
which, after straightforward rearrangement, shows that the preferred by minimising, over i E I.
given
the posterior weightedaverage. over models. of the logarithmic divergence (or and discrepancy) between p,(w 1 z) each of p,(w 12)..j # i E 1. If. instead, we were to adopt the (proper) quadratic scoring rule (see. for example, Definition 3.17). we obtain, ignoring irrelevant constants,
U ( m , 1 )x 2
so that, after some algebraic rearrangement. in the case of Mclosed models the preferred M, is seen to be that which minimises. over i E 1.
I{
2p,(w 1 )2
. I
p f ( wI z)dw p ( w 1 ) 2 cfw.
1
Comparison of the solutions for the logarithmic and quadratic cases reveals that if. for arbitrary f.
defines a discrepancy measure between y and q. both may be characterised as identifying the 121, for which
XI'('''Jz)ci{pJ(wz, y , ( w I z)} I I l
./El
is minimised over i E 1,the differences in the two case5 corresponding to the form of f (logarithmic or quadratic. respectively).
6.1 Model Comparison
401
Example 6.8. (Point prediction versuspredictive beliefs). To illustrate the different potential implications of model comparison on the basis of quadratic loss for point prediction versus model comparison on the basis of logarithmicscore
for predictive belief distributions, consider the following simple (Mclosed) example. Suppose that alternative models A i l , A for z = (q, . . . r,,) defined by: l z are
.
with XI, Xz, /ill, known: we are thus assuming normal data models with precisions XI, A?. XI, respectively, and uncertainty about p described by N ( p Ip,). A,,) in both cases. Now, given x, consider two decision problems: the first problem consists in selecting a model and then providing a point prediction, with respect to quadratic loss, for the next the observable, x"+~; second problem consists in selecting a model and then providing a predictive distribution for z,,], with respect to a logarithmic score function. For the first problem, straightforward manipulation shows that the predictive distribution for x n r l assuming model M, is given by
where
so that, corresponding to the analysis given earlier in this section, model M, leads to the prediction pf,(j), = 1.2, and the preferred model is MI if and only if j
To identify these posterior probabilities, we note that, if $ = T E  ~ C ( XS ) 2 ,  ,
which, for small A,), is well approximated by
The posterior model probabilities are then given by
402
6 Remodelling
where pl? = P(All)/P(Jf2). Model Al is therefore preferred if and only if l
Iog[BI1(z)pl2j 11 >
In the case of equal prior weights, p12 I and, assuming small A,,, if we write the condition = in terms of the model variances cf = A;', J = 1.2. we prefer dll when
Noting that the lefthand side is an intuitively reasonable databased estimate of the model variance, we see that model choice reduces to a simple cutoff rule in terms of this estimate. For the second decision problem, the logarithmic divergence of p ( .rjt I 9 1 1 . ~ ) from p(.r,,+ I A/?, z) is given. for small All. by
with a corresponding expression for d2,. The general analysis given above thus implies that model MI is preferred if and only if
i.e., if and only if
(rather than > 1, as in the point prediction case). Note, incidentally, that should it happen 1 model .\II. would be preferred if and only if < A l l . which . that P(:lfl j z) = P(,4f2 z), happens if and only if h l .> A?. Intuitively, all other things being equal. we prefer in this case the model with the smallest predictive variance. Toobtain some limited insight into the numerical implicationsof these results. consider the case where 0 ; = 1, 0 = 25, It  1. P(dfl) = P(AL) = and .Y') = 3. which gives : : = 0.3'33,so that P( d l , 12) = 0.28, P ( M 2; I ) z 0.7'2. Usini the point prediction with ; quadratic loss criterion, we therefore prefer A I L . However. hI1 = 1 .I29 and hi1 10.31. 50 that if we want to choose a predictive distribution in accordance with the logarithmic score ' criterion we prefer A l l . since (13.28)/(13.72) > (1.129)/(11).31). However. if R = 1. the reader might like t o verify that :\I2 is preferred under both criteria (B12 = O.05S. implying l that P r ( Al i I ) = 0.(15.5).
7
so that p(w I z) not available? We shall illustrate a possible approach to this problem by detailed consideration of the special case where w = 9. and with a quadratic score function we have Secondly.= into z .= xl].6.6 Approximation by Crossvalidation For the general problem of model choice followed by a subsequent answer to an inference problem.. in the Mcompleted case. the analysis based on Figure 6. z ) = logp(yIAl. First.. In the Mclosed case.. as a . or a predictive distribution with respect to logarithmic or quadratic score...{z3} . for point prediction with quadratic loss. .where xnl (j)= x. . f i ( y . [z... we have seen that the mixture form of p(w I z) enables an explicit form of general solution to be exhibited. has the mathematical form for some function ft of y and x. .l(j)as a “proxy” for 5 and x. . if ri is reasonably large. What can be done to compare the values of hl. denotes the optimal subsequent decision given Adt. a future observation. i E I. z.z).n. j = 1. is required. whose form can be explicitly identified. .1. ignoring irrelevant terms. we have for a predictive distribution with logarithmic score function we have.) . (21. with xJ deleted.2 implies that the optimal choice of model from M is the All for which is maximised over i f I. and the z’s are exchangeable..1 Model Comparison 403 6. . For example. and that. denotes z ..depending on i. we have noted that the solution is in principle available. we note that. we note that there are TL possible partitions of 5 = z .given appropriate computation. We turn now to the case of model comparison within the Mopen framework. each such partition effectively provides x. the expected utility of the choice A f t . where (I.i E I.(j). as proxies for an actual is belief model which itself has not been specified. in all these cases. for which a point prediction with respect to quadratic loss.
z). p(. But. a standard law of large numbers argument suggests that. k C J=l In the case of comparing two models. A l l . J f 2 . so that the expected utilities of MI.] such that z . if y is a future observation and Jij*(j) denotes the . a “proxy” for z and xj is a “proxy” for y. in the form k 1 ~ o g p (I zi\ij) . Under the logarithmic prediction distribution utility. as we .= j saw above. [zI. we can form 71 partitions z . If we now randomly select k from these n partitions. over i E I.. for computational purposes. however. in this Mopen perspective.. of how well MI performs when. x. is.404 6 Remodelling “proxy” for y. the quantity C PI(^.y 1%) is no/ specified.~ ( .=) z . approximation implies this value of E[y I M &z] that we minimise. k 3 x. over i E I. for large r i .r. using squared distance.if we randomly select A. which can be regarded as an average measure based on the logarithm of the integrated likelihood under model AfI. i E I. can be compared on the basis of the quantities ]=I In the case of point prediction. which is an average measure. .. we maximise.j). In the case of a predictive distribution with a logarithmic score. this criterion can be given an interesting reformulation. It follows that. on a leaveoneoutatatime basis. can rearrange the criterion to see that we prefer model A l l if where. we and writing p l ( y I z)= p(y I M I . IxltI(j)) log M(1. when x is replaced by z. . .(j).JIzn 1 ( A ) k J=. of these partitions.l(j) I M. and can be conveniently rewritten. it attempts to predict a missing part of the data from an available subset of the data. 1 k ._l(..).logp(t. as I ? .
We recall from Section 6. One interesting difference is the following (Pericchi. These kinds of approximate performance measurements for comparing models could obviously be generalised by considering random partitions of x involving leaveseveraloutatatime techniques.). ( 8 . 1993). for example.xr11 ) )denotes the Bayes factor for 11 against (j 11 AI2. from a mainly nonBayesian perspective. respectively. based on the versions { Pt (x. A I 2 on the basis of the models’ ability to predict one further observable.3but merely note that the above approximation to the optimal Bayesian procedure leads naturally to a crossvalidation process. In Section 6.%I2. But this. the latter puts the emphasis on “future predictive power”. in the context of zeroone loss functionsand the Mclosed perspective.3. can be rewritten so that the criterion implies preferring model 11.6. (z O.1 ( j )thus leading to large .1 observations. .. in both the quadratic loss and logarithmic score cases. the Bayes factor is evaluating 1111.1 Model Comparison 405 provides a (consistent.1. . for example.11 if for j = 1.1 ( d ) } 1 of A l l . squared distance terms or small logintegratedlikelihood values. the performance measure will penalise Id. We shall not develop such ideas further hereapart from giving one further interesting illustration in Section 6. given R . The above development clearly establishes that such crossvalidatory techniques do indeed have an interesting role in a Bayesian decisiontheoretic . Thus. In contrast. p. ) ) 1 1 .4 the role of Bayes factors based on the I of 1 1 versions { p . where B12(xj.1112. in turn. which results in a preference for models under which the data achieve the highest levels of “internal consistency”. ) ) .I8.) P* (61 I=a.. Model choice and estimation procedures involving crossvalidation (sometimes called predictive sample reuse) have been proposed by several authors. . as n t 00) Monte Carlo estimate of the lefthand side of the model criterion above. there are zl which are “surprising” in the light of ~ . Although there are clear differences here in formulation (Mopen versus Mclosed. in the above we are taking a geometric average of Bayes factors which are evaluating MI.. if under model A I . it is interesting to note the role played again by the Bayes factor.k.4. logpredictive utility versus 0. Stone ( I 974) and Geisser (1975) for early accounts and Shao ( I 993) for a recent perspective.1 loss). ~ .1.A12 on the basis of the models’ ability to “predict” x given no data (beyond what has been used to specify p z ( 8 . as a pragmatic device for generating statistical methods without seeming to invoke a “true” sampling model: see. The former situation puts the emphasis on “fidelity to the observed data”.
I + C(/icl 11 r( ...If.9. and in the absence of a specified actual belief model.6. 0. omitted.3 and 6. the optimal prediction of a future observation. .X known: The analysis given in Example 6.~ ) P + w. . and it was shown that. !/.' . We shall now reconsider the case of quadratic loss for point prediction in the Mopen context.correspondingto simple and composite hypotheses about / r in :V(. for z = (.r. . (Lindley's paradox revisited).rI. I pI. the criterion reduces to the comparison of posterior probabilities when just two models are being compared). under . . First. r .\f1 is just i.. Secondly.\I?. I where 7. U ~ 1 1 ( ~ ) }. if and only if I I .and k = H. In Example 6..and x. = u X ( X I r t X ) . = /ill. MI would be the preferred model results given in Sections 6.( j ) .( Intuitively.tT.10) under Af? the optimal prediction is i = 1. will be preferred if the posterior mean on average does better as a predictor than / t o . ) > I).j ) = 7 + ( u .) is the mean of the sample x with . Example 6.r. whereas (making appropriate notational changes to the results given in Example 5. we considered the case of two alternative models. . is preferred to df2 if and only if. A ) .1.). as XI under either mroone utility. In particular. given x. A ) and de6ned by: MI : p l ( z ) = n I I 1 N ( .1) '(2 .6 was within the Mclosed context with P( .r.. r . we note that. . P(N1 xj I 1 for uny tixed x. . It follows from 0. .1. based on k random partitions of 5 into s. an approximate analysis shows that d l . as XI   where w. !  .\I. is preferred to !If.. if XI + 0.r.4 that.)') < xi{(l * ..2.. pll. . 1 1 . from the crossvalidation approximation analysis given above we see that A l ..6 Rentodelling setting for approximating expected utilities in decision problems where a set of alternative models are to be compared without wishing to act as if one of the models were "true". or quadratic loss utility for point prediction (since in this latter case.r.. .\II.
a data bank (of “training data”) is available consisting of D = { ( W J ... n previous patients presenting at the same clinic. To aid the modelling process. z s ) . which are themselves selected functions of z = (zl. stylised exampleson which most of our discussion has been based.((z). it might be illuminating at this juncture to indicate briefly how the theory we have developed above can be applied in contexts which are much more complicated than those of the simple. which. or predicting the.7 Covariate Selection We have already had occasion to remark several times that our emphasis in this volume is on concepts and theory. under Mi.on covariates y1 ( z ) . .n}. .. and only if. preferences among models in M may differ radically.1 Model Comparison 407 This is easily seen (Pericchi.representing all possible observed relevant attributes (discrete or continuous) for the individual population unit: for example. . and that complex casestudies and associated computation will be more appropriately discussed in subsequent volumes. To fix ideas. for the state of the world w of the new individual unit. x:. . quality level of the output from a particular run of an industrial production process. 1993) to be equivalent to preferring hl.. . even given the same data and utility criterion.j= 1.dl random quantity exceeds See Leonard and Ord (1976) for a related argument.6 and makes clear that. . we shall consider the important problem of model comparison which arises when we try to identify appropriate covariates for use in practical prediction and classification problems. depending on whether one is approaching their comparison From an the value 2. . for various choices of m. That said. To this end.D ) . . Mclosed or Mopen perspective. or a record of all the input and control parameters of the production run. disease state of a specific patient.. Some kind of decision is to be made regarding an unknown state of the world w relating to an individual unit of a population: for example. or 1%previous runs of the same production process. recording all the attributes and (eventually known) states of the world for n previously observed units of the same population: for example. a predictive distribution p(w 1 g ( z ) . .1. if.is equivalent to rejecting Mi if a Snedecor F’l. . as yet unknown. . 6. yet unknown. We shall suppose that the ultimate objective is to provide. consider the following problem.y. classifying the. the patient’s complete recorded clinical history. This result provides a marked contrast to that obtained in Example 6. Possible predictive models are to be based.6.z{).
Figure 6. D ) when w turns out to be the true state of the world.s. on the practical problem under consideration: typically. . If we suppose.(. the latter will not only incorporate a score function component for assessing p ( . however. i n } under consideration for defining covariates. Qpically. reflecting the different costs associated with the use of different covariates y. . D). I y(z). in the case of disease classification the use of fewer covariates could well mean cheaper and quicker diagnoses. . the "traingiven ing data" D. the resulting decision problem is shown schematically in Figure 6.). . D). I y(z). in the case of predicting production quality the use of fewer covariates could cut costs by requiring less online measurement and monitoring.4 Selecrion ojcovciririres CIS n decision prohler~r Ifp(z. w I D) represents the predictive distribution for ( z . depend on the form of the utility function.a). The particular forms in Y will depend. if w is continuous. . D) is equivalent to the identification of y E J'. If w is discrete. that the utility function can be decomposed into additive score and cost components. . of course. are then compared on the basis of their expected utilities The resulting optimal choice will. i = 1 . i = I . it will include functions mapping z to z . Then. For example. we refer to it as one of prediction.w} denotes a utility function for using the predictive form p ( . of course. we typically refer to the problem as one of clnssijication. . so that individual attributes themselves are also eligible to be chosen as covariates.I y(z ) . if ci(y(. To simplify the exposition.408 6 Reniodelling where y denotes a generic element of the set of all possible { y. . . I y( z ) . we shall suppose that identification of the density p ( . the different possible models corresponding to the different possible choices of covariates.4. but possibly also a cost component. for simplicity. where J' denotes the class of all y under consideration. y E Y . .
so that.2. but discussion of these would take us far into the realm of methods and casestudies. and the primary decision corresponds either to the choice w. has been proposed initially. the set Y = M of possible models is typically a rather pragmatically defined collection of stylised forms. say.) for which ii(y. I D) is less than some appropriately predefined small constant. etc.) I D) . given . and so will be deferred to the second and third volumes of this work. for observations x. If costs are omitted. In this section. an Mclosed perspective would not usually be appropriate. or to the choice m ..i E I.2. over all possible choices of one covariate. that might be employed in particular cases. two covariates. . which corresponds to C subsequently acting as if M o were the predictive model. where the primary decision consisted in choosing m. More pragmatically. 6.i = 1. rejecting A&).1 MODEL REJECTION Model Rejection through Model Comparison In the previous section. we considered model comparison problems arising from the existence of a proposed range of welldefined possible models.6. identify the optimal yLi).2 Model Rejection 409 the expected utility of the choice y is given by In many cases. it will be natural to use proper score functions. quadratic or logarithmic. with the implication of "doing something else. with the implication of subsequently acting as if the corresponding M i . i E I. in the terminology o Section 6. In fact. were the predictive model.2. one could ignore costs.w I 0) likely to is prove far too complicated for any honest representation.+. observe that is typically concave.. If. we shall be concerned with the situation which arises when just one specific welldejned modelfor x. Given the complexity of problems of this type. i E I}. (thus. reflecting marginal expected utility for the incorporation of further covariates.a(yii. . and. for example.1.. if cost functions are used which increase with the number of covariates in the model. in a sense. M = {Mi. recalling the discussion of Section 6.2. in most applications p ( z .2 6. a small subset of the latter will typically be optimal. and hence select that y. we need to perform a comparison of the models in Y from the f Mopen perspective.1. the optimal model will typically involve a large number of covariates.. . There are interesting open problems in the development of the crossvalidation techniques.
a "mathematical neighbourhood" of :\I. particularly where \I.6. the calculation of U( rrr.. I z). One way or another...( 0 ) indexes the models in M distinct from ill. as a consequence of the application of some kind of principle of simplicity or parsimony.M. this model rejection problem might be represented schematically by Figure 6. generated by selecting. if we adopt the Mcompleted perspective.\I. .5 Model rejecrion as a decision problem z.. as a consequence of ill0 being the only predictive model thus far put forward in a specific decision problem context. for example.. we see from Figure 6.410 6 Remodelling Figure 6. we may use a crossvalidation procedure to estimate the expected utilities. Thus. Such a structure may arise. as an attempt to "get away with" using :\I. thus far clearly illdefined. contain alternatives of practical interest). 12) to compare with u( mo I 2).bfclosed perspective.5 that we cannot proceed further unless we have some method for arriving at a value of C( n i t . evaluations are based on mixture forms involving prior and posterior probabilities of the '11. the hitherto undefined becomes value of ii( r n & I z) where I' = I . instead of using more complicated (but. has M been put forward for reasons of simplicity or parsimony. This might be done. we are forced to consider alternative models to ill. unstated) alternatives. i E I. in this context. I z)denotes the ultimate expected utility of a primary action. by consideration of uctual ulternutives to .but may be numerically involved. .. Let us suppose therefore that we have embedded dl. = { M . the calculation is.E I proceeds t as indicated in Section 6. in some way. or. (which might also. if we adopt the Mopen perspective.. I .. Otherwise. What perspectives might one adopt in relation to this. it might be done by consideration of formal dtertiutivrs. in some larger class of models .. if we adopt the ..5. shown schematically in Figure 6. in principle.G ( . thought (by someone) to be of practical interest. problem of model rejection'? If we are concerned with coherent action in the context ofa wellposed decision problem. of course. For this redefined problem of model rejection within . For any specific decision problem. i E I}. welldefined.
or (ii) p ( z I &. by replacing ii(rn0 I z)by C(n% f ~ ( 2 ) ~“(z) Is) where represents. we shall assume.where. predictive models for 2 are defined.M I } . for related discussion.i E 1’) consists of actual alternatives to h. The same argument f applies even more forcibly in the case where { M t . i E 1’. we might regard the redefined model rejection problem as essentially identical to the model comparison problem. we might still s prefer Mo because o the special “simple” status of hfo.. IB ) . over and above the expected utility a(mo 1 z). implicit (but as yet undefined) exrru utility relating to the special status of A10. In order.6.since rejecting MOmay not lead obviously to an actual alternative model. by: Specially. therefore. M. . However.ii(nbI )were positive. so that rejecting mo corresponds to choosing the best of m. will correspond to either: (i) p ( z I Bo).6 Model rejection within M = { M. i E I } In the case where {hi. 1980. forsome parametric family { p ( .. a simple hypothesis that 8 = Bo (specified by a degenerate prior pO(0) which concentrates on 80). for the remainder of this chapter that M = { M o .. but not ‘‘too large”. a simple hypothesis on a parameter of interest q3 where X is a nuisance parameter (specified by a priorpo(8) = p(A I & which concentrates . the redefinition of the problem of model rejection as one of model comparison corresponds to modifying slightly the representation given in Figure 6. A).6.10. an given z.2 Model Rejection 411 Figure 6. 8 E Q}.) The formulation of the model rejection problem given above is rather too general to develop further in any detail. Thus. to provide concrete illustrative analyses. for some 80c 8. this would seem to ignore the fact that when ho has been put f forward for reasons of simplicity or parsimony there is an implicit assumption that the latter has some “extra utility”. From this perspective. z E Z’} consists of formal alternativesto h. (See Dickey and Kadane.10. ) on the subspace defined by 4 = 40). and the “extra utility” of choosing Ado if at all possible may be greater. if G(m81 z) .
If. or by the { Afl}closed form.. . where With a considerable reinterpretation of conventional statistical terminology.Ilk . .Mclosed form. O). model Af.. p ( 0 I Af. as before. . since it assigns !210no special status. it is common practice to consider the decision problem of model rejection to be that of assessing the conzpciribiliry of model Ml1 with the data z this being calibrated relative to the wider model .412 6 Remodelling The next three sections consider some detailed model rejection procedures within this parametric framework. and the optimal action is to reject model . In terms of the discrepancy.to the rejection of model Af. 8 defined either by an . e) . In such situations.. and overall beliefs.11(7U{). 8 E @. Noting that there are only two alternatives in this decision problem. MI}.. leading. u . 2). E 8. since the optimal inference will clearly be to reject model . el. We shall refer to a(@) as a (utilitybased)discrepuncy measure between M.. within which A J o is embedded. 0). we might refer to 1( .Ifl when 8 E c ) is the true parameter. corresponds to a form of parametric restriction (or “null hypothesis”) imposed on model ill.\l~if and only if 1(z)> f ( l ( Z ) . if the observed value of the test statistic exceeds a criticul \ d u i ~ .) as a test stutistic.. c ( 5 ) = E(l(5). say in favour of the larger model nr. the utility premium attached to keeping the simpler model Mo. a(e) = u ( l l l .the latter prowith M = (Mol viding a kind of “adversarial” analysis. We shall focus on this version of the model rejection problem with utilities defined by ~ ( n $ ~ ) ( m I . it suffices to specify the (conditional)difference in utilities.2. if and only if where ~ “ ( 2 ) represents.2 Discrepancy Measures for Model Rejection Within the parametric framework described at the end of the previous section.. 6. p(0 1 z). for given data x.
a choice of c ( . Of course.lo to only be rejected with low probability (a.6.111 P(Afo. for criticism of the practice of working with a fixed a. for values of a of the order of 0.)P(a:I A h ) + P(hfl)P(Z a1.A I 4 0 ) d A A)p( IM I ) = and J P(a: I W l (me.2 is defined to be Assuming the decision problem of model rejection to be defined from the Mclosed perspective. )such that P ( t ( z )> .)and has no special theoretical significance. Under this approach. for a nondecisiontheoretic approach). Box.. 1 )> c(z). should be P ( M .11) ' I if 00specifies a simple hypothesis if@"= (+(. () thus obtaining the (1 . we would choose c ( . Ill0 It follows from the analysis given in the previous section that rejected if and only if. this is just one possible approach to selecting c ( . Examples will be given in the following two sections. for specified critical value c(z). 1980.2. we obtain t(z> = where ~ (I ' z0 ) = 1 J s(e>P(ez)de I P(M I 11.3 Z m n e Discrepancies Suppose that the discrepancy measure introduced in Section 6.3. see Appendix B. prior to observing and assuming hi0 to be true. 6. However.A). =P(h1lIX) { ?. z 1 Ill") = a .01).2 Model Rejection 413 How might c(z) be chosen? One possible approach could be to consider.05 or 0.(FAo.say.2. It is interesting that this choice turns out to lead to commonly used procedures for model rejection which have typically not been derived or justified previously from the perspective of a decision problem (see. )which would lead h. 2 . conditional on information available prior to observing x and assuming A10 to be true.a ) t h percentage point of the predictive distribution of t(z). Section 3. for example.
if the posterior odds on Af0 and against MI are smaller than (1 . ( z ) . Chapter 5) that similar results can be ohtained for a variety of"null models" in more general contingency tables. The reader might like 10verify (perhaps with the aid of Jeffreys.. with its tail critical value. on this particular approach to the choice o f cfz).e. . . I?(=). distribution. having decided where ti. n ) which calibrates the procedure to only reject MIwith probability o when MII true.. rr . assuming. is detined by the equation is denotes the upper 100o'X point o f a 1. Suppose that 2 represents s successes in I I binomial trials... can be approximated by ) where is the usual chisquared test statistic. and applying Stirling's formula..10.c(z))/c(z)..If. HGI. 1939/1961. the rejection criterion for Alfo terms in of the Bayes factor is given by Example 6. :Ifl defined by and F'(Af.r. Of course.414 6 Rernodelling i.) I . In the case where the prior odds are equal.J t + f l o g ( 2 ~+ (127. (Null hypothesisfor a binomial pammeter). I .. Straightforward manipulation shows that which.for given r r .. large .a value o f c ( s ) = c(O. . logr(n + 1) z ( i t + l)log(rt) .]) = P(dfl) = f . with .. for purposes of illustration. By considering 2 log U .(. there is n o real need to identify it! The rejection procedure is simply defined by comparing the test statistic value. \:.
2. N ( z . ) . measure (the noncentrality parameter) suggested by intuition as a discrepancy measure for a location scale family. O0 = ( p ~ .6. A12 defined. . .3.6 and 6.4 General Discrepancies Given our systematic use throughout this volume of the logarithmic divergence between two distributions. an obviously appealing form of discrepancy measure is given by where p ( z I 8 ) is the general parametric form of model h 1 1. pll. for some appropriate c(z).11.X known. standardised. 6 ) .2 Model Rejection 415 6. or from the "adversarial" form corresponding to assuming model 1 1 . we might consider the standardised measure In any case.9 we considered the use of models A l l .x . we obtain which is just a multiple (by n / 2 ) of a natural. . 1 prl. In Examples 6. u )8 = ( p . .=I . A). the general prescription will be to reject A40 if t ( z ) > c(z). again). 11 Example 6. . In the case of a locationscale model. where withp(8 I z) derived either from an Mclosed model. . (Lindley's paradox revisited.2. . as illustrated in Section 6. Using the logarithmic divergence discrepancy. by MI : p(x) = n . for x = ( 2 1 . .
distribution./ I . with s' = [ r j s 2 / ( r r ./ I . the logarithmic divergence discrepancy in this case being proportional to the Mahalanobis distance (see Ferrindiz. the statistic ~ ( z= dx(7 . Wc shall not pursue such cases further here.}closed perspective (see Example 5.significancetest statistic for a normal location null hypothesis in the presence of an unknown this has a St(t i 0.I ) distribution and the scale parameter. Rueda (1992) provides the gencral expressions for onedimensional regular exponential family models. this has an . With respect to p(z 1 Oil).(z) 1 exceeds the upper 100(tr/2)'%point of an : I: 1j . the reference posterior for ( I / . Here.17) is given by where t w 2 = X(J.rrA). is easily seen 10 be uniform.)L'v(// IT. 1) distribution and the appropriately calibrated value of ('(2)is thus implicitly defined.follows that It we see that fi(7 . Multivariate normal location cases are also easily dealt with. ~ ) / S 'is B version of the standard where. for example. we have the reference posterior p(/rIz) :V(/i17. //A)+ where. testing the equality of means in two independent normal samples. with known or unknown (equal) precisions. if I .. ( The above analysis is easily generalised to the case of unknown A. II). ~ ) / V seen to be a version of ) is the standard significance test statistic for a normal location null hypothesis. A ) from an { dl. With respect to p(zI appropriately calibrated value of v(z) is defined by the standard rejection procedure. I .. u .. since it seems t o us that detailed discussion of model rejection and comparison procedures all too easily becomes artificial . The reader can easily extend the above analyses toother stylised test situations: for example. with CT') = A ". by rejecting Jf.V(Z 10.24. as a special case of Proposition 5. which. = and hence (// //l. 1985).7)2.416 6 Remodellirig Assuming the reference prior for / I derived from an {Afl}closedperspective. .
such as choosing the model with the highest posterior probability. Bemardo (1980). see. Ferrfindiz ( 1985). Lindley (1965. Zellner and Siow (1980). or Bayes factors for competing models. or model choice. see. G6mezVillegasand G6mez (1992). Pettit and Young ( 1990). we have shown that “natural” Bayesian solutions. Spiegelhalterand Smith ( I982). San Martini and Spezzafem ( 1984). This obviously leads to the problem of model comparison. Felsenstein (1992) and Rueda (1992).6. Berger ( 1985a). Others openly adopt a decisiontheoretic approach. Karlin and Rubin (1956). Raiffa and Schlaifer (1961). Zellner (197 I). 6. by deriving either posterior probabilities. . Bemardo and Bayarri (1985). and then using discrepancy measures as loss functions in order to decide whether or not the original simpler model may be retained after all.3 6. and also from that of a group of modellers. Zellner ( 1984). In this setting. Dickey and Lientz (1970). From the perspective of this volume. Box and Hill (1967). Dickey (1971. loss functions. Bemardo (1982. Smith and Spiegelhalter (1980). Kass and Vaidyanathan ( 1992). 1972). Leamer (1978). where the primary decision consists in acting as if the proposed model were true without having specific alternatives in mindand have shown that useful results may be obtained by embedding the proposed model within a larger class. McCulloch and Rossi ( 1992) and Lindley ( 1993). there are frequently strong reasons for considering a range of possible models. Schlaifer (1961).3. 1985a). Some authors adopt a purely inferential approach. we have taken the analyses of this chapter sufficiently far to demonstrate the creative possibilities for model choice and comparison within the disciplined framework of Bayesian decision theory. Poskitt (1987). for example. and our approach has been to consider formally a decision problem where the action space is the class of available models. appropriately chosen.1 DISCUSSION AND FURTHER REFERENCES Overvlew We have argued that both from the perspective of a sensitive individual modeller. Berger and Delampady ( I987). DeGroot (1970).3 Discussion and Further References 417 outside the disciplined context of real applications of the kind we shall introduce in the second and third volumes of this work. There is an extensive Bayesian literature directly related to the issues discussed in this chapter. for example. Aitkin (1991). We have also considered the generally illposed problem of model rejection. 1977). are obtained as particular cases of the general structure for stylised.
Gelfand.2 Modelling and remodelling 6 Remodelling We have already argued that we see Bayesian statistics as a rather formalised procedure for inference and decision making within a welldefined probabilistic structure. mainly exploratory analysis. 1993). Naturally. in that either new data.3. We shall not elaborate on this here. will force this sequence of informal exploration and formal analysis to eventually settle on the use of a particular belief model. a pragmatic combination of time constraints. which hopefully can be defended as a sensible and useful conceptual representation of the problem. 1992a) provide a Bayesian analysis of selecfion models. more formal investigation of its adequacy and the consequences of using it will often be necessary before one is prepared to seriously consider the model as a predictive specification. data limitations. In this chapter we have discussed some of the procedures which may be useful in a jbrniui comparison of such alternative models. Peiia and Guttman (19931. And even with such a simple model. more time. Such investigations will typically include residuul analysis. Dey and Chang (1992). which may have been informally suggested by a combination of exploratory data analysis. cluster unulwis. however. 6. but it would be highly unrealistic to expect that in any particular application such a belief model will be general enough to pragmatically encompass a defensible description of reality from the very beginning. In practice. may force one to make yet another iteration towards the never attainable "perfect".3. where data are randomly selected from a proper subset of the sample space rather than from the entire population. Geisser (1987. 1990. identification of outliers andor influential duru. The outcome of this strategy will typically be a more refined model for which a similar type of analysis may be repeated again.3 Critical issues We shall comment further on six aspects of the general topic of remodelling under the following subheadings: ( i ) Model Choice iind Model Criticism. Pettit and Smith (1983. Bayarri and DeGroot (1987. Verdinelli and Wasserman (1991). 1992. graphical analysis and prior experience with similar situations. This remodelling process is never fully completed.418 6. and capacity of imagination. 1983. Weiss and Cook ( 1992). 1985). Some relevant references are Johnson and Geisser (1982. Chaloner and Brant (1988).(ii) Inferenre . Pettit (1986). McCulloch (1989). As a consequence of this probing. or an imaginative new idea. and the behaviour of diagnostic srurisrics when compared with their predictive distributions. Guttman and Peiia (1993). we typically first consider simple models. Fully specified belief models are an integral part of this structure. a class of alternative models will typically emerge. all powerful predictive machine. and Chaloner (1994).
whilein generalthe posteriors are not. Bemardo and Bermlidez (1985). The use of predictive distributions to check model assumptions was pioneered by Jeffreys (193911961). from a decisiontheoretical point of view. 1988. See. (iii) OverjStting and Crossvalidation. under different headings such as model comparison. model choice or model selection. Clayton et al. Possible comparisons include checking whether or not the observed t i ’ s belong to appropriate predictive HPD intervals. Hodges (1990. Rubin (1984). and we have argued that. nondecisiontheoretic Bayesians have often tried to produce procedures which evaluate the compatibility of the data with specific models. (. As clearly stated by Box (1980). while the predictivedistribution makes possible criticism of the entertained model in the light of current data. it seems to us . can all be reformulated as particular implementations of this general framework. partly due to the classical heritage of significance testing. based on a different sample with their predictive distributions p t . Gelfand. ( 1992).6. Dempster ( 1975). and given the obvious attraction of being able to check the adequacy of a given model without explicit consideration of any alternative. and (vi) Computer Sopare. the problem of accepting that a particular model i s suitable. Dey and Chang (1992) and Gircin et al. Model Choice and Model Criticism We have reviewed several procedures which. Box and Tiao (1973). As mentioned before. The reader will readily appreciate that common techniques such as residual analysis. segregation of homogeneous clusters. However. . the posterior distribution of the parameters only permits the estimutionof the parameters conditionalon the adequacy of theentertained model..~) from the same population. 1987. identification of influential observations. .z. . (1986). the predictive distributions which correspond to different models are comparable among themselves. or determining the predictive probability of observing tt’s more “outlying” than those observed. ..3 Discussion and Further References 419 and Decision. also. may be used to choose among a class of alternative models. we see these very useful activities as part of the informal process that necessarily precedes the formulation of a model which we can then seriously entertain as an adequate predictive tool. Additional references include Geisser (1966. 1992). However.z. is illdefined unless an alternative is formally considered. (v) Scient@cReporting. Moreover. . 1971. = t c ( z n + .Geisser and Eddy ( 1979). .1993).+k) of the data and comparing their actual values in a sample l. or outlier screening.(iv) Improper Priors.1985. The basic idea consists of defining a set of appropriate diagnosricfuncrions t .I z1 .
say. otherwise. then some form of alternative must be considered.z} one of which is used to produce the . A natural solution consists of randomly partitioning the available sample z = .x. moreover. . . I z l .} say. by explicitly evaluating the degree of compatibility between the observed t and its predictive distribution. Hill ( 1986. of the general type ( I ( p t ( . and its actual. Klein and Brown (1984). eventually observed. t ) ). z. that the more traditional Bayesian approaches to model comparison. For a discussion of how cross validation may be seen as approximating formal Bayes procedures. we have emphasised the advantages of using a formal decisionoriented approach to the stylised statistical problems which represent a large proportion of the theoretical statistical literature. Florens and Mouchart (1985). Very often. This technique is usually known . ( 1986) and West (1 986). Skene ef al. These advantages are specially obvious in model comparison since. . the procedure then being repeated as many times as is necessary to reach stable conclusions.. see Section 6. and references therein. . Gelfand. . provide natural utility functions to use in this context. can be obtained as particular cases of the general structure by using appropriately chosen. Scoring rules. . 1990). the consequences of entertaining a particular model may usefully be examined in terms of the discrepancies between the prediction provided by the model for the value of a relevant observable vector. such us determining the posterior probabilities of competing models or computing the relevant Bayes factors. they make explicit the identification of those aspects of the model which really matter. Inference and Decision Throughout this volume.420 6 Remodelling inescapable that if a fonnal decision is to be reached on whether or not to operate with a given model. For recent work. it is far more common to be obliged to do both model construction and model checking with the same data set. stylised utility functions.. however. relevant predictive distributions. severe overtring may occur. We have seen. see Peiia and Tiao (1992). and the other to compute the diagnostic functions. value. into two subsamples { z . Overjfring and CrossValidation If we are hoping for a positive evaluation of a prediction it is crucial that the predictive distribution is based on data which do not include the value to be predicted. For further discussion of model choice. Pragmatically. Dey and Chang ( 1992). by requiring the specification of an appropriate utility function.6. . Krasker (1984). t. although it is sometimes possiblc to check the predictions of the model under investigation by using a totally different set of data than that used to develop the model. see Winkler (1980). .1s crossvalidation. Poirier (1985). { X I .1.
.. (iii) Estimate the expected utility of the model by Note that the last expression is simply a Monte Car10 approximation to the exact value of the expected utility.3 Discussion and Further References 421 A possible systematic approach to crossvalidation starting from a sample of . however. Improper Priors In the context of analysis predicated on a fixed model. or a diagnostic function. as described in the above approach to model criticism. )I zbi) will be proper even if the reference prior r ( 8 )is not. (ii) Determine the set of all predictive distributions of the form where aJ is a subsample of z of size k and air] consists of all the 5. involves the following steps: (i) Define a sample size k. as described by the utility function.. When it comes to comparing models. size n. But if one or more of the p t ( 0 ) is not a proper density. in general the use of improper priors is much more problematic. and a model { p ( z Ie). We also note that this programme may be carried out with reference distributionssince the correspondingreference (posterior) predictive distributions r l ( t ( a .The observable function could either be that predictive aspect of the model which is of interest.'s in a which are not in a. . p ( B ) } .where A: 5 n. a = {tl.sk).. We first note that for models the predictive quantities typically play a key role in model comparisons for a range of specific decision problems and perspectives on model choice. . we have seen in Chapter 5 that perfectly proper posterior parametric and predictive inferences can be obtained for improper prior specifications.. ./2 is large enough to evaluate the relevant observable function t = t(z1.6. . thus precluding. G } . the corresponding pi (2)'s will also be improper.
.. However. in which case it can be argued that the ratio p.6 that the conventional Bayes factor is used to assess the models' ability to "predict Z" from { p l ( z1 6 . . . Indeed. with the reference prior approach some models are implicitly disadvantaged relative to others. An exception to this arises when two mcdels '11. x. equivalently.in a way which may be expected to achieve neutral discrimination between the models. say.(z). does provide a meaningful comparison between the two models. (6. even for improper 1 (nonpathological) priors p. However. is a well known simple example of this behaviour. . with a formal improper specification for the prior component of a model an initial amount of data needs to be passed through the prior to posterior process before the model attains proper status as a predictive tool and can hence be compared with other alternative predictive tools. since. ) . for sufficiently large I ) .. Eaves (1985) and Consonni and Veronese (1992a). Essentially. have common B and improper p ( B ) . proper.(z)/p.. there is an inherent difficulty with these methods when the models compared have different dimensions. ( J ) ) will be proper.6. .. the calculation of posterior probabilities for models in an Mclosed analysis. also. this is due to the fact that the amount of information about the parameters of interest to be expected from the data crucially depends on the prior distribution of the nuisance parameters present in the model. Spiegelhalter and Smith (1982).). no problem arises. discussed earlier in this chapter. p 7 ( O l ) } we see that the latter does run into trouble if p . ( O l ) is im. p . . then. Another possible solution to the problem of comparing two models in the case of improper prior specification for the parameters is to exploit the use of crossvalidation as a Monte Carlo approximation to a Bayes decision rule. Some suggestions along these and similar lines include Bemardo ( I 980). (8.422 6 Remodelling for example. 1957). for the problem of predicting a future observation using a logpredictive utility function the (Monte Carlo approximated) criterion for model choice involves the geometric mean of Bayes factors of the form where ( z ~ ( j )~ denotes a partition of 2 = ( $ 1 . p . Lindley's paradox (see. I . ( ~is) not proper. and . recalling from our discussion in Section 6.] Since. As we saw in Section 6.). A possible solution consists in specifying the improper prior probabilities of the modelsor.. Pericchi (19841. weighting the Bayes factors. Bartlett. x. I z l . . A!. technically. the Bayes factor in favour of A!.1.
prefer hi1 if . These include: J. based on partitioning x and averaging over random partitions. we see the explicit role of the geometric average of Bayes factors. ) where B t 2 ( z r .6. . and j = 1.is to take partitions of the form x = [xs(j).. The proposal is now to select randomly k such partitions. Proceeding formally. R. ) . 0. At the time of writing we are aware of work in progress by several researchers who propose forms related to those discussed here. and overcome the problem of the impropriety of p t ( O . Berger and L. de Vos cfair Bayes factors). O’Hagan Cfroctionul Bayes factors) and A. However. and approximate the lefthand side of the criterion inequality by The (Monte Carlo approximated) model choice criterion then becomes. F. Again. if we were to take logpi(x) as the utility of choosing the Mopen perspective prefers lcll if where p ( z ) is not specified. but with the latter “reversing”. (:). The closest we can come to this. . A. ~ wheres(2 l)isthesmallestintegersuchthatbothp~(81~ s ( j )andpz(02 I zs(j)) e ) are proper.1. Again. ( j )z s ( j )denotes the Bayes factor for All against A h based on the versions of A f l . . we can approach the evaluation of the lefthand side as a Monte Carlo calculation.3 Discussion and Further References 423 Mi.. .6. the role of past and future data compared with the form obtained in Section 6. Pericchi (intrinsic Bayes factors). in this contect we want partitions where the proxy for the predictive part resembles data z and the proxy for the conditioning part resembles “no data”. A&. . . in a sense.~ ( j ) ] .
where models are evaluated in terms of their predictive behaviour. Fisher. Beliefs as individual. objective processes of science by allowing subjective beliefs to enter the picture. bypasses the dimensionality issue. The solution to such problems lies in combining ideas from Chapters 4 and 6. nonobjective and unscientific. leads to a criterion of the form. in narrowly focused decisionmaking. personal probabilities are the key element in this process and. Pericchi and A. in utility terms. for example. we have seen that shared assumptions about structural aspects . On the one hand. while many are willing to concede that. This can be formalised by adopting a utility function of the form which. "fidelity to the observed data" and "future predictive power''. there has also been a widespread view (see. F. As Dickey ( 1973) has remarked. and therefore to be avoided in science'? Who cares to read about a scientific reporter's opinion as described by his prior and posterior probabilities'? We have already made clear our own general view that objectivity has no meaning in this context apart from that pragmatically endowed by thinking of it as a shorthand for subjective consensus. at any given moment in the learning cycle. it might be desirable to tradeoff. are the encapsulation of the current response to the uncertainties of interest. in turn. M. prefer 1\11 if Work in progress (by L. there are clearly practical problems of communication between analysts and audiences which need addressing.424 6 Rentodelling From a practical perspective. However. like superstition.Smith) suggests that such a criterion effectively encompasses and extends a number of current criteria. since posterior predicfiw distributions obtained from models with different dimensions are always directly comparable. rhetorically: But is not personal knowledge. We conclude by emphasising again that a predictivistic decisiontheoretical approach to model comparison. Scientific Reporring Our whole development has been predicated on the central idea of normative standards for an individual wishing to act coherently in response to uncertainty. However. such beliefs are an essential element. 1956/1973) that it would be somehow subversive to sully the nobler. or opinion. R.
we have seen. But what are the appropriate software tools for Bayesian Statistics? What software? For whom? For what kinds of problems and purposes? A number of such issues were reviewed in Smith (1988). that entertaining and comparing a range of models fit perfectly naturally within the formalism. as does the sophistication of graphical displays. for technical expositions. There are also some obvious technical challenges in making such a programme routinely implementable in practice. . We believe that this is the way forwardalthough. computational power grows apace. Scientific reports should objectively exhibit as much as possible of the inferenrid conrenr of rhe data. but. communicating a single opinion ought not to be the purpose of a scientific report. from several perspectives. see Edwards et al. experience with applicationsof methods to concrete problems than they ever will be by philosophical victories attained through the (empirically) bloodless means of axiomatics and stylised counterexamples. (1963) and Hildreth (1963). . Roberts (1974) and Dickey and Freeman (1975). On the other hand. an experimenter can summarise the application of Bayes' theorem to whole ranges of prior distributions. but at the time of writing. . Computer Sofnoare Despite the emphasis in Chapter 2 and 3 of this volume on foundational issuesnecessary for a complete treatment of Bayesian theorywe are well aware that the majority of practising statisticians are more likely to be influenced by positive. We are also well aware that the availability of suitable software is the key to the possibility of obtaining that handson experience. for a discussion in the context of a public policy debate. many still remain unresolved. To quote Dickey again: .6. For early thoughts on these issues. However. We shall return to this general problem in the second and third volumes of this work. while perhaps differing over the prior Component specification. the dataspecific priortoposterior nunsformarion of the collection of all personal probability distributionson the parameters of a realistically rich statistical model.3 Discussion and Further References 425 of beliefs (for example. it has to be said. to let the data speak for themselves by giving the effect of the data on the wide diversity of real prior opinions. derived to include the opinions of his readers. exchangeability) can lead a group of individuals to have shared assumptionsabout the parametric model component. see Smith (1978). rather. . preferably handson. there is a great deal of work to be done in effecting such a cultural change. Goel (1988) provided a review of Bayesian . see Dickey (1973). . There is nothing within the world of Bayesian Statistics that prohibits a scientist from performing and reporting a range of "what if?" analyses.
Lauritzen and Spiegelhalter (1988). 1988. and references therein). on which the commercial expert system builder Ergo'" is based.426 6 Remodelling software in the late 8 ' . who presents LISPSTAT. . 1987) and RacinePoon et al. and Ley and Steel ( 1992). Examples of creative use of modem software in Bayesian 0s analysis include: Smith er a f . Cowell ( 1992) and Spiegelhalter and Cowell ( 1992). who describes BUGS.Tiemey ( I 990. who discusses sampleassisted graphical analysis. as described by Goldstein ( I98 I. Albert (1990). an object oriented environment for statistical computing and discusses possible uses of graphical animation. who discuss the use of MINITAB to teach Bayesian statistics. Thomas et al. Further review and detailed illustration will be provided in the volumes Buyesian Computution and Bayesian Methods. 199 I. (1992). (1985. a program to perform Bayesian inference using Gibbs sampling. and Marriot and Naylor (1993). respectively. Wooff ( 1992). 1992). an implementation of subjectivist analysis of beliefs. describe and apply the probabilistic expert system shell BAIES. who describes [B/D/. who. Korsan (1992). Grieve ( 1987). (1986). RacinePoon ( 1992). who describe the use of the Buyes Four package. who make use of the commercial package Muthemutica'" .
it records the appropriate likelihood function. and first two moments of the probability distributions (discrete and continuous. Smith Copyright 02000 by John Wiley & Sons. A.l PROBABILITY DISTRIBUTIONS The first section of this Appendix consists of a set of tables which record the notation.BAYESIAN THEORY Edited by Josd M. the reference prior and corresponding reference posterior and predictive distributions. The second records the basic elements of standard Bayesian inference processes for a number of special cases. the conjugate prior and corresponding posterior and predictive distributions. M. Ltd Appendix A ~~ Summary of Basic Formulae Summary Two sets of tables are provided for reference. the sufticient statistics. parameter range. variable range. univariate and multivariate) used in this volume. The first records the definition. Bemardo and Adrian F. and the first two moments of the most common probability distributions used in this volume. In particular. . definition.
n ) Hypergeometric ( p . r i ) Binomial (p. 9. . . . . ( = n1ax(O.I' = 0. r z = 1. 115) N = 1. . .. 1. . .:If) I h = t n i t i ( n . . . .I / Hy(x I N .2. Sumtnary of Basic k'ormulae Univariate Discrete Distributions Br(r I 8 ) Bernoulli ( p .... I . A f = 1. a ) BinomialBeta f p . .'Ux) . . x = 0.11 Bb(z 1 (I.. b . . t / .2. 115) Bi(z 18. 115) 0 < e < 1 71 = 1.A. .. .VfAI z = ~ . ( L l+. A')  p ( r )=c )( (n. . . . :If. I 17) ..2. N E [ r ]= 7)' + *\I N .
2) = 0.l ) ( a . r ) NegativeBinomial (p. p>o. . 1 . . 9..l.. 118) rB E [ z ]= a1 V[x] = 5 + ( a .2.1..8)' r = O.2. . a E[x] = v B . v > o . E = 0 . . l .2. 116) p(x) = c E[x] = X X! V[T] = x ~~~ A' Pg(x I a . 1191 n a>0. r ) NegativeBinomialBeta (p. c = 8' V[z] = T10 02 E[z] = r0 Nbb(x I a . 116) O < O < l..A.) PoissonGamma fp.1 Probability Distributions 429 Univariate Discrete Distributions (continued) Nb(x 18. . 2. Pn(x I A) X>O Poisson (p.r=l..
I' >0 . \'[..(I)' E [ x ]= + ( U + b) G. 71 > 0 . 117) b>a P(P) rr < . n ) GammaGamma f p . 118) a > 0. Summary o Basic Forntulae f Be(r 1 a.3 > 0.I' < t j =c I' = (b.j) Gamma (p. 6 ) Uniform ( p .u)' = &(I> ..] G a ( i 1 a.1) > 0 E [ x ] = aS1 x >0 Ex(x 10) Exponential (p. 3. 118) t1>0 p ( r ) = (. 120) 0 . ".I' >0 1/19' c=6.e@= E [ s ] = 1/e Gg(x I ct. d) Beta ( p . 116) U n ( s 1 u.430 Univariate Continuous Distributions A.I] = > 0.
p) InvertedGamma ( p . 119) V > O . 119) a > 0. B) Squareroot InvertedGamma (p.2 ) x’(z1 v ) InvertedChisquared (p.‘ 2 >0 c= E[z]= v2 1 (1/2)”‘2 Uv/2) V[X]= (v . x > 0 Nancentral Chisquared ( p .A. 121) x>O E [ x ]= v +x V [ x ]= 2 ( v +2 4 Ig(x I a . i3 > 0 x>o . 120) x>o V [ x ]= 2v x 2 ( x I v.4) ~ ~~ ~ Ga”‘ ( x I a . 119) Cr>o.j3>0 X > O E [ 4= a1 a 8’ V [ x ]= (a.1 Probability Distributions 431 Univariate Continuous Distributions (continued) x z ( x I v) = x: Chisquared ( p .2)2(v .1)2(a.A) v > 0.
ct) =x Student t ( p . 122) x.n > 0 x < .r] = . 3 .r < + x .r + 1) '(a 4 2) 1 N(.r I p .P '(k(Ct + 1)1 \'I. 1 3 ) (1 InvertedPurero (p. t j >0 o<s< I' j I p ( r ) = cz"' Els] = . 121) x < 11 < soc.of Basic Formulae Univariute Continuous Distributions (continued) Ip(x 1 ( 1 .' a ( t s = (1. A) x NorniuI (p. A.A(. 120) > 0. x > 0 p)2} < .432 A. < 11 < C X . x > 0.] E[:r]= p St(s 111. = x' \.r < +x 1(2K)l 2 I y(r) = c exp {.[.I. Summar?.
A.1 Probability Distributions 433 Univariate Continuous Distributions (continued) Lo(x a. . E[x..1) (a a.n) MultinomialDirichlet (p..1. Chl 5 n 2 1 n! = n. n) Multinomial (p. 122) Multivariate Discrete Distributions Muk(x I 8.] npi = .xa) xj =0. 135) z = (51> . 133) Mdk(x I 8. ) Logistic P (p. .2. .=.+ e .. .
.A) Multivuriute Norntul ( p ... J) NornialGammu (p. y 1 p. 136) 2 p = (/I.p)'A(z . A. y ICY..XI.. XI=.. .I.p)} !z EIX. < 1. Summary of Basic Formulae Multivariate Continuous Distributions Dik. 134) x = (. 31) Biluterul Pureto (p..k) I 0 < J . .(z I a ) Dirichlet (p. ..434 A.1 = r= j CC x 1 "?(27i)~4 I[x]= x ' Pa2(r./&) E Rk A symmetric positivedefinite = ( X I .rA ) E %A p ( z ) = c cxp { .( 5 I Ng(z...(. 141) .( . . . I361 N k (x1 p. .A. 11.
2)'a = x symmetric positivedefinite P symmetric nonsingular .y 1 p . a .p ) E [ x ]= p.l ] (n+k)/2 c= V [ X ] A'(cI .A.. A > 0 2a>kl j3 symmetric nonsingular .. . 140) ______ Nwk(x. .ck) x xi < +m < g symmetric positivedefinite ~ ~ Stk(x I p. A. x = ( q . 138) h > k . 139) 90 < pi < +w. 8 ) Multivariate NormalGamma (p.m. P ) Multivariate NormalWishart ( p .P) Wishart ( p .Zk) < 2... Wik(x I a. A. a ) Multivariate Student (p.g I p . A. CL > 0 A symmetric positivedefinite 5 M = (XI. 140) 90 < pz < +.p ) [ A ( z. < +m .:. I Probability Distributions 435 Multivariate Continuous Distributions (continued) Ngk(x. . a.
the conjugate family. in separate . we also provide. the conjugate prior predictives for a single observable and for the sufficient statistic. in multiparameter problems. and A. in a final section.of Basic Formulae A. the reference prior is only defined relative to an ordered parametrisation. p ) x A’. In the univariatc normal model (Example 5. When clearly defined.17). the reference prior and the corresponding reference posterior and posterior predictive for a single observable.9.I. . n ) y(6. Bernoulli model p ( 0 ) = Be(@1 0 . = T (A. We recall. Summar?. however. !j p ( .4 that. = Be(# ) ~ ( 6 z)= .I. the reference prior for ( p .436 A. In the case of uniparameter models this can always be done. and specified by different ordered parametrisations. multivariate normal and linear regression models.. in the multinomial.r ) I T ( X 1 Z ) = B b ( t 1 f + I .sections of the table. p ) . 1) ‘6 ~ ( 6 . we provide. A) happens to be the same as that for (A. + 11 . and we provide the corresponding A) reference posteriors for 11. the following: the sufficient statistic and its sampling distribution. 4 + ri . For each of these models. I~ ) = Bb(s I CL + r. the conjugate posterior and the conjugate posterior predictive for a single observable. 3. I z ) = Be(6. i) I Be(0 I 4 + r. 1) p ( r ) = Bb(r 1 ct.r) . I c1 r.. d) p(s)= Bb(r I o. ‘1) I i. there are very many different reference priors. 9 z + + + 71 7i . These are not reproduced in this Appendix. together with the reference predictive distribution for a future observation. corresponding to different inference problems. namely ~ ( p .2 INFERENTIAL PROCESSES The second section of this Appendix records the basic elements of the Bayesian learning processes for many commonly used statistical models. however. from Section 5.2.
A.. 2 . s r) CX @1(1 . s f ) T ( Z I z ) = Nbb(s I nr. 5. = 0 . . r ) . 0c 8 < 1 P(G I@)= Nb(z.J.1 ' 2 + + i... . 2. . 1 .. 18. 2. NegativeBinomial model z = ( x I . .q . . .2 Inferential processes 437 Poisson Model p ( r I A) = Pn(r I n X ) t(%) T = = c:'=.(e) n(8 I z ) = Be(8 I nr.
. J y(s 12)= G g ( r I n I t .r I I t . i ? + + + t) + 1. X I . 18) = Un(r. 10. } . 1) Uniform Model = ($1. t) 8 7r(s1 a ) = Gg(. P ( J .. Summury of Busic Formulae Exponential Model p(H) = Ga(O I Q . 1) p ( t ) = G g ( fI o. ... e).1) T ( 0 ) x 8' ~ (I z) = Ga(8 I r i .A. r ) = G g ( r I cr.j.n ) p(8Jz) = Ga(8 I ( I I ) . >o . t . J) p ( . 0 <XI < H 6.
n) . ntI.2 Inferential processes 439 Normul Model (known precision A) Normal Model (known mean p ) .(A) . ( b o ( A1 7r(A I z ) = Ga(A 1 in.A. i t ) J %) = St(s I p .
Summary oj Basic Formdue .440 Normul Model (both puramriers utiknown) A.
A. X k x k positivedefinite .2 Inferential processes 441 Multinomial Model Multivariate Normal Model z = (21.. 1 } . 2*E Rk p E Rk. A). p ( s l I p. A) = N k ( q I p. 2 . .. .
442 Linear Regression A. = X'y) . Summary o Basic Farmulae f t ( % ) (X'X.
prediction. Bemardo and Adrian F. First. however. The main theories reviewed include classical decision theory. if not most. two broad reasons why we think it appropriate to give a summary overview of our attitude to other theories. and hypothesis and significance testing. frequentism. Further issues discussed include: conditional and unconditional inferences. and the material in this Appendix may help them to put the contents of this book into perspective. M. many. 6.1 OVERVIEW Bayesian statistical theory as presented in this book is selfcontained and can be understood and applied without reference to alternative statistical theories. asymptotics and criteria for model choice. Ltd Appendix B Non= Bayesian Theories Summary A summary is given of a number of nowBayesian statistical approaches and procedures. . and fiducial inference. These are illustrated and compared and contrasted with Bayesian methods in the stylised contexts of point and interval estimation. There are. Smith Copyright 02000 by John Wiley & Sons. nuisance parameters and marginalisation.BAYESIAN THEORY Edited by Josd M. likelihood. readers will have some previous exposure to “classical” statistics.
and that exploratory data analysis and graphical displays are often prerequisite. typically derived from combining p ( z I 8 ) and p ( B ) .our own experience has been that some element of comparative analysis contributes significantly to an appreciation of the attractions of the Bayesian paradigm in statistics. The form { y ( z 18). we stressed that predictive models. NonBayesian statistical theories typically lack foundational support of this kind and essentially consist of a set of recipes which are not necessarily internally consistent. such as those for point or interval estimation. NonBayesian theories depart radically from this viewpoint. designed to satisfy or optimise an ud hoc criterion. classical decision theory is only partially relevant to inference. In Section B. therefore. a decision ~tructurc is the natural framework for any fornial statistical problem. most nonBayesian theories essentially consist of stylised procedures. and nonBayesian inference theories typically ignore the decision aspects of inference problems. implications of this fact are so far reaching that sometimes Bayesian statistics is simplistically thought of as statistics with the “optional extra” of a prior distribution.2. z E X. and often lacking the necessary flexibility to be adaptable to specific problem situations. Likelihood I nference and Fiduciul cind Relared 7heories. NonBayesian Theories Secondly. or hypothesis testing. informal activities. In contrast. and have described how a ”pure” inference problem may be seen as a particular decision problem.forniuI procedures. In Chapter 4. specifically reviewing Classical Decision Theory Frequentist Procedures. In Chapters 2 and 2. We begin by making explicit some of the key differences between Bayesian and nonBayesian theories: (i) As we showed in detail in Chapter 2. with solutions directly tailored to problems. we will revise the key ideas of a number of nonBayesian statistical theories. from a Bayesian viewpoint. we have argued that the existence of a prior distribution is a mathematical consequence of the foundational axioms. Bayesian statistics has an uxionrutic foundation which guarantees quantitative coherence. we recall from Chapter 1 our acknowledgment that Bayesian l analysis takes place in a r a t h e r j ~ r t aframework.444 B. It is important. to be clear that in this Appendix we are discussing nonBayesian. As a preliminary.are primary. (iii) The decision theoretical foundations of Bayesian statistics provide a natural framework within which specific problems can easily be structured. . (ii) NonBayesian theories typically use only a parametric model family of the ignoring the prior distribution p ( 0 ) . (iv) We have argued that.8 E e ) .
w)p(w)dw. a prior distribution a p(w)over 0. For readers seeking further comparativediscussion at textbook level. Nuisunce Purumeters und Marginalisation.2 B. EL) . if additional information z is obtained which is probabilistically related to w by y ( z 1 w ) .Cox and Hinkley (1974). A regretfunction. Aspects o Asymptotics and Model Choice Criteria.4.w ) = sup u(d. parameter space 0. 8. In Section B.Casella and Berger ( 1990) and Poirier ( 1993).1 ALTERNATIVE APPROACHES Classical Decision Theory We recall from Section 3. Anderson (l984). We established that the existence of horh the prior distribution p ( w )and the utility function u(d. and the performance of classical procedures relative to their Bayesian counterparts. We established furthermorethat.u(d7 ) to w conform more closely to standard notation in classical decision theory.then the best decision d: is that which maximises the posterior expected utility where P(W I=) P(= I w)P(w) Some authors prefer to use loss functions instead of utilities. f Approuches to Prediction. or decision loss. the relevance to actual statistical problems.2.3.8.Press (1972/1982). we will discuss in detail some key comparative issues: Conditional and Unconditional Inference.. we recall. consisting of a set of a possible decisions D.w ) is a mathematical consequenceof the axioms of quantitativecoherence and that the best decision d' is that which maximises the expected utility E(d) = s u(d. the books by Barnett ( I97 I / 1982). is easily defined from the utility function (at least in bounded cases) by l ( d . HyBayesian textbooks into the topics of Point Estimarion. DeGroot (1987).w ) .3 the basic structure of a general decision problem.u(d. 11. and a utility function u ( d ( w ) )which we shall denote by . Inren~al pothesis Testing and Sign8cance Testing. from our discussion at the end of Chapter 5 . 0 ) .2 Alternative Approaches 445 In Section B. we will follow the typical methodological partition of nonEstimation. Within each of those subheadings we will comment on the internal logic.
decision rule 0' is defined as  and subsequent comparison of decision rules is based on their risk functions. may suffer as a conseone quence of a wrong decision. for each w. A utility function (or a loss function) is accepted..r : p . consisting of functions Ci : S f ) which attach a decision n'(z)to each possible data set x. but the class of decision rrtles. .. I ) . (Estimation o the mean of a normal distributiun). / I . but a prior distribution for w is not. perhaps justified by utilityonly axiomatics of the type pioneered by von Neumann and Morgenstem ( 1944/1953).~) prior./. . the core framework of classical decision theory may be loosely described as decision theory without a prior distribution.. This work was continued by Girshick and Savage (1951) and by Stein (1956). the sample median = ( i i i ) n'. the basic space is not the class of decisions.. .L j o f a .. excluding prior distributions as core elements. as 3 special case. where the risk function reduces to the loss function.?. under the assuniption of a quadratic loss function /(ti.u. Thus. I ) distribution..I be a random sample from a . a fixed value = (iv) &(z) ( I J u .Some possible derision rules are ( i ) i j i ( 3 c ) = 7. . 1 1 ) = (11 . the posterior mean from an = centred on p. An excellent textbook introduction is that of Ferguson (1967). and with precision i t l i .\~(/i'/t. the risk. w) only depends on w.. The formulation includes.w)p(w)c~u is minimised by the same decision d' which maximises Z(d) and hence.  f . It is then suggested that decision rules should be evaluated in terms of their triwtrge loss with respect to the data which might arise. A. the situation with no additional data (the nodata case).d ) L .446 B. Although some of the basic ideas in classical statistical theory were present in the work of Neyman and Pearson (1933). but including a formulation of standard statistical problems within a decision framework. r . it was Wald (1950) who introduced a systematic decision theory framework. ) ' ( U ~ .l. ~ rr7). the sample mean ( i i ) n'.(c) / i t ) . Since slipD u(d.(z) .firricrion I * ( 4. so thul 11 = 'W.'i'(. f Let 2 = { . expected the loss  l ( d )= J f(d. . the two formulations are essentially equivalent. Consequently. . from a Bayesian point of view. . NonBayesian Theories which quantifies the maximum loss that. arid w p p that ~ we want to select an estimator for p . Example B. In contrast to this Bayesian formulation. Classical decision theory focuses on the way in which additional information x should be used to assist the decision process.
the best decision rule markedly depends on p .po)L (iv) ~(6~. The closer fill is to the true value of 11. = 0. It is easily Seen that. r(6’:w)2 r ( b .11) = (11 . whatever the value of p. = (n + 7 2 0 ) . corresponding risk functions are easily seen to be (i) r(dl.I L ) ~ } 3 2 1 I 1 2 3 Figure B1 Riskfunctions for p. Figure B. w ) .2 { ~ ~ 11) + ni(p.B.. otherwise. the rule 6.2 Alternative Approaches 447 Using the fact that the variance of the sample median is approximately 7 ~ / 2 nthe . for any value of the unknown parameter p.) + x respectively. for all w. This is formalised within classical decision theory by saying that a decision rule 6’ is dominated by another decision rule 6 if. = 5 . 1 provides a graphical comparison of the risk functions.  Admissibility The decision rule 62 in Example B. (iii) r ( & . 0 and 71.p) = l / n (ii) r(&. . has a smaller risk..p ) = 7r/2n.. I can hardly be considered a good decision rule since. the more attractive 6 and 61will obviously be. has larger risk than 61 but. n = 1 0 and n. .6.. Note that (iv) includes both (i) and (iii) as limiting cases when rt.
since P(ZI W ) P ( W )  o(w I Z ) l J ( Z ) .\ which it calls a Bayes decision ride. w ) p ( w ) d u . and that a decision rule is udniissible if there is no other decision rule which dominates it.w ) y( z I w ) )I( w ) tlz riw .aa B. I!. further concepts and criteria are required.h correspontls ro n proper prior ciistribution is dmissihle.(6/.w)p(w)dw< I . any admissible decision r u k is ( I Bu. NonBayesian Theories with strict inequality for some w . ) p ( o 1 z ) p ( z riztlw . Indeed.w ) with strict inequality on some subset of I2 with positive probability under p ( w ) . ( & * . if n" is the Bayes decision rule which corresponds to p ( w )and 6' were another decision rule such that r(6'.vpected risk (or socalled Buyes risk) uiin A r ( b .ve. . under appropriate regularity conditions we may reverse the order of integration above to obtain rriin 1/ I!. classical decision theory focuses on the decision rule which minimises e. 1 J which would contradict the definition of a Bayes rule as one which minimises the expected risk. for guidance on how to choose among admissible decision rules. and a class is ntinirnal complere if it does not contain a complete subclass. ) so that the Bayes rule may be simply described as the decision rule which maps each data set x to the decision P ( z ) which minimises the corresponding posterior expected loss. w ) 5 r ( P . If one is to choose among decision rules in terms of their risk functions. Note that. . Note that this interpretation does not require the evaluation of any risk function.then one would have . d )p ( w )~ l w i n i r i = 'I . prior disrrihution.v improper. r ) w . s r ( d ( . classical decision theory establishes that one can limit attention to a minimal complete class. under rather general conditions. It is easily shown that uny Buys rule bt4iic..sih1. pos. However. Bayes Rides If the existence of a prior distribution p ( w )for the unknown parameter is accepted. A class of decision rules is corviplete if for any 6' not in the class there is a n' in the class which dominates it. /I I ( 6( z) . Wald (1950) proved the important converse result that.s deiisioti ride with respect to some.
.1 the basic structure of a stylised inferenceproblem. i.8 E 0).2 Alternative Approaches 449 There is. A famous example is the inadmissibility of the sample mean of multivariate normal data as an estimator of the population mean. The idea that the minimax rule should be preferred to a rule which has better properties for nearly all plausible w values. The intuitive basis of the minimax principle is that one should guard against the largest possible loss. . under rather general conditions. Thus.. It can be shown. .e. by the parametric model component { p ( z 18). that the minimax rule is the Bayes decision rule which corresponds to the leasr favourable prior distriburion. even as a formal decision criterion. This asserts that one should choose that decision (or decision rule) for which the maximum possible loss (or risk) is minimal.the minimax solution may be reasonable (essentially coinciding with the Bayes solution). and that to derive the Bayes rule does not require computation of the risk function but simply the minimisation of the posterior expected loss. While this may have some value in the context of game theory. and certainly in the finite spaces of real world applications. it gives different answers if applied to losses rather than to regret functions. Nevertheless. see James and Stein (1961). this has been the mainstream view since the early 1960’s. for instance. Indeed. apart from purely mathematical interest. For details. that which gives the highest expected risk. it has no obvious intuitive merit in standard decision problems. make it clear that. Moreover. 1972). with the authoritative monographs by DeGroot (1970) and Berger (1985a) becoming the most widely used decision theory text.g. the minimax criterion seems entirely unreasonable. however. although in specific instancesnamely when prior beliefs happen to be close to the least favourable distribution. it is rather pointless to work in decision theory outside the Bayesian framework. Minimax rules The combined facts that admissible rules must be Bayes..B. no guarantee that improper priors lead to admissible decision rules. Lindley. some textbooks continue to propose as a criterion for choosing among decisions (without using a prior distribution)the rather unappealing minimax principle.22 Frequentlst Procedures We recall from Section 5. 8. where probabilistically related to 8 inferences about 8 E 8 are to be drawn from data z. even though it is the Bayes estimator which corresponds to a uniform prior. and it can violate the transitivity of preferences (see e. minimax has very unattractive features. where a player may expect the opponent to try to put him or her in the worst possible situation. but has a slightly higher maximum risk for an extremely unlikely w value seems absurd.
if the observed value o f t is well within the area where most of the probability density . described by p ( t I e).called a sfurisric. Wald (1947) for specific methods for sequential problems.tioti lik(8 12) = p ( x 18) (or variants thereof). (i) they regard the information provided by the data x as the sole quantifiable form of relevant probabilistic infomation and (ii) they use. most of the basic concepts were brought together in the 1930's from two somewhat different perspectives.the likelihood in terms of the sampling distribution of z becomes (in the above variant) which suggests that meaningful likelihood comparisons should be made in the form of ratios rather than.. as a basis for both the construction and the assessment of statistical procedures. being critically opposed by Fisher. the conditional distribution p ( t 1 O ) .. and that the required inferential statement about 8 given x is simply provided by the full posterior distribution where Frequentist statistical procedures are mainly distinguished by two related features. and (iii) measuring the "plausibility" of each possible 8 by calibrating the observed value of the statistic t against its expected longrun behaviour given 8. as reflected in discussions at the time published in the Royal Statistical Societyjournals.450 B. differences. If z = x(x) is a onetoone transformation of x. (ii) deriving the sampling distribution of t . Frequentist procedures make extensive use of the /ike/ihood~~nc. essentially taking the mathematical form of the sampling distribution of the observed data x and considering it as a function of the unknown parameter 8. the work of Neyman and Pearson. NonBayesian Theories We established that the existence of a prior distribution p ( 8 ) is a mathematical consequence of the axioms of quantitative coherence. Convenient references are Neyman and Pearson ( 1967)and Fisher (1990). say. in order for such comparisons not to depend on the use of z rather than 2.e. longrun frequency behaviour under hypothetical repetition of similar circumstances. also. For a specific parameter value 8 = 8. The basic ideas behind frequentist statistics consist of (i) selecting a function of the data t = t ( x ) . See. which is related to the parameter 8 in a convenient way. i.. Although some of the ideas probably date back to the early 18OO's.
other frequentist developments of the sufficiency concept have little or no interest from a Bayesian perspective. then O0 is claimed to be compatible with the data. However. according to a ndive frequentist. 8 ) = p ( z I t). Example B2 (Cauchy observaabns). For example: a sufficient statistic t is compfere if for all 8 in 0.8)g(Z). Yet. identical conclusions should be drawn from data zI and 2 2 with the same value oft. ~ in which case.) lies.e. it cannot make any difference whether one uses s . observationsfromaCauchydistributiony(. the sampling distribution of 0 is again S t ( r IB. for any prior p ( 8 ) . from a “textbook” perspective.i. T e property of completeness guarantees the uniqueness of h certain frequentist statistical procedures based on t. for any given model p ( z 18) with sufficient statistic t.2 Alternative Approaches 451 of p ( t 18.. From a Bayesian viewpoint there is obviously nothing new in this “principle”: it is a simple mathematical consequence of Bayes’ theorem. This contrast is highlighted by the following example taken from Jaynes (1976). or a rare event has happened. The idea was introduced by Fisher (1922) and developed mathematically by Halmos and Savage (1949) and Bahadur ( 1954). (1988) and references therein. SuJiciency WerecallfromSection4.c18) = St(z I8.5thatastatistict issuflcienrifp(Z 1 t .. Let z = {slQ } consist of two independent .B.1 l).l) so that. but otherwise seems inconsequential. for example. Clearly. Can0 er al. . and that a necessary and sufficient condition for t to be sufficient for 8 is that the likelihood function may be factorised as lik(8 I Z) = p ( 18) = h ( t . Such an approach is clearly far removed from the (to a Bayesian rather intuitively obvious) view that relevant inferences about 8 should be probability statements about 8 given the observed data.. i. The suflciency principle in classical statistics essentially states that. the posterior distribution of 8 only depends on z through t .if the conditionaldistribution ofthedatagiven t is independent o f 8 (Proposition4. Commonsense(supported by translational and permutational symmetry arguments) suggests that 8 = (zI+ ~ ) / may 2 be a sensible estimate of 8. rather than probability statements about hypothetical repetitions of the data conditional on ( h e unknown) 8. see. . p(8 I Z) = p(8 I t).1.e. otherwise it is said that either O0 is not the true value of 8.J h ( t ) p ( t1 8 ) d t = O implies that h(t) = 0. or 1 . there is more to inference than the choice of estimators and their assessment on the basis of sampling distributions.rz 8 to estimate 8. The concept of sufficiency in the presence of nuisance parameters is controversial. 1).l.
suggesting H I as the true value of 0. The difficulty in this case disappears if one works conditionally on the ancillary statistic 1 . whenever there is an ancillary statistic a. the conditional likelihood would have been lik(8.r = 1 were obtained. the unconditional likelihood given .99.2. We defined such an n ( z ) to be is an ancillary statistic and showed that. that one should always condition inferences on whatever information is available. with p ( R I ) = 0. it suffices to work conditionally on the value of the ancillary statistic. R1 might have been the receiver) hut did Nor.r = 1) = 0.r = 1) = (J.2 where the receiver is selected at random. lik(H1 1 R1. suggesting 82 instead. #Ji = fJ(. in the inferential process described by Bayes' theorem. see Basu ( 1959).1 we demonstrated how a sufficient statistic t = t ( z )may often be partitioned into two component statistics t ( z ) = [a(z).lK. the conditionality "principle" is .rl .8. (Condifional versus uncondilional arguments).. N ) = 0. Example B.845. For further information.e. if a is ancillary..2 j . On the other hand. The conflict arises because the latter (unconditional) argument takes f undue account o what nrighf have happened (i.I' = (1 I / f ~ . iik(0.the conclusions about the parameter should be drawn as if a were fixed at its observed value.2.x2 I /2. 1 Rr.452 B.r = 1 ) = 0.l' = 11 H 2 . and there are two receivers R I and It2 such that ])(.I signal comes from one of two sources HI or HI'. NunBayesian Theories Ancillarity In Section 5.i = I ) = 0. then so that. The apparent need for such a principle in frequentist procedures is well illustrated in the following simple example. These examples serve to underline the obvious appeal of a trivial conscquencc of Bayes' theorem: namely.3. s(z)J such that the sampling distribution of a(z) independent of 8. If It? were the receiver and . A further example regarding ancillarity is provided by reconsidering Example B.. A 0.just a small (id / r o c . .I' = 1 would have been lik(fll I . The cunditiunali5 principle in classical statistics states that.
9 6 6 . . then the fractional part of I is uniformly distributed on [0. The Repeated Sampling Principle A weak version of the repeated sampling principle states that one should not follow statistical procedures which. so that E suggests itself as an estimator of ji. 1+ O [ . that (iii) there are no means of assessing any finitesample realised accuracy of the procedures. would too frequently give misleading conclusions in hypothetical repetitions. 1) is obtained. explicitly says that given f. whose sampling distribution is N ( f I p . n ) . Although this is too vague a formulation on which to base a formal critique. and are not necessarily unique.. Since the sampling distribution of L concentrates around p. in the Ion8 run.!)C/&q . however. it can be used to criticise specific solutions to concrete problems. I [ and hence ancillary.95 that p belongs to f f 1 . “automatic” in the Bayesian approach). P [E / I f l. we are producing an interval which will include the true value of the parameter 95% of the time. the conditionality “principle” is not necessarily easy to apply. conditional on 11. one might expect f to be close to 11 on the ba5is of a large number of hypothetical repetitions of the sample.. A much stronger version of this “principle”. but the conditional distribution of x given its fractional part is a useless onepoint distribution! See Basu (1992) for further elegant demonstration of the difficulties with ancillary statistics in the frequentist approach.95 which is derived from the reference postenor distribution of p given f. (Confidence versus HPD intervals). For example.9S/fi whenever a random sample of size n from N ( s 1 ji.B. 9 6 / 6 1 S] = 0. /I] = 0. Note that this says nothing about the probability that j i belongs to that interval for any given sample. applying the conditionality principle may leave the frequentist statistician in an impasse. that (ii) optimality criteria have to be defined in terms of longrun behaviour under hypothetical repetitions and. and is not concerned at all with hypothetical repetitions of the experiment. Example BA. x. . with precision n. Basu (1964) noted that if x is uniform on [O. From a frequentist viewpoint. Moreover. the superficially similar statement P [ p E 2 f 1 . Moreover.. whose essence is at the heart of frequentist statistics. } be a ran. since ancillary statistics are not readily identified. dom sample from N ( s I p. for some possible value of the parameter. It is easily seen that F is a sufficient statistic. a normal distribution centred at the true value of the parameter. if we define a statistical procedure to consist of producing the interval Z f 1.2 Alternative Approaches 453 step towards this rather obvious desideratum (which is. Let x = {s. This implies that (i) measures of uncertainty have to be interpreted as longrun hypothetical frequencies. 1). In contrast.the degree ofbeliefis 0. in any case. . states that statistical procedures have to be assessed by their performance in hypothetical reperitions under identical conditions.95 so that.
so that I+ Ie)= z ) p ( z I el. have the same distribution for every 8 . Frequentist procedures centre their attention on producing inference statements about irnohscmvrhle parameters.r . the invariance principle could not be applied if it were known that B 2 0. then both the likelihood principle and the mechanics of Bayes’ theorem imply that one should derive the same conclusions about 8 from observing z l as from observing x?.For example.454 Invuriance 6. Then the experiments produce the same conclusions about 8. and g(0) = 0 + a. in the location/translation example.t:) = 3’ + u. As we shall see in Section B. same prior. Note that the argument only works if there is no reason to believe that some 8 values are more likely than others. From a Bayesian point of view. for all the elements of a group 6 of transformations there is a unique transformation g such thaty(O) = 8 a n d p ( z 18) = p(y(z) Ig(8)). ~ In this case. g(. in estimating I9 E ’R from a location as those drawn from model p ( I 0) = h(. such an approach typically fails to produce a sensible solution to the more fundamental problem of predictinx furitre ohsenwtions.. one may have a uniform loss of expected utility from following the invariance principle). \ince long run behaviour under hypothetical repetitions depends on the entire distribution { p ( z 10).z E Y}and not only on the likelihood. 21 and x2. A final general comment. Thus. with X = 0. ci E ‘31.4. where 1 is a vector of unities. I the following trivial consequence of Baye\’ theorem. A more elaborate form of invariance principle involves transformations of both the sample and the parameter spaces. since they induce the same posterior distribution. for invariance to be a relevant notion it must be true that the transformation involved also applies to the prior distribution (otherwise. The Iikelikood principle suggests that this should indeed be the case. NonBayesian Theories If a parametric model p ( z 18) is such that two different data sets. Frequentist procedures typically violate the likelihood principle. . Suppose that. and proportional likelihoods. 8 2 3 Likelihood Inference . We recall from Section 5 . data z and z arid with model repthe resentation involving the same parameter 8 E 8. The invariance principle then requires that any estimate t ( s )of0 should satisfy t(z + n l ) = t(z) + n. Another limitation to the practical usefulness of invariance ideas is the condition that g ( e ) = 8. Consider two experiments yielding.q(s)) be the same to g ( t ( x ) ) .19)it may be natural to consider the group of translations. for [he relative support givcn by the two sets o f data l o the possible values of 8 is precisely the same. Thenthein~urianceprincipfewould require the conclusions about 8 drawn from the statistic f(. respectively.
1973) and Andersen (1970. also. breaks down immediately when there are nuisance parameters. the posterior distribution is.1968. 1973). Both claims make sense from a Bayesian point of view when there are no nuisance purumerers.e. Davison (1986). Butler ( I 986).. With a uniform prior. Section 5. (i) is just a restatement of the likelihood principle and. in that examples exist where. work has focused on the properties ofprojle likelihood and its variants. proportional to the likelihood function.5. of course. Indeed. In recent years. Proponentsof the likelihood approach to inferencego further. The likelihood approach can also conflict with the weak repeated sampling principle. Bickel and Ghosh (1990).BarndorffNielsen ( 1980. 1991 ). but the suggested procedures for doing this seem hard to justify in terms of the likelihood approach. but also as a meaningful relative numerical measure of support for different possible models. hypothetical repetitions result in mostly misleading conclusions. Other references relevant to the interface between likelihood inference and Bayesian statistics include Hartigan (1967).1 for a link with Laplace approximations of posterior densities. it follows from Bayes’ theorem that so that the likelihood ratio satisfies (ii).Birnbaum ( 1962. Frequentist statistics . see Kalbfleish and Sprott (1970.B. Goldstein and Howard (1991) and Royal1 (1992).2 Alternative Approaches 455 As mentioned before. the Bayesian approach automatically obeys the likelihood principle and certainly accepts the likelihood function as a complete summary of the information provided by the data about the parameter of interest. Plante (197 I). However. For further information on the history of likelihood. when common priors are used across models with proportional likelihoods. Barnard er al. or for alternativeparameter values within the same model. They essentially argue that (i) the likelihood function conveys all the information provided by a set of data about the relative plausibility of any two different possible values of 8 and (ii) the ratio of the likelihood at two different 8 values may be interpreted as a measure of the strength of evidence in favour of one value relative to the other.1972) and Edwards (1972/1992). Cox and Reid (1987. 1992). for some possible parameter values.1963). For early attempts. The use of “marginal likelihoods” necessarily requires the elimination of nuisance parameters. since it is the factor which modifies prior odds into posterior odds. Cox (1988). See. Fraser and Reid (1989). Pereira and Lindley ( 1987). in that they regard it not only as the sole expression of the relevant information. The basic ideas of this pure likelihood approach were established by Barnard ( 1949. moreover. see Edwards (1974). the attempt to produce inferences solely based on the likelihood function. in their uses of the likelihood function. i. Bjrnstad (1990) and Monahan and Boos (1992). Akaike (1980b). the pure likelihood approach. ( 1962). however. Useful references include: Barnard and Sprott ( 1968).1983.
. a form of inference summary that seems most intuitively useful. 100). (NuLve fikolihood versus reference onufysis). 1 0 0 . . .5. We now review some of those proposals. If all f) are judged a priori to have the same probability. but with the posterior distribution defined as the weighted average of the likelihood function with respect to the prior. a second observation would. Thus. with this prior.I' Example B. . Consider the model e { I. . The following example is due to Birnbaum (1969). H E { O . . Finally. .456 B. .r. given a single observation. . then one certainly has However. the answer obviously depends on the prior distribution. .2. then the reference prior turns out to be p ( H = 0) = 1j 2 . as might well be the case in any real application of such a model. Bayesian statistics solves the problem by working. . I . . has led to a number of attempts to produce "posterior" distributions without using priors. the likelihood approach has difficulties in producing an agreed solution to prediction problems. . From a Bayesian point of view. for . . 100 and a straightforward calculation reveals that this is cilso the posrerior. This fact. and is declared to be the parameterof interesf. p ( H = 1 . 1. . NonBayesian Theories solves the difficulty by comparing the observed likelihood function with the distribution of the likelihood functions which could have been obtained in hypothetical repetitions. like frequentist procedures. as we shall discuss further in Section B. say.4. . whatever . if. . for any prior. where. with high probability. .r is observed. '1. Then. we note that. p ( . . not with the likelihood function. 6.2.I' = 1.4 Fiducial and Related Theories We noted in Section B. r I H ) . . one observation from the model provides no information.3 that frequentist approaches are inherently unable to produce probability statements about the parameter of interest conditional on the data. 100). 2.2. coupled with the seeming aversion of most statisticians to the use of prior distributions. if H = 0 the likelihood of the true value is always I ilOOfh of the likelihood of the only other possible H value. ) = 1/200. I' = 1 . Of course. reveal the true value of 0. H = 0 is considered to be special. namely H = .
Ol) and. The basic characteristics of the argument may be described as follows. and has a distribution function ‘. hence. . This is thefiduciul disrriburion of 8. has the mathematical structure of a “posterior density” for 8. Suppose further that the distribution function o f t . Example B. with F(t 160)= 1 and F ( t 181) = 0.z. (Fiducial and reference distributions). in this example. through a series of examples and without any formal structure or theory. G(8 I t) = 1 . the fiducial distribution of 6. ranges over (0. However. thus somehow transferring the probability measure from T to 63..F(t 18) has the mathematical properties of a distribution function over (80. .2 Alternative Approaches 457 Fiducial Inference In a series of papers published in the thirties. . Since n(8) = 8’ is ) in this case the reference prior for 0. 1939) developed. 1933. Let x = {x. . 8 E (80. Let p ( z I 8 ) . is monotone decreasing in 8. This last example suggests that the fiducial argument might simply be a reexpression of Bayesian inference with some appropriately chosen “noninformative” prior. Fisher (1930. The argument is trivially modified if F(t 10) is monotone increasing in 8. 1%) x p ( z 1 @a(@). It is easily verified that 3 is a sufficient statistic for 8. However. Essentially. Lindley (1958) established that this is true if. which is monotone increasing in 8. random sample from an exponential distribution p(z 18) = Ex(z 18) = Be@”. the fiducial distribution coincides with the reference posterior distribution.B. . with ~ ( 8= 8 I . as proposed by Fisher ( 1930. } be a . cc).I 3 ) = F(% I 0) is monotone increasing from 0 to 1 as 6. it follows that. by using G(O I t) = F ( t 10). 1935.with mean 6.81) ? be a onedimensional parametric model and let t = t ( z ) R be a sufficient statistic for 0. Hence. G(6. no formal justification was offered for this controversial “transfer”. F(t I 0). Then. and only if.1956/1973). he proposed using the distribution function F ( t 18) of a sufficient estimator t E T for 8 E 0 in order to make conditional probability statements about 8 given t .6. what he termed thefiduciul argument. is obtained as and Note that this has the form f(6.
that so many people assumed for so long that the argument was correct. 2) = [g(O. . z) conditional on the observed value of (~(z). Partitioning a pivotal function h(H. As Good ( I97 I ) puts it . possibly conditional on the observed values of an ancillary statistic ( ~ ( z ) . for a given model p ( z 10). See Seidenfeld (1992) for further discussion. However. NonBuyesian Theories the probability model y(r 10) is such that I and II may separately be transformed so as to obtain a new parameter which is a locution parameter for the transformed variable.o ( z ) ] to identify a possibly uniquely defined ancillary statistic a ( z ) . Pivotul Inference Suppose that. which is independent of H. and which has a distribution which only depends on 0 through h(0. however. The Royal Statistical Society discussions following the papers by Fieller (1954) and Wilkinson ( 1977) serve to illustrate the difficulties. See. Barnard ( I980b) has tried to extend this idea into a general approach to inference. Other relevant references are Brillinger ( 1962) and Barnard (1963). Then. Fisher's original argument. when applicable. as he modestly says. However. In onedimensional problems. produce some interesting results does in multiparameter problems where the standard fiducial argument fails. since G(f? t ) is a pivotal function with a uniform distribution over [O. h(0. if we do not examine the tiducial argument carefully. 11. . the fiducial argument.t). and using the distribution of g(O. and that mainly due to the perceived stature of its proponent. the fiducial argument seems now to have at most historical interest. which are nevertheless far better justified from a Bayesian reference analysis viewpoint. From a modem perspective. in fact. They lacked the daring to question it. as described I above. with sufficient statistic t . and (ii) because. is if special case of this formulation. is more or less well defined and often produces reasonable answers. and in t for fixed 0. it seems almost inconceivable that Fisher should have made the error which he did in fact make. . it is possible to find some function h ( 8 . t ) which is monotone increasing in 0 for tixed t . His basic idea is to produce statements derived from the distribution of an appropriately chosen pivotal function. a matter of considerable controversyhow the argument might be extended to multiparameter problems. t) is called a pivotalfunction and the tiducial distribution of 8 may simply be obtained by reinterpreting the probability distribution of It over T as a probability distribution over 8. it is by no means clearand. Efron ( 1993) for a recent suggested modification of the fiducial distribution which may have better Bayesian properties. It is because (i) it seemed so unlikely that a man of his stature should persist in the error.458 B.2). his 1930 'explanation left a good deal to be desired.
1979). and the argument is limited to the availabilityby no means obviousof an appropriate pivotal function for the envisaged problem. Example B. (Sfructuml and reference distributions). The key idea is then to reverse this relationship. coincide with the corresponding reference posterior distributions.p ) / s St ( t I 0. . He proposes the specification of what he terms a structural model. this structural model may be reduced in terms of the sufficient statistics 2 and Sz to the equations s = ns. and error distributions Reversing the probability relationship in the pivotal functions ( n . . which relates data 2 and parameter 8 to some error variable e.}be a set of independent measurements with unknown location p and scale 0. having two parts: a structural equation. the svuctural equation is and the error distribution is. Structural Inference Yet another attempt at justifying the transfer of the probability measure from the sample space into the parameter space is the structural approach proposed by Fraser ( 1968. 1972.1. and fi(f .so that 8 in a sense “inherits” the probability distribution. Let x = {xI:.71. the transformation governed by the value of 8. as is often the case. and a probability distribution for e which is assumed known. and independent of 8. Fraser claimed that one often knows more about the relationship between data and parameters than that described by the standard parametric model p ( z I 8). 3 = p + uz.B.. the observed variable 2 is seen as a transformation of the error e.If the errors have a known distribution p ( e ) . If p ( e ) is normal. Thus..7. 1) leads to structural distributions for u and p which.l)s2/a2 x‘fi.x. and to interpret 8 as a transformation of e governed by the observed 2.   . .2 Alternative Approaches 459 the mechanism by which the probability measure is transferred from the sample space to the parameter space remains without foundational justification.
adjusting a control mechanism.460 B. or setting a stock level. Note that. As Lindley (1969) puts it . IY83b. operating on a realised error e. that only works in some situations. . . in this formulation. 1990. However. with no explicit recognition o the rcncertuinry inwlved. Dawid. A. and considers a structural equation z = Be. Formally. with a completely identified error distribution for e. hence.suspiciouh of any argument. We recall from Section 5.l ( z )has the same probability distribution e and. the normal. 8. the problem of point estimation is naturally described as a decision problem where the set of possible answers to the inference problem. Pragmatically. for inference is surely B whole. this may be used to provide a structurul distribution for 8. and references therein). It is then claimed that 8 . the structural distributions are precisely the posterior distributions obtained by using as priors the right Haar measures associated with the structural group. and the Poisson distribution [is] not basically different in character from .S that. . Fraser’s argument [is]an improvement upon and an extension of Fisher‘s in the special case where the group structure is present but [one should beJ . .1. the final answer is an element f of t3.3 STYLISED INFERENCE PROBLEMS 8 3 1 Point Estimation . . to be interpreted as the response 2 generated by some transformation 8 E G‘ in a group of transformations G. is the parameter space 0.. . are special cases of reference posterior distributions (sce Villegas. . for example. within the Bayesian framework. . the group structure is fundamental and the approach seems to lack general validity and applicability. 1981. Let { p ( z 1 O ) . This is the socalled point estimation problem. NonBuyesiun Theories The general formulation of structural inference generalises the affine group structure underlying the last example. one may genuinely require a point estimate as the solution to a decision problem. . Here. In fact. . a point estimate of 8 may be motivated as being the simplest possible summary of the inferences to be drawn from z about the value of 6: alternatively. which in turn. . the mechanism by which the probability measure on X is transferred to t3 is certainly welldefined in the presence of the group structure central to Fraser’s argument. 6 E 0)be a fully specified parametric family of models and suppose that it is desired to calculate from the data z a single value e(z) E 8 representing the “best estimate” of the unknown parameter 0 . in most examples. When the structural argument can be applied it produces answers which are mathematically closely related to Bayesian posterior distributions with “noninformative” priors derived from (group) invariance arguments. 1977a.
more formally. The criteria adopted are typically nonconstructive.1 carry over to particular applications such as point estimation. with respect to any particular loss function. etc. are particular cases of this formulation for appropriately chosen loss functions.8 ) which describes the decision maker’s preferences in that context. the problems and limitations of classical decision theory that we identified in Section B. proposes methods for obtaining “best” estimators.B.. The likelihood approach proceeds by using the likelihood function to measure the strength with which the possible parameter values are supported by the data.3 Stylised Inference Problems 461 one specifies the loss function l(a.2 and 5. withas we notedthe general minimax principle being unpalatable to most statisticians. The frequentist approach proceeds by defining possible desiderata of the long run behaviour of point estimators. . in that it identifies a precise procedure for obtaining the required value. admissible estimators are essentially Bayes estimators. and chooses as the (Bayes) estimate that value @(z)which minimises the posterior expected loss. one may define admissibleestimates. It is worth stressing that this is a constructive criterion. either to offer as an estimator of 8 some location measure of the probability distribution of 8 or. mode or median of the posterior distribution of 8. to obtain that value of 8 which minimises some specified loss function with respect to such a distribution. in that the very definition of a maximum likelihood estimator (MLE) determines its method of construction. Hence. We also note that the definition of an optimal Bayesian estimator is constructive. Classical decision theory ideas can obviously be applied to point estimation viewed as a decision problem. We have seen (Propositions 5. and. minimax estimates. but classical decision theory provides no foundationally justified procedure for choosing among admissible estimators. using these desiderata as criteria. their “solution” to the problem of point estimation is essentially that suggested by the Bayesian approach. the optimal estimator is naturally taken to be that which maximises the likelihood function. Hence. From our perspective. Thus. Fiducial.2. such as the mean. Thus.9) that intuitively natural solutions. pivotal and structural inference approaches all produce “posterior” probability distributions for 8. and identifies conditions under which “good behaviour” will result.
1000) or the mixture form 0. nor is R? sufficient for a*. ] A simple theory is available if attention is restricted to unbiased estimators. (i) Sufficiency is a global concept. the (unique) unbiased estimator of the parameter O E (0. (ii) Sufficiency is a concept rekutive to a model. if quadratic loss isjudged to be an appropriate “distance” measure. .. even if 8. data (2. . thus. with small mse(d 18) for almost all values of 8. there are powerful arguments u p i n s t requiring unbiasedness.. A concept of relative efjfciency is developed in these terms. L’(e .{ O .(z) is is sufficient for 8. since then we simply have t o minimise I8) in this unbiased class. I . is one certain to use all the relevant information about the parameter of interest. thus.?.2 that the search for good estimators may safely be limited to those based on suflcient statistics. even though these two models are indeed very “close” to N(I I p. mse(8.0 ) ’ .e. and no theory exists which specifies conditions under which this can be guaranteed not to happen. } is known. if. 1) of a geometric distribution p(. An estimator 6 .r lOj = N( I . is jointly sufficient for ( p .. if e s is sufficient for 8 .999 x N ( s I p.1). (ii) Even when they exist. estimators such that b(8) = 0. For example ( 2 . However.m ) if the true model is St ( r 1 p . the following two points introduce a note of caution. The bias of an estimator &z)is defined to be and its mean squared error (mse) to be From a frequentist point of view it is desired that. i.c 10. unbiased estimators may give nonsensical answers. (5. ~ ? but 2 is not sufficient for / I .. For instance. for then.2. it does (j not follow that &(z) sufficient for a component parameter 8. However. and only then. for all 8. For instance.462 B. For example. unbiased estimator of the parameter 8 for a binomial Bi(r 18.4 and B. r / u is an distribution. 8 ) . Indeed: (i) In many problems. is more e&ient than 6. 18) < me(&.001 x N(. v .s) is not sufficient ’ for (11. although requiring the sampling distribution of 6 to be centred at 8 may have some intuitive appeal. 1‘ = 0. a frequentist would like an estimator (5.71) but there is no unbiased estimator of O’.1.cr) + 0. with univariate normal s?) ). in the long run. thus. there ure no unbiased estimators. NonBayesian Theories Criteria for Point Estimution It should be clear from Sections 5. when 8 . should be as close to 8 as possible. even a small perturbation to the assumed model may destroy sufficiency. cr). .
a frequentist would to converge to 8 (in some sense) as n increases. a. Obviously. a). to demonstrate.. (iv) Even from a frequentist perspective. i. de la Horra ( 1986) and Diaconis and Freedman (1986a. For discussion on the consistency of Bayes estimators. . see Bickel and Blackwell (1967).. By Chebychev's inequality. that this is a rather restricted view of optimality. It is easy.1) otherwise will be something else. 8(z) = 0.. 2) to make . Wald (1939). Thus.1 (for all odd z)! (iii) The unbiasedness requirement violates the likelihood principle. P ( 18) = nz e'B"/z!. a sufficient condition . a valid measurement. also. ..$ a. a quantity which must lie in (0. then the onfy unbiased estimator of e'. hardly sensible). . Freedman and Diaconis ( 1983). provided p = E ( z I 8) and a2= V ( z 18) exist. in the sense that it has the smallest mse among all linear. buteven more ridiculously the only unbiased estimate of e28 is (.1)"..e. 3 is said to be the best linear unbiased estimator (BLUE) of p. it seems inappropriate to make our estimate of p dependent on the fact that we might have obtained an invalid measurement. If we write explicit the dependence of the estimator on the sample size. this procedure is even further restricted to linear functions. z = 1: 2. 1967).B. . . leading to the estimate of a probability as .0. so that an estimator with small bias and small variance may be preferred to one with zero bias but a large variance. unbiased estimators may well be unap pealing if they lead to large mean squared errors. "Optimum Estimators '' We have mentioned before that minimising the variance among unbiased estimators is often suggested as a procedure for obtaining "good" estimators.since nonlinear estimators may be considerably . Yet. An estimator clearly like is said to be weakly consistent if . Sometimes. Schwartz ( 1965).1. for the weak consistency of unbiased estimators is that V ( 8 . behaviour of its sampling distribution. unbiased estimators. For the frequentist properties of Bayes estimators. a)observation is x. a. . thus. see.. a. Another frequentist criterion for judging an estimator concerns the asymptotic = . . 1986b). see Diaconis and Freedman (1983). a consistent estimator is asymptotically unbiased. the unbiased estimator of p from a N ( z I p . with appropriate examples. for example.8 in probability.0 as n . and strongly consistent if . however. 0)= N ( z I p. z = 0. if 8 is the mean of a Poisson distribution. = N ( z 10.ii with an instrument which only works for values L 5 100 and obtains z = 50. ) . l). hardly a sensible solution! Similarly (see Ferguson. is 1 if z is even and 0 if it is odd (again. by making the answer dependent on the sampling mechanism. but did not.3 Stylised Inference Problems is 8(0) = 1. a. if one is measuring . t 8 with probability one. For further discussion of the conflict between Bayes and unbiased estimators. . See.. but the unbiased estimator from ifz 5 100 p ( x I p .
Under suitable regularity conditions. namely that described above. in the sense that.I’ 6 ( z ) p ( sI t)&. if 0 is sufficient for 6’ there is a irniyue function g ( 8 ) for which a MVB exists. the MVB estimators. i) oe Then. We have already stressed that limiting attention to unbiased estimators may not be a good idea in the first place. under suitable regularity conditions. i. B ( t ) = E[B I t ] = .mse(8 I 8) I mse(8 I O ) . and only if. n?).464 B. @[u(x O)] = 0. A decisiontheoretic consequence of the socalled RaoBlackwell theorem is that any estimator of t which is not a function of the sufficient statistic f must be 9 . the existence of such iinijortnl! minimum ruriunce (UMV) estimators can indeed be established. where k(B) does not depend on x. then. } is a is random sample from N(:r 10.Z r ~ / n a MVB estimator f o r d . I . .1. but no MVB estimator exists for (T! One might then ask whether it is at least possible to obtain an unbiased estimator with a variance which is lower than that of any other estimator for each 0. Rao ( 1945) and Blackwell ( 1947) independently proved that if G(z) is an estimator of Q and t = t(x)is a sufficient statistic for 8.Bayesiun Theories more efficient. the conditional expectation of 8 ( x ) .r. for every value of 8.e. Indeed. Specifically. Non.. if x = { 11.logp(x 10). unbiased. Let j = g(s)be an unbiased estimator of g ( 0 ) and define the eflcienr scorefunction u. in which case g is said to be a mitiitnimt w r i unce bound (MVB) estimator of g(t9). Moreover. a result which can be generalised to multidimensional problems. can be found is rather limited. with equality if.. and a linear function of the score function. . .(xI 0)to be ?1(5 18) = . . the range of situations where “optimal” unbiased estimators. given the value of the sufficient statistic f . . An ”absolute” standard by which unbiased estimators may be judged is provided by the CramerRao inequality. is an improved estimuror of 8. €. For example... even if it does not reach the CramerRao lower bound. It follows that a minimum variance bound estimator must be sufficient.
n). all these construction methods have been used at various times within the frequentist approach to produce candidate “good estimators”. Nowadays. some frequentist statisticians pragmatically minimise an expected posterior loss to obtain an estimator.. Historically. MLE estimators are not guaranteed to exist or to be unique. asymptotically fully efficient. the result may be used to show that r(r . MLE’s can be shown to be consistent (hence asymptotically unbiased. which have then been analysed using the criteria described above. even if biased in small samples). Both the likelihood and the Bayesian solutions to the point estimation problem automatically define procedures for obtaining them. see Lehmann ( 195911983). For example. rather than mapping X . so that. and asymptotically normal.. r / a is the MVB estimator of the parameter 8 of a binomial distribution Bi(r 18. which may be obtained as the posterior mean which corresponds to a uniform prior. partly under the influence of classical decision theory. least squares. the frequentist approach does not (expect for special cases like the exponential family).the information function.l)/[n(n . whose behaviour they then proceed to study using nonBayesian criteria. asymptotically fully efficient and. In addition to the MLE approach. and is the UMV estimator of 8.B. A famous example is the Pimun esrimuror (Pitman (1939). a region C ( z ) within which the parameter 8 may reasonably be expected to lie. also. asymptoticallynormal). Robert et ul.e. (1993). from a frequentistperspective they are consistent.8 E (3) be a fully specified parametric family of models and suppose that it is desired to calculate. as a constructive procedure for obtaining estimators this result is of limited value due the fact that it is usually very difficult to calculate the required conditional expectation. which usually have to be investigated case by case. However. but when they do exist they typically have very good asymptotic properties. Thus. Under fairly general conditions.3 Stylised Inference Problems 465 inadmissible. see. from the data x . under suitable regularity conditions. other methods of construction include minimum chisquared. However. They are typically biased. if n *m. I ( @ ) ) Bayesian estimators always exist for appropriately chosen loss functions and automatically use all the relevant information in the data. For an extensive treatment of the topic of point estimation. and the method of moments.then 8 ( t ) is unbiased.the sampling distribution of converges to the normal with mean 8 and precision I ( 8 ) . If 6(x) is unbiased and complete. and there is a complete sufficient statistic t = t(x). but there is no MVB estimator of 8’. Let { p ( z I 6 ) . and have analogous asymptotic properties to maximum likelihood estimators (i. However.l)]is a UMV estimator of 8*. these methods do not in themselves guarantee any particular properties for the resulting estimators.  8 3 2 Interval Estimation . distribution N(8 1 8.
i. as a set of 8 values which may safely be declared to be consistent with the observed data..(1 and. whose elements may be claimed to be supported by the data as "likely" values of the unknown parameter 8. A lol. see e. i. suggest themselves as summaries of the inferential content of the posterior distribution. parameter H may reasonably be expected to lie.u with the corresponding nesting property. It the is crucial however to recognise that the only proper probubiliry interpretatioir of a confidence interval is that.g. credible regions provide a sensible solution to the problem of region estimation. a subset of 8 is associated with each value of x. Given x.5 that. given z.e. Indeed. as in point estimation.. a statistic 8" (2) such that for all 0. One only has the rather dubious 'Transferred assurance" from the longrun definition.erconfiderzceIimit8. Whether or not the pcirticulur @(z) which corresponds to the observed data x is smaller or greater than (9 is enrirely uncertuin. Plante ( 1984. in rhr long run.the specific interval ( .(z) 5 0 18) = 1 . ~ " ( z ) ]is then typically interpreted as a region where.c. then g(@') is an upper confidence limit for g(0). 0 < (1 < 1.(x) issimilarlydefinedasastatistica(x) such that Pr{&. within a Bayesian framework.n. those of the smallest size. for each (L value. when 8 is onedimensional. among such regions.x . The nesting condition is important to avoid inconsistency. the highest posterior density (HPD) regions.e. and such that if a l > (12 then Pl 5 0"2. Region estimates of 8 may be motivated pragmatically as informative simple summaries of the inferences to be drawn from x about the value of 8 or. such that contains the true value of the parameter with (posterior) probability 1 .(I)% credible region C. Note that if y is strictly increasing. I . We recall from Section 5. This is the socalled region estimation problem. Combining a lower limit at confidence level 1 . a proportion 1 . hence the more standard reference to the interval estimation problem. more formally. Note that this formulation is equally applicable to prediction problems simply by using the corresponding posterior predictive distribution.( t with an upper limit at confidence level . a 100(1 .(b of the @'(x)values will be larger than 0. the regions obtained are typically intervals. Confidence Limits For 0 < c t < 1 and scalar B E 8 C A. 199I 1. is called an upper confidence limit for (9 with confidence coeflcient 1 . NonBayesiun Theories into 0.466 B..
one could try to choose a 12 1 and (12 to minimise the probability that the interval contains false values of 8. For fixed crl + (x = (1. a1 and cr2 may be chosen to minimise the expected interval length Ez I ~ [ e ^ 2 ( z ). (iii) There are serious difficulties in incorporating any known restrictions on the parameter space. that shortest intervals for 8 do not generally transform to shortest intervals for functionsg(8). . There are.may be used to provide approximate confidence intervals for 8. the construction of confidence intervals is by no means immediate. are generally less than clear. other alternatives.0 2 . which produces central confidence intervals based on equal tailarea probabilities. such uniformly most accurate intervals are not guaranteed to exist. (iv) In multiparameter situations. It is worth noting that. moreover. (v) Interval estimation in the presence of nuisance parameters is another controversial topic. we obtain a nvosided confidence interval level 1 . typically based on replacing the unknown nuisance parameters by estimates. the fact that u has I(8)). for all 8.B.a2 . the construction of simultaneous confidence intervals is rather controversial. however. Unless appropriatepivotal quantitiescan be found.k1 181.with a sampling distribution which is asymptotically normal N(u 10. (ii) There is no general constructive guidance on which particular statistic to choose in constructing the interval. a convenient choice is a1 = 02.crl . (i) Shorrest confidence intervals. It is less than obvious whether one should use the confidence limits associated with individual intervals.3 Stylised Inference Problems 1 . for a variety of reasons. or as one of considering the probability that a number of confidence statements are simultaneously correct. 467 [aI (z) &'2 (z)] confidence at For twosided confidence intervals. (ii) Most selective intervals. and no systematic procedure exists for incorporating such knowledge in the construction of confidence intervals. It can be proved that intervals based on the score function + have asymptotically minimum expected length. However. (5) It must be realised. or whether one should think of the problem as that of estimating a region for a single vector parameter. the properties of various alternative procedures. mean 0 and precision I ( e ) .such that. however. (i) They typically do not exist for arbitrary confidence levels when the model is discrete. For fixed a1 a2 = a.
one is again limited to ud huc approximations based on substituting estimates for parameters. Unless one is able to find a function of the present and future observation whose sampling distribution does not depend on the parameters (and this is not typically the case). the concepr of a confidence interval is open to what many would regard as a rather devastating criticism. However. (ii) If . We give two examples. 1988). also. But. Pierce ( 1973). the smaller and the larger of these two observations..468 B.2. Maatta and Casella (1990) and Goutis and Casella (1991). (i) In the FiellerCreasy problem. Solemnly quoting the whole real line as a 95% confidence interval for a real parameter is not a good advertisement for statistics. there are values ct < 1 such that.. 9 2 ) provides a 50%.so that we know fur sure that 0 belongs to the interval (g. if for the observed data it turns out that y~ . but it certainly does not solve it and. and . the corresponding 1 .0 0.r1 and 2 2 are two random observations from a uniform distribution on the and interval (8 . respectively. Conditioning on ancillary statistics.5). y ~ y.0. even in the simplest case where 0 is a scalar parameter labelling a continuous model p ( z I e). 1979b).. Two important such references are Robinson ( 1975) and Jaynes (1976). it may create others. NonBuyesian Theories (vi) Interval estimation of future observations poses yet another set of difficulties. see Bemardo (1977) and Raftery and Schweder t 1993). confidence interval. The reader interested in other blatant counterexamples to the (unconditional) frequentist approach to statistics will find references in the literature under the keywords relevunf subsets. Basu ( 1964. 1992). for a subset of possible data with positive probability. as discussed in Section B. where the parameter of interest is the ratio of two normal means. For Bayesian solutions. Robinson (1979a. 0. Corntield ( I 969). which refer to subsets of the sample space yielding special information and subverting the "longrun" or "on average" frequentist viewpoint. the quoted intervals are nicmeric~crl/~ equal . These examples reflect the inherent difficulty that the frequentist approach to statistics has of being unable to condition on the complete observed data.even though the confidence level of the interval is only 50%. when possible. Casella (1987.5.(t confidence interval is the enrire real line. Namely the fact that the confidence limits can turn out to be either vacuous or just plain silly in the lighr o j f h eobservrd dam. Buehler ( I959).2 are.yl L. As a final point. see. may mitigate this problem. then it is easily established that for all H + so that (y.2.5 then certainly yl < 0 < y2. we should mention that for many of' the standard textbook examples of confidence intervals (typically those which can be derived from univariate continuous pivotal quantities). y?).
while H I is referred to as the alrernative hypothesis. would in fact be correct.a. ( 1 993) have proposed. A typical example of this situation is provided by the class of intervals for the mean of a normal distribution with unknown precision. au = accept NOor 01 E accept HI. for interval estimation. if x = { q . we have a decision problem. Note that although this longterm coverage probability is not directly relevant to a Bayesian. 8 E 0) a fully specified parametric family of models. Although the theory can easily be extended to any finite number of alternative hypotheses. we will present our discussion in terms of a single alternative hypothesis..17. These are both the “best” confidence intervals for 11.i.. the intuitive interpretation that many users (incorrectly. where the choice is to be made on the basis of the observed data 2. xnlaX) a 50% interval for p . 8 3 3 Hypothesis Testing . the example suggests that special care should be exercised when interpreting posterior distributions obtained from improper priors. ( n . Let { p ( z I O ) .. there is probability 1 . but if both observations belong to the set is then P r { C I z E R . x ( p I x) = St(p 1 ?. given the data. ~then C = }.t . the two hypotheses are not symmetrically treated. p : o } = 0.o that the interval contains the true parameter value. suppose that we wish to decide and whether the unknown 8 lies in 80or in el.5181. Pierce (1973) has shown that similar situations can occur whenever the confidence interval cannot be interpreted as a credible region corresponding to a posterior distribution with respect to a proper prior. of course!) tend to give to frequentist intervals of confidence 1 . In most such problems. in these cases. This means that...3 Srylised lnference Problems 469 to credible regions of the same level obtained from the corresponding reference posterior distributions. with only two possible answers to the inference problem. be partitioned into two disjoint subsets 0 0 and el.l)s*.1) i derived in Example 5. if described.p ) / s .namely that. {x. Indeed. with 0. If H0 denotes the hypothesis that 0 E OOand H I the hypothesis that 8 E El. derivable from the sampling distribution of the pivotal quantity m ( Z . the working hypothesis Ho is usually called the nuff hypothesis.B.. This is the socalled problem of hypothesis resting. . as a reference posterior credible interval. alternative loss functions to the standard linear functions of volume and coverage probability. and also the credible intervals which correspond to the reference posterior distribution for p. Casella er al. Buehler and Feddersen (1963) demonstrated that relevant subsets exist even in this standard case. instead.
the longrun probability that the test rejects the null hypothesis Ho. or Bayes factor.. the ideal power function would be pow(e 1 o') = 0. Thus. can be appropriately treated using standard decision theoretical methodology. the problem of hypothesis testing. or simply a rest 0') is specified in terms of a critical region Hn. I ) that the null hypothesis H I . From the point of view of classical decision theory. 1 e } . Assuming the stylised Mclosed case.1 that. The most relevant frequentist aspect of such a procedure 0' is its powerfuncrion p o q e I ($1 = Pr{z E R.470 B.8 E k } and the utility structure is simply . as a function of 8. p(Mi) = L. a decision rule for this problem (henceforth called a test />roceditre 6. Y(Z I e)P(e)de J& P ( W 8 I is smaller than a cutoff point which depends on the ratio l o i / l l ~of the losses incurred. respectively.y(e)rie. although. and on the ratio of the prior probabilities of the hypotheses.O) = 0 0 E 8. one will seldom be able to derive a test procedure with such pow(8 an ideal power function.). we have seen (Proposition 6. 1 = I. NonBayesian Theories We recall from Section 6. by accepting a false null and rejecting a true null.. within a Bayesian framework.. as formulated above. This corresponds to checking whether the appropriate (integrated) likelihood ratio. and maximising the corresponding posterior expected utility. which specifies. where the true model ) is assumed to belong to the family { p ( z 10). I (i) is the longrun probability of . We also recall that the solution to the decision problem posed generally depends on whether or not the "true" model is ussutned to be included in the family of analysed models. e E 8. by specifying a prior distribution and an appropriate utility function. For any 8 E 8. that is.. = 1. i = 0. i = o. j # i . and only if. should be rejected if. ji)" P ( X I e)P(We J& P ( W 8 BIII(X) = Je. Obviously..(a. 1. the problem of hypothesis testing is naturally posed in terms of decision rules. e E c3(. naturally. e E 0 . defined as the set of x values such that H is rejected o whenever x E Rd.
Let us denote by a(6 18) and p(S 18) the respective probabilities of these two types of error. a socalled error of type 2.e. the corresponding hypothesis is referred to as a simpfe hypothesis. 1933. otherwise 4' Ra 18) =O if 8 E el. for example. if 0. then Hi is referred to as a composire hypothesis. one usually tries to minimise some function of the two. modifying Rs to reduce one would make the other larger. a linear combination aa(6 18) bP(6 18). This can be seen as a particular case of the Bayesian solution recalled above. otherwise. to specify a significance level CI is to restrict attention to those tests whose size is not larger than a.B. it can be proved that a test which minimises m ( d ) + b. . which is then called the level of signflcunce of the tests to be considered. a socalled error of v p e I . so that a(6 18) = a(6 1 8 0 ) = a ( & ) and O(d 18) = O(SIOl) = O(d).. The size of any specific test d is defined to be a = sup pow(e eceo I 6). a(6 18) = Pr{z E Rd 18) =o p(Sl8) = Pr{z if 8 E eo. and only if.O(6) should reject HOif.3 Srylised Inference Problems 471 incorrect rejection of the null hypothesis. However. if the likelihood ratio in favour of the null is smaller than the ratio of the weights given to the two kinds of error. + Testing Simple Hyporheses When both Ho and H I are simple hypothesis. typically. and accepting a false null hypothesis. frequentist statisticians often specify an upper bound for such probability. and only if. 1967) which says that a test which minimises p(d) subject to a(6) 5 a must reject Ho if. contains more than one value of 8. Either 80 or may contain just a single value of 8. For any test procedure 6 one may explicitly consider two types of error: rejecting a true null hypothesis. In this case. i. Hence. and is closely related to the NeymanPearson lemma (Neyman and Pearson. thus. It would obviously be desirable to identify tests which keep both error probabilities as small as possible.
E t) 5 %} is said to have a monotone likelihood ratio in the statistic f = t ( s )if for all 0 < 612. j ( d ) one may find that. Lindley. 5 posv(8 1 6). H. n') it is important to note (see e. If p ( z 1 8) has a monotone likelihood ratio in I and 1. by fixing o(6) minimising . p ( z 1 H z ) / p ( z I H I ) is an increasing function 1 of 1. for all 8 E (3. Indeed: (i) With discrete data one cannot attain a fixed specific size (0') without recourse to auxiliary. NonBuyesian Theories for some appropriately chosen constant k. o( = . see Kadane and Seidenfeld (1986).472 B. Although this can be avoided by carefully selecting n ( h ) as a decreasing function of the sample size. A model {p(z 10). with and large sample sizes. and.)) = 0 . It can be proved that. in the sense that no other procedure is equivalent to minimising an expected loss. However. terms of the power function. Conipositu Alternurive Hypotheses In spitc of the difficultiesdescribed above. in which case no difficulties of this type can arise. for any other 6 such that N(6 1 8 ) 5 oo..1 loss case. ) . ) oo. For a Bayesian view on randomisation. whereas minimisation of a linear combination of the form mi (6) + Iid(6) can always be achieved. is a constant such that PI.{f 1 c I 0. (ii) More importantly. )is far larger than p(zI i l l ). when 8 is onedimensional. 1972)that minimising a lineur combination of the two types of error is actually the only coherent way of making a choice. due to the fact that the minimising j j ( 0') may be extremely small compared with the fixed ft(n'). e E efl and for which pow(8 I 6) is as large as possible in A test procedure 6' is called a irniforrniy most power1 (UMP) test. UMP tests often exist for onepow(8 I n ' ) sided alternative hypotheses.j(6) corresponds to the minimax principle.05 or 0. at level ofsignificuncx~ i f o (6' I8) 5 1 1 . For example. it seems far inore natural to minimise a linear combination m ( 6 ) + bJ( 0') of the two error probabilities. . 6) in the 0 . but it should be emphasised that this is not a sensible procedure. this implies deriving a test 0' such that In pow(e 10') i all. It has become standard practice among many frequentist statisticians to choose a significance level 00 (often "conventional" quantities such as 0. is rejected when p(s1 H I .. Other strategies for the choice ofo(6) and <j( have been proposed.g.6. irrelevunt rundoinisation.. The NeymanPearson lemma shows explicitly how to derive such a test. frequentist statisticians have traditionally defined an optimal test 6 to be one which minimises d(6 1 8 ) fora fixed significance level oO.01) and then to find a test procedure which minimises J(6) among all tests such that o( b) 5 (t(I (rather than explicitly minimising some combination of the two probabilities of error).
n = 30.c. it seems desirable that. . . at the level of significanceao. r . = = The fact.po I > 1.~. I E .B.40. suggests that a less demanding criterion must be used if one is to define a “best” test among those with a fixed significance level. = {x.2 Power of testsfor the mean of a normul distribution P(. Since these critical regions are different. defined by critical region Rb. . with E =0. Similarly. that UMP tests typically do not exist for twosided alternatives.9. (Nonexisfence of4 UMP fesf). . when Ho is true.05 . I .282/fi} is a UMP test for H(... Example B. A test 6 is called unbiased if for any pair 8 0 E 80 81 E 8 1 and it is true that pow(Oo 16) 5 pow(OLI6). . . . = 1 . } is a random sample from a normal distribution N ( s I p . 1).j : > 1.282/fi} is a UMP test for H. 11 5 /Lo versus HII p > 110.x . Since the power function pow(8 1 6 ) describes the probability that the test 6 rejects the null. UMP tests do not generally exist.r 111. . .. it follows that there is no UMP test for p = /ill versus p # PIl. (2 = 2. . .. If x = {J. then the test 6. illustrated in the above example.3 Stylised Inference Problems 473 then the test 6 which rejects HOif t 2 cis a UMP test of the hypothesis HO 8 5 00 versus the alternative H I = 8 > Oo. However. Example B. = 1. (Compurative power o &@red tests)./LO > 1. = (2. the test 6.If x = (. pow(&16) should be smaller in Qo than elsewhere.8.1) then the test & defined by the region Rn:.with the same level.10 significance level. defined by R62 = (2: pll. } is a ranf dom sample from a normal distribution N(.645/fi} Figure B. 2 p0 versus HI E / I < [L.
2 compares the power . locally most powerful tests may be very inappropriate when the true value of 0 is far from O. the statistician should report the cutoff point (1 such that Hn would not be rejected for any level of significance smaller than ( t . of this test with those defined in Example B. indeed. NonBayesian Theories is an unbiased test for Nu 5 11 = f i b versus H I = /i # ~ 4 . Methodological Discussion Testing hypotheses using the frequentist methodology described above may be misleading in many respects. in this case. so that.. even numerically. An added advantage of this approach is that there is no need to select beforehand an arbitrary significance level.. Clearly. are considered to be more likely. even when they exist. In particular: (i) It should be obvious that the all too frequent practice of simply quoting whether or not a null hypothesis is rejected at a specified significance level no ignores a lot of relevant information. locally most powerful tesrs may be derived by using the sampling distribution of the efficient score function in a process which is closely related to that described in our discussion of interval estimation.is true is smaller than the pvalue.. in any decision procedure. This value is called the tail ureu or pvalue corresponding to the observed value of the statistic. and with that of a typical nonsymmetric test 6.. the requirement of maximum local power does not say anything about the behaviour of the test in a region of high power and. if such a test is to be performed. As noted in the case of confidence intervals. should be preferred to the unbiased test whenever the consequences of the first class of errors are more serious. which has the critical region for suitably chosen constants (j < r 2 . We are drawn again to the general comment that. there is a tendency on the part of many users to interpret a pvalue as implying that the probability that If.474 B.8. which is more cautious about accepting values of / I larger than plI than about accepting values of 11 smaller than pi.9 above that. prior information and utility preferences should be an integral part of the solution. Yet another approach to defining a “good” test when UMP tests do not exist is to focus attention on localpowr. It is clear from Example B. by requiring the power function to be maximised in a neighhourhood of the null hypothesis. Not only. of the same level. no simple form of reinterpretation which would have a Bayesian justification. However. in general. Under suitable regularity conditions. is this false within the frequentist framework but. or whenever the values of smaller than / I ( . ~Figure B. It seems obvious that 6. pvalues cannot generally . of course. there is. unbiased procedures may only be reasonable in special circumstances..
but under most plausible utility functions the difference has no political significance. and it is desired to test whether or not the data x are compatible with this hypothesis.designed either to contain acruafalternatives of practical interest. since the classical theory of hypothesis testing does not make any use of a utility function. the problem of significance testing. r r 2 X ) with precision m determined by a random integer m. to identify the most appropriate procedure. X by the conditionalityprinciple. 8 E 0). if it completely specifies a density p ( x 100). there is no way to assess formally whether or not the true value of the parameter 8. if z is a random sample from N(x I p .05 or 0.2 that. Thus. The null hypothesis may be either simple.B. Berger and Delampady ( 1987)and Berger and Sellke ( 1987). We recall from Section 6. tests on p or X should condition on the observed m. null hypothesis HO = 8 E 8 0 is tested against a (at least) one specific alternative. However. (iii) Finally. (ii) Another statistical “tradition” related to hypothesis testing consists of declaring an observed value stutisticully signijcunt. as formulated above. we have reviewed the problem of hypothesis testing where. 8 E 0}. a vote proportion of 34% for a political party is technically different from a proportion of 34. which may well be numerically different from a hypothetical value 8 0 . any discrepancy measure For . Berger (1 985a). See Casella and Berger (1987) for an attempted reconciliation in the case of onesided tests. given a family { p ( z I O ) . at least asymptotically. Durbin (1969) showed that. is significantly different from 8 0 in the sense of implying any practical difference. for example. Yet. whenever the corresponding tail area is smaller than a “conventional” value such as 0.3. 8 E 8 0 ) has been initially proposed. could be solved by embedding the hypothetical model in some larger class { p ( e It?).001 %.then rn is ancillary and hence. within the Bayesian framework. implying that there exists statistical evidence which is sufficient to reject the null hypothesis. even in the theory’s own terms. For detailed discussions see. where onfy the null hypothesis H0 = { p ( z 1 O ) .0 I . without considering specific alternatives. the mutual inconsistency of frequentist desiderata often makes it impossible.4 SignificanceTesting In the previous section. or composite. In this section we shall review the problem of pure significance tests. unrestricted tests may be uniformly more powerful. orformal alternatives generated by selecting a mathematical neighbourhood of Ho.3 Stylised Inference Problems 475 be interpreted as posterior probabilities. For example. 8. See Chernoff (1951) and Stein (1951) for further arguments against standard hypothesis testing.
in repeated samples.. scientific support (or fashion). we showed that Ho should be rejected if. or whatever. and for any function E ~ I ( X ) . ( i ) The frequentist theory does not generally offer any guidance on the choice of an appropriate test statistic (the gcnerulised likelihood rurio resf. if it were true. that. While in the Bayesian analysis . the conditional utility difference. a test statistic t = t ( x ) is selected with two requirements in mind. for each 8. Then. of the kind which it is desired to test. a disguised Bayes factor seems to be the only proposal). t ( ~> EO(x) ) where is the expecred posreriur discrepane?. The result of the analysis is typically reported by stating the pvalue and declaring that Ho should be rejected for all significance levels which are smaller than 1'. and rejecting H. conditional on Ho.and the true model. we proposed the logarithmic as a reasonable general discrepancy measure. f would exceed the observed value t (x). p(1 I H O )should be the same for all 8 E ell.apvulue or significance level is calculated as the probability. (i) The sampling distribution of f under the null hypothesis p ( t 1 H o ) must be known and. given the data x.. due to its special status corresponding to simplicity (Occam's razor). In particular. that p is given by so Small values of p are regarded as strong evidence that HO should be rejected. (ii) The larger the value of t the stronger the evidence of the departure from H. Comparison with the Bayesian analogue summarised above prompts the following remarks. This (fully Bayesian) procedure could be described as that of selecting a test statistic t ( x ) which is expected to measure the discrepancy between HI.476 B. if t(z) is larger than some cutoff point E O ( Z ) describing the additional utility of keeping I i . describing the additional utility obtained by retaining HO because of its special status. if HO is composite. From a frequentist point of view. NonBuyesiun Theories describing.
or may be measured with any proper scoring rule such as A logp(t(z) Is. the frequentist statistician must. the frequentist statistician needs to determine the unconditional sampling distribution o f t under Ho. (ii) Even if a function t = t ( z ) is found which may be regarded as a sensible discrepancy measure. The absence of declared alternatives even precludes the use of the frequentist optimality criteria used in hypothesis testing. Thus. this should certainly f take into account the advantages o keeping Ho. in the more interesting situation of composite null hypotheses. in general. it is required that p ( t I #) be the same for all 6 in 00. We described in Section 6. is simply not the case.HI)) +B. may be described by quoting the HPD intervals to which it belongs. the position of the observed value of t with respect to its posterior predictive distribution p ( t Is. the compatibility of t ( s )with HI. which..3. Indeed.3 Stylised Inference Problems 477 t(s)is naturally and constructively derived as an expected measure of discrepancy. (iii) If a measure of the strength of evidence against Ho is all that is required.B. and actually impossible when there are nuisance parameters. i. Moreover. defining the cutoff point in terms of utility.e. in Figure B. often. this may be very difficult. more relevant answer than quoting the realised pvalue. t l ( s ) may readily be accepted as compatible with Ho while t 2 ( s ) may not.2 how this may actually be .Ho) under the null hypothesis seems a more reasonable. tl(X) tdx) Flgure B 3 Esualising the compatibility o t ( x ) with H f o (iv) If a decision on whether or not to reject Ho has to be made. rely on intuition to select t.
in the long run. only tangential. criticisms made of confidence intervals typically apply to significance testing. lottgrim suritpling properties under hypothetical repctitions.478 B. Thus. A far more pertinent question.+values or confidence intervals are largely irrelevant once the sample has been observed. ? Within the frequentist approach. not necessarily the most appropriate in all circumstances. conditional on known data. but this is only one possible choice. but the rekcvczric~ the aggregate. Not only may one dispute the existence of the conceptual "co//ectii. p. 385) .vexpect.4 8. before the data are collected. frequentist statistics produces probability statements about hypothetical repetitions of the data conditional on the unknown parameter. Indeed. the problem at the very heart of the frequentist approach to statistics is that of connecting uggregute. I . either parameters or future observations.c. NonBayesian Theories chosen to guarantee a specified significance level. one might expect that. longrun properties for specific inference of problems seems.fincz/precision. to spec@. they typically have average characteristics which describe. . Y G / f i which might be constructed by repeated sampling from a normal distribution with known unit precision. initiu/precisiorr andJirrolprecision. Similarly. Thus. with no logical possibility of assessing the relevant. 8. We should finally point out that most of the criticisms already made about hypothesis testing are equally applicable to significance testing.1 COMPARATIVE ISSUES Conditional and UnconditionalInference At numerous different points of this Appendix we have ernphasised the following essential difference between Bayesian and frequentist statistics: Bayesian statistics directly produces statements about the uncertainty of unknown quantities.of the intervals of the form d f I . but have not. the true mean p will be included in 95%. Thus. to quote Jetfreys ( 1939/1961.4. introduced by Savage ( 1962). for each value of the unknown parameters. at best. and then seeks (indirectly) ways of making this relevant to inferences about the unknown parameters given the observed data. is the following: given the observed . however. how close is the unknown 11 to the observed . the "precision" we may iniriu//. frequentist procedures are designed in terms of their expected behaviour over the sample space." where these hypothetical repetitions might take place. It is useful to distinguish between two very different concepts. for example. Indeed. one must rest on the rather dubious "transferred" propzrties of the longrun behaviour of the procedure.I' which derives from the observed sample (which typically will r i o f be repeated in any actual practice).inferences of a totally different type. since confidence intervals can generally be thought of as consisting of those null values which are not rejected under a significance test. since they are concerned with events which might have occurred.
very precise estimates of p. there remain many problems with conditioning on ancillary statistics.3 that it is easy to construct examples where totally unconditional procedures produce ludicrous results.. We recall from Section 5. within a Bayesian framework. we may expect. Basu (1964. . no matter how efficient the estimator fi was expected to be. 4. Given the sample and + 51 using a uniform (reference)prior for p. can yield a totally uninformative sampling distribution. the full parameter vector 8 can typically be partitioned into 8 = {&. . on uveruge. as pointed out in our discussion of the conditionality principle. Thus.also. they are not easily identifiable. . see. [ (since the reference posterior distribution is uniform on that interval).. theoretical problems. so that.s. Suppose however that we obtain a specific large sample with a small range. be a random . (Itdial a d f i n a l precision).+ ~ . The following example. such as the search for maximum power in hypothesis testing. often referred to as the vector of nuisance parameters. Example B 1 . Indeed.0 sample from a uniform distribution over ]/I . conditioning on an ancillary statistic. we saw in Examples B.2 and B. Let z = {x. However. x.. ..2 Nuisance Parameters and Marginalisation Most realistic probability models make the sampling distribution dependent not only on the unknown quantity of primary interest but also on some other parameters.. they are not necessarily unique and. unlikely.4 Comparative Issues 479 .. and can conflict with other frequentist desiderata.B.4. Berger ( 1984b). . we can only really claim that p E]I... 8. The need for conditioning on observed data can be partially met in frequentist procedures by conditioning on an ancillary statistic. taken from Welch (1939) further illustrates the difference between initial and final precision. ~ ~ =is a 2 efficient estimator with a sampling variance of the order of ) / very 1/n2.. the final precision of our inferences about p is bound to be rather poor. the desired result. for example.. admittedly. can simply be written as . but nevertheless possible..i. if the acniul data turn out this way.. this is.1 that. namely the (marginal) posterior distribution of the parameter of interest.A} where & is the subvector of interest and X is the complementary subvector of 8. See. This seems a remarkable procedure... moreover. 1992) and Cox and Reid ( 1987).It is easily verified that the midrange fi = (x. from large samples. Indeed. the presence of nuisance parameters does not pose any formal. Thus. rather than the usual l/n. p + i1. a hypothesis which may be true may be rejected because it has not predicted observable results which have not occurred..
0 A) =n / J ( X . simplificationsbecome available that are not available in general (Cox and Hinkley. Indeed. frequentists arc forced t o use uppro. In general. and common variance 9. Neyman and Scott ( 1948) illustrated s