This Page intentionally left blank
BAYESlAN
THEORY
This Page intentionally left blank
BAYESIAN
THE0RY
Jose M. Bernard0
Professor of Statistics Universidad de Valencia, Spain
Adrian F. M. Smith
Professor of Statistics Imperial College of Science, Technology and Medicine, London, UK
JOHN WILEY & SONS, LTD
Chichester . New York . Weinheim Brisbane . Singapore * Toronto
1
Copyright 6 2000 by John Wiley & Sons,Ltd Bafiins Lane, Chichestcr, West Sussex PO19 IUD, Engtand National 01243 779771 Intemional (W)1243 179777 email (for orders and customer s m i c e enquiries): csbooks@viley.co.uk Visit our Home Page on http://www.wiley.co.uk or http://www.wilcy.com First published in hardback 1994 (ISBN 0 471 92416 4)
All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval systcm, or transmitted, in any form or by any means, electronk, mechanical, photocopying, wording, s m i n g or otherwise, except under the terms of the copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency, 90 Tottenham Court Road,London, UK W1 P 9HE. without UIC permission in writing of the Publisher, with the exception of any malerid supplied specifically for the purpose of being entmd and executed on a computer system, for exclusive use by the purchaser of the publication.
Neither the authors nor John Wiley & Sons Ud accept any responsibility or liability for loss or damage o occasioned t any person or propmy through using the material, instructions, methods or ideas contained herein, or acting or refraining from acting as a result of such use. The authors and Publisher ins expressly disclaim all implied warranties, including merchantabilityor ftes for MYparticular purpose. There will be no duty on the authors or Publisher to comct any CKO~Sor defects in the software. Designations used by companies to distinguish their products are often claimed as trademarks. In all instances where John Wiley & Sons is aware of a claim, the product name appear in initial capital or all capital lettca. Readers, however, should contact the appropriate companies for more complete information regarding trademarks and registration.
Other Wiky Editorial Ogices
John Wiley & Sons, lnc., 605 Third Avenue, New York, NY 101580012, USA Weinhcim Brisbane Singapore Toronto
Libraw o Cbngress ~atdoginginPubIicationData f
Bernardo, JOSC M. Bayesian theory / JoSt M. Bernardo, Adrian F. M. Smith. p. cm.  (Wiley series in probability and mathematical statistics) Includes bibliographical references and indexes. ISBN 0 471 92416 4 1. Bayesian statistical decision theory. 1. Smith, Adrian F. M. 11. Title. ill Series. QA279.5.B47 1993 519.5'424~20 9331554 CIP
British U b r q v Cataloguing in Pvblicaron Data
A catalogue record for this book is available from the British Library ISBN 0 471 49464 X
T O
MARINAand DANIEL
This Page intentionally left blank
Preface
This volume, first published in hardback in 1994, presents an overview of the foundations and key theoretical concepts of Bayesian Statistics. Our original intention had been to produce further volumes on computationand methods. However, these projects have been shelved as tailored Markov chain Monte Carlo methods have emerged and are being refined as the standard Bayesian computational tools. We have taken the opportunity provided by this reissue to make a number of typographical corrections. The original motivation for this enterprise stemmed from the impact and influence of de Finetti’s twovolume Theory o Probability, which one of us helped f translate into English from the Italian in the early 1970’s. This was widely acknowledged as the definitiveexposition of the operationalist, subjectivist approach to uncertainty, and provided further impetus at that time to a growth in activity and interest in Bayesian ideas. From a philosophical,foundationalperspective,the de Finetti volumes provide in the words of the author’s dedication to his friend Segrea necessary document for clarifying one point of
view in its entirety.
From a statistical, methodological perspective, however, the de Finetti volumes end abruptly, with just the barest introduction to the mechanics of Bayesian inference. Some years ago, we decided to try to write a series of books which would take up the story where de Finetti left off, with the grandiose objective of “clarifying in its entirety” the world of Bayesian statistical theory and practice.
viii
Prejiice
It is now clear that this was a hopeless undertaking. The world of Bayesian Statistics has been changing shape and growing in size rapidly and unpredictablymost notably in relation to developments in computational methods and the subsequent opening up of new application horizons. We are greatly relieved that we were too incompetent to finish our books a few years ago! And, of course, these changes and developments continue. There is no static world of Bayesian Statistics to describe in a onceandforall way. Moreover. we are dealing with a field of activity where, even among those whose intellectual perspectives fall within the broad paradigm, there are considerable differences of view at the level of detail and nuance of interpretation. This volume on Bayesiun Theor:\. attempts to provide a fairly complete and uptodate overview of what we regard as the key concepts, results and issues. However, it necessarily reflects the prejudices and interests of its authorsas well as the temporal constraints imposed by a publisher whose patience has been sorely tested for far too long. We can hut hope that our sins of commission and omission are not too grievous. Too many colleagues have taught us too many things for it to be practical to list everyone to whom we are beholden. However, Dennis Lindley has played a special role, not least in supervising us as Ph.D. students. and we should like to record our deep gratitude to him. We also shared many enterprises with Morrie DeGroot and continue to miss his warmth and intellectual stimulation. For detailed comments on earlier versions of material in this volume. we are indebted to our colleagues M. J. Bayam, J. 0. Berger. J. de la Horra, P. Diaconis, F. J. Gir6n. M. A. G6mezVillegas. D. V Lindley, M. Mendoza, J. Muiioz. E. Moreno. L. R. Pericchi. . A. van der Linde. C. Villegas and M. West. We are also grateful, in more ways than one, to the State of Valencia. It has provided a beautiful and congenial setting for much of the writing of this book. And. in the person of the Governor, Joan Lerma. it has been wonderfully supportive of the celebrated series of Vulenciu International Meetings on Buyesiun Statistics. During the secondment of one of us as scientific advisor to the Governor. it also provided resources to enable the writing of this book to continue. This volume has been produced directly in TEX and we are grateful to Maria Dolores Tortajada for all her efforts. Finally, we thank past and present editors at John Wiley & Sons for their support of this project: Jamie Cameron for saying "Go!" and Helen Ramsey for saying "Stop!"
Valencia, Spain January 26,2000
J. M. Bernardo A. F. M. Smith
Contents
1. INTRODUCTION 1.1. ThornasBayes
1
1
1.2. The subjectivist view of probability
1.3. Bayesian Statistics in perspective
1.4. An overview of Bayesian Theory 1.4.1. Scope 5 I .4.2. Foundations 5 I .4.3. Generalisations 6 1.4.4. Modelling 7 1.4.5. Inference 7 I .4.6. Remodelling 8 1.4.7. Basic formulae 8 1.4.8. NonBayesian theories 9 1.5. A Bayesian reading list
2
3
5
9
X
Contents
2. FOUNDATIONS
13
2.1. Beliefs and actions
I3
2.2. Decision problems 16 16 2.2.1. Basic elements 2.2.2. Formal representation
I8
23
2.3. Coherence and quantification 23 2.3.1. Events, options and preferences 2.3.2. Coherent preferences 23 2.3.3. Quantification 28
2.4. Beliefs and probabilities 33 33 2.4. I . Representation of beliefs 2.4.2. Revision of beliefs and Bayes’ theorem 45 2.4.3. Conditional independence 47 2.4.4. Sequential revision of beliefs 2.5. Actions and utilities 49 2.5. I . Bounded sets of consequences 49 2.5.2. Bounded decision problems 50 2.5.3. General decision problems 54
2.6.
38
Sequential decision problems 56 2.6. I . Complex decision problems 56 59 2.6.2. Backward induction 2.6.3. Design of experiments 63
67 69
2.7. Inference and information 67 2.7.1. Reporting beliefs as a decision problem 2.7.2. The utility of a probability distribution 2.7.3. Approximation and discrepancy 75 77 2.7.4. Information
2.8.
Discussion and further references 8 I 2.8.1. Operational definitions 8I 2.8.2. Quantitative coherence theories 83 1.8.3. Related theories 85 2.8.3. Critical issucs 92
Some particular univariate distributions I 14 3. I .2.Contents xi I05 I5 0 3.2.3. Generalised representation of beliefs 3.1. 4.1. I .1.1. Motivation and preliminaries 141 145 3. Bayes' theorem 3.3.1.4.3.2.5.2.2. I . Generalised information 157 Discussion and further references 160 3.2.4.3. Generalised options and utilities 141 3. Critical issues 161 3. MODELLING 165 I65 4.2. The multinomial model I77 4.2. I . The utility of a general probability distribution 151 3.2. Motivation 105 3. Review of probability theory 109 3.3. Generalised approximation and discrepancy 154 3.2.1. Models via exchangeability I72 4.1 Statistical models I65 4.3. Random quantities and distributions 109 3.4.3.5. Exchangeability and partial exchangeability I72 168 4.3.3.2.5.2.4. Convergence and limit theorems 125 I27 3. The Bernoulli and binomial models 176 4.1. The role of mathematics I60 3. Random vectors. Exchangeability and related concepts I67 I67 4.3. The general problem of reporting beliefs I0 5 3. I .4. GENERALISATIONS 3.3.3.2. Some particular multivariate distributions I33 3. Generalised information measures 1 SO 3. The value of information 147 3. Generalised preferences 3. Dependence and independence 4. Beliefs and models 4.4. The general model . Countable additivity 106 3.4.2.2.5.2.
Sufficiency. Structured layouts 2 I7 4.8.6. Summary statistics 190 4. I .5. Models via sufficient statistics 190 4.4.5.3. INFERENCE S. Information measures and the exponential family I9 1 207 4. Model simplification 233 4. I .6.2.4. The geometric model Corttents 185 4. Implementation issues .3. I . Several samples 4.6.5.4.3. Sufficiency and the exponential family 4. Models via invariance I8 I 4.4.7.Critical issues 237 5.4.7.4. The exponential model 187 I89 4.3.xii 4. I . Model elaboration 229 4. Models via partial exchangeability 209 4.2.7.1. I .5.1.3. Observables.2.6. The role of Bayes' theorem 243 5. beliefs and models 242 5.5. Subjectivity and objectivity 236 4. Hierarchical models 4.8.1. Predictive and parametric inference 247 5. The multivariate normal model 4.4. Models for extended data structures 209 21 I 4. I . 241 The Bayesian paradigm 24 I 24 I 5.5.7.5.4.6. Finite and infinite exchangeability 226 228 4. Representation theorems 235 4.4.3.8.6.2. The normal model 18 1 4. Covariates 2 I9 222 4.1. Discussion and further references 235 4. Predictive sufficiency and parametric sufficiency 197 4. Prior distributions 234 4.5.4.7. ancillarity and stopping rules 5. Decisions and inference summaries 255 763 5.2. I .6.7. Pragmatic aspects 226 4.1. Parametric and nonparametric models 4.8.2. I .
Conjugate analysis 265 5.1. Numerical approximations 339 5. Continuous asymptotics 287 5.4.4.1. Hierarchical and empirical Bayes 37 1 5.5.3. Samplingimportanceresampling 350 5. An historical footnote 5. Asymptotics under transformations 295 5. I.5.4. Ranges of models 383 6.5.4. I.6.3.5.3. I.6.2.4. Model comparison 377 377 6.6.2.3. Model comparison as a decision problem 6. Approximation by crossvalidation 403 6.3.5. I. Perspectives on model comparison 386 6.5.4.5.2. Discussion and further references 356 356 5.2. Critical issues 374 377 6.6.1.1.2.1.1. Canonical conjugate analysis 269 5.5. Multiparameter problems 333 5. Asymptotic analysis 285 286 5. Discrete asymptotics 5. REMODELLING 6.2. Restricted reference distributions 3 16 5.6.1. Prior ignorance 357 5.3. Reference analysis 298 299 5.5. I .4.5. Covariate selection 407 . General utilities 395 6. Importance sampling 348 5. Nuisance parameters 320 5. Iterative quadrature 346 5.6.4.3. Reference decisions 5.6.2.2.4. Further methodological developments 373 5.6.1.Conrenrs xiii 5.2.6.1.3.3. Robustness 367 5.2.4. Laplace approximation 340 5. Markov chain Monte Carlo 353 302 5.7. Conjugate families 265 5. Onedimensional reference distributions 5. Approximations with conjugate families 279 5.3. Zeroone utilities and Bayes factors 389 6.
4. A.2. SUMMARY O F BASIC FORMULAE A.3.4.1. Hypothesis testing 469 8. Frequentist procedures 5.2.4. 443 Overview 443 Alternative approaches 445 B.4. Likelihood inference 454 B. Model rejection through model comparison 6.3. Model rejection 409 6. Significance testing 475 Comparative issues 478 8.3.2.4.4. REFERENCES SUBJECT INDEX AUTHOR INDEX 555 57. Conditional and unconditional inl'erence 8.2.3.3. Modelling and remodelling 418 6. Overview 4 17 6.4. General discrepancies 4 I5 409 4 12 6. I . 1.3. NONBAYESIAN THEORIES B.19 B.4. Critical issues 4 I8 A.2.3. Point estimation 460 B.2. Fiducial and related theories 456 Stylised inference problems 460 B.3. Classical decision theory 445 B.xiv 6.3.4.3. Interval estimation 465 B.4.2. Discussion and further references 4 I7 6.2. B. Discrepancy measures for model rejection 6. Approaches to prediction 482 B. Zeroone discrepancies 413 6.2. I .2. 427 Probability distributions Inferential processes 436 427 B.2. Aspects of asymptotics 485 B.2. Model choice criteria 486 489 478 379 B.2. I .3.2.3.3 .2.3. Nuisance paramctcrs and marginalisation B. B.S. I .3.2.l.
An overview is provided of the material to be covered in successive chapters and appendices. allowing for the calendar reform of 1752 and accepting that he died at the age of 59. Ltd Chapter 1 Introduction Summary A brief historical introduction to Bayes' theorem and its author is given. Definitive records of Bayes' birth do not seem to exist.BAYESIAN THEORY Edited by Josd M.1 THOMAS BAYES According to contemporary journal death notices and the inscription on his tomb in Bunhill Fields cemetery in London. 1. Son of the said Joshua and Ann Bayes (59). as a prelude to a statement of the perspective adopted in this volume regarding Bayesian Statistics. The inscription on top of the tomb reads: Rev. Smith Copyright 02000 by John Wiley & Sons. M. In recognition of Thomas Bayes's important work in probability. at the age of 59. Bemardo and Adrian F. and a Bayesian reading list is provided. Thomas Bayes. 1761. but. 7 April 1761. The vault was restored in 1969 with contributions received from statisticians throughout the world. it seems likely that he was born in 1701 (an argument attributed to Bellhouse in the Insr. Murk . Thomas Bayes died on 7th April.
if II denotes an hypothesis and Ll denotes data. I970/ 1975) . the theorem states that P( II I D )= P ( D I H )x P ( H ) / I ' (11) With I 12) regarded as a probabilistic statement of belief about H before obtaining ' ( data D. According to Stigler ( 1986b). attributed to Bayes and communicated to the Royal Society after Bayes' death by Richard Price in I763 (Phil. I99 I ) and Earman ( 1990). lie in the interpretation and assumed scope of thc formal inputs to the two sides of the equationand it is here that past and present commentators part compiiny in their responses to the idea that Bayes' theorem can or should be regarded as a central feature of the statistical learning process.the mechanism of the theorem provides a solution to the problem of how to learn from data. where he lies in peace with Richard Price for company.2 I Ititrodiictioti Srutist.who stated the theorem in its general (discrete) form. Bull.has made of all the fuss over the last 233 years we shall never know. Dale ( 1990. from a purely formal perspective there is no obvious reason why this essentially trivial probability result should continue lo excite interest. also. in any case. Gillies (1987). The interest and controversy. the lefthand side P(H 1 U ) becomes a probabilistic statement of belief about H after obtaining 11. 3 7 0 4 18). Stigler ( 1986a). 53. The definitive account and defence of this position are given in de Finetti's twovolume Theory ofProl. 1. which asserts that the lefthand side of the equation must equal the righthand side. we shall adopt a wholehearted subjectivist position regarding the interpretation of probability. However. Trcrtrs. Like any theorem in probability. Rox Soc. In its simplest form.2 THE SUBJECTIVIST VIEW OF PROBABILITY Throughout this work. Some background on the life and the work of Bayes may be found in Barnard (1958). See. Holland (1962). The technical result at the heart of the essay is what we now know as Buyrs' theorem.and the appropriateness and legitimacy of basing a scientific theory on the latter. Actually. at the technical level Bayes' theorem merely provides a form 0f"uncertainty accounting". Having specified f)( 11 I H ) and P( 11). of course. Bayes only stated his result for a uniform prior.ahi/ih ( I970/ 1974. That his name lives on in the characterisation of a modern statistical methodology is a consequence of the publication of An rssciy towards solving (Iproblem irt the doctrine ofchmces. he is in no position to complain at the liberties we are about to take in his name. 1992). At the heart of the controversy is the issue of the philosophical interpretation of probability objective or subjective?. We would like to think that he is a subjectivist fellowtraveller but. What Thomas Baycsfrom the tranquil surroundingsof Bunhill Fields. it was Laplace ( I774/ 1986)apparently unaware of Bayes' work . 26. Pearson (1978).
based on physical considerationsof symmetry. the latter being 'trials' of the former. or known by other people. under the appropriateconditions. in any case. different. in which one should be obliged to give the same probability to such 'symmetric' cases. opposed attempts to put forward particular points of view which. and from their formal structure or language. to obtain. in the opinion of their supporters. but much more superficial and irresponsible inasmuch as it is based on similarities or symmetries which no longer derive from the facts and their actual properties. in a perfectly valid manner. which are n motivated and characterised by varying degrees of formal intent. It follows that all the three proposed definitions of 'objective' probability. xixii) 1. if it leads him to assign the same probabilities to such events. but merely from the sentences which describe them. which acts as a bridge connecting the new approach with the old. The classical view. 1970/1974. or 'firmer' philosophical or logical foundations. It suffices to make use of the notion of exchangeability. also. it is straightforward. The individual events not only have to be 'equally probable'.3 BAYESIAN STATISTICS IN PERSPECTIVE The theory and practice of Statistics span a range of diverse activities. has often been referred to by the objectivists as "de Finetti's representation theorem". the result aimed at (but unattainable) in the statistical formulation. The main points of view that have been put forward are as follows. although useless per se. Preface. but also 'stochastically independent' . activity relating to . The only relevant thing is uncertaintythe extent of our own knowledge and ignorance. Activity i the context of initial data exploration is typically rather informal. why? The original sentence becomes meaningful if reversed: the symmetry is probabilistically significant. In this case. The logical view is similar. by means of the subjectiveapproach. . and to provoke wellknown polemics and disagreementseven between supporters of essentially the same framework. in that it considers an evenr as a class of individuul events. turn out to be useful and good as valid auxiliary devices when included as such in the subjectivist theory. But which symmetry? And. have only served to generate confusion and obscurity.3 Bayesian Statistics in Perspective 3 and the following brief extract from the Preface to that work perfectly encapsulates the essence of the case. The actual fact of whether or not the events considered are in some sense determined.I . The result. (these notions when applied to individual events are virtually impossible to define or explain in terms of the frequentist interpretation). would endow Probability Theory with a 'nobler' status. or a 'more scientific' character. and so on. is of no consequence. Thefrequenrisr (or staristiruf) view presupposes that one accepts the classical view. . in someone's opinion. The numerous. (de Finetti.
However. and activity directed at the mathematical abstraction and rigorous analysis of these structures is intentionally highly formal. of course. That said. The goal.4 I Introduction concepts and theories of evidence and uncertainty is somewhat more formally structured.by which we mean keeping in mind the objective of creating or recreating a formal framework for uncertainty analysis and decision makingbut are not themselves part of the Bayesian formalism. What is the nature and scope of Bayesian Statistics within this spectrum of activity? Bayesian Statistics offers a rationalist theory of personalistic beliefs in contexts of uncertainty. often a pragmatic ambiguity about the boundaries of the formal and the informal. cannot readily be seen as part of the formalism. Part of the process of realising that a change is needed can take place within the currently accepted framework using Bayesian ideas. in our view. in the sense of saying "if you wish to avoid the possibility of these undesirable consequences you must act in the following way". But this formalism is preceded and succeeded in the scientific learning cycle by activities which. and we have drawn attention to this whenever appropriate. In any field of application. there are sections where the full slory would require a greater level of abstraction than we have adopted. but the process of rethinking is again outside the formalism. The theory establishes that expected utility maximisation provides the basis for rational decision making and that Bayes' theorem provides the key to the ways in which beliefs should tit together in the light of changing evidence. with the central aim of characterising how an individual should act in order to avoid certain kinds of undesirable behavioural inconsistencies. From the very beginning. Rather. . Also. within which uncertain events and available actions can be described and axioms of rational behaviour can be stated. the development of the theory necessarily presumes a rather formal frame of discourse. a prerequisite for arriving at a structured frame of discourse will typically be an informal phase of exploratory data analysis. The theory is not descriptive. it is prescriptive. is to establish rules and procedures for individuals concerned with disciplined uncertainty accounting. in the sense of claiming to model actual khaviour. The emphasis in this book is on ideas and we have sought throughout to keep the level of the mathematical treatment as simple as is compatible with giving what s we regard a an honest account. Both these phases of initial structuring and subsequent restructuring might well be guided by "Baycsian thinking". in effect. there is. it can happen that evidence arises which discredits a previously assumed and accepted formal framework and necessitates a rethink.
4 An Overview o Bayesian Theory f 5 1. for the most part. there are important topics.the What?will be provided in the volume Bayesian Merhod~. each of Chapters 2 to 6 concludes with a Discussion und Further References section. In most cases. Prior Elicitation. and throughout this volume. these are.will be provided in the volume Bayesian Computation. A detailed treatment of analytical and numerical techniquesfor implementing Bayesian procedures. For this reason. with chapters covering elementary Foundufions. in which some of the key issues in the chapter are critically reexamined.1. that even colleagues who are committed to the Bayesian paradigm will disagree with at least some points of detail and emphasis in our account. there are two appendices providing a Summary ofBasic Formulae and a review of NunBuyesiun Theories. The selection of topics and the details of approach adopted in this volume necessarily reflect our own preferences and prejudices.4. We acknowledge. Mulrivarif ute Analysis. Survival Analysis and Time Series. However. which we believe to have considerable intuitive appeal and to be an improvement on the many such systems that have been previously proposed. In addition. Imuge Analysis. avoiding too dogmatic a tone. Nonpararnetric Inference. we stress the importance of a decisionoriented . Inference and Remodelling. mathematical Generalisations of the Foundations. Topics falling into this category include Design o Experiments. rather clearly and forcefully stated. Here. We introduce a formal framework for decision problems and an axiom system for the foundations of decision theory. For a convenient source of discussion and references at the interface of Decision Theory and Game Theory. reflects the fact that a detailed treatment will be given in one or other of the volumes Buyesiun Compurution and Buyesiun Methods. such as Game Theory and Group Decision Muking.the How. Sequential Analysis. the concept of rationality is explored in the context of representing beliefs or choosing actions in situations of uncertainty. The emphasis throughout is on general ideasthe Why?of Bayesian Statistics. and to avoid complicating the main text with too many digressionary asides and references. 1. see French (1986).1 AN OVERVIEW OF BAYESIAN THEORY Scope This volume on Bayesian Theory focuses on the basic concepts and theory of Bayesian Statistics. which are omitted simply because a proper treatment seemed to us to involve too much of a digression from our central theme.?. hopefully. or its abbreviated treatment in this volume. the omission of a topic. Robustness.2 Foundations In Chapter 2. while. Modelling.4 1. A systematic study of the methods of analysis for a wide range of commonly encountered model and problem types. however.4. Where we hold strong views. Linear Models.
in logarithmic utility.58) and employed in statistical contexts by Kullback ( 195911968).6 I Introduction framework in providing a disciplined setting for the discussion of issues relating to uncertainty and rationality. An additional postulate concerning the comparison of a countable collection of events is appended to the axiom system of Chapter 2. respectively. Within this framework. A further additional mathematical postulate regarding preferences is introduced and. rather than requiring a separate “theory of inference”. the inference problem can be analysed within the general decision theory framework. These measures are mathematically closely related to wellknown informationtheoretic measures pioneered by Shannon ( 19. measures of the discrepancy between probability distributions and the amount of information contained in a distribution are naturally defined in terms of expected loss and expected increase. An important special feature of what we shall call a pure inference problem is the form of utility function to be adopted. 1. A key feature of our approach is that statistical inference is viewed simply as a particular form of decision problem. the criterion of maximising expected utility is shown to be the onl? decision making criterion compatible with the extended axiom system. Thus defined.3 Generalisations In Chapter 3. are extended in a natural way to provide a very general mathematical framework for our development of decision theory. and is shown to provide a justification for restricting attention to countably additive probability as the basis for representing beliefs. The notions of actions and utilities. within this more general framework.4. specifically. the ideas and results of Chapter 2 are extended to a much more general mathematical setting. The dual concepts of probability and utility are formally defined and analysed within this decision making context and the criterion of maximising expected utility is shown to be the only decision criterion which is compatible with the axiom system. A resulting characteristic feature of our approach is therefore the systematic appearance of these informationtheoretic quantities as key elements in the Bayesian analysis of inference and general decision problems. . The elements of mathematical probability theory required in our subsequent development are then reviewed. We establish that the logarithmic utility functionmore often referred toas a score function in this contextplays a special role as the natural utility function for describing the preferences of an individual faced with a pure inference problem. a decision problem where an action corresponds to reporting a probability belief distribution for some unknown quantity of interest. introduced in a simple discrete setting in Chapter 1. The analysis of sequential decision problems is shown to reduce to successive applications of the methodology introduced.
Structures considered include those of several samples. formalise and then use to establish a version of de Finetti's celebrated representation theorem. we show that beliefs which have certain additional invariance propertiesfor example. the key role of Bayes' theorem in the updating of beliefs about observables in the light of new information is identified and related to conventional mechanisms of predictive and parametric inference.4. or translation of the origincan lead to mathematical representations involving other familiar specific forms of parametric distributions. 1. A feature of our approach is an emphasis on the primacy of observables and the notion of a model as a (probabilistic) prediction device for such observables. 1. which we motivate. The key concept here is that of exchangeability. Various forms of partial exchangeability judgements about data structures are then discussed in a number of familiar contexts and links are established with a number of other commonly used statistical models. This demonstrates that judgements of exchangeabilitylead to general mathematical representationsof beliefs that justify and clarify the use and interpretationsof such familiar statistical concepts as parameters. and hierarchies. problems involving covariates. likelihoods and prior distributions. to rotation of the axes of measurements. The roles of sufficiency.4 Modelling In Chapter 4. a featurecommon to many individual beliefs about sequencesof observables. The problem is approached by considering simple structural characteristics such as symmetry with respect to the labelling of individual counts or measurements.4. .4 An Overview o Bayesian Theory f 7 In this generalised setting. such as normals and exponentials. The concept of a sufficient statistic is introduced and related to representations involving the exponential family of distributions. we examine in detail the role of familiar mathematical forms of statistical models and the possible justificationsfrom a subjectivist perspectivefor their use as representations of actual beliefs about observable random quantities. multiway layouts. inference problems are again considered simply as special cases of decision problems and generalised definitions of score functions and measures of information and discrepancy are given. ancillarity and stopping rules in such inference processes are also examined.1. Going beyond simple exchangeability. the role of conventional parametric statistical modelling is problematic. random samples. based on data reduction. and requires fundamental reexamination. A further approach to characterising belief distributions is considered. From this perspective.5 Inference In Chapter 5.
in the absence of any specification of an actual belief mtdel. 1. importance sampling. but intractable. closely related idea is that of how to represent "vague beliefs" or "ignorance". rather than predicating all analysis on a single assumed model. some involving model choice only. together with outline accounts of numerical quadrature. are reexamined within the general Bayesian decision framework and related to formal and informal inference summaries.and Markov chain Monte Carlo methods. The third perspective arises when thc range of models is under consideration because the models are "all there is available". The problems of implementing Bayesian procedures are discussed at length. The mathematical convenience and elegance of conjugate analysis are illustrated in detail. belief model. whether viewed from the perspective of a sensitive individual modeller or from that of a group of modellers.4. We provide a detailed historical review of attempts that have been made to solve this problem and compare and contrast some of these with the reference analysis approach. such as point and interval estimation and hypothesis testing.4. in tabular format. samplinginiportanceresampling. there are good reasons for systematically entertaining a range of possible belief models. which can be viewed as a Bayesian formalisation of the idea of "letting the data speak for themselves". A brief account is given of recent analytic approximation strategie5 derived from Laplacetype methods. A variety of decision problems are examined within this framework. we collect together for convenience. others involving only a terminal action. The first perspective arises when the range of models under consideration is assumed to include the "true" model. The second perspective arises when the range of models is assumed to be under consideration in order to provide a more conveniently implemented proxy for an actual. a clear distinction is drawn among three rather different perspectives on the comparison of and choice from among a range of competing models.6 Remodelling In Chapter 6. An alternative. signiticancc testing and crossvalidation. summaries of the main univariate and multivariate probability distributions that appear in the . it is argued that. throughout. 1. Our discussion relates and links these ideas with aspects of hypothesis testing. such as prediction. as are the mathematical approximations available under the assumption of the validity of largesample asymptotic analysis.7 Basic Formulae In Appendix A.Various standard forms of statistical problems. some involving model choice followed by a terminal action. A feature of our treatment of this topic is that. A particular feature of this volume is the extended account of socalled reference analysis.
1992). Scozzafava ( 19891. de Finetti (1970/1974. or when the original is not in English. together with summaries of the prior/posterior/predictive forms corresponding to these distributions in the context of conjugate and reference analyses.1.1962). More advanced Bayesian monographs include Hartigan ( 1983). Berger (1985a). 1.and to its most recent (3rd) edition. hypothesis and significance testing.8 NonBayesian Theories In Appendix B. Winkler (1972). likelihood theory. frequentist procedures. 1970/1975. Daboni and Wedlin ( I982). namely. 198I . (1965). Jeffreys ( 1939/196I ) refers to Jeffreys' Theory of Probability. Biostutistics (GirelliBruni. and fiducial and related theories. Savchuk ( 1989).1965). reflecting our own interests and perspectives. de Finetti (1970/1974) refers to the original (1970) Italian version of de Finetti's Teoria delle Probabilitu vol. classical decision theory. these include Actuarial Science (Klugman. Borovcnik ( 1992). Thus. Press (1989). Robert (1992) and O'Hagan (1994a). this work is necessarily a selective account of Bayesian theory. 1. In those cases where there are several editions. Through counterexamples and general discussion.Dubinsand Savage (1965/1976). Kleiter (1980). Elementary and intermediate Bayesian textbooks include those of Savage (1968).whose contents would provide a significant complement to the material in this volume. Florens et nl. 1972). The following is a list of other Bayesian books. Iversen ( I984).by no means exhaustive. Lee ( 1989). Lindley (1965.5 A BAYESIAN READING LIST As we have already remarked.4. published in 1961. Pioneering Bayesian books include Laplace (18 12). Tribus (1969). we indicate why we find all these alternatives seriously deficient as formal inference theories. Bemardo ( 1981b). Cifarelli and Muliere ( 1989).Good (1950. Schmitt (1969).Schlaifer( 1959. Wichmann ( 1990). we quote both the original date and the date of the most recent English edition. O'Hagan (1988a). Polson and Tiao (1995) is a two volume collection of classic papers in Bayesian inference. Special topics have also been examined from a Bayesian point of view. 1 and to its English translation (published in 1974). . Lindley (1971/1985). Regazzini (1983). Mostellerand Wallace( 1964/1984). we review what we perceive to be the main alternatives to the Bayesian approach. similarly. Keynes (1921/1929). (1990).Jeffreys (1939/1961). DeGroot (1970). Savage( 1954/1972. We compare and contrast these alternatives in the context of "stylised" inference problems such as point and interval estimation.RaiffaandSchlaifer( 1961). and Berry ( 1996). 1961). first published in 1939.5 A Bayesian Reading List 9 text. Lavalle (1970). Pratt et al. 1972) and Box and Tiao (1973).
1982~. Savage (1961. Florens et (11. Lindgren. Educutional und Psvchoiogical Research (Novick and Jackson. 1988. 1970. Viertl(1987). Gardenfors and Sahlin (1988).de Finetti ( 1993). 1984. 1965/1983. 1970.Ghosh and Pathak ( 1992). 1967. 1985. Bmai et al. General discussions of Bayesian Statistics may be found in review papers and encyclopedia articles. Geisser et trl. 1959. Meyer and Collier ( 1970). Pollard. 1987.Godambeand Sprott ( 197 1 ). 1978. Howson and Urbach 1989. 1988. Gupta and Berger (1988. Aitchison. History (Dale. 1967. 1991). 1983/1991. Box (1985). Hinkelmann (1990). 1989. Smith and Erickson. Boyer and Kihlstrom.Osteyee and Gcxxl.. Lindley ( 1953. Fellner. 1985). 1976. 1973)and Spectral Anu1ysi. 1964. Gtxl and Zellner ( 1986). Smith and Dawid (1987). Morris. Skilling.Multivtiritrtr Analysis (Press. 1987). Parenti (1978). Keeney and Raiffa. Among these. Ansconibe (1961).. Zellner. 1991. 19911. Sawagari et ul. 1979. Verbraak. 1993). Savage (1981). 1990. 1986). Stael von Holstein and Matheson. . 1987.Erickson and Smith. 1979. Dynanric Forecasting (Spall. 1982. 1978. 1977. 1988. 1970. Smith. I996 and 1999). 1967. Kyburg and Smokler ( 1964/ 1980). Rios. 1988). 1982. 1965: Roberts. (1990). Grandy and Schick. Edwards and Tversky. Aitken and Stoney.Fearn and O'Hagan (1993).White and Bowen (19751. 1974. 1976. 1987). Mmirnunr Enrrupy (Levine and Tribus. 1979. Brown.Social Science (Phillips. 1984. Lusted. Rosenkranz. 1986. Marinell and Seeber. 0ptinii. ( 1993). 1982b. 1971. Roberts. 1989. Fougere. 1978. 1985). but not least. Pole et al. 1984/1988. I97 1 . 1993).Smith and Grandy. 1973. Fienberg and Zellner ( 1974). the Proceedings of the kleticia Internutional Meetings on Stryesitin Statistics (Bernard0 et al. Raiffa. we note particularly. Kadane ( 1984). 19x2: Claroti and Lindley. Prediction (Aitchison and Dunsmore.s (Bretthorst. 1988). Economics arid Econometrics (Morales. 1971. Berger and DasGupta. 1957.. Foundations (Fishburn.Decision Atialvsis (Duncan and Raiffa. 1992. 1984. Good (1983). Freeman and Smith ( 1994). 1994). Richard.Satnple Surveys (Rubin. Cooke.. Reliability (Martz and Waller. 199 I 1. 1980.Goel and Iyengar ( 1992). Dawid and Smith (1983). French. Inforniurion Theory ( Y a glom and Yaglom. 1988a. 1983. 1968. I988a). 1960. 1970. Oliver and Smith ( 1990). Seidenfeld. and last. A number of collected works also include a wealth of Bayesian material. 1987. Grayson. 1968. 1990. Probabiliry Assessnient (Stael von Holstein. 1966. Linear Models (Lempers. 1989. Pilz. we note de Finetti ( 195 I ). Broemeling. Among these. 1977: Lavalle. 1984. 1985). Martin. 1986.Justice.10 I Iniroditction Lecoutre. I99 I ). 1967). Fishburn. 1968. Geisser. 1992). Hadley. 1967. 1985. 1991). Pattern Recognition (Simon. (1983). Gatsonis Y I ul. I974). 1994). Aykaq and Brumat (1977). Halter and Dean. I97 I . Rivadulla. Cyert and DeGroot. Lrrw and fimnsic Science (DeGroot etal. Control Theory (Aoki. MohammadDjafari and Demoment. West and Harrison. Berger and Wolpert. Berry and Stangl. 1971: Learner. 1990. Zellner (1980). Chernoff and Moses. 1981. 1988. Schlaifer. 1992. Kapur and Kesavan. Jaynes ( 1983). Box et ul. French et al. 1984). Bauwens. 1978.1992. 1960/ 1983. 1975. Logic und Philosophy qfScience (Jeffrey. 1989). (1983. 1969. 1996). 197211982.stction (Mockus.
1978). LaMotte (1985). Pack (1986a.Good (1976. 1986). Smith (l984). 1987. RacinePoon etal. see Luce and Suppes ( 1965). Barlow and Irony (1992). . Ghosh (1991) and Berger (1993). 1988b).Breslow (1 W). Arnold (1993). Dickey (1982). 1988a. Bartholomew (1994). Cifarelli (1987). Roberts (1978). Goldstein (1986~).Zellner ( 1988c). Genest and Zidek (19861. Dawid (1983b. Press ( I980a.Lindley ( I 91). (1992). Dawid (1983a). 1986a. Geisser (1982. Kadane (1993). Eric9 son ( 1988). 1987. Berger (1994) and Hill (1994). 1986b). (1963). 1985. Hodges (1987).1.1986.Birnbaum ( 1968. Cornfield (1969). 1992).Trader ( 1989). Good (1982. Ferguson et al. I985a). 1988a). Joshi ( 1983). DeGroot (1982). Bartholomew (1963. Bernard0 (l989). 1988b). Smith (1991). Zellner ( 1985. 1992). For discussion of important specific topics.5 A Buyesian Reading List 11 1970).Edwards et al. Fishburn ( I98 1. (1986).de Finetti ( I968).
This Page intentionally left blank .
The criterion of maximising expected utility is shown to be the only decision criterion which is compatible with the axiom system. An axiomatic basis.BAYESIAN THEORY Edited by Josd M.1 BELIEFS AND ACTIONS We spend a considerable proportion of our lives. M. with intuitive operational appeal. where direct . in a state of uncertainty. Bemardo and Adrian F. The logarithmic score is established as the natural utility function to describe the preferences of an individual faced with a pure inference problem. both private and professional. Statistical inference is viewed as a particular decision problem which may be analysed within the framework of decision theory. is introduced for the foundationsof decision theory. the concept of discrepancy between probability distributions and the quantification of the amount of information in new data are naturally defined in terms of expected loss and expected increase in utility. respectively. This uncertainty may relate to past situations. The analysis of sequential decision problems is shown to reduce to successive applicationsof the methodology introduced. Within this framework. Ltd Chapter 2 Foundations Summary The concept of rationality is explored in the context of representing beliefs or choosing actions in situations of uncertainty. The dual concepts of probability and utility are formally defined and analysed within this context. 2. Smith Copyright 02000 by John Wiley & Sons.
i. For example. Many feelings of uncertainty are rather insubstantial and we neither seek to analyse them.. complete information. So fw as this work is concerned. This might be because we face the direct practical problem of choosing from among a set of possible actions. as being a form of action t o which certain criteria of “rationality” might be directly applied. where each involves a range of uncertain consequences and we are concerned to avoid making an “illogical” choice. there is a sense in which all states of uncertainty may be described in the same way: namely. in principle. interesting decision problems are those for which such perfect information is not available. It might be argued that there are complex situations where we do have complete information and yet still find it difficult to take the best decision. we all regularly encounter uncertain situations in which we at least aspire to behave “rationally” in some sense. In such cases. or that the possible outcomes have no (or only negligible) consequences so far as we are concerned. we might be called upon to sumrnarise our beliefs about the uncertain aspects of the situation.e. not conceptual. it is typically not easy to decide what is the optimal strategy to rebuild a Rubik cube or which is the cheapest diet fulfilling specified nutritional requirements. Here. Our basic concern in this chapter is with exploring the concept of “rationality” in the context of representing beliefs or choosing actions in situations of uncertainty. In other words. Alternatively. To choose the best among a set of actions would. In the first case. be shared by other individuals). We take the view that such problems are purely technical. the choice of a particular mode of representing and communicating our beliefs. or the potential effects are trivial in comparison with the effort involved in carrying out a conscious analysis. and we must take i4ncertoint. when we feel that we have no (or only negligible) capacity to influence matters. be immediate if we had perfect information about the consequences to which they would lead.14 2 Foundatioris knowledge or evidence is not available.v into account as a major feature of the problem. the difficulty is trchriicul. nor to order our thoughts and opinions in any kind of responsible way. they result from the large number of possible strategics: . we are concerned that our summary be in a form which will enable a “rational” choice to be made at some future time. And yet it is obvious that we do not attempt to treat all our individual uncertainties with thc same degree of interest or seriousness. an individual feeling of incomplete knowledge in relation to a specified situation (a feeling which may. On the other hand. bearing in mind that others may subsequently use this summary as the basis for choosing an action. Whatever the circumstances. or has been lost or forgotten: or to present and future developments which are not yet completed. More specifically. of course. cven though we have. In this case. however. This typically happens when we feel no actual or practical involvement with the situation in question. we are not motivated to think carefully about our uncertainty either because nothing depends on it. in principle. we might regard the summary itself.
we make precise the notion of “rational” preferences in the form of axioms. . alternatively. In Section 2. we should emphasise that we do not interpret “actions in situations of uncertainty” in a narrow. a statement of beliefs might be regarded as an end in itself. with the decision criterion to be adopted f when we do not have complete information and are thus faced with. at least some. and we shall assume that in the presence of complete information we can. they reduce to the mathematical problem of finding a minimum under certain constraints. . . in which case the choice of the form of statement to be made constitutes an action. instead. actually or potentially. directly “economic” sense. in that they cannot lead us into forms of behavioural inconsistency which we specifically wish to avoid. In Section 2. the emphasis is on the inference rather than the decision aspect of problem. always choose the best alternative. in principle. but would lead to action in suitable circumstances. But in neither case is there any doubt about the decision criterion to be used. It is assumed in our approach to such problems that the notion of “rational belief“ cannot be considered separately from the notion of “rational action”.1 Beliefs and Actions 15 in the second.2. does actually lead to action.just as a lump of arsenic is called poisonous not because it actually has killed or will kill anyone. Either a statement of beliefs in the light of available information is. although formally it can still be considered a decision problem if the inferential statement itself is interpreted as the decision to be taken (Lehmann. In this work we shall not consider these kinds of combinatorial or mathematical programming problems. it is a question of providing a convenient summary of the data . is with the logical process o decision making in sitf uarions o uncertainty. Our concern. To avoid any possible confusion. . or trying to facilitate the task of others seeking to decide upon their beliefs in the light of the experimental results. For example.3. I959/ 1986). 1926). In other words.2. . it is not asserted that a belief. . an input into the process of choosing some practical course of action. In such cases. Frequently. . elements of uncertainty. We can therefore explore the notion of “rationality”for both beliefs and actions by concentratingon the latter and asking ourselveswhat kinds of rules should govern preference patterns among sets of alternative actions in order that choices made in accordance with such rules commend themselves to us as “rational”. or. but because it would kill anyone if he ate it (Ramsey. we describe the general structure of problems involving choices under uncertainty and introduce the idea of preferences between options. . within our purview we include the situation of an individual scientist summarising his or her own current beliefs following the results of an experiment.
. the labelling sets. we shall omit this dependence. we identify the design of experiments as a particular case of a sequential decision problem.2 2.j E J } of uncertcrin e\*ent. if “illogical” forms of behaviour are to be avoided. in order to conform with the principles of quantitative coherence.j E . However... I and . In such cases. The idea is as follows. Finally.. whose structure is determined by three basic elements: (i) a set {u.s. 2. so that a more precise notation would be {I?. . I .1 DECISION PROBLEMS Basic Elements We shall describe any situation in which choices are to be made among alternative courses of action with uncertain consequences as a decision prohlern.I..I}.We describe these as principles of quantitative coherence because they specify the ways in which preferences need to be made quuntitutively precise and fit together. while remarking that it should always be borne in mind. j E J . In Section 2./) forms a purririon (an exclusive and exhaustive decomposition) of the total set of possibilities.4 and 2. } for each action a. to simplify notation..I (for each i). both the set of consequences and the partition which labels them may depend on the particular action considered.describing the uncertain outcomes of taking action a. .. j E .2. Naturally. j E .... a set { E. (ii) for each action a.6. one of which is to be selected. In Section 2. or cohere.} and {c. j E .6.I. i E I}of available actions. Suppose we choose action ( 1 .: (iii) corresponding to each set { E.. . a set of consequences {c.. In practical problems. This identification of inference as decision provides the fundamental justification for beginning our development of Bayesian Statistics with the discussion of decision theory.. we prove that. occurs and leads to the corresponding consequence c.j E . and the rational choice of an action is to select one which has the nitrrirnuni expected utility. the decision problem can be represented schematically by means of a decision tree as shown in Figure 2. . a general review of ideas and references is given in Section 2. in particular. .. Each set of events {E. We shall come back to this point in Section 2.5. relative values of individual possible consequences should be described in terms of a utilipfunction. I } . wc discuss sequential decision problems and show that their analysis reduces to successive applications of the maximum expected utility methodology. we make precise the sense in which choosing a form of a statement of beliefs can be viewed as a special case of a decision problem.7. are typically finite.j E . then one and only one of the uncertain events E. I .. degrees of belief about uncertain events should be described in terms of a (finitely additive) probubility meusure.8.. In Sections 2.
an individual decisionmaker often prefaces . j E J . The notion of preference is. that we can formally identify any a.j E J} to which it leads. Preferences about the uncertain scenarios resulting from the choices of actions depend on attitudes to the consequences involved und assessments of the uncertainties attached to the corresponding events. We shall write a. is very much dependent on the information currently available. very familiar in the everyday context of actual or potential choice.6.. most practical problems involve sequential considerations but. where the notation cJ I El signifiesthat event EJ leads to consequence cI. essentially. of a kind which leads to a restriction on what can be regarded as the total set of possibilities.2. Indeed. Of course. Following the choice of an action and the occurrence of a particular event. these reduce.. is to opt for the uncertain scenario labelled by the pairs (E. = {c.2 Decision Problems 17 Figure 2. { EJ j E J} forms a partition of the total set of relevant possibilities as the individual decisionmaker now perceives them to be.. the branch leads us to the corresponding consequence. or from the decision tree representation. whereas others may become more plausible.i.j E J } and { c J . I EJ j E J } to denote this identification. It is clear. of course. where the choice of an action is required.1 only captures the structure of a decision problem as perceived at a particular point in time. will change the perception of the uncertainties.. In other words. i E I. The circle represents an uncertainty node where the outcome is beyond our control. to repeated analyses based on the above structure. with the combination of { E J . as shown in Section 2. to choose a. Further information. The latter are clearly subject to change as new information is acquired and this may well change overall preferences among the various courses of action. c J ) . either from our general discussion. An individual's perception of the state of uncertainty resulting from the choice of any particular a.1 Decision tree The square represents a decision node. that a. in that some of the EJ's may become very implausible (or even logically impossible) in the light of the new information. It is therefore of considerable importance to bear in mind that a representation such as Figure 2.(E J )= c.e. In particular.
.. Ferrandiz and Smith (1985). Throughout.\I(. a textbook of statistical methods. an investment portfolio.2 means that if these were the only two options available ( 1 1 would be chosen (conditional. The development which follows. which expresses the individual decisionmaker's preferences between pairs of available actions. where: (i) E is an algebra of relevatit events. The collection of uncertain scenarios defined by the original concrete problem is therefore implicitly embedded in a somewhat wider framework of actual and hypothetical scenarios.2. etc. (caviar. This will be presented in a rather compact form. We begin by within describing this widerfiume o~~di. A.) with the phrase "I prefer. equities. Typically. It signifies a willingness to accept an externally determined choice (for example. are dependent on the decisionmaker's overall state of information at that time. In order to study decision problems in a precise way. . (Decision problem). s). . It is to be understood that the initial specification of any such particular frame of discourse. we shall need to reformulate these concepts in a more formal framework. These four basic elements have been introduced in a rather informal manner. " 2. so that n l 5 a:! signifies that ( 1 1 is nor prejerred to a?. Bayesian procedures. or tossing a coin).C. we must also be able to represenr rhe ideu clfprejkrence as applied to the comparison of some or all of the pairs of available options.3.an actual choice (from a menu.1. the idea of indifference between two courses of action also has a clear operational meaning. of course. A decision problerii is dejned by the elements ( E . We now give a formal definition of a decision problem. we shall denote this initial state of mind by . is largely based on Bernardo. In everyday terms. To prefer action n l to action a. on the information available at the time).. letting adisinterested third party choose. Definition 2. etc.scoi~r. In addition to represetiring [he srructure of a decision problem using the three elements discussed above.). E. concrete decision problem. We shall therefore need to consider a fourth basic element of a decision problem: (iv) the relation 5 .~e which the comparisons of scenarios are to be carried out. which we hope will aid us in ordering our thoughts by providing suggestive points of reference or comparison. together with the preferences among options within it. surgery. a range of possible forms of medical treatment. we do not usually contine our thoughts to on/? those outcomes and options explicitly required for the specification of that problem. we expund our horizons to encompass analogous problems.2 Formal Representation When considering a particular. here and in Section 3. detailed elaboration is provided in the remarks following the definition.
I E. it can certainly be argued that this is too rigid an assumption. c. So far as the definition of an option as a func!ion is concerned. In particular.j E J .2. or potential acts. However. it is natural to assume that if El E € and E2 E € are judged to be relevant events then it may also be of interest to know about their joint occurrence. an individual decisionmaker will wish to consider the uncertain events judged to be relevant in the light of the initial state of information A J .8. . which it may be convenient to bring to mind as an aid to thought. (iii) A i a set of options. those occurring in the structure of any concrete. . with the interpretation that event E. preferences among such consequences will later be assumed to be independent of the state of information concerning relevant events. . consisting offunctionswhich map s finite partitions of 0. Technically. we note that this is a rather natural way to view options from a mathematical point of view: an option consists precisely of the linking of a partition of R.4. In our introductory discussion we used the term action to refer to each potential act available as a choice at a decision node. it is natural to require & to be closed under complementation. {c. Repetition of this argument suggests that E should be closed under the operations of arbitrary finite intersections and unions. { E.) As we mentioned when introducing the idea of a wider frame of discourse. ordered sets of elements of C . to compatiblydimensioned.2 Decision Problems 19 (ii) C is a set of possible consequences. We denote by C the set of all consequences that the decisionmaker wishes to take into account. with a corresponding set of consequences. both belong to €. (However. we prefer the term option. Within the wider frame of discourse. j E J } . The class € will simply be referred to as the ulgebra of(re1evant)events. actual decision problem that we may wish to consider). We now discuss each of these elements in detail. j E J}.. together with any other hypothetical events. the certuin event in E. or whether at least one of them occurs. To represent such a mapping we shall adopt the notation {c. we are assuming that the class of relevant events has the structure of an algebra. j E J}. the algebra E will consist of what we might call the realworld events (that is. We shall provide further discussion of this and related issues in Section 2. taking the form of a binary relation between some of the elements of A. Similarly. Within this wider frame of discourse. This means that El n E2 and El U E2 should also be assumed to belong to &. formal framework may include hypothetical scenarios (possibly rather far removed from potential concrete actions).. these requirements ensure that the certain event $2 and the impossible event 8. so that E' E E . leads to consequence c. (iv) 5 is a preference order.. The class C will simply be referred to as the set of (possible)consequences. since the general. .
j E J } and { c I El UE2$ c. we say that ( I . . I E. we shall simply regard c as denoting either an element of C. c. { ca I E'. will simply be referred to as the uction spucr. In all cases. In defining options. There will be no danger of any confusion arising from this identification. so that. the interpretation of an option with a rather cumbersome description is clarified by an appropriate reformulation. for any r E C. ~is not preferred to u p . For example. we can derive a number of other useful binary relations.2 5 a l .with (11 = { ( * I I E . the options { C:I 1 E . cp I F'}. (12 < 01.. To simplify the presentation we shall omit such universal quantifiers when there is no danger of confusion. or (12 is not preferred to ( I I (or both). (Induced binary relations) ( i ) a I 0.20 2 Foundutions It follows immediately from the definition of an option that the ordering of the and labels within J is irrelevant. Definition 2. these suffice to describe all cases where pairs of options can be compared. If the rclation can be applied.. ~2 I I?'}. are not assuming that all we pairs of options (a. The class A of options. I I. nl 5 (12 and it i nor triw thut as 5 ( 1 1 . j E .I) are completely equivalent. From 1. we shall  . ~e 0. The induced binary relations are to be interpreted to mean that (II is equivalent to u2 if and only if ( 1 1 (12. ( 1 2 ) E A x A can necessarily be related by 5. E . (iv) a l > n?  .2 is to be understood as referring to any options n l . We can identify individual consequences as special cases of options by writing c = {c. Thus. and forms such as {c I E l . an extension to admit the possibility of injinite partitions has certain mathematical advantages and will be fully discussed. s (ii) 01 < a1 (iii) a I 3 n . c1 I E } are identical. I G'"}. Which form is used in any particular context is purely a matter of convenience. Sometimes. for example.2. In introducing the preference binary relation 2. and o1 is strictly preferred to nz if and only i f a l > a2.2 ol 5 andnp 5 u1..I}. in the sense that either (iI 5 (J? or a2 5 ( i I (or both). the assumption of ajnite partition into events of E seems to us to correspond most closely to the structure of practical problems. in Chapter 3. Together with the interpretation of 2.. I fl}. c2 I E" n G. together with other mathematical extensions. Thus. I E. or the element { c I 12) of A.. a = {q 1 6 n G'. However. if we shall use the composirefuncriori iiorution ( I = { ( J .1. Definition 2. or potential actions. c 1 E. Without introducing further notation. . the ordering of the labels is irrelevant. c:j I C }may be inore compactly written as (1 = ( ( 1 1 I G. r i p in A.
This binary relation will capture the intuitive notion of one event being "more likely" than another. The basic preference relation between options. It is worth stressing once again at this point that ull the order relations over A x A. Since. Thus. We shall proceed similarly with the binary relations N and < introduced in Definition 2.2 Decision Problems 21 write c1 5 02 if and only if { c . Moreover. once again. each individual i s free to express his or her own personal preferences. Continuing the (convenient and harmless) abuse of notation.2. as one would expect. we then suy that E is not more likely than F .  (01 it is always true. are to be understood as personal. that 0 < $2. Strictly speaking. conditional on the initial state of information Mo. 10) 5 {Q I S2) and say that consequence c1 is not preferred to consequence c2. in the sense that. and E > F if and only if E is strictly more likely than F. 5 . for all < c?. we shall further economise on notation and also use the symbol 5 to denote this new uncertainty binary relation between events. Definition 2. given the stute of information described by Moo. To avoid triviality. for a given individual. the collection of all pairs of relewnt events. since 5 is defined over A x A. . In fact.2 to describe uncertainty relations between events. If we compare two dichotomised options. can also be used to define a binary relation on E x &.3. (Uncertainty relation). the force of this argument applies independently of the choice of the particular consequences el and c2. The intuitive content of the definition is clear. involving the same pair of consequences and differing only in terms of their uncertain events. a statement such as E > F is to be interpreted as "this individuul. Clearly. we will prefer the option under which we feel it is "more likely" that the preferred consequence will obtain. Since. given an agreed structure for a decision problem. Thus.2.considers event E to be more likely than event F ". provided that our preferences between the latter are assumed independent of any considerations regarding the events E and F. this parsimonious abuse of notation creates no danger of confusion and we shall routinely adopt such usage in order to avoid a proliferation of symbols. in the light of his or her initial state of information Mo. we shall also use the derived binary relations given in Definition 2. we should introduce a new symbol to replace 5 when referring to a preference relation over C x C. E F if and only if E and Fare equally likely. we shall later formally assume that there exist at least two consequences CI and cz such that c1 < ep. there is no danger of confusion. and hence over C x C and & x &.
3 provides such a statement with an operational nieaning since for all 2 E > F is equivalent to an agreement to choose option { ~2 I E. ('1 I F"}. Thus. preference to option (cz I F. Given the assumed Occurrence of a possible event CJ.. The initial state of information. If we do not prefer u ] to 02. .3.. then this preference obviously carries over to any pair of options leading. we have stressed that preferences. all conditional relations reduce to their unconditional counterparts. denoted by (. is a simple translation The definition ofthe conditional uncertainty relation of Definition 2. occurs. has been denoted by Jf. and the additional information provided by G. The conditional uncertainty relation <(. respectively. taking as an arbitrary "origin" the first occasion on which an individual thinks systematically about the problem.\A. the induced binary relations set out in Definition 2. preferences between options will be described by a new binary relation <(. not affected by additional information regarding the uncertain events in f. Conversely. c1 I E " } in . comparison of options which are identical if G" Occurs depends entirely on consideration of what happens if C. when we come. to ( 1 1 or (12 if C: occurs. The induced binary relation between consequences is obviously defined by However. Obviously. in Section 2. and <(. Naturally.22 cI < ~ 2 Foi4ndations Definition 2. c1 ir. initially defined among options but inducing binary relations among consequences and events.(''2 if and only if ('I <_ ~ 2 so that prcfcrences between pure consequences are . we shall need to take into accountfurther irlforniufiori.2 have their obvious counterparts. taking into account both the initial information . induced between events is of fundamental importance. to discuss the desirable properties of 5 and we shall make formal assumptions which imply that. The intuitive content of the definition is clear.3 to a conditional preference setting. given G. it is only when k! < C. and defined identically if G" occurs. The obvious relation between 5 and is given by the following: s(. and <(. Throughout this section. Subsequently. we need to consider one further important topic. obtained by considering the occurrence of realworld events... are conditional on the current state of information. s(. To complete our discussion of basic ideas and definitions.. if G = I I . provides the key to investigating the way in which uncertainties about events should be modified in the light of new information. however. This relation. as one would expect. e: I I that conditioning on G may yield new preference patterns. with its derived forms (.
~ ~ I F " } 5 o r { c 2 l E . (ii) For all consequences c1. The set A x A is equipped with a collection of binary relations 5 ~G .clIE"} { c ~ I F . c2 such thut c1 < c2. . (Comparability of consequences and dichoromised options).3 2.representing the notion that one option is not preferred to another. s$~) 2. It is important to recognise that the axioms we shall present areprescriptive. j E J } .not descriptive. c 1 I E c }2 { c ~ l F . they do not purport to describe the ways in which individuals actually do behave in formulating problems or making choices. cj E C. We now wish to make precise our assumptions about these elements of the formal representation of a decision problem.j E J } is a finite partition of the certain event Q. on some presumed "ethical" basis.. .e.If. given the assumed Occurrenceof a possible event G. (i) There exist consequences cl.. a set of consequences C.I&. either{c:!IE. and a set of options A. with the binary relation 5 (i.2.3 Coherence and Quantification 23 2. whose generic element has the form {c.> 0. Thus. c ~ l F ' } .. c2. representing the preference relation on A x A conditional on . where { E j . The axioms simply prescribe constraints which it seems to us imperative to acknowledge in those situations where an individual aspires to choose among alternatives in such a way as to avoid certain forms of behavioural inconsistency.3.2 Coherent Preferences We shall begin by assuming that problems represented within the formal framework are nontrivial and that we are able to compare any pair of simple dichoromised options.1 COHERENCE AND QUANTIFICATION Events. und events E. F .3. In addition. I E. can be viewed as responses to the questions: "what rules should preference relations obey?' and "what events should be included in E?" Each formal axiom will be accompanied by a detailed discussion of the intuitive motivation underlying it. Options and Preferences The formal representation of the decisionmaker's "wider frame of discourse" includes an algebra of events €.. all preferences are assumed conditional on an initial state of information.alone. our assumptions. presented in the form of a series of uxioms. neither do they assert. Bearing in mind the overall objective of developing a rational approach to choosing among options. the ways in which individuals should behave. j E J . EJ E €. Axiom 1.
combinations of monetary. since all choices would certainly lead to precisely equivalent outcomes. then ( I if ( 1 . We shall now state our assumptions about the ways in which prcferences I should fit together or cohere in terms o f the order relation over . But there could be no possibility o f an orderly and systematic approach if we were unwilling to express preferences among simple dichotomised options and hence (with E = F = $ 1 ) among the consequence3 i themselves. of course. u 5 (1. Condition (i)s very natural. The intuitive basis for such a requirement i s perhaps best illustrated by considering the consequences of iiirrwisitirv preferences. making the direct assumption that rill optiom. that we found ourselves expressing the preferences ( 1 1 < ( 1 2 . can be compared. Condition (ii) does riot therefore assert that we should be able to compare any pair o f conceivable options. (12 and o. (i)n 5 (ii) f u l / (I.24 Discussion ofAriom 1. We are trying to capture the essence o f what i s required for an orderly and systematic approach to cornparing alternatives of genuine interest. very difficult. 5 a:! ando? 5 thm (11 5 (I. there would not be a decision problem i n any real sense. however complex. obviate the r r c w l for such comparisons if we are to aspire lo responsible decision making. We are not. Axiom 2. so that our  . :: (12 < (Iri and < ( i I among three options n l .g all monetary). Discussiori ofAriom 2. In most pructicul problems. from Definition 2. The assertion o f : strict preference rules out equivalence between any pair o f the options. although i t i s easy to think o f examples where this form i s complex (eg. Resource allocation among competing health care programmes involving different target populations and morbidity and mortality rates is one obvious such example. at this stage.s.1. x A. We have already noted that.rhrrr r w rrrrrst 111 leusi he willing to esprixs pti:ferunc*esbetu~urii siiiiplv cliclrororrrisutl opriorrs.” There are certainly many situations where we find the task of comparing simple options. I t would has make little sense to assert that an option was strictly preferred to itself. therefore. Condition (ii) requires preferences to be trunsitirv. and even consequences. Condition (ii)s therefore to be interpreted in the following sense: **//’ wu tispire to iiitrkr ( I rutiorrul clioirx.(. however bizarre or fantastic. However. C can be defined as simply the set of consequences required for that problem. If all consequences were i equivalent. (Transitivity of preferences). Condition (i) obvious intuitive support.2 (i). health and industrial relations elements). there w i l l typically be a high degree of similarity i n the form of the consequences (e. Suppose. herwurrr rrlrrrrrcrti\~c optiori. It would also seem strangely perverse to claim to be unable to compare an option with itself! We note that. in any given decision problem. the difticulty of comparing options in such cases does not.
that we would be willing to pay in order to move from a position of having to accept option a to one where we have. instead. we are willing to pay I in order to exchange option a1 for option u2. it would therefore be inappropriate to become involved in a formal discussion of terms such as “value” and “price”. the prospect of option and u. But now.. Repeating the argument once again. The following consequences of Axiom 2 are easily established and will prove useful in our subsequent development.1. having a “value” less than the perceived difference in “value” between the two options. however small.3 and Axiom 2.. Proposition 2.  a . (ii) El 5 E2 and E 5 E:i imply El 5 En. say 6. or beads. By virtue of the expressed preference nl < a2 and the above discussion.” Our discussion of this axiom is. of course. At this stage. then we must ensure that our preferencesjt together in a transitive manner. we are willing to pay y in order to exchange a2 for u3. Let us now examine the behavioural implications of these expressed preferences.3 Coherence and Quantification 25 expressed preferences reveal that we perceive some actual difference in value (no matter how small) between the two options in each case. instead. for example. the preference ul < (12. we could find ourselves arguing through this cycle over and over again. (i) E E.respectively. Willingness to act on the basis of intransitive preferences is thus seen to be equivalent to a willingness to suffer unnecessarily the certain loss of something to which one attaches positive value. Thus. If we consider. It is intuitively clear that if we assert strict preference there must be some amount of money (or grains of wheat. (Transitivity of uncertainties). informal and appeals to directly intuitive considerations. Axiom 2(ii) is to be understood in the following sense: “lfwe aspire to avoid expressing preferences whose behavioural implications are such as to lead us to the certain loss of something we value. Let y and 1 i denote the corresponding “prices” for switching from u2 to u3 and from a3 to u r l . 2 ProoJ This is immediate from Definition 2. We would thus have paid x + y + i in order lo find ourselves in precisely the same position as we starred from! What is more. We should therefore be willing to pay this amount to switch from the less preferred to the more preferred option. We regard this as inherently inconsistent behaviour and recall that the purpose of the axioms is to impose rules of coherence on preference orderings that will exclude the possibility of such inconsistencies.2. since u3 < a1 we are willing to pay z in order to avoid a:% have. we are implicitly stating that there exists a“price”. or whatever). to accept option u2. Suppose now that we are confronted with the prospect of having to accept option a ] . by virtue of the preference a2 < u3.
cI <(.letnl a2and(12 qsothat. c. (a1 Proposition 2. jhr sume (:I < "2. Taking n = { (j 1 G. so that. 1 G' }.This latter argument is a version of what might be called the surething principle: if two situations are such that whatever the outcome of the first there is a preferable corresponding outcome of the second. then a1 $: (12. c2 implies that for cin? option ( 1 .(1 I P. Similarly. Then. lf El E2 und E? E:$then FJI 155. {Q I E. (Invarianceof preferences between consequences1. given C. i G' } i { (12 I C. c1 I E " } 5 { r2 I F. This forrnalises the intuitive idea that the stated preference should only depend on the "relative likelihood" of F' and E' and should not depend on the particular consequences used in constructing the options. for any u. by Axiom 3( i). theri E 5 F . ( 1 1 C ' } . by Definition 2.2. I ' I G' }.?1 Q } .2.2. An important implication of Axiom 3 is that preferences between consequences are invariant under changes in the information "origin" regarding events in E . 2 Foundations    Proof. Indeed. (iii) asserts that if we have the preference { ( I .ail <_ (12.2 I G"} 5 { c1 10.3 and 2.}. ('1 5 ( ' 2 . Conversely.. by Axiom 2(ii).A similar argument applies to events using Proposition 2. { a1 I G. thus contradicting C > k!. (I I C' }.nl 5 ~ 2 a1 and a? 5 a:$. Proof. c? for any event C.4(i). CI 5(. If c1 5 (:2 then. and thus n l 0%. part (ii) follows rather similarly. I . CI I E"'} 5 { r? I b'.4 have operational content. : . byDefinition2. Hence. 1 G"} F { (12 I G. (Derived transitive properties). (ii) lf(1. (i) l f a l u2 and u2 Q:{ then a1 o:$. Q I 5 u:t and o:t 5 0 1 . I F"} for sot~ic' cI < ( ' 2 then we should have this preference for m y c1 < t. one has { c1 I G. : (ii) I f . u I G } 5 { c2 I G.('2. ( 1 1 should not be preferred to (12. I G'. G 5 (3. (.1.1 < (12 und a? (I:{ {hen a1 < o+ If El < E2 ond E? El then El < E:s. Conditions (ii) and (iii) ensure that Detinitions 2. by Axiom I(ii). "1 5 Q ifund onl! iftlierr crist C > sud1 that CI 5(. If ci > c? this implies. Discirssion of Ariom 3. (I I C" } 5 { I G. Again. (iii) G f o r some c and G > @.3.2. (ii) asserts that if we have {cz I E . { (1 I 1 G. c I G"}for some c then. 4 a 2 5   . this implies that {cl 1 C c. (i) Ifcl 5 (f2 then.. Condition (i) formalises the idea that preferences between pure consequences should not be affected by the acquisition of further information regarding the uncertain events in 1. (Consistencyof preferences).26 Proposition 2.. (j <(. . then the second situation is preferable overall. that .fur crll C > 8. Axiom 3. for any G > Y). Toprove(i). by Axiom 3(ii).
~ G I ~ U } Taking a = c l . It now follows immediately This last result is an example of how coherent qualitative comparisons of uncertain events in terms of the "not more likely" relation conform to intuitive requirements. if E is significant given G then.CIE"} I ( F . 1 = {Q I F .5.e. Proposition 2. for which this ordering is strict. for any event E.4.q I E'} < G c2.2 that E 5 F. For any c1 < CZ. i. Proof. If follows from Proposition 2.E ) " } . ~1 IS1} < ( ~ 2 1 E n C .E ) " } . however. c1 1 E c } to c1 for sure. ~ E ' < { C.E = F from Definition 2. CI 1 ( E n G)'} < {CZ I G . one would strictly prefer c2 for sure to the stated option. (Monotonicity).3 Coherence and Quantification 27 Another important consequence of Axiom 3 is that uncertainty orderings of events respect logical implications.2. C . c ~ IF"} = {CZ IF . (Significantevents). if E E F . (Characterisationo sign@ant events).4 that. {CZ I E. 2 By Axiom 3(i) with G = F .5. {cI IG.5. Definition 2. Using Definitions 2. in the sense that if E logically implies F . for all c1 ICc2 and for any option a. G ~ G " } . then F cannot be considered less likely than E . Intuitively.4 and 2. define a1 = { Q I E. if and only if8 signijicant ifand only if0 < E < 12. An event E is sigf < E n G < G .E . we shall simply suy that E is signifcant. E is Proof. An event E is signifcant given G > 0 ifcl <G c2 implies that c1 <G {c2 I E. as one would expect. one would strictly prefer the option {cz 1 E . 8 5 E 5 SZ. Thus. I f E 5 F then E 5 F. I f G = 0. significant events given G are those operationally perceived by the decisionmaker as "practically possible but not certain" given the information provided by G. In particular. given G > 8 and assuming c1 <G CZ. we have CI = { C Z 10.~U G . We shall mostly work. nijicant given G > 8.cl I E'} = {CI 1 F . since it provides an additional perceived possibility of obtaining the more desirable consequence c2. CI I G'} .alG"} < { C ~ ~ E ~ G .. n E". Proposition 2.Ey { ~ IE:cII E"} I ( F . with "significant" events. if E is judged to be significant given G . al 5 a2. Similarly.
1 E' } + c { ('2 I E . too many complex ways in which such changes in assessments can take place for us to be able to capture the idea in a simple form. cI I I?} <(. say. are unaffected by the additional information F . ~ ~ I P ) . which leads to changes in assessments of uncertainty.b "E { ( . ('1 <<. (? < E 17 G < C.20 2 Foundations and hence. and hence. by Axiom 3(iii). by Definition 2. ? ( i i ) c o { ~ ~ ~ F . An alternative characterisation will be given in Proposition 2. c?. provides a qualitative basis for comparing options and. the notion that uncertainty judgements about E. Conversely. r? ( i ) r { r2 I E. in an operational form. ~ I F . ('1. the condition stated captures. I E . On the other hand. if 8 < E n ( . the very special case in which preferences do not change is easy to describe in terms of the concepts thus far available to us. Definition 2.13. We shall now argue that this purely qualitative framework is inadequate for serious. formalised by the binary relation 5. Quantification The notion of preference between options. systematic comparisons of options. (Painvise independence of events). for a l l c.. ('1 1 l }. for comparing consequences and events. denoted by E I F'.3. however.  or >. We interpret E I F as "E is independent of F". < G. If. W suy that E and F are (pairwise) independent. {(*.. 233 . and e on& if. An illuminating analogy can be drawn between 5 and a number of qualitative relations in common use both in an cveryday setting and in the physical sciences. in particular. There are. by extension. ~ ~ O } (81 where is any one of the relations < . . The coherence mioms (Axioms I to 3) then provide a minimal set of rules to ensure that qualitative comparisons based on 5 cannot have intuitively undesirable implications. if.3 preferences regarding pure consequences are unaffected by additional information. a The operational essence of "learning from experience" is that a decisionmaker's preferences may change in passing from one state of information to a new state brought about by the acquisition of further information regarding the Occurrence of events in E . G = Sl then E is significant if and only if 8 < E < $1.6. The definition is given for the simple situation of preferences between pure consequences and dichotomised options. c+ ~ . Since by Proposition 2.
the relations not heavier than. We shall regard it as essential to be able to aspire to some kind of quantitative precision in the context of comparing options. or temperature. but the general point is extremely important. Precision. as they stand.) (v) i f E I S. We suppose that no point on the circumference is “favoured” as a resting place for the ball (considered as a point) in the sense that given any CI. we need to introduce in each case some form of quantification by setting up a stundard unit of measurement. E. ‘ (iv) SI IS. length. F IS a n d E I F . This enables us to assign a numerical vche. or the centigrude interval. a centigrade interval. representing weight.) ~ ( 5 ’ 2 ) . If the stick is “not longer than“ the scale mark of 2. or finite unions . such as “strictly longer than”). iJ and only$ ~(5’1) 1 4 5 ’ 2 ) : I 5 (ii) S n S. implicitly or explicitly. a metre. thus extending the number of decimal places in our answer.2. The example is. +  Discussion of Ariom 4. A family of events satisfying conditions (i) and (ii) is easily identified by imagining an idealised roulette wheel of unit circumference. 1 u S.r? and events SI. we place one end against the origin of a metre scale and then use a series of qualitative comparisons. n S. there is a standard event S r such that p(S) = 0. and events E. together with an (implicitly) continuous scale such as arbitrary decimal fractions of a kilogram. such as the kilogram.) (iii) for any number c in (0. a trivial one. we might lazily report that the stick is “2. As a first step towards this. nor longer than. Instead. but is “strictly longer than” the scale mark of 2. corresponding to the ball landing within specified connected arcs. is I achieved by introducing some form o numerical standard into ( conte. we could continue to make qualitative comparisons of this kind with finer subdivisions of the scale. implies that ~ ( S I = p(S1)p(S2).S.5 metres. F . whose definitions have close links with an easily understood numerical scale. It is abundantly clear that these cannot suffice. as an adequate basis for the physical sciences. 1) such that: (i) S 5 S. in quantifying the length of a stick. a series of qualitative pairwise comparisonsof the feature of interest with appropriatelychosen points on the standard scale.3 Coherence and Quantification Consider.rt already f equipped with a coherent qualirative ordering reittion. for example. and which will play a role analogous to the standard metre or standard kilogram. (Existence o s t a n h d events). we make the following assumption about the algebra of events.1]. based on “not longer than” (and derived relations.4 metres. the merre.45 metres long”. There exists a subalgebra S of & f + [O. not hotter than. This can be achieved by carrying out. For example. of course. It is therefore necessary that we have available some form of standard options. through quantification. to any given physical or chemical entity. E IS and k IS . = Q) implies rhat ~ ( S I = p(S. If we needed to. and afunction 11 : S Axiom 4. then E S * E F S.
and F is independent of S. Finally. { c1 I SI. alternatively. condition (iv) is clearly satisfied. where p is the function mapping the "arcevent" to its total length. that for any a E [O. It follows from Proposition 2. including previous"p1ays"on the same wheel. in principle. as is the fact. Namely. while many irrelevant technical difticultirs are avoided: in particular. I]. 11. physically exist. however.srundctrd/umil~of events in E and will think of C as : the algebra generated by the relevant events in the decision problem together with the elements of S. in (iii). the discussion for the unit interval being virtually identical to that given for the roulette wheel.4 and Axiom 4(i) that I & ( @= 0 and p(I2) = 1. 1 we can construct an S with p ( S ) = o. would denote a subinterval of length 1). I Si'} and { c1 I SZ.} are considered c! : equivalent if and only if p(Sl) = p ( & ) . Our argument for accepting this degree of mathematical idealisation in setting up our formal system is the same as \vould apply in the physical sciences. F in E.cj I S. that if E is independent of F and S. We will refer to S as a . There is.a judgement of equivalence between E and 5 should ' not be affected by the Occurrence of F. for any events E. we note first that the basic idea of an idealised roulette wheel does assume that each "play" on such a wheel is "independent". that no serious conccptual distortion is introduced. The underlying image would then be that of a point landing in the unit interval and an event S such that p ( S ) = I. namely. or could be prwise/y constructed in accordance with conditions ( i ) to (v). Other forms of standard family satisfying ( i ) to ( v ) are easily imagined. The obvious intuitive content of conditions ( i ) to ( v ) can clearly be similarly motivated in these cases. In this extended setting. if we think of the circumferences for two independent plays as unravelled to form the sides of a unit square. for example.2 Fuundations and intersections of such arcs. reflecting the inherent limits of accuracy in any actual prtxedurc for determining arc lengths or areas. Note that 1 S is required to be an algebra and thus both VJ and II are standard events. Thus. rather than. mapping events to the areas they define. some subset of the rationals. It is important to cmphasise that we do m r require the assumption that standard families of events actually. of cill scientitic discourse in which measurements are taken. ( v ) encapsulates an obviously desirable consequence of independence. For example. rathcr than a subset of the rationals chosen to rellect the limits of accuracy in the physical measurement procedure being employed. it is obvious that a roulette wheel of unit circumference could be imagined cut at some point and "unravelled" to form a unit interval.6. We only require that we can invoke such a set up as a iiiental image. of course. to be real numbers. with S denoting a region of area p . with 11. we could imagine a point landing in the unit square. . an element of mathematical idealisation involved in thinking about ull p e [O. we can always think of an "independent" play which generates independent events S in S with p(S)= o for any specified (1 in (0. The same is truc. Conditions (i) and (ii) are then intuitively obvious. corresponding to binary expansions consisting of zeros from some specified 1 0 cation onwards. The remainder ) of (iii) is intuitively obvious. in the sense of Definition 2. of any other events.
) = 4.. whereas the step from comparing a finite set of possibilities to comparing an infinite set has more substantive implications. that SJ n S. For suppose inductively that SI. P.. .. there exists a standard event S such that E  S. Let S.} 5 c 5 {c2 I S. (Collections of disjoint standard events). . . with 0 < s < y < 1. (1.rbecoming increasingly small. Intuitively.can therefore be “sandwiched” arbitrarily tightly and.... (ii) For each event E . {c? I S. .) .. = . . qIS‘. We have emphasised this latter point by postponing infinite extensions of the decision framework until Chapter 3. using Axiom 4(ii). and define l ijl = crl . = S. we discussed the idea of precision through quantification and pointed out. = T. .3 Coherence and Quantification 31 those concerning the nonclosure of a set of numbers with respect to operations of interest. using analogies with other measurement systems such as weight. In this way..I = p( B. ) = a. . Discussion of Axiom 5.) = p(BJ){a. }. . Then. * . BJ = S U. n BJ’.c. S. . i Proposition 2. n B J )and hence. We start with the obvious preferences. so i = 1. { c2 1 SO.6.b‘..+) = 0.cl I S. any e 5 c 5 CZ. we arrive at comparisons suchas(c1 1 S r . . and the result follows. IF}. This argument is not universally accepted.S.). related discussion of the issue is provided in Section 2. that the process is based on successive comparisons with a standard.The essence of Axiom 5(i) is that we . = 8. + Axiom 5 . be judged equivalent to one of the standard options defined in terms of c1. a.. in S such that p(B. (i) Ifcl 5 c 5 q. in the limit. 1 < j 5 I?. By Axiom 4(iii) there exists SI such that p(S1) = 0 1 . iv). For anyfinire collection { 0 1 . a ..USJ.By Axiom 4 (iii.} 5 c 5 { c 2 1 51 c1 I Sf}./(l.} real numbers such that a. 5 1 there exists a corresponding collection { S1. . and further. length and temperature. such that c1 5 c 5 c : ~ . T. /J(s.a. j .). n T.ij. .. and then begin to explore comparisons with ’. In the introduction to this section. Sn} of disjoint standard events such that p ( S . Proof. there exists T. 11. from the perspective of the foundations of decisionmaking. . Our view is that.}. .. by gradually increasing . .I: away from 0 and decreasing y away from 1. the step from the finite to the infinite implicit in making use of real numbers is simply a pragmatic convenience.u. c:l 1 S.withthedifference!/.. U (T.. Any given consequence c. n u s . /J(T. + + + + . > 0 and of a1 .l are disjoint. .)= a J / ( l.there exists a standard event S such that c  {c2 1 s. ./(l~ 3 ~ )Define S. . however. for l standard options based on S. as we increase s.. becomes more l S} and more “attractive” as an option.1. denote a standard event c1 such that p( S. = 1.2. e I . { c 2 1 S. .. and as we decrease y. I S.} becomes less “attractive”. (Precise measurement of preferences and uncertainties).) p ( s J ) p ( ~n B.8.
The preceding argument certainly again involves an element of mathematical idealisation. .3. 1 S.S[. whilst acknowledging that. so that.. Again. ('1 I .. but not permitting a more precise statement.} . we can begin with the obvious ordering {(. ('1 I E"} 5 { (*. it may \wll he that therc are situations where imprecision in the context of comparing consequences is too basic and problematic a fciiture to be xlequately dealt with hy an approach based on theoretical precision. r?.?I. In practice.. tempcred with pra~matically ncknowledged approximation.. Every physicist or chemist knows that there are inherent limits of accuracy in any given laboratory context but. For many applications. I I?'} can be compared precisely with the family of standard options { q I S'. and above by the successive values of g.c2 such that c. Indeed. This is analogous to the situation where a physical measuring instrument has inherent limits. we have to make do with the best level of precision currently available (or devote some resources to improving our measuring instruments!1. } for any event E. The standard family of options is thus assumed to provide a continuous scale against which any consequence can be precisely compared. However.. ('I with of this of the form { (2 I S. In the context of measuring beliefs.I' and g but feel unable to express a more precise form of preference. ('1 1 S.. say.} some (possibly '.. The underlying idea is similar ('1 to that motivating condition (i). (P. and then consider refinements . approached from below by the successive r values of .13 and proceed CIS iJ'this were a precise measurement. in terms of the ordering of the events.1. there might. I 5:r}. In this case. 1 S. I26 to 3. in fact.5. 2 E 5 S<. this would seem to be an unneccssary confusion of the prescrIp/iiv~ and the dtwripriw. (j I $ } 5 { 1. enabling one to conclude that a reading is in the range 3.2 i c'.in the sense that we judge {c? I S. vI I k" } 5 { ("2 I S. for any event E and for all consequences r I ..8.1.r's and a decreasing sequence of !/*stending to a common limit.}. for rational) . the option { c? I E . defined by c1 and c2.can proceed to a common limit... Condition (ii) extends the idea of precise comparison to include the assumption that.1 1 S ..} 5 ( ' 5 { I S . .r becoming increasingly small.r E [O. be some itirerwl oj'ittdifferencv..I' increasing gradually from 0. say. !/ decreasing gradually from I. n o one hits suggested developing the structures of theoretical physic5 or cherni5try on the assumption that quantities appearing in fundainental equations should be constrained to take values in some subset of the rationals. <.. S . [ .riom +s/u~J. We shall return to this issue in Section 2. the essence of the axiom is that this "sandwiching" can be refined arbitrarily closcly by an increasing sequence of . given the intuitive content of the relation "not more likely than". we would typically report the measurement to be 3. and !/ ... ( 8 .: < I 1.. ri I S.11. in practice. so far i15 we know. several authors have suggested that this into imprecision be forinn/!\' inc~orpnrccfccl /he ct. We formulate the theory on the prescriptive assumption that we aspire to exact measurement (exact comparisons in our case). .
By Axiom 5(ii).1. respectively. there would be little hope for rational analysis of any kind. This is in vivid contrast to what are sometimescalled the rfassicalandfreyuencyapproachesto defining numerical measures of uncertainty (see Section 2. or El E2. of course. or E2 > E l .coherent preferences. thus providing a numerical measure of the uncertainty attached to each event. there exist SIand S2 such that El SIand E2 N S2. where the existence of symmetries and the possibility of indefinite replication. We begin by establishing some basic results concerning the uncertainty relation between events. Indeed. (Complete comparability of events). our practical evaluations will often make use of perceived symmetries and observed frequencies. play fundamental roles in defining the concepts for restricted classes of events. The principles of coherence and quantificationby comparison with a standard. Our definition will depend only on the logical notions of quantitative.2.7. Either El > E2. irrespective of the nature of the uncertain events under consideration. coherent preferences. at least in part. depend on c.1 BELIEFS AND PROBABILITIES Representationof Beliefs It is clear that an individual’s preferences among options in any decision problem should depend. the complete ordering now follows from Axiom 4(i) and Proposition 2.4 Beliefs and Probabilities 33 The particular standard option to which c is judged equivalent will. a  .3 implies that our “attitudes” or “values” regarding consequences are Jixedthroughout the analysis of any particular decision problem. but we have implicitly assumed that it does not depend on any information we might have concerning the Occurrence of realworld events. Proposition 2. if the timescale on which values change were not rather long compared with the timescale within which individual problems are analysed.8). will enable us to give a formal definition of degree of belief. expressed in axiomatic form in the previous section. on the “degrees of belief’ which that individual attaches to the uncertain events forming part of the definitions of the options.  Proof. 2. The conceptual basis for this numerical measure will be seen to derive from the formal rules governing quantitative.4 2. Proposition 2. We cannot emphasise strongly enough the important distinction between defining a general concept and evaluating a particular case. It is intuitively obvious that.4.
a1 5 a2 5 n4 .34 2 Foundations We see from Proposition 2. A similar result concerning the comparability of all options will be established in Section 2.7 that.3. Given an uncertainty relutioti 5.3. l n G = B n n = = . i f .e.8. (Additivity of uncertainty relations). by Axiom 3.3. and using again Definition 2.. Definition 2.. the probability P ( E ) of an event E is the real number p( S ) associated with any stancklrd event S such that E  S.7. if A n G = BnC = 0 then A 5 U e A u G < B U C . . (Measureof degree of beliej). not all pairs of options were assumed to be comparable). d e f i n e : 1 Then. This definition provides a natural. A 5 B I a 5 a?. although the order relation 5 between options was not assumed to be complete (i. Proof.then A u C < H u L). by linking the equivalence of any E E E to some S E S and exploiting the fact that the nature of the construction of S provides a direct obvious quantification of the uncertainty regarding S. Moreover. as a consequence of Axiom 5 (the axiom of precise measurement). or a computer generated 'random' integer being an odd number. operational extension of the qualitative uncertainty relation encapsulated in Definition 2. for any G.5. . u G 5 R u C. that the uncertainty relation induced between events is complete. Thus. F o r a n y q > r] . With our operational definition. the statement P ( E ) = 0. it turns out. a The final statement follows from essentially the same argument.lu(CB)u(CnB) 5 DU(BC)u(CnB)=BuD. and Proposition 2.   AuC=. We now make the key definition which enables us to move to a yuantirari\*e notion of degree of belief. then A U C 5 U u D.5 precisely means that E is judged to be equally likely a a standard event of 'measure' 0. the nieaning of a probability statement i s clear.4 < 1 or 3 C' < D. If '4 5 R . For instance. maybe a conceptual perfect coin s falling heads.5. by Definition 2. We first show that. 4 (13 5 u4. C 5 L) A n C = B n D = VJ. .
if E S1 and S then by Proposition 2. i.l). quantitativedegrees of belief have the structure of a finitely additive probability measure over E. probabilitiesare always conditionalon the information currently available.e.2(ii). Since probabilities are obviously conditional on the initial state of information M o .) is compatible with the uncertainty relation 5. “correct” or “unconditional”.4 Beliefs and Probabilities 35 It should be emphasised that. by 2 Axiom 4(i).9.7.11. The probabilityfunction P( . we shall stick to the shorter version. Then. Poaiiy f f (i) P(0) = 0 and P(Q) = 1. should always be borne in mind. Moreover. and only if. Existence follows from Axiom 5(ii). a E    Definition 2. Proposition 2. Proof. By Axiom 5(ii) there exist standard events S and S2 such that E S 1 1 and F S. a more precise and revealing notation in Definition 2.2(ii) . iff p(S1) 5 p(S2). ( r b b l t structure o degrees o beliej). The result follows from Definition 2. Proof. within the framework we are discussing. Moreover.8.. (ii) I f E n F = 0. according to Definition 2. (Compatibility of probabilify and degrees of belief ). probabilities are f always personal degrees o belief. a   The following proposition is of fundamental importance. it establishes that significant events. (iii) E is signijEcant if.7.2. 0 < P( E ) < 1. In order to avoid cumbersome notation. The result now follows from Axiom 4(i). Proposition 2.10. For uniqueness. should be assigned probability values in the open interval (0. there exisfsa unique probability P( E ) associated with each event E . events which are “practically possible but not certain”.. to qualify the word probability with adjectives such as “objective”. (Compatibility). in that they are a numerical representation of the decisionmaker’s personal uncertainty relation 5 between events. Afunction f : E 3 8 is said to be compaton ible with an order relation I E x & 8for all events.. E 5 F iff S1 5 S and hence. It makes no sense. + .7 would have been P(E I A&)). 2 S2. S.then P ( E U F ) = P ( E ) P ( F ) . It establishes that coherent. Proposition 2. but the implicit conditioning on M. by Proposition 2. Given an uncertainty relation 5. (Existenceand uniqueness).
E is significant iff 8 < E < S2.( I . if F < S?then E U F < 5') US?and P(EU1. coherent degrees of belief are probahiliries. ByProposition2. similarly. By Proposition 2. by Proposition 2. by Proposition 2.6 for this measure of degree of belief. by Proposition 2. a Proposition 2. = Q).as stated. therefore justifying the nomenclature adopted in Definition 2. by Proposition 2. C. It establishes formally that coherent. S 5 R implies ' that P ( 0 ) = 1. by Proposition 2. (Probability distribution). P(SI) ctandP(S2) = j j .7. there exist events S1.4. B 5 S. ) = 0 and p ( S * )= 1. be constantly borne in mind.E. Hence. (ii) If E = 0 or F = 8. is impossible.j E J } is said to be N probahilit.0 5 P ( E ) 5 1. hence. The first part follows by induction from Proposition 2. = If F > . The latter fact should. then. again. E U F > SI u S and hence P ( E U F) > .9.j which. if o = P(E) and J = P ( E U F). Proof. S 2 such that S1n.8. Moreover.v distribution over the pcirtition. The result then follows immediately from Proposition 2. E U F > E: thus. I I(iii): the second part is a special case of (i) since if U.S2. then f (ii) For any event E. = $2 then. without explicit reference to the fact that the mathematical structure is merely serving as a representation of (personal) degrees of belief.J E J .5. then { p . by Axiom 4(iii) there exist S. (i) If( E. similarly. j E J ) is cijnite collection o disjoint events.F > & o r F SzorF < S?. the result is trivially true. a  +  Corollary. (i) By Definition 2. If{ E J . .10 P(0) 5 0.') < . and. with P( E. If E > and I' > ld.j. we have (1 < $and.10.J}fornirifinite pcrrtition of $2.6. .5. P ( E " ) = 1 . . P(E u F') = P ( E ) P ( F ) . then. quantitative measures of uncertainty about events must take the form of probabilities. I I is crucial. however.0 . In short..j E . so that . and S* such that p ( S .36 2 Foundations Proof. (iii) By Proposition 2.) = 1.P(E. F S and therefore P( F\ = 3 . 2 which is impossible. Definition 2. I 1 (i).P ( E ) .or both. P(0) = 0. It will often be convenient for us to use probability terminology. (Finitely additive structure of degrees of belief).7.) = pJ.8.
by Axiom 4(v). n s.P(E. for all nonnegative (1.. Hence. expressed in conventional mathematical terminology. If P' were another compatible measure. there exists a monotonic function f of [O. I F } ' ' Taking a = cl. But. By 1 Proposition 2.8 we would always have P ' ( E ) 5 P ' ( F ) e P ( E ) 5 P ( F ) . ad hoe. ( f f($) Axiom 4(ii).a.12. f ( 0 + 19) = P'(S1 U S2) = P'(S1) P'(sp) = ) and so (Eichhorn. 1 F'}. then by Proposition 2. k = 1. . Hence. Hence. a + + + We shall now establish that our operational definition of (pairwise) independence of events is compatible with its more standard.F I Sp and S1 I Sz. Suppose E I F.1 ill"). Proof. = C. { E. CI I ( E n F)'} so that E n F  { c 2 I SI n F.) P(. 1978. and any option a. there exist disjoint standard events S1 and S. P( . . compatible with the qualitative ordering 5. 1). by . product definition. and noting from Definition 2. 51n F. by an identical argument to the above.6. by Proposition 2. Proof. 1 into itself such that P ( E ) = f { P ( E ) } . (Characterisation o independence). we have {Q I E n F. so that.4 Beliefs and Probabilities 37 This terminology will prove useful in later discussions.63).a.p. f ( o ) = ko for all ct in [O. s1n F  S .. having measure 1) is distributed among the events of the partition. Proposition 2. f ELF e P(EnF ) = P(E)P(F).cl I E n F. CI I (SIn F ) ' } . Theorem 2. By Axiom 4(iii). Starting from the qualitative ordering among events.with C.) = 1. P(i2)= 1 and hence.J E J}.? such that B 3 5 1. over E and shown that.' S1. (Uniqueness o the probability measure). it has the form of ajinitely udditive Probability measure. for any consequences el < c. j E J}. given F.13. ' 1 there exists s such z that P ( S z )= P ( F ) . E I. S .. {cp I E n F. P is the only f probability measure cornputihlv with the uncertainty relation 5. we have derived a quantitative measure. The idea is that total belief (in 0. there exists S1 such that P(S1) = P ( E ) . . so that we have P'(.cl I Sc n F.we have   ( ~ 11 S I n F.E) = P ( E ) for all E. Proposition 2.E I Sl and F I S1. hence. such that P ( S I ) = (r and P(S2) = 3. We now establish that this is the only probability measure over & compatible with 5.2..9.6 the symmetry of I.according to the relative degrees of belief { p . Again by Axiom 4(iii).
E' u En G 5 I.14.. for any C.  { c ~ s . <(. I and 2. i. given the asdefined sumed Occurrence of a possible event G.sr q < c. without loss of generality. The starting point for analysing order relations between events. Now suppose. by Definition 2.1. acts is replaced by 5 ~ . c.. [ ( E n s ) ' } .E IS..cI IF"} 5 { C ~ I E ~ Fl . so that c < f { c2 1 E.A similar argument can obviously be given reversing the roles of E and F. Then. hence establishing that E: I k'.3) that. there exists S such that By P ( S ) = P ( F )and F IS.( C C F ) ' } .3 are trivially established and we recall (Proposition 2. Given the assumed Occurrence of C > !A the ordering 5 between . between events. c .'!) (.2 siich tlicrt { (. ~1 I E'.? ! I.. .6. we use the assumptions of Section 2.vi. Suppose Y ( E f7 F) = P ( E ) P ( F ) . = so that E n F E n S. ('1 I E" } . Analogues o f Propositions 3. Axiom 4(iii). > v). rI I P ' } .2 5 ( ' 1 iff cz Lc. cI 1 E r } .4.3 in order to identify the precise way in which coherent modification of initial beliefs should proceed. Hence. and Axiom 4(iv).~1 I Sr} { c 1 I. Is'} 5 { C 2 ( t i ' n S . I E n k: c1 I ( E n F ) " } : ((*I F. In this section. 2 fimndutioiis a n d h e n c e P ( E n F ) = P(E)f(F'). (Propertiesof conditional beliefs). : Proposition 2. is the uncertainty relation s(.'. { tlleii E <<. CI I I.2 Revision of Beliefs and Bayes' Theorem The assumed occurrence of a realworld event will typically modify preferences between options by modifying the degrees of belief attached." 'V } and { c 2 I E n S c1 i ( E n S)"} . by the first part of the proof.10. that c 5 {c2 I E. P ( E n S ) = P( E ) P ( S )= P ( E ) P ( F ) P(h' n k'). ('1 1 il" } s(. hence by Proposition 2. to the events defining the options. by an individual. But {C I S. a 2. (ii) lfrlirre p. E.2. . ( i ) I.30 By Propositions 2.F .  {r.2.
by Proposition 2.clI(FnG)'} and the result follows from Axiom 3(ii) and part (i) of this proposition. c {cz I FCl I F}? {cz I E. and only if. By Axiom 4(iii) and Proposition 2. the initial state of information. a l G ' } 5 ( ~ 1 F n G . of course.10.Cl I E'} I i.10. P( E I C ) provides a quantitativeoperational measure of the uncertainty attached to E given the assumed Occurrence of the event G. (Conditionalprobability). S c: E.e.10.clI E n C .is always present as a conditioning factor. a E I ~ F Definition 2. S n G E n G and. For ans G > 0. c1 I E"} Ic { c2 I F. c ~ I G ' } .2. .c~~(E~C)'}I{c2(FnC. = Thus. (Conditional measure of degree of bebf).13.13.c. although omitted throughout for notational convenience.4. CI I F'} then. Thus..I G) to the initial measure of degree of belief P(. We have. {c~(E~G. 0  . ~ E " 5 {G :.((FnG)c and this is true iff E n G 5 F r l C. by Proposition 2. P ( En G ) P ( E 1 G)= HG) Proof. . { C ~ ~ E ~ C ~ C . P ( S n G ) = P ( S ) P ( G )= p ( s ) P ( c ) P ( E n G). by Definition 2.15. By Definition 2.! I F ~ G . The following fundamental result provides the key to the process of revising beliefs in a coherent manner in the light of new information. It relates the conditional measure of degree of belief P(. Moreover. {c21EnG.3. c .4 Beliefs and Probabilities Proof. Proposition 2.the conditional probability P(E I G) of an event E given the assumed occurrence of G is the real number p ( S ) such rhat E y: S..4 and Proposition 2. P( E 1 C)= p ( S ) = P( F: n G ) / P ( G ) . E 39 IGF iff. where S is an standard event independent of G. by Definition 2. already stressed that all degrees o f belief are conditional. Gl ( F ' n G .I(EnG)'} 5 {cqIFnC. Generalising the idea encapsulated in Definition 2. 1 Taking a = q. with a = c1. there exists S I C such that p ( S ) = P ( E n C ) / P ( G ) By Proposition 2.If(. ~ c C n~ c ' } so that {c2IEnG. for all a. Given a conditional uncertainty relation S C !G > 0. a I G r } . if there exist c2 > CI such that {c2 I E . for all c2 2 CI. if.7.14.c. I F C n G . The intention of the terminology used above is to emphasise the additional conditioning resulting from the Occurrence of G .).
I I t o degrees of belief conditional on the occurrence of significant events.40 2 Foimdutions Note that.10 implies that P(I.16. Proposition 2..: n C) <_ WG).) = 0 5 C'( t. E is significant given G iff (b c. Proposition 2. P( k . I I . Proposition 2. (ii) ji)r uny event E.).) 5 P ( Q I G') = 1: (ii) i f E 17 (1G = 13.10. IS. nor an ad hoc definition. (iii) i is sigrzificunt g i r m G 5 ). " Proof. .. fJ( E I G j 5 1. An extended form is given later in Proposition 2. iff 0 < P(I: n c. By Proposition 2.): the result now follows from Proposition 2. . I:' <(. l Thus. (Finitely additive structure o conditional degrees o belief f f For nll C > 0.5. F P 0 < P ( E I G ) < 1. this is the simplest version of Bciyrs' theorfnt.J 1 C.14(i). Corollary. Finally. .which. The result follows from Proposition 2. ( i ) P(@I C.3 We now extend Proposition 2. E is significant given C. P(E I C) = P ( E fi C : ) / P ( C )is a logical derivation from the axioms. by Proposition 2. ( i ) if { E:. I? n G = C. by Proposition 2.) K I'( (. By Proposition 2. holds if and only if P(E (1G ) 5 P(F (1 C. IS. I } is ajnirc cnllectiott ofdisjoint events. tlieti P( E: u F G ) = P( E 1 G) i. Finally. ? 1 Proposition 2.19. Proof. .17. since E fi G 5 CT'. E <(. using again Proposition 2. (Compatibility of conditional probability and conditional f degrees o belief). (Probability structure o conditional degrees o belief f f For any event C: > 0. in our formulation.15.a This parallels the proof' of' the Corollary to Proposition 2. by Proposition 2.15. IS. . 1 G') = 1 . IS. so that.). by .I E . I' iff E I G 5 F 1 G. P ( I 1 ( C )= 1.).( F I (. so that. E i 6" c: C. < Prooj By Proposition 1. In fact. P(E 1 G ) 2 0 and I > ( @ I G ) = 0: moreover. thiw ).1'( E 1 (.10. n (. F e+ P ( E / G ) P ( F 1 G ) ..
2 Trial resultsfor male patients R T T 180 70 Rr 120 30 Total 300 100 Recoveryrate 6Q% 70% The results surely seem paradoxical. I . Suppose now. as events. and R. Suppose that the results of a clinical trial involving 800 sick patients are a shown in s Table 2. R denote. Table 2.4 Beliefs and Probabilities 41 certainty relution &. The following example provides an instructive illustrationof the way in which the formalismofconditionalprobabilitiesprovides acoherent resolution of an otherwise seemingly paradoxical situation. Table 2. Tables 2.18. Example 2. I tells us that overall it is beneficial! How are we to come to a coherent view in the light of this apparently conflicting evidence? . and were one to base probability judgements on these reported figures. Proposition 2. P ( R I T")= 0.1. This parallels the proof of Proposition 2. however. respectively.1 Trial resultsfor all patients R T T ' 200 160 R ' 200 240 Total 400 400 Recoveryrate 506 40% Intuitively. but Table 2. it seems clear that the treatment is beneficial. where recovery and the receipt of treatment by individualsare now represented.2. that patients did or did not receive a certain treatment. in an obvious notation. (Uniqueness of the conditionalprobabilily measure). TCdenote. P ( . where T. it would seem reasonable to specify P(RI T ) = 0.2 and 2. and that these have the summary forms described in Tables 2.3 tell us that the treatment is neither beneficial for males nor for females. (Simpson's paradox). that one became aware of the trial outcomes for male and female patients separately.12.2 and 2. that the patients did or did not recover.5. respectively.3. I C ) is the only probability measure compatible with the conditional una Proof.4.
7 ' P ( R lw n r )= 0.). P ( RI w n T ) ' = 0.L ! P ( RI :\I n T ) = 0. respectively. z ~ ~ T ' ) P ( A ~ ~ R ~ ) ~ T ) P ( M T ) . With Af. Al' denoting. (Bayed theorem). after all. were one to base probability judgements on the figures reported in Tables 2. See Simpson ( I 95 I ).. Z f I T ~ + P ( R i A I ' n 7 ) P ( A f ' l T ) P ( R I T )= P ( R ( .3 Trial results for femule purienrs 2 Foicndations R T T 20 90 R 80 210 Torul Recoveryrare 100 300 20% 30% The seeming paradox is easily resolved by an appeal to the logic of probability which.3. U n T ) P ( . Proposition 2. ~ The probability formalism reveals that the seeming paradox has arisen from the confounding of sex with treatment as a consequence of the unbalanced trial design. a . Proposition 2. we have just demonstrated to be the prerequisite for the coherent treatment of uncertainty. For m y finite purtition { E l .j E J } o 9 and G > 0. it would seem reasonable to specify P ( R I XI r T ) = 0.19.15. P ( AI 'r = 0. I .75. I 1.42 Table 2.2 and 2. The result now follows from the Corollary to Proposition 2. the events that a patient is either male or female. 1973) and Lindley and Novick (1981) for further discussion. IS and the Corollary to Proposition 2. that P ( R ( T )= P ( R l .25. we note.(C n E. from the Corollary to Proposition 2. To see that these judgements do indeed cohere with those based on Table 2.1 I when applied to C = u.3.17. +P( T' M' I where P ( M I T ) = 0.2. Blyth (1972. f Proof: By Proposition 2.
). Bayes’ theorem acquires a particular significance in the case where the uncertain events { EJ. E J . the outcome of a clinical test). in a medical context.characterizing the way in which initial beliefs about the hypotheses.I. P(D I H.I. 1988b)that.posterior. thus defining the “relative likelihoods” of the latter. which could be more properly be written as P( HJ I . Bayes’ theorem may be written in the form c. If { HJ. we omit . P ( H .or data (for example. = I and thus it is always possible to normalise the products by dividing by their sum. the set of possible diseases from which a patient may be suffering) and the event G corresponds to a relevant piece of evidence. Thus. D . (iv) P( D )is called the predictive probability o D implied by the likelihoods f and the prior probabilities. The four elements. G = D.I. . as usual.’ . by the Corollary to Proposition 2.. I E . occur.). EJ = H. ) P ( H J ) .P(D 1 H . ) P ( H . E . (i) P(H. Since the { EJ.!f”).1 1. P(H. P( HJ I D)and P( D). Definition 2. which reflect how beliefs about obtaining the given data. E J. j f j (iii) P( HJ I D ) . Bayes’ theorem is an optimal information processing system. j E J.17. D. From another point of view. j E .j E J } form a partition and hence. It is important to realise that the terms “prior” and “posterior” only have significancegiven an initial state of information and relative to an additional piece of information.J E J . are modified by the data. ) ..4 Beliefs and Probubilities 43 Bayes’ theorem is a simple mathematical consequence of the fact that quantitative coherence implies that degrees of belief should obey the rules of probability. given D. . in various guises. P( HJ).vary over the different underlying hypotheses. j where P ( D ) = C. j E . Proposition 2.ure culled thepriorprobabilities of the H J . are called the likelihoods o the H.17 IeadstoBayes’theoremintheformP(H. (ii) P( D I HJ). If we adopt the more suggestive notation. E *J. are calledtheposCeriorprobabisof the H. explicit notational reference to the initial state of information Arc.).2. P ( H . I } are exclusive and exhaustive events (hypotheses). P(EJI G) = 1. ) / P ( D ) . since the missing proportionality constant is [P(G)]’ [CJP(G E J ) P ( E J ) ] . ID)= P(DI H . it may also be established (Zellner. E J . I D). under some reasonable desiderata. J E J. This form of the theorem is often very useful in applications. and predictive probabilities). and.j E J. (Prior.into a j revised set of beliefs. throughout Bayesian statistics and it is convenient to have a standard terminology available.thenfor any event (data)D.j E J}comespond to an exclusive and exhaustive set of hypotheses about some aspect of the world (for example. This process is seen to depend crucially on the specification of the quantities P( D 1 HJ).
or. Similarly. For simplicity.2 The quantities P(T H I ) and f'(T I H 2 ) represent the !rue positive and true negnriw rates of the clinical test (often referred to as the test sensitivity and tehi spec$ficity. we can investigate the range of underlying prevalence rates lor which the test has worthwhile diagnostic value. . but posterior to conditioning on whatever history led to the state of information described by ill^.) represents rhe prrvtrlrwe rote of the disease in the population to which the patient is assumed to belong. Bayes' theorem often provides a particularly illuminating form of analysis of the various uncertainties involved. P(H. respectively) and the systematic usc of Bayes' theorem then enables us to understand the manner i n which these characteristics of the test combine with the prevalence rate to produce varying degrees of diagnostic discriminatory power. represents beliefs posterior to conditioning on X and D. provides a basis for assessing the compatibility of the data D with our beliefs (see Box. or negative (suggesting the absence of the disease and denoted by D = T ). for a given clinical test of known sensitivity and specificity. whose outcome is either positive (suggesting the presence of the disease and denoted by D = 7').0 0.6 0.logically implied by the likelihoods and the prior probabilities. In particular. representing the presence or absence.. I D). let us consider the situation where a patient may be characteribed as belonging either to state H I . Let us further suppose that I'(ff. 1980). The predictive probability P ( D ) .2.or to state H. (Medical diagnosis). and that further information is available in the form of the result o f a single clinical test. We shall consider this in more detail in Chapter 6. Example 2. I AfO n D ) . In simple problems of medical diagnosih. of a specified disease. but prior to conditioning on any further data which l o may be obtained subsequent to D. respectively. P ( H .44 2 Foundations represents beliefs prior to conditioning on data D. more properly. I .
4 Beliefs and Probabilities 45 As an illustration of this process. (198 I ) concluded that P ( T I HI) = 0.as shown in Figure 2. 2. Not surprisingly. Thus. However. there is limited diagnostic value in the test.4. while we are less sure about other assessments and need to approach them indirectly via the relationships implied by coherence. E L F P ( E I F )= P ( E ) . for example. respectively). . let us consider the assessment of the diagnostic value of stress thallium201 scintigraphy. in Proposition 2. Proposition 2.15 arises when E and G are such that P(E 1 G) = P ( E ) . 0. this is directly related to our earlier operational definition of (pairwise) independence. some assessments seem straightforward and we feel comfortable in making them directly. Murray er ul.900. In cases where P ( H l )has very low or very high values (e. for coherence.3 Conditional Independence An important special case of Proposition 2.so that beliefs about E are unchanged by the assumed Occurrence of G. On the basis of a controlled experimental study.875 were reasonable orders of magnitude for the sensitivity and specificity of the test. One further point about the terms prior and posterior is worth emphasising. for example. Insight into the diagnostic value of the test can he obtained by plotting values of P( H I1 T). They merely indicate that. where for D = Tor D = T .20.75. the test may be expected to provide valuable diagnostic information. in clinical situations where there is considerable uncertainty about the presence of coronary heart disease.17 do not involve any such chronological notions. specifications of degrees of belief must satisfy the given relationships. For ull F > 8.2.) 5 0. a technique involving analysis of Gamma camera image data as an indicator of coronary heart disease. It is true that the natural order of assessment does coincide with the “chronological” order in a number of practical applications. the particular order in which we specify degrees of belief and check their coherence is a pragmatic one.15 one might first specify P ( G ) and P ( E 1 G) and then use the relationship stated in the theorem to arrive at coherent specification of P(E n G). As a single. overall measure of the discriminatory power of the test. for large population screening or following individual patient referral on the basis of suspected coronary disease. Propositions 2. but it is important to realise that this is a pragmatic issue and not a requirement of the theory. one may consider the difference P ( H l I T ) . P ( T 1 H2)= 0.15 and 2.P ( H I I T’).2. with the assump tion that “prior” beliefs are specified first and then later modified into “posterior” beliefs. thus. In any given situation.25 <_ P ( H . P( HI I T’ ) against P( H I). Theyare not necessarily to be interpretedin a chronological sense.g.
and therefore reflects an inherently personal judgement (although coherence may rule out some events from being judged independent: for : example. we can attempt to represent the complex event in terms of simpler events. Definition 2. of F and G (and. we haveP(EnF) = P(EIF)P(b').. . in the general case.. using the probability rules. E. There i s a sense. An important consequence of the fact that coherent degrees of belief combine i n conformity with the rules of (finitely additive) mathematical probability theory is that the task of specifying degrees of belief for complex combinations of events is often greatly simplified.g. we would regard our degree of belief for E as being "independent" of knowledge of F and G if and only if P( E I H )= P( E ) .I} ure suid to be t n u t i d l ~ independent $ f o r any I C . describing the combined Occurrences. (Mutuul independence). These considerations motivate the following formal definition. The latter are then recombined. f" nG'}. similar conditions must hold for the "independence" of F from E and C. for which we feel more comfortable in specifying degrees of belief. in which the judgement of independence (given for large classes of events of interest reflects a rather extreme form of belief. Instead of being forced into a direct specification. and of G from E and F ) . F'nC:. 1950/1968. pp.12. of course. the intuitive conditions discussed above.15.6 and can be shown (see e. 125128) to be necessary and sufficient for encapsulating. for any of the four possible forms of I!. Feller. Note that Proposition 2. in that scope for learning from experience is very much reduced.J. Definition 2. Events { E . l'nG'. J I l f j ) . any E.20 derives from the uncertainty relation 5 1. E' such that c? c I' C F c I?). to obtain the desired specification for the complex event. however.46 2 Foundations froofi E L F P ( E n F) = P ( L ) P ( F )and. f and C. by Proposition 2. { F n G . which generalises Definition 2. or otherwise. This motivates consideration of the following weaker form of independence judgement. from an intuitive point of view. I2 makes clear that the judgement of independence for a collection of events leads to considerable additional simplification when complex intersections ofevents are to be considered. a In the case of three events. the situation is somewhat more ' complicated in that. j E .
In the absence of such knowledge.j E . for example.for any I C_ J . 1979b. in turn. as in Definition 2.In practice.. However.13 could. in simple Mendelian theory. (Conditional independence). let us suppose that data are obtained in two stages. have been stated in primitive terms of choices among options. however. given exuct knowledge o the proportional split in the dichotumised populution. of course.j E J . the genotypes of successive offspring are typically judged to be independent events. manufactured items. we typically receive data in successive stages. which can be described by realworld events D1 and D2. in the practical context of sampling. D.j E J } are said to be conditionally independent given G > 0 if.4 Sequential Revision of Beliefs Bayes’ theorem characterises the way in which current beliefs about a set of mutually exclusive and exhaustive hypotheses.13. but that the algebraic manipulations involved are somewhat more tedious. having seen in detail the way in which the latter leads to the standard “product definition”. . Indeed. As a simple illustration of this process. .4 Beliefs and Probabilities 47 Definition 2. so that the process of revising beliefs is sequential. in neither case would the judgement of independence for successive outcomes be intuitively plausible.13 is one which is utilised in some way or another in a wide variety of practical contexts and statements of scientific theories. marketable quality.2. see Dawid ( I979a. For a detailed analysis of the concept of conditional independence.Omitting.4. or whatever). the events { EJ. given the knouledge ofrhe two genorypes forming the muting. are revised in the light of new data. influences judgements about subsequent outcomes. The form of degree of belief judgement encapsulated in Definition 2. . from large dichotomised populations (of voters. Thus. successive outcomes (voting intention. 1980b). since earlier outcomes provide information about the unknown population or mating composition and this. a detailed discussion of the kinds of circumstances in which it may be reasonable to structure beliefs on the basis of such judgements will be a main topic of Chapter 4. For any subalgebra 3o &.) may very often be judged independent.6.I}are said to be conditionally f independent given F i f and only i f they are conditionally independent given any G > (b in 3. it will be clear that a similar equivalence holds in these more general cases. H. I2 and 2. Definitions 2. 2. . with or without replacement. f Similarly. The events { EJ. of course.
these kinds of siniplifying structural assumptions play a fundamental role in statistical modelling and analysis. the likelihoods and prior probabilities to be used in Bayes’ theorem are now P(D2 I H. .s. .48 2 fiiuridtitions for convenience. for all . . might be to assume a rather weak form of dependence by making the judgement that a (Markov) property such as f’(fIfiTI 1 H . whereupon we obtain the latter being the direct expression for P( if.j E . .). since all judgements are now conditional on LII.D. D. 1 4 .. . revision of beliefs on the basis of the = H. I I>I) into the cxpression for P(H. There is. ./. . This is easily verified by substituting the expression for P ( H. D I 9 D?. subsequent revision of beliefs in the light of Dz.13.. . by Definition 2. where P(U? 1 f l l ) = C j l’(f12 I H. n D’”)./. we then only need I ) = I’( I H.I. The generalisation of this sequential revision process to any number of stages. first piece of data DI is described by P( HJ 101) P ( U I I ff.j E . j E . As we shall see later. When it comes to the further. explicit conditioning on Mu. since.. If we fl write DIk)= D 1 U2n . D? were analysed successively or in combination. From an intuitive standpoint. We thus have. 1 . D1.. may appear to be impossibly complex if I. holds for all k . we would obviously anticipate that coherent revision of initial belief in the light of the combined data. a task which.D?. n D l ) and P( H.. since there is an implicit need to specify the successively cvriditiotred lik~li/i(iod. .I. 1~H. then. for all j E ..+ I H. One possible form of simplifying assumption is the judgement of conditional independence for D I . .. j E J . n L ) k to denote all the data received up to and including stage k. . corresponding to data. . 7 j E . a potential practical difficulty in implementing this process. 111n U . however. proceeds straightforwardly. I DI).j E . respectively. in the absence of simplifying assumptions. Another possibility n the evaluations f’( D L . . ) from Bayes‘ theorem 1 when DI n D2 is treated as a single piece of data.. I I H. should not depend on whether DI. . n D’”) = P ( & . ‘I D ~ ) f ‘ ( f If D I ) ... . P( /I. j E J. is at all large. ... I I l l f l L)?). which provides a recursive algorithm for the revision of beliefs.../.)P( )/f’(Dl). D.given any H. J .
description of the process of revising beliefs by noting that. on the CG). operational basis. The following special case provides a useful starting point for our development of a measure of value for consequences. in a noloss situation with current assets k.5 2. we can thus summarise the learning process (in “favour” of H I )as follows: posterior odds = prior odds x likelihood ratio. intuitive way of introducing a numerical measure of valuefor consequences. we need to consider a little more closely the nature of the set of consequences C. . mathematically motivated choices of C.to be the interval [k. C is often taken to be the real line 3 or. Such C’s would not contain both a best and a worst consequence but.6. in part. In Section 2.H2. in this case. by the relative degrees of belief that an individual attaches to the uncertain events involved in the options. .D. the judgement of conditional independence for D1. The measurement framework of Axiom Xi) provides us with a direct. The pair of consequences cI and c* are called. we shall examine in more detail the key role played by the sequential revision of beliefs in the context of complex. 2. H I . in such a way that the latter has a coherent.c. For example. D2. in mathematical modelling of decision problems involving monetary consequences. 5 c 5 c*. .5 Actions and Utilities 49 In the case of two hypotheses. . . .L. .5.1 ACTIONS AND UTILITIES Bounded Sets of Consequences At the beginning of Section 2.14. which could be constructed in such a way as to rule out the existence of extreme consequences. It is equally clear that choices among options should depend on the relative values that an individual attaches to the consequencesflowing from the events. (Extreme consequences). enables us to provide an alternative . Definition 2. Indeed. This eliminates pathological.4. It could be argued that all real decision problems actually have extreme consequences. we argued that choices among options are governed.given H i or H2. sequential decision problems.2. Before we do this.the worst and the best consequences in a decision problem 3for any other consequence c E C. respectively. we recall that all consequences are to be thought of as relevant consequences in the context of the decision problem. . With due regard to the relative nature of the terms prior and posterior.
In the next section.v f bounded decision problem (€. ..15. (ii) the value o i i ( c I c. Definition 2.21.  + It is important to note that the definition of utility only involves comparison among consequences and options constructed with standard events. c ' )5 r ~ ( c ~ c .3) with the situation in which extreme consequences are not assumed to exist. To choose a particular r* would be tantamount to putting forward (. for conceptual and mathematical convenience.2 Foundutions other hand. c . but regarding c* + 1 years as impossible! In such cases. (Canonical utility function for consequences). 2. ) = 1. c* ) exists und ib nnique. A. For this reason. Since the preference patterns among consequences is unaffected by additional information. we shall also deal (in Section 2. The inupping ( ii : C 3 i s culled the utilityfunction. even though we believe there to be one. We shall refer to such decision problems as bonnded. f )i s im$fected by the assumed occurrence of an f event G > 0: (iii) O = i i ( c . .2 c.. A. ( i ) jiw d l c.5.5. they clearly do not correspond to concrete.<_c u ()r ' l c . ciaied with uny stundurd event S such that (. C. { (** I S. it is attractive to have aviiilable the possibility. < c' are assumed to exist.* * 1 s' } . the utility i i ( r ) = i i ( c 1 (**. f relative to the extreme c~onseqiiences < c. 11 I c. 5 ) for which extreme consequences C. Ciwn ( I preference relution 5 . . Consider. Proposition 2. practical problems. a medical decision problem for which the consequences take the form of different numbers of years ofremaining life for a patient. of dealing with sets of consequences 1101 possessing extreme elements (and the same is true of many problems involving monetary consequences). we would expect the utility ofa consequence to be uniquely defined and to remain unchanged as new information is obtained. Assuming that more value is attached to longer survival. Bounded Decision Problems Let us consider a decision problem (8. it must be admitted that insisting upon problem formulations which satisfy the assumption of the existence of extreme consequences can sometimes lead to rather tedious complications of a conceptual or mathematical nature. it would appear rather difficult to justify any purticitlur choice of realistic upper bound. 5 ) with extrenie 1onsequences c. we shall consider the solution to decision problems for which extreme consequences are assumed to exist. is the real nimiber 11 ( S ) NSSOc. For cir1.. (Existenceand uniqueness o bounded utilities). c .:c c. Nevertheless. ~ ( ( 8 .) o ( I consequenc~rc. despite the force of the pragmatic argument that extreme consequences always exist. years as a realistic possible survival time. / c . for example. This is indeed the case.
c * ) = p(S1). {c* 1 Sl. I Si} {c' I S2. c*. whose form depends both on the events of a finite partition of the certain event R and on the particular consequences to which these events lead. 0 It is interesting to note that u(c I c. which we shall often simply denote by u ( c ) . S. for any G > 0choose S2such that G I S. I 12). . Indeed. Then. by transitivity c. (i) Existence follows immediately from Axiom Xi). given G. We shall illustrate the ideas in Example 2. c. JFJ is the expected utility of the option u. .. . a f subject which has generateda large literature (with contributionsfrom economists and psychologists. since c* = {c' I 0. Using the coherence and quantification principles set out in Section 2.15 and Axiom 4(i). (ii) To establish this.. C = {c' I S2. c ' . I Ss} and so the utility of c given G is just the original value p(S2). consequence. {c' I S.. c c. 0). as well as from statisticians). that 0 5 ~ ( I c. by Definition 2..c' I Si}. c * ) 5 1. G ) = u C ~ (IC~. can be given an operational interpretation in terms of degrees of belief. from Definition 2.2. note that if c {c* I S. belong to the algebra of standard events.. c. if we consider a choice between the fixed consequence I: and the option { c' 1 E . and both 0 and R . c*. c. . so that u(c I c .c.c*) in place of F(a I c'*.c*) = p(f2) = 1. ( a I c . Definition 2. c.3. c* I Si} and c {c' 1 S2.5 Actions and Utilities 51 Proof.G > 0. If G = $2.3.for some event E.. rather than worst.} then. we have u(c*1 c. It remains now to investigate how an overall numerical measure of value can be attached to an option.16.C * ) P (IEG. e * ) = ~ ( ( 3 ) = 0 and U ( C * I c. Foranyc. with respect to the extreme consequences c. I E " } .6. ) C. and Axiom 3(ii). then the utility of c can be thought of as defining a threshold value for the degree of belief in E. we shall simply write ii( a I c*. I  .). a n d a = { c j I E j . I S.j E J } .). c .} and Sl S2. (Conditional expected ufility). we have seen how numerical measures can be assigned to two of the elements of a decision problem in the form of degrees of belief j o r events and utilities for consequences.  . < c'. c* I 0}. It then follows. whereas values less than 11 would lead the individual to prefer c for certain. let c using AxiomI(iii). (iii) Finally. in the sense that values greater than u would lead an individual to prefer the uncertain option. c. the result now follows from Axiom 4(i). . andp(S2) = p(S. The value u itself corresponds to indifference between the two options and is the degree of belief in the Occurrence of the best. {c* I S1. For uniqueness. This suggests one possible technique for the experimentalelicitariotz o uriliries.
3 f'(A1 G ) 5 /'(.) n c:]. (11 5 ~ u2 =+ =I.) = u(c.G ) 5 E(a2 I C.22. where ' .14(ii) and 2. by Definition 2.)P(E.. In summary form.10.) I E..52 2 Foundations In the language of mathematical probability theory (see Chapter 3). I B. for i = 1 . I E. j r . for each <(.J.But..  ii(CC1 / c * . contingent on the occurrence of a particular event E. n G n S. } . il complete ordering of the options considercd and does not guarantee the c ~ .. of course... I u1 < G U L Proof... 17G ) and P(S.vimising expected ittilitj. <(. for all ( r .c*.. in most (if not all) concrete.S.15. i = 1.).. by Proposition 2. Proposition 2.C . c ' ) P ( E .#I? 1 C)..j )there exist S.. 4(iii). . c ..R...  . but rather an implication of our assumptions and definitions. n C.. Hence.C*.. the resulting prescription for quantitative.. this is clearly not an independent "principle". ) P f E .. .WfYG').l = 1.fi~nSf. By Axioms 5(ii).... r. P ( A . then ii is simply the expected value of that utility when the probabilities of the events are considered conditional on G. the existence of a maximum will depend on analytic features of the set of options and on the utility function I I : C . 1 c. Technically. .P(E..16. . Proposition2.2. c * . J and Proposition 2. Proposition 2. a C) /=1 w ii(al I c*.. Ti S. 2 and any option ( I .22 merely establishes. and using ..13. which may be written as {c* I A... c. q <_ 7i(O~/C. and S: such that '.... = Hence.I s. By Detinition 2. = U.. = {c.nC)Y(S.6'). u ( c .(E. (Decision criterionfor a bounded decision problem).I. coherent decisionmaking is: choose the option with [lie greutest expected utility.G).  I s.( E.. u.}with . In our development. I c. I ('. In more abstract mathematical formulations.) 4 and B. c#] {c" I S. For any bounded decision with extreme consequences c. I S. . ifthe utility value of n is considered as a "random quantity". . The result just established is sometimes referred to as the principle of tna. By Propositions 2. r i s t c m ~ of an optimal option for which the expected utility is a maximum. c*).l(E. practical problems the set of options considered will be finite and so a hesf option (not necessarily unique) will exist.= IJ.(1 1 G' } . .  { [(c. 1 G)= and SO (11 <c (11 c )). ( I I cz" }. . Let a.. . < c* niid G' > In. = 1. I G) = q n .6.5.)= P ( S . However.
emerges from such interrogation as an approximate "indifference" value.150 with probability 1 . f attempts at the quantification of utilities in a practical decisionmaking context was that of Grayson ( 1960). perform an interrogation for various p until an "indifference" value.. in units of one thousand dollars.5 Actions and Utilities 53 Example 2 3 (Uriliries o oil wildcoaers). ~ ( 8 2 5 ) 1 . . An alternativeprocedure. would be to fix c. For the purposes of illustration. p ) pairs. .150) = p . One of the earliest reported systematic .p . the theory developed above suggests that. for a coherent individual.p ) ?I( . provides a series of ( c . and then repeat this procedure for a range o values of c to obtain a series f of (c. one could ask the wildcatter. . pr is found. ) pairs..2. from over which a "picture"of ~ ( r ) the range of interest can be obtained. Thousunds of dollurs Figure 2. Assuming u(150) = 0. using a series of values of c.) = p U(825) + (I . If c]. which option he or she would prefer out of the following: (i) c for sure.3 Willium Beard's 14tilityfitncrion for chunges in moinetup assets . (ii) entry into a venture having outcome 825 with probability p and an outcome . from . for some specified p . whose decisionmakerswereoil wildcattersengaged inexploratory searches for oil and gas. I I Utility . Repeating this exercise for a range of values of p. of course. I&( c. For example. 11. the above development = suggests ways in which we might try to elicit an individual wildcatter's values of u(c) for various r in the range 150 < c < 825. 0 ..I50 (the worst consequence)to +825 (the best consequence). and Grayson's work focuses on the assessment of utility functions for this latter quantity. The consequences o drilling decisions and their outcomes are ultimately f changes in the wildcatters' monetary assets. * * b * * *.supposethat we restrict attention tochanges in monetary assets ranging.
In general. In particular. we shall provide an extension of these ideas to more general decision problems where extreme consequences are not assumed to exist.4 0 . 1 Our restricted definition of utility (Definition 2.1.r ): ifcl 5 c 5 c ? and I' { r 2 1 S . ('2 are to define a utility scale by being assigned values 0.3 shows the results obtained by Grayson using procedures of this kind to interrogate oil company executive. Definition 2.22 guarantees that preferences among options are invariant under changes in the origin and scale of the utility measure used. there may be bounded decision problems where the probabilistic interpretation discussed above makes it desirable to work in terms of ~ ~ ~ i n o t ~ i c ~ d utilities. A "picture" of Beard's utility function clearly emerges from the empirical data. ) c. on October 23. is the tnensirre crssociated with the ..e. we cannot then assume that rl 5 I' 5 c2 for all c. H so that the orientation of "best" and "worst" is not changed.then 11 = . 5 t' 5 vw for all c E C. 2.rhen ii = . .1 then I I = 1/.2 1 S. cI . However. S'} . relutiw to the conseqitences < ~2~ is defined to he the r e d nuinher u such tlicrr (f c < ('1 and 1'1 { f.54 2 Founddons Figure 2. the utility function reflects considerable risk aversion. IS) relied on the existence of extreme consequences v. derived by rcferring to the best and worst consequences. Beard. Since the expected utility Ti is a linear combination of values of the utility function. r / ( 1 . in the sense that even quite small asset losses lead to large (negative)changes in utility compared with the (positive) changes associated with asset gains. i f c :> C? and r? { r I S6. such that r .17. where . (General utility function). therefore. The definition is motivated by a desire to maintain the linear features of the utility function obtained in the case where c: +(( the utility 11  ... . i.. and we can simply refer to the expected utility of an option without needing to specify the (positive linear) transformation of the utility function which has been used.. and this means that if q .stand(ird ebwt S . .. 1957.!Y'~}. provided we take .sequct~~~es. 1. I PI. . However. Given (I preference relation 5. W. In the next section. . 1 1. respectively..) i . such an origin and scale can be chosen for convenience in any given problem. invariant with respect to transformations of the form f l u ( . Proposition 2.(j I . we shall require negritiw assignments for c and assignments gremer t h m one for (.3 General Decision Problems We begin with a more general definition of the utility of a consequence which preserves the linear combination structure and the invariance discussed above. over the range concerned.r.5. c2 to play the role of c.c ofcr ~~onseq~ience . In the absence of this assumption. c' I . we have to select some reference ~~on. r?. c * . c.I' = / i ( S. .
c') = A ~ u ( c (I. c4) + B. 1 where z = P(& ) and. c = A ~ u ( c ( 1.}. c. (Existence and uniqueness of utilities). q3)denote any permutation of c. if Q > c there exists S.where A = u(c4 \ ~ I > Q ) . Proof. c2. B1. c.Z ) U ( C ( .) + + . Then. The following result extends Proposition 2. c I S. a ) (. u ( c I c : t . c3 I S }and 9 I q . ~ there exists S.1 > cg.414~ c.CZ). qj I S.u(c I q . ) ' ) ) hence.22.2 I to the general utility function defined above. Hence. +  . u(c I c . .1) = x. where 9 = P(S.}.g/(l . ca 5 q .1.c2) is unaffected by the occurrence o an event G > (b. by Definition 2. c q ) exists and is unique. with 4 and 13 as above. a The following results guarantee that the utilities of consequences are linearly transformed if the pair of consequences chosen as a reference is changed.c?) + (1 . For any decision problem and for any puir of consequences c1 < c2.)E { c1. I c.) and.. for q. . cq.2. c. u ( c I cl. such that c {q I S. there exist A .L) = A. c . that < U ( C ( 2 )( C l .1) + B. Hence.u(c I CI.CI. with A and B as above. C * ) = Z74C(3)ICI.c2)= 1. (ii) the value o u( c I c l . where c1 < cZ and ql) cf2)5 ct3). . c2) 1 u(c4  + Now suppose that the C'S have arbitrary order. ~ 2a)n d B = ~ u ( c : ~ ~ c l .  u(cIcI:c2)= r u ( c ~ I c l .ca) = l/y. such that c4 {c I S . by Definition 2. (i) for all c. by Proposition 2.y). Proposition 2. It can be checked straightforwardly that if ql). c:j < (I <_ c4 implies that there exists a standard event S.and c1 < c I c2. we have g u(c 1 c l ? = AU(C ca.17. (Linearity).5 Actions and Utilities 55 extreme consequences exist. c3. such that c3 {c4 I S. Hence.17. u ( c I c1.u(c:t I C I . c 2 )=Oandu(c2Icl. Similarly.CZ) = y u ( c I CI. By Axiom 5(ii).. Hence. for all c.c4) B.17. u(c 1 c1.21. f f (iii) u ( q I c 1 . C 2 ) + ( 2)u(c3IcI. c ~ = ) Av(cIcg. 132 such that. c:i.) and. c d ) Bz.A2. of (c1. by Definition 2. c ~ = ( A ~ / A I ) u ( cI(::t.CZ) + BI and u ( q . C. if c > c .CZ) + (1 . c 2 ) = . This is virtually identical to the proof of Proposition 2. Hence.*) B..u(q5 1 ~ 1 . c } . For all c 1 < c2 and c:$ < ca there exist A > 0 and B such that. c* be the minimum and maximum. Suppose first that c 2 el. ~ ) .ca) (132 .. c2) + (1 .23. by the above. ZL(C 1 cg.22. c2. cl.)I c*. cp). c ~ ) . by Proposition 2.the definition given ensures that for any G > 0. Let c. qZ)G {q3) ! c(]) 1 S:} implies IS .B l ) / A l .u(c3I c1. c2) = yu(c4 I c1. By Axg iom 5(ii). c. c}. ) ICl. Proof. Q. (12. respectively.y ) 4 c 1 CI C 2 L where y = P(S. c a ) = . 14. subject to c2 > c1.)1 q . Proposition 2.24. u(c(.y).c2). u(c(..
5 . I .. in the fields of market research and production engineering investigators often consider first whether or not to run a pilot study and only then.6 2.c * .3. Such a twostage process provides a simple example of il src~~rrnri~rl . trnd esetil C: > g. For instance. a problem which does not arise in the case of finite partitions) and the choice is a personal one.. r' be such that for all c. Then.. An immediate implication of Proposition 2. u .6. there are further formal considerations which may delimit the form of function chosen. by Proposition 2. Fine 1973.I 0 u(a2 But. For any decision problem. } . are the major options considered. If we begin by defining a utility function over I I : C . probabilities to events and the choice of an option made on the basis of maximising expected utility. Any function can serve as a utility function (subject only to the existence of the expected utility for each option.. however.. and let r. c. This completes our elaboration of the axiom system set out in Section 2. r ' ) = Au(c 1 cl. 221 ). In some contexts. r 2 )+ H .. (General decision criterion )....22. I E . by Proposition 2.25 is that all options can be compared among themselves. 5 c.H.7.25. N? 5(.G ) 5 % ( a . Proposition 2. we merely assumed that all consequences could be compared among themselves and with the (very simply structured) standard dichotomised options. Instead. and so the result follows. 5 r * .this indirccv in turn a preference ordering which is necessarily coherent. .  2.2. .1 cw. L"). Pro<$ Suppose ( I . Starting from the primitive notion of preference. and that the latter could be compared among themselves. We recall that we did nol directly assume that comparisons could be made between all pair of options (an ussuniprioti which is often criticised as unjustified.i = 1. pciir ofcomeyuences ('1 < ('2. = { c. coherent comparisons of options must proceed cis if a utility function has been assigned to consequences. and B such that ii(r I c. see.. (11 iff . p. we generalise Proposition 2. there exists ..we have shown that quantitative.22 to unbounded decision problems.24. in the light of information obtained (or on the basis of initial information if the study is not undertaken). . for example. r ' . An important special case is discussed in detail in Section 2. . I c.J = 1.56 2 Fouticlcrt ions Finally.1 SEQUENTIAL DECISION PROBLEMS Complex Decision Problems Many real decision problems would appear to have a more complex structure than that encapsulated in Definition 2.
and (iii) degrees o belief in the Occurrence of events such as E. we shall review.2. which label the possible consequences { cLJ. we are merely emphasising the obvious dependence of both the consequences and the events on the action from which they result.. E J . Before explicitly considering sequential problems. with this notation.6 Sequential Decision Problems 57 decision problem. some of our earlier developments. Ale. M0). } which may result from action a. i E I} be the set of alternativeactions we are willing to consider. By using the extended notation P(E. j E J . interdependent decisions. G) is the degree of belief in the occurrence of event EZ3 event E.. we are emphasising that (i) theactual eventsconsidered may dependon theparticularactionenvisaged. it is sometimes convenient to describe the relevant events Ell. For example. j E J . I a. thus substantiating our claim that the principles of quantitative coherence suffice to provide a prescriptive solution to any decision problem.. there is a class { f $ ] . a1 is to be preferred to action a2. Note that. are f understood to be conditional on action (I. G).. lu. In this section. the main result of the previous section (Proposition 2. in a sequential form. involving successive.25) may be restated as follows.. } of exhaustive and mutually exclusive j events. so that the possible influence of the decisionmaker on the real world is taken into account. Let A = {a.having been assumed to be taken. For any action ai. For each a. one may first . having been taken. action G. and the state of information being ( hlo... (ii) the information available certainly includes the initial information together with C > 8. i f and only if u ( c tJ) the value attached to the consequenceforeseen ifaction at is taken and the is occurs. We recall that the probability measure used tocompute theexpected utility is taken to be a representation of the decisionmaker's degree of belief conditional on the total information available. I C) used previously.G.conditional on action a. in considering the relevant events which label the consequences of a surgical intervention for cancer.J. we shall demonstrate that complex problems of this kind can be solved with the tools already at our disposal.and For behuviour consistent with the principles of quantitutive coherence. given hf.. rather than the more economical P ( E. If Afo is our initial state of information and G > 0 is additional information obtained subsequently. using a more detailed notation. and P( E..
. It is also usually the case.. and .in the example shown. a .. It may appear at first sight that this is a decision problem where the utilities involved in an action (the possible prizes to be obtained from a bet) depend on the probabilities of the corresponding uncertain events (the possible winning horses). many people will have guessed correctly and the prize will be small. if we bet on the favourite we have a higher probability of winning. that the structure of the problem is similar to that of Figure 2. 1 the related betting bchaviour of other people ( E. These situations are most easily described diagrammatically using decision trees. fl k . such as E. . . this does not represent any formal departure from our previous structure..4. ..think of whether the patient will survive the operation and then.. C. A closer analysis reveals. that it is easier to elicit the relevant degrees of belief conditionally. for example.. We now turn to considering sc~yitencusofdecision problems.) P(Et.. Clearly. P( E.). such as that shown in Figure 2.If. . if the favourite wins. however. We shall consider situations where. the problem of placing a bet on the result of a race after which the total amount bet is to be divided up among those correctly guessing the winner. J and the outcome of the race ( F . k ) . so that. Obviously. Consider. . but.A. n F'/. after an action has been taken and its consequences observed. whether or not the tumour will eventually reappear were this particular form of surgery to be performed. with as many successive random nodes as necessary./k I E u. apparent complication. a possibility t i o r contemplated in our structure. The prize received depends on the het you place (a. conditional on survival. Mo) would often be best assessed by combining the separately assessed terms P( k.1 u . This conditional analysis straightfonvardly resolves the initiul. . G. since the problem can be restated with a single random node where relevant events arc defined in ternis of appropriate intersections. for instance. / k I (I. M. Conditional analysis of this kind is usually necessary in order to understand the structure of complicated situations. in practice.. G.4. It is only natural to assume that our degree o f belief in the possible outcomes of the race may be influenced by the betting hehaviourofother people..
a physician has to decide whether to continue the same treatment. the number of decision nodes to be considered in any given sequential problem will be assumed to be finite. Consequently.6 Sequential Decision Problems 59 new decision problem arises. the number of scenarios which may be contemplated at any given time is necessarily finite. and bearing in mind that the analysis is only strictly valid under certain fixed general assumptionsand we cannot seriously expect these to remain valid for an indefinitely long period. Thus. 2.5. we typically cannot decide what to do today without thinking first of what we might do tomorrow. which makes it possible to solve these problems within the framework we have already established. when the consequences of a given medical treatment have been observed. or to declare the patient cured. conditional on the new circumstances.6. after which no further decisions are envisaged in the particular problem formulation. the possibilities are finite in number.2 Backward Induction In any actual decision problem. of course. and that. Using the notation for composite options . or to change to an alternative treatment. buchurd induction. If a decision problem involves a succession of decision nodes. In colloquial terms. we should be able to define ajinite horizon.2. For example. In the next section. at each node. it is intuitively obvious that the optimal choice at the first decision node depends on the optimal choices at the subsequent decision nodes. Figure 2. If. we consider a technique. will typically depend on the possible consequences of today's actions.5 Decision tree with several decision nodes Let 'n be the number of decision stages considered and let a('")denote an action being considered at the rnth stage. the situation may be diagrammaticallydescribed by means of a decision tree like that of Figure 2.
and u( . if GI.1)th stage by maximizing .:’) and It is now apparent that sequential decision problems are a special case of the general framework which we have developed. and the process continued until we reach the rith stage.) is the (generalised) utility function. Formally. we may write where This means that one has to first solve the final (nth) stage. secondstage options may be written in terms of thirdstage options.} from which UY (*(it?choose that option which is preferred on the basis of our pattern of preferences at that stage. occur is that we are confronted with a set of options { ul!’. to occur. The “maximisation” is naturally to be understood in the sense of our conditional preference ordering among the available secondstage options.25 that.2. jE J(”)} is the partition of relevant events which corresponds to the notation “max a.. we have a..k E K(.Indeed. given the Occurrence of EIJ. then one has to solve the ( 1 ) . consisting of “ordinary” options defined in terms of the events and consequences to which they may lead.. all firststage actions may be compactly described in the form where { E I J .60 2 Fiwdurions introduced in Section 2. k E Iil. at each stage r r i .is the relevant information available.} which we would be confronted with were the event E. It follows from Proposition 2. by maximising the appropriate expected utility. Similarly.::?’’‘refers to the musf preferred of the set of options {(if’. the “consequence” of choosing u:li and having E.
. However. n.6 Sequential Decision Problems 61 the expected utility conditional on making the optimal choice at the nth stage. 1 5 I/ I r + I . 5 7’.. or staff appointment in a skill shortage area when a job offer is required immediately after interview. As with the “principle” of maximising expected utility. .r. the inspector’s utility for these consequences. r . in order to select one of them. ofcourse. 1 5 I’ 5 I ) . let r‘. At each stage. Now suppose that t’ < n objects have been inspected and that the relative rank among these of the object under current inspection is . and the knowledge that the 78 objects are being presented in a completely random order.y being. I . where I/. Less romantically.. as a result. we see that this is not required as a further assumed “principle” in our formulation. by Ti. Suppose further that. When should the inspection process be terminated‘? Intuitively. to simplify notation. repent at leisure” warns against settling down too soon: but such hesitations have to be balanced against painful future realisations of missed golden opportunities. 1957). The information available at the r t h stage is G..). receiving. where 1 5 . reThis quirement is usually known as Bellmun ‘s oprimulity principle (Bellman.. of . such as property purchase in a limited seller’s market when a bid is required immediately after inspection. We now consider a famous problem. working backwards progressively. ) = u ( i ) . at any stage of the procedure. but is simply a consequence of the principles of quantitative coherence. = i if the eventual object chosen has rank i out of all YI objects. Suppose that a specitied number of objects I I 1 2 are to be inspected sequentially. if the inspection process goes on too long there is a good chance that the overall preferred object will already have been encountered and passed over. the continuation of the procedure must be identical to the optimal procedure starting at the mth stage with information G. 1.. the inspector has the option of either stopping the inspection process. We shall denote by u ( c . I’ + 1). equally likely since the I I objects are inspected in a random order.I/. = (. This kind of dilemma is inherent in a variety of practical problems. all values . .4. if the inspector stops too soon there is a good chance that objects more preferred to those seen so far will remain uninspected. denote the possible consequencesof the inspection process. which is usually referred to in the literature as the “marriage problem” or the “secretary problem”. . given G. one at a time. say. Example 2. r ) . (An optimal sfopping problem). i = 1. again with no backtracking possibilitiesthis stopping problem has been suggested as a model for choosing a mate. = ( . or of continuing the inspection process with the next object.. the information available at the ( r + 1)th stage would be G . is the rank of the next object relative to the I’ + 1 then inspected. a procedure often referred to as gynamic programming. . If we denote the expected utility of stopping. the mth. r . the object currently under inspection. with c.. a2 = continue (where. N o backtracking is permitted and if the inspection process has not terminated before the nth stage the outcome is that the trth object is received.2. n . This process of hcrchwdinduction satisfies the requirement that. until the optimal first stage option has been obtained.r. and so on. r ) and the expected utility . . There are two actions available at the rth stage: n l = stop. i = 1.(. we have dropped the superscript. Potential partners are encountered sequentially: the proverb “marry in haste. r=worst) of the current object among those inspected so far. the only information available to the inspector is the relative rank ( I =best. at any stage r . More exoticallyand assuming a rather egocentric inspection process.
r For illustration. the general development given above establishes . suppose that the inspector's preference ordering corresponds to a"nothing but the best" utility function.SCPII sofur. whose account is based closely on Lindley ( 19613)..62 2 Foundations of acting optimally.(r. that where r Values of QI(. If I. see Freeman ( 1083) itnd Ferpuson (1989).) = . see DeGroot (1970. / I .' is the smallest positive integer for which I T Y 1 i / I .1 I / .r. r ) = 0. ..r > 1. t t . The optimal procedure is then . " ' 1 5 1'. The decision as to whether to stop if . I I : thus. continue until the object under inspection is the best s o far. 1 . . ) .' objects have been inspected: (ii) if the r'th object is the best so far. .* It ii. stop. then htop (\topping in any case if the u t h stage is rcachcd). 1. given C.r.should never be terminated ifthe cwrent ubje1.r = 1 is determined from the equation which is easily verified by induction. It is then easy to show that I' //. .1 This implies that inspection . It' I ) is large. I' = 1. approximation of the sum in the ithove inequality by an integral readily .(. defined by I J ( I ) = I.r ) .r. I.r = 2 . (iii) otherwise. . .0. 1 .r ) . if .r ) > E. . I. 1 I. . . &(.(.by lil. yields the approximation I" z I / / (For further details. For reviews 0 1 I'iirthcr.seen to be: (i) continue if li0(a. Chapter 13). the optimal procedure is detined as follows: ( i ) continue until at least 1. ./' = 2. ) > E.l is not thc best .(I.r..).(. . .r.). (ii) stop if ull(. . related work on this fascinating problem.(.r. i r ( . ) = ii. ) can be found from the final condition and the technique of backwards induction. .z.
Readers who are suspicious of putting this into practice have the option. Usually. which embraces the topic usually referred to as the problem of experimental design. modifying earlier notation in order to be explicit about the elements involved. 2. which. We want to choose the “best” experiment. we denote by u( u.6. would produce a consequence having utility which. of staying at home and continuing their study of this volume. were event E to occur. we also have available .vperimenrcri design We must first choose an experiment e and. one of which is to be performed in order to provide information for use in a subsequent decision problem. The structure of this problem.6 Decision treefor e. in light of the data obtained. of course. may be diagrammatically described by means of a sequential decision tree such as that shown in Figure 2. take an action o. Sequential decision problems are now further illustrated by considering the important special case of situations involving an initial choice of experimental design. t‘. D. Figure 2.6 Sequential Decision Problems Applied to the problem of “choosing a mate”.6. thereafter ending the search as soon as one encounters someone better than anyone encountered thus far.2. very important example of a sequential problem is provided by the situation where we have available a class of experiments. E). and assuming that potential partners are the encountered uniformly over time between the ages of I6 and 60. above analysis suggests delaying a choice until one is at least 32 years old.3 Design of Experiments A simple..
. which maximises the (prior) expected utility is so that an experiment P is worth performing if and only if T i ( ( ) > T i ( ( Naturally. b.) = TT(fi.vperinrent ( ’ ( f i i ( c .) we can therefore choose the best possible continuation: namely.. i i ( c . f J .I For each pair (c. ) P ( l I(. c /‘ .</). the expected utility of the pair (c. given the information available at the stage when the action is to be taken. ) .). of directly choosing an action without performing any experiment. (Optimal experimental design). This is that c which maximises. that to solve a sequential decision problem we start at the last stage and work backwards. d l .25.26. Within the general structure for sequential decision problems developed in the previous section.. On the other hand. f. more important. T i ( ( ). D. In this case. Ti(n.’. However. 11. in the class of available experiments. if r‘ were the experiment chosen.2. to simplify the notation and without danger of confusion we shall always use ii to denote an expected utility. in our subsequent development we will use a simplified notation which suppresses these possible dependencies in order to centre attention on othcr. .64 2 Foundations the possibility. denoted by ell and referred to as the rrrr// e. that action a: which maximises the expression given above../ I .i.1 is given by  U(C. . f . ) = ~ q o . and the sets of consequences and labelling events may depend on the particular combination of experiment and action chosen.)P(fi.vperinienf.). the unconditional expected utility i i ( r . Proposition 2.. we note that the possible sets of data obtainable may depend on the particular experiment performed. I P ) denotes the degree of belief attached to the Occurrence of data D. However. We have seen.f’. Proof: This is immediate trom Proposition 2. 11. the expected utility of performing no experiment and choosing that action (I. fl. ) = / l ( ( i . aspects of the problem. ) are different functions defined on and different spaces. c. U. tlie optiniul cic’tiori is to iwfvrtti t i o cvpcrittiwt. D. Tlic optinid w t i o t i is to prrfortn the e. lil where I’( I). = I We are now in a position to determine the best possible experiment. D. I l .) T i ( ( . the expected utility of option i i .6. ’ ) > t r ~ T i ( ( 1 = Illils. D.. otlwrd wise. I. in Section 2.. Thus. is i i ( U . the set of available actions may depend on the results of the experiment performed.) IllasTT(fi.
. i. The opportuniry f loss which would be suffered i action a were taken and event E. E ~) u(aG. a:): where a:.EJ)p(EJ I . ) ~ ( IDe )~. be the optimal action given E. eo.~ ( a . E .E.) = m a x u ( a .. Definition 2. E j ) .eo.E. .eo. D ~ = ) C {. its expected value will provide an upper bound for the increase in utility which additional data about the El's could be expected to provide.19. and let a t . (Expected value of perfect informatton). under appropriate conditions. D. I e.).e.)} P ( E . occurred is l(a. J ~ J e.E. the optimal actions given D. were we to know the particular event EJ which will eventually occur.6 Sequential Decision Problems 65 It is often interesting to determine the value which additional information might have in the context of a given decision problem.(a:. provided by an experiment e is t v(e.18. E. D . this difference will measure. u(ai. the value of perfect information and. (The value of additional infomaation). such that. Let us therefore consider the optimal actions which would be available with perfect information. a.. a Then.the loss suffered by choosing any other action a will be "(a&).eo.... EJ). For a = a.). given EJ. eo. respectively. (i) The expected value of the & u D. for all E J . lGr It is sometimesconvenient to have an upper bound for the expected value u ( e ) of an experiment e. i.eo. D. and with no data. )u(a...E.. Definition 2..) = riiaxu(a. conditional on E. 01 the expected value of perfect information is then given by u*(eo) = JEJ I(aT.2. (ii) the expected value of an experiment e is given by v(e) = C v(e. the optimal action under prior information.)are. e o .. The expected value of the information provided by new data may be computed as the (posterior) expected difference between the utilities which correspond to optimal actions after and before the data have been obtained.e.
1 v ) is the expected cost o c . ) . D . R E In many situations. the utility function u ( u . 2 0. D. E .7(f ). ) where c ( f .for any available experiment P . One is the (experimental) cost of performing c and obtaining D. E. ) . ~ ( cmay he written as ) . E.. c . Often.) . D. d}although.* ( t) and the number i all crucially depend on the (prior) probability distributions { ( P ( E . Moreover.27.) n. 2 Foundations It is important to bear in mind that the functions t i ( 0.. f . U .r(c. where T(c) = c ICl C(P.) = //(U. the probability distributions over the events arc often independent of the action taken. E . the latter component does not actually depend on the so preceding e and D. Proposition 2. I8 and 2.('(f .). for notational convenience. (Additive decomposition).t.E..U . When these conditions apply. and the probability distributions are such that then.66 f(tI. f ProoJ: Using Definitions 2. assuming additivity of the two components.. we can establish a useful upper bound for the expected value of an experiment in terms of the difference bctween the expected value of complete data and the expected cost of the experiment itself.). we may write / ) ( a .V. ) . we have not made this dependency explicit. that.D. the other is the (terminal) utilip of directly choosing ( I and then finding that I?. with c(e.c .10). ) P ( D .19.) may be thought of as made up of two separate components.. D. If the ittilit! function hus the forrn u(a. ) 2 0. occurs..). I * ( c) 5 1" (PI..) = u ( u .
thus establishing the fundamental relevance of these foundational arguments for statistical theory and practice.5.7. at first sight. and the tools developed in Sections 2.5) establish that if we aspire in accordance with the axioms of toanalyse a given decision problem. C. or an experiment) involving the deliberate collection of data (usually.7. { E . We have shown that the simple decision problem structure introduced in Section 2. A. we previously denoted such a measure by P(.2 to 2. Given the initial state of information described by A10 and further information in the form of the assumed occurrence of a significant event G. in numerical form). . The event G will then be defined directly in terms of the counts or measurements obtained. sequential problems which. we shall study in more detail the special case of experimental design in situationswhere data are being collected for the purpose of pure inference. We now wish to I specialise our discussion somewhat to the case where G can be thought of as a description of the outcome of an investigation (typically a survey. 67 as stated. either as a precise statement.1 INFERENCE AND INFORMATION Reporting Beliefs as a Decision Problem The results on quantitativecoherence (Sections 2. We shall now use this framework to analyse the very special form of decision problem posed by sratistical inference. rather than as an input into a directly practical decision problem.3 to 2. hence. or involving a description of intervals within which s}.2. The probability measure represents an individual’sbeliefs conditional on his or her current state of information. Options are then to be compared on the basis of expected utility. suffice for the analysis of complex. 2.7 Inference and Information and. a In Section 2. we have seen that the important problem of experimental design can be analysed within the sequential decision problem framework. we must represent degrees of belief about uncertain events in the form of a finite probability measure over E and values for consequences in the form of a utility function over C.7 2. appear to go beyond that simple structure.G). In particular.2. quantitative coherence.
we know that our statements of uncertainty may be used by others in contexts representable within the decision framework.1 that we distinguished two. An individual’s degree of belief measure over E will then be denoted P ( . However.7.. So far as uncertainty about the events of E is concerned.5. .). (Fisher. a specification of the set of possible populations sampled and a question concerning the true population. 1956/1973). To emphasise the fact that G characterises the actual dutu collected. On the one hand.. the study of models and techniques for nnalysing the w u y in which beliefs ure modified by data. On the other hand. we noted that the inference. we have suppressed. or inference statement. . may sometimes be regarded as un end in itseff. quoting Ramsey (1926). t o be judged independently of any “practical” decision problem. 19%): . undermine our conclusion that problems concerning itncertuinfy are to be solved by revising degrees of belief in the light of new dutu in accordance with Buyes’ theorem. . namely as means to making decisions in an acceptance procedure. The differences between these two situations seem to the author many and wide. we thus have a formal justification for the main topic of this book. In such situations. a considerable body of doctrine has attempted 10 explain. reasons for trying to think rdtionally about uncertainty. but also on an assessment of the losses resulting from wrong decisions. It is this case that we wish to consider in more detail in this section. we shall denote the event which describes the new information obtained by D. again. . . (Cox. Moreover. they would.D) constitutes I a complete encapsulation of the information provided by D. Our main purpose in this section is therefore to demonstrate that rheproblenr of reporting inferences is essentially u special case of a decision problem.I D ) representing the individual’scurrent beliefs in the light of the data obtained (where. . of course. let us recall from Section 2. hence. or rather to reinterpret these (significance) tests on the basis of quite a different model. f(. establishing that.2 Foundutions readings lie. . indeed. our conclusion holds. quoting Lehmann (1959/1986). for the solution of uny decision problem defined in terms of the frame of reference adopted.. for notational convenience. the explicit dependence on :If. Decisions are based on not only the considerations listed for inferences. in conjunction with the specification of a utility function. I D) provides all that is necessary for the calculation of the expected utility of any option and. P ( . to 2. Starting from the decision problem framework.. it can be regarded as falling within the general framework of Sections 2. By way of preliminary clarification. we noted that. If views such as these were accepted. given the initial state of information :If(. possibly distinct. . namely. many eminent writers have argued that basic problems of reporting scientific inferences do not fall within the framework of decision problems as defined in earlier sections: Statistical inferences involve the data. even if an immediate decision problem does not appear to exist.
I} compatible with the information D. a class of exclusive and exhaustive “hypotheses” { EJ.2 The Utility of a Probability Distribution We have argued above that the provision of a statistical inference statement about J}. there is always. 2.may be precisely stated as a decision problem. j E J } . we know that I our uncertainty about the {E. implicitly. We shall consider this specification and its implications in the following sections.j E . But there is nothing (sofar) in thisformulation which leads to the conclusion that the besr action is to state one’sactual beliefs.7.7 Inference and Information 69 Formalising the first sentence of the remark of Cox. In the discussion which follows. “states of nature”. given above. We shall regard this set of hypotheses as equivalent to a finite partition of the certain event into events { E J j E J}. j E J} should be represented by { P ( E J D ) . j E J}. we know from our earlier development that options cannot be ordered without an (implicit or explicit) specification of utilities for the consequences. corresponding to each inference statement and each € ... j E . conditional on some relevant data D. j E J}. The actions available to an individual are the various inference statements that might be made about j the events { E J . to regard the set of possible inference statements as the class of probability distributions over { E l . will be a consequence. 9 having the interpretation EJ = “the hypothesis HJ is true”. therefore. together with what actually turned out to be the case. the latter constituting the uncertain events corresponding to each action. a finite set of such hypotheses. A particular form of utility function for inference statements will be introduced and it will then be seen that the idea of inference as decision leads to rather natural interpretationsof commonly used information measures in terms of expected utility. Mathematical extensions will be discussed in Chapter 3. j E J } is a partition consisting of elements of E . we need to acknowledge that. the record of what the individual put forward as an appropriate inference statement. I D)denotes our current degree of belief measure. Indeed. To complete the basic decision problem framework. where the set of “hypotheses” { E. It is natural.2. . If we aspire to quantitative coherence in such a framework. say { H. From a strictly realistic viewpoint. and the action space A relates to the class Q of conditional probability distributions over {E. E I } .thus. we shall only consider the case of finite partitions { E . where P ( .namely. The inference reporting problem can thus be viewed as one of choosing a probability distribution to serve as an inference statement. there . j E J}. a pure inference problem may be described as one in which we seek to learn which of a set of mutually exclusive “hypotheses” (“theories”..although it may be mathematically convenient to work as if this were not the case. or “model parameters”) is true.. given data D in addition to the initial information ilfo.
.j E . which describes the "value" u.). we shall denote by the probability distribution which describes. so that (Proposition 2. j E . It seems natural t o assume that score functions should he smooth (in the intuitiw sense).. . A sc'orr&funcriori I I jhr probal>ility distrihitriotis q = { q. To avoid triviality.fittii. It then follows from Proposition 2.(q.. The action corresponding to the choice of q isdefinedas{(q. The mathematical condition impoed is ii simple and convenient representation of such sniwthnesb. there is no logical e requirement which forces an individual to report the probability distribution p which describes his or her personal beliefs. that all the EJ'sare significant given D. If this were not so.otitinuoiisly differenriable us n. were EJ to turn out to be the true "state of nature".E. Our next task is to investigate the properties which such a function should possess in order to describe a preference pattern which accords with what a scientific cornmunity ought to demand of an inference statement. . in the structure described so far. j E J } . consists of all pairs ( q .))EJ. all are compatible with the available data. we assume that none of the hypotheses is certain and that. since one would wish small changes in the reported distribution to prtduce only small changes in the obtained score. is. E. Wc complete the specification of this decision problem by inducing the preference ordering through direct specification of a utility function i t ( . The set of consequences C.I } d e j t i d over (I purtition { E. without loss of generality. L<J } to each pair ( q . the individual's ucruul beliefs about the alternative "hypotheses". This special class of utility functions is often referred to as the class of score functions (see also Section 2. we could simply discard any incompatible hypotheses. Thisjim~rioiis . representing the conjunctions of reported beliefs and true hypotheses. Throughout this section.I } is ( I nrirpying i which assigns a real tiimil>er i t { q . being true. is assumed to be the probability which.).8) since the functions describe the possible "scores" to be awarded to the individual as a "prize" for his or her "prediction". W emphasise again that. conditional again on the available data D. conditional on the available data D. Definition 2. (Score function). E J )of reporting the probability distribution q as the final inferential summary ofthe investigation. E J ) .5) Cn < EJ n D < D for all j E J.wid to be smooth if it is c. I7(iii) that each of the personal degrees of belief attached by the individual to the conflicting hypotheses given the data must be strictly positive.. in preference to any other probability distribution q in Q.rioii oj'rwli (1.70 2 hiindatiotis where q.20. an individual reports as the probability of EJ f H.
over the possible answers. probability distributions which truly describe their beliefs (de Finetti. This is a well specified problem. whose solution. taken over the class Q of all probability distributions over { E. one should require score function to be. otherwise. where the supremum.6). 1971). The simplest proper score function is the quadratic function (Brier. de Finetti. Bernardo. Whether a scientific report presents the inference of a single scientist or a range of inferences. proper. proper score functions have been successfully used in practice in the following contexts: (i) to determine an appropriate fee to be paid to meteorologists in order to encourage them to report reliable predictions (Murphy and Epstein. in a scientific inference context. j E J } . and only if. we need a form of u( .we should wish to be reassured that any reported inference could be justified as a genuine current belief. This motivates the following definition: Definition 2. . In order to ensure that a coherent individual is also honest. j E J } dejned over a partition { E. ..7 Inference and Information 71 We have characterised the problem faced by an individual reporting his or her beliefs about conflicting "hypotheses" as a problem of choice among probability distributions over { E l .) which guarantees that the expected utility is maximised if.2.with preferences described by a score function. Section 3. = p. 1985). 1950. (iii) to devise general procedures to elicit personal probabilities and expectations (Savage. I98 1 b. purporting to represent those that might be made by some community of scientists. in accordance with our development based on quantitative coherence. q. 1967). .21. a Smooth. s It would seem reasonable that. (ii) to score multiple choice examinations so that students are encouraged to assign. i attained $ and only i j q = p . 1965. (iv) to select best subsets of variables for prediction purposes in political or medical contexts (Bernardo and Bermlidez. j E J }. = P(E. the individual's best policy could be to report something other than his or her true beliefs. (Proper scorefunction). 1962) defined as follows. j E J}. I D ) . A scorefunction u is proper i j for each strictly positive probability distribution p = { p . is to report that distribution q which maximises the expected utility c (E J ) "q t JEJ 1 D). for each j .
.. ). Using the indicator function for F.23. I } definedover apartition { E.28 we did iiot need to use the condition 1 qI = 1: this is a rather special feature of the quadratic score function... j E .v probtrbility clisrrihiitiono I.11. I t quadratic score function is given by an alternatiw expression for the which makes explicit the role of a 'penalty' equal to the squared euclidean distance from q to a perfect prediction.j E . A yuudratic score jtnctiori j b r pr~)huhilit~distrihutions = { qJ.:. } = I(. (Local score funcfion). It is therefore natural that if E l .28. . over q. which we shall refer to as "pure inference problems". \ ~ {r El .I } q is (my function (.J } .. q.firnc. We have to maximise..j E . in a "pure" inference problem we are solely concerned with "the truth". scored) only on the basis of his or her reportedjudgement about the plausibility of E l . tlwre crist jiiircrions { I I I ( . j E . by definition. A qundrcttic score function is proper.I } S11~~11 Iilflf I1 { q .i F . A further condition is required for score functions in contexts..ftheform w l w e q = { y.e. the expected score Taking derivatives with respect to the q1's and equating them to zero. dejned over ci partition Definition 2.I } is tii1. j h r . j E . 1. . a xk Note that in the proof of Proposition 2. A score .. = I ... It is intuitively clear that the preferences of an individual scientist faced with a pure inference problem should correspond to the ordering induced by a local score function. . we have system of equations 2pJ . (Quadratic score function). Proposition 2.rioii I I is locd if. = /jJ for all j. etrch eleriient q = {y. is only to be assessed in terms of the probability it assigned to the actual outcome.i E .72 2 Foundutions Definition 2.I }./ (q. and since C . the individual scientist should be assessed (i.say.2qJ { q. J } of the chrss Q of probnl>ility distrihiitioris { I. It is easily checked that this gives a maximum. where the value of a distribution.22.j E J . E . The reason for this is that. we have the } = 0. Proof. turns out to be true.
.2. In stock control. for example. = &(PI)P.j E J } defined over a partition { E. local score function.23. Q. ) P . and only if. ( q .I IEJ where p. proper. in Definition 2. E. . limit situation. p. . . but one which seems. . 1946. p. .ql = 1.(pJ)of the dependence of the score on the probability attached to the true El is allowed to vary with the particular El considered. then for some { u. Jeffreys and Jeffreys.g.(. the functional form u. probability judgements about demand would usually be assessed in the light of the relative seriousness of under. By permitting different uJ(. then it must be of the form u { q . Proposition 2. . (Characterisation of proper local score functions). later in this section we shall see that certain wellknown criteria for choosing among experimental designs are optimal if.. at least approximately. we must have y u P c u ( q 7 m 3 = s u P c u . I u f is a smooth. + Proofi Since u( .s are arbitrary constants. 315) that .} and q = {q. we seek {u.)’sfor each E. appropriate in reporting pure scientific research. .. JEJ q ]€. j E J}. (. The situation described by a local score function is.) . q. .29. In addition.. p z . j E J } . rather than by just concentrating on the belief previously attached to what turned out to be the actual level of demand.} to make F stationary it is necessary (see e. where A > 0 and the B. preferences are described by a smooth. an idealised. j E J } which contains more than two elements.7 Inference and Information This can be contrasted with the forms of “score” function that would typically be appropriate in more directly practical contexts.} = A logq. proper. with c. C.). 73 Note that. . 2 0 . 4 3 . Writing p = { p l . > 0. = 1 and the supremum is taken over the class of probability distributionsq = ( q l r j E J ) . we allow for the possibility that “bad predictions” regarding some “truths” may be judged more harshly than others. of course. .. giving an extremal of For {qz. local score function for probability distributions q = { q.or overstocking.}. B.) is local and proper.
rJ'(. 1 . u [ . ~ 3 . f o r a l l j = 1.1 ~ )say.(11). 1fi } = f ( q 1 .a partition { E J .I} is unyfittiction of rhe form u { q . ..l)+(l S)f'(S.Y1) f(P1. (Logarithmicscorefunction). j E .whether or not W occurs.(pl).2.24. vacuous. if the score function is smooth. . . Calculating this derivative. ( p ) = A logy BJ..o) so that.p:j. then f must satisfy the functional equation . so that the partition is simply {H. where 111 is the indicator . .} and. In this case. A logarithmic scorefunction for strictly positive probabiliry distributions q = { qJ. the condition is seen to reduce to for all el's sufficiently small.I } dejned o w .j E . Moreover. where u' stands for the derivative of u. . a + 2 0 suffices to guarantee that the Definition 2. E J } = u(((11.j = 1. pui(p)=.74 2 Foirnddons for any E = { E ' L . function for H . and the score function only depends on the probability ql attached to H . are sufficiently small.} must be an extrernal of F and thus we have the system of equations so that all the functions u J . .1 I {PI .f(ql. . E J } J l o g qJ = + B.PI) f((Il. . 3 . . If the partition { € J . 1 ) + ( 1 .r.q1)? 1~ } to be proper we must have SUP Y1 t 10. for all (p2.. u { q . 5 3 . .2.1 . hence. the locality condition is.A.I} only contains two elements. A > 0. j = 2 . . W}. so that u . . ( ~ 2 . O < p < I. ( p J )= p1 u'.} such that all the E . + ( 1 o)} 1) . .O) =o. For u { ( q l . of course. namely p . . .The condition A extremal found is indeed a maximum. since IL is proper. . j E . satisfy the same functional equation. = 1'1 h .
.7 tnference and tnformation 75 The logarithmic function f(x. that since we place no (strictly positive) lower bound on the possible q. and only if.2. the expected utility of reporting q when p is the { actual distribution of beliefs is E ( q ) = CJA log q. 6{ q I p } 2 0 with equality if and oniy i f q = p . q = p. so that V ( p ) 2 a(q). The justification of such a procedure requires a study of the notion of “closeness” between two distributions.and thus The final statement in the theorem is a consequence of Proposition 2.7. say.x) B is 2 then just one of the many possible solutions (see Good.. j E J } .30.1) = A logx B1.0) = A log(1. + + 2. however. From a technical point of view. we have an example of an unbounded decision problem. + BJ}p. a decision problem without extreme consequences. but much easier to calculate. q. Proof. the precise computation of p may be difficult and we may choose instead to report an approximation to our beliefs.3 Approximation and Dlscrepancy We have argued that the optimal solution to an inference reporting problem (either for an individual.24. It is worth noting. (Expected loss in probability reporting). We have assumed that the probability distributions to be considered as options assign strictly positive qJ to each EJ. Using Definition 2. and only . j E J } representing actual beliefs. f(x. on the grounds that q is “close” to p. we have no problem in calculating the expected utility arising from the logarithmic score function. . the expected utility of reporting q is maximised if. is given by . If preferences are described by a logarithmic scorefunction. say. the expected loss of utility in reporting a probubilit?.29 since. i. or for each of several individuals)is to state the appropriate actual beliefs..This means that. particularly within the mathematical extensions to be considered in Chapter 3. p . however. Proposition 2. Moreover.distribution q = { qJ j E J } defined over a partition { E . rather than the distributiori p = { p . given any particular q E &.with equality if.. because the logarithmic score function is proper.1952).e.
a The quantity & { q l p } which arises here as a difference between two expected . general measure of "lack of fit". .I' I .. = p . Indeed. I } is dejitied by Example 2. p = q.j E .29 and 2. log . between a distribution and an approximation.76 2 Foundations if. . J = 0.5. we then have . s = 1.0)'' = 0 . 1. (Discrepancy of an approximation). Proposition 2.I / . hetween a strictly positive probability distribution p = ( p . . generally speaking.. .). The behaviour o f h { p I p } is well illustrated by a familiar.j = 0. This is in contrast to many practical decision problems where the form of the utility function often makes the solution robust with respect to changes in the "tails" of the distribution assumed. This reflects the fact that the "tails" of the distribution are.25. The discrepcinc:\. it is clear that an individual with preferences approximately described by a proper local score function should beware of approximating by zero. (Poisson approximation to a binomial distribution). when preferences are described by a logarithmic score function. Consider the binomial distribution I ) .30 suggests a natural.1 .1 with equality if. or discrepancy. and only if.r > 0.. . was introduced by Kullback and Leibler ( I95 1 ) as an d h o c measure of (directed) divergence between two probability distributions. = ( g ) @ ' ( l .. elementary example. utilities. othenvise '. '.J} oier (I partition { E. for all . j E . An immediate direct proof is obtained using the fact that for all . and let . Definition 2. Combining Propositions 2. extremely important in pure inference problems. and only if. r i with equality if. j E J } and un upproxiniarion p = {fi.30.
7 that h { p I p } decreases as either I I increases or 19decreases. for example.4 .25 Figure 2. When.7 Inference and Information 77 I1 =1 n=2 tl = 10 0.7. it follows from our previous discussion that it would not be a good idea to reverse the roles and try to approximate a Poisson distribution by a binomial distribution. p.2.2.p.2. logarithms to base 2 are used. 2. as in Figure 2. 564). we showed that. In Section 2. Clearly.7 Discrepncy between a binomial distribution und its Poisson upproximation (logurithms10 btise 2).I D). 1962/1970. It is apparent from Figure 2.any new information D should be incorporated into the analysis by updating beliefs via Bayes’ theorem.4. or both. for quantitative coherence.7. so that the initial representation of beliefs P(. However. or Renyi. Definition 2. 103. the utility and discrepancy are measured on the wellknown scale of bits of information (or entropy).) is updated to the conditional probability measure P(. which can be interpreted in terms of the expected number of yesno questions required to identify the true event in the partition (see.25 provides a systematic approach to approximation in pure inference contexts. de Finetti. Information in Section 2. The best approximation within a given family will be that which minimises the discrepancy. 1970/1974. we showed that. and that the second factor is far more important than the first. utility is defined in terms of the logarithmic score function. be its Poisson approximation.00 0. within the context of the pure inference reporting problem.7.
conditional on D ...j E J } is the conditionul prohcibilip distribution. would be A log [’(&IJ) l3. I 11) = 1’( E.31. the utilities of reporting P ( . Ifpreferences are described by f over u partition { EJ. we shall tind it convenient to underline the fact that.26 and 2. .77.26. then the e.24.. and onls P(E . J E J }. The umoimt of informution uhout E J } provided by the dutcr D. i. a In the context of pure inference problems.r..\ dejned ro be Ivhere { P( EJI D). given D.I } is the c. because of the use of the logarithmic score function.I E J } is p. is nonnegative and is zero if and only if. j E J } i. ) . respectively. when the initial probubilip distribution { P( I?.)jiJrull . This motivates Definitions 2.Ypected increuse in utility provided by data D.70 2 Foimdutions Proposition 2. . it pctrtition { E l .J. when the initial distrihiition over { E . were E. (Informationfrom data). this expected increase in irrilip is nonnegative und i s zero if.30. the expected increase in utility provided by D is given by which. )or P ( .I D1. by Proposition 2. Thus. ~ Proo& By Definition 2. E . known to be true. = { I’( f C J ) .1 Definition 2. .I D ) = P(f?. strictlv positive.onditionaI prohuhiliry distrihittion giren J the (IUIU n. . and 4 P(EJI D ) i + log U. I ) .). for all j . i s giwn by u logarithmic score function for rhe class of prohahilip distrihutions dejned where 3 > 0 is urhirrury. and { f ( EJ I D ) . P(E. utility assumes a special form and establishes a link between utility theory and classical information theory. M0reove.. j E .. (Expected utility o &a).
. In the finite case. ) . Conditional on E. .were E.logp. . . which acts as an “origin” or “reference” measure of uncertainty. known to be true. Another interesting interpretation of I ( D I p o ) arises from the following analysis. Attempts to define absolute measures of information have systematically failed to produce concepts of lasting value. a role unambiguously played by the uniform distribution in the finite case. I ) . p . respectively. j E J } . H { p } as a measure of uncertainty (and .H { p } as a measure of “absolute” infomation) seems to work correctly is explained (from our perspective) by the fact that sothat.2.log P(E. For detailed discussion of H { p } and other proposed entropy measures. The amount of information is thus seen to be a relative measure.. It should be clear from the preceding discussion that I ( D I p o ) measures indirectly the information provided by the data in terms of the changes produced in the probability distribution of interest. I D ) measure. I(D I p o ) is simply the expected value of that difference calculated with respect to pD. which obviously depends on the initial distribution. . defined by Hip} =  CP. The recognised fact that its apparently natural extension to the continuous case does not make sense (if only because it is heavily dependent on the particular parametrisation used) should. . how good the initial and the and conditional distributions are in “predicting” the “true hypothesis” EJ = H.. . logP(EJ) log P(E.I t has been proposed and widely accepted as an absolute measure of uncertainty.the discrepancy measure if p . I D). however.j E J } is consideredas an approximation to pu = { P(E.26 that the amount of information provided by data D is equal to 6(poI p u ) . I D ) . = {P(E. . in the finite case. as the amount of information which is necessary to obtain p = {pi. . have raised doubts about the universality of this concept. intermsoftheabovediscussion. .).) is a measure of the value of D. the problem of extending the entropy concept to continuous distributions is closely related to that of defining an “origin” or “reference” measure of uncertainty in the continuous case. We shall on occasion wish to consider the idea of the amount of information which may be expected from an experiment c.7 Inference and Information 79 It follows from Definition 2. see Renyi (1961). so that log P(E.2. the expectation being calculated before the results of the experiment are actually available. } from an inirinl discrete uniform disrriburion (see Section 3.. the entropy of the distribution p = { p l . As we shall see in detail later. ..apartfroman unimportant additive constant. The fact that.2).H { p } maybeinterpreted..
i E I } . given the occurrence of U . it follows from Definition 2. { D.J E .xpression jhr the e. . locd score junction.r (ill E. = { P( E l ) .rperinient t'. = P ( D ..rpcrirneiit in (I pure itlference c'onte.27. decision theoretical 1 framework. (Expected informafion from an experiment).i E . n 12. The eqiected inforntation to he provided by un experiment P ubotrt N purtition { E . Then..32 is Shannon's ( 1948) measure of expected information. when the initial distribirtion over { E. ) .31.j E .) = P ( E . cwrespondirtg to the initiul distribution p. a natural interpretation ofthis famous measure ofexpected information: Shannon's expected information is the expected urilip provided by uit e.27 that I ( P 1 p . Definition 2. I ( D . by Proposition 2. ) 2 0 with equality if. y. I } .) P (11. ) .rnrution is where P( EJ D . = P ( E / I D.fr. we have suggested that the problem of reporting inferences can be viewed as a particular decision problem and thus should be analysed within . and D.. when un indiridiiul 's preferertc*esare clesc~rihed a b!.I} is p. = I)( ' .) given by Proposition 2.rt. smooth. P(E .80 2 Foimdutioiis Definition 2. t t n d { P(E. P(E. In conclusion. )j. . = P ( E .. An alternative e. 1 D. k i d t equality ifand only if. .). = { P ( E . in . n 1).). . We have thus found. is where the possible resirits ofthe e. I } .) 2 0. I ( ( Ip. conditionid distribution. ). I } i.. Proposition 232.). . P ( € .er. by Bayes' theorem. Moreor. Ip. and only if. = Since. I D. for all E .) P(E.) 2 0 withequality iff. by and the result now follows from the fact that.27...rpected iirli. The expression for I ( c Ip. .). ) = P( I).\ the EJ. und D.) I'( I D . E J ) . ) k Proof.. ) . ) and p / . Let (1. i E I } . proper. occur with probabilities { P( D. ) P ( D .I 6 ..
preferences should be described by a logarithmic score function. who insisted that clarity in thinking about concepts could only be achieved by concentrating attention on the conceivable practical effects associated with a concept.4 how these results.1 DISCUSSION AND FURTHER REFERENCES Operational Definitions In everyday conversation. More rigorous habits of thought are required. A prerequisite for “making sense” is that the fundamental concepts which provide the substantive content of our statements should themselves be defined in an essentially unambiguous manner. or the practical consequence of adopting one form of definition rather than another.8. there is a paramount need for statements which are meaningful and unambiguous. respectively. that might conceivably have practical bearings. extend straightforwardly to the continuous case. with a natural characterisation of an individual’s utility function when faced with a pure inference problem. This kind of approach to definitions is closely related to the philosophy of pragmatism. Then. within this framework. occasionally seeking an ad hoc clarification of a particular statement or idea if the context seems to justify the effort required in trying to be a little more precise.) In the context of scientific and philosophical discourse. We shall see in Section 3. discrepancy and amount of information are naturally defined in terms of expected loss of utility and expected increase in utility.8 2. 2.2. the way in which we use language is typically rather informal and unselfconscious. (For a detailed account of the ambiguities which plague qualitative probability expressions in English. see Mosteller and Youtz. as formulated in the second half of the nineteenth century by Peirce. In Peirce (1878). We have also seen that. however. established here for finite partitions. and we tolerate each other’s ambiguities and vacuities for the most part. ad hoc response will therefore no longer suffice. The everyday. this point of view was summarised as follows: Consider what effects. we conceive the object of our conception to have. We are thus driven to seek for definitions of fundamental notions which can be reduced ultimately to the touchstone of actual or potential personal experience. We have established that. tolerant. . our conception of these effects is the whole of our conception of the object.8 Discussion and Further References the framework of decision theory. and we need to be selfconsciously aware of the precautions and procedures to be adopted if we are to arrive at statements which make sense. rather than remaining at the level of mere words or phrases. 1990. and that maximising expected Shannon information is a particular instance of maximising expected utility.
where the key idea of an operatinnul definition is introduced and illustrated by considering the concept of “length”: . and call the length the total number of times the rod was applied. repcat this process as often as we can. . for example. the concept of length involves as much as. . than the set of operations by which length is determined. we mean by any concept nothing more than a set of operations: the concept is synonymous with the corresponding set of operations. of Modern Pkjsics. is in practice exceedingly complicated. .a2 2 Foundations In some respects. Indeed. we have stressed this aspect of our thinking in Sections 2. however. In general. Since many critics of the personalistic Bayesian viewpoint claim to find great difficulty with this feature of the approach. the operations are mental operations. we have to perform certain physical operations.7. We also noted the inevitable element of idealisation. what do we mean by the length of an object’? We evidently know what we mean by length if we can tell what the length of any and every object is and for the physicist nothing more is required. I to 2. often suggesting that it undermines the entire theory. To find the length of an object. operational idea of preference between options the fundamental starting point and touchstone for all other definitions. . If the concept is physical. The concept of length is therefore fixed when the operations by which length is measured are tixed: that is. . . apparently so simple. .. . and we remarked on this at several points in Section 2. then move the rod along in a straight line extension of its previous position until the first end coincides with the previous position ofthe second end. the operatiom are actual physical measurements . or else we must niiike i i . mark on the object the position of the rod. this position is not entirely satisfactory in that it fails to go far enough in elaborating what is to bc understood by the term “practical”. This crucial elaboration was provided by Bridgman (1927) in a book entitled The Logic. be sure that the temperature of the rod is the standard temperature at which its length is detined. This procedure. or if the concept is mental.we take a measuring rod. This is well illustrated by Bridgman‘s account ofthe operational concept of length and its attendant idealisations and approximations: . . We must. it is worth noting Bridgman’s very explicit recognition that uN experience is subject to error and that all we can do is to take sufficient precautions when specifying sets of operations to ensure that remaining unspecified variations in procedure have negligible effects on the results of interest. and doubtless a full description o f all the precautions that must be taken would till . we shall seek to adhere to the operational approach to defining concepts in order to arrive at meaningful and unambiguous statemcnts in the context of representing beliefs and taking actions in situations of uncertainty. Throughout this work. lay it on the object so that one of its ends coincides with one end of the object. . . . or approximation. .I large treatise. . where we made the practical.3. and nothing more. implicit in the operational approach to our concepts.
who presented the outline of a formal system. we would like our axioms to be simple. reflecting a variety of responses to the underlying conflict between axiomatic simplicity and structural flexibility in the representation of decision problems. in principle. its precise path through space and its velocity and acceleration in getting from one position to another. What matters is to be able to achieve sufficient accuracy to avoid unacceptable distortion in any analysis of interest. in conformity with the operational philosophy outlined above. but the justification is in our experience that variations of procedure of this kind are without effect on the final result. . interpretable. we would like our theory to adhere to the loose structures that often arise in realistic decision situations. or we must correct for the gravitational distortion of the rod if we measure a vertical length. even though.2 Quantitative Coherence Theories In a comprehensive review of normative decision theories leading to the expected utility criterion. On the other hand. but if this is done then we will be faced with fairly complicated axioms that accommodate these loose structures. of course. In practical terms.3. This pragmatic recognition that there are inevitable limitations in any concrete application of a set of operational procedures is precisely the spirit of our discussion of Axioms 4 and 5 in Section 2. The earliest axiomatic approach to the problem of decision making under uncertainty is that of Ramsey ( I926). .2. precautions such as these are not taken. This will serve in part to acknowledge our general intellectual indebtedness and orientation. Fishburn sums up the dilemma as follows: On the one hand. and capable of convincing others that they are appealing criteria of coherency and consistency in decision making under uncertainty. Practically. but to do this it seems essential to invoke strong structural conditions. and in part to explain and further motivate our own particular choice of axiom system. our purpose here is to provide a brief historical review of the foundational writings which seem to us the most significant.8. . Fishburn (1981) lists over thirty different axiomatic formulations of the principles of coherence.8 Discussion and Further References correction for it. In addition. we must go further and specify all the details by which the rod is moved from one position to the next on the object. With these considerations in mind. or we must be sure that the rod is not a magnet or is not subject to electrical forces . we could indefinitely refine our measurement operations. The key postulate in Ramsey’s theory is the existence of a socalled ethically neutral . we have to stop somewhere. . intuitively clear. we should like the definitions of the basic concepts of probability and utility to have strong and direct links with practical assessment procedures. 2.
Of course. Savage's major innovation in structuring decision problems is to define what he calls acts (options. with the publication of Savage's (1954) book The fi~imdurioiis ofStutistics that the first complete formal theory appeared. for a modern appraisal see Shafer (1986) and lively ensuing discussion. which impose severe limitations on the range of applicability of the theory. two major difficulties with Savage's approach. expressed in terms of our notation for options. assuming the prior existence of probabilities. as indeed it has to many other writers. however. The Savage axiom system is a great historical achievement and provides the first formal justification of the personalistic approach to probability and decision making. in turn. also. This is directly defined to have probability 1/2 and thus enables Ramsey to get the quantitative ball rolling without imposing undue constraints on  . Such an assumption imposes severe constraints on the allowable forms of structure for the set of uncertain outcomes: in fact. but a closely related development can be found in Pfanzagl ( 1967. These are extended into quantitative prubabilities by means of a "continuously divisible" assumption about events. See. which. who had. There are. c2 1 E}. however. Utilities are subsequently introduced using ideas similar to those of voii Neumann and Morgenstern (1944/1953). many variations on an axiomatic theme are possible and other Savagetype axiom systems have been developed since by Stigum (1972). It is then rather natural to define the degree of belief in such an event to be 1/2 and. it even prevents the theory from being directly applicable to situations involving il finite or countably infinite set of possible outcomes. 1968). Ramsey's theory seems to us. One way of avoiding this embarrassing structural limitation is to introduce a quantitative element into the system by a device like that of Ramsey's ethically neutral event. Fishburn (1975) and Narens ( 1976). Suppes ( 1960. I E'.2 Foundations event E . From a mathematical point of view. presented an axiom system for utility alone. transitive order relation among acts and this is used to define qualitative probabilities.2. 1 E. ten years earlier. for any consequences cl. No mathematical completion of Ramsey's theory seems to have been published. a revolutionary landmark in the history of ideas. This. See. Hens ( 1992). is used to extend the definition of degree of belief to general events by means of an expected utility model. from this quantitative basis. has the property that { c . c2 1 EC} { c . in our terminology) as functions from the set of uncertain possible outcomes into the set of consequences. Suppes ( 1956) presented a system which combined elements of Savage's and Ramsey's approaches. From a conceptual point of view. His key coherence assumption is then that of a complete. 1974) and Savage ( 1970). The first of these difficulties stems from the "continuously divisible" assumption about events. the treatment is rather incomplete and it was not until 1954. it is straightforward to construct an operational measure of utility for consequences. c. which Savage uses as the basis for proceeding from qualitative to quantitative concepts. also. say. Roberts (1974).
283 . And yet. already hinted at in our brief discussion of medical and monetary consequences in Section 2. in presenting the basic quantitative coherence axioms it is important not to confuse the primary definitions and coherence principles with the secondary issues of the precise forms of the various sets involved. We take the view. and the standard events and options that we introduced in Section 2. introduced by Anscombe and Aumann (1963) for defining degrees of belief.8 Discussion and Further References the structure. The theory does not therefore justify the use of many mathematically convenient and widely used utility functions. those implicit in forms such as “quadratic loss” and “logarithmic score”. A theory which. that it is often conceptually and mathematically convenient to be able to use structural representationsgoing beyond what we perceive to be the essentially finitistic and bounded characteristics of realworld problems. Other systems using standard measuring devices (sometimes referred to as external scaling devices) are those of Fishburn (1967b. in Chapter 3. 1969) and Balch and Fishbum (1974). for example. however. a generalisation of Ramsey’s ’ idea reemerges in the form of canonical lotteries. The basic idea is essentially that of a standard measuring device.2. we have so far always taken options to be defined by finite partitions. The second major difficulty with Savage’s theory. Related Theories Our previous discussion centred on complete axiomatic approaches to decision problems. combines a standard measuring device with a fundamental notion of conditional preference is that of Luce and Krantz ( I97 I ). 1970). in some sense external to the realworld events and options of interest. Raiffa and Schlaifer (1964. Motivated by considerations of mathematical convenience. is that the Savage axioms imply the boundedness of utility functions (an implication of which Savage was f apparently unaware when he wrote The Foundations o Statistics. we hope that the essence of the quantitative coherence theory has already been clearly communicated. For this reason.3 play this fundamental operational role in our own system. we shall. We shall then arrive at a sufficientlygeneral setting for all our subsequent developments and applications. relax the constraint imposed on the form of the action space. involving a unified development of both probability and utility concepts. like ours. 1981). within this simple structure. but which was subsequently proved by Fishburn. 1965) as a basis for simultaneously quantifying personal degrees of belief and utilities in a direct and intuitive manner.. All he requires is that (at least) one such event be included in the representation of the uncertain outcomes. a unified treatment of the two concepts is inescapable if operational . and by Pratt. It seems to us that this idea ties in perfectly with the kind of operational considerations described above. In fact. indeed. uncomplicated by structural complexities.5. In our view. and one that also exists in many other theories (see Table I in Fishburn.
3. then the individual's degree of belief in E is defined to be p. partly because of their close relation with the main concerns of this book. which is equivalent to paying an opponent to enter such a gamble. However. given an arbitrary monetary sum 771 and uncertain event E. without explicit use of the utility concept. In modern economic terminology. 0 I ? 1. If consequences are assumed to be monetary. for any E and r t ) . (iii) Ariomatic Approaches to Degrees c$ Beliej: (iv) Avioniutic Approaches to Utilities and (v) hlforniution Theories.s and Degrees of Belief. or such an arrangement is impossible. An individual who assigns p > 1 is implicitly agreeing to pay a stake larger than I ) ) to enter a gamble in which the maximum prize he or she can win is t i t : an individual who assigns 1) < 0 is implicitly agreeing to offer a gamble in which he or she will pay out either t n or nothing in return for a negative stake. we shall simply give what seem to us the most important historical references. and partly because of their connections with the important practical topic of the assessment of beliefs. Given that an individual has specified his or her degrees of belief for some collection of events by repeated use of the above detinition. grouped under the following subheadings: (i) Moneran Bets and Dexrees ofBelieJ (ii) Scoring Rit1c. however. de Finetti's approach can be summarised as follows. a socalled "Dutch book". In either case. In the latter case.7. It is now straightforward to verify that coherent degrees of belief have the properties of finitely additive probabilities. In this section. In addition. there is a considerable literature on informationtheoretic ideas closely related to those of Section 2.considerations are to be taken seriously. there have been a number of attempted developments of probability ideas separate from utility considerations. be treated at greater length. the individual is said to have specified a coherent set of degrees of belief. probability can be considered to be a marginal rate of substitution or. together with some brief comments. Using the notation for options introduced in Section 2. This definition is virtually identical to Bayes' own definition of probability (see our later discussion under the heading of Ariomuric Approoches to Degrees o f Belie). and if. we shall provide a summary overview of a number of these related theories. more simply. as well as separate developments of utility ideas presupposing the existence of prob' abilities. a bet can be arranged  . Monetary Bets and Degrees of Belief An elegant demonstration that coherent degrees of belief satisfy the rules of(finitely additive) probability was given by de Finetti (1937/1964). For the most part. we can argue as follows. a kind of "price'.. either it is possible to arrange a form of monetary bet in terms of these events which is such that the individual will certainly lose. To demonstrate that 0 5 p 5 1. The first two topics will. an individual's preferences among options are I such that { p r I~Q } { 711 I E.
if F does not occur the bet is “called o f f and the stake returned. For further discussion of possible forms of “utility for money”. one may explicitly recognise that some people have a positive utility for gambling (see. An ud hoc modification of de Finetti’s approach would be to confine attention to “small” stakes (thus. . . perhaps. given any monetary sum ~ nwe have the equivalence . . for any choice of the 7 r t J ’ s . In order to avoid the possibility of the m. restricting attention to a range of outcomes over which the “utility” can be taken as approximately linear) and the argument. E2. of course.(1973). despite the intuitive appeal of this simple and neat approach. Lindley (197 1/1985. for instance. pn = 1.En. be negative. . of 9. .’s to the 9. to be his or her degrees of belief in those events.nz.2. In the first place. see.’s be zero so that the linear system cannot be solved. this is an implicit in agreement to pay a total stake of plml p. p.‘s being chosen in such a way as to guarantee the negativity of the 9. it is necessary that the determinant of the matrix relating the m. A more formal argument based on the avoidance of certain losses in betting formulations has been given by Freedman and Purves (1969). (q7n 1 0) { m I E n F. and nothing otherwise. in effect.p. qrn 1 F‘}. p. but has its earlier origins in the celebrated St.’s for fixed p. it has two major shortcomings from an operational viewpoint.*7nf. . 1993). Related + + + + + + +  . Moreover. . Lavalle (1968). has considerablepedagogical and.C. To demonstrate the additive property of degrees of belief for exclusive and exhaustiveevents. Conlisk. If an individual specifies pl . However. E l .21112 . Additionally. order to enter a gamble resulting in a prize of m..we proceed as follows. occurs and thus a “gain”. ..gJ = 0. it is easy to check that this is also a sufficient condition for coherence: it implies C. this turns out to require that p1 p 2 . or “net return”.= nit .’s in this system of linear equations. practical use.p2. except that an individual’s degree of belief in an event E conditional on an event F is defined to be the number q such that. . . for example. which could. thus modified. p.. The extension of these ideas to cover the revision of degrees of belief conditional on new information proceeds in a similar manner. . it is clear that the definitions cannot be taken seriously in terms of arbitrary monetary sums: the “perceived value” of a stake or a return is not equivalent to its monetary value and the missing “utility” concept is required in order to overcome the difficulty. 62. Petersburg paradox (first discussed in terms of utility by Daniel Bernoulli. Chapter 5) and Hull er a f . Pratt (1964). The interpretation of this definition is straightforward: having paid a stake of qrn. footnote (a)). 0 1 E n F. despite its rather informal nature. 1730/1954). if F occurs we are confronted with a gamble with prizes 7 n if E occurs.8 Discussion and Further References 87 which will result in a certain loss to the individual and avoidance of this possibility requires that 0 5 p 5 1. according to the individual’s pref‘ erence ordering among options. p. and hence the impossibility of all the returns being negative. This point was later recognised by de Finetti (see Kyburg and Smokler. 1964/1980.if E.
. L = ( I[: . In addition to the problem of "nonlinearity in the face of risk". . 1964). A simple geometric argument now establishes that.'. .p L ) L ! + . . . with important subsequent generalisations by Savage (1971) and Lindley ( 1982a). . .. the penalty can be written in the general form. the individual is said to have specified a coherent set of degrees of belief. . p. as delining a further point in 'R". . note that the n logically compatible assignments of values 1 and 0 to the E.00 2 Foundutions arguments have also been used by Cornfield (1969).. pI)L!+(lc. . Scoring Rules and Degrees of Bclief The scoring rule approach to the definition of degrees of belief and the derivation of their properties when constrained to be coherent is due to de Finetti ( 1963. or it is not possible to find such yl.I are an exclusive and exhaustive collection of uncertain events for which the individual. . . To see this.. . The number. . 5 1. The underlying idea i n this development is clearly very similar to that of de Finetti's ( I93711964) approach where the avoidance of a "Dutch book" is the basic criterion of coherence. . = ( 1 . ... 11. In the latter case. (I(. q. whereas if E does not occur he or she is to suffer a penalty of L = 1. there is also the difficulty that unwanted gametheoretic elements may enter the picture if we base a theory on ideas such as "opponents" choosing the levels of prizes in gambles. y. Thinking of p 1 . . an individual is asked to select a number. Heath and Sudderth (1972) and Buehler ( 1976) to expand on de Finetti's concept of coherent systems of bets. q?.2.7. subject now to the penalty L = ( l E . p. the development proceeds as follows.p ) ? . a concept we have already introduced in Section 2.1))'. alluded to above. 1). This .'sdefine )i points in !R". .' Suppose now that El. using the quadratic scoring rule scheme.+ ( l / . . . .p.. has to specify degrees of belief 1". . p ~ ).X Given a specification. the coherence condition can be reinterpreted as requiring that this latter point cannot be moved in such a way as to reduce the distance from . .. In terms of the quadratic scoring rule. with the understanding that if Ii occurs he or she is to suffer a penalty (or loss) of I. ( I . de Finetti himself later preferred to use an approach based on scoring rules. Given an uncertain event b. p1. p. .'.'s and 0 to the others. . p. . + p.. respectively.p . l .. &. for i = 1. . such that for any assignment of the value 1 to one of the E.. which the individual chooses is defined to be his or her degree of belief in t. say. .1. p?... = I . . Using the indicator function for E. and p1 + p 2 + . . either it is possible to find an alternative specification. €. . . for coherence we must have 0 5 p.ill the other 11 points. . For this reason.. .
which he or she chooses when confronted with a penalty defined by L = 1~( 1~ . we argue as follows. The first recognisably “axiomatic” approach to a theory of degrees of belief was that of Bayes (1763) and the magnitude of his achievement has been clearly . q. However.8 Discussion and Further References 09 means that p1..q)’. none of these writers attempted an axiomatic development of the idea. the specification of q proceeds according to the penalty ( 1~ .q y = = + ( 1 . A simple calculation shows that this reduces to the condition q = p / r . If p. a formulation which is clearly related to the idea of “calledoff bets used in de Finetti’s 1937 approach. T defined a point in 91“ where the Jacobian of the transformationdefined by the above equations did not vanish. if F occurs. An individual’s degree of belief in an event E conditional on the occurrence of an event F is defined to be the number q.T ) 2 . ECn F and F“ occur. if F does not occur. q. The interpretation of this penalty is straightforward. specified subject to the penalty L = 1F ( 1E . q and T imposed by coherence. which is. T satisfy the equations 11 tl + = (1 .r)2 q2 + p’ U! + (1 . Regazzini (19831. including contributions from James Bernoulli ( 1713/1899). De Finetti’s ‘penalty criterion’ and related ideas have been critically reexamined by a number of authors. Piccinato (1986). which demands that no other choices will lead to a strictly smaller L. must define a point in the convex hull of the other R points. 2. The extension of this approach to cover the revision of degrees of belief conditional on new information proceeds as follows. Indeed. p. See. also. To derive the constraints on p. . Bayes’ theorem. there is no penalty. Coherence therefore requires that the Jacobian be zero.p ) 2 ( IF .q)2. so far as we know. Eaton (1992) and Gilio (1992a). whatever the logically compatible outcomes of the events are. Axiomatic Approaches to Degrees of Belief Historically. p2.2. If u. respectively. again.r)’ p’ + r2. the numbers p and T are the individual’s degrees of belief. for the events E n F and F. it would be possible to move from that point in a direction which simultaneously reduced the values u. Suppose now that.p ) 2 + (1. w. thus establishing the required result. Laplace ( 177411986. in addition to the conditional degree of belief q.(1)’ + ( 1 E 1F . Gatsonis (1984). then p . and w. are the values which L takes in the cases where E n F. t i . respectively. De Morgan ( 1847)and Bore1 ( 1924/1964). the idea of probability as “degree of belief” has received a great deal of distinguished support. Relevant additional references are Myerson ( 1 979). . I8 1411952). .
by Good ( 1980a) and by Lindley ( 1980a). ( I97 I ). Much of Savage's (195411972) system . Heath and Sudderth ( 1978) and Luce and Narens ( 1978). of course. however. the flavour of Jeffreys' approach seems to us to place insufficient emphasis on the inescapably personal nature of degrees of belief. be criticised by many colleagues who. We should point out. I92 1 ).8. however. many other examples of axiomatic approaches to quantifying uncertainty in some form or another. see for example. modem approach only began to emerge a century and a half later. Bayes' formulation is. in a series of papers by Wrinch and Jeffreys ( 19 19. From afixmfarional perspective. The work of Keynes ( 192 1/ 1929) and Carnap (1950/1962) deserves particular mention and will be further discussed later in Section 2. resulting in an overconcentrationon "conventional" representations of degrees of belief derived from "logical" rather than operational considerations (despite the fact that Jeffreys was highly motivated by real world applications!). Suppes and Zanotti ( 1976. and a more formal. whose profound philosophical and methodological contrihutions to Bayesian statistics are now widely recognised. Scott ( 1964). extremely informal. in the volume edited by Zellner ( 1980). See. of course. I961 ) and Jaynes ( 1958). Chapter 4). There are. this includes work by Kraft et a / . See Good ( 1965. that our emphasis on operational considerations and the subjective character of degrees of belief would. Fishburn ( 1970. Chapter 2) for a discussion ofthe variety of attitudes to probability compatible with a systematic use ofthe Bayesian paradigm. which is followed by a long. stimulating discussion. by his essay. Formal axiom systems which wholeheartedly embrace the principle of revising beliefs through systematic use of Bayes' theorem. share a basic commitment to the Bayesian approach to statistical problems.4. also. French (1982) and Chuaqui and Malitz (1983).90 2 Foundations recognised in the two centuries following his death by the adoption of the adjective Bayesian as a description of the philosophical and methodological developments which have been inspired. are discussed in detail by Jeffreys (193111973. in turn. ( 1959). directly or indirectly. Axiomatic Approuches ro Utilities Assuming the prior existence of probabilities. Similar criticisms seem to us to apply to the original and elegant formal development given by Cox ( 1946. who showed that the probability axioms constitute the only consistent extension of ordinary (Aristotelian) logic in which degrees of belief are represented by real numbers. von Neumann and Morgenstern (1944/1953) presented axioms for coherent preferences which led to a justification of utilities as numerical measures of value for consequences and to the optimality criterion of maximising expected utility. 1939/1961). in other respects. Krantz Y I 01. In the finite case. the evaluations of his work by Geisser (1980a). Domotor and Stelzer ( 197 1 ). Fishburn (1986) provided an authoritative review of the axiomatic foundations of subjective probability. By present day standards. 1982).
1967) and S h d a l (1970). Other discussions of utility theory include Fishburn (1967a. unifying account of . Chapter 7) presents a general axiom system for utilities which imposes rather few mathematical constraints on the underlying decision problem structure. Becker and McClintock (1967). Marschak (1950). (1963). The first characterisation of the logarithmic score for a finite distribution was attributed to Gleason by McCarthy ( 1956). General accounts of utility are given in the books by Blackwell and Girshick (1954). The logarithmic information measure was proposed independentlyby Shannon (1948) and Wiener (1948) in the context of communication engineering. among others. for which the uniqueness result is not applicable. By considering the inference reporting problem as a particular case of a decision problem. 1981). Arrow (I95 la). Multiattribute utility theory is discussed. ( 1957). Savage (1971) and Hull er al.Aczel and Pfanzagl(1966). Schervish era/. Logarithmic scores seem to have been first suggested by Good (1952). also. Seminal references are reprinted in Page (1968). Luce and Raiffa (1957). Discussions of the experimental measurement of utility are provided by Edwards ( 1954). The logarithmic divergency measure was first proposed by Kullback and Leibler ( I95 I ) and was subsequently used as the basis for an informationtheoreticapproach to statistics by Kullback (195911968). Lindley (1956) later suggested its use as a statistical criterion in the design of experiments. 1966. The mathematical results which lead to the characterisation of the logarithmic scoring rule for reporting probability distributions have been available for some considerable time. Information Theories Measures of information are closely related to ideas of uncertainty and probability and there is a considerable literature exploring the connections between these topics. Extensive bibliographies are given in Savage (1954/1972) and Fishburn (1968. DeGroot (1963). by Fishburn (1964) and Keeney and Raiffa (1976). A formal axiomatic approach to measures of information in the context of uncertainty was provided by Good (1966). Arimoto (1970) and Savage (1971) have also given derivations of this form of scoring rule under various regularity conditions. Other relevant references on information concepts are Renyi ( 1964. who has made numerous contributions to the literature of the foundations of decision making and the evaluation of evidence. 1987). 1988b) and Machina ( 1982. See. we have provided (in Section 2. 1952). ( 1 990). DeGroot (1970.Davison etul. Other early developments which concentrate on the utility aspects of the decision problem include those of Friedman and Savage (1948. Suppes and Walsh ( I 959). ( I 973). but he only dealt with dichotomies. Beckerer ul.8 Discussion and Further References 91 was directly inspired by this seminal work of von Neumann and Morgenstern and the influence of their ideas extends into a great many of the systems we have mentioned. Chernoff and Moses (1959) and Fishburn ( 1970). Herstein and Milnor (1 953).2.7) a natural. Edwards ( 1954) and Debreu (1960).
and thus not directly applicable to every phase of the scientific process. But we do not accept. These will be dealt with under the following subheadings: ( i ) Dynamic Frame rtfDiscourse.4 Critical Issues We shall conclude this chapter by providing a summary overview of our position in relation to some of the objections commonly raised against the foundations of Bayesian statistics. although informal. complementary role to play. inventing new models or theories. Substantive subjectmatter inputs would seem to be of primary importance..2. 2. . this activity has to be viewed.2 Foundations the fundamental and close relationship between informationtheoretic ideas and the Bayesian treatment of “pure inference” problems. although some limited formal clarification is actually possible within the Bayesian framework. as many critics have pointed out. (vii) Subjectivity of Probabilit!. as we shall see in Chapter 4. defined in the light of our current knowledge and assumptions. in Chapter 3. that alternativefornral statistical theories have a convincing. for example. as sandwiched between two other vital processes. Based on work of Bemardo (1979a).. dynamic. particularly in the context of the possibilities opened up by modern computer graphics. to cover continuous distributions. Box. Quantitati\*e Preference. (ii) Updating Subjective Probability. In the more general. The problem of generating the frame of discourse. offers considerable intellectual excitement and satisfaction in its own right. (iii) Relevance oj’un Axiomatic Approach. this analysis will be extended.. ( v ) Prescriptive Nature of the Axioms. the critical questioning of the adequacy of the currently entertained set of possibilities (see. (vi) Precise. D y a m i c Frame of Discourse As we indicated in Chapter I . this activity constitutes only one static phase of the wider. currently assumed necessary and sufficient to reflect key features of interest in the problem under study. we are operating in terms of a fixed frame of discourse. On the one hand.8. seems to us to be one which currently lies outside the purview of any “statistical” formalism. DS Box (1980) appears to. our concern in this volume is with coherent beliefs and actions in relation to a limited set of specified possibilities. In the language of Section 2. We accept that the mode of reasoning encapsulated within the quantitative coherence theory as presented here is ultimately conditional. Complete. (iv) Structure of the Set of Relevant Events. evolving. scientific learning and decision process. the creative generation of the set of possibilities to be considered. :\I. on the other hand.e. either potentially or actually. context. (viii) Statistical Inference us (I Decision Probktn and (ix) Communication and Group Derision Making. exploratory data analysis is no doubt a necessary adjunct and. However. 1980). i.
in scientific paradigm (Kuhn. However.8 Discussion and Further References 93 The problem of criticising the frame of discourse also seems to us to remain essentially unsolved by any “statistical” theory. (iii) we ull lack a decent theoretical formulation of and solution to the problem of global model criticism in the absence of concrete suggested alternatives. exploratory diagnostic probing would seem to have a role to play in confirming that specific forms of local elaboration of the frame of discourse should be made. the issue of assessing adequacy in relation to a total absence of any speciJic suggested elaborations seems to us to remain an open problem. 1962). critics of the Bayesian approach should recognise that: (i) an enormous amount of current theoretical and applied statistical activity is concerned . Dawid (1982a). or even “rebellion”. The logical catch here. empirical performance of an internally coherent individual. see. The latter could therefore be incorporated ub initio into the frame of discourse and a fully coherent analysis carried out. however. Related issues arise in discussions of the general problem of assessing. it is not clear that the “problem” as usually posed is wellformulated. is the key issue that of “surprise”. esplorurury techniques are an essential part of the process of generating ideas. We shall return to these issues in Chapter 6. For example. rather than of circumscribing the scope of the coherent theory. a range of reactions. On the one hand. it is not clear what questions one should pose in order to arrive at an “internal” assessment of adequacy in the light of the information thus far available. there can be no purely ”statistical” theory of model formulation. (ii) informal. the issue is resolved for us as statisticians by the consensus of the subjectmatter experts. Indeed. However. Overall. or “calibrating”. So far as the scope and limits of Bayesian theory are concerned: (i) we acknowledge that the mode of reasoning encapsulated within the quantitative coherence theory is ultimately conditional. The issue here is one of pragmatic convenience. our responses to critics who question the relevance of the coherent approach based on a fixed frame of reference can be summarised as follows. this aspect of the scientific process is not part of the foundational debate. is that such specific diagnostic probing can only stem from the prior realisation that the corresponding specific elaborations might be required. although the process of passing from such ideas to their mathematical representation can often be subjected to formal analysis. and thus not directly applicable to every phase of the scientific process. On the other hand. or is some kind of extension of the notion of a decision problem required in order to give an operational meaning to the concept of “adequacy”? Readers interested in this topic will find in Box (1980). for example. In the case of a “revolution”. and we simply begin again on the basis of the frame of discourse implicit in the new paradigm.2. the external. and the ensuing discussion. in the absence of such “externally” directed revision or extension of the current frame of discourse.
of an event E given the <(. assumed Occurrence of an event G. JrJ?eX'. At one extreme. we derived Bayes' theorem. and to avoid potential confusion. intuitive appeal and do not need axiomatic reinforcement. in particular. See. as working frames of discourse. Good ( 1977). Detailed discussion and relevant references can be found in Diaconis and Zabell ( 1982).has been updated to the poJIerior probability P(E 1 G). a number of authors have questioned whether it is justified to identify assessments made conditional on the assumed occurrence of C. P ( E I G).2.s rule (Jeffrey. the existence of a commonly agreed notion of what constitutes "desirable statisticid practice". The quantitative coherence approach is based on the aswnption . although we acknowledge its interest and potential importance. and those to follow. who discuss. we have heard proponents of supposedly "modelfree" exploratory methodology proclaim that wc can evolve towards "gocxl practice" by simply giving full encournpement to the creative imagination and then "seeing w h a t works". with actual beliefs once G is knowvr. Upduling Subjecrive Probubiiity An issue related to the topic just discussed is that of the mechanism for updating subjective probabilities. I965/1983). and Goldstein ( 1985). Although this is literally true. Our objection to both these attitudes is that they each implicitly assume. We shall not pursue this issue further. we defined. I>( E I G) becomes our actual degree of belief in E. in terms of a conditional uncertainty relation. we have heard Bayesian colleagues argue that the mechanics and flavour of the Bayesian inference process have their own sufficient.the notion of the conditional probability. At the other extreme.94 2 Foundations with the analysis of uncenainty in the context of models which are accepted.. From this. . an operational definition o f the notion is required. contained in the premises. However.If we actually knocr. and (ii) our arguments thus far. The prior probability P ( E ) . which establishes that p ( E I G ) = P(G I E ) P (E ) / P ( G ) .j)rcertain that G has occurred. also. tautologically. Another form of this argument asserts that developments from axiom systems are "pointless" because the conclusions are. albeit from different perspectives. In Section 2. we simply do not accept that the methodological imperatives which How from the assumptions of quantitative coherence are in any way "obvious" to someone contemplating the axioms. who examines the role of tempord colwrence. direct. are an attempt to convince the reader that within this Iutrer context there are compelling reasons for adopting the Bayesian approach to statistical theory and practice. This does not seem to us a reasonable assumption at all. subject only to local probing of specific potential elaborations.4. Relevance o the Axiomatic Approach f Arguments against overconcern with foundational issues come in many forms. for the purposes of the analysis.
1980a). which include “fuzzy logic”. In that context. This raises questions about their implicitly symmetric treatment within the framework given in Section 2. for the most part. ad hoc and the challenge to the probabilistic paradigm seems to us to be elegantly answered by Lauritzen and Spiegelhalter (1988). this problem has already been considered in relation to the foundations of quantum mechanics. Randall and Foulis. For example. of course. Finally. Proponents of such systems have argued that (Bayesian) probabilistic reasoning is incapable of analysing these structures and that novel forms of quantitative representations of uncertainty are required (see Spiegelhalterand KnillJones. alternative proposals. We shall return to this topic later in this section. Choices of the elements to be included in & are. where the notion of “sample space” has been generalised to allow for the simultaneous representation of the outcomes of a set of “related” experiments (see. for example. is it reasonable to assume that. are. Prescriptive Nature of the Axionis When introducing our formal development. to avoid Dutchbook inconsistencies. desirable practice requires.2. Barnard. We are concerned with un . the logical description of the possibilities should be forced into the structure of an algebra (or cralgebra).8 Discussion and Further References 95 that. Another area where the applicability of the standard paradigm has been questioned is that of socalled “knowledgebased expert systems”.an assumption which leads to the Bayesian paradigm for the revision of belief. we emphasised that the Bayesian foundational approach is prescriptive and not descriptive. in each and every context involving uncertainty.2. for references to these ideas). including hierarchies and networks. This draws attention to the interpretational asymmetry between a statement like ”the underlying distribution is normal” and its negation. which often operate on knowledge representationswhich involve complex and loosely structured spaces of possibilities. it has been established that there exists a natural extension of the Bayesian paradigm to the more general setting. However. We shall return to this topic in Chapters 4 and 6. and ensuing discussion.2. for example. within the structured framework set out in Section 2. in which each event has the same logical status? It seems to us that this may not always be reasonable and that there is a potential need for further research into the implications of applying appropriate concepts of quantitativecoherence to event structuresother than simple algebras. another form of query relating to the logical status of events is sometimes raised (see. 1984. at least. bound up with general questions of “modelling” and the issue here seems to us to be one concerning sensible modelling strategies. 1975). “belief functions” and “confirmation theory”. Structure of the Set of Relevant Events But is the structure assumed for the set of relevant events too rigid? In particular.
we note that. Machina ( 1987). like those described in Figure 2. 1979) that there are a great many individuals who prefer option I to option 2 in the first situation.derstanding how we ought to proceed.90 It has been found (see. Savage (1980). Despite this. many critics of the Bayesian approach have somehow taken comfort from the fact that there is empirical evidence. Wallsten ( 1974). from experiments involving hypothetical gambles.10 0. Kahneman PI (11. Allais' criticism is based on a study of the actual prcferences of individuals in contexts where they are faced with pairs of hypothetical situations. Bordley (1992).X9 0.1 I 0. for example. We are not concerned with sociological or psychological description of actual behaviour.oo 0. there must cxist a utility . if they are to correspond to a consistent utility ordering. for examplc. Luce (1992) and Yilmaz (1992). To examine the coherence of these two revealed preferences. Kahneman and Tversky ( 1979).89 0. Allais and Hagen. and a t the same time prefer option 4 to option 3 in the second situation. for example. in each of which a choice has to be made bctween the two options where C' stands for current assets and the numbcrs describe thousands of units of a familiar currency. See.8. ( 1982).0I 0. if we wish to avoid a specified form of behavioural inconsistency. Allais (1953) and Ellsberg ( I961 ).10 0. which suggests that people often do not act in conformity with the cohercnce axioms: see. I . For the latter. also. see.
This seems to us a very peculiar argument. .defined over consequences (in this case. But. on the grounds that individuals can often be shown to perform badly at deduction or long division.89u(C). Table 2. Savage ( 1954/1972. to enable us to discover the kinds of mistakes and distortions to which we are prone in ad hoc reasoning.10~(2.4 Savuge 's reformulution o Allais ' example f ficketnumber I 500tC 211 12100 situation I option I option 2 option3 option 4 C 500+ C 500+ C 5OO+C 2500+C 500+C situation2 c 5OO+C 2500tC G c In the case of Allais' example.l0~(2. which of the options is selected. as shown in Table 2.89u(500 + + + + G) + 0.). for this range of tickets. the more liable people are to make mistakes. therefore.and. it is clear that if any of the tickets numbered from 12 to 100 is chosen it will not matter. satisfying the inequalities ~ ( 5 0 0 C) > 0. total assets in thousands of monetary units). the stated preferences are incoherent. Indeed. The conclusion to be drawn is surely the opposite: namely. both as a reference point.). How should one react to this conflict between the compelling intuitive attraction (for many individuals) of the originally stated preferences.90u(C) > 0.50O+C) +0.8 Discussion and Further References 97 function u(.500 C) 0. when the problem is set out in this form. Preferences in both situations should therefore only depend on considerationsrelating to tickets in the range from I to I 1. so that preferring option I to option 2 and at the same time preferring option 4 to option 3 is now seen to be indefensible. It is as if one were to argue for the abandonment of ordinary logical or arithmetic rules. But simple rearrangement reveals that these inequalities are logically incompatible for any function u(.01u(C) O.3. situations 1 and 2 are identical in structure.11~(5OO+C) +0.2. and also as a suggestive source of improved strategies for thinking about and structuring problems. the more need there is to have the formal prescription available. and the realisation that they are not in accord with the prescriptive requirements of the formal theory? Allais and his followers would argue that the force of examples of this kind is so powerful that it undermines the whole basis of the axiomatic approach set out in Section 2. Chapter 5 ) pointed out that a concrete realisation of the options described in the two situations could be achieved by viewing the outcomes as prizes from a lottery involving one hundred numbered tickets.4. in either situation.
However. Qiruntitutive Preferences In our axiomatic development we have not made the a priori assumption that all options can be compared directly using the preference relation. and we shall not repeat the details here. are given by Fine (1973. when confronted with complex or tricky problems. rather than from the extreme nature of the consequences. also making use of the analogy with “optical” and “magical“ illusions. however. that all consequences and certain general forms of dichotomised options can be compared with dichotomised options involving standard events. Precise. implicit in “standard” axiom systems. This latter assumption then turns out to imply a quantitative basis for all preferences. and Koopman. The view has been put forward by some writers (e. we must be prepared to shift our angle of vision in order to view the structure in terms of more concrete and familiar images with which we feel more comfortable. in addition to the monetary consequences specified in the hypothetical gambles. in fact. 1921/1929. Ellsberg’s (1961) criticism is of a similar kind to Allais’. in Axiom 5 . assumed. Complete. Raiffa ( 1961) and Roberts (1963) have provided clear and convincing rejoinders to [he Ellsberg criticism. I93 1/1973)the general response to this view has been that some form ofquantification is essential if we are to have an operational. Even without such refinements. scientifically useful theory. there has been a widespread feeling that the demand for precise quantification.g. disappear if one takes into account the possibility that the experimental subjects’ utility may be a function of more than one attribute. We have. see Kadane ( 1992). Indeed. beginning with Jeffreys’ review of Keynes’ Treutise (see also Jeffreys.98 2 Foudutions Viewed in this way. which go far beyond the ranges of our normal experience. but the ”distorting” elements which are present in his hypothetical gambles stem from the rather vague nature of the uncertainty mechanisms involved. is rather severe and certainly . Keynes. Allais’ problem takes on the appearance of a decisiontheoretic version of an “optical illusion” achieved through the distorting effects of “extreme” consequences. however. The form of argument used is similar to that in Savage’s rejoinder to Allais. Roberts presents a particularly lucid and powerful defence of the axioms. or even comparable. where confusion is engendered by the probabilities rather than the utilities. often as a result of thinking that there is a “right answer” if the problem seems predominantly to do with sorting out “experimentally assigned’ probabilities. the perceived incoherence may. Other references. For a recent discussion of both the Allais and Ellsberg phenomena. The lesson of Savage’s analysis is that. Neverthcless. and hence for beliefs and values. we may need to consider the attribute “avoidance of looking foolish’. 1940)that not all degrees of belief are quantifiable. together with a thorough review of the mathematical consequences of these kind of assumptions. and arguing solely in terms of the gambles themselves. Chapter 2). In such cases. In particular.
Walley (1987.8 Discussion and Further References 99 ought to be questioned. Kyburg (1961). We therefore echo our earlier detailed commentary on Axiom 5 in Section 2.g. but provide no operational guidance as to how to choose from among these. to be interpreted as “upper” and ‘‘lower’’ probabilities. some of the kinds of suggestions that have been put forward from this latter perspective. we shall proceed on the basis of a prescriptive theory which assumes precise quantification. such theories lead to the identification of a class of “wouldbe” actions. therefore. such as Shafer’s (1976. Smith (1961). is clearly absurd if taken literally and interpreted in a descriptive sense. however. we accept that the assumption of precise quantification. to the effect that these kinds of proposed extension of the axioms seem to us to be based on a confusion of the descriptive and the prescriptive and to be largely unnecessary. So far as decisions are concerned. A particular consequence of this . but then pragmatically acknowledges that. Subjectivity of Probubility As we stressed in Section 2. Walley and Fine (1979). is certainly an open. In essence. Our basic commitment is to quantitative coherence. 1990b). in the sense that it derives from the response of a particular individual to a decision making situation under uncertainty.e. have been shown to be suspect (see. is to be understood as personal. to be dogmatic about this. Wasserman (1990a. forexample. Chateaneuf and Jaffray (1984). For a related practical discussion. This and other forms of “robust Bayesian” approaches will be reviewed in Section 5.. the notion of preference between options. Gir6n and Rios (1980). 1982a) theory of “belief functions”. We would not wish. all this should be taken with a large pinch of salt and a great deal of systematic sensitivity analysis.3. i. 1991) and Nakamura (1993).2.but its operational content has thus far eluded us. see Hacking (1965).6.but have led on themselves to further generalizations. An obvious. debatable one. is to consider simultaneously all probabilities which are compatible with elicited comparisons. In general. 1962).1985). the suggestion in relation to probabilities is to replace the usual representation of a degree of belief in terms of a single number.2. 1968). The question of whether this should be precise. This has attracted some interest (see e. It is rather as though physicists and surveyors were to feel the need to rethink their practices on the basis of a physical theory incorporating explicit concepts of upper and lower lengths. in practice. Among the attempts to present formal alternatives to the assumption of precise quantification are those of Good (1950. Aitchison. such as Dempster’s (1968) generalization of the Bayesian inference mechanism. that comparisons with standard options can be successively refined without limit. also. Dempster ( 1967. and it might well be argued that “measurement” of beliefs and values is not totally analogous to that of physical “length”. In this work. DeRobertis and Hartigan (1981).3. See. the primitive operationalconcept which underlies all our otherdefinitions. if often technically involved solution. or allowed to be imprecise. We should consider. Particular ideas. by an interval defined by two numbers..
be some kind of group. that Keynes seems subsequently to have changed his view and acknowledged the primacy of the subjectivist interpretation (see Good. are found to contain subjectivist ingredients. from the formal structure of the language in which these statements are presented. The first of these retains the idea of probability as measurement of partial belief. regarding it. it has met with a more favourable reception. there have emerged two alternative kinds of approach to the definition of “probability” both seeking to avoid the subjective degree of belief interpretation. At the very least. Further comments on the problem of individuals versus groups will be given later under the heading Communicution and Group Decision Making. The case for the subjectivist approach and against the objectivist alternatives can be summarised as follows. The symmetry (or classical) view asserts that physical considerations of symmetry lead directly toa primitive notion of”equa1ly likely cases“. in some undefined way. as a unique degree of partial logicul implication between one statement and another. The “individual” referred to above could. instead. above all else. Nevertheless. defined in Section 2. From the objectivistic standpoint. but rejects the subjectivist interpretation ofthe latter. it appears to offend directly against the general notion that the methods of science should. it can conveniently be regarded as a “person”.4 and subsequently shown to combine for compound events in conformity with the properties of a finitely additive probability measure. Unique probability values are simply assumed tocxist as a measure of the degree of implication between one statement and another. to the extent that they appear successful. But any uncertain . to be intuited. in which case.100 2 Foundations is that the concept which emerges is personal degree of belief. it is interesting to note. of course. 1965. This idea that personal (or subjective) probability should be the key to the “scientific” or “rational” treatment of uncertainty has proved decidedly unpalatable to many statisticians and philosophers (although in some application areas. have an “objective” character. Brown ( 1993) proposes the related concept of “impersonal” probability. the alternatives are either inert. bitter though the subjectivist pill may be. Chapter 2). with later influential contributions from von Mises (1928) and Reichenbach ( 1935). such as a committee. and admittedly difficult to swallow.sens that the notion of probability should be related in a fundamental way to certain ”objective” aspects of physical reality. to the extent that we ignore the processes by which the group arrives at preferences. by far the most widely accepted in some form or another. as. the first systematic foundation of the frequentist approach is usually attributed to Venn ( 1886). 1954). provided the latter had agreed to “speak with a single voice”. From an historical point of view. see Clarke. however. The logical view was given its first explicit formulation by Keynes ( 192 I I1 929) and was later championed by Carnap ( I950/ 1962)and others. such as actuarial science. such as symmetries or frpquencies. or have unpleasant and unexpected sideeffects or. The second approach. The logical view is entirely lacking in operational content.
. an act of personal judgement. It is of paramount importance to maintain the distinction between the definition of a general concept and the evaluation of a particular case. The frequency view can only attempt to assign a measure of uncertainty to an individual event by embedding it in an infinite class of “similar” events having certain “randomness” properties. But an individual event can be embedded in many different “collectives” with no guarantee of the same resulting limiting relative frequencies: a truly “objective” theory would therefore require a procedure for justifying the choice of a particular embedding sequence. The subjectivist point of view outlined above is. an operational definition of which is meant by “similar”. This concept. via the celebrated de Finetti representation theorem. which we shall discuss at length in Chapter 4. it was not until Thomas Bayes’ ( I 763) famous essay that it was explicitly used as a definition: The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed. The subjectivist view explicitly recognises that any assertion of “similarity” among different. a “collective” in von Mises’ ( 1928)terminology. the definition derives from logical notions of quantitativecoherent preferences: practical evaluationsin particular instancesoften derive from perceived symmetries and observed frequencies. an act of personal judgement. requiring. Moreover. In fact. and then identifying “probability” with some notion of limiting relative frequency.8 Discussion and Further References 101 situation typically possesses many plausible “symmetries”: a truly “objective” theory would therefore require a procedure for choosing a particular symmetry and for justifying that choice. and the value of the thing expected upon its happening. there are obvious dificulties in defining the underlying notions of “similar” and “randomness” without lapsing into some kind of circularity. not new and has been expounded at considerable length and over many years by a number of authors. The idea of probability as individual “degree of confidence” in an event whose outcome is uncertain seems to have been first put forward by James Bernoulli (17 13/1899). inescapably. and it is only in this evaluatory process that the latter have a role to play. this latter requirement finds natural expression in the concept of an exchangeable sequence of events. course. The identification of probability with frequency or symmetry seems to us to be profoundly misguided. In the subjectivist approach. individual events is itself. inescapably. The subjectivist view explicitly recognises that regarding a specific symmetry as probabilistically significant is itself.2. provides an elegant and illuminating explanation. from an entirely subjectivisticperspective. in addition. of the fundamental role of symmetriesand frequenciesin the structuringand evaluation of personal beliefs. It also provides a meaningful operational interpretation of the word “objective” in terms of “intersubjective consensus”. However.
Stutisricul Inference us u Decision Prohlm Sty lised statistical problems have often been approached from adecisiontheoretical viewpoint. . see. 1974).7. 1970/1975). requiring separate discussion. ofcourse. Hacking ( 1975). Fishburn (1964). we have already made clear that. in our view.Fine (1973). We believe that the two contexts. Other interpretations of probability are discussed in Renyi (1955). the supposed dichotomy between inference and decision is illusory. Walley and Fine ( 1979) and Shafer ( 1990). Berger ( I985a) and references therein. and also "cohesivesmalleroup" decision making processcs. A number of later contributions to the field of subjective probability are collected together and discussed in the volume edited by Kyburg and Smokier (i964/1980). and a number of inforrnationtheoretic ideas and their applications were given a unified interpretation within a purely subjectivist Bayesian framework. DeGroot (1970). Section 3). The expected utility of an "experiment" in this context was then seen to be identified with expected information (in the Shannon sense). Good (1959). In Section 2. An exhaustive and profound discussion of all aspects of subjective probability is given in de Finetti's magisterial Tlieory qf Prohubility (1970/1974. We shall deal with these topics in more detail in Chapters 5 and 6. since any report or communication of beliefs following the receipt of information inevitably itself constitutes a form of action. A seemingly powerful argument against the use of the Bayesian paradigm is therefore that it provides an inappropriate basis for the kinds of interpersonal communication and reporting processes which characterise both public debate about beliefs regarding scientific and social issues. which includes important seminal papers by Ramsey (1926) and de Finetti (193711964). the books by Ferguson (1967). Many approaches tostatistical inference do not.102 2 Foundutions Not only is this directly expressed in terms of operational comparisons of certain kinds of simple options on the basis of expected values. "public" and "cohesivesmall:Iroup". Coniniutiicution trnd Group Decision M d i i i g The Bayesian approach which has been presented in this chapter is predicated on the primitive notion of individirul preference. and concentrate instead on stylised estimation and hypothesis testing formulations of the problem (see Appendix B. However. we formalised this argument and characterised the utility structure that is typically appropriate for consequences in the special case of a "pure inference" problem. for instance. Barnett ( 1973/1982). de Finetti ( 1978). Kyburg (1961. but the style of Bayes' presentation strongly suggests that these expectations were to be interpreted as personal evaluations. pose rather different problems. assign a primary role to reporting probability distributions.
Stone (1961). Any approach to scientific inference which seeks to legitimise an answer in response to complex uncertainty seems to us a totalitarian parody of a wouldbe rational human learning process. the range being chosen both to reflect (and even to challenge) the initial beliefs of a range of interested parties. and on how the group is structured in relation to such issues as “democracy”.2. Useful introductions to the extensive literature on amalgamation of beliefs or utilities. or utilities. We concede that much remains to be done in developing Bayesian reporting technology. the prescriptive processes by which we ought individually to revise our beliefs in the light of new information if we aspire to coherence. Some discussion of the Bayesian reporting process may be found in Dickey ( 1973). So far as the issues are concerned. and so we shall not attempt a detailed review of the considerable literature that is emerging. so far as we are aware. This latter topic will be further considered in Chapter 4. 9 Luce (1959). however. are provided by Arrow ( I 51 b). “negotiation” or “competition”. Some of the literature on expert systems is relevant here. we feel that criticism of the Bayesian paradigm is largely based on a misunderstanding of the issues involved. . the whole basis of the subjectivist philosophy predisposes Bayesians to seek to report a rich range of the possible belief mappings induced by a data set.On the broader issue. for instance. or both. “informationsharing”. one of the most attractive features of the Bayesian approach is its recognition of the legitimacy of the plurality of (coherently constrained) responses to data. On the other hand. review of the connections between this issue and the role of models in facilitating communication and consensus. Blackwell Edwards (1954). Lindley (1987). we need to distinguish two rather different activities: on the one hand. and we conjecture that modem interactive computing and graphics will have a major role to play. depending on whether the emphasis is on combining probabilities. Spiegelhalter (1987) and Gaul and Schader (1988). the second reminds us that a “oneoff’ application of the paradigm to summarise a single individual’s revision of beliefs is inappropriate in this context. no Bayesian statistician has ever argued that the latter would be appropriate. Dickey and Freeman ( I 975) together with a and Smith (1978). A variety of problems can be isolated within this framework.Luce and Raiffa (1957). see. It is not yet clear to us whether the analyses of these issues will impinge directly on the broader controversies regarding scientific inference methodology. The first of these processes leads us inescapably to the conclusion that beliefs should be handled using the Bayesian paradigm. together with most of the key references. Further discussion is given in Smith (1984). the pragmatic processes by which we seek to report to and share perceptions with others. and the uses to which it can be put. in the “cohesivesmallgroup”context there may be an imposed need for group belief and decision. But. Indeed. and on an oversimplified view of the paradigm itself. on the other hand.8 Discussion and Further References 103 In the case of the revision and communication of beliefs in the context of general scientific and social debate.
I985b). 1989. Edwards and Newman ( 1982). . Bayarri and DeGroot ( 1988. 1986. 1985. Kim and Roush ( 1987). Important. 19931. Seidenfeld rt ul. Hogarth ( 1980). Hylland and Zeckhauser ( 198 I ). 1968. White and Bowen ( I975). seminal papers are rcprtxtuced in GHrdenfors and Sahlin ( 1968). Roberts ( I979). I99 I ). Lindley (1983. Arrow and Raynaud ( I987). Bunn ( I984).Nau and McCardle ( 1990). see Hodgrs ( 1987).Yu ( 1985). A recent review of relatcd topics. Rios ( 1990). 1990). ( 1989). 1986). Normand and Tritchler ( 1992) and Gilardoni and Clayton (1993). 1980). Raiffa ( 1982). (1992). 1985. (1983). Huseby ( 1988). 1989).West ( 1988. Eliashberg and Winkler ( 198 I ). Fishburn (1964. I992a). Kelly (1991). Lindley and Singpurwalla (1991. 1976b). Marschak and Radner ( 1972). Goel et ul. DeGroot and Feinberg ( 1982. Winkler (1968. Press ( 1978.Berger ( 1981 ). 1983. Chankong and Haimes ( 1982). Smith ( 1988b). DeGroot ( 1974. Rarlow et ti/. 1983). Clemen and Winkler ( 1987. 1970. 1986). (1992). ( 1988). DeGroot and Kadane (1980).104 2 I. ( 1989).Young and Smith ( I99 I ). References relating to the Bayesian approach to game theory include Harsany (1967). 1983). Kogan and Wallace ( 1964). Morris ( 1974). Clemen ( 1989. Aumann ( 1987). Genest and Zidek ( I 986). French et ul. Wilson ( 1986). For related discussion in the context of policy itnalysis. Werrahandi and Zidek ( 198 I . ( 1984). ( 1986). Saaty ( 1980). Genest ( 1984%1984b). 1981). White ( 1976a. Brown and Lindley ( I 982. Sen (1970). 1987). French ( 198 I . Kadane and Seidenfeld ( 1992) and Keeney ( 1992). Rios rt ul. Lindley et ul.ruidution~ and Dubins (1962). followed by an informative discussion. Ctxhrane and Zeleny ( 1973).De Waal erul. ( 1979). (1971). Kranz el d . I980b. DeGroot and Mortcra ( I9Y I ). Goicocchca et t i / . is provided by Kadane ( 1993). 1993). Kadane and Larkcy ( 1982. Wilson (1968).i. Raiffa (1982). Caro etul. 1986).
1 3. In the context of inference problems. The notions of options and utilities are extended to provide a very general mathematical framework for decision theory. led to the fundamental result that quantitatively coherent degrees of belief for events belonging to the algebra & should fit together in conformity with the mathematical rules offinire/y . and is shown to justify the criterion of maximising expected utility within this more general framework. Ltd Chapter 3 Generalisations Summary The ideas and results of Chapter 2 are extended to a much more general mathematical setting. A further additional postulate regarding preferences is introduced. The elements of mathematical probability theory are reviewed. M.1 GENERALISED REPRESENTATION OF BELIEFS Motlvation The developments of Chapter 2.1. Smith Copyright 02000 by John Wiley & Sons. based on Axioms I to 5. generalised definitions of score functions and of measures of information and discrepancy are given. An additional postulate concerning the comparison of a countable collection of events is introduced and is shown to provide a justification for restricting attention to countably additive probability as the basis for representing beliefs. Bemardo and Adrian F. 3.BAYESIAN THEORY Edited by Josd M.
there seems no compelling reason to require anything beyond this finitistic framework. Kingman and Taylor (1966) or Ash (1972).7. but find it difficult to justify any particular choice of its value. I when discussing bounded sets of consequences. and then used in Section 3. discrete case. The example given in that context involved the problem of representing the length of remaining life of a medical patient. The important special case of inference as a decision problem will be considered in Section 3. I and the subsequent discussion. for example. a we remarked in Section 2. convenient references are. since these are generally regarded as being on a continuous scale. from the point of view of descriptive and mathematical convenience.4. Axioms 1 to 5 will continue to . From a directly practical and intuitive point of view. so that the latter is taken to be a 0ulgebru. by de Finetti (1970/1974. a view argued forcefully. Within this extended structure. we shall now assume that we allow arbitrary countable intersections and unions in E. discrete framework. Most people would accept that there is an implicit upper bound. and provided we do not feel that in so doing we are distorting essential features of our beliefs.. Similar problems obviously arise in representing other forms of survival time (of equipment.106 3 Generalisations addirive probability. given in Section 2. For these reasons. 312 . which extends to the general mathematical framework the discussion of the finite. it is certainly attractive. to consider the possibility of extending our ideas beyond the finite.2. a discussion of some particular issues is given in Section 3. 1970/1975). we shall provide a formal extension ofthe quantitative coherence theory to the intinitedomain. we assumed that the collection & of events included in the underlying frame of discourse should be clobed under the operations of arbitraryjnire intersections and unions. transplanted organs. In Section 3. or whatever) and further difficulties occur in representing the possible outcomes of many other measurement processes.3 to develop natural extensions of our finitistic definitions of actions and utilities.5.1. and in great detail. As the first step in providing a mathematical extension to the infinite domain.2. Our fundamental conclusion about beliefs in this setting will be that quantitatively coherent degrees of belief for events belonging to a aalgebra & should fit together in conformity with the mathematical rules of coiintablv additive probability. The major mathematical advantage of this generalised framework is that all the standard manipulative tools and results of mathematical probability theory then become available to us. Countable Additivity In Definition 2. thus establishing an extremely general mathematical setting for the representation and analysis of decision problems. there are many situations where the implied necessity of choosing a particular finitistic representation of a problem can lead to annoying conceptual s and mathematical complications. Finally. However. A selection of these tools and results will be reviewed in Section 3.5.
10 and2. and 0 have ( E. .and thus.+. J*X P(E.2. and hence for degrees of belief. By Proposition 2. if.EJ >c. for all j = 1. n.. then it would seem very natural in terms of“continuity” that the relation should “carry over”. n G 2 E.P(EJI G) = p . . if E. We note first that the condition encapsulated in Postulate 1 carries over to conditional preferences. then. Since we have already established in Proposition 2. I j for all j . n.} are disjoint events in E. provided that only finite combinations of events of & are involved. (Countably addin’ve structure o degrees o be&. in this extended setting. moreover. I G) = 0. and so there exists a number p 2 0 such that lini. .1. f f If { EJ. P( E.) n G 2 F n G. for any G > 0. n G 2 F n G.14(i). we admit the possibility. P(. One possible such extension is encapsulated in the following postulate. E. if we wish to deal with countable combinations of events. Postulate 1. J=l x Discussion of Postulate 1.. and. we shall need an extension of the existing requirements for quantitative coherence. for all j. Hence. However. . However. . (n. G I S and. EJ _> EJ+I and EJ 2 F. proposition 3. P(EJI G) 2 P(E. a (n. S. E. and G > 8. that E l ) 2~F. By Axiom 4(iii) and Proposition 2. which implies that S .+l I C) 2 0. Ej 2 Ej+1 X und J=1 Ej = 0. I G) is a finitely additive probability measure. then this form of continuity would seem to be a minimal requirement for coherence.j = 1. enables us to establish immediately that.. we . based on the postulate of monotone conrinuiry. . n G..I Generalised Representation of Beliefs 107 encapsulate the requirements of quantitative coherence for preferences. (Monolone continuiry). lim Proof. there exists a standard event S such that p(S)= p . thus.14(i) by Postulate 1. By Proposition 2. 2 F holds for every member of a decreasing sequence of events El 2 E2 .thatp=O.17 that P ( . for all j.2.2. .16. The operational justification for considering such a countable sequence of comparisons is certainly open to doubt. and if we accept the limit event Ej into our frame of discourse.1l. then 0Ej 2 F . It now follows from Proposition 2.+cc.3. for descriptive and mathematical convenience. If the relation E. then / x \ x .16. . I j for a l l j . by the above. 2~F then we have E.I G) i s a countably additive probability measure.) = 8 >G S. the above result. (Continuitya~8). I G) 1 p. Proposition 3. by Propositions 2.
then P ( E ) 5 P ( F ) . 3. let alone infer that they are zero. hence. we simply note that philosophical allegiance to the finitistic framework is in no way incompatible with the systematic adoption and use of countably additive probability measures for the overwhelming majority of applications.5.4 and 2. as well as mathematically convenient. or whether it is a pragntaric option.. outside the quantitative coherence framework encapsulated in Axioms I to 5 . which contains & and all the subsets of the null events of t (the socalled ' cornpletion of E). However. P(U. lim P( l.3.. for any I ) >_ 1.X I C ) = 0. so that if P(F)= 0 then P ( E ) = 0. (Probability space). If so. The debate centres on whether this particular restriction to ii subclass of the linitely additive measures should be considered as a twc'esscrr:\' feuture of quantitative coherence. we identify strongly with this latter viewpoint.1 G) is finitely additive we have. From a philosophical point of view.F.1. It is called ii complete probability measure... Definition 3. in general. I G). We have established (in Propositions 2. but in almost all the developments which follow we shall rarely feel discomfited by implicitly working within a countably additive framework. nieusure on 3.= L4. not all subsets of an event of probability zero (a socalled null event) will belong to & and so we cannot logically even refer to their probabilities. to be able to do this.+ with n. It follows that F. 2 F. I . Since P(. For the present. E.2. P ) where 3 is u aalgebra of $1 arid I' is N cwnpletc). A probubilit? space is dejned by tlw elenients ((1. The induced probability measure over F is unique and has the property that all subsets o f null events are themselves null events. 0additiw prohibilit?.108 3 Generalisations Proof.. where % . The result follows by taking limits in the last expression for We shall consider the finite versus countable additivity debate in a little more detail in Section 3. by Proposition 3.10) that if E and F are events in & with E C F. In somc circumstances it may be desirable. . this can be done "automatically" by simply agreeing that E be replaced by the smallest nulgebrti.
from { 0. However.1) and our subsequent discussions of generalised decision problems (Section 3.2. operationalist philosophy as a basic motivation and guiding principle.F. “counts”. the constituent possibilities and probabilities of any decision problem are encapsulated in the structure of the probability space {R. 3. in the forms of counts or measurements.2 3. P } . We do not anticipate encountering situations where these mathematical assumptions lead to conceptual distortions beyond the usual. that surrounding the birth of an infant.1 REVIEW OF PROBABILITY THEORY Random Quantitiesand Distributions In the framework we have been discussing. therefore. we are not really interested in a “complete description”. or the state of international commodity markets at a particular time point. The material which follows (in Section 3. Now.  c which associates a real number Z ( W ) with each elementary outcome w of 52 (our initial exposition will be in terms of a singlevalued z.8. it might be argued that “measurements” are always. In the next section. for example.F.. 3?:R x R.. as de Finetti ( 1970/1974. Recalling the discussion of Section 2. in a certain abstract sense. the concepts and results from mathematical probability theory which will provide the technical underpinning of our theory. we shall distinguish the two in the usual pragmatic (fuzzy)way: “counts” will typically mean integervalued data. We move. vector extension will be the .2 Review of Probability Theory 109 From now on. we shall concentrate instead on reviewing. “measurements” will typically mean data which we prerend are realvalued. our mathematical development will take place within the assumed structure of a probability space. P } to a more explicitly numerical setting by invoking a mapping. one must always be on guard and aware that distortions rnighr occur. we might think of R as the “primitive” collection of all possible outcomes in a situation of interest. However. when convenient. from a purely mathematical standpoint. inevitable element of mathematical idealisation which enters any formal analysis.2) will differ in flavour somewhat from our preceding discussion of the general foundutions of coherent beliefs and actions (Chapter 2 and Section 3. but rather in some numerical summary of the outcomes.3. even if such were possible. in fact. These developments systematically invoke the subjectivist. However. 197011975) has so eloquently warned.3) and of the link between beliefs about observables and the structure of specijc models for representing such beliefs (Chapter 4).
Y Clll B E B. c Following de Finetti (1970/1974. . (Random quantity). the smallest oalgebra containing intervals of the form (XI.c is called a discrete random quantity and the function pr :R [O. 11 dom quunti? x : R * X C_ R on { (1.. and describes the way in which probability is"distributed"over the possible values J' E .} .s R such that . Definition 3..r(w) = . this will constrain the class of functions . Notationally. . . definable on the aalgebra of Burel sers. example.(H) = f ( .n E 'R.. or a particular value of the function . is easily seen to be aprububility measure. : 'R dejned by  If the probability distribution concentrates on a countable set of values. The standard requirement is that Z be ' . (Distribution function). fie to The function 1. we shall wish to ensure that f . n E 0. Definition 3. since the 0 1 latter can be generated by appropriate countable unions and intersections of the a].r. This information can also be encapsulated in a single realvalued function. which might suggest a restriction to contexts involving repeated ''trials'' oter which the quantity may vary. and hence all forms of interval. for example. Subsets of 51 are thus mapped into subsets of 'R and the probability measure P defined on F will induce a probability measure.}. However. ( I f 'R.r which we would wish and to use to define the numerical mapping. P } is a function s : (.4).I + .I (131 E . 1970/1975).. The interpretation will always be clear from the context.(. on CI prohuhi1ir. we shall use the same symbol for both a random quantity and its value. term rundonr wriable..r)= P { w : x(u)= .Y..say. For a random quantity z'.. .x < ~ ( u )u } . rather than use the traditional.. (I E %. B. F. say. 3 . Thus. intervals (x. The distribution function of CI rnn' [O. so that X = {I.. is defined in the natural way by P. n ] . in 5 which case we shall want sets of the form { w : . over appropriate subsets of !R.2. the induced measure P. P C .r may denote a function. is defined for on certain special subsets of 8.110 3 Generulisutiotis made in Section 3. to belong to 3. we use the term rundoni yiiurisignify a numerical entity whose value is uncertain. sp.3.tor F spuce { f 1 .. of %.I.2. 11 such that  p. intervals such as (x. .P } is thefuncrion F.(1 E 'R. r @)). but potentially confusing. A rnndont quunrin.
and the meanings of integrals will rarely depend on the niceties of the interpretation adopted. B E f?. If the probability distribution is such that there exists a real.. and p I .’ ( B )E F for all B E D.c in p .2). is given by Some examples of this relationship are given at the end of Section 3. ( { g . is called an (absolurely)conrinuous random quantity and p . we shall. 1972. this means “equal. .) and its value p. y. if g is monotonic and differentiable.(s) a particular z E X. Ash. whenever we refer to such functions of a random quantity 2. the random quantity 9 induces a probability space {%. We shall typically denote y o t by g(x). and if g : R t Y C 8 is a function such that (y o s ) . L g’(y)) = F*(g’(d) 5 and. . Writing y = g(z). nonnegative (measurable) function pr such that r then . then g o s is also a random quantity. when appropriate. see.3. Section 2.’ ( B ) } ) . we might have a mixture of both discrete and continuous elements. if g’ exists and is strictly monotonic increasing we have FJY) = P*(g(4 9) = P*(.and. B. These forms are easily related to those of F. of course. we shall often omit the suffix . If x is a random quantity defined on { Q . Readers unfamiliar with these concepts need not womy: virtually none pf the machinery will be visible.(. leaving it to be understood that. in the continuous case. P} such that s : 52 4 X C 8. using p(. the density p . Also. r ) .Py}. both are.r) both to represent the density at or mass function p.’ ( B ) } ) P ( { ( g O . In particular.where P.2. except possibly on a set of measure zero”.2 Review of Probubility Theory 111 is called its probability (mass)funcrion.2.(B) = P . to avoid tedious repetition of phrases like “almost everywhere”. when there is no danger of confusion. We shall use the same notation. is called its ciensivhncrion. simply state that densities are equal. In addition. for example.I and pu are defined in the obvious way. In measuretheoreticterms. we shall use the notation and results of Lebesgue and LebesgueStieltjesintegration theory as and when it suits us.it is to be understood that the composite function is indeed a random quantity. The distribution function is then a step function with jumps pr(xi) at each 5 . with respect to the relevant measure. Moreover. In general. 3 . ( x ) . for both the mass function of a discrete random quantity and the density function of a continuous random quantity. special cases of the RadonNikodym derivative. No use of singular distributions will be made in this volume (for discussion of such distributions. of course. = Functions such as F.
+T ] . As in Definition 3. To avoid tiresome duplication. . a median. I a ). .'E[.[.!+ (.r]) = ". (.. such that p.r].4. r <_ q. ~ [ . vuriunce: the (iv) u[. or shape._1]. .r: (ii) E[. (vii) ( Q .4. y ure rundoni quunrities with .r]= \.Q(.E[.Y and 1... c2 are finite real numbers then E[(j. where tlir eyuulity is to he interpreted in tlie sense h i t if'either side e. a mode of the distribution 0f.I:].r..r'].. If. we shall also typically use the integral form to represent both the continuous and the discrete cases. !/C 1 ..r1: rc .r] = E[(.'[.I) [.5[x]. ( ..the mean of the distribution of the random quantity . E.( d I [ X I ) = snp p.ti = {I( .112 3 Generalisations Definition 3. so that if .\ for the discrete and continuous cuses.Ij + c 2 4 = ('1 E[.r].r2 are two random quantities and C'I. spread.\I[. ~ . often in terms of special cases of Definition 3.. ~ .r).r. (Expectation).. an aquantile of thc distribution of . the kth (ohsolute)niotnent: (iii) \. r ) ]is clejned by either . respectively.rists so cloes the other tirrd they (ire equul.r]. .r]' the sfandurd deviution.r I . such that [. Assuming. the righthand side to exist. Thc expectation operator of Definition 3.r])'] = E[. assuming it to be understood from the context. To simplify notation we shall usually omit the range of summation or integration.'*[. ). most sums or integrals over possible values of random quantities will involve the complete set of possible values (for example. It is useful to be able to summarise the main features of a probability distribution by quantities defined to encapsulate its location..(Q.4 is linear. the expectution of y.r. (v) ... such summary quantities include: (i) E[.4. ~[ p. ". I . in each case.r.I.. E[y] = E [ g ( .s (vi) Q.r2] E"[.I  . = (1: (vi) M e [ x ] = Q0. c ]pinreryuunrik runge.
Proposition 3. V[UI ZZ [g'(4l2. . . E ! ? events {w. E n x . for suitably wellbehaved g ( s ) the following result. more refined approximationsare easily obtained by including higher order terms. . For any random quantify y. . taking expectations and ignoring higher order terms. . I f x is a random quantity with E[s] = p. . n E [ T . for all y. E(g(s)] g ( E [ z ]and the moments of a trans# ) formed random quantity g(x)do not exactly relate in any straightforward manner to those of z. ] and V cxl [ . (Approximate mean and variance).P)29"(/.. i = 1. . y ( w ) 5 y}. Definition 3. we clearly have E[g(x)l = cE[zI = g(E[sI)* 1/2 D [ g ( z ) ] (V[g(z)])1'2 = (b2V[x]) = = g(D[x]).(w) 5 t . i = 1..13). (Conditionalindependence). } are condirionally independent given the event { w .6.2.. are mutually independent $ f o r any t. are conditionally independent given y $ for any t .12) and conditional independence (Definition 2. These notions can be extended to random quantities in the following way. a In Definition 2.5. are mutuully independent. . .1). . rhe events { w . We note that for independent random quantities 51 . which we shall illustrate at the end of Section 3. . subtracting the latter approximation from both sides. yields the result for V[y].3.3. x. .}. .. . where we are assuming regularity conditions sufficient to ensure the adequacy of this approximation in what follows. Clearly.x. .x. For general transformations g ( x ) . we obtain d z ) = ! d P ) + (. Outline proof Expanding g (x ) in a Taylor series about p . However. Taking expectations immediately yields the approximate form for E[y]. we introduced the notion of the independence of two events with subsequent generalisationsto mutual independence (Definition 2.P ) g W + .R. (Mutual independence).2..2 Review of Probability Theory 113 In the special case of the transformation g ( s ) = cx. I*:. the random quantities 1 1 .xt( w ) 5 t. V[r] = c and y = g(z) then. E 92.n. . squaring. .z n .(x .often provides useful approximations. . ~ 1 I =~ V [ G ] . f 1 = S ( P ) + p 29 N (PI. . for I the .?. subject to conditions on r ' the distribution of x und the smoothness o g.1 L:. = Definition 3. for some real constant c.6. The random quantities 21.
as we shall see.r plune. Many similar properties hold for the closely related alternative transforms S[(' r ] .are independent random quantities. we note the following. For the present. . Many forms of technical manipulation of probability distributions are greatly facilitated by working with some suitable transformation of the original density or distribution function. ( t )= 1 q E [ . the probahili? generating~4ficncrior1.(O) I If . .rl. One important initial warning is required! These distributions provide the building blocks for srutistic*almodels and are typically delined in terms of "parameters". = 1. which will be discussed at length in Chapter4.I .parameters" should simply be regarded as "labels" of the various mathematical functions we shall be considering. o. '. o.(t) = 1 Of.r'] < x. although. operationalist framework are extremely important issues. = Among the most important properties of the characteristic function.2 Some Particular Univariate Distributions In this section. . 1 . The role and interpretation of "models" and *paratneters" within the general subjectivist. . and list some of their properties and characteristics.. and E[ t'1.114 3 Generalisatians Conditional independence will play a major role in our later discussion of modelling in Chapter 4. We shall assume that the reader is familiar with most of this material. . f E 3. The characteristic function of u rundom quantity s is the function d. 3.. Definition 3. mapping 92 to the conip1e. The books by Johnson and Kotz (1969. these "labelling parameters" often relate closely to onc or other of the characteristics of the distribution. One of the most useful such transforms is the following.2.7. s.(U. Two random quantities have the same distribution if and only if they have the same characteristic function. the moment generatingfunction. 7 4 ]+ o ( t ' ) . we shall review a number of particular univariate distributions which are frequently used in applications.. given by o. 1970) provide a mass of detail on these and other distributions. and s = I:=. then o .(t) t.rf. 1 Q r ( t ) <_ 1 and @. then If E[. (Characteristicfunction).is a uniformly continuous function o ft. n:. and detailed discussion and derivations are therefore not given."c"4]. . /=I .
r = O . 1 are modes. A1 and n (n 5 N hf) if its probability function Hy(x I A.). 1 are modes.] = ( n 1)8. is an and integer. . . if I. l . A mode is attained = at the greatest integer Ai[z] which does not exceed I..) if its probability function Bi(x I n ..n)= (:) 8"(10)"'... 2 . an integer. The mean and variance are E[I] = n8. 8) is Bi(r(O. n) ' is + where The mean and variance are given by 7N 1 E[4 = . and V[r] 7 4 1 .. with probability function denoted by Br(z 18). ..3.hi.1) A mode is attained at the greatest integer Al[x] which does not exceed = (n .. is said to have a Bernoulli distribution.. . r If n = 1.. ..8). . . .rm 1 ) ( N + 1). k..2 Review of Probubiliry Theory The Binomial Distribution 115 A discrete random quantity x has a binomial distribution with parameters 8 and n (0 < 8 < 1. + + + The Hypergeometric Disrriburion A discrete random quantity x has an hypergeometric distribution with integer parameters N. n = 1 . :11+N+2 + if x. x. I I A ..and V[r\ = N + hi 71. . then both X. n. The sum of k independent binomial rmdom quantities with parameters (8.~ x. is a binomial random quantity with parameters 8 and I I ~ . is and  .]. 11. .i = 1. then both x.71) ( N + n q z ( N + nr .. .\1& ( N + hi .
8 ) ] / 8 :if r( 1 . ." t ' ' d t .. Moreover. is a Poisson random quantity with parameter XI . < 1. .[. k . .0) < 1.4 = r8 and \.((I ) r ( .r) = J. .k .)*0 ..1 are modes. .r]= 0." Pn(. . t X I . . .. r is said to have a geometric or Pascal distribution. The sum of k independent Poisson random quantities with parameters A. .5.772. there are two modes at 0 and I . .) if its probability function Nb(. + The Beta Distribution A continuous random quantity . the sum of k independent negative binomial random quantities with parameters (8. ) .I' = 0. .I] A.r has a betti distribution with parameters o and (1 > 0.. where c = c The mean and variance are given by E[. If I' = 1. It' A is an integer.8) > 1 the mode Af is the least integer not less than [r(1 . 1.Ir) is ( 1 j where (a = l'(rr + . f = 1 . [XI The Poisson Distribution A discrete random quantity .j) and r(..116 The NegativeBinomial Distribution 3 Generctlisations A discrete random quantity 3: has a negativebinomial distribution with parameters 8 and r (0 < 6.r]= \'[..r). . where c = 8'.8)'.2.] = r ( 1 .r [ A ) is x. A mode = :\I [ r ] attained at the greatest integer which does not exceed A.r [ A) = c' . and the values I'( I ) = 1 and I( 1/2) = v6 1. 3 > 0) if its density function Be(. .2. r + 1) = .r ) = c (I' . 0) is Nb(r 119.0) = 1. If r ( 1 . r . is both X and A .H ) / H ' .r! .rl(. The mean and variance are E[.Ir) I. if I (1 . 1. df [.r has a Poissoti distribution with parameter A ( A > 0 ) if its probability function Pn(.I.r I r.i = 1 . is a negative binomial random quantity with parameters 0 and rl+.+rh. . 2 . 1 1 ) ( 1 .I* = 0. z . integer and halfinteger values of the gamma function are easily found from the recursive relation T ( . I = 1. '.
1).a)2/12 The BinomialBeta Distribirrion A discrete random quantity s has a binomialbetu distribution with parameters 0 . In particular.2.l)/(n + . a) density. assigning mass ( T I 1 ) .3. By considering the transformed random quantity y = u + ~ ( ba). + b ) / 2 and variance V[y] = (6 . . : 3 . then ?J = 1 . 9 and 11 ( u > 0.3 . If o = A = 1.3 1) + + + + A mode is attained at the greatest integer Af [x] which does not exceed = ( . 6 ) . rz: is said to have a uniform distribution Un(r 10.d (a ./j. s = 0.r has a Be(r 111. the beta distribution can be generahed to any finite interval ( a .+ $ + 1) If a > 1 and 3 > 1. 1 are modes. d) density. f 117 Systematic application of the beta integral.1) on (0.2).6) on ( a6 ) . . + 1)(. + . both zII. The distribution is generated by the mixture r1 The mean and variance are given by E [ x ]= n a a + $9 and XI..n ) = c where (:) r(cz+ x ) r ( ! j+ 71 . .. an integer.b)=(bu) has mean E [ y] = (u I.* to each possible s.n .I. . If u = d = 1 we obtain the is and discrete uniform distribution. . 8 > 0 . n) j . n<y<b. .x).s has a Be(y I 3.) if its probability function Bb(x 1 a. 7 t ) is Bb(o I a. gives 0 czd E [ s ]= . the uniform distribution Un(y 10. where .1) ' ~r+ij2 if . If s has a Be(x I a. . L. n = 1. there is a unique mode at (a . Un(yIa.3)2( a + ...2 Review o Probability Theor?. i r ) density..l V[L] = (0 na.and V[s1= a +d (0 + 3 ) ? ( .
:3) density. ".j > 0. where .I' o r := !J .r 1 J) = . .I' > 0.r I v ) or 4 (. is a gamma random quantity with parameters ( t l + .. r = l .thereisauniquemodeat ( o .r has a Ga(. .l ) / . .rpotientiul Ex(. and the variance is given by The Guniniu Distribufion A continuous randomquantity s has agutnnzu distribution with parameters (1 and ( 0 > 0. I f o > 1. .j > 0) if its density function Ga(. .j(o . o > I . 2.118 3 Genemlisutions The NeXcitiveBinomiulBrrcrDistributioti A discrete random quantity s has a negutii. 3 ) .j)tf0. .) if its probability function Nbb(. . .3) distribution with parameter :3 and density Ex(.r I ( t . 1 . By considering the transformed random quantities !/ = (I + .. . d and r (CI > 0.."r]= r.P/T(tr ).r I (k.1 ) I' Nb(s 10.j. j : i f r r < 1 there are no modes (the density is unbounded). If . ' . + and . . <j)= r . .j.' t ". J) is . the sum of k independent gamma random h) quantities with parameters ((k.r is said to have an r. F ) Be(0 I (I.r 1 . d. Moreover.jf.r.. the gamma distribution can be generalised to the ranges ( ( 1 .ehinomicilbeta distribution with parameters ct. r) or (x. . I. Systematic application of the gamma integral givesE[.j Ga(. sj= 1/ 2 . If ( I = v/2.j = 1.I. where c = .r is said to have an Erlurig distribution with parameter 0 . ) is The distribution is generated by the mixture Nbb(.I' is said to have a (centrul) chisyuured ( k 2 ) distribution with parameter 11 (often referred to as degrees qffreedonr) and density denoted by \ (. 2 0.. f The mode o an exponential distribution is located at zero. r ) = The mean is t. .r I (k.r In.r I o. . k.r] = n / J a n d V [ r ] = o/:j'. . If = 1. r " .r ). . I = 1.
Y ) = c r(n+x) v r x! ( p + I/)"+" 9 x = 0 . A A discrete random quantity x has a Poissongummu distribution with parameters and v (a > 0. 0)density. B/(L? v)). a generalisation of the negative binomial distribution Nb(x I a.!?*. 0 > 0) if its density function Ig(z I CL: 8) is where c = jY'/r((i).. B > 0. if QV < 1) + v. distribution is generated by the mixture The This compound Poisson distribution is. where c = P / r ( a ) .3. in fact. there are two modes at 0 and 1.. The term invertedgamma derives from the easily established fact that if y has a Ga(y I a. 3).1. 9) density. 8. Af [z] 0. the variance is V[z] = ~ ~ ( i i v)/.2) 9 a>2. 0 = 1/2.2 Review of Probability Theory The Inverted.O) density then x = y' has an I ( I a.1 ) 2 ( a . . previously defined only for integer a. gz If x has an invertedgamma distribution with CL = v/2. if av = fl v. Systematic application of the gamma integral gives \X "I = (0 1' 3 . The mean is E [ x ] = va/p. 2 . Ga(':')(y 1 a. = + + + + .l)//3) . v ) is Pg(2 10: 13.Gamma Distribution 119 A continuous random quantity x has an invertedgamma distribution with parameters a and /!! ( a > 0. v > 0) if its probability function Pg(x I n . 1 . A continuous random quantity y has a squureroot invertedgamma density. The PoissonGamma Distribution (1. there is a mode at the least integer not less than ( v ( a.. Moreover. then 2 is said to have an inverted%: distribution. There is a unique mode at B/(a + 1). and j if CLV > D v.if x = y' has a Ga(x I a.
has a Pureto distribution with parameters .I’ = .r 10. .) if its probability function Gg(.1 . . j I I ) 1 . j) if .j)= 1‘ 1 Ex(.’ has a gunznrugumnzu distribution with parameters I j and I I (o > 0.r  . .z IS 3 Generulisatioru A continuous random quantity .r In.j . The distribution is generated by the mixture Pa(. ( 1 . if ( 1 > 1 The mcxle i s .r. *j> 0. d > 0) if its density function Pa(.2. The distribution is generated by the mixture The mean and variance arc given by The Pureto Distribirtion A continuous random quantity* I .j)is (I and where = ~ 3 ’ . I ) = 1.mean and variance are given by The E [ J ] = 0 fk.y has an invertedPwetc~density Ip( g has a l’fi(.II[.120 The GunimuGunmu Distribution 0.r] = J. . 3 ) density. . j) d H .j ( a > 0. I 11.j10) Ga(H n.r 10.y’ A continuous random quantity .
N ( y In b. A) = x 2 ( A l [ z ]I v . The mean and mode are E [ z ]= M [ x ] = IL and the variance is V [ x ]= A . where I has a normal density N(x I p . aremutually independent normal randomquantities N ( z . .' . . 1=0 x i.e.r.) is a noncentral . then y has a normal density. X = 1. : then z = + + cf'=. A) is denoted by N(x I p. N ( z 111. LC? + . c then z = El=. a mixture of central y2 distributions with Poisson weights. If p = 0. In general. A) is N(xIp.p ) = ( x ..3.I T . It reduces to a central x2(v) when X = 0. The distribution is unimodal. y has a N(y 10: 1) (standard) density. o2 = V[z] is the vuriance. The mean and variance are E(x] = v X and V [ z ] = 2(v 2X). A. . are independent with N(s.2. with distribution function Q given by @(s) =  1 : exp{it2} d t . A) is x2(x I v. I p.~ )where . If y = X'12(x . c is said to have a standurd normul distribution. If X I . .) the .p ) / o . .' . . if y = a Cf.. . Alternatively. A. so that X here represents the precision of the distribution. ..t. . A).pl. the mode occurs at the value M(x]such that x 2 ( A l [ ~ I: v. X > 0) if its density function y2(s I v. A). ] I f z l .x ~are mutually independent standard normal random quantities.. LCA. + vk and XI + . + 2(tI k g = . has a noncentral x' distribution.A) = C Pn (II X / ~ It 2 ( x 1 v + 2i)..x2 with parameters vl + . The Noncentral y2 Distribution A continuous random quantity LC has a noncentrul x2 distribution with parameters v (degrees of freedom) and X (noncentrality) (v > 0 . r l . A) where X = b : / X j ) .. cf'=. . I p. (c:=. + Xk.X)=cexp where The distribution is symmetrical about s = p.: 1). . Z ) . z?has a (central) k distribution.=. a weighted harmonic mean of the individual precisions. then . . where b. . The sum of k independent noncentral x 2 distributions with parameters (v. X > 0) if its density function N(x I p.2 Review of Probubiliry Theory 121 The Norm1 Distribution A continuous random quantity x has a normal distribution with parameters p and X ( p E R. densities.
The mean and variance arc The parameter rr is usually referred to as the clegrec~s c?f. The logistic distribution is most simply expressed in terms of its distribution function. > 0) if its density function Lo(./3.r I i t . J) is j where c = An alternative expression for the density function is so that the logistic is sometimes called the sechsyitared distribution. X where The distribution is symmetrical about .r 10. j ( o E R. of .r = i t . (k) is X 11.freedom the distribution. F.122 The Logistic Distribittiori 3 Generuliscrtions A continuous random quantity x has a logistic distribution with parameters o and .r]= i i . . > 0.r]= o. and the variance is \'[s] j2ii2.  (y)}] 1 The distribution is symmetrical about .(s) [I + ( = a . A. and has a unique mode A/[. = The Student ( t ) Distrihitriotr A continuous random quantity . The mean and mode are given by E(1.I' has a Studew distribution with parameten and o ( j i E '8.I' = o. o > 0) if its density St(.1 = :\f [.
l . with.2 Review of Probability Theory The distribution is generated (Dickey. E [ x ] = . and x and y are mutually independent. z is said to have a Cuuchy distribution. then has a standard Student density St(z I O . if D > 4. For a geometrical interpretation of some of these relations.o)density.l. I.~ Ip. (tw Ify = A‘/’(xp).A. 1968) by the mixture 123 and includes the normal distribution as a limiting case. respectively. If x has a standard normal distribution. 4) is where 8 + If 0 > 2.3.2) and there is a unique mode at [ / 3 / ( p 2 ) ] [ ( a 2 ) / ( r ] . y has a k: distribution.2)’  o2 If x and y are independent random quantities with central . V[X] = 2 ((Y+l32) a(@ 4) (0 . A) = lirn St(x J p. with density Ca(x I p.y2 distributions. a).2. distribution with parameters cr and p (degreesuffieedutn) ( a > 0. Relationships between some of the distributions described above can be established using the techniques described in Section 3. 0 > 0) if its density Fs (xI a. If a = 1. u). A.wherexhasaSt(.A). a ) . u1 and U. .D/( .Ldegrees of freedom. since N(x J 11. see Bailey ( 1992). then 1 has a Snedecor distribution with u and u2 degrees of freedom.thenyhasa(srandurd) student density St(y 10. or Fisher. moreover. The Snedecor ( F ) Distribution A continuous random quantity x has a Snedecor.
r 1 f i . for ! > 0.r. Suppose that .I) and let = .y] '.j r + o. and noting that . . 0 < . but that we are interested in the means and variances of the transformed random quantities Recall ing that and noting that . = ntj[.r < 1. Example 3.j) density. for r/ > 0. (Approximutemomentsfor transformed random quuntih'es). these relationships provide a useful basis for numerical work involving any beta or binomial distribution. Suppose that . .. (Gumma and 4 distributions). Suppose that . Example 33. we have so that y has the stated Snedecor ( F ) density.j.. Binomial probabilities may also be obtained from the F distribution using the exact relation between their distribution functions given by Peizer and Pratt (1968) Since 5' distributions are extensively tabulated. Then.r has a Be(r 10. this relationship provides a useful basis for numerical work involving any gamma distribution. (Beta. we have so that y has a t2(g12o) density.2. Then.I' has a density Be(.r 1 ct.r)j I .1. Binomiul and Snedecor (F) distributions).I' has a Ga(.124 3 Generulisutions Example 3.1) den/ sity and let y = 2.ir!o(1 . Since \' distributions are extensively tabulated.
f converges in probability to a random quantity s i and only i f f for all E > 0 . A sequence X I .(t) = F ( f ) at all continuity points t of F in ?R. does not depend on [ I ) under the second transformation. we denote this by FI  F.. lim P ( { w . We shall summarise a few of the main ideas and results. in particular. (Convergence). . r.2 Review of Probubiliry Theory 125 application of Proposition 3.3 immediately yields the following approximations: We note. Within the countably additive framework for probability which we are currently reviewing. beginning with the four most widely used notions of convergence for random quantities. 3 2 3 Convergence and Limit Theorems . that if p z the second (correction) terms in the mean approximations will be small. . much of the powerful resulting mathematical machinery rests on various notions of limit process.e.3 : ) 7 = 0. i f s . and that. Definition 3.(w) .8. for all 11.. converges almost surely to a random quantify 3: ifand only if in other words.. f converges in mean square to a random quuntity .r.r(w)l > E}) = 0. I.o random quantities: x2.. I* converges in distribution to a rundonr quantity J’ ifand only if the corresponding distribution functions are such that !a liin F. ( w ) tends to ~ ( wfor all w except those lying in a ) set o Pmeasure zero. .3. the variance is “stabilised” (i.c ifand only i f Ix lini ~ ( ( .
converges in mean square (and hence in probability) to 11. . tI I= 1. . ... are independent. F and ~ .. ] (ii) if F.] = < x. converse also holds.2.. ..r. . + F‘ if and only if. .r?. .F. . Moreover.(t)+ O ( t ) for all f E 3. If .2. . . ]= p . for every bounded continuous function g.. link the limiting behaviour of averages of (independent) random quantities with their expectations. .I / f 7 i  J. I . r ) .. 2 .. .(n) > 1 . converges in probability to I / . s2. . discrete random quantity which assigns probability one to I t . converges almost surely to 1’. . provided that o(t ) is the continuous at f = 0. (iii) (Hel/y’s rheorem) Given a sequence {Pi. the sequence with respect to F..’. are independent identically distributed random quantities with El. I I = 1. F. T.. .126 3 Generalisations Convergence in mean square implies convergence in probability. . ] = o’) x. Two important examples are the following: (i) 7he centrcrl littiit theurmi. converges in distribution to thc standard normal distribution. are independent. .)] value E [ g ( s )with respect to F‘. 2  An important class of limit results. the socalled luw.?. then the sequence of random quantities T.rl. I ) = 1 . Convergence in distribution is completely determined by the distribution functions: the corresponding random quantities need not be defined on the same probability space. converges to the expected of the expected values E[g(. .. .. . then the < sequence of standardised random quantities 1. In addition. Convergence in probability implies convergence in distribution.2. r ~ . to a degenerate.. (iii) The strong law oj’/argen i r ~ d w s . . and E [ s . . ’ [ . .. ) then o. . Some of the most basic of these are the following: (i) If. (i) F. = I)’ ..} of distributions functions such that for each E > 0 there exists an such that for all i sufficiently large F. * I/ = 1. I .. the converse is false. for finite random quantities.. x:’. there exists a distribution function F and a subsequence FfI. such that I. almost sure convergencealso implies convergence in probability.r.rl.5. .(ci). If . . identically distributed random quantities with E!. . then the sequence of random quantities T..rf] < x.. = 11 and \ . identically ’ distributed random quantities with E[.r.s uj’lmgvtti4mbers. that is to say. . there is a further class of limit results which characterises in more detail the properties ofthe distance between the sequence and the limit values. F.Under the same conditions as in (ii). . ( tand q ( t )are the corresponding characteristic functions. for all i . (ii) The w c i k k u ~of lurge numbers. .). .I.
For a random vector x.2 Review of Probubiiiry Theory 127 (ii) The law of the irerared logarithm.' ( R ) ) . we move the focus of attention from the underlying probability space { 52. As in the case of (univariate) random quantities. a description of the state of the international commodity market would typically involve a whole complex of price information. Definition 3.(R) = P ( x . I .Bayes' Theorem A random quantity represents a numerical summary of the potential outcomes in an uncertain situation. Generalising our earlier discussion given in Section 3. It is necessary therefore to have available the mathematical apparatus for handling a vector of numerical information.2.P. However. F. in that they ..P } is afuncrion x : 52 3 X C 'Jzk such that x . as well as anencoding (for example. we wish to define a mapping which associates a vector z ( w ) of k real numbers with each elementary outcome w of R. is welldefined for particular subsets of Rk and this puts mathematical constraints on the form of the function x. F. e). There are enormously wideranging variations and generalisations of these results. For example. 3.9. H E B../Jr. Under the conditions assumed for the central limit theorem. The possible forms of disrribution for x. for all BE B.containing This then all forms of kdimensional interval (the socalled Bore1 sets of prompts the following definition. in general each outcome has many different numerical summaries which may be associated with it. B. A random vector x on N probability space {R. Xli . is defined in the natural way by P.4 Random Vectors. we shall again wish to ensure that P.P 1/2 limsup((2loglogn) = 1.l ( f 3 )E F. } to the context of Skand P an induced probability measure P. (Random vector). are potentially much more complicated for a random vector than in the case of a single random quantity. the birth of an infant might be recorded in terms of weight and heartrate measurements. Formally. the induced probability measure P. However.2. but we shall rarely need to go beyond the above in our subsequentdiscussion.3. we shall take this class of subsets to be the smallest aalgebra. I(% . using aO1 convention)of its sex.
rA ) d r . . is .'. . . r . . say. [O. 2 ) . ) .. . .I p ( . . where y = ( .. .'. . p(r. r . . If the random vector x is partitioned into x = (y.. As in the onedimensional case. .. in particular) that it is useful to have available a simple alternative notation.rciss)fiitictioti px(cc) = P({w: ( w )= x}). ./'A) of the random vector x is often referred to as the joint density of the random quantities . z = (. . r * k ! .]}. . . . I/ The iwditiotial density for the random vcctor z. r l . . . rather than the technical integration required. : I. . In what follows.I mixture of the two types. we shall usually present our discussion using the notation for the continuous case. . we could have cases where some of the components are discrete and others are continuous. I ] defined by E. . . of course. where the distribution may be described p x such that by a d~~nsityfiiric~tioti( x ) The distribution firnction of a random vector x is the realvalued function .(x) = k i ( . To denote the nrorgintiliscitioti opcvtitioii we shall thereforc write I d Y . Some components.x . It will always be clear from the context how to reinterpret things in the discrete (or mixed) cases. might themselves be . . . d. . . where x takes only a countable number of possible values and the distribution can be described by the probability (~.rI. .I28 3 Ceiiercilisuriotis not only describe the uncertainty about each of the individual component random quantities in the vector.17. The density p r ( z ) = p x ( . n (n.I' h'k . x and (absolutely)continirous distributions. we can distinguish discwte distrihtioiis. I .. . . the niargittul density for the vector y is given by or alternatively..I .. 2 ) +I+(Y).. . r . r f ) = . r .rk ' This operation of passing from a joint to a marginal density occurs so commonly in Bayesian inference contexts (see Chapter 5. . dropping the subscripts without danger of confusion.r. . .r r ) = & { ( . ] In addition. .rI. . emphasising the operation itself. but also the dependencies among them. . . . r k . . given that y ( w ) defined by : y.
The proportionality form of Bayes’ theorem then makes it clear that.2 Review of Probability Theory 129 or. Proposition 3. it is obvious that P&) = P&(Z I Y)PY(Y) = Pylzb I Z)PZ)z(Z). (Generaked Bayes’ theorem). More generally. Exchanging the roles of y and z in the above. should the latter be needed explicitly. up to the final stage of calculating the normalisingconstant. where q is a function and c is a constant. we note that if a density function p ( z ) can be expressed in the form c q ( z ) . this latter observation is often extremely useful for avoiding unnecessary detail when canying out manipulations involving Bayes’ theorem. In many cases. which immediately yields the result. however. again dropping the subscripts for convenience.3. alternatively. we can always just work with kernels of densities. It is therefore important to remember that the functional forms of the various marginal and conditional densities will typically differ. not depending on x.y In fact.then since Any such q ( x ) will be referred to as a kernel of the density p ( z ) . We shall almost always use the generic subscriptfree notation for densities. it is not explicitly required since the “shape” of p l ( I z ) is all that one needs to know. since the righthand side contains all the information required to reconstruct the normalising constant.4. . Proof. y. It is often convenient to reexpress Bayes‘ theorem in the simple proportionality form PYIAY I z> i P x& I Y)PY(Y).
As with marginalising. . xk were conditionally independent.:I d Y ) = PJlZ(Vl4 In more explicit terms. . . given .) p ( .r. r l ) as describing beliefs for the random quantities .and. .e.TI. .then g o x is also a random vector.e.rl.. . . . .. and dropping the subscripts on densities. Bayes’ theorem can be written in the form p ( 1’. 1 s t JHt . . . I . . Often. . r l . To denote the Bayes’ theorem operation. . r . were independent we would have t p ( J . manipulation is si m pl i lied if independence or conditional independence assumptions can be made. we shall therefore write P. . it is useful to have available an alternative notation. . . .. ) = JJp(. . .r.xk I X I . In particular. .130 3 Generdisutions For further technical discussion of the generalised Bayes’ theorem.y(r I Y) .. .r. . r ~ . B. . the Buyes’ theorem operation of passing from a conditional and a marginal density to the “other” conditional density is also fundamental to Bayesian inference and. . I . . .posrerior to) observing .I . . . dr. whenever we refer to such vector functions of a random vector x. . .s/.. . . .. I. . I . .r.l ( B ) )= P((g0z) l ( z 3 ) ) . If x is a random vector defined on ($2.r. . . .r1) p(. ... . . . .. . . ) 1. it is to be understood that the composite function is indeed a random vector.rl. ..) : 11 if . 1 . .. before (i. .. .). Chapter 3) and Wasserman and Kadane (19%)).the random vector y induces a probability space (Y?‘‘. ) p(J. .rf+ I . .) = .. . I n 1 I .rJ.. Manipulations based on this form will underlie the greater part of the ideas and results to be developed in subsequent chapters. . .. .I’~ . 13 t B. we would have p(. We shall typically denote g o x by g(x). . .. that z : I I P } such X C >RL 1’ and if g : R” ( h 5 k ) is a function such that (g o z) ’ ( B )E 3 for all B E B.t T I p(. see Mouchart (I976). .rl j . I .. . . . . .andp(. .xk. .r. if s l . .r. For example. .. .+1. . Writing y=g(x).. . extending the use o f the terms given in Chapter 2. r . .)d r l .. Hartigan (1983.Ifv}where f  P y ( B )= P x ( g . . ..r.3.. . . again. . r l .. we shall typically interpret densities such as p ( . .rL) as describing beliefs after(i. . . . .rl. . 1 .prior to) observing the random quantities . .r. ..r.cttI.1 p ( z . .r...I. .1. . .rk . . .
(Expectation o a random vector). we where is the Jacobian of the transformation g'.2 Review of Probability Theov 131 and distribution and density functions Fy. ) is a onetoone function with inverse = z f'. so does the other and they are equal. Z ( ? / . with dimension k . have. g : .and then proceed in two steps to obtain py(y)by first obtaining P d W ) = P Y .h. E[y] = E(g(s)].3.' ( d ].= l . ' If h < k. such that w = f(s) ( 9 .10. a vector whose ith component.X C R k . Rk .=E[g. for the discrete and absolutely continuous cases.x: R . . Definition 3. .z) = (g(x). if g is a onetoone differentiable function with inverse g*.(x)]. In particular. is EIYJ=E[g(x)l. respectively. for each y E Y .defined by where h ( Y ) = [ g . we are usually able to define an appropriate z .y are random vecf forssuchthary = g(x).Y C R h ( h 5 k). expectation ofy. h i i dejined by either s .px. 4 = Px(f'(w))lJfl(w)l. the . . .pv are defined in the obvious way. and then marginalising to The expectation concept generalises to the case of random vectors in an obvious way. . These forms are easily related to Fx. ifx. where all the equalities are to be interpreted in the sense that ifeither side exists.
r . . ] (. .I. V[z] = C und y = g(x) is Q oii~toone trutisforwiutiott of x such thut g exists. .x. ]= ] [ . . E [ s .r:'] = 11. . k.L.] pectation vector with components E[x1]. ' ] . . If x is ( I rundom .]denotes the truce ofu mutrix urgurnent.I. . wcror in Xk.k.k ) . V[x] reduces to a diagonal matrix with (i. Proposition 35 (Approximate mean and covariance). r .132 3 Generulisutions In particular. = E [ y W ]= S f b ) + f t r [ C V " Y f ( P ) ] ] V[9(z)] J g h ) C J $ ( c L ) . the is of mutrk.r.. where. If .. . ' E [ ( g ( z ) ) . V[x]. E [ z ]of z. r .c:'~]are called the moments of x of order n = i i l + . r .. 2. . and the correlation by The CauchySchwarz inequality establishes that 1 R[. where t\r[. i)th entry given by \'[r. The ex. 21 = g(x).] = ~ [ ( . but the following will occasionally prove useful. . E[. i = I . r ~][ . The cuvuriunce between r l and sJ is defined by C. . j)thelement C[. .] called the covariance ... . I I [ . the forms defined by E [I'IfY. .. As in the case of a single random quantity.. k x k matrix with ( i .for i = 1 .rJ )~ ] .. are not available.]. . r l . E[. Proof. . = 1.~ [ .~ ( r . ]i. r . . I 5 1..rk]is also called the mean vector.lJ (1 <_ i # j 5 k ) . This follows straightforwardly from a multivariate Taylor expansion: the details are tedious and we will omit them here.. ( s . then E [II. exact forms for moments of an arbitrary transformation.r. E [ J * 1.. subject to conditions on the distribution of x and on the smoothness of g.]) . ..and the secondorder moments. E[. .[..)'A are independent. If the components of 2 are independent random quantities.. . Important special cases include the firstorder moments.r.rf]. We shall not need very general results in this area. .. . .wirh E ( z ] = p. + i l k . then.
3. Johnson and Kotz ( 1969.. that is to say. = 0 . nondensity. C. . n ) . for I.6. do we refer to general. a detailed rigorous treatment of densities (RadonNikodym derivatives) requires statements about dominating measures and comments on the versions assumed for such densities. i. . Only occasionally. mension k. The mean vector and covariance matrix are given by The mode@)of the distribution is (are) located near E [ z ] . distributions for random vectors... . . 5 72. . 1972) and DeGroot ( 1970) for further information. restrict the possible modes to a . in Chapter 4.=.2. 3. . sk) has a mulrinomial distribution of di. with the condition relatively few points.e. The Multinoniial Distribution A discrete random vector z = (sl ~. forms. .satisfying these inequalities.3. . .5 xf 7t. . < 1. 1 . Readers familiar with measure theory will be aware 133 that there are many subtle steps in passing to a density representation of the probability measure PI. with I E. with parameters 8 = (61. O r ) and n (0 < 6..) if its probability function M u k ( z 18. 2 . for example Wilks (1962).2.2. x:=.2 Review of Probability Theory A nore on rneusure theory. < 1. As in Section 3. Readers unfamiliar with measure theory will already have assumedcorrectly! that we shall almost always be dealing with the “standard” versions of probability mass functions and densities (corresponding to counting and Lebesgue dominating measures and “smoothly” defined). no very detailed discussion will be given: see. In particular. is J.5 Some Particular Multivariate Distributions We conclude our review of probability theory with a selection of the more frequently used multivariate probability distributions. r7 = 1.
. sr. ..134 3 Generalisiitions The marginal distribution ofd"" = (XI. .... . . . MUI.n . . .. . . . . I ! ) reduces to the binomial (z density Bi(x 18. is the multinomial . . ing s.specifically.i u < k.'~ is also multinomial.. . I A. n ) .. 10. D i k ( z ( a )reducestothc betadensity B c ( . ) I ) then y = (yI. The conditional distribution of x"") given the remain . . with parameters a = (oI . n). .vk < I.rI + . . ' ~ through their only sum s = E:=~.r1.rl are k independent Poisson random quantities with densities Pn(x. .r 1 8. . n ) . Ifz = (xl. J L )given . .. ) . .. )+ ~( ( I . + .). < 1 and .) has density M u k ( x 10. = A. with 8. where If z is the sum of 111 independent random vector> having multinomial densities with parameters (8. I . and it depends on the remaining s . ( 1 2 )In thegeneralcase. = I I is multinomial Mu(. > 0. . . . . .. ih where Ifk = I .. . J ./ I ) . has density Mut(y j 4. 0 < .. . .I*. .striburion A continuous random vector x = ( X I . . . r I . i = 1. ) .. the mean vector and covariance matrix arc given by .r. . ( ~ k .r. k + 1 ) if its probability density Dik(z 1 a). .(^('^') 181. then z also has a multinomial density with parameters 0 and (711 .)... y f ) where . If k = I. then the joint distribution of 2 = ( . t i t . . +.. MU^. B. . + E:J The Dirichler Di. . r t ( 1 1 . If . r k ) has a Dirichler distribution of dimension k .' CtZI A. . . i = 1.
xk) has density Dir.. . . is where cr['I = fl.. yt) where . . with + k x ~ + I = )L . there is a mode given by M [ x . l ...rJ and I.. with parameters a = (oL.rk) has amultinomialDirichletdistribution .~).(x'.. = 0 .of I ( is also Dirichlet. 2 .=.. . i = 1. for s.3.. .+~. y = (y1. and 71 = 1. k. .z ....(Z I a).0 1 ... .. .(I. . .. .. of dimension k. ..xi. < k. ) . k The conditional distribution. if z = ( I ~ . . . I. . . ..(a J . . and . given z.1) defines the ascending facrorial function. In particular. Moreover.. . ]= 1 C::: aJ . t k + l ) . oktl) 71 where at > 0..! C= CJ=l kt i pi] "J . is the Dirichlet = mi . where The MulrinornialDirichler Distribution A discrete random vector z = ( X I .2. if its probability function Mdk(z I a. Di.z~.1 The marginal distribution of d"') (xl. . with n a r 1 5 ri.k . .=.2 Review of Probability Theory If czi 135 > 1. .. has density Dit(y 10). .. then . . . .
. . The shape of a normalgamma distribution is illustrated in Figure 3. . I' c::. h). . n and .I'L } given' { .) . .5) is displayed both as a surface and in tcrms of equal density contours. . y 10.I . .p ) } . l(\.r. with parameters / I . . .( I .181. .l.c. . . . . where p E '5' and A is a k x k symmetric positivedefinite matrix. A.C. o k . A ) is N ~ ( Z ~ ~= (A r x p . r l ..A ( x.=.n. Ct o.5. . . is the binomialbeta Bb(.r. .r I p. has a r?titltivariutenormul distribution ofdimension k. ) and A. .. In particular.1 n. . . with parameters { o . .. 8 j > 0) if its density Ng(s. . *) x .. I n I } and I I .} and 12 Y. . } is also I . the marginal distribution of .X > 0. . Xy) and that the marginal density of y is Ga(y 1 n. . ()! ./ . . with parameters p = ( p1 .~um~m distribution. . . I o.ylp. } is a multinomialDirichlet : with parameters { G I . x E IP.r.j. the marginal density of . > 0. ) . I . For an interesting characteri7~tion this distribution.1 . It is clear from the definition. where the probability density of Ng(.r I p . multinomialDirichlet. . The Miiltivuriute Norniul Distribution A continuous random vector x = ( .2.s The marginal distribution ofthe subset { . .~ ) ' A ( . :I c'. see Basu and of Pereird ( 1983). ..X.. Xn/d. that the conditional density of s given p is N(. Moreover.136 The mean vector and covariance matrix are given by 3 Generu1isution. J .I.. y) has a rtor~nal. the conditional distribution of { . . I . Moreover.. 9) is where the normal and gamma densities are defined in Section 3. . .z=l The NorinulGurnmu Distrihutiort A continuous bivariate random vector ( J . is St(. . . . o.. (11 E 'N. I * . if its probability density N k ( z 1 p . o.j).rl.2. 1.
]= ~ 7 so that . with C = A' of general element u t j . A.. and C[x. V [ x l ] u . The parameter p therefore = ~ ~ ~ labels the mean vector and the parameter X the precision matrix (the inverse of the . In the general case.2 Review of Probability Theory 137 Y I' Y 2 0 2 F i r e 3. If k = 1.5.5) where c = ( 2 ~ ) .. so that X i s a scalar.1 The Normalgamma densiry Ng(z. Nk(z I p. A) reduces to the univariate normal density N(x 1 p .1.V[z]= X'.]= p.c. and. E[x.3.~ / ~ 1 X 1 ' / * . A). y Io..
In particular.p)'X(a:.. then y has density N.130 3 Generdisutions coturiance mutri. with parameters n and 0 (with 2ct > k . .= x.k. .r. X). .p) of the k(k + 1)/2 dimensional random vector of the distinct entries of x is Wik(a: where c = l / 3 1 n / I ' k ( ~ t ) . .with 2.).having dimension k.. with mean vector and covariance matrix given by the corresponding subvector of p and submatrix of X I . for 1 = I. is and kl k? = k. ( y I Ap.p ) has a kg I k ) density. k. if x = (21. so that 0 is a scalar 3.1 and a k x A: symmetric. . We also note that. ..22) a partition of x. of appropriate dimension. If k = 1. we can deduce the integral formula The Wishart Distribution A symmetric..( ACA' 1 ). . I'1 is the generulised gummufunction and tr( . and if the corresponding partitions of p and X are ' + then the conditional density of xI given x2 is also (multivariate) normal. Moreover. denotes the truce o f a matrix argument. has a Wishart distribution of dimension k. as before. . nonsingular matrix). . respectively.where A is an rri x k matrix of real numbers such that ACA' is nonsingular. from the form of the multivariate normal density. of dimension kl with mean vector and precision matrix given. positivedefinite matrix x of random quantities s. if the density Wik(a: 1 n.. by ' ( The random quantity y = (a:. the murginal density for any subvector of x is (multibariate) normal.. j 5= 1. Jj). reduces to the gamma /3) density Ga(cr I ck. then L\*k( 2 I 0. If y = Ax.
(k 1)/2)'D. p' = (:. Moreover. then y has a Wishart distribution of dimension m with parameters o and (A0I A')'. . The Multivariate Student Distribution + 1)/2 distinct A continuous random vector z = ( X I . a. . + x. .xk) has a multivariate Student distri. $A)..also has a Wishan distribution. and 0.} is a random sample of size n > 1 from a multivariate normal ? x. . so that A is a scalar. with parameters a1 . Y.2 Review of Probability Theory 139 If { x1 . . if the iatterexists. . a ) is &(a: I p. i = 1. the parameter A is often referred to as the precision matrix of the distribution.: ): . with parameters a t . ): * where 2 1 1 . 6 1 1 are square h x h matrices (1 5 h < k). . .( P+ 1)/2 exp{tr(ps)} ctx = cl .' ] = (a . .(SI i ( n . bution of dimension k.' ( a / ( a. + + J the integration being understood to be with respect to the k(k elements of the matrix x. . Although not exactly equal to the inverse of the covariance matrix. with parameters p = (111. a ) reduces to the univariate Student density St(z I p: A! Q). Z is N k ( f 1 p . A).l ) . and Z = 7tI s = C(Z?Z)(q. has a Wishart distribution Win. a > 0) if its probability density &(a: I p . E ( s ] = p and V [ z ]= A . A. . . a)= c where . in particular. . . then Stk(x 1 p..3. A. . A. A a symmetric. If k = 1. .p)9(x. then x11 has a Wishart distributionofdimensionh withparametersaand(aIl)'. positivedefinite k x k matrix. if g~ = AxA' where A is an rn x k matrix (m 5 k) of real numbers. if x and p' conformably partition into + x= (:.2)). In the general case.3)' c=l 11 is independent of Z. xEqk. then 21+ . each with a Wishart distribution.. we can deduce the integral formula 1 1". pk). and then Nk(x. x 8 are independent k x k random matrices. .x. and The following properties of the Wishart distribution are easily established: E [ s ]= ap' and E [ x . A and (Y ( p E Xk. I p.p. nA). A.p) 1 (n+k)/2 . . We note that. .ifzl. from the form of the Wishart density.
respectively. Act. where the multivariate normal and Wishart densities are as defined above. I p. 0 . then y has density St. of dimension k l . A!]) and the marginal density of y is Ga(g I(\.0). positivedelinite matrix of random quantities y have a joint Nornwl. a joint tnultivariate norntalganvna distribution of dimension k . y where the multivariate normal and gamma densities have already been defined. and mean vector and precision matrix. The Multivariate NornialGarntnn Distribrrtiotr . X y ) Wik(y l(1.Nwk(z.j)= NA(a:I p. A. Ap)Ga(!/ 10.(y ] ( I .j) is j A continuous random vector x = Ngk(x. 20 ). A. ( t ) density. ..o).'A.j). where A is an 111 x k matrix ( i n 5 k) of real numbers such that AAIA' is nonsingular. From the definition. (AA'A')'.p2)'(A22 . given a:? is also (multivariate) Student.Wishart Disrribi4tioti A continuous random vector x and a symmetric. p ) is Nivk(2. y I p. if x = (xl. A. )and a random quantity y have (2. (t.1 and A are given by then the conditional density of 2 1 . nonsingular matrix). u . given by + 111 . . positivedetinite matrix. of appropriate dimension.&%2(a:? (k . 0 ) . . (t':jA.3). ( I : 0 and > : > 0) if the joint probability density of a: and y. A. t * ~ . J ( p E g k . . and 0 a k x k symmetric. with mean vector and inverse of the precision matrix given by the corresponding subvector of p and submatrix of A . integer 20 > k .0 ) = Nk(a: I p. 0 ( p E 92'. with parameters p.X a k x k symmetric.p)'A(a:.x2) is a partition of x and the corresponding partitions of 1. Moreover. the marginal density of a: is Stk(x I p.p2). Ngk(a:.1. A.. 0 . the conditional density of a: given y is N k ( z I p.y 1 p. h). Xy) and the marginal density of y is Wik.111') A11 + A*:! [ + (a:? .. O .Wislzart distribution of dimension k.1. X 3 0. .p ) has an Fs(y 1 k .140 3 Generalisations If y = Ax. the marginal density for any subvector of a: is (multivariate) Student. with parameters p. the conditional density of x given is NI.' .Moreover. The Mirltivarinte Normal. . with ct kz degrees of freedom.(5I p. In particular. From the definition. the marginal density ofa: is Stk(x 1 p . Anp I .?)(x2 0 1 The random quantity y = (a: .(y I Ap. Moreover.X?IA. y I p .. if the probability density of x and the k(k + 1 ) / 2 distinct elements of g.
and let . "step") functions to more general functions. described which is F by a probability space { $2: .3. with the straightforward interpretation that. Ferrhndiz and Smith.1 GENERALISED OPTIONS AND UTILITIES Motivation and Preliminaries For reasons of mathematical or descriptive convenience. C . an option was denoted by a = {cJ I EJ. Since the development given in Chapter 2 led to the assessment of options in terms of their expected utilities. The extension of this function definition to infinite settings clearly requires some form of constructive limit process. 3. or even countable. going beyond finite. dl.2) ifa > 2. y>i31.)=('(yx)("+'). if option (1 is chosen.  The mean and variance are given by V [ s ]= V[y] = Q($l . In the finite case./%)2 v ( n . the "natural" definition of limit that suggests itself is one based fundamentally on the expected utility idea (Bemardo. The marginal distributions of tl = / ) I . I}. has a bilaterul Pureto distribution with y) parameters &.cr). P } and utility function u : C + %.~andt. 1985). Let us therefore consider a decision problem { A .e. analogous to that used in Lebesgue measure and integration theory in passing from simple (i. < O1.3 3. .B. where c = ck(a + l)(/3.. It is therefore desirable to extend the concepts and results of Chapter 2 to a much more general mathematical setting. frameworks.~=yi3. A ) is Paz(I.31 ik. first by taking & to be a oalgebra and then suitably extending the fundamental notion of an option. cJ is the consequence of the occurrence of the event E.3 Generalised Options and Utilities 141 The Bilateral Pureto Distribution A continuous bivariate random vector (I.~o.E .1 ) * ( 0 .yItr.' . and CI ({/?".n .. Paa(z. a > 0 ) if its density function E R2.3. y 1 a.)arebothPa(f1. J 5 85. j E J } .. and the correlation between I and y is . it is common in statistical decision problems to consider sets of options which consist of part or all of the real line (as in problems of point estimation) or are part of some more general space.
we can prove that any (I E D such that o t i is essentially bounded (i. we can prove the following. there exist sequences ( I I . tl und. u. Given u urilit?. (Convergence in expected utifity). Proposition 3. (1: . Fnra given utili~fincrion.25 was that simple options should be compared in terms of their expected utilities. t i . however.02. ) n .6. . given a specified utility function.1 1. . E ( a ) = i i ( r r I S 2 ) is precisely the expected utility of the simple option n (see Definition 2.16 with G = IT. this constructive definition does not provide a straightforward means of checking whether or not.12. o simple options siich thut ( I . In order for this to carry over smoothly to decisions (generalised options). we shall return later to the case of a general conditioning event G and corresponding probability measure P ( . . [ijunction (I E 'D is (I decision (getirralised option) if' und only ij' there cvisrs u sequence 01.. ~  . ti.) 5 i i ( d ) 5 q c c : ) . G((1. . (Decisions).function 11 : C 9. t.142 3 Generalisations In other words. : C zi 92. Discussion of DeJnifions 3.tion. For u giivn iitilityjm.)) is a random quantity whose expectation exists. ) ' f z(d). However.P } are to be understood as fixed background specifications.t . if turd only if   (i) (L o d. . In the case of the particular subset A of 'D. in 'D i5 said to ticwrivergi~ u to ( I firnction tl in 'D. the r*ulrte of ). un~JNnction E 2. t i : C 3.)r t i l l i. More specifically. (with respect to . . it is natural to require a constructive definition of the latter in terms of a limit concept directly expressed in terms of expected utilities. 11 o rl is bounded except on a subset of $2 of Pmeasure zero) is a decision. In all the definitions and propositions in this section. +I.  . a function d E D is or is not a generalised option. . is then called the expected utilic of the derision d. I G)). for tl siich that 11 o d is essentiully bounded. Definition 3. Definition 3. In abstract mathematical terms. the fundamental coherence result of Proposition 2. . (11. A and { R. to more general functions requires some form of limit process. However. As it stands. . . the extension from simple functions. tind u'. 0 2 .. f (I. 'Dconsists of those functions (soon to be called decisions) rl : (1 C for which u o d = u(d(.( d ) = liiii i i ( t 1 .)  11 o (1 almost sure!\. mapping C to 'J?.I I trnd 3. sequence ofjuncrions (11..e. converges to (ii) ri(rl. F.12. o j simple options such that ( I . written d. (1.
if . j E J i } . = { clJI E. an..3. . there exists Ti.. such that a.j..}. We now define a sequence {a. In addition.u(d(w)) < T&j(d)for all w E E. note that.d(w)l < E. in such a way that two extreme events contain outcomes with values of u o d ( o ) < i or ..defined by but.) begin by defining the partitions {&.i). where We For each i.there exists z such that u o d ( w ) E [i. for all i. then we would have thus contradicting the definition of a.of simple options a....i = 1. since a(d) < m. for this i . Hence. . for all i. By construction. (An exactly parallel proof exists if “above” is replaced by “below” and 2 by I. . ( d ) .iz(a. this establishes a partition of R into 2(22’ 1) events... la. .) 1 E ( d ) . j E J . d almost surely.) = 0 then czJis an arbitrary element of C.3 Generalised Options and Utilities 143 Proofi We prove first that if u o d is essentially bounded above then there exists a sequence of simple acts a l . whereas the other events contain outcomes whose values of u o d(w) do not > differ by more than 2’..(w) . with 2’ < E and.) and > To see that the c.i. . a. (ii) if P ( E t J ) 0 then cy E d(E. exist and are well defined. } such that: + (i) if P ( E. ( d ) < cc. for all E > 0.  .. 1” d and .2.
since ii(o..x. rl cindfor till i. In the former case. In fact. If P(. a?. let Ir' denote the essential supremum and define :lo = { w E ! I : I I o d(w) = f<}: in the lattcr case. choose a consequence ( ' E C and an increasing sequence of real numbers 3. ifw E . define . d and.. We first note that either o (1 is essentially bounded above or it is not. uch that S . / . can be obtained as the limit of "bounding sequences of simple options" in the sense made precise in the following.S. ( I : .I.x.. ( I : . and and.. tine1 (I.rr(rr.' ) ). ifw E .. .:. such that 0. since A.  Prouf. such that n ( c ) < I<.s sicc. 0 . Given Q titilit~.. for all i . An obviously parallel proof exists for the other inequality.  as i . define f< = x and All r fl. ( 1 2 .i = . all j. ti. . of siniple optiori.h h i ( I . .. ..A.j. tliere exist sequences ( I 1 . :w E I!: I / o r l i w ) :> . then.. . choose . = { w E I! 1 ic(d(w)) . . it remains to prove that 7 Genercllisotions 1: Writing Ji! = I u ( d ( w ) ). we can show that any decision. S. .6 there exists a sequence of standard events S.j.}.&) = 0.. such that .. S .(w))( ( ~ ( w ) 0. We shall show that there exists a sequence of simple options ( I I . x. define   rl. . where 4. 11 such that (1. + J:. j Y and ( I ( ( * ) < .moreover.: 4. c.. .. which converges to zero as E(d) < %. + fl as i . E [O. let { In either case. IfP(&) > 0 . .f. . whether or not o tl is essentially bounded. (1. ) = o .t } . S. ) < ~ ( t l < ii(cr.) < i i ( d ) . we note that for < /I sufficiently large i (larger than the essential supremum of o (0. . . and Pr( . i i (( 1 .l. +. . / / ( .I decreasing sequence of real numbers ( I .4'..144 To show that u.this also converges to zero. for all r .. .(w) = { rf(w).7.I 2 S'..r ciriy clecisioii tl E D.lll and choose a consequence c E C for (7 . .)n : P(ilu)c~.) 2 T i ( ( / ) . E(cr.firiic~tion: C ti 'T?. Then by Axiom 4 and Proposition 2. Proposition 3.
. d. .1 d. d implies that uji. for each a.. this postulate enables us to establish a very general statement o the identification of quantitative coherence with the principle f of max imising expected uti I i ty. ..) < E(d) .3. a 332 .. } d2.E(d. (1) . . if w E A.. a.) + d. } )“ d. 2 a:. (Extension of the preference relation). the second part of the postulate is an obvious necessary condition for strict preference. then d. (w)= { . we clearly have a:. (1) J . > ui. under the limit process we have introduced. This is made precise in the following. . there exists a sequence al a2. ii(d. Generalised Preferences Given the adoption. (ii) if:for all 2 > 10. to the corresponding decisions. and { a . Moreover. and d. the required result follows. for some io. The first part of the postulate simply captures the notion of the carryover of preferences in the limit. for mathematical or descriptive convenience.:(w). u. . there exist sequences of simple options u(. u(7) 2 . for some io. such that TI (uia) . Together with our previous axioms.. d. that. . by construction d. for all i. it is natural to require that preferences among simple options should “carry over”. .3 Generalised Options and Utilities 145 Since d is a decision.then dl > d. Given a utilityfunction u :C R.&. .ii(d.6. Postulate 2. and E ( u y ) ) _> E ( 4 ) . is bounded above. d. we now define a new sequence of simple options a:j. we have: (i) if: for all i > io. for all i. by )11 .. Hence. since for all k. +. If we now choose a subsequence uj.a. If.for any decisions dl . is a decision.. 2 dp.’). ).) < E ( d ) and u o d. such that u:’) I.. by Proposition 3. if w E A.and sequences of simple options { a . of the extended framework developed in the previous section. of simple options such that a. so. d.} and {a:} such that { a .  ..).
. i 2 such that..i i ( t l . Writing f / c . ) ii(d1) U((f. ) . by Proposition 2. Proof.j C/. 0. . ..8. for all i . dl 2 (f. I G) without any basic modifications to the proofs and results. ) . for all i . >[. d l . by Postulate 2.25 that. By Proposition 3. . o i .a?.2. &}.  i q n . 1. E ( ( i . I 2 ( I . Ti(tl2) .. (1: +. . there exist sequences of simple options u I .(d) ii( 1 ~od).l 2 (1. it iseasily verified that P(G)z(.Z(t() i :. By Proposition 3. ) > ‘iT(tll) and Z(o:) < Z(d2). ( n .G ( r l l ) 5 u and. so that d ] (12. +. for all .” 2 u : ” .146 3 Genemlisurions Proposition 3. . ih such that. of G. Given ans two decisions dl . +. and ( I . With .. for all i. . for all ... there exist iC! I I. by Postulate 2.2) ii(r() 5 ii(tl2) . u( n. ) ~ ’2 a:”’. d2 > d l .Z(4J > max{ i’.25. and we can choose i’..) can be replaced by P ( .k = 1 . It follows that if 11.I.(j u(u.7. .) sequences of simple options ( ( 1 . 4  Proposition 3.} .(11 < . (J.u( I ) 5 ii(((?)). Throughout the above. .) 2 i i ( d 2 I G ) . tf2 2 dl and t l l 2 r f j . (I1 fork = 1. ) < 5. . the probability measure P( . Since we have U ( d 1 ) = E ( d 2 ) .1) such that 0.) then it is integrable with respect to P ( . o d is integrable with respect to P ( . We first establish that i7(d?) > i 7 ( d l ) implies that d? > (11. ) . we must show that G(d1) = i i ( d 2 )implies that t f l w (12.c = ( i i ( i f p ) .where I(. > It follows from Proposition 2. ..E(dl)]/:3. 2 . such that a.  Z(nj:”) 5 TT(rl2) 5 IIi Z((1. is the indicator function = .. fl?   E(dl 1 c. fiw uriy G > Cn. .. d2 and.)“’ (12 for k = 3 . and u’. that o:.(&. for all j > i i . ).7. (Maximisationof expected uliriryfor decisions). Il(d1) 2 T7(tL) Proof.9. we can choose i l . this implies. and. and so.1 C). . d. and hence. . 3 . .j > i i i i \ s { i ~i t. . To complete thc proof(tfl rf2 =$ Ti(dl) = T i ( r f 2 ) being obvious). ( d )= 1 (I 0 d ( W ) $ P ( W 1 G). for all j > i’.
z) sup = d I u ( d . 5.6. we have shown that. hence. e. If d$ is the optimal decision corresponding to ell.3 Generalised Options and Utilities 147 This establishes in full generality that the decision criterion of maximising the expected utility is the only criterion which is compatible with an intuitive set of quantitative coherence axioms and the natural mathematical extensions encapsulated in Postulates 1 and 2. . describes current beliefs about w . the expected utility from the optimal decision following e is  u(e) = L 7i(di.e. now takes the form given in Figure 3.is q d . z).. the decision tree for experimental design. the utility notation is extended to make explicit the possible dependence on the experiment performed e (or eo if no data are collected) and the data obtained. Whether or not this is sensible obviously depends on the relative costs and benefits of such additional information. given a general decision problem.2. given originally in Figure 2. so that 7i(d.. u ( d ( w ) )describes the current preferences among consequences and p ( w ) . ) d u . x. where. . Specifically. w d and. as in Section 2. ) p ( wI P . As we saw in Section 2. z. The Value of Information For the general decision problem. and we shall now extend the notions related to the value of information.3. e.6. the optimal action is that Ct. x)p ( x I e ) dx. the probability density of P with respect to the appropriate dominating measure.3.. it is natural before making a decision to consider trying to reduce the current uncertainty by obtaining further information by experimentation. introduced in Section 2.3. expected utility from the optimal the decision given e and 2.6. into the more general mathematical framework established in this chapter.3.6. 333 . the expected utility from an optimal decision with no additional information is defined by Let d: be the optimal decision after experiment e has been performed and data x have been obtained. e . which maximises the expected utility. where w E R labels the uncertain outcomes associated with the problem.
Thus. The optitnal decision is to perjorm experitnent r" i f i i ( r ' ) = iiiax. 4 The expected value of the information provided by additional data x may be computed as the (posterior) expected difference between the utilities which correspond to optimal decisions after and before the data.13. (i) The expected viilire cgthe ittformutioti proiiiled hy x. such that. ( I ) .) ctnd i i ( r ' ) > i i ( ( . Thus.rperintent f i s Let us now consider the optimal decisions which would be available to us if we knew the value of w .w)... ..10. (The value of adiitional information).E $1. '. Pmoj: This follows immediately.e. for all t i . ) describes beliefs about the occurrence of z were c to be performed.. )mid to . ..148 3 Genernlisiitions Fgr 32 Genemiised decision rreejw esprririienrtil tlrsign iue . . where p(x I f . Proposition 3. w ) 2 t r ( r l . be the optimal decision given w: i. w u(d:. Ti((.is (ii) the e.rpected value ofthe e. let (1:. perform no experiment otherwise. (Optimal experimental design). Definition 3.
and the utility of directly taking decision d and finding w to be the state of the world.function has the u(d. is the optimd decision with no udditionul information. 4 = P(W I PO). = . C(). und the expected vulue of perfvct informution ubout w is defined by where d.and the probability distributions are such that y(w I e .14.r ( e . w ) . we can establish a useful upper bound for the expected value of an experiment.d) = y ( w 1 e. Its expected value with respect to p ( w ) will define. z ) .27. the loss suffered by choosing another decision d u(d:. w ) = u(d1. f’. The opportunity loss o choosing d is defined to be f I ( d . then. where ?(e) = c(c . the optimal decision with no additional data.w ) . previously discussed in detail in its finitistic setting in Section 2. x) L: 0. q J w ) .3 Generalised Options and Utilities Then. (Expected value of perfect information). v(v) <_ P*(eo). This closely parallels the proof. Proposition 3.w) u ( d . given in Proposition 2. with c(e. under certain conditions.v(d.W) . In the next section. ell. for the finite case. in many situations the utility function may often be thought of as made up of two separate components: the experimental cost of performing e and obtaining 2.J.1 1. 5 )p ( z I c) dx is the expected cost of e. an upper bound on the increase in utility which additional information about w could be expected to provide. . (Additive decomposition). 149 # dw would be For d = 4. As we remarked in Section 2. this utility difference measures (conditional on w ) the value of perfect information.7. 2. w ) . form If the urilin. we reconsider the important special problem of statistical inference.F(e). x). eo.e.. Definition 3. Given such an (additive) decomposition.6. for any available e.3. Proof. given w .2.xperiment e. q This concludes the mathematical extension of the basic framework and associated axioms.u(d. P(W I eo.3.
the probability measure describing an individual's actrrctl beliefs. with the interpretation E. conditional on 11..with the interpretation E . We denote by 2) the set of functions d. we shall generalise these concepts and results to the extended framework developed in the previous sections. conditional on some relevant data D and initial state of information :2/0. being true. w ) . A...7. = ( w ) . In this section.which map w to the pair ( p w ( . conditional on D. w E (2. 1 D ) .4 3.. given data 1) and initial state of information \f.. Here. = 1. we argued that the problem of reporting a degree of belief distribu... E. w . consisting of elements of I . /I. Ill). We then proceeded to consider a special class of utility functions (score functions) appropriate to this reporting problem and to examine the resulting forms of implied decisions and the links with information theory. and the set of consequences C consists of all pairs (9.j tion for a (finite) class of exclusive and exhaustive "hypotheses" {tf.. c. = W E . The first generalisation consists in noting that the set of alternative "hypotheses" now corresponds to the set of possible values of a (possibly continuous) random vector. I D ) . labelling the "unknown states of the world. compatible with I).). should be represented by a probability distribution P over a cralgebra of subsets of I!. C.1 3 Gmeralisarions GENERALISED INFORMATION MEASURES The General Problem of Reporting Beliefs In Section 2.150 3. = "the hypothesis w is true". an individual reports as the probability of E. we denoted by p = {y. could be formulated as a decision problem { E . . Quantitative coherence requires that any particular individual's uncertainty about w.4. I } . E . > 0.).p. representing the possible conjunction of reported beliefs and true hypotheses. A relates to where q) is the probability which. say. j E . which we shall assume can be described by a density (to be understood as a mass function in the discrete case) We shall take the set of possible inference statements to be the set of probability distributions for w . 5 ) . .] E * I } . one for each pu(. = "hypothesis H. In the previous finitistic setting.I} is a partition of 9. so that the relevant uncertain events are E. is true". { F.
I D ) . which describes the “value” u(qu(. q(w I D ) = p(w 1 D). w ) .(w) = (qu(. w E R}.2 151 The Utility of a General Probability Distribution In this general setting. which we can conveniently denote by {D. I D) which maximises the expected utility As in our earlier development in Chapter 2. For this purpose and with the same motivation. w ) of reporting the probability density qu(. Definition 3. w ) . we shall denote an individual’s actual belief density by pu (. for each o E R. is a mapping u : Q x 1 + 8. (Scorefunction). in the sense that his or her expected utility is maximised if and only if d. I D)’sand defined by d. so that.p(w 1 D) > 0 and q(w I D) > 0 for all d.I D) were w to turn out to be the true “state of nature”. D for all w E 0. The appropriate generalisation of Definition 2.4 Generalised Information Measures 3. by inducing the preference ordering through direct specification of a utility function u. I D) is the density which an individual reports as the basis for describing beliefs about w conditional on D..4. we shall wish to restrict utility functions for the reporting problem in such a way as to encourage a coherent individual to be honest. D relates to the class of probability densities for w E Q compatible with D. I D)E Q are strictly positive probability densities. where qu(.3. scorefunction is said 2 A lo be smooth i f it is continuously diflerentiable us afunction of q ( o 1 D ) for each w E 0. . so that d. E 2 . we shall assume that ) pu(. we generalise the notion of score function introduced in Definition 2. I D).representing the conjunction of reportedbeliefs and true “states of nature”.21 is the following. where R is the set of possible values of the random quantity w.’s correspondingto choosingto report The qw(. We shall assume the individual to be coherent. Throughout this section. given data D. The set of consequences C consists of all pairs (d. A score functionfor probability densities qu (. decision space D consists of d. I D ) defined on R. is chosen such that.15. is a decision problem. The solution to the decision problem is then to report the density q W ( .. E ’ . P}. conditional on data D.20. Without loss of generality. the problem of providing an inferencestatement about a class of exclusive and exhaustive “hypotheses” { w . u. R. I D). We complete the specification of this decision problem. 1 D) and the qw(.
. (Proper scorefunction).152 3 Generulisutions Definition 3.22). ). A scorefrtnction ti is proper if. Proof. B (. Rearranging.. probability densities yw (. .tioti. the simplest proper score function in the general case is the quadratic. (Quadraticscorefunction). I D ) E Q to maximise subject to j ' q ( w I D)dw = 1.17. Definition 3. ensures the existence o f t i (d.riorrfoi.fi)r euch strictly positive probability densiry pw (. A qucrdmtic score firtiction i s proper. E D. from which it follows that we require q ( o 1 D ) = p ( w 1 D ) for almost all w E 51. it is easily seen that this is equivalent to maximising . Proposition 3. As in the finite case (see Definition 2. We note again (cf. Given data I).Proposition 2.Il(W I D))' dw. I D 1. we must choose yw(.16.28) that the constraint J' q(w I D)dw = 1 has not been needed in establishing this result for the quadratic scoring rule.12 ofthe form sirch that the otherwise arbitrclry$int..) for all d.j j a i w I U ) . 1 D ) E Q defined on $1 is a timapping 11 : Q x 5 1 . A yuadrcrti~ scorefirnc..12.
(Characterisationof proper local score functions). this reduces to finding an extremal However. values of q ( u I D). A scorefunction is locul if for each yw(. Jeffreys and Jeffreys.7. Given data D.e. F(q(w I U ) n i ( w ) ) l da n=O for any function T : R t R of sufficiently small norm (see. locul scorefunction.3. we need to maximise. I D ) f Q there existfunctions uw. the functional form.proper. 1946.23. I m w ) P(W I D ) Q!u D)dw = 1. the dependence of of the score function on the density value q(w 1 D ) which dflassigns tow is allowed to vary with the particular w in question. i. I m w ) = Alo. for example. Intuitively. this enables us to incorporate the possibility that “bad predictions”.23. Proposition 3. Definition 3. as we argued in Section 2. uw(). I t i : Q x R + SR is a smooth. w .13.29 and characterises the form of a smooth. then it must be of f the form . This condition reduces to the differential equation + .18. the expected utility  4dq) = subject to of q(w I J T&d. for the problem of reportingpure inference statements it is natural to restrict further the class of appropriate utility functions. for some “true states of nature”. 1 0 )to give a stationary value of F ( q w ( .1 D)) is necessary it that i3 = 0. Chapter 10). may be judged more harshly than others. subject to the existence of 21(dfl) all d. (Localscorefunction). Since IL is local. The following generalises Definition 2. The next result generalises Proposition 2. for Proof. w E Q.. as in Definition 2.) is an arbitruryfunction of w.4 Generulised Information Measures 153 In fact. for qu(. with respect to q(. proper. 1 D).sq(w I D)f B ( w ) where A > 0 is un arbitrury constant and B(. local score function.. E D.(qw(. defined on ‘8’ such that Note that.
154
3 Generulisuticm
where D1 u . denotes the first derivative of uw. But, since uw is proper, the maxi~ must be attained at qw (. I D ) = pw (. / D ) , so that a smooth. mum of F ( qw (. I D)) proper, local utility function must satisfy the differential equation
D , u w ( p ( o I D)) I D ) .<I= 0. p(w
whose solution is given by
llw
( p ( wI U ) ) = .4 logpfw 1 11) + 13(w).
as stated. This result prompts us to make the following formal definition.
Definition 3.19. (Logarithmic score funcfion). A Iogurithmic score @tun(.tion for probubility densities qw(. 1 D ) E Q defined on I1 i5 N iircippitig II : Q x i2 + 9 ($theform ?
u(YW(.I
D).o) = :Ilogq(wI D ) + H(w).
A > 0, B ( . )urbitrury. subject to the cJAi.5tenc.e ofii( (I,,) /i)r d I ti,, E V.
For additional discussion of generalised score functions see Good( 1069) and Buehler ( 197 1).
3.4.3
Generalised Approximation and Discrepancy
As we remarked in Section 2.7.3. although the optimal solution to an inference
problem under the above conditions is to state one's actual beliefs. there may be technical reasons why the computation of this "optimal" density pw(. I D) is difficult. In such cases, we may need to seek a tractable approximation, q d ( . I D). say, which is in some sense "close" to pw(. I D). but much easier to specify. As in the previous discussion of this idea. we shall need to examine carefully this notion of' "closeness". The next result generalises Proposition 2.30.
Proposition 3.14. (Expected loss in probability reporting). preferences are described I,v u logarithnric score fi4nctioti. the expected loss c irtilin q in reporting (I probubility (1eiisit.v y( ~3 1 D), rcrrher than the densir?.p( w I 11) representing u c t i d heliey5. is g i w r t>J
3.4 Generalised Information Measures
155
Proof. From Definition 3.19 the expected utility of reporting qw(. I D) when p w ( . 1 D )is the actual belief distribution is given by
so that
The final condition follows either from the fact that u is proper, so that E(dp) 2 ti(d,) with equality if and only if qw(. I D )= p w ( . I D); directly from the fact or that for all x > 0, log 2 5 I  1 with equality if and only if 5 = 1 (cf. Proposition 2.30). a
As in the finitistic discussion of Section 2.7, the above result suggestsa natural, general measure of “lack of fit”, or discrepancy, between a distribution and an approximation, when preferences are described by a logarithmic score function.
Definition 3.20. (Discrepancy of an approximation). The discrepancy between a strictly positive probability density pw (.) and un approximation & (.), w E $2, is defined by
Example 3.4. (General normal upproximdons). Suppose that p(w) an arbitrary density on the real line, with finite first two moments given by
> 0. i~ E 9, is
and that we wish to approximatep(.) by a j(.) correspondingto a normal density, N ( u , I p , A), with labelling parameters p, X chosen to minimise the discrepancy measure given in Definition 3.20. It is easy to see that. subject to the given constraints, minimising 6 ( p I y ) with respect to p and X is equivalent to minimising

Lx
p ( w )log N(w I p . X) &.
and hence to minimising
 $ log X
+
:J:
p ( w ) ( w  p)2du.
3 Generalisarions
Invoking the two moment constraints. and writing (d  \I)' = ( J  111 + i t /  //)'. [his reduces to minimising, with respect to and A, the expression A  lo&A +  + ,\(HI  I / ) ? . t It follows that the optimal choice is / I = w . X = I . In other words.jiir the reporrinR problem with a logctrithntic score funcrion. the hest norntal crppro.\i,,rcirion to a distribution on the real line (whose mean and variance exist) is the normal distribution having the .wne meatt and variance.
Example 3.5. (Normal approximations to Student distributions). Suppose that we wish to approximate the density St(.r 11'. A. (1). o =. 2, by a normal density. We have just shown that the best normal approximation to ony distribution is that with the same first ~ W O moments (assuming the latter to exist. corresponding here to the restriction ( t > 2). Thus. recalling from Section 3.2.2 that the mean and precision of St(.r I 1 1 . X. i t ) are given by and X(tr  2 ) / i t , respectively, it follows that the best normal approximation to St(./.1 11. ,\. ( 1 ) is provided by N(.r I 11. A(ct  2)/tt). From Definition 3.20. the corresponding discrepancy will be st(.r~//.<\.(l) n'( N I St) = St( .I' I 11. A. 0 ) log N ( . r j / I . ,\(I\  2 ) / 1 i
0
1 0
20
30
Figure 3.3 Uiscrepnnc~ brmecn Student cind nornial clrtisirirs
This is easily evaluated (see. for example, Bemardo, 1978a) using the fact that the entropy of a Student distribution is given by
3.4 Generulised Information Measures
157
where $ ( z ) = r’(:)/r‘(z) denotes the diRammufunction (see. for example. Abramowitz and Stegun, 1964). from which it follows that 6(N 1 St) may be written as
which only depends on the degrees of freedom, shows a plot of 6(N 1 St) against 0 . Using Stirling’s approximation,
logI’(2)
ft,
of the Student distribution. Figure 3.3

( i

;)
log:  2
1 + 7 log(2n). 2
we obtain, for moderate to large values of ct,
6(N 1 St)
[CV(O
 2)] ’ = O(l/(k2),
so that [a( n  2)] provides a simple, approximate measure of the departure from normality of a Student distribution.
’
3.4.4
Generalised Information
In Section 2.7.4, we examined, in the finitistic context, the increase in expected utility provided by given data D. We now extend this analysis to the general setting, writing x to denote observed data D.
Proposition 3.15. (Expected urility ofdclra). Ifpreferences are described by a logarithmic score function for the class of probability densiries p(w I x) defined on 52, then the expected increase in utility provided by data x, when the prior probability density is p(w).is given by
where p(w 1 x ) is the density of the posterior distribution for w ,given x. This expected increase in utility is nonnegative. and zero if, and only $ p(w I x ) is identical to p(w). Proof. Using Definition 3.19, the expected increase in utility provided by x
is given by
which, by Proposition 3.14, is nonnegative with equality if and only if p(w I x ) and y(w)are identical. a
158
3 Generalisations
The following natural definition of the amount of information provided by the data extends that given in Definition 2.26.
Definition 3 2 . (Information .1 from data). The aiiiouirt ($infortrratioti uhoirt w E 11provided by data x when the prior density is p(w)is given hy
where p(w I x)is tlw corrrsyoirding posterior deiisity.
As in the finite case. it is interesting to note that the amount of inforination provided by z is equivalent to the discrepancy measure if the prior is considered as an approximation to the posterior. Alternatively, we see that logp(w). logp(w 1 x). respectively, measure how "gtmd", on a logarithmic scale. the prior and posterior are at "predicting" the "true state of nature" w. so that log p w I z) log p( w j ( is a measure of the usefulness of x were w known to be the true value. Thus I { x I pu(.)} is simply the expected value of that utility difference with respect to the posterior density, b' e n 2. w
The functional , \ ' / ) ( & I ) logp(d)1 L has been used (see e.g.. Lindley. 1956. and references therein) as a measure of the 'absolute' information about contained in the probability density p ( r i ) . The increase in utility from observing .I' is then
instead of our Definition 3.21. However. this expression is m t invariant under ' onetoone transformations of A . a property which seems to us to be essential. Note. however. that both expressions have the same expectation with respect to the distribution of 2. Draper and Guttman ( 1969) put fi)r\vard yet another noninvariant detinition o f inlormation.
Additional references on statistical information concepts are Renyi ( 1963. 1966, 1967). Goel and DeGroot ( 1979) and De Waal and Groenewald ( 1989). More generally, we may wish to step back to the situation before data become available. and consider the idea of the amount of information to be expected from an experiment 6'. We therefore generalise Definition 2.27.
Definition 3 2 . (Expected inforination from an experiment). Tlic c>.vpectetl .2 iilfornratioii to Iw provided I?\. (rii erperiirwiit t trhorrt LJ E 12. n.lieii rlre prior dens it^ is p(w)is gireri by
3.4 Generalised Information Measures
159
The following result, which is a generalisation of Proposition 2.32. provides an alternative expression for I { e pw (.) } .
1
Proposition 3.16. An alternative expressionfor the expected information is
wherep(w,x 1 e ) = p(w 1 x,e)y(x I e ) andp(w I x. e ) is theposterior density for o given data x and prior densiry p ( w ) . Moreover, I { e p , ( .) } 2 0, with equality i f and only i f x and w are independent random quantities, so rhar p(w,z 1 6 ) = p(w)p(xI .)for all w and x.
I
Proof,
and the result now follows from the fact that p(w 1 x.e ) = p(w 1 x, ) p ( x I e). e Moreover, since, by Proposition 3.14, I { e p w (  ) } 2 0 with equality if and only ify(w I x,e)= y(w). it follows from Definition 3.19 that I { e p w ( . ) } >_ 0 with equality if and only if, for all w and x,p(w.z I e ) = p ( w ) p ( xI e ) . a
I
I
Maximisation of the expected Shannon information was proposed by Lindley (1956) as a "reasonable" ad hoc criterion for choosing among alternative experiments. Fedorov (1972) proved later that certain classical design criteria (in particular, Doptimality) are special cases of this when normal distributions are assumed. We have shown that maximising expected information is just a particular (albeit important) case of the general criterion, implied by quantitative coherence. of maximising the expected utility in the c a e of pure inference problems. See Polson (1992) for a closely related argument.
It follows from Proposition 2.3 I and the last remark, that someone who adopts the classical Doptimality criterion of optimal design under standard normality assumptions should. for consistency, have preferences which are described by a logarithmic scoring rule; otherwise, such designs are not optimal with respect to his or her underlying preferences.
There is a considerable literatureon the Bayesian design of experiments, which we will not attempt to review here. A detaileddiscussion will be given in the volume Bayesian Methods. We note that important references include Blackwell ( 1951,
1953). Lindley (1956). Chernoff (1959). Stone ( 1959). DeGroot (1962. 1970). Duncan and DeGroot (1976), Bandemer ( 1977). Smith and Verdinelli ( 1980). Pilz (1983/1991), Chaloner (I984). Sugden (1985). Mazloum and Meeden ( 1987). Felsenstein (1988. 1992). DasGupta and Studden (1991). ElKrunz and Studden ( I99 I). Pardo @ICJI. I99 I), Mitchell and Moms ( 1992). PhamCia and Turkkan ( ( 1992). Verdinelli ( 1992). Verdinelli and Kadane ( 1992).Lindley and Deely ( 1993). Lad and Deely ( 1994) and Parmigiani and Berry (1994).
3.5
3.5.1
DISCUSSION AND FURTHER REFERENCES
The Role of Mathematics
The translation of any substantive theory into a precise mathematical formalism necessarily involves an element of idealisation. We have already had occasion to remark on aspects of this problem in Chapter 2. in the context of using real numbers rather than subsets of the rationals to represent actual measurements (necessarily "finitised" by inherent accuracy limits of the unceflainty apparatus). Similar remarks are obviously called for in the context of using, for example. probability densities to represent belief distributions for realvalued observables. In some situations, as we shall see in Chapter 4. the adoption of specific forms of density may follow from simple. structural assumptions about the form of the belief distribution. in other situations. however. if we really try to think of such a density as being practically identified by expressions of preference among, say. standard options. we would encounter the obvious operational problem that, implicitly, an infinite number of revealed preferences would be required. Clearly, in such situations the precise mathematical form ofa density is likely to have arisen as an approximation to a "rough shape" obtained from some finite elicitation or observation process, and has been chosen. arbitrarily, tbr reasons of mathematical convenience. from an available mathematical toolkit. Similar remarks apply to the choice, for descriptive or mathematical convenience. of infinite sets to represent consequences or decisions. with the attendant problems of defining appropriate concepts of expected utility. There are obvious dangers, therefore. in accepting too uncritically any orientation, or wouldbe insightful mathematical analysis, that flows from arbitrary. idealized mathematical inputs into the general quantitative coherence theory. However, given an awareness of the dangers involved. we can still systematically make use of the power and elegance o f the (idealised) mathematics by simultaneously asserting. as a central tenet of our approach. a concern with the i ~ o t ~ i ~ . v r ia i~d~ v i. w ~ s v sirirityofthe output of an analysis to the form of input assumed (see Section 5.0.3). Of course, we shall later have to make precise the sense in which thew terms arc to be interpreted and the actual forins of proccdures t o htr adopted. That being
3.5 Discussion and Further References
161
understood, our approach, as with the earlier formalism of Chapter 2, will be to work with the mathematical idealization, in order to exploit its potential power and insight, while constantly bearing in mind the need for a large pinch of salt and a repertoire of sensitivity diagnostics. 3.5.2
Critical Issues
We shall comment further on three aspects of the general mathematical structure we have developed and will be using throughout the remainder of this volume. These will be dealt with under the following subheadings: (i) Finite versus Countable Additivity; (ii) Measure versus Linear Theory (iii) Proper versus Improper Probabilities; (iv) Abstract versus Concrete Mathematics.
Finite versus Countable Additivity In Chapter 2, we developed, from a directly intuitive and operational perspective, a minimal mathematical framework for a theory of quantitative coherence. The role of the mathematics employed in this development was simply that of a tool to capture the essentials of the substantive concepts and theory; within the resulting finitistic framework we then established that uncertainties should be represented in terms offinitely additive probabilities. The generalisations and extensions of the theory given in the present chapter lead, instead, to the mathematical framework of countable additivity, within which we have available the full panoply of analytic tools from mathematical probability theory. The latter is clearly highly desirable from the point of view of mathematical convenience, but it is important to pause and consider whether the development of a more convenient mathematical framework has been achieved at the expense of a distortion of the basic concepts and ideas. First, let us emphasise that, from a philosophical perspective, the monotone continuity postulate introduced in Section 3.1.2 does not have the fundamental status of the axioms presented in Chapter 2. We regard the latter as encapsulating the essence of what is required for a theory of quantitative coherence. The former is an “optional extra” assumption that one might be comfortable with in specific contexts, but should in no way be obliged to accept as a prerequisite for quantitative coherence. Secondly, we note that the effect of accepting that preferences should conform to the monotone continuity postulate is to restrict one’s available (in the sense of coherent) belief specifications to a subset of the finitely additive uncertainty measures; namely, those that are also countably additive. This is, of course, potentially disturbing from a subjectivist perspective, since a key feature of the theory is that the only constraints on belief specifications should be that they are coherent. For some such representations to be ruled out a priori, as a consequence of a postulate adopted purely for mathematical convenience, would indeed be a distortion of the
162
3 Generulisutions
theory. This is why we regard such a postulate as different from the basic axioms. However, provided one is aware of. and not concerned about, the implicit restriction of the available belief representations, its adoption may be very natural in contexts where one is, in any case, prepared to work in an extended mathematical setting. Throughout this work, we shall. in fact, make systematic use of concepts and tools from mathematical probability theory. without further concern or debate about this issue. However. to underline what we already said in Section 3.1.2. it is important to be on guard and to be aware that distortions might occur. To this end. we draw attention to some key references to which the reader may wish to refer in order to heighten such awareness and to study in detail the issues involved. De Finetti (1970/1974. pp. 116133. 173177 and 228241: 1970/1975. pp. 267276 and 340361 ). provides a wealth of detailed analysis, illustration and comment on the issues surrounding finite versus countable (and other) additivity assumptions, his own analysis being motivated throughout by the guiding principle that
. . . mathematics is an instrument which should conform itself strictly to the exigencies of the field in which it is to be applied. (1970/1974. p. 3)
Further technical and philosophical discussion is given in de Finetti ( 1972, Chapters 5 and 6); see, also, Stone (1986). Systematic use of finite additivity in decisionrelated contexts is exemplified in Dubins and Savage (1965/1976), Heath and Sudderth (1978, 1989). Stone (1979b), Hill (1980). Suddenh (1980). Seidenfeld and Schervish ( I 983). Hill and Lane ( 1984). Regazzini ( 1987) and Regazzini and Petris ( 1 9 3 ) . A discussion of the statistical implications of finitely additive probability is given by Kadane et u1. ( 1986). In Section 2.8.3 we discussed , within a tinitistic framework, several "betting" approaches to establishing probability as the only coherent measure of degree of belief. These ideas may be extended to the general case. Dawid and Stone (1972. 1973) introduce the concept of "e.rprc.totion cwisi.stenc:v". and show the necessity of using Bayes' theorem to construct probability distributions corresponding to fair bets made with additional information. Other generdlised discusions on coherence of inference in terms o f gambling systems include Lane and Sudderth ( 1983) and Brunk (1991 1.
Mmsurc. versus Lineur Theory
Mathematical probability theory can be developed. equivalently. starting either from the usual Kolmogorov axioms for a set function defined over a afield of events (scc,for example, Ash, 1972). or from axioms for a linear operator defined over a linear space of random quantities (see. for example. Whittle, 1976). The former deals directly with probability tnecrsurr; the latter with an cxpectcition operutor (or a prewXori. in de Finetti's terminology).
3.5 Discussion and Further References
163
In our development of a quantitativecoherence theory, the axiomatic approach to preferences among options has led us more naturally towards probability measures as the primary probabilisticelement, with expectation (prevision)defined subsequently. In the approach to coherence put forward in de Finetti (1972,197011974, 1970/1975),prevision is the primary element, with probability subsequentlyemerging as a special case for 01 random quantities. The case for adopting the linear rather than the measure theory approach is argued at length by de Finetti, there being many points of contact with the argument regarding finite versus countable additivity, particularly the need to avoid, in the mathematical formulation, going beyond those aspects required for the problem in hand. In the specific context of statistical modelling and inference, Goldstein ( I 98 I , 1986a, 1986b, 1987a. 1987b, 1988, I99 I , 1994) has systematically developed the linear approach advocated by de Finetti, showing that a version of a subjectivist programme for revising beliefs in the light of data can be implemented without recourse to the full probabilistic machinery developed in this chapter. Lad et a f .(1990) provide further discussion on the concept of prevision. We view these and related developments with great interest and with no dogmatic opinion concerning the ultimate relative usefulness and acceptance of “linear” versus “probabilistic” Bayesian statistical concepts and methods. That said, the present volume is motivated by our conviction that, currently, there remains a need for a detailed exposition of the Bayesian approach within the, more or less, conventional framework of full probabilistic descriptions.
Proper versus Improper Probabilities Whether viewed in terms of finite or countable additivity, we have taken probability to be a measure with values in the interval [O,1). However, it is possible to adopt axiomatic approaches which allow for infinite (or improper) probabilities: see, for example, Renyi (1955, 196211970, Chapter 2, and references therein), who uses conditional arguments to derive proper probabilities form improper distributions, and Hartigan (1983, Chapter 3), who directly provides an axiomatic foundation for improper or, as he terms them. nonunitary, probabilities. We shall not review such axiomatic theories in detail, but note that we shall encounter improper distributions systematically in Section 5.4. Abstract versus Concrete Mathemarics When probabilistic mathematics is being used as a tool for the representation and analysis of substantive nonmathematical problems, rather than as a direct mathematical concern in its own right, there is always adilemma regarding the appropriate level of mathematics to be used. Specifically, there are basic decisions to be made about how much measuretheoretical machinery should be invoked. The introduction of too much abstract mathematics can easily make the substantive content seem totally opaque to the very reader at whom it is most aimed. On the other hand, too
164
3 Generu1isation.s
little machinery may prove inadequate to provide a complete mathematical treatment, requiring the omission of certain topics, or the provision of just a partial, nonrigorous treatment, with insight and illustration attempted only by concrete examples. Thus far, we have tried to provide a complete, rigorous treatment of the Foundations and Generalisations of the theory of quantitative coherence, within the mathematical framework of Chapters 2 and 3. This chapter essentially defines the upper limit of mathematical machinery we shall be using and. in fact, most of our subsequent development will be much more straightforward. However, it will be the case. for example in Chapter 4. that some results of interest to us require rather more sophisticated mathematical tools than we have made available. Our response to this problem will be to try to make it clear to the reader when this is the case. and to provide references to a complete treatment of such results. together with (hopefully) sufficient concrete discussion and illustration to illuminate the topic. For more sophisticated mathematical treatments of Bayesian theory. the reader is referred to Hartigan ( 1983) and Florens et (11. ( 1990).
BAYESIAN THEORY
Edited by Josd M. Bemardo and Adrian F. M. Smith Copyright 02000 by John Wiley & Sons, Ltd
Chapter 4
Modelling
Summary The relationship between beliefs about observable random quantities and their representation using conventional forms of statistical models is investigated. It is shown that judgements of exchangeability lead to representations that justify and clarify the use and interpretation of such familiar concepts as parameters, random samples, likelihoods and prior distributions. Beliefs which have certain additional invariance properties are shown to lead to representations involving familiar specific forms of parametric distributions. such as normals and exponentials. The concept of a sufficient statistic is introduced and related to representations involving the exponential family of distributions. Various forms of partial exchangeabilityjudgements about data structures involving several samples, structured layouts. covariates and designed experiments are investigated, and links established with a number of other commonly used statistical models.
4.1
4.1.1
STATISTICAL MODELS
Beliefs and Models
The subjectivist. operationalist viewpoint ha. led us to the conclusion that, if we aspire to quantitative coherence, individual degrees of belief, expressed as probabilities. are inescapably the starting point for descriptions of uncertainty. There can
..166 4 Modellitig be no theories without theoreticians.I. I ...r....rl. . However. .. . whatever their particular concerns and fashions at any given point in time. which we shall typically assume. etc. . .I’ . and more particularly in Chapter 5 .. i . conditional on gives the joint density having observed .. . as in our earlier discussion in Section 2. ..I..I.. to be representable in terms ofa joint density function p(. (discrete or continuous. . both herc.].~. In particular. . w that.. In Chapters 3.. are necessarily reasoning processes which take place in the minds of individuals. .and . . l . . . ..I. .8. .I. To be sure. subjective beliefs about that reality.I* . .. .rI. for example. provides the marginal joint density for . J  /)I.. Of course.. . no science without scientists. and 3.rl . /J(. I ... no learning without learners: in general. . .. namely. we shall use notation such as I’andp in a. rather than as specifying particular functions. . . i . It follows that learning processes.. . .. . ) (to be understood as a mass function in the discrete case). In what follows.. I . throughout. . the object of attention and interest may well be an assumed external. this latter conditional form is the key to “learning from experience“. .. we shall assume that an individual‘s degrees of belief for events of interest are derived from the specification of a joint distribution function P(.. . . I.qrrreric. any such specification implicitly defines a number of other degrees of belief specifications of possible interest: for example.I. I ...x. . . ! .. ) / p ( .. Within the Bayesian framework.r. . that the primitive and fundamental notions of irrrhsidu d preference and belief will typically provide the starting point for irirrrpersotrtil communication and reporting processes. .. ). . . . I . we shall therefore often be concerned to identify and examine features of the individual learning process which relate to interpersonal issues. = .. I . ~ ) .J ’ . .. . . In such cases. such as P ( . such as the conditions under which an approximate consensus of beliefs might occur in a population of individuals. . /J(I . and possibly vectorvalued) representing obscrved or experimental data. I ...rl). 1 or P!. .. those where thc events of interest are defined explicitly in terms of rundotn yuunrities. . . . ..s. I * ( = . for 1 (1 I ) ) < 1). .). t ’ I ...sense. r l ..I. X I .. and sometimes refer to implied distribution functions. Siinilarly. . . .. Werecall that. it is important to emphasise. . . . . .. P may sometimes refer to an underlyinp probability measure.. ) yet unobserved . . we may write p(. . . . I . .... . we established a very general foundational framework for the study ofdegreesofbeliefand their evolution in the light of new information. without systematic reference to measuretheoretic niceties. .. . objective reality: but the actuality of the learning process consists in the evolution of individual.r. r . . P[..I . We now turn to the detailed development of these ideas for the broad class of problems of primary interest to statisticians.. .. ) : / ) ( .. . = ) ) ( .. . .rI. . . I .
operationalist approach will provide considerable insight into the nature and status of these conventional assumptions. From our perspective.2 Exchangeability and Related Concepts 167 simply indicates that the conditional density for .. This is clearly a somewhat daunting task. muthemutically specijies theform of the joint belief distributionfor any subset ofx.x .2. ~ given X I .T. . (Predictive probability model).1. and suppose that a predictive model is assumed which specifies that. . and thus far we have no way of attachingany operational meaning to the “parameters”which appear in conventional models. as we shall soon see... is a probability measure P . . . since direct contemplation and synthesis of the many complex marginal and conditionaljudgements implicit in such a specification are almost certainly beyond our capacity in all but very simple situations. In other cases.1 EXCHANGEABILITY AND RELATED CONCEPTS Dependence and Independence Consider a sequence of random quantities x l . however. . Definition 4. . In actual applications we shall need to choose specific. In some cases. this “formal” approach will not take us very far towards solving the representation problem and we shall have to fall back on rather more pragmatic modelling strategies. . and the context will always ensure that there is no confusion of meaning. We shall therefore need to examine rather closely this process of choosing a specific form of probability measure to represent degrees of belief. . this is all somewhat premature and mysterious! We are seeking to represent degrees of belief about observables: nothing in our previous development justifies or gives any insight into the choice of particular “models”. . the mathematical representation strategy to be adopted. the joint density can be written in .4.. However.r2. typically involving “unknown parameters”. . concrete forms for joint distributions. the subjectivist. In much statistical writing. is given by the ratio of the specified joint densities. . . A predictive model for a sequence of rundom quanrities x!. Thus far.x. for all 11. in some sense. the starting point for formal analysis is the assumption of a mathematical model form. a word of warning is required. Such usage avoids notational proliferation.. . we shall find that we are able to identify general types of belief structure which “pin down”.. which x?.2 4. At this stage. our discussion is rather “abstract”.L:. the main object of the study being to infer something about the values of these parameters. .. . .+l. .r:. 4.
.. In such cases. 1 . the structure of the joint density p ( . however.) = n il p(. .(.. ... r l . . . .r 1 .) = p(. . triples.. his or her joint degree of bclief distribution for a sequence of random quantities .r. I ) } . .frci?re. there are a vast number of possible subjective assumptions about the form such dependencies might takc and there can be no allembracing theoretical discussion. .2.I so that no leurning. r l . p(.r. in thinking about P ( . etc. I . They simply represent forms of judgement which may often be felt to be appropriate and whose detailed analysis provides illuminating insight into the specification and interpretation of certain classes of predictive models. . . .)must encapsulate some form ofdepcnrlunce among the individual random quantities. of the random quantities. the “labels” identifying the individual random quantities. . . . in the sense that he or she would specify all the marginal distributions for the individual random quantities identically. what we can do is to concentrate on some particular simple forms of judgement about dependence structures which might correspond to actual judgements of individuals in certain situations.r. and similarly for all the marginal joint distributions for all possible pairs./‘... .. In other works./. . .. .+1.an individual makes the judgement that the subscripts. for any 1 5 t i / < 1 1 . . . In general. r .. I 1 so that the . I t is easy to see that this implies that the form of the joint distribution must be such that for any possible permutation ~i of’the subscripts { 1.4 Modelling the form p(..... .. . Instead. are independent random quantities. . are ”uninformative”. or oirght to be adopted in most cases. . . . . ..r. .2 Exchangeabilityand Partial Exchangeability Suppose that. 4.(. ) . or whatever. We formalise thi\ notion of “symmetry” of beliefs for the individual random quantities as follo~v\./. . r i . / ’ I . It then follows strdightfonvardiy that.1.I.r. .. A predictive model specifying such an independence structure is clearly inappropriate in contexts where we believe that the successive accumulation of data will provide increasing information about future events.r. . . There is no suggestion that the structures we are going to discuss in subsequent subsections have any special status.vperience can take place within this sequence of observations. past data provide us with no additional information about the possible outcomes of future observations in the sequence....
a meaningless phrase thus far within our framework. to be exchangeable in the above sense. .n}.4. The random quantities 11.. (Tossing a thumb tuck). for a subjectivist interested in belief distributions for observables. of course. but might be reluctant to make a judgement of symmetry for the combined sequence of tosses. but that the even and odd numbered tosses are made with different pins: an all metal one for the odd tosses: a plasticcoated one for the even tosses. In terms of the corresponding density or mass function. This latter notion is. (cont. Suppose that the sequence of tosses of a drawing pin are not all made with the same pin.I.. many individuals would retain an exchangeable form of belief distribution within the sequences of odd and even tosses separately. but that the odd tosses are made by a different person. . . If the tosses are performed in such a way that time order appears to be irrelevant and the conditions of the toss appear to be essentially held constant throughout.. using a completely different tossing mechanism from that used for the even tosses. Clearly.). .. .. In general. of no direct use to us at this stage. . it would seem to be the case that. The notion of exchangeability involves a judgement of complete symmetry among all the observables x l .given the value of the underlying parameter". and let xi = 1 if the pin lands point uppermost on the ith toss. In such cases. Alternatively. even though a partial judgement of symmetry is present.2. i = 1. . are said to be judged yiniteiy) exchangeable under a probability measure P if the implied joint degree of belief distribution satisjes for ull permututions 7r defined on the set { 1. .x. .2 Exchangeability and Related Concepts Definition 4. the exchangeability assumption captures. Example 4. since it (implicitly) involves the idea of "conditional independence. x. . (Finite exchangeability). .1. . Consider a sequence of tosses of a standard metal drawing pin (or thumb tack). under consideration. most observers would judge the outcomes of the sequence of tosses 11.. . r2. . . suppose that the same pin were used throughout. whatever precise quunrirutive form their beliefs take. the essence of the idea of a socalled "random sample". .R. in many situations this might be too restrictive an assumption.1. = 0 otherwise. the condition reduces to Example 4.
. I . In practice.170 4 Modelling Example 4. Clearly. . Example4. We shall now return to the simple case ofexchangeability and examine in detail the form of representation of p(. . spanning a wide age range. Under such conditions. . . we shall concentrate on the ”potent idly infinite” case. specifying an actual upper bound may be somewhat difficult or arbitrary and so. If the drug is administered at more than one dose level and it’ there are both male and female subjects. all made on the same sample with the same measurement procedure. (Luborato~ measurements). judgements of exchangeability for each laboratory sequence separdtely might be appropriate. we shall generalise our previous definition of exchangeability to allow for ”potentially infinite” sequences of random quantities. it is convenient to be able to proceed as if we were contemplating an infinite sequence of potential observables. that sequencesof such measurements are combined from 1 different . most individuals would be very reluctant to make a judgement of exchangeability for the entire sequence of results.6. . ) which emerges in various special cases. . laboratories. it simply signifiesthat there may be additional “1abels”on the random quantities (forexample. . the substance being identical but the measurement procedures varying from laboratory to laboratory.I.r]. (Physiologicalresponses). r l . .. within each combination of doselevel. Suppose that { . Suppose that . . However. As a preliminary. Of course. T i . at least in principle. many individuals might judge the complete sequence of meawrements to be exchangeable.3. These and related issues of finite versus infinite exchangeability will be considered I in more detail in Section 4. however. it should. For the time being. are realvalued measurements ofa physical or chemical property of a given substance. there are many possible forms of departure from overall judgements of exchangeability to those of partial exchangeability and so a formal definition of the term does not seem appropriate. or the identification of the tossing mechanism in Example 4. it will be important to establish that working within the infinite framework does not cause any fundamental conceptual distortion. . Suppose. } are realvalued measurements of a specific physiological response in human subjects when a particular drug is administered. . odd and even.:. sex and appropriately defined agegroup. . However.7. Judgements of the kind suggested in the above examples correspond t o forms of purriuf exchongeubility. A detailed discussion of various possible forms of partial exchangeability will be given in Section 4.2.1 ) with exchangeable judgements made separately for each group of random quantities having the same additional labels. In this case. . for mathematical and descriptive purposes. In general. a judgement of exchangeability might be regarded as reasonable. always be possible to give an upper bound t o the number of observables to be considered.. . x 2 . whereas such a judgement for the combined sequence might not be.
.rI = O . (Infinite exchangeability). = 1. Si = 1) 5 p ( J l = 0. p ( s . We shall now try to identify an sJ. is said to be judged (injnitely) exchangeable if everyfinite subsequence is judged exchangeable in the sense of Definition 4.4. r L = l ) = l / : I . for example.C:3 = 1) = p(3. with all other combinationsof x i .C? = 0. . exchangeable.x.ri = 1..2. we also have p ( r .S:t = l . *'2 = O.S.so that x i . = 1 or . C : ~ that either I.x2: r:{ having probability zero.Zz = 1 .r3are clearly . such that rI. . = 0.with joint such probability function given by p ( q = O. . .4. One might be tempted to wonder whether every finite sequence of exchangeable random quantities could be embedded in or extended to an infinitely exchangeable sequence of similarly defined random quantities. Suppose that we define the three random quantities C ~ .2?3.r2= O. X:) = 1..1.rr = 0) = 1/3. .rl = 0.4 = 0) = p ( q = O.S.3.2 = l. x2!.p ( q = 0. i = 1. The inJnite sequence of random quantities $1.) = 1 ) = 0.2 Exchangeability and Related Concepts 171 Definition 4. x r = 1.C2 = 1. = It follows that a finitely exchangeable sequence cannot even necessarily be embedded in a larger finitely exchangeable sequence.r:i = I ) = p ( q = 1. = where p ( q = l.S. (Nonextendibleexchangeability)..22 = 1.t = l.t = 0) = 1/3.22 = 1. this is certainly not the case as the following example shows. But p ( q = 0.1:z = l .1 = 1. : C ~ . taking only values 0 and 1. S l = 1).x2 = l.r..r:i= 1) . .x j are exchangeable. S ) = 1) = 0 and so p(r1 = O. 3 . I1 = 0.$ 1.r. x:! = 1. . However.s) 1 . = 0 . XA = 0) = p(s1 = 0 .3. = 0.~ 2. 2 = 1.r4= I ) . p(J. . r y = l .p ( q = 1.S( = O ) .. S ) = 1. x. For this to be possible. . ~ 4= 0) # p(r1 = 0. Example 4. .. X J = 1) 1/3p(.1 = 0. & $= I .1 = 0 ) 5 p(xl = so that 1.. ~ : ~ = l .KI = 1. let alone an infinitely exchangeable sequence. we require. However. q = 1.
. = . 0’s.) = (. + . (Representation theorem for 0 I random quantities). . = .rchungeable sequence of01 random yuuntitics with prohahilin.I/.. J’T.1. t .). p ( . = y. [!I. Without loss of generality. .= y. . hcrs the forin r rcjhere. of Proposition 4. .1.r. . there txists u distribution fiuiction Q such t k r r the joint inuss firnction p( .. Suppose ..)U.1 4 Modelling MODELS VIA EXCHANGEABILITY The Bernoulli and Binomial Models We consider first the case of an infinitely exchangeable sequence o f 0 ..\ = 2%. .I’I + .I‘ I . measure P .rII). 1976. . we can imagine sampling tt items without replacement from an urn of X items containing . .. (Intuitively.. If X I . the first I I random quantities XI. . etc. . .r: t ~ .r.. . . . . I * “ I. .(?y.. .. for any 0 5 ?I. with s. .( 1 1 .ri t . (De Finetti. I937/ 1964. 2 0.. . .. for all i = I . is un injnitely e. see also Barlow.. . . t.. . .. here we follow closely the proof given by Heath and Sudderth. . . ]1(.rll. . . . .7..vI.y.1)). = 1. / t t ....) jhr . r . . ..1 + + rl. .. . corresponding to the hypergeometric distribution of Section 3.2. 7 . we see that where (.y~.) d . = 0 or . .r7(I . .\ . .3.172 4. 2 ri 2 y. .\ 1’s and :Ir y.1) .. . r l . ’ .r.r. 199 I ) .)  .. . I . 1930. . Q ( 0 ) = lim 0x P[gll/tt 5 HI. to g.. . rt } such that . ~ .rTl = y. . for arbitrary R. I .. . 1 for any permutation x of { 1.v(y. r l . with y.3 4. and with the summations below taken over the range y y = y. .g.. .x i . + zI. by exchangeability. then. we shall derive a representation result for the joint mass function.. . Moreover. ? . {rnd 0 = l i ~ i i . . Proof.I random quantities.
( 8 W U .. it is as if.I . for example. . by the strong law of large numbers..xll are a random sample from a Bernoulli distribution with parameter 8. As we range over all possible choices of this latter distribution.2). conditional on 8. there exists a subsequence Q.. . gAv= 0.The operational content of this prior distribution derives from the fact that it is as ifwe are assessing beliefs about what we would anticipate observing as the limiting relativefrequency from a “very large number” of observations. Bernoulli random quantities (see Section 3. (1 .v(8) on R to be the step function which is 0 for 8 < 0 and has jumps of p(zl t .V = y . generating a parametrised joint sampling distribution I. we shall refer to the joint sampling distribution as the likelihood function.. The result follows. ~ ) 8 = y. 1972. . I. such that :YJ lirn Q“. N. 8 = liInllx(y.1.  x where Q is a distribution function. Any such P must correspond to the mixture form given in Proposition 4.=I r=l where the parameter is assigned a prior distribution Q(8).v2. so that Q may be interpreted as “beliefs about the limiting relative frequency of 1 ’s”. we generate all possible predictive . I .). for some choice of prior distribution Q ( 8 ) . (Nht uniformly in 8. In terms of Definition 4.3 and Ash. places a strict limitation on the family of probability measures P which can serve as predictive probability models for the sequence./n). l K1 . In more conventional notation and language.y. Q.2.. .. .r. . t S. . . . . are judged to be independent. Moreover.v.3 Models via Exchangeability 173 If we now define Q.l / l l ~ fly. we see that at As N  x... Section 3.. . p(x1.. 8 is itself assigned a probability distribution Q..~ ) A v r r . . It is as if: the . = Q. by Helly’s theorem (see. the assumption of exchangeability for the infinite sequence of 01 random quantities ~ 1 . Thought of as a function of 8.2) conditional on a random quantity 8. Section 8.v/N. ~ 2 .2.4. . a The interpretation of this representation theorem is of profound significance from the point of view of subjectivist modelling philosophy.
. this simple example provides considerable insight into the learning process.l such that . we establish a justification for the conventional model building procedure of combining a likelihood and a prior. such as yl. .firnction p(rl. I ..1 rundun1 quantities with prohuhitity mrcrsitrr P. i . The likelihcd is defined in terms of an assumption of conditional independence of the observations given a parameter. .r1.r.r2. . . . . . r ~ .slI X I . defined by Bi(.. . The formal learning process for models such as this will be developed systematically and generally in Chapter 5 .rIIgiren . .. .174 4 Mudelling probability models compatible with the assumption of infinite exchangeability for the 01 random quantities. a T h i s provides a justification.yll10. ~ than in the .l . . ... In many applications involving 01 random quantities. The representation of p ( . acquire an operational interpretation in terms of a limiting average of observables tin this case a limiting frequency). showing how. . f PIUOJThis follows immediately from Proposition 4.rI. we may be more interested in a summary random quantity. "at a stroke". . = .. f .y. Thus. However. .1 and the fact that for all s1. . . when expressing beliefs about g.rI . Given the cvnditions o Proposition 4..rll. ) I ) . ] .rll = . . . . + + Corollary 1. . the conditional probability .for acting us ifwe have a binomial likelihood. = !I.1.for .. the latter. and its associated prior distribution. . # f i r . . + J . hus + 1 thejhri . . . .I' . + . . . .. . Corollary 2.. r t * s . is (it1 injnitely crchangeahle seqrretice ~ $ 0 . t .). the key step is a straightforward consequence of the representation theorem. . . l . in a sense. with a prior distribution Q(0) for the binomial parameter 0. .) is straightforwardly obtained from Proposition 4.rl I + . individual sequences of .
.. Thus. 2 ~ I the (conditionul... also provides the basis for deriving the con. .. z. of course.4. .+ . . ditional predictive distribution of any other random quantity defined in terms of the futureobservations.x.. . . . . . is that the prior distribution Q ( 0 ) for 8 has been revised.. a posterior distributionfor a parameter i seen to be a limiting case of s u posterior (conditionat)predictive distributionfor an observable.I. But..~ I . .~ s I+zl. All that has happened. into the posrerior distribution Q(O 1 zl.z. G I ) f and the result follows by applying Proposition 4.# 1. a ) We thus see that the basic form of representation of beliefs does not change. .. given 11.. the total number of 1's in I .c. .) and p(z1. via &yes' theorem.. . . .I l l ..~. ~and rearranging the resulting expression. . ~and this. given rl. . ~ XI. . p(z1.. ~ + . . Clearly. . .. by Proposition 4. .i.. . .the predictive probability func.3 Models via Exchangeubiiity where 175 Proof. The conditional probability function ~ ) ( t . ..+1. Forexample. . x . ..~ . . . . .. ..) .or posterior) predictive probability function for zn.. I.. )= p ( .1 to both p ( q .. for . .) is called .. ..e. 5 1 1 12.. n .rnl).. ..1 and its Corollary 2. . .~ has the form A particularly important random quantity defined in terms of future observations is the frequency of 1's in a large sample.. . . tion p ( ~ .I . p(z... expressed in conventional terminology.xn. .~) Y. . ..
= y. We can extend this idea in an obvious way by considering Idimensional random vectors x. . ( l . . If xI. . = 1 signifies that the ith observation belongs to category I and .whose j t h component. = 0 signities membership of category 2. = x l . . x. . .s P.2 4 Modelling The Multinomial Model An alternative way of viewing the 0 1 random quantities discussed in Section 3. 4 As in the previous case. . tlie.rchtirigeableseyitetice ($0l nrntiotii \~~tcir.. we shall refer t o such 5.176 4. In what follows. 5 2 .. . . albeit algebraically cumbersome. . x. x..2 ... . . .. I .rists u distrihirtiorifiinctiotr Q such thrit the joint mtus. whose j t h component g. . This i s a straightforward. . . . ~ 1. in the sense that .)i s the random quantity vector y.. takes the value I to indicate membershipofthejth o f k 1 categories. + + + + + . . generalisation of the proof of Proposition4. J . . At most one of the k components can take the value I . correspondingto the total number ofoccurrences ofcategory j in the I I observations. . and then comment on the interpretation of these rcsults. + + Proposition 4.r. 1 = y(y. (Represenfation theorem for 0 I random vecfors).. y . we can extend Proposition4 I in an obvious Wily. rcitli probability tt~~(rsiire there e. We shall give the representation of p z generalising Corollary I to Proposition4. If 2 1 .fiinction p(x I .3. is (it1 itrjnitely c~. as "01 random vectors". .T.fortr~ ). we are often most interested in the summary random x.1 i s as defining category membership (given two exclusive and exhaustive categories). . .. i s an infinitely exchangeable sequence of 5 01 random vectors. ..1.~. x .for htrs where Proof.2. if they all take the value 0 this signifies membership of the ( X 1)th category.
. where the components 8.! . . . of establishing a representation theorem in the realvalued case are somewhat more complicated than in the 01 cases. ..T. . We now consider the case of an infinitely exchangeable sequence of realvalued random quantities xl.4. . . is an infinitely exchangeable sequence of realvalued random quantities with probability measure P . we see in Proposition 4. Mu&. hopefully. is the empirical distributionfunction defined by X I .. of the latter can be thought of as the limiting relative frequency of membership of the jth category. . . 4 Thus. Proposition 4. l ) . . each z. As one might expect.Ynk!(n c?/.. where sponding to the joint sampling distribution of a random sample z .) 112 and F. . ~1~ x2. . with a prior Q(8)for 8.2 that it is as if we have a likelihood corre1. In the corollary. .u on 8.3. . f such that the joint distributionfunction of T I . Given the conditions of Proposition 4. . (General representation theorem). it is as ifwe assume a multinomial likelihood. the mathematical technicalities x2. together with a prior distribution Q over the multinomial parameter 8.. and a rigorous treatment involves the use of measuretheoretic tools beyond the general mathematical level at which this volume is aimed. providing some intuitive insight into the result. .. 18. TI). ... 4 3 3 The General Model . = ?Jnl!Pn2!. . .. has the form I f where Q ( F ) = lirn P(Fl.. . there exists a probability memure Q over 3. we shall content ourselves with providing an outlineproof of a form of the representation theorem. having no pretence at mathematical rigour but.. 18..x. . For this reason. the space o all distributionfunctio. as well as the key ideas underlying a form of proper proof.3 Models via Exchangeability p(ynl.ynk) may be represented as 177 Corollary.. This follows immediately from the generalisation of the argument used in proving Corollary 1 to Proposition 4. .has a multinomial distribution with probability function Muk(z..z. the joint mass function where LlIl hJ 72 T?.2. . . I..l)! ’ Proof.
F. . tt x.and hence the random quantity The righthand side tends to zero as . < for all i ..r) ( .r) in probability to some random quantity.. which implies that tends  in probability as .. To see this. it then follows that . 1978/ 19138).) = P ( J . o. .178 Ourline proof. .r) and E ( Z . j .. Since 4 Modelling we have. by exchangeability.rl < . r . r ~ .r). . r l ) n . F(.r.V x . .. in place of and noting that I = I. (See Chow and Teicher. for fixed 11. say. ) ] . by exchangeability.(. 4 r For il: > t t .I*)]. denote positive integers and set  . < .IT. Suppose we now let (1 I . I J ) = I'[(. < .=I I' and ' = I ( n ) = I[(. .l 5 .. writing I.. n ( .. . we have ' Note also that E(I. A straightforward count of the numbers of terms involved in the summations then gives the required result.
/g F ( z j )dQ(F) P ( x ~ . For ease of reference. rather than the infinitedimensional label. v ) . cannot easily be described explicitly. a The general form of representation for realvalued exchangeablerandom quantities is therefore as ifwe have independent observations zl. having the operational interpretation of "what we believe the empirical distribution function would look like for a large sample". .x P ( F . we present this representation as a corollary to Proposition 4. as N + 179 co. The structure of the learning process for a general exchangeable sequence of realvalued random quantities. so that. F. as N  m. .) . by exchangeability. . with the distribution function representation given in Proposition 4.3.zn conditional on F . . and so Recalling (*).4..3 Models via Exchangeability However..I But. . In what follows.3. random) distribution function (which plays the role of an infinitedimensional "parameter" in this case).x.. .e. 8. labelled by a finitedimensional parameter. say. we shall therefore find it convenient to restrict attention to those cases where a corresponding representation holds in terms of density functions.v. we see that. . with a belief distribution Q for F. an unknown (i. where Q ( F ) = lim.
rt. I . . ..I and realvaluedcases. . The role of Bayes' theorem in the learning process is now easily identified. . we shall give detailed references to t e literature on represenh tation theorems for exchangeable sequences. .> 1 matters. .8.E 'Rk. a deeply satisfying clarification of such fundamental notions as models./+I. zl. everything carries over in an obviously analogous manner to the case of exchangeable sequences xl. wen the simple cases we have presented already provide.)= . .. . and and rearranging the resulting expression. . .180 of Proposition 4. r f Proof. with x. . . . 16) denoting the density jiinction corresponding to the "unknocon parameter" 6 E 8. .. X I .. i r s . This follows immediately on writing P(. . .. .r. that happens. . . .. However. with p ( .. 1 X I . . In Section 4. . .. of realvalued random quantities. . * X I . is an injnitely exchangeable sequence o retrlf valued random quantities admitting a density representation as in Corollarj I ..rt. In fact. it will be clear from the context what is intended. . from the subjectivist perspective. .3 and its corollaries become the To joint distribution functions and densities for the k components of the z. then Corollary 2.. . . . avoid tedious distinctions between .3 the joint density of s1. in subsequent developments we shall often just write s E X. including farreaching generalisations of the 0 . XI/) p ( r . . . In cases where the distinction between k = 1 and A.rl. parumeters. x2. r l . in effect.). conditional independence and the relationship between helieji and l i n i i t i t i ~ ~ ~ t ~ ~ u e n c . . is that the distribution func2 All tions and densities referred to in Proposition 4. .. . a The technical discussion in this section has centred on exchangeable sequences. r l . 2 .) p ( . x.'(. x. . under the conditions p(x. . applying the density representation form to both p ( .x 2 .rll has the form 4 Modelling Corollary 1. Assuming the required densities to exist...r E 'R and x E !Rk.) [.. .
4. . . F. and whose consequences might be interesting to explore. The following definitions a finite subset of observations.1. . . T. so that the “parameter” is.. . . ~ .4. . the “parameter”.. and hence of the complete predictive probability model P . describe two such possible judgements of invariance. again places (as in the 01 case) a limitation on the family of probability measures P which can serve as predictive probability models.3. but the practical task of translating actual beliefs about realvalued random quantities into the required mathematical form of a measure over a function space seems. A sequence of random quantities .. to be exchangeable. In particular. imposing further symmetries or structure beyond simple exchangeability. perhaps relating to the “geometry”of the space in which say. as possible forms of judgement that might be made. x. to say the least.4.. (Spherical symmetry). we consider the possibility of further .3 can be invoked.is said to have spherical symmetry under a predictive probability model P ifthe latter defines the distributionsof x = ( x l . in the context of the general form of representation given in Proposition 4. the specification of Q .1 MODELS VIA INVARIANCE The Normal Model Suppose that in addition to judging an infinite sequence of realvalued random quantities q . ~ Definition 4. Given the interpretation of the components of such a parameter as strong law limits of simple sequences of functions of the observations. In this case. the assumption of exchangeability for the realvalued random quantities ~ 1 . . there is no claim that such judgements have any a priori special status.4 4. underlying the conditional independence structure within the mixture is a random distribution function.) and Ax to be identical.2. 4. . injnire dimensional. It is interesting therefore to see whether there exist more complex formal structures of belief. then becomes a much less daunting task. xl.4 Models via Invariance 181 In terms of Definition 4. judgements of invariance. simply. however. a somewhat daunting prospect. The mathematical form of the required representation is welldefined. As with exchangeability.for any (orthogonal)n x n matrix A such that AtA = I . . 22.and the family of coherent predictive probability models is generated by ranging through all possible prior distributions Q(Fj. lie. in effect. which lead to more specific and “familiar” model representations. They are intended. 2 1 .. .. x. it is of interest to identify situations in which exchangeabilityleads to a mixture of conditional independence structures which are defined in terms of ajnite dimensional parameter so that the more explicit forms given in the corollaries to Proposition 4.. .
.r.such that thejoint distribirtion Jimctiori of r 1 .)} h a w sphrriculsynnietry. + . for example. (ii) X is itself assigned a distribution Q. this is equivalent to a judgement of identical beliefs for all outcomes of sI.I. since the proof of a generalisation of this result will be given in full in Proposition 4. as a "labelling parameter". Proof. . a The form of representation obtained in Proposition 4. change if they had been expressed in a rotated coordinate system. there e. leading to the same value of J. Freedman (1963a) and Kingman (1972): details are omitted here. .). .risfs Q distribution function Q on 'R 1..' = interpreted as "beliefs about the reciprocal of the limiting mean sum of squares of the observations". Since rotational invariance fixes "distances" from the origin. s i . . .. ..].f rei~lvul~ied rundotn yirnntities with probability meuswe P. . .3 assumes a much more concrete and familiar form.4.. .. . See. . in the sense that. .102 4 Modelling This definition encapsulates a judgement of rotational symmetry. und if. is uti infinite sequence . { ..r: : + . so that (2 may be (iii) by the strong law of large numbers. .our quantitative beliefs would not .r. . ..x si.. The next result states that if we make the judgement of spherical symmetry (which in turn implies a judgement of exchangeability. (Representaliontheorem under spherical symmetry).. X .since permutation is a special case of orthogonal transformation). fi)r utiy 11.rl . and X ' = l i i i i . Proposition 4. . corresponds to the the random quantity precision. + xf)..I. although measurements happened to have been expressed in terms of a particular coordinate system (yielding s ~. . the reciprocal of the variance)..e. . .r. If .5.x has the jbrni where CP is the standard tiorninl distrihuriori~fimctioti unit with s = n'(. c' *wen X (which. + .r2. . . the general mixture representation given in Proposition 4.4 tells us that the judgement of spherical symmetry restricts the set of coherent predictive probability models to those which are generated by acting us if: (i) observations are conditionally independent riortnal random quantities. i. .
. = TI’(x~ . (Centredsphericdsymmetry).. { X I . ~ v e ~ ~ p h e r i c a l s y m m e t ~ . is exchangeable. however. are independent. . Proof. Ex.. This is equivalent to a judgement of identical beliefsfor all outcomes of x1. = n’ where%. quantities with probability measure P.. r ) 2 p = limrrda and A’ = 1imtt% En.4 Models via Invariance 183 For related work see Dawid ( 1977. . . then there exists a distribution function Q on R x R+ such that the joint distribution of x 1 .P .). .. condi. . } have centred spherical symmetry. x . (Representation under centred spherical symmetry). 1981). . is said to have centred spherical symmetry ifthe random quantitiescl . . . corresponding to F . with “unknown mean and precision”.3 there exists a random distribution function F such that. This motivates the following definition. x n . we need to generalise the above discussion slightly. is an infinitely exchangeable sequence of realvalued random 22.x..5. such that . . x. s?. 4. h a s theform ~ where is the standard normal distributionfunction and Q ( p .x . x i . . . leading to the same value of ( X I + . .T .. the random quantities . There is therefore a random characteristic function. . .4. .’ [ ( ~ l .)’].. A sequence ofrandom quantities X I . . . + (zf. .I) n ( s i 2 I A)] p nx with x. .. Toobtain a justification for the usual normal specification.).. and if.1978).S: = 7 t .. tional on F . x. . Proposition 4.Z. . (Smith. for any n. If xl. since it is equivalent to a judgement of invariance in terms of distance from the origin. . by . if we were to feel able to make a judgement of spherical symmetry. . it would typically only be relative to an “origin” defined in terms of the “centre” of the random quantities under consideration. I. .5. + + + + (x. ~for any n..A) = lim P[(?. In general. 22. Proposition 4.cI. .T t 1 ) * .. . Definition 4.. . . . We note first that the judgement of spherical symmetry implicitly attaches a special significance to the origin of the coordinate system. Since the sequence X I .
) = A( I / ) + n( 1 8 ) . Again using ~ ( . are spherically symmetric. s1 + .) I1  1.184 If we now define y = .) takes the form d ( t ) = c'sp ipt for some random quantities 11 E %. the constant coefficient must be zero.tljc3(r. Recalling that q ( ./)c? ( . .1  ~ .. A E !W. . we see that. . + s X = 0 and . All the four terms 11' are therefore equal.)(l(  f*j@(//  I $ ) O ( / * u)} . .l/)(. = 4 Mvdelling for all real s l . so that the overall expression is zero. the linear coefficient purely imaginary and the quadratic coefficient real and nonpositive.I I .I / ) } ..c. . both sides of this latter equality depend only on sf + . i t . for example. If we now define a random quantity :by 2 { . II . it follows that log o ( t )is a quadratic in 1 (see. .such that sI + . + s' . and that o ( 0 ) = 1. E { J O +~ ( r. Kagan.5. it follows that. .si = I( f r 2 ) . ( I 1 + + \zI'( I.I .1). where all four terms in this expression are of the form of the righthand side of ( * ) with n = 8. . This can be rewritten in the form \zI.and where A ( . + = 0. ) ( > ( I ( .)o(u 1.(u)and B ( v ) II: log[@(! ) @ ( . . 1973.)~(l/j~. + . c $ ( O ) = 1. .f l ) l ~ } = f<{Q(l/ + iQ)(. almost surely with respect to the probability measure P. ~ . s.) = I>')( u)o(I ' ) C 3 ( I * ) and 11. ) = 2logd(. where @ . for this quadratic. the complex conjugate.. .r ) ] .. . L:. .E ( O ( Ut ( . ( t ) = @ ? ( t ) = l o g d ( f ) .~(l. This establishes that the random characteristic function I.t ? ) o " ( .~(~.). ) ~ ~ ( . .E ( O ( . satisfies the functional equation o f/ ( for all real II + r)c>( / I  I. . ~ . Since !jI. + . I Linnik and Rao. j ~ [ t ~ ) ~ ( f . Lcmmn 1.u ) o ( . j1..) } ..) c ) ( I . .f ) = o ( t ) . [ f ) .s.q) . for any real u and v .  ::l} = cxp ( i g t. .)} + E { 0' ( (1 )O' ( I ' ) C 7 ? ( . . it follows that . .f ) = c . This implies that.
5. + (Xx . . therefore. The Multivariate Normal Model Suppose now that we have an infinitely exchangeable sequence of random vectors . each with mean p and precision A. taking values in !Rk. for all n and for all c E 3Zk. p and A. * + 211 l 7x n n = P. l p .2 zl. (XI . . generating a likelihood n p(x1. we judge. 2 2 . that the combined judgements of exchangeability and centred spherical symmetry restrict the set of coherent predictive probability models to those which. p and A.. expressed in conventional terminology. which can be given an operational interpretation as "beliefs about the sample mean and reciprocal sample variance which would result from a large number of observations".4 Models via Invariance then. 4. A) = n N ( + i i=l 1 ~ 5 ~ ) . x l . we have 185 This establishes that. in addition. . The next result then provides a multivariate generalisation of Proposition 4.l) .. lim 11x. have centred . and that. (ii) we have a joint prior distribution Q ( p . by the strong law of large numbers. ) 2 + ~ ~ ..ctz. x i . correspond to acting as if: (i) we have a random sample from a normal distribution with unknown mean and precision parameters.1 1 2 _ . . . k 2 2.~ 2 .. The mixing distribution in the general representation theorem reduces therefore to a joint distribution over p and A. a We see. lirn 2 1 1. conditional on p and A.4.zn are independent normally distributed random quantities. spherical symmetry. .A) for the unknown parameters.1 x and the result follows.. .4. . . that the random quantities c ~ z ~ . * * . But. by iterated expectation. .
where . . . cis if rhe latter were independent. It follows that.have centred spherical symmetry the structure a evaluations under P of probabilities of events f . so that for all c E 92" f . . E !HA. 11.p and A indircrd by P. with probabiliry measure 1'. I ) . . by Proposition 4. .c'x. we see that the random quantities yl. x.6. for all t . . . . X I .3 . E %. . there exist p = p(c)and X = X(c)such that. y. . independent multivariate normal . = ctzJ. . j = I . . . = 1. are random quantities each with mean p and precision matrix A. conditional on (1 rundoni meun wcm. is an infinitely exchangeable sequence of random vectors taking rdues in Rk. . nith u distribution o w .p und a random precision matrix A. .= 1. .j so that. where j Prook Defining y. is nordejined by X I . E 92. 2 2 . such that. ttiulti~nriare mally distributed rundam wcms. . . the random quantities C'XI. conditional on p and A. have centred spherical symmetry and so. . . . ..for any '11 and c f ?TIk. I. If x 1 . I ) . for all f . I ) . . . . .. . .1 = 1. .. x. . (Mulfivariccrerepresentah'ontheorem under centred spherical symmetry). . .. .186 I Madelling Proposition4.. .5. .
L32 rejections in 45” line. an individual’sjudgement exhibited a form of “lack of memory” property with respect to the origin. . It is interesting to ask under what circumstances an individual might judge . for any i # j.1 A . the resulting representation is as follows. Ifwe added to the assumption of exchangeability the judgement that the “origins” of the xi and x. BI. we note that this implies. Az.. events such as C1 C2 to have equal probabilities. for any pair x. The answer is suggested by the additional (dashed) lines in the figure.3 The Exponential Model Suppose x l .4. C . symmetrically placed with respect to the 45” line through the origin. an identity of beliefs for any events in the positive quadrant which are xJ. . axes are “irrelevant”. however. when making judgements about events in the positive quadrant. B1 and €32.C2rejections in (dashed) 45” line Thus. If such a judgement is assumed to hold for all subsets of n (rather than just two) random quantities. In general. must be equal. In particular.1.. then the probabilities of events such as C1and C would be judged equal. in Figure 4. the 1 have assumption of exchangeabilitywould not imply that events such as C and CZ equal probabilities.. . In perhaps more familiar terms. even though they are symmetrically placed with respect to a 45” line (but not the one through the origin). this would 2 be as though. for example. so far as probability judgements are concerned.4. respectively. I XI / Figure 4.4 Models via Invariance 187 4. is judged to be an infinitely exchangeable sequence of positive realvalued random quantities. the probabilities assigned to A1 and A z .xz.
41F] = P[(.ap) and ‘ A = { ( s . But this functional relationship implies. . . . I E’) = < r “r for some 8. > nl + (2. has the form p ( x . ) ] . + . .. that p(. r . . and Outline prooJ (Diaconis and Y Ivisaker. for any n. is an infinitely exchangeable sequence of positive realvalued random quantities with probability measure P.. . tl’ = 1iiiir.. The rest of the result follows on noting that. for all 11. .. independent. s .’ ( . for positive realvalued x. sl. . .r. > 0 ) we have P [ ( x *> al + a 2 )n (x.7. 1985).r. is certainly positive for all j .I F) = p(. x !R+. . such that.Z.I. . . ..&) E A] = Pl(r1.2 I F ) = P ( X . 10). and recalling that I. By the general representation theorem. . > u 2 )1 F ] = P [ ( & > u1) I F ] P [ ( x .)E . then the joint densityfor x l .r. It can be shown that the additional are invariance property continues to hold conditional on F. x2. > .such that.r.> a2) I F ] . P [ ( r .. . .c. . . p(s.> a l ) n (2.> a1 + a. and any event A in R+ x . this implies that P(1.. If we now take a = ( a ] . conditional on F. . x %’.fx T i ’ ..) A + a ( F ] E for A and Q as described above. s. .. . . x Rsuch that at1 = 0and A+Q is an event in %+ x . > 0) I F ] = P [ ( x . ). so that. + . By exchangeability. [ t i . (Continuousrepresentalion under origin invariance). P[(J. .. .x.so that the density. <J . and hence given by 8 cxp( Ox.). ? Z I f )E A + a] for all a E 8 x . . where 0 = lim. for any i # J . is the derivative of I . > 01 1 F ) P ( s .L.. ) .188 I f 4 Modelling Proposition 4. X I . . I .> a2 1 F ) . .t>xp(Hx. . there exists a random distribution function F. by the strong law of large numbers.
x 2 such thut a‘l = 0 and A + a is an event in 2+x .. namely. . ] P [ ( x l . . .. . Proposition 4. ~ 2 . Dsrt I X I . we see that judgements of exchangeability and “lack of memory” for sequences of positive realvalued random quantities constrain the possible predictive probability models for the sequence to be those which are generated by acting US if we have a random sample from an exponential distribution with unknown parameter 8.5 2 . x 2+. .has theform + . .x. . It is easy to see that we could repeat the entire introductory discussion of Section 4. > u2 I 8).. rather than as regions in %+ x . . This enables us to state the following representation result.8.1. .> a l ) = P(.c. 11 and any P ( ( r 1 . .. . . . ..L) E A a] 4= for (111 a E 2 x . if Q’denotes the corresponding distribution for 4 = 81 = lim.. . . which appears implicitly in the above proof.. x 2 . ~ p ( Z C .. for all event A in 2+ x . .4. x Z+. it is interesting to note the very specific and wellknown “lack of memory” property of the exponentialdistribution. . . is an infinitely exchangeuble sequence of positive integervalued f rundom quantities with probability measure P.4.. . .4 The Geometric Model Suppose x 1 . x W. except that events would now be defined in terms of sets of points on the lattice 2+ x .. x 2 + .4 Models via lnvariance 189 Thus. Z.4.. such that. with a prior distribution Q for the latter. . then the joint densityfor X I . is judged to be an infinitely exchangeable sequence of strictly positive integervalued random quantities. . ( i c e erepresentution under origin invariance). . In fact.3. .. it may be easier to use the “reparametrised” representation since Q*is then more directly accessible as “beliefs about the sample mean from a large number of observations”.> nl + a2 1 8. Recalling the possible motivation given above for the additional invariance assumption on the sequence X I .) E . 4. . .
(. with aprior distribution Q for the latter.rI. with specified sets ofpossible i~uliresY I .r.. . .( ~ 5 )1 1 1 ) is culled X(irt)tlintc... J'..= lim. the . s./.. A trivial case of such a statistic would be t. . Familiar examples of summary statistics are: t.7. (In general. (Stalislic). + t~ statistic. for positive integervalued z... I9 I = l i i i i . + x. . . the sample size und nirdinn ( k (I / / ) = 2 ) : t. \ . ). > a? 18).) Definition 4.I l l i l l ( .). ..= 1 ): t. . . = [ r n .tc/l Ulld . since .. > (I:! I F ) P(. t h e strrnplr rr/rrgr (A. .. med{X I .6.= [ n i .. our discussion carries over to the case of random vectors. . . forall i. } . . of random quantities.ri + . functional equation the P(r. . r r t ) = / I / .. Z.!'I.1 MODELS VIA SUFFICIENT STATISTICS Summary Statistics We begin with a formal definition.)I. ( J ]+ .. I (XI I//) . Again..= I H + . . . .> 0 1 ) = P(.I. = lllax{.5 4. > x I F ) = I 9 ' .r. a O)"t'. 2 1 .r. J . the coherent predictive probability models must be those which are generated by acting us ifwe have a rundom sample from a geometric distribution with unknown purumeter 6 .I .5. + x. but for notational simplicity we shall usually talk in terms of random quantities. > Rj implies that + 112 I F ) = P ( r . . > al + a ? 18. .. recalling the possible motivation for the additional invariance property. .. it is interesting to note the familiar "lack of memory" property of the geometric distribution..... so that the probability function. + . )I. > f11 I F)P(.. but this clearly does not achieve much by way of summarisation since A. which enables us to discuss the process ofsurnmurising a sequence.Slllll (k(rrr)= 3 ) : t. .r. This follows precisely the steps in the proof of Proposition 4.( X I . ' tr. J . P(n. ..: .. ... 4.0 < 6 5 1. .. x X. Given rfJrIdOti1 quatitities (tvctors) . . ... . . . ./j. .sclnIplr si:e. respectiwly..I. ix .#.190 4 Modelling Outline proof.. where 19... . . except that.... . . y ( r l 1 F) = p ( s .. . /or t.r. I .. O ~ S ~ N N M S .} . c R k " " ) ( L . ..r. ~ .nsionuI '. . 1 I9) is easily seen to be #( 1 Again. . or sample..( / ) / ) = 1 ). . = (. .. (I mndoni vrc' .). XI.. .. In this case. . . .I. where. . . by the strong law of large numbers. . . .. ).. the strniple niean (A( .
. ~ available and an individual is contemare plating... .) and past observations (21. . . = trl2(x1. what forms of representation for predictive probability models might result. The following . . conditional on this given information. Given a sequence of random quantities x 1 . a fixed dimension independent of TR.. . Past observations xl ... ) .1 zl.. I . given t....rrl1).i j f ~ w a l l 7 n 1.. x X... with t.s . . we shall examine the formal acceptabilityand implications of seeking to act as if particular summary statistics have a special status in the context of representing beliefs about a sequence of random vectors. . . 4. The above definition captures the idea that. if so. . . . ! Definition 4. .to be described by p ( ~ . t 2 ... we clearly need k ( m ) < m:moreover.4.r >_ l a n d { i l. is said to be predictive suficient for the 1 2 .5. and under what circumstances. .2. . . .2 PredictiveSufficiency and Parametric Sufficiency As an example of the way in which a summary statistic might be assumed to play a special role in the evolution of beliefs. however. .x. In the next section. in general. from a pragmatic point of view the assumption of a specified sequence of predictive sufficient statistics will... . Another way of expressing this. Throughout.zl. . we shall need additional .zr}n{1 where p ( . . (Predictive su&iency). .x:!. . . . the sequence of statistics t l . + ~. . . . . we shall focus attention on the general questions of whether. beliefs about future observations x . . the individual values of SI.xtl.... . . dejined on X x . .7. . . as is easily verified from Definition 4.! m} =0..z2 .. . i = 1. seyuence. further clarity of interpretation is achieved if k(m) = k. Clearly. definition describes one possible way in which assumptions of systematic data reduction might be incorporated into the structure of such conditional beliefs. ~.) are conditionally independent given t. with probability measure P. From a formal point of view.) is the conditional density induced by P. greatly simplify the process of assessing probabilitiesof future events conditional on past observations. . as with the above examples.~x..x. x. let us consider the following general situation. . ~. x . is that future observations ( x . .5 Models via Suficient Statistics 191 To achieve data reduction. +. . . where xt takes values in Xi. We shall not concern ourselves at this stage with the origin of or motivation for any such choice of particular summary statistics. . we shall assume that beliefs can be represented in terms of density functions. .7. Instead. contribute nothing further to one's evaluation of probabilities of future events defined in terms of as yet unobserved random quantities. it is coherent to invoke such a form of data reduction and.. .
3.rl.. . we can establish the following. conditional density function of .I*?. .. predictive su@ciency and exchangeubili!\t for the injnite sequence .r. if . . . r.. . . t?. throughout this section we shall assume that the exchangeability assumption leads to a finitely parametrised mixture representation. .?..8 both seem intuitively compelling as encapsulations of the notion of a statistic being a ”sufficient summary“.7 and 4. . the learning process is ”transmitted” within the mixture representation by the updating of beliefs about the “unknown parameter” 8.) = (rc)(et. .8.192 4 Modelling structure if we are to succeed in using this idea to identify specific forms of the general representation of the joint distribution of ~ 1 . a mathematically rigorous treatment is beyond the intended level of this book and so we shall confine ourselves to an informal presentation of the main ideas. where . r l . .. is un infinitely e.. within our assumed framework.2.). . In particular. . . This suggests another possible way of defining a statistic t.. . .i = 1. .. . given . so that. here and in what follows. .Y.lie seqirence of srutistiis t l . r f . has the form and all integrals. Definitions 4. q ( e 1 . r ~. we shall assume in what follows that the probubilih nieusure P describing our belieji implies both . ... . .r. .. ....r2.vchangeable sequence of rundoni quantities. . . .r tiny I I 2 I .. with t . . I t is perhaps reassuring therefore that.. )to be a “sufficient summary” of . . as in Corollary I to Proposition 4. Definition 4. .r.I‘I. I . . . . As a particular ilfustratioti of what might be achieved.I.rl . dejitieri on S x 1 . .. = t.. If‘ . I for any d 4 (6 ) dejtiing an exchangeable predictive prohubilih model rici thr represenration p ( x .(... as shown in Corollary 2 to that proposition.. ..fi. . for such exchangeable beliefs. . (Parametric sufiiency). are assumed to be over the set of possible values of 6. . . As with our earlier discussion in Section 4. .4. the . . . .. . .. . is said to be parametric sii&ienrjbr x I . . tukes wlues in X I = X.r.I. .. . x . .x. .... . . This latter form makes clear that.
...5 Models via Suflcient Statistics 193 Proposition 4. .. n ) .10.. . . .(21..m = 1. . .x. Heuristicproof. The sequence tl . for To make further progress. = t. 2 2 . the sequence o statistics f t l . . . . . .. 16) ~ h n ~ ( t . . . . . x m ) = t m } . tics t. . i parametric sufiient for infinitely exchangeable x 1 . with t j defined on XIx . .). g > 0. is predictive suflcient $ and only if.n . . x2.. .X m ) = dQ(6 I t m ) all dQ(6).. . . where xi takes values in Xi = X .~n = I * .4. . (Neymanfactorisation criterion). in turn.. . . . x . Q if.. .xn and any sequence of statis. admitting a s finitely parametrised mixture representation if and only $ for any m 2 1. . . . i = 1. For any X I . . (Equivalenceof predictive and paremetric sufiiencies).x. I xm+l. . we now establish that parametric sufficiency is itself equivalent to certain further conditions on the probability structure..t 2 .x = for somefunctions h.9.. * .. . .which. .. . . . Given an infinitely exchangeable sequence of random quantities X I . . . d Q ( 8 I X I . can be easily shown to be expressible as It follows that P(x.. . where t .2. . B ) g ( ~ l .1. $ x m ) i t m ( x l . . and only if. . given 8 has the form ~ ( x I ... it is parametric suflcient. . . . the joint density for xI. . x X. the representation theorem implies that where A = ( ( 2 1 . . . x m ) . .. 3 xm) J t=m+l fi p ( z l 1 e) d Q ( e 1x1. 2 0 . Proposition4.. . . .x.n+1+ *  * 9 zn I t m ) = P(xm+lr. .t 2 .
. 1 8.1 . . .. . . T I .. for any dQ(8) with support 8.(t. r l .) t. The sequence t 1 .. . . = h. Outline prooj For any t. 1 for some h.. is parametric sufficient.. Substitutingforh. given parametric sufficiency. ..rl. . follows immediately from Proposition 4. t .rl. .. . Given such a factorisation... that t ( t . . ..= t.10. .... 10) e)g(... .. hence. and. is parametric suBcienr for infinitely exchangeable ... l~ If p ( r l . ij uttd only ij for ony I I I 2 1.s..18). suppose that t .r.I'..).... . .. The righthand side depends on X I . ~ ~ . . the density p ( x ~.. (Sufficiency and condifiunal independence)... ..I. . forsomeG > 0. .w. Conversely. . ..(t. p ( .. r l ... .. . s.r.r.. . cl .r. so that..... 2 0. ~. x. ) is independenr off?. for some It. we have.I e ) = p ( .10. we obtain = . .el.t?. ) w . by Proposition 4.. . dQ(0 I ... Proposition 4.. . .) have ( we P(.194 4 Modelling Outline proof. .. ... the parametric sufficiency of t l . ..( ~ . . . we obtain so that which is independent of 8. for any d Q ( 8 ) we have . ) = dQ(8 1 t.. J .. . ...} such . . . t 2 .0)intheexpressionforp(... . tin) is independent of 8. t 1 7 1 ) p ( t1 . 18..11... p ( t l f l 0) = ~ I { ~ j . .Conversely. . . > 0.s..x2. ..I e.y > 0. . . . .t 2 . ... ...rl... ..only through t.. s. Integrating over all values t.r~. .
we shall mainly be interested in asking the following questions. We recall from Proposition 4. From an operational. we can exploit many of the important mathematical results which have been established using that definition as a starting point. In the context of our subjectivist discussion of beliefs and models. When is it coherent to act as if there is a sequence of predictive sufficient statistics associated with an exchangeable sequence of random quantities? What forms of predictive probability model are implied in cases where we can assume a sequence of predictive sufficient statistics? Aside from these foundational and modelling questions. is an infinitely exchangeable sequence of 0.x2.)on 6. subjectivist point of view. We shall illustrate this possibility with some simple examples before continuing with the general development.z. then we have the general representation .established here in Proposition 4. where the operationalsignificanceof “parameter”typically becomes clear from the relevant representation theorem. from a technical point of view. related concepts of “sufficiency”arealso central to nonsubjectivisttheories. However.1 1 as a consequence . of our definitions.. In particular.. a justification for regarding the usual (Fisher) definition as equivalent to predictive and parametric sufficiency. since our representation for exchangeable sequences provides. . Thus. I 6. In fact..1 random quantities. it seems to us rather mysterious to launch into fundamental definitions about learning processes expressed in terms of conditioning on “parameters” having no status other than as “labels”. however. and the factorisation given in Proposition 4.4.. . . however.5. for us. for example.was itself put forward as the dejnition of a“sufficient statistic” by Fisher (1922). . Example 4. . t.5 Models viu SuBient Statistics 195 In the approach we have adopted.10 was established by Neyman (1935) as equivalent to the Fisher definition. the notion of parametric sufficiency has so far only been put forward within the context of exchangeable beliefs. (Bernoulli model). the results given above also enable us to check the form of the predictive sufficient statistics for any given exchangeable representation.1 that if x l . we note that the nondependence of the density p ( z l : . as the reader familiar with more”conventional”approaches will have already realised. the definitions and consequences of predictive and parametric sufficiency have been motivated and examined within the general framework of seeking to find coherent representations of subjective beliefs about sequencesof observables.
... .r2. .r.1. .. t. 1 A) t. .(t. rtj(.r.H ) ” . is not unique. .. is . I I . .. ..... + . . it is perhaps not surprising that the sample size.] and noting that we can write . 5.. = (12..r. ..c2. ... .rI. = 1.1 random quantities.. . [ t i . .’ ” .. A ) Ip where In the light of Propositions 4. .@)P ( 1 . an exchangeable sequence of realvalued random quantities with the additional property of centred spherical symmetry then we have the general representation p ( z l .. (Nonnol model). Example 4.9 and 4. t 2 . .) it follows from Propositions 4.....196 where Y. .r.. . In view of the centring and spherical symmetry conditions. . = ~(3. . sequence length and total number of 1’s summarises all the interesting information in any sequence of observed exchangeable 0. r i ) / as the sequence of sufficient .. Of course..~. .. = h... .. . .. is predictive and parametric sufficient for sI. = r. mean and sample mean sum of squares about the mean turn out to be sufficient summaries. + r.t. ... We recall from Proposition 4 5 that if s. . X ) t l Q ( p . ..z.6.. 2 . for example. ..10 and Proposition 4. 10) O)y(.10 that the sequence t l . r 2 . g ( r I . with h. . = X I 4 Modelling + . since .. .(t.. inspection of reveals that ~ ( Z I.1. [7!. 9 defines a sequence of predictive and parametric sufficient statistics for .2. + ’. I we could equally well define t.Defining t.. .$ p .s. = statistics..) = 1% p(. This corresponds precisely to the intuitive idea that the .
. in some sense. there is not such an obvious link between the form of invariance assumed where s.r2? . minimal. it is immediate from Propositions 4. . This motivates the following definition. (XI.. is an exchangeable sequence of positive realvalued random quantities with an additional “origin invariance” property.is min1 .. where x.. the sequence of statistics tl . i given any other sequence of suficient statis. f . .. ..g2(. .. It is easily seen that the forms of t ( z )identified in Examples 4. omit explicit mention of n and refer to the “interesting function(s) of T I ... .). s2.)t~~(~) 11 . references to sufficient statistics should be interpreted as intending minimal sufficient statistics. . x X . . .= X.7. = + + and the form of the sufficient statistic.3 Definition4. . such that t . although. there existfunctions gl(.7 that if 5 ... = g . given our interest in achieving simplification through data reduction.s. Again.(zl. is However. . . defined on X x . then we have the general representation = Lx 0’1 exp(~u. . in this example. x.5 to 4.. .2. . we shall sometimes. ... .. imal suflcienr for 21. ~ = (12.. .9. it is equally clear that we should like to focus on sufficient statistics which are.tar. 4.5.7 are minimal sufficient statistics. since n very often appears as part of the sufficient statistic. From now on. W recall from Proposition 4. I 1 that . . . . ) . ( s t ) . . We shall now take this process a stage further by examining in detail representations relating to sufficient statistics of fixed dimension...). Sufficiency and the Exponential Family In the previous section.)] always a sufficient statistic. (Minimal suyicient statistic). Ifx x2. . ...) defines a sequence of predictive and parametric sufficient statistics.q . = In. t .. Finally.x . (Expnentiulmudel).4. with t. .5 Models via Suficient Statistics 197 e Example 4.. .T.is an infinitely exchangeable sequence of random quantities. we identified some further potential structure in the general representation of joint densities for exchangeable random quantities when predictive sufficiency is assumed. tics.10 and 4.takes values in X. It is clear from the general definition of a sufficient statistic (parametric or predictive) that t. i = 1. r.” as the sufficient statistic. sl. to avoid tedious repetition.
r I .1 1.. .12..I 98 4 Modelling Since we have established.Q. Definition 4. the statistic t.rfJ S.f(.rcliurigeuhle seqirence such thnt. We begin by considering exchangeable beliefs constructed by mixing. I . .r < x. . given f . . . I f ' r 1 . ) jiir . By Proposition 4. A probubility densin lor massfunction) p ( s I 19). Thefumilj is culled regular if X does not depend on 8: orherwise it is culled nonregular. . . An important class o such f p(.rI) + . given E is regular Ef(. with respect to some dQ(I9). .. we shall from now on simply use the tern suflcient stutisric. x f l ) .L(. .10. ]. + h ( . . for some dQ(t)). un e.r) exp{co(O)h(s)}ri.10. ~ )= [ r ~ h(. r 1 7 . (Su@knt s&atis&ics the oneparameter exponential for family). . .). then t. ( . without risk of confusion. .sft10) factors into h f .2. . (Oneparameterexponentialfamily). . ( t .10 on noting that .viiSJicient stntistics. in the finite parameter framework. = t . . the equivalence of predictive and parametric sufficiency for the case of exchangeable random quantities. Proof.. for .and c. .lubelled by 8 E 0 C R. . .. . .z t l ) . would be sufficient. some h f 89. I ' . = t. I I = 1. is suid ro belong to tha onepurumeter exponentialfamily if it is of the fornr where. . h . .. and their equivalence with the factorisation criterion of Proposition 4. . if the form of y( s 1 19) is such that p ( q . Proposition 4.0 ) g ( . i s ci sequence (f'. . [y(tl)]' = J . . 0.rl. This follows immediately from Proposition 4.r 10) is identified in the following definition.over a specified parametric form where I9 is a onedimensional parameter. r .
d(8). it will be useful at this stage to note examples of the wellknown forms of distribution which are covered by this definition. In Definition 4. g(e) = e . Although we have not yet made a connection between this case and forms of representation arising in the modelling of exchangeable sequences. However. I}. it is often convenient to be able to separate the “interesting” function of 8. . s E (0. of possible values of z might itself depend on the labelling parameter 8. from the constant which happens to multiply it. 9. e E %+. we allowed for the possibility (the nonregular case) that the range.10.B ) . e E [o.etc. 1 . Definition 4. Unijorm p ( I 6) = u ~ ( 10.5 Models via Suj’icient Statistics 199 The following standard univariate probability distributions are particular cases of the (regular) oneparameter exponential family with the appropriate choices of f .) . I E {o. 1).1 I). X.4. h ( ~= o. also. We shall indicate later how the use of such forms in the modelling process might be given a subjectivist justification. ) d ( 8 ) = 8. f ( . = 1.) form could always be simply written as @ ( O ) with cjt suitably defined (see.~ e) = e’. ) = 1. as indicated. Bernoulli P(X I 8) = Br(s 18) eJ(i = ep. Poisson We note that the term cg(8) appearing in the general Ef(.
) f = ..) E R'.10 that is a sequence of sufficient statistics in this case. . .r 18) = Shex(L 18) = c y [ . so that. . r .( . . . x..O ) ] .rI. for ( J .. . .t' 8E >R+. ( . for ri = 1. (. For the uniforni. . 8 E W . = f .r)= 0. I ' . = 1. we make use of the factorisation criterion given in Proposition 4. . . o(0) 0.r. In order to identify sequences of sufficient statistics in these and similar cases. I. ) = I provides a sequence of sufficient statistics. 5 .). . r l .2.. For the shifted exponential. if we rewrite the density in the form a similar argument shows that. we rewrite the density in the form so that... '..200 Shif ed exponentiul 4 Modelling p(. which is conditionally independent given 8. for any sequence . . h(. It then follows immediately from Proposition 4. . . .10. . g ( H ) = f /I .. . The above discussion readily generalises to the case of exchangeable sequences generated by mixing over specified parametric forms involving a kdimensional parameter 8.f(. .
which is labelled by 8 E 0 C Rk. . . otherwise it is called f nonregular.4.13. . (SujJkientstatistics forthe kjwmeterexponentialfamily). . 1 ).1 1. . for some dQ(t?). (kparameter expone& famdy).12 and is a straightforward consequence of Roposition 4. A probabilitydensity (or massfunction) p(x I O ) . 4(8) = f . given the functions Thefamily is called regular i X does not depend on 8.. Proposition4. This is analogous to Proposition 4.then is a sequence of suflcient statistics. 9 etc.xt e X . x2. . is said to belong to the kparameter exponentialfamily i it is of the form f where h = ( h l . the second nonregular) of the Icparameter exponential family with the appropriate choices off. h. ($1 . a The following standard probability distributions are particular cases (the first regular. . .hk). If 51. as indicated.10. and the constants c. . . given regular kparameter Efk(. Proof. . is an exchangeable sequence such that.&) and. x E X...5 Models via Suflcient Statistics 201 Definition4. 4.
... I I . .=c. .. = statistics. C ' I ) . k = 2 and so that t. .Q2). @(e) ( 7 4 .. = (n.d..021) = ) P(X e) = p ( : ~ ( e . In this case. . = 1.min{xl.(T).. . Definition 4. .~p.. (Canonical exponentialfamily). . y Ll. x I f } . I. . n is easily seen to give a sequence of sufficient statistics. .) in Definition 4.max{r.(e). el E R+.. i = i . .12. I . is convenient for some purposes (relating straightforwardly to familiar ver sions of parametric families). 7 ) = N(. = [n.11. . . s E (el. The description of the exponential family forms given in Definitions 4.202 Normal (unknown mean and variunce) Y(x 10) = p(x I P . . . e.2.yk).. i .T I p ..C:=lI. I . . . This motivates the following definition. is a sequence of sufficient 7i Uniform (over the interval [el.o1 E 8. c1 = 1. cp= 1/2. ~ .= (c. The probability density (or massfunctiotr) derivedfrom Efk(.X:llxp] . . called the canonicalform of representcrtion ofthe e.potietiticil~iniil~. ~ } ] .10 and 4.2.= 1.A. but somewhat cumbersome for others. which we give for the general kparameter case. .. .] . sia the trunsjhnutions y = (y . 7 ) 4 Modellitig In this case. e ~ U ( Z el&) = (e..e . and t . ) . Yf=h.
+ is given by Proof. Pitman (1936). so that the distribution of s is as claimed. Sufficiency is immediate from Proposition 4. Examination of the density convolution form for n = 1. under various regularity conditions. (Firsttwomoments of the canonicalexponeniialfamdy). Here. and to identify the distribution of sums of independent Cef random quantities. . where d*) is the nfold convolurion of a. from which the result follows straightforwardly. .4. ..12. (Sumiency in the canonical exponentialfamily). b: +) random quantities. Hipp (1974) and Huzurbazar (1976) establish.y. a + Proposition 4. lfyl.5 Models via Su$Ecienr Staristics 203 Systematic use of this canonical form to clarify the nature of the Bayesian learning process will be presented in Section 5. . we shall use it to examine briefly the nature and interpretation of the function b(+).15. where a(") satisfies 7ib(+) = log 1 a(")(s) exp{t/~'s}ds.14. A consequence is that sufficient statistics of fixed dimension exist. Moreover.2.nb. then is a suficient staristic and has a disrriburion Cef(s I a("). establishes the form of a("). For y in Definition 4.12. a Our discussion thus far has considered the situation where exchangeablebelief distributions are constructed by assuming a mixing over finiteparameter exponential family forms. We see immediately that the characteristic function of s is exp{nb(iu + +) . Proof. plus induction.l are independent Cef(y I a. +).2. It is easy to verify that the characteristic function of y conditional on E(exp{iu'y}I +) = exp{b(iu +) .nb(+)}. E(9 I +) = Vb(+)? V(Y 1111) = V 2 W . Proposition4. Koopman (1936).b ( + ) } . that the exponential family is the only family of distributions for which such sufficient statistics exist. classical results of Darmois (1936).
. Because of exchangeability. If we assume y l .rz. identified the parametric forms that had to appear in the mixture representation. .k < n. the conditional distributions have the above form (for some n defining a Cef(y [ ( I .. . for all 71and k < 11. is modelled by Now consider the form of p(yI.y. .15). in Section 4.. y.}. together with exchangeability.. does this im. suppose for a moment that an exchangeable sequence. we shall consider. . yk 5 s. instead. which imply that the mixing must be over exponential family forms. I y1+. . As a preliminary. which. to be exchangeable and also assume that.y?. whether characterisations can be established via assumptions about ronditiond distrihutions. . . . . = s). The exponential family mixture representation thus implies that. . in the numerator.204 4 Modelling In the second part of this subsection. Here. . ..4. 6.1 1 and 4. + Now suppose we consider the converse.. Previously. we shall consider the question of whether 1 there are structural assumptions about an exchangeable sequence s . with considerable licence in ignoring regularity conditions. this has a representation a a mixture over s + But the latter does not involve because of the suffkiency of y1 + .. { y. = y1+ * . However. +) form). . s. so that + + yI. . motivated by suflciency ideas. . 1990).. (Propositions 4. ) ply that p ( ~ . g r 8 has the corresponding exponential family mixture form? A rigorous mathematical discussion of this question is beyond the scope of this volume (see Diaconis and Freedman. . the result and the “flavour” of a proof are given by the following. we considered particular invariance assumptions. where.
forn 1 2.)= 1fi 1=1 Cef(y..log 491) 40) f(Y1) f (0) and it follows that 4Yl) + 4 s . Ifyl. ?=I where sk = y1 + . ) defines Cef(y I u . + ) ~ Q W L for some dQ(+).. sothat. then p(Y. . Outline pro($ We first note that exchangeability implies a mixture representation.b(+)l* The following example provides a concrete illustration of the general result.for all n 2 2 and k < n. . + gr and a ( . (Represenfationtheorem under surniency). k < 71. s k p(yI. i any exchangeuble sequence such that. themselves imply an exchangeablesequence. mixing over distributions which make the y.5 Models via Sujficient Statistics 205 Proposition 4. b. independent. . +). k = 1.and noting that u(0) = 0.4. . f(y. . ... . for some @. . = 8 ) = ~a(y. = s.y2.ul. Now consider ri = 2. .& Jy1 +"'+Y. .. . hence U(YA + 4 9 2 ) = 4 Y l + y. ) . and Setting 9 . Independence implies that where denotes the marginal density and f C 2 ) ( ... But each of the latter distributions.u(s) = ~(s). so that f b ) = 4 9 )e x P W Y . we obtain .16. .)a'~~"(sss)/n("'(s).+gn = s) also has the specified form in terms of u ( . so that a) If we now define U(Yl) = log . b.g k I y1+.Y1) = 4 s ) . .2)a This implies that u(9) = +'y. with densities denoted generically by f. . ) f( must satisfy f ( 9 ) its twofoldconvolution. I a. ..
the above heuristic analysis and discussion for the 1.r.0). .y. . Morris ( 1982)and Brown ( 1985). )and d'"(). (Cluuacterisdon o the Poisson model).* !/! 1 } = ( I ( ! / ) c'xp{.i. . is coherent and implies that for some d c ) ( ~ . 10.) = w. This sufficient statistic is clearly not a summation. . .parameter regular exponential family has been given without any attempt at rigour. . } .. l .. Conversely.yi. .yl .16. where 8 = ( 1 : / I . we shall focus on the uniform. so that where ?it = y... r k .. U(s10. . For concreteness. < 1 1 . and sufficient statis) tic iiiitx{. = niax{. sg. m.x . { I 1 2 . 10. I / I / ).206 4 Modelling Example 4.. I). Cef(g / ( I . constructed by mixing over independent L'(. a ) As we remarked earlier. .. . . r . . in terms of a ( . ./s!. it follows that the belief specitication for y1.6. . . is judged exchangeable. are approximately independent I:(.. distribution. . . . ) . ( x3'. briefly and informally.txp { !p. . what can be said about characterisations of exchangeable sequences as mixtures of nonregular exponential families. .rl. E R. A.0). * L". . c q ) t ..) this will therefore be true for and all exchangeable . with the conditional distribution of y = (9. . .) } from which it easily follows that d'"(s. Pn(!/ form as t . . . can he witten in Pn(!/ I 1 % )= . .y?. 8).g. it is straightforward to check that. A. . Suppose that the sequence f of nonnegative integer valued random quantities . as is the case for regular families (and plays a key role in Proposition 4.16).. .. However. .bO . we might wonder whether positive exchangeable sequences having . . . conditional on nil. v . specified to be the multinomial Mul (g1 S . .8. . .X I . . ..rl. T I . By Proposition 4. . Other relevant references for the mathematics of exponential families include RamdorffNielsen ( 1978). E 8'. = c. We conclude this subsection by considering. . For the full story the reader is referred to Diaconis and Freedman (1990). yk) given .r. } .given a sample x i . which hasdensity & I l ! . + I J ~ Noting that the Poisson distribution.
. parametric sufficient statistics. one might expect the result to be true..(r)p(x)& = mi. in addition.O. In3 tuitively. that exponential family distributions can also be motivated through the concept of the utility of a distribution (c. . .I . together with the normalizing constraint p(r)ds = 1. . This is indeed the case. lo. we minimise S(f I p) over p subject to the required constraints on p . and c are arbitrary constant multipliers. . but a general account of the required mathematical results is beyond our intended scope in this volume. Thus. and the further references discussed in Section 4. possible values. is to be approximated as closely as possible by a specified density f(z). Hence. however.4 Information Measures and the Exponential Family Our approach to the exponential family has been through the concept of predictive or. The interested reader is referred to Diaconis and Freedman (1984). . It is interesting to note.f. where ml.which satisfies the k (independent) constraints h. 4. Consider the following problem.5. < 30. . and.4). we seek p to minimise where 81. . i = 1.@). if rn.8. We recall from Definition 3.. as R + 00. .k. tends to a finite f from below. Note that we are interested in deriving a mathematical repreof sentation of the true probability density y(x).5 Models via Sujficient Statistics 207 this conditional property are necessarily mixtures of independent U ( x . X. equivalently.20 (with a convenient change of notation) that the discrepancy from a probability density y ( ~ ) assumed to be true of an approximation f(x) is given by ly where f and p are both assumed to be strictly positive densities over the same range. . Section 3. rather than 6( f I p) over f subject to constraints on f.nzk are specified constants. We seek to obtain a mathematical representation of a probability density p ( z ) .4. using the derived notions of approximation and discrepancy. not of the (specified) approximation f(x)..
208 4 Modelling Proposition 4. } .( < t ) ) P ‘#P( . 9. 6 .. 1946. Each distinct (t defines a distinct exponential family with density (..9. Chapter 3 . . r ) was derived on the basis of a given approximation f ( ~ ) a collection of “constant” functions h(. In general.. Q. see Kullback. By a standard variational argument (see.0.r)= [h. 195911968. for example.Y h. a necessary condition for y to give a stationary value of F ( p ) is that for any function T : s the equation . where f and h are given in b’(p). This condition reduces to (a/ih)F(p(.c ) .. r ) within the exponential family framework.% of sufficiently small norm. H) family with n known. (The exponential family as an approximatwn). I’hefuncrional F ( p ) defined above is nrinimised by p ( . c ) as the e. for example. .r) + (tT(J)) I . ) mid Proof.rponenrialfantily Reneruted by f uttd h. .r I f. h k ( r ) ] .f l J . . Jeffreys and Jeffreys. the Ga(x 10. r ) = Efk(. specification of the sufficient statistic 1 I I  J does not uniquely identify the form off ( .17. h.(.l / I . and .() =0 from which it follows that as required.r). (For an alternative proof. 4. . . we shall refer to Ef(s 1 f. = 1. ) The resulting exponential family form for p ( . qb = 8 = ( 0 . . If we wish to emphasise this derivation of the family.r E . H i . Chapter lo). .r .r”. Consider.
for example.J. the normal distribution with p = E ( z I p . Thus. the exponential distribution with 4l = E(x 14). labelled by a single index. Example 3.1 MODELS VIA PARTIAL EXCHANGEABILITY Models for Extended Data Structures In Section 4. following Definition 3. . .f. since minimising p ( z ) logp(z) dx is equivalent to maximising H ( p ) = . x2).6 4. and we shall need to extend and . and unrelated to other random . A’ = V(x I p . A) (c. which further help to pinpoint the appropriate specification of a parametric family.. .z2. in the sense that f is extremely diffusely spread over X. we have thus far restricted attention to the case of a single sequence of random quantities. together with a prior distribution dQ(8) for 8.5. Returning to the general problem of choosing p to be “as close as possible” to an “approximation” f.22?.4. A limiting form of this would be to consider f(x) = constant..the maximum entropy choice for p ( z ) is EX(LI4).judged to have various kinds of invariance or sufficiency properties. In the next section. in many (if not most) areas of application of statisticalmodelling the situation will be more complicated than this. z1. as a random sample from a parametric family with density p(s 18).6 Models via Partial Exchangeability 209 so that. if X = W and h(z) = x. .20). w ) 4 exp which.the maximum entropy choice for p ( z )turns out to be N(z I p. or when there are several possible ways of making exchangeable or related judgements about sequences. A). A).subject to the k constraints defined by h ( z ) it is interesting . We also briefly mentioned further possible kinds of judgements. of random quantities z1.8 M 4 } Jy exp {EL. . in order to concentrate on the basic conceptual issues. . which arise when several such sequencesof observationsare involved. i = 1. to ask what happens if the approximation f is very “vague”. quantities. The solution is then sx P(X) = { ct. However.52> . we need to specify f(z) = x0’/r(a)order to in identify the family. Clearly.6. If X = 93 and h ( z )= (x. . Our discussion of modelling has so far concentrated on the case of beliefs about a single sequence of observations zl . we discussed various kinds of justification for modelling a sequence . sx 4. which leads us to seek the p minimising p ( z )logp(z)dx subject to the given constraints.2* .is the socalled maximum enrropy choice of p .y p(x) logp(z) dx..4. involving assumptions about conditional moments or information considerations. in addition to h ( z )= 2 . we shall extend our discussion in order to relate these ideas to the more complex situations.
L .rl.rll and other specified (controlled or observed) quantities z l = ( z l l . and to quantify beliefs (predictions) about the observable f corresponding to a specified input or control quantity z. or we may have I different geographical areas. . ' . : A .k denote observable responses for ' each context/treatment/replicate combination. For example: we may have sequences of clinical responses to each of I different drugs. . . or x t .. Suppose that for each sequence. I ' . or a coding of voting intention. of random quantities are to be observed in each of i E I contexts. . (iii) Sequences of random quantities . i E I. .r. or responses to the same drug used on I different subgroups of a population.. of random quantities are to be observed . some form of qualitative assumption has been made about a form of relationship between the . A modelling framework is required which enables us to learn. . (i) Sequences x I l s12. and the corresponding "death might denote the height or weight at time . and the random quantities ..'.. .r. Among the typical (but by no means exhaustive) kinds of situation we shall wish to consider are the following. .. with .?. For example: .. . . . . k 2 1. . where .k denoting the presence or absence of a specific type of disease. in each o f i E I contexts. (ii) In each of i E I contexts.. might denote the output yield on the . in some sense.I.. (iv) Exchangeable sequences. For example: we may have I different irrigation systems for fruit trees. . from the jth replicate rate": or .k denoting ' the total yield of fruit in a given year. with an assumed form of relationship between z . ~might denote the status (dead or alive) of the j t h rat exposed to a toxic substance administered at dose level z l . or context/treatment combinations.. A modelling framework is required which enables us to investigate differences between either contexts.I different tree pruning regimes and h trees exposed to each imgation/pruning combination. are to be observed. .A). ~. . j E J different treatments are each replicated k E A times. or whatever.rIJ measurement of a plant or animal following some assumed form of "growth curve". = (:. :.jth run of a chemical process when k inputs are set at the levels z .I different agegroups and I< individuals in each of the I J combinations. r . . In each case a inodelling framework is required which enables us to learn about the quantitative form of the relationship. or treatments. ~ . . ) and the general form of relationship between process output and inputs is either assumed known or wellapproximated by a specified mathematical form. with s. about differences between some aspect of the responses in the different sequences. . .210 4 Modelling adapt the basic form of representation to deal with the perceived complexities of the situation.. where I is itself a selection from a potentially larger index set I.
only the total for the irh sequence is relevant when it comes to beliefs about the outcomes of any subset of n. i = 1. here and throughout this section. the sequence of longrun frequencies of failures for each of the components might. 2 . ~ .ni. observations from that sequence. the sequence of largesample averages of quality for each of the dyestuffs might. (Unrestricted exchangeability for 0 .. . . A modelling framework is required which enables us to exploit such further judgements of exchangeability in order to be able to use information from all the sequences to strengthen. . .6 Models via Partial Exchangeability 211 is judged to be a sufficient statistic.)denotes the vector of random quantities ( x . observations from the ith sequence. including that of a comparative clinical trial.7n. 2 .4. ( N . .1 sequences).x . Definition 4. ~ ~ . . i = 1. .1 + . are suid to be unrestricredly exchangeable i each sequence is injnitely exchangeable f and. where y . of the N. .O?! .1 observables would typically have the property encapsulated in the following definition. . . x .1 random quantities. in some sense.rn. x . joint beliefs about several sequences of 0 . in the second case. ) = . a priori. i = 1: . for example. . . where.c. . i = 1.2 Several Samples We shall begin our discussion of possible forms of partial exchangeabilityjudgeby ments for several sequences of observables. . the learning process within an individual sequence. is itself judged exchangeable. . z. in addition.. given the total number of successes in the first N . or sequence i may consist of quality measurements of known precision on replicate samples of the ith of I chemically similar dyestuffs. 5 IV. In many situations. . . . Sequences of 0 . x t l .. l .I random quantities. a priori. that the strong law limits exist and that the sequence 0.6.. For example: sequence i may consist of 0 . . + x . . be judged to be exchangeable. ) .(n. i In addition to the exchangeability of the individual sequences. .m . 4... . considering the simple case of 0 . . . . . In the first case. this definition encapsulates the judgement that. .1 (successfailure) outcomes on repeated trials with the ith of I similar electronic components. x . . ~ . . Thus. be judged to be exchangeable.13.= 1 ..m. given 15 . for all 78. .
J . . .l = 1... { 1. is defined by sZJ= .rchangeuble sequences ($0I random quantities Hith joint pmhabilic nieusure P. .. ..212 = 1.r12 = 0. . ' .T. . is).1:. = 1 .1. we would typically judge the latter information to be irrelevant to any assessment of the probability that the first three patients receiving Drug 1 survived and the fourth one . and n I 1 712 = >'V1= N? = 2. . ~ . . .P ( : r I q ( l ) .. + r . .i = 1. .CfI..)= r l l + ..18. .withy. . . . I I *. ~ . Proposition 4. .r14 = 1). . ..V?)= 201.rz1= l. = xi. } . the information might well be judged relevant if we were not informed of the value of y I ( N l ) . in.. XI?. = 1 ) = 112 x 1/2 = 1. there exists u distrihutionfunc.1 = 0. taking . . r r I ! . .r?. ~. .l.21 = 1. The definition thus encapsulates a kind of "conditional irrelevance" judgement. . y?(. .l. As an example of a situation where this condition does i i o f apply. 111.(ri. r r Z. a development starting from this latter condition For see de Finetti ( 1938).l x I i. .. . Then l'.I. .X?? = 0 I g1.r?'.tion Q such that .u . but.ll.1.212 4 Modelling deaths in the first 100 patients receiving Drug 1 (N1 100. .?. *. I .. p(. .~ .whereas. S I .. y I ( N I ) = 15) and = 20 deaths in the first 80 patients receiving Drug 2 ( * V 2 = 80. . ! ~ foranyunrestrictedchoiceofpermutationssr. . [ 1 i . . = 1 . whereas p ( a l l = 0. S I I . we only have invariance of the joint distribution when T I = TIT?. Of course. . rri ure unrestrictedly infinitely e. .i where. . r l j ) . . . . certainly an exchangeable sequence (since s I I .r12= 1 jy12 = 1) p(. ~ 1 : s= 0. . ' . . . We can now establish the following generalisation of Proposition 4. suppose that .rll. .. r ~ 1 . died (XI. Further insight is obtained by noting (from Definition 4. = 0 .. .i. . . . is an infinitely exchangeable 01 sequence and that another sequence . .rl. If'. . .. of in the case of the above counterexample.. . . ~ ' r l ~ ~ .!fTi = 1) = 0. ~ f l 1 . ...c.13)that unrestricted exchangeability implies that P(Xl1. i = 1. (or by .rT2 = 0 1 yt. ) . . . (Represenfation theorem for several sequences of 01 random qrcantilies). .
. Moreover. and. (ti.~. . . by Definition 4.. .. We see that p(yl(n1). .() 7. .= 1. (ys ..).%) Proof. it suffices to establish the corollary. and defining the function Qx. P(?/l(nl). .2. P(YI(’l1).4.O. y . 6. where y. ~ y l n ( ~ ~ l l L ) ) Writing (YN). .( i V l ) = y... .1)).. *‘ where the ith of the r n summations ranges from y.) = 0..2. Under the conditions of Proposition 4.1. = y. and where. . i = I. .) to y l ( N . . a . Y l ? . yrla(n711)) CP(Yl(?’l). ~ ~ l ..i 1.v.Nl. .rn...( y l l .6 Models via Partial Exchangeability 213 Corollary. l . . ..xrn (01. . having j a limit Q. .1).. .) on 3Zrn to be the rndimensional “step” function with “. there exists a subsequence Q.Ym(%n) IY l ( N ) . . .. . ? h ( N n ) )P(Yl(NI). ( ~ ~ nis ) ) = l equal to uniformly in el. . ... . . .rrr. . We first note that so that.. which is a distribution function on 8’’’.bmps”ofp(yl(Nl).. .(j). * * * Y...The result follows.(.. . . . l ( N at J ) ll . ( n m ) l?/l(N).. . . . . we may express p(yl(n1). .. .. ) = 1 2 : .v(ys .(N. 2 11. by the multidimensional version of Helly’s theorem (see Section 3. to prove the proposition.. ..4 h “ ( N ” ) ) . etc..18.3). .4 and a straightforwardgeneralisation of the argument given in Proposition 4. as for any N .. .
d C ) ( H 1 . )so that C) may . 0. together with detailed specifications of dQ(0.. so that we have the independent form of prior specification.02). in fact. where dC) ’ (0. . (iii) by the strong law of large numbers. in Chapter 5.Ti)(IC)+(/?.. . of course. H ? ) = dQ(fl1)(fQ(H2): (b) the limiting relative frequency for the second sequence will necessarily be greater than that for the first sequence (due. the general form of representation of beliefs for observables defined in terms of the two sequences. for simplicity. At a qualitative level. are judged to be independent Bernoulli random quantities (or the y . 711 = 2. be interpreted as “joint beliefs about the limiting relative frequencies of 1’s in the two sequences”. H ? ) . = liiii.#. & ) is zero outside the range 0 5 H I < H:! 5 I . As we shall see later. Proposition 4. I ) ? .18 (or its corollary) asserts that if we judge two sequences of 0 . we note the following possibilities: (a) knowledge of the limiting relative frequency for one of the sequences would not change beliefs about outcomes in the other sequence. 0 2 ) are assigned a joint probability distribution C). ) / t t . so that. ) to be independent binomial random quantities) conditional on random quantities 0. that. has the form ~.H?) assigns probability over the range of values of ( 0 . the limiting frequencies could turn out to be equal. to a known improvement in a drug or an electronic component under test).?).H:!) and the representation. for ( y ~ .2.. s o that r I Q ( H l . The model is completed by the specification of dQ(B1. enables us to explore coherently any desired aspect of the learning process. say. whose detailed form will.. for example. (ii) ( H I . we may have observed that out of the lirst { ( I . depend on the particular beliefs appropriate to the actual practical application of the model. (c) there is a real possibility. = 02. to which an individual assigns probability T . ( ~ .. we can proceed US if: (i) the x. i = 1. For example.x ( y . in this case r/Q(Hll has the form 612) TdQ’(0) 4.. ( n . say. writing 0 = 8.(1 .yl.I random quantities to be unrestrictedly exchangeable. 0 2 ) such that 0. # H2.214 4 Modellitrg Considering.
= 1. xt?.6 Models via Partial Exchangeability 215 patients receiving drug treatments I . Sequences of random quantities x..2. This might be done by calculating. . i = l. . taking values in X . for example.. the discussion and resulting forms of representation which we have given for the case of unrestrictedly exchangeable sequences of 01 random quantities can be extended to more general cases. . .9/2(7L2))1 lim lim I which. respectively.v= (Y.4.13 is the following. 1=I where t.with I. . . y1( n l ) yz(n2) survived. in the language of the conventional paradigm.. x l l .z. are said to be unrestrictedly infinitely exchangeable if each sequence is infinitely exchangeable and. Most often. m . so where the parameters correspond to strong law limits of functions of the sufficient statistics.) = p(x I O.ydn2)”.. . .. . . Definition 4. that typically we will have p l ( z 10. .y. i = 1 .). (Unrestricted exchangeabilityfor sequences withpredictive su@cient statistics). = t.m.. given m unrestrictedly exchangeable sequences of random quantities. as discussed in previous sections. . given y~(m). relate to the same form of measurement or counting procedure for all . the fact that the k sequences are being considered together will mean that the random quantities r.62. .. i = 1.yl ( z t ( N z ) )i.are separately predictive suficient statistics for the individual sequences. and and. taking values in X. on the basis of this information. wish to make judgements about the relative performanceof the drugs were they to be used on a large future sequenceof patients. . One possible generalisation of Definition 4. 2. i = 1. is the “posterior density for 4 .. . . . 5 N.1.. m .. . we typically amve at a representation of the form and where 0’ = fly:l 8. or whatever.14. the parametric families have been identified through consideration of sufficient statistics of fixed dimension. m . P( !v2 ( Y I ( N ) I N ).1. .na. . . Clearly. . in addition. . for all n. The following forms are frequently assumed in applications.x.?.L(WlN) Y1(721). . . In general.
withS. outcomes . . of realvalued random quantities. 1.1 random vectors" (see Section 4.7.. the further judgement is made that X I = . . .denoterealvaluedobservations from 11) unrestrictedly exchangeable sequence!.. ... ) a n d .A. This is the model most often used to describe beliefs about a o t i e .. Ifg.5.'. I = I . HI.+ 5 I}. . . . . (Normal).s i ( i ) . This model describes beliefs about an 111 x ( k 1 ) cotiringetyv rcrhle of count data. .. then P(T/l("l). I t . . ....10. . . . discussed earlier in this section...(r1. . . .w ~ ~ of measurement lqvoirr data.r.(/i.1 < / < A .9. (Multi'nomial). X J... . so . .3).' ~ ~ .r) ~ ..)) = l. . ( y . .I random quantities with 'rri = 2. In that the representation then takes the form ' ( .. . .. many applications. 4 As in the case of 0 .. we could make analogous remarks concerning the various qualitative forms of specification of the prior distribution Q that might be made in these cases. It generalises the c a ~ of the 111 x 2 contingency e table described in Example 4. and 0. + .t. = . U 1 10..=liiri. then Example4. .. . Hh.11. say.(r))2... . r r . . .) I h i . whereO.F..(i). ) d C .. with row totals r r I . r~t. r . We shall not pursue this further here.. I = liiit. .)suchthatO<H. ~ . . .A.. . / I .. ( n denotes the number of 1's in the first 1 1 . hut will comment further in Section 4.... ci  Example4.) of the ith of rrr unrestrictedly exchangeable sequences of 0 . ~ ~ ( . . . . . . .216 4 Modelling Example4.1 random quantities. [ I . + .. .. .) denotes the category membershipcount(into the firstl:ofk+l exc1usivecategories)fromthe first I ) . . .\. outcomesofthe ithofvr unrestrictedly exchangeable sequences of '0. If. 3 4 .(n. . ...(i). . fi M L (y. I / . ( r r ) / r r ) a n d ~ = { O = ( H . If ~ .~ (. . . i = . .) and (3) = H'" x (it").. the assumed sufficiency of the sample sum and sum of squares within each sequence might lead to the representation where.. ?J. (Binomid).. 5 1...i...J = 1.wvehave//.9. = X. H = ( / I . .
. we shall simply motivate. There is no suggestion that the particular form discussed has any special status. for the kinds of mechanisms routinely used to allocate patients to treatment groups in clinical trials. having I rows and J columns. . it is typically not the case that beliefs about the x f J t would be invariant under permutations of the subscript i. most individuals would find it unacceptable to make a judge. If further assumptionsof centred spherical symmetry or sufficiencyfor each sequence then lead to the normal form of representation. such a situation corresponds to the invariance of joint beliefs for the collection of random quantities. we think of r r J 1 .6. . . a model which is widely used in the context of the twoway layout. for any fixed I j . ~ under some restricted set of permutations . using very minimal exchangeability assumptions.? . ~ .4. where the random quantity xyk is triplesubscripted to indicate that it is the kth of K “replicates” of an observable in “context” i E I. or oughr to be routinely adopted. with K replicates in each of the I J cells. x .3 Structured Layouts Let us now consider the situation described in (ii) of Section4. In what follows. depend on the actual perceived partial exchangeabilities in that application. ~ xV2. The precise nature of the appropriate set of invariances encapsulating beliefs in a particular application will. In general terms. On the other hand. of the subscripts. if rows represent agegroups. subject to “treatment” j E J. with I and J fixed. say. x . Technically. for any fixed i .6 Models via Partial Exchangeability 217 4. replicates refer to sequences of patients within each agegroupheatment combination and the X t J k measure deathrates. rather than under the unrestricted set of all possible permutations (which would correspond to complete exchangeability). j. we have a WOway layout.6. as Suppose that.~ a (potentially) infinite sequence of realvalued random quantities (2 E such that the IJ sequences of this kind. . the sequence x . of course.1. are judged to be unrestrictedly exchangeable. . we have x). In such contexts. many individuals would have exchangeable beliefs about .. . or whatever. columns correspond to different drug treatments. ment of complete exchangeability for the random quantities x T J k For example.
In the above discussion.)'. ~ assumed independently and normally distributed with means prJ and are variances (A/. for fixed i . .jth wlutnn efleect and as the (ij)th intercrcrioir rfleect. we can always write p.(K) be assumed to be the same for all ( i . .so that A.5 and { ?. that further forms of symmetric beliefs might be judged reasonable for certain permutations of the i. for all z . j . k are conditionally independently distributed with The full model representation is then completed by the specification of a prior distribution Q for X and any Z J linearly independent combinations of the p.j subscripts.. In conventional terminology.. ./J = /1 C k . = A.. . our exchangeability assumptions were restricted to the sequence . Interest in applications often centres on whether or not interactions or main effects are close to zero and.$.. Letting denote the strong law limits of the row averages.l. column averages and overall average. the nature of the observational process leads to the judgement that limb:. ci is referred to as the o\*erull inem.$. + + + ?IlJ. We shall return to this possibility in Section 46.218 4 Modelling the x . d. the { (. .. . from the twoway layout with Z and J fixed.r. on making inferences about the magnitudes of differences between different row or column effects.as the . J .. It is possible. where so that the random quantities x . if not. ". respectively. may say. are referred to as the main cftecr.where we shall see that certain further assumptions of invariance lead naturally to the idea of hierarchical rcpresentations of beliefs. j). } as the } interactions. ( I ! as the ith row eflect.j.tt } and { . Collectively. . of course. r t J 2 . In many cases.5. .
2. i = l ? . + cj2z. .n.18 implies a representation of the form .(z)) 8. with + and G(41 + &z.. . z.)lz1.6 Models via Partial Exchangeability 219 4.)/{l + + exp(4. zi. The examples which follow illustrate some of the typical forms assumed in applications.6. typically measured on a logarithmic scale.. ~=.) = exp(d. . of a related sequence of (random) i quantities. where zi. . O. Example 4. . survives.. investigators often find it reasonable to assume that where the functional form G (usually monotone increasing from 0 to 1) is specified. . .12. . but d. they simply illustrate some of the kinds of models which are commonly Used. For any specified C(. + where z = (q. .i = 1.) = G(41 p 2 z l ) with 4' E 92. .. . z n l ) . there is no suggestion that these particular forms have any special status. . tl.2:.(n. ( n * . . we gave examples of situations where beliefs about sequences of observables xtllx22.8(2)= (6JI(z).)= x .1z. .1 random quantities. . If.. R'.i = 1. . Hewlett and Plackett...)} (the logit model) being the most common. z.m are functionally dependent. for each i = 1.= 1 if the . . are widely used (see... ? !.4 Covariates In (iii) of Section 4. j t h animal receiving dose t. ( z ) = linin'y. . Suppose that at each of 'rri specified dose levels. Again. we shall denote the joint density of ~ . . 1979).m. .. . We shall refer to the latter as cuvariares and. Functions having the form G(4: 2. m. . .. the required representation has the form ... .. . . .. +. m. are to be observed. .. ... 1. j by p(zl(n1) . 2 ... sense.6.a straightforward generalisation of the corollary to Proposition 4. . = 1. in some . z l ) . and n x In many situations. . = 0 otherwise. the sequences zil. of a toxic substance.(n). in recognition of this dependency. sequences of 0 . . .~ &z. . . for example.42 E . (Bimsay).. is a random quantity.t n .). . . .4. on the observed values. are judged exchangeable. z.. . x . . z i l . and if we denote the number of survivors ~ out of ni animals observed in the ith sequence by y. .
‘+ = 1og[0~&/(20~t & ) j / the time at which growth is halfway from the initial to the final level. Suppose that at each of irt specified time points.02). 2 0. . In the probit and logit cases. say i . y(@: 2 . . Suppose further that the kinds of judgements outlined in Example 4.. (.:. . where the functional form y (usually monotone increasing) is specified. about c ’ ~ Q2then acquire an operational meaning as beliefs about the average growthlevels. than in terms of (0. . . . corresponding to the “saturation” growth level reached as s. ) = cil where dQ(& A ) specifying a prior distribution for @ E @ and X E %*. . 5 .!. z. but @ is a random quantity.. for example. + &:. ..Q I /c72. a 02 Example 4. As with Example 4.\. . specification of C) might be facilitated if we reparametrise from q5 to a more suitable ( I . vl = . .. in. where . are to be observed. for example. ~. Beliefs . (Growthcurues). 2. .1 + .. . and ~2 = (ol 02)’. . .12. the joint predictive density representation has the form 022. with t = 1.x.r. .r.r. A.respectively..1 1 are made about the sequences .. that would be observed from a large number of replicate measurements..): = (0. . and at times “. = I)(@).(z))and0: = W x (W)“‘.?.... In the logistic case. ~ + I.x” “0. = . I = 1. at which G(o.. say. is the jth replicate measurement (perhaps on a logarithmic scale) of the size or weight of the subject or object of interest at time 2 .) = 1/2. so that we have the representation wherefiJ(z)= ( p l ( z. + correspondingto the growth level at the “time origin”.(Z) = = X p.C > ~ / O ~ corresponds to the (log) dose..sequences of realvalued random quantities.1) transformation.. = dl I . Experimenters might typically be more accustomed to thinking in terms of ( . Commonly assumed forms include g ( ~2. In practice. Beliefs about L‘nI then correspond to beliefs about the (log)dose level for which the survival frequency in a large series of animals would equal 112. the specification of (1 might be facilitated by reparametrising from Q to a more suitable (11) transformation @ = @(@). ’ . A third possible parameter to which investigators could easily relate i n some applications might be c.) = g ( @ ::. z . .. . X 1 ( z . . ) ) In many such situations. the judgement is made that X l ( r ) = . .. (the /ogisric model) and (the straightline model). ( t ) . the socalled LD50 dose...13. 0 ~ c .I.. : :.(:..) . w .220 4 Modellitig with dQ’(0) specifying a prior distribution for 0 E (P.(%I (particularly if measurements are made on a logarithmic scale) and that /I. we might take v ! ..). p . . . For any specified g ( .I.
~ ) vector) .zz. = l ~ I ~ ( z *i )=] 1.(z A@. the furtherjudgements are made that XI (2. where each x .LL. 01..). A... .. z. = P ( z * ) . rrl t . .as values of the regressor rwiables z ' j ) .z . Suppose that.) = A. with I.]. to 8 as the vector of regression coeflcienrs and A as the design marrix. ]. . *a '(2. + n.. .e.6 Models via Partial Exchangeability 221 Example 4... N... . for some (unspecified) z * .. . (row and 8 = (&. . = 1. are to be observed.. In many situations.4. ses quences of realvalued random quantities x.A). .. .. . ... . ~ + = (8. ( z . .. ~. ) are unknown. . ... .(ZI).. whose rows consist of a lreplicated n l times. .. .).(z). .) = p ( z l ) L. . . (2. .. .m. . is related to certain specified observed quantities z.).. so that. ~ .z*D pu(z). The unconditional representation can therefore be written as It is conventional to refer to . ) made which lead to the belief representation where p .rr(zno) a n d 8 = R"'x (Rt)"'. 8 . ... adequately approximated by a firstorder Taylor expansion. ~and judgements are .I C .= ( l . denoting therr x 71 identity matrix. ... m. ..]. h ( Z 1 )* . the joint distribution of is thus seen to be multivariate normal.. qz ) = (P. .. = (i.. for each 1 = 1. .Ok)' (column vector) with Conditional on e. where A is an rz x 12 matrix (n = I A). J = 1.(Z.14.k .. . but the latter is assumed to be a "smooth" function. t .. The form p ( z ) = A 8 is called a regression equation and the structure E ( z I A . and soon. X )= A 8 . where we define a. )= nlim z.. followed by a replicated n2 2 times. where X and p ( . . .) = n?i liiii si(i). = X and p .X. and X = XI.withz =(zl . (Muuiple regression).
with respect to the regressor variables (2... . . .) = 4. we considered the general situation where several sequences of random quantities. given the values of the relevant covuriates.. dd)(@).14 essentially reduces to the same process as we have seen in earlier representations of joint predictive densities as integral mixtures. From an operational point of view.. . Qt. .6. f / r ) of the J...and given the unknown parameters..). O. . . ~ . which arise in implementing the Bayesian learning process. linear versus nonlinear functional forms. (ii) the latter are assigned a prior distribution.. often of exponential family form (as with the binomial. . E(L.wiorr):in these cases. normal versus nonnormal distributions. . .6. . Howevcr. within this general structure we can represent various special cases such as 2 ’ ’ ’ = z ’ @oIynmnriid V regression)or Z ~ J )= sin(jH/K). approximation. iO . in generul. 4.. the likelihood.3 Specification of the kinds of structures which we have illustrated in Examples 4. .5 Hierarchical Models In Section 4.. 4. since this must reflect whatever beliefs are appropriate for the specific application being modelled. normal and multivariate examples seen above). From a conceptual point of view. We proceed c1s if: (i) the random quantities are conditionully independent. we have a multiple regression model.). beliefs about 8 in the general case relate to beliefs about the intercept (&) of the regression equation and the marginal rates of change ( O I . we have the simple regression (struighrline) model. = 1. I2 to 4. . x . . Z .. However.. involves familiar probability models. z . (a version of trigononterric regre.. but with at least some of the usual “labelling” parameters replaced by more complex functional forms involving the covariates. In many cases. defined through conditional independence.2. . for some . . . ~ for I. . about the prior specification Q ( & . etc. . L. I = 1. when we consider the applications of such models. . if I. z1 ).222 4 Modelling is said to define a linear model. . leading typically to a joint density representation of the form We remarked at that time that nothing can be said. . this is all that really needs to be said for the time being. together with the problems of computation... 2 2. However. and so on.. beliefs about 8 will stem from rather different considerations. it is often the case that additional judgements about relationships among the nt \equenccs lead to interestingly structured forms of Q ( H I . . I n are judged unrestrictedly infinitely ~ exchangeable. it is often useful to have a more structured taxonomy in mind: for example.
. Example 4. but also between the m strong law limits of appropriately defined statistics for each of the sequences... . . . I .]. ) I O . .t)dQ(Q~.6 Models via Partial Exchangeabilio 223 In Section 4. . . l ( t . .. . .ow) . if the sequences consists of successfailure outcomes on repeated trials with i i i different (but.3) the general representation Q(Qi. 30. ) l e ~ . . . 11x As we remarked in Section 4. . ( l ~ . . . ~ t . . .. to all intents and purposes.6. .. we then have (see Section 4. . ! / f . I G) m(G) The complete model structure is then seen to have the hierarchicalform . Suppose that we have unre1 strictedly infinitely exchangeable sequences of 01 random quantities..) = s. and final.. . ... [ 7 t . This corresponds to specifying an exchangeableform o prior f distributionfor rhe parameters el.I Q(6.. for i = 1. .O.1. x l l .) .O.. Then.T/ffo(n..m. we noted some of the possible contexts in which judgements of exchangeability might be appropriate not only for the random quantities within each of m separate sequence of observables. (Exchangeablebinomial parameters). .. .. If the rri types of component can be thought of as a selection from a potentially infinite sequence of similar components. .) 1 In conventional terminology. .3. is a sufficient statistic for the 2th sequence and p(yi(n1). The following examples illustrate this kind of structuredjudgement and the forms of hierurchicuf model which result.. + s.2.6.4. stage specifies beliefs about G. vt(7/. ) . . . the third. = liiri ( g . “similar”) types of component. . . the second stage models the binomial parameters a5 i they were a f random sample from a distribution G. the first stage of the hierarchy relates data to parameters via binomial distributions. .l + ... ( n ) / n ) . where 8. .. .= n B i ( p . . p(yi(lli) .. . .15. . .. . . . .i)) = I Y(YI (ni). ~ 1 . I G) = n(G) n f G(@. . H .. . with i = 1. it might be reasonable to judge the 711 “longrun success frequencies” to be themselves exchangeable.. 11”’ 181.
. 1 So far as the specification of Q ( p l .. . r ./'. . the second stage now models the parameters us if they were a random sample from a parametric family labelled by the hyperpurutnem.r.. the third.r. of real valued random quantities. . . .O . . . r .... so that. . .. Suppose that we hake 111 unrestrictedly infinitely exchangeable sequences .. . stage specifies beliefs about the hyperparameter... . assuming the existence of the appropriate densities.224 4 Modelling The above example is readily generalised to the case of exchangeable parameters for any oneparameter exponential family. . .//. . and final.?. . the prior specification takes the form /T 1 les the hierarchi As before.. . .. .). the first stage of the hierarchy relates data to parameters in a form as sumed to be independent of G.r. / I . . . _ 1_1 . /'. It. I = 1 . . I 1 ) the joint density has the representation where we recall that A' = I h ... + Example 4.sf(() and 11..where //Sf(/) Z ( ... .: = . .I. .. In practice.. ( / ) ) i .?. iu. . A ) . . as we shall indicate in the following example. for which (see Example 4. (Exchangeable normal mean parameters).E CP.lA"(A). wc tirst note that in many applications it is helpful to think in terms o f O(/fl.16.! +'"+. .(/) "(. . . X ) is concerned. = hi. . . i t 1 _ .(i). b l . . beliefs about G might concentrate on a particular parametric family. Such beliefs acquire operational significance by identifying the hyperparameter with appropriate strong law limits of observables.
. (?. the final stage specification of the joint prior distribution for &. tn'(pl + . p. . . suppose that. Now suppose that.(PI. It would then be natural (see Section 4.. for example. In some cases.~. . ( p . If the rn sequencescan be thought of as a selection from a potentially infinite collection of similar sequences.. . .. .4..1(1). + f i n ) and s'(m) = m I 1 . For an explicit example of this.4) = n 9 J P t I A. conditional on A.~. assuming the existence of the appropriate densities.forvery largenl..az..)thequantitiesrri.PI12 I A. knowledge of the strong law limits of sums of squares about the mean may be judged irrelevant to the assessment of beliefs for strong law limits of the sample averages: in such cases.(pl I A.2(2). I A) will not depend on A. . In other cases. QA.. In either case.. moreconcretely. p2. given a potentially infinite sequence pl .3) a further representation of Q. I A) will involve A. 42 and X then reduces to a specificationof beliefs about the following limits of observable quantities (for large m and 7 t l . given A.5) to take g. might concentrate on a particular parametric family.. .6 Models via Partial Exchangeability 225 for some Qp.. (or. . . . i = 1. where + From an operational standpoint. it is useful to think in terms of the product form of Q.ji(m))2 the large sample analogues : (or of p(m)and s2(7n))were judged sufficient for the sequence. . ( p l . so that. . in the form The complete model then has the hierarchical structure tr1 In practice. beliefs about G. p ( m ) = . we have (see Section 4. . the limiting sample means are judged exchangeable. (i) the mean of all the observations from all the sequences (&).. . . r n ) : .. . .0 2 ) . Q . we might believe. that variation among the limiting sample averages is certainly bigger (or certainly smaller) than withinsequence variation of observations about the sample mean: in such cases. 4) 1=1 lW4 I A) QA(A). 4) = N ( p I 14. the hierarchical structure would take the form 111 gp(11. ..
Q. such that If the de Finetti representation held.rl ..I . where links will be made with empirical Buyes ideas. . mathematical representations which correspond to probabilistic mixing over conditionally independent parametric forms do not hold for finite exchangeable sequences.1 . (iii) the mean (over sequences)of the mean sum of squares of observations within a sequence about the sequence mean (A).6. However.7. further brief discussion will be given in Section 5. x t t ) for observables . 1980a).7 4. . and is being increasingly used in statistical modelling and analysis. To see this. 4.4. .1 PRAGMATIC ASPECTS Finite and Infinite Exchangeability The de Finetti representation theorem for 01 random quantities. assumed to be part of an infinite exchangeable sequence. in general. characterise forms of p ( r l .. consider 11 = 2 and finitely exchangeable 0. . In the context of the Bayesian learning process.). .226 4 Modelling (ii) the mean sum of squares of the individual sequence means about the overall mean (6. depend on the particular situation in which the model is being applied.. An extensive discussion of hierarchical modelling will be given in the volumes Bayesian Curnpururiun and Bayesian Merhods. ? . The precise form of specification at this stage will. This section has merely provided a brief introduction to the basic ideas and the way such structures arise naturally within a subjectivist. Hierarchical modelling provides a powerful and flexible approach to the representation of beliefs about observables in extended data structures. . modelling framework.6. . we would have for some Q ( O ) .3. an impossibility since the latter would have to assign probability one to both 0 = 0 and B = 1 (Diaconis and Freedman. and the various extensions we have been considering in this chapter.r. of course. A selection of references to the literature on inference for hierarchical models will be given in Section 5. . .
I I. and f ( 7 1 ) = ( n . . Proposition 4. Proof. . The message is clear and somewhat comforting. there exists Q such that where f ( n )is the number of elements in X . let us call an exchangeablesequence. If a realistic judgement of Nextendibility for large. For extensions of Proposition 4. typically of this kind: the xl. N is replaced by the mathematically convenient assumption of infinite exchangeability.19.) E X” is of the form . . Practical judgements of exchangeability for specific observables 2 1 .4. Jaynes (1986) and Hill (1992). Nextendible if it is part of the longer exchangeable sequence q . = / F”(E)K?(F). one might feel that if 3c1 . . Infinite exchangeability correspondsto the possibly unrealistic assumption of Nextendibility for all N > n. but finite. are . see Diaconis (1977).v.x. x1 . . . the assumption of infinite exchangeability implies that the probaE bility assigned to an event (q . . . E X. (1992). . is Nextendible for some N > > n. . x. . but finite. . This is made precise by the following.7 Pragmatic Aspects 227 It appears.z. potential sequence of exchangeable observables. with IC... therefore. . See Diaconis and Freedman (1980a) for a rigorous statement and technical details. If we denote by P( E) the corresponding probability assigned under Nextendibility for a specific N. In general.xn can be considered as part of a larger. a possible measure of the “distortion” introduced by assuming infinite exchangeability is given by where the supremum is taken over all events in the appropriate afield on X“. With the preceding notation. Intuitively. .19 to multivariate and linear model structures. To discuss this problem. .1) otherwise. ifthe latter isfinite. the “distortion” should be somewhat negligible.x.. . for some Q. no important distortion will occur in quantifying uncertainties. (Finite approximation of infinite exchangeability). that there is a potential conflict between realistic modelling (acknowledgingthe necessarily finite nature of actual exchangeabilityjudgements) and the use of conventional mathematical representations (derived on the basis of assumed infinite exchangeability). . For further discussion. . see Diaconis et al.
Hill (1968. Leonard (1973).rl.DykstraandLaud(198l).) 11 x and F. with Q representing our beliefs about “what the empirical distribution would look like for large n*’. Dickey (1969). .Wahba ( 1988).x. an infinitedimensional parameter. Berliner and Hill ( 1988). whereas those involving the infinitedimensional parameter are referred to as nonparumetric models! The technical key to Bayesian nonparametric modelling is thus seen to be the specification of appropriate probability measures over function spaces. effectively. Doksum (1974).2 4 Modelling Parametrlc and NonparametricModels In Section 4. is the empirical distribution function defined by ..since F is.TI. For this reason.Rolin( 1983). as in the parametric case.3. Antoniak (1974). Ferguson and Phadia ( 1979). Ferguson (1973. In the rest of this volume we will deal exclusively with the parametric case. As we remarked at the end of Section 4. rather than over finitedimensional real spaces. Susarla and van Ryzin ( 1976).. the use of specific parametric forms can often be given a formul motivation or justification as the coherent representation of certain forms of .. .228 4.7. s2. the task of assessing and representing such a belief distribution Q over the set 3 of all possible distribution functions is by no means straightforward. J . . Conventionally. representations in the tinitedimensional case are referred to as pururnetric models. This implies that we should proceed as ifwe have a random sample from an unknown distribution function F. Kestemont (I 987).Padgettand Wei( i981). 1 11 where Q ( F ) = lim P(F. Kimeldorf and Wahha (1970). Lenk (1991) and Lavine (1992a).3.. . albeit somewhat paradoxically. Good and Gaskins (1971. . we note Whittle ( 1958). 1980). .3.Lo( 1984). . As we have seen. 1992). Among important references on this topic. . Dalal and Hall (1980). Thorburn ( I 986). . postponing a treatment of nonparametric problems to the volumes Buyesion Coniputation and Bqcsiun Methods. we saw that the assumption of exchangeability for a sequence joint distribution function of x I... of the form sl. 1988. the Bayesian analysis of nonparametric models requires considerably more mathematical machinery than the corresponding analysis of parametric models. 1974). Most of this chapter has therefore been devoted to exploring additional features of beliefs which justify the restriction of ‘3 to families of distributions having explicit mathematical forms involving only a tinitedimensional labelling parameter.Hjort (1990). of realvalued random quantities implied a general representation of the .
as ifparticular forms of summary statistic were sufficient!). in arriving at a particular parametric model specification. In each case. to “expand one’s consciousness” a little in relation to an intended model in order to review the judgements that have been made.7. Depending on the context. indicating briefly the kinds of elaboration of the “first thought o f ’ model that might be considered. reasons for choosing to work with a particular parametric model (as there often are for acting. is it reasonable to have made a particular conditional independence assumption. From the standpoint of the general representation theorem. or whatever? (iii) when considering temporally or spatially related observables. potential covariates have been included in the model. regarding such things as the “straightness” of a graphical normal plot. where a given model seemed “to work”) or by scientific theory (wtuch determines that a specific mathematical relationship “must” hold. such judgements correspond to acting as ifone has a Q which concentrates on a subset of 3 defined in terms of a finitedimensional labelling parameter.3 Model Elaboration However. a number of simplifying assumptions will necessarily have been made (either consciously or unconsciously). is it reasonable to have excluded the others. of course. there are often less formal. either individually or in conjunction with covariates already included? We shall consider each of these possibilities in turn. of course. the choice involves subjective judgements. historical reference to“similar” situations. or should some form of correlation be taken into account? (iv) if some.7 Pragmatic Aspects belief characterised by invariance or sufficiency properties. but not all. or should the scale be transformed to logarithms. or might some of them be important.4. . and the “applicabilityof a theory to the situation under study.. more pragmatic. in accordance with an assumed “law”). In practice. specific parametric models are often suggested by exploratory data analysis (typically involving graphical techniques to identify plausible distributional shapes and forms of relationship with covariates). therefore. It would always be prudent. by means of whatever combination of formal and pragmatic judgements have been deemed appropriate. the following kinds of critical questions might be appropriate: (i) is it reasonable to assume that all the observables form a “homogeneous sample”. or might a few of them be “aberrant” in some sense? (ii) is it reasonable to apply the modelling assumptions to the observables on their original scale of measurement. for example. the “similarity” between a current and a previous trial. formally.e. 4. In particular.or by experience (i. reciprocals.
.vjhnnurion ekuhorution.? . . x f amight be aberrant. . x. Pettit and Smith ( 1985).the aberrant observation will "outlie". z2. Such models are usually referred to as "outlier" models. Dawid ( 1973).. (? # 0. If aberrant observations are assumed to be such that a sequence of them would have a limiting mean equal to I / .rl.prior belief in the relative inaccuracy of' an aberrant observation as a "predictor" of p is reflected in the weight attached by the prior distribution Q ( ? )t o values of!much smaller than 1. Tran. with specified probability T. . 1990). Smith ( 1983). Arnaiz and RuizRivas ( 1986). sf. a suitable form of elaborated model might be . it is thought wise to allow for the fact that (an unknown) one of . tion 1.I. . since 7 < 1 implies an increased probability that. . 1992). . = 0 ) . beliefs about the sequence . .I)/*.7 E 1') ( . A.rj. there are no aberrant observations. in the observed sample sI.I. . . . but. . 0 < 2 < 1. . defined by 1. Guttman and Peiia (1988) and Peiia and Guttman (1993). Suppose now that judgements about a sequence xl . if'a suitaMe :were idetitijied..I. . .. aberrant observation.! . but a limiting mean square about the mean equal to (7 A)'.p ) ? 111.. De Finetti ( I96 1 ) and Box and Tiao ( 1968) are pioneering Bayesian papers on this topic. This model corresponds to an initial assumption that. of realvalued random quantities are such that it seems reasonable to sup/' i pose that.4 Modelling Outlier elaboration. . ( J ' . Muirhead ( 1986). r l . . Pettit ( 1986. .] = ( ? A ) ~ ' I .E [ ( r . . ) . .Generalisations to cover more than one possible aberrant observation can be constructed in an obviously analogous manner. = log(.\..7 ~ there is precisely one . Since for an aberrant observa. . of realvalued random quantities have led to serious consideration of the model but.. on reflection. which is equally likely to be any one of . More recent literature includes.t.'. 1988b. . where p and A denote the corresponding limits for nonaberrant observations. with probability 1. O'Hagan (1979. Freeman ( 1980). Suppose that judgements about a sequence zI. West ( I984 1985).
. for example. might provide a better scale on which to assume a normal parametric model. Suppose that judgements about z1. for a given y E [ . etc.. 1985).y = 1 / 2 .r. . . 3) = ? h .X . ~ ( ~ 1 7 .and .4.x.r. = n ~ ( rI P ~ X ) . . and conditionalon p and A. Pericchi (1981) and Sweeting (1984. t = 1. again lead to a "first thought of model in which I. the observations correspond to successive timepoints. For detailed developments see Box and Cox (1964).(since. A..+h is given by R(. the elaborated model admits the possibility that trans= formations such as reciprocal. or logarithm.1. .. . r l n . Correlationelaboration. ~ 0. t = 2.x2. I P. the correlation between x. If r includes values such as y = 1. so that . 1). Judgements about the relative plausibilities of these and other possible transformations are then incorporated in &+. square root.7 Pragmatic Aspects 231 would plausibly have the representation It then follows that where 1 n 4 The case y = 1 corresponds to assuming a normal parametric model for the observations on their original scale of measurement. . t r=l but that it is then recognised that there may be a serial correlation structure among zl.) A possible extension of the representation to incorporate such correlation might be to assume that. + h 1 p.
. . . . r l l )= / N. In all these cases. often provides a natural basis for checking on the adequacy of an initially proposed model. If it is subsequently thought that ' covariates t A + l . ( / t . = ( z . . k T l . . Mitchell and Beauchamp ( 1988) and George and McCulloch ( I YY3a). . . .232 The elaborated model then becomes. imaginatively chosen to reflect interesting possible forms of departure from the original model. . as well as learning about the directions in which the model needs extending.el) denotes the regression coefficients of the additional regressor variables Z A . ) replicated I I . :. . .. . . .where z .. . ~ . .A ) of the covariates z I . . si. r . correlation are reflected in the specification of ( Cowiricite aluborurion. an initially considered representation of the form . the latter reducing to the original representation on setting the elaboration parameter y equal to 0. A) &W. . . . . . q . = ( 2 . for some Q'. .. A) as described in Example 4. a suitable elaboration might take the form of an extended regression model 3 where 3 consists of rows containing b.~ . + I .AI' II I ) H Y R + I I . ] replicate observations corresponding to the observed valuc z . is the multiple regression model with representation Y(Z) = / pPtl. . . ( I ~ ~ .6. . . d&'(p.) denotes ) ) .I The "first thought of' model corresponds to = 0 and beliefs about the relative plausibility of this value compared with other possible values of positive or negative 2 ' .(zlpl. )= ( .r1.7) d 4 * ( 4 . . . iri and y = ( & .(a: I AO. . 2 should also have been taken into account. . z .p Nl. . .. =/P(z: 14) dQ(4) is replaced by an elaborated representation A=)= / A s I 4. 4 I Modelling '.I' . The value y = 0 corresponds to the "first thought of' model. .XI?) d Q t ( i ) p(. .. . 1 .. . Suppose that the "first thought of' model for the observables z = (zl ( 1 1 ~ ) . . . . Inference about such a 7.16 of Section 4. . i FT 1 .. Other Bayesian apprwdches to the problem of covariate selection include Bernardo and Bermudez ( 1985). . times. 7 ) ..
Equalify ofparameters. 8’ = flzlQ. in the sense that a simpler form of model might be perfectly adequate. These and other questions relating to the fundamentally important area of model comparison and model choice will be considered at length in Chapter 6. The process of model simplification is. .x on the alternative.7 Pragmatic Aspects 233 4.. the converse. In conventional terminology. . For the present.). # 8. relating to the ith sequence can typically be interpreted as the limit of a suitable summary statistic for the ith sequence. If. outlined in the previous section. of course. rather than opt for sure for one or other of these representations. we could take a mixture of the two (with weight x . and the parameter 8. thought to be possibly injudicious. . this latter consideration is somewhat illdefined: the “adequacy”. the labelling of the sequences is irrelevant and that any combined collection of observables from any or all of the sequences would be completely exchangeable.4 Model Simplification The process of model elaboration. In Section 4. say. As it stands. representation). on the other hand. of a particular form of belief representation can only be judged in relation to the consequence arising from actions taken on the basis of such beliefs. This . we analysed the situation where several sequences of observables are judged unrestrictedly infinitely exchangeable. = Om) and the original 8 representation as the alternative hypothesis ( 1 # .6.4. the simplified representation is sometimes referred to as the nullhypothesis (61 = . or otherwise.6 (for the case of two 01 sequences). leading to a general representation of the form where 8. E Q. In reviewing a currently proposed model. we might wonder whether some parameters (or covariates) have been unnecessary included. in a sense. consists in expanding a “first thought of’ model to include additional parameters (and possibly covariates).7. general. . on reflection. As we saw in Section 4. it will suffice just to give an indication of some particular forms of model simplification that are routinely considered. reflecting features of the situation whose omission from the original model formulation is. on the null representation and 1 . the simplifyingjudgement were made that. we would have the representation where the same parameter 8 E 8 now suffices to label the parametric model for each of the sequences. in fact.
we considered the situation of a structured layout with replicate sequences of observations in each of 1. Absence of efects. = 71. Omission o covuriares.1 cells.7. It is the combination of prior and likelihood which dejines the overull model.!I = 0) and large sample cell means coincide with column (or row) means.. Again.. the specification of a prior distribution for unknown parameters is theretbre an essential . . setting the “elaboration parameter” to 0 provides a natural form of simplification of potential interest. in all the cases of elaboration which we considered in the previous section.5 Prior Distributions The operational subjectivist approach to modelling views predictive models as representations of beliefs about observables (including limiting. so that large sample means in individual cells are just the additive combination of the corresponding large sample row and column means. In fact. . so that o1 = . . . 4. we see that here the simplification process is very clearly just the converse of the elaboration process. Considering. = . . O I 1. lnvariancc and sufficiency considerations have then been shown to justify a structured approach to predictive models in terms of integral mixtures of parametric models with respect to distributions for the labelling parameters.234 4 Modelling form of representation will be considered in more detail in Chapter 6. In terms of the mixture representation. In Section 4.. together with a distribution for the unknown parameters (a prior distribution). conventionally referred to as parameters). . T ~ J ) . conventional terminology would refer to these simplifying judgements as “null hypotheses”. In any case.6 and reconsidered in the previous section on model elaboration. described in Example 4. defining a likelihood). and a possible parametric model representation involving row effecrs (a1. .14 of Section 4. . issues of model comparison and choice require a separate detailed and extensive treatment. Further possible simplifying judgements would be that the row (or column) labelling is irrelevant..1 0). we specify a distribution for the observables conditional on unknown parameters (a sampling distribufion. . . .6. largesample functions of observables. for example. In familiar terminology. . column effecrs ($1. the multiple regression f case. d ~and interaction effects (711. which we defer until Chapter 6. then the model corresponding to = 0 provides the required simplification.A commonly considered sim) = plifying assumption is that there are no interaction effects (71I = . . where it will be shown to provide a possible basis for evaluating the relative plausibility of the “null and alternative hypotheses” in the light of data. Whether the process of model comparison and choicc is seen as one of elaboration or of simplification is then very much a pragmatic issue of whether we begin with a “smaller” model and consider making it “bigger”. . = crl = 0 (or dl = . or vice versa. If y denotes the regression coefficients of the covariates we are considering omitting.
Care must therefore obviously be taken to ensure that prior specifications respect logical or other constraintspertaining to such limits.4. Extensions to the case of general exchangeable random quantities appear in de Finetti (193711964) and Dynkin (1953). 4.within which many of the apparent difficulties associated with the precise specification of prior distributions are seen to be of far less significance than is commonly asserted by critics of the Bayesian approach. From the operational. as we have repeatedly stressed throughout this chapter.robustness and sensitivity analysis.8 Discussion and Further References 235 and unavoidable part of the process of representing beliefs about observables and hence of learning from experience. the concept of exchangeability having been considered earlier by Haag (1924) and also in the early 1930’s by Khintchine (1932). it should be readily acknowledged that the process of representing prior beliefs itself involves a number of both conceptual and practical difficulties. prior beliefs about parameters typically acquire an operational significance and interpretationas beliefs about limiting (largesample)functions of observables. provides a rich and illuminatingperspective and framework for inference.8 4. From a conceptual point of view.1 DISCUSSION AND FURTHER REFERENCES Representation Theorems The original representation theorem for exchangeable 0 . therefore. subjectivist perspective. asymptotics. or even subversive (seeAppendix B). with an abstract analytical version appearing in Hewitt and Savage (1955). the specification process will be facilitated by suitable “reparametrisation”. From a practical point of view. together with a number of illustrative examples. and certainly cannot be summarily dealt with in a superficial or glib manner. a general overview of representation strategies. will be given in the inference context in Chapter 5. Seminal extensions to more complex forms of symmetry (partial exchangeability) can be found in de Finetti (1938) and Freedman .8. we shall see that the range ofcreative possibilities opened up by the consideration of mixtures. However. as well as novel and flexible forms of inference reporting. In particular. it is meaningless to approach modelling solely in terms of the parametric component and ignoring the prior distribution. W are. That said. detailed treatment of specific cases is very much a matter of “methods” rather than “theory” and will be dealt with in the third volume of this series. Often. in fundamental disagreement with approaches to e statistical modelling and analysis which proceed only on the basis of the sampling distribution or likelihood and treat the prior distribution as something optional.1 random quantities a p pears in de Finetti ( 1930). irrelevant.
there is an interesting sense. is seen as a ”subjective” optional extra. more complex invariance or sufficiency. For related developments from a reliability perspective. However. In contrast.8. the corresponding parametric form would be the “true” model for the observables. 4 Modelling See Diaconis and Freedman ( 1980b) and Wechsler ( 1993) for overviews and generalisations of the concept of exchangeability. It is often implicit in such discussion that if the “true” parameter were known. the two are actually inseparable in defining a belief model. 1994) and Mendel (1992). linguistic separation into “likelihood” (or “sampling model”) and “prior” components. Instead of viewing these roles as corresponding to an objectivdsubjcctive dichotomy. providing the mechanism whereby the observables are generated. It is as if. Diaconis (1988a) and. Useful reviews are given by Aldous ( 1985). Ressel(l985). In the absence of any general agreement over assumptions of symmetry. The latter then defines ”objective” probabilities for outcomes defined in terms of observables. 1986b). such an approach seeks to make a very clear distinction between the nature of observables and parameters.IS part of ”objective reality”. 1990) and Kuchlerand Lauritzen ( 1989). We have noted how this illuminates. Important progress is made in Diaconis and Freedman (1984. 4. see Barlow and Mendel (1992. given the ”true” parameter. But we have also stressed that. a potential contaminant of the objective statements provided by the parametric mtdel. Lauritzen ( 1982. even from our standpoint. See. 1982b. also.236 ( 1962). Recent and current developments have generated an extensive catalogue of characterisations of distributions via both invariance and sufficiency conditions. all concerned with their belief distributions for the same sequence of observables. partial symmetry. traditional discussion of a statistical model typically refers to the parametric form as “the model”. from a rather different perspective.2 Subjectivity and Objectivity Our approach to modelling has been dictated by a subjectivist. on the other hand. invariance or . The conference proceedings edited by Koch and Spizzichino (1982) also provides a wealth of related material and references. and puts into perspective. consider a group of Bayesians. operational concern with individual beliefs about (potential) observables. To this end. we view them in terms of an intersubjective/subjectivedichotomy (following Dawid. the corresponding parametric distribution is seen . The “prior”. from our standpoint. we have seen how mixtures over conditionally independent “parameterlabelled’’ forms arise as typical representations of such beliefs. Clearly. 1987. in which the parametric model and the prior can be seen as having different roles. this view has little in common with the approach we have systematically followed in this volume. these probabilities being determined by the values of the “unknown parameters“. Clearly. Through judgements of symmetry. 1988).
that the key element here is intersubjective agreement or consensus. However. the parametric forms adopted will be the same (the intersubjective component). a basis for separating out. 4. 1990). there are typically . however. However.2 to the fact that shared structural belief assumptions among a group of individuals can imply the adoption of a common form of parametric model. "shorthand" for intersubjective communality of beliefs. Such intersubjective agreementclearly facilitatescommunication within the group and reduces areas of potential disagreement to just that of different prior judgements for the parameter. (iii) Identtj5abiliry and (iv) Robustness Considerations. for example. one must necessarily work with simplified representations.8. However. and Lehmann. so that such a group of Bayesians may eventually come to share very similar beliefs. two components. Nonsubjectivist discussions of the role and nature of models in statistical analysis tend to have a rather different emphasis (see.8. the forms of representation theorems we have been discussing provide. we have drawn attention in Section 4. We emphasise again. As we shall see in Chapter 5 . Within the mixture. 1990. while the priors for the parameter will differ from individual to individual (the subjective component). the fundamental notion of a model is that of a predictive probability specification for observables. as a possibly convenient. and the belief model for the parameters. even if their initial judgements about the parameter were markedly different. if required. Cox. judgements about the parameter will tend more towards a consensus as more data are acquired. The Role and Nature of Models In the approach we have adopted. One might go further and argue that without some element of agreement of this kind there would be great difficulty in obtaining any meaningful form of scientific discussion or possible consensus. In order to think about complex phenomena. the individuals are each simply left with their own subjective assessments. but potentially dangerously misleading. (ii) Structurul und Stochastic Assumptions.4. implicit or explicit. perhaps. given some set of common assumptions. Indeed. the parametric model.3 Critical Issues We conclude this chapter on modelling with some further comments concerning (i) The Role and Nature of Models.the results of this chapter imply that the entire group will structure their beliefs using some common form of mixture representation.8 Discussion and Further References 237 sufficiency. in typical cases. while allowing the belief models for the parameters to vary from individual to individual. such discussions often end up with a similar message. about the importanceof models in providing a focused framework to serve as a basis for subsequent identificationof areas of agreementand disagreement. In any given context. We can find no real role for the idea of objectivity except.
whereas empirical modellers are simply concerned that p(z)"works". hut . a phenomenon is gova erned by the specific mechanism p ( z 10) with 0 = 0.4 Modelling a number of different choices of degrees of simplification and idealisation that might be adopted and these different choices correspond to what Lehmann calls "a reservoir of models". If. the key role of the parametric model component p ( z 10) was to specify structured forms of expectations for the observables conditional on the parameters. The essence of the dichotomy is that scientists are assumed to seek r. Put very crudely. see Leonard ( 1980). scientist who discovers this and sets p ( s ) = p(s10. When comparing rival belief specifications. in fact. where . . (Lehmann. . all other things being equal we are intuitively more impressed with the one which consistently assigns higher probabilities to the things that actually happen.sonic are Irsefirr'.w.) will certainly have a p(z) "works". it is observables which provide the touchstone of experience.6. We recall two examples. separate from considerations about the complete form of probability specification to be adopted. particular emphasis is placed on transparent characterisations or descriptions of the models that would facilitate the understanding of when a given model is appropriate. As we have stressed many times. but simply with providing a reliable basis for practical action in predicting and controlling phenomena of interest. explanatory modellcrs take the form of p ( z 10) very seriously.rpIuwtory models. For an elaboration of the latter view. which aim at providing insight into and understanding of the "true" mechanisms of the phenomenon under study: whereas technologists are content with empirical models. Whilst we would not dispute that there are typically real differences in motivation and rhetoric between scientists and technologists. we are personally rather sceptical about taking the science versus technology distinction too seriously.fii1. in terms of our generic notation. . 1990) But appropriate for what'? Many authorsincluding Cox and Lehmannhighlight a distinction between what one might call scientiJc and fechnolugical approaches to models. we considered several illustrative examples where. but our prejudices are wellcaptured in the adage: "trll moifels are. it seems to us that theories are always ultimately judged by the predictive power they provide.). that However. Structural and Stochastic Assumptions In Section 4. The approach we have adopted is compatible with either emphasis. which are not concerned with the"truth". Is there really a meaningful concept of "truth" i n this context other than a pragmatic one predicated on p(z)'? shall return to this issue in We Chapter 6.
4.8 Discussion and Further References
In the case of observables x i j k in a twoway layout with replications (Section 4.6.3), with parameters corresponding to overall mean, main effects and interactions, we encountered the form
in the case of a vector of observables x in a multiple regression context with design matrix A (Section 4.6.4, Example 414),we encountered the form
In both of these cases, fundamental explanatory or predictive structure is captured by the specification of the conditionalexpectation, and this aspect can in many cases be thought through separately from the choice of a particular specification of full probability distribution.
Identi$abiliry
A parametric model for which an element of the parametrisation is redundant is said to be nonidentified. Such models are often introduced at an early stage of model building (particularly in econometrics) in order to include all parameters which may originally be thought to be relevant. Identifiability is a property of the parametric model, but a Bayesian analysis of a nonidentified model is always possible if a proper prior on all the parameters is specified. For detailed discussion of this issue, see Morales ( 197 1), Drhze ( 1974), Kadane ( 1974), Florens and Mouchart ( I986), Hills (1987) and Florens et ul. (1990, Section 4.5).
Robustness Considerations For concreteness, in our earlier discussion of these examples we assumed that the p ( z I e) terms were specified in terms of normal distributions. As we demonstrated earlier in this chapter, under the a priori assumption of appropriate invariances, or on the basis of experience with particular applications, such a specification may well be natural and acceptable. However, in many situations the choice of a specific probability distribution may feel a much less “secure” component of the overall modelling process than the choice of conditional expectation structure. For example, past experience might suggest that departures of observables from assumed expectations resemble a symmetric bellshaped distribution centred around zero. But a number of families of distributions match these general characteristics, including the normal, Student and logistic families. Faced with a seemingly arbitrary choice. what can be done in a situation like this to obtain further insight and guidance? Does the choice matter? Or are subsequent inferences or predictions robust against such choices?
240
4 Modelling
An exactly analogous problem arises with the choice of mathematical specifications for the prior model component. In robustness considerations, theoretical analysis sometimes referred to as "what if?" analysishas an interesting role to play. Using the inference machinery which we shall develop in Chapter 5. the desired insight and guidance can often be obtained by studying mathematically the ways in which the various "arbitrary" choices affect subsequent formsof inferences and predictions. For example. a"what if?" analysis might consider the effect of a single. aberrant, outlying observation on inferences for main etYects in a multiway layout under the alternative assumptions of a normal or Student parametric model distribution. It can be shown that the influence of the aberrant observation is large under the normal assumption. but negligible under the Student assumption. thus providing a potential basis for preferring one or other of the otherwise seemingly arbitrary choices. More detailed analysis of such robustness issues will be given in Section 5.6.3.
BAYESIAN THEORY
Edited by Josd M. Bemardo and Adrian F. M. Smith Copyright 02000 by John Wiley & Sons, Ltd
Chapter 5
Inference
Summary
The role of Bayes’ theorem in the updating of beliefs about observables in the light of new information is identified and related to conventional mechanisms of predictive and parametric inference. The roles of sufficiency, ancillarity and stopping rules in such inference processes are also examined. Forms of common statistical decisions and inference summaries are introduced and the problems of implementing Bayesian procedures are discussed at length. In particular. conjugate, asymptotic and reference forms of analysis and numerical approximation approaches are detailed.
5.1
5.1.1
THE BAYESIAN PARADIGM
Observables, Beliefs and Models
Ourdevelopment has focused on the foundational issues which arise when we aspire to formal quantitative coherence in the context of decision making in situations of uncertainty. This development, in combination with an operational approach to the basic concepts, has led us to view the problem of statistical modelling as that of identifying or selecting particular forms of representation of beliefs about observables.
242
5 Inference
For example, in the case of a sequence .r1. .r2. . . . of 0  1 random quantities for which beliefs correspond to a judgement of infinite exchangeability, Proposition 4.1, (de Finetti's theorem) identifies the representation of the joint mass as function for x l . . . . . s,, having the form
.
for some choice of distribution Q over the interval [O. I]. More generally. for sequences of realvalued or integervalued random quantities, .rl. q. . . . we have seen, in Sections 4.3  4.5, that beliefs which combine . judgements of exchangeability with some form of further structure (either in terms of invariance or sufficient statistics). often lead us to work with representations of the form
where p(s I d ) denotes a specified form of labelled family of probability distributions and Q is some choice of distribution over RA'. Such representations, and the more complicated forms considered in Section 4.6, exhibit the various ways in which the element of primary significance from the subjectivist, operationalist standpoint. namely the predictive matlef of beliefs about observables. can be thought of us (fconstructed from a puramerric model together with a prior distribution for the labelling parameter. Our primary concern in this chapter will bc with the way in which the updating of beliefs in the light of new information takes place within the framework of such representations.
5.1.2
The Role of Bayes' Theorem
In its simplest form. within the formal framework of predictive model belief distributions derived from quantitative coherence considerations, the problem corresponds to identifying the joint conditional density of
for any 7 n 2 1, given, for any p ( . r , .. . . ,slJ.
11
2 1. the form of representation of the joint density
In general. of course. this simply reduces to calculating
5.1 The Bayesian Paradigm
243
and, in the absence of further structure, there is little more that can be said. However, when the predictive model admits a representation in terms of parametric models and prior distributions, the learning process can be essentially identified, in conventional terminology, with the standard parametric form of Bayes’ theorem. Thus, for example, if we consider the general parametric form of representation for an exchangeable sequence, with dQ(8)having density representation,p ( 8 ) d 8 , we have
from which it follows that
P(xn+ly.
.. * xn+m I 1
1

where
P ( 8 I XI,.
This latter relationship is just Bayes’ theorem, expressing the posterior density for 8, given 11,. . . ,xn. in terms of the parametric model for 5 1 , . . . ,xn given 8, and the prior density for 8. The (conditional, or posterior) predictive model for zn+l,. . ,xfl+fll, . given x1 . . . , x I 1is seen to have precisely the same general form of representation as the initial predictive model, except that the corresponding parametric model component is now integrated with respect to the posterior distribution of the parameter, rather than with respect to the prior distribution. We recall from Chapter 4 that, considered as a function of 8 ,
+
lik(8 I xl,. . ,.rn) = p(xl.... ,I,18) . is usually referred to as the likelihood function. A formal definition of such a concept is, however, problematic; for details, see Bayarri et al. (1988) and Baymi and DeGroot (1992b).
5.1.3
Predictive and Parametric Inference
Given our operationalist concern with modelling and reporting uncertainty in terms of observables, it is not surprising that Bayes’ theorem, in its role as the key to a coherent learning process for parameters, simply appears as a step within the predictive process of passing from
p(Z,.
. . . x,) = / p ( x l . . . . .
.
5,1
1 e)p(e) e d
244
to
5 Inference
by means of
to denote future (or. as Writing y = { g ~. ,. . . gn,} = {.r,l.,.. . . , z l l . yet unobserved) quantities and z = {SI. . . . zI,} to denote the already observed quantities. these relations may be reexpressed more simply as
and
P(@ 1x1 = P ( 3 : I8)P(6)/PtZ).
However, as we noted on many occasions in Chapter 4, if we proceed purely formally, from an operationalist standpoint it is not at all clear, at first sight. how we should interpret “beliefs about parameters”. as represented by p ( 6 ) and p(8 I z), or even whether such “beliefs” have any intrinsic interest. We also answered these questions on many occasions in Chapter 4, by noting that, in all the forms of predictive model representations we considered, the parameters had interpretations as strong law limits of (appropriate functions 00 observables. Thus. for example, in the case of the infinitely exchangeable 0  I sequence (Section 4.3. I ) beliefs about 0 correspond to beliefs about what the longrun frequency of 1’s would be in a future sample; in the context of a realvalued exchangeable sequence with centred spherical symmetry (Section 4.4.1). beliefs about 11 and d. respectively, correspond to beliefs about what the largc sample mean. and the large sample mean sum of squares about the sample mean would be. in a future sample. inference about purunietrrs is thus seen to be a limiting form of predictive inference about observables. This means that, although the predictive form is primary. and the role of parametric inference is typically that of an intermediate structural step, parametric inference will often itself be the legitimate endproduct of a statistical analysis in situations where interest focuses on quantities which could be viewed as largesample functions of observables. Either way, parametric inference is of considerable importance for statistical analysis in the context of the models we are mainly concerned with in this volume.
5.1 The Bayesian Paradigm
245
When a parametric form is involved simply as an intermediate step in the predictive process, we have seen that p(8 I x1.. . . ,T , ~ ) the full joint posterior , density for the parameter vector 8 is all that is required. However, if we are concerned with parametric inference per se, we may be interested in only some subset, 4, of the components of 8, or in some transformed subvector of parameters, g ( 8 ) . For example, in the case of a realvalued sequence we may only be interested in the largesample mean and not in the variance; or in the case of two 0  1 sequences we may only be interested in the difference in the longrun frequencies. In the case of interest in a subvector of 8, let us suppose that the full parameter vector can be partitioned into 8 = {@,A}, where 4 is the subvector of interest, and A is the complementary subvector of 8, often referred to, in this context, as the vector of nuisance parameters. Since
the (marginal) posterior density for t is given by $
where
with all integrals taken over the full range of possible values of the relevant quantities. Expressed in terms of the notation introduced in Section 3.2.4, we have
P ( 4 . A I e>
$
0
P ( 4 12).
In some situations, the prior specification p ( 4 , A) may be most easily arrived at through the specification of p(A I @ ) p ( 4 ) .In such cases, we note that we could first calculate the integrated likelihood for 4,
and subsequently proceed without any further need to consider the nuisance parameters. since
246
5 Inference
In the case where interest is focused on a transformed parameter vector. g ( 8 ) . we proceed using standard changeofvariable probability techniques as described in Section 3.2.4. Suppose first that ?/.J = g ( 8 ) is a onetoone differentiable transformation of 8. It then follows that
where
is the Jacobian of the inverse transformation 8 = g'(+). Alternatively, by substituting 8 = g  ' ( + ) . we could write p(z 18) as y(a: 1 +). and replace p ( 8 ) by 149 '($1) I J,, 1 (?I)I. to obtain A+ 12) = P(a: I + M + ) / P ( Z j directly.
If = g ( 8 ) has dimension less than 8 , we can typically define y = (q. )= w h(8), some w such that y = h ( 8 )is a onetoone differentiable transformation, for and then proceed in two steps. We first obtain
P(?/.J.W 12) = PN(h
+
I(Y) 12) I J,,l(Y)
I
where
and then marginalise to
These techniques will be used extensively in later sections of this chapter. In order to keep the presentation of these basic manipulative techniques as simple as possible, we have avoided introducing additional notation for the ranges of possible values of the various parameters. In particular, all integrals have been assumed to be over the full ranges of the possible parameter values. In general, this notational economy will cause no confusion and the parameter ranges will be clear from the context. However. there are situations where specific constraints on parameters are introduced and need to be made explicit in the analysis. In such cases, notation for ranges of parameter values will typically also need to bc made explicit. Consider, for example. a parametric model. p(a: 18). together with a prior specification p ( 8 ) .8 E 0,for which the posterior density, suppressing explicit use
5.1 The Bayesian Paradigm
247
Now suppose that it is required to specify the posterior subject to the constraint 8 E 0"c 8, where p ( 8 ) d O > 0. Defining the constrained prior density by
,s ,
we obtain, using Bayes' theorem,
From this, substituting for w(8) in terms of p ( 8 ) and dividing both numerator and
expressing the constraint in terms of the unconstrained posterior (a result which could, of course, have been obtained by direct, straightforward conditioning). Numerical methods are often necessary to analyze models with constrained parameters; see Gelfand et al. (1992)for the use of Gibbs sampling in this context.
5.1.4
Sufficiency, Ancillarity and Stopping Rules
The concepts of predictive and parametric sufficient statistics were introduced in Section 4.5.2, and shown to be equivalent, within the framework of the kinds of models we are considering in this volume. In particular, it was established that a (minimal) sufficient statistic, t ( z ) ,for 8, in the context of a parametric model p ( z I O ) , can be characterised by either of the conditions
The important implication of the concept is that t ( z )serves as a sufficient summary of the complete data zin forming any required revision of beliefs. The resultingdata reduction often implies considerable simplification in modelling and analysis. In
240
5 Inference
many cases, the sufficient statistic t ( z )can itself be partitioned into two component statistics, t ( z )= [a(z). such that, for all 8. 45))
It then follows that. for any choice ol'p(c)).
so that, in the prior to posterior inference process defined by Bayes' theorem. it suffices to use p(s(z) a(z). ) .rather than p ( t ( z 10) as the likelihood function. I 8 ) This further simplification motivates the following definition. Definition 5.1. (Ancillary statiSric). A stutistic. a ( z ) . said to be ancillary. is with respect to 8 in u purumerric nrocidp(x I c ) ) . if p ,( a ( x 18) = p ( a ( x ) ) f i w ) all values of 8.
Example 5.1. (Bernoulli model). In Example 4.5. we saw that for the Bernoulli parametric model
which only depends on i i and r?,= .rl + . . . i.I,,,. Thus, t,, = [ t i . I.,,] provides a minimal sufficient statistic, and one may work in terms of the joint probability function p ( 11. I . * , 1 0). If we now write p("* I,,, 10) = p ( r . I f l . H)I'(fi I H ) .
and make the assumption that, for all r i 2 1, the mechanism by which the sample size, 0 . is arrived at does not depend on 0, so that p( r i i H ) = p ( . i i ) . i f 2 1. we see that is uncilkrry j i w (1. in the sen.= of Definition 5 . I. It follows that prior to posterior inference for 0 can thereforc proceed on the basis o f
for any choice of p ( B ) .15 0 5 1. From Corollary 4.I . we see that I
5.1 The Bayesian Paradigm
249
so that inferences in this case can be made as if we had adopted a binomiulpurametric model. However, if we write p(n.,rrlI 0) = ~ ( 7 I1r,,, @)p(r,, I 8) and make the assumption that, for all r,, 2 1, termination of sampling is governed by a r,, mechanism for selecting r,,, which does not depend on 8,so that p(r,, 18) = p(r,,), 2 1. we see that r , is ancillary for@, the sense of Definition 5. I . It follows that priorto posterior in inference for 0 can therefore proceed on the basis of p(6 I Z) = P ( @ I n: rw) x
I
@)P(O),
for any choice of p(B), 0 < 8 5 1. It is easily verified that
= Nb(n 1 ,I.,,) 8 (see Section 3.2.2). so that inferences in this case can be made as if we had adopted a negativebinomial parametric model. We note, incidentally, that whereas in the binomial case it makes sense to consider p(B) as specified over 0 5 6 5 1, in the negativebinomial case it may only make sense. to think of p ( 8 ) as specified over 0 < 8 5 1, since p ( r f l18 = 0) = 0. for all r,, 2. 1. So far as prior to posterior inference for 8 is concerned. we note that, for any specified p ( @ ) and assuming that either p ( n 18) = p ( n ) or p(r,, 18) = p ( r , , ) . we obtain ,
p ( 8 1 r l....,z,!) p ( B I ~ t . r , , x P ( i  8 ) n  r t i p ( e ) = )
since, considered as functions of 6 ,
y(r,,I TI, 0) x p(n r,,.8) x Orfi (1  8)"rfd.
The last part of the above example illustrates a general fact about the mechanism of parametric Bayesian inference which is trivially obvious; namely,for any 2 specified p ( 8 ) ,ifthe likelihoodfunctions pl ( z1I 8 ) ,p2(z I 8 ) are proportional US functions of 8, the resulting posterior densities for 8 are identical. It turns out, as we shall see in Appendix B, that many nonBayesian inference procedures do not lead to identical inferences when applied to such proportional likelihoods. The assertion that they should, the socalled Likelihood Principle. is therefore a controversial issue among statisticians . In contrast, in the Bayesian inference context described above, this is a straightforward consequence of Bayes' theorem, rather than an imposed "principle". Note, however, that the above remarks are predicated on a specified p ( 8 ) . It may be, of course, that knowledge of the particular sampling mechanism employed has implications for the specification of p( 8 ) ,as illustrated, for example, by the comment above concerning negativebinomial sampling and the restriction to 0 < 8 5 1.
5 Inference
Although the likelihood principle is implicit in Bayesian statistics, it was developed as a separate principle by Barnard ( 1949), and became a focus of interest when Birnbaum (1962) showed that it followed from the widely accepted sufticiency and conditionality principles. Berger and Wolpert ( 1984/1988)provide an extensive discussion of the likelihood principle and related issues. Other relevant references are Barnard era!. ( 1962). Fraser (1963).Pratt ( 1963, Barnard ( 1967). Hartigan ( I967), Birnbaum ( 1968, 1978). Durbin ( 1970). Basu ( 1975). Dawid (1983a).Joshi (1983).Berger ( 198%). Hill (1987)and Bayarri et d.(1988). Example 5.1 illustrates the way in which ancillary statistics often arise naturally as a consequence of the way in which data are collected. In general. it is very often the case that the sample size, 11. is fixed in advance and that inferences are automatically made conditional on t i . without further reflection. It is. however. perhaps not obvious that inferences can be made conditional on r i if the latter has arisen as a result of such familiar imperatives as “stop collecting data when you feel tired”. or “when the research budget runs out”. The kind of analysis given above makes it intuitively clear that such conditioning is, in fact. valid. provided that the mechanism which has led to “does not depend on 8”. This latter condition may. however, not always be immediately obviously transparent. and the following definition provides one version of a more formal framework for considering sampling mechanisms and their dependence on model parameters.
Definition 5.2. (Stoppingrule ). A stopping rule. h.jilr (sequentiul)sarnpling frclm CI sequence of observubles .I‘I E Yt. r? 6 .Y?. . is u sequence (IJ. ,functions h,, : X x . . . x X,, [O.I]. such that, ifs,,,, ( . ( . I . . . . . . I . , , ) is I = observed, then sampling is terminated with probabiliv 11,,(q, , j; otherwise. I the ( t i + I)th observution is made. A stopping rule is proper the indirced probability distribution p,, ( r t j . 11 = 1. 2 . . . . .for find suniple s i x girurwtees that the latter is finite. The rule is deterministic i h,,(z, E { 0. 1} ,for ull f ) ( n ,x(,,)); otherwise, it is u randomised stopping rule.

In general, we must regard the data resulting from a sampling mechanism defined by a stopping rule h as consisting of ( I ! . z,,,,).the sample size. together with the observed quantities .r1. . . . . x,,. A parametric model for these data thus involves a probability density of the form p ( n , z,,,? h. 8 ) . conditioning both on I the stopping rule (i.e., sampling mechanism) and on an underlying labelling parameter 8. But, either through unawareness or misapprehension, this is typically ignored and, instead. we act as if the actual observed sample size I ) had been fixed in advance, in effect assuming that
14.r).
~ ( 1 , )
I h. 8 ) = Y(x,,,,I 1 ) . 8 ) = ~ ( z l i18),,
using the standard notation we have hitherto adopted for fixed I I . The important question that now arises is the following: under what circumstances. if any, can
however. l . likelihoods). we have 1 p(71=i.e) ( = = I .r. h.x2.ss. for all (n.. x I = 0. =~ 1 71 = 2. and that a sequential sample is to be obtained using the deterministic stopping rule h.l). h. . In this case.0) = p(rl = 1 I = i. I8.e) ~ =z J =~2 . h l ( 0 ) = 0. that . = 0 . e ) = 1 .I The Bayesian Paradigm 251 we proceed to make inferences about 0 on the basis of this (generally erroneous!) assumption. 10) = Bi(. if there is a success on the first trial. t i . sampling is terminated (resulting In in n = 1. without consideringexplicit conditioningon the actual form of h? Let us first consider a simple example.e) = p(zl = o = p(xl = O .rl. x2 = s I h. the following detailed analysis: p ( n = i . 7 i . x I = 0.defined by: hl(1) = 1. we obtain in this case the latter considered pointwise as functions of 0 (i.e)P(T1 1 h.1.o)=p(tl = ~ l ~ = i .e)p(n = 1 1 h.7t = 1 . l z I = 0. with. hz(zl. (0) = 0.e) n o 1 h. . for x = 0. may be regarded as a sequence of independent Bernoulli random quantities with p ( z . inferences for 0 based on assuming 71 to have been fixed at its observed value will be identical to those based on a likelihood derived from explicit consideration of h. having nonzero probability.1. it might appearessentialto take explicit account of h in making inferences about 8. 1 t . =iih.e) . X Thus. = i ~ h . . r 1 = 0. Suppose. given 8. h 2 ( r I .cl = 0. for . x l = 11 h. p ( n = 2 .p ( q = 0 I O)p(zz = I 18) x p(zI = = 2. e)p(zl= o 1 e ) = p ( 1 2 = 5. z1= 0.h. L C ~= 0 or n = 2.e.h. x2 = z 1 h.e)p(n 2 I h.e)p(zL = 1 z1 = 0. e ) ~ =~ . e ) p ( ~ ~ . (‘‘BiOSed” stopping rulefor a Bernoulli sequence). since the sampling procedure seems designed to bias us towards believing in large values of 8.h. y ( 3 j =118). h .2. Example 5.for any speciJiedp ( B ) . Consider. Consider now a randomised version of this stopping rule which is defined by h1(1) = ir. x2 = 1). other words. xI = 1).5.cl = 0. .p(x. otherwise. r p(71=2.= 1 1 e) = p ( t l = 1 1 e) and.e) = ~ ( = 2 I z1 = 0. I = o 1 el. Q) = 1 for all x l . = 2. p(. At first sight. two observations are obtained (resulting in either n = 2.e) = 1 . It then follows trivially from Bayes’ theorem that.X.c2 = z 1 .22) = 1 for all s I22.
rather trivial.) were obtained using a fixed sample size u . cl)p(.) /H) as functions of cl. H j = (1 .3!(. z(. prior to posterior inference for 0 given data 0 1 .f.. 0 ) x p ( J . although seemingly biasing the analysis towards beliefs in larger values of 0. the stopping rule does not in fact lead to a different likelihood from that of the a priori fixed sample size. q.n)p(..252 and 5 Ir$erence p ( n = 2. 6 E 8. for all ( t i .r2: 1 H).2.~)) such rhatp(n. .) nonzero probability.r I h.. . for all (71. / / = 2.I' J . q l lI) 0) # 0. for any given p(N). = 1 .0) P(X(. that. which we might therefore describe as "likelihood noninformative stopping rules". = tibles . 10).H) X /J(Z{.H)p(. so that the proportionality o f the likelihoods once more implies identical inferences from Bayes' theorem. z ( .0) P(z(. for uny specified prior densir!. Pruc...r. z i l l obtained using .1'.r Thus. 1 ) Ih. = 1 I h.) a likelihood noninformative stopping rule h can proceed by acting as if q.1. (Stopping rules are likelihood noninformative 1.rl. a notationally precise rendering of Bayes' theorem.. having. f ) ( 7 ? .r? = .I) 101. The following. proposition makes clear that this is true for all stopping rules as defined in Definition 5.. Proposition 5. .. h. perhaps contrary to intuition.. The analysis of the preceding example showed. it is a trivial consequence of Bayes' theorem that. 2 = . For any stopping rule h.fired suniplr size purunirtric niodel p(x. However. This follows straightforwardly on noting that and that the term in square brackets does not depend on 6. = l.I'. we again find that having j/(f/.Q. .r1= 1. 1 1 = 1. a Again.\ 1h. i 11. h. . 0) = p ( u = 2 . . .for (sequential)samplingjioni n sequence ofohserv.. h .
10. 10. Section 3. which take no explicit account of the stopping rule h. 0) = N(Ti. may be regarded as a sequence of independent normal random quantities with y(. Since the latter is likelihood noninformative. = C. A).for all ( r t . so that termination will certainly occur for some finite n. .31 for R = 0. we assume the I prior specification p ( 0 ) = N(O I p . This issue is highlighted in the following example. z . Example 5. for small fv.2. x2. For any fixed sample size n. .05. or k = 3. (“Biased” stopping rule for a n o d mean).3). our discussion in Chapter 4 makes clear their basic inseparability in coherent modelling and analysis of beliefs./Wever takes on a value that is “unlikely”. to be interpreted as indicating extremely vague prior beliefs about 8. E %.57 for o = 0. with precision X ‘I0. Suppose.001). .. It follows that h is a likelihood noninformative stopping rule. I r i . a possible stopping rule. .. that this is a proper stopping rule. for illustration.S.3. assuming 0 = 0 to be true. 2. ~ ) . k = lL96 for a = 0.)) for which the lefthand side is nonzero. although it is often convenient for expository reasons to focus at a given juncture on one or other of the “likelihood and “prior. for example. T L ) . defining we have as a function of 8. It can be shown. using the fact that y(f. where.’ components of the model.5. Now consider prior to posterior inference for 0.rl 10) = N ( r . . yielding data ( 7 1 .we have . It cannot be emphasised too often that.01. if ”unlikely” is interpreted as “an event having probability less than or equal to a”. Moreover. might be for suitable k(n)(for example. q.. using the law of the iterated logarithm (see. k = 2. Suppose further that an investigator has a particular concern with the parameter value 0 = 0 and wants to stop sampling i f f ..1 The Bayesian Paradigm 253 reveals that knowledge of h might wellaffect the speciJcation of theprior density! It is for this reason that we use the term “likelihood noninformative” rather than just “noninformative” stopping rules.1). that xl. ~ . given 0.
. T (1  ~)l. N . unaware of the stopping rule.) the normal parametric model.?. f /I. where Q(B) = ( r t which implies that +h) [ H  //+A PI. However. .. for X 2 0. .. Y.(t.. xl. p(0111.. which should be assigned nonzero prior probability (rather than the zero probability resulting from any continuous prior density specification). Since h is a likelihood noninformative stopping rule and ( I ) . But the stopping rule h ensures that IF.~. ' ~ ~ ..(Hj .z h ) ~ N ( F ..0. One consequence of this vague prior specification is that. As an illustration. .ll(H)N(O'O. 2. = 1 .+. The complete posterior p ( H I I / . we have .. This means that the value 0 = 0 j certainly does not lie in the posterior interval to which someone with initially very vague beliefs would attach a high probability.iQ(B)}. mislead someone who...). I I ) P ( H I ~ ) . having observed i n ..] are sufficient statistics for the normal parametric model.i~ : times a N(H \ ( I .] y)I T. . All) density to the range fl # 0. + A/' I'  .5 1ttferenc. 10 the special value. K . acts as if initially very vague. and assigns 1 . suppose that we specify p(Hih)= i l i NII. 1. we are led to the posterior probability statement I P BE [ ( . An investigator knowing B = 0 to be the true value can therefore. The nature of h might suggest a prior specification p ( H I h ) that recognises 0 0 as a possibly "special" parameter value.\. h) is thus given by .... The righthand side for is easily seen to be proportional to exp{ . > k(o)/"'%.. 7. let us now consider an analysis which takes into account the stopping rule. which assigns a "spike" of probability.i.e by virtue of the sufficiency of (R. by using this stopping rule.
if 8 is appreciably different from 0. For qualitative insight. In such a case. With slightly revised notation and terminology. since 255 it is easily verified that The posterior distribution thus assigns a “spike” n* to 8 = 0 and assigns 1 . so that k ( a )is quite large. a.1.. (ii) w E fl. available “answers” to the inference problem. rt + Ao) component. 5. w . unknown states of the world: (iii) u : A x (1 + R. and hence n‘. . with the possibility or intention of subsequently guiding the choice of a future action.) density to the range 6 # 0. the resulting n. consider the case where actually 6 = 0 and a has been chosen to be very small. or are to be recorded or reported in some selected form. so that knowing the stopping rule and then observing that it results in a large sample size leads to an increasingconvictionthat 6 = 0. k ( a ) / f i . we made clear that our central concern is the representation and revision of beliefs as the basis for decisions..n’ times a N(8 I (n+ Xu) IrE.1 The Bayesian Paradigm where. derived from a prior taking account of h. we recall fromchapters 2 and 3 the elements and procedures required for coherent. will tend to be small and the posterior will be dominated by the N(8 I (7t + All) ‘ r i X t l . is clearly very different from that of the posterior density based on a vague prior taking no account of the stopping rule. The behaviwr of this posterior density. This means that 2 : for large 71.. Either beliefs are to be used directly in the choice of an action.5. quantitative decisionmaking. w ) of a decision to summarise inference in the form of an “answer”. The elements of a decision problem in the inference context are: (i) a E A.5 Decisions and Inference Summaries In Chapter 2. On the other hand. ri + A.and an ensuing state of the world. a function attaching utilities to each consequence ( a . 11 is likely to be very large and at the stopping point we shall have 7.
is all that is required. specific features of the belief distribution. aspecification. the optimal answermaximising the expected utility or minimising the expected lossmay turn out to involve only limited. we shall illustrate and discuss some of these commonly used forms of summary. A and u (or 1 ) may be unspecified. if beliefs about unknown states of the world are to provide an appropriate basis for future decision making. 0: beliefs about 6. E A which ntuxitni. the optimal choice of answer is an a E A which tnitiimises the e. fixed function. The optimal choice of answer to an inference problem is an a the expected utility. typically reduce to one or other of the familiar forms: p(0) p ( O 1 z) p(+ I z) p(y I z) initial beliefs about a parameter vector. It is clear from the forms of the expected utilities or losses which have to be calculated in order to choose an optimal answer. given data 2.ws Alternatively. in the form of aprobability distribution. where. we shall have in mind the context of parametric and predictive inference. so that these "summaries" of the full distribution suffice for decisionmaking purposes. if an immediate application to a particular decision problem. given data 2: beliefs about = g ( 8 ) .256 5 Inference (iv) p(w). and current beliefs. p ( w ) . I(a.w) /(w)u(a.w). + .rpected loss. Throughout. with specified A and zi (or I). In the following headed subsections. if instead of working with rr(a. as yet. where the unknown states of the world are parameters or future data values (observables). that. However. of current beliefs about the possible states of the world.given data z: beliefs about future data y. = where f is an arbitrary.w ) we work with a socalled l o s s junction. we need to report the complete belief distribution p(w).
. Ifcl = c2. Proposition 5.5. the corresponding decision problem is typically referred to as one of point estimation. w ) = c ~ ( a . or R+. ~ ( a ) + c ~ ( w . etc. + cp). E + 0. . (Forms of Bayes estimates). and so the Bayes estimate with respect to quadmtic form loss is the mean of p(w). the Bayes estimate with respect to linear loss is the qwntile such that P(w 5 a ) = (‘2/(c. As p ( w )rlw.3. and the required answer. or ‘8 x R+. A Bayes estimate of w with respect to the lossfunction l(a. ~ ( a ) .w)p(w) dw.’ exists. rather than with the utility function.w ) l ~ .2. and H issymmetric definite positive. statisticians typically work directly with the loss function concept. (Bayes estimate).w ) t H ( a. Moreover. since one is almost certain not to get the answer exactly right in an estimation problem. a = E ( w ) .1 The Bayesian Paradigm 257 Point Estimates In cases where w E R corresponds to an unknown quantity. median or mode ofp(w) could be reasonable point estimates of a random quantity w. Definition 5.. however.assuming a mode to exist. righthand side equals 1/2 andso the Bayes estimate with the respect to absolute value loss is a median of p(w).a ) l ~ . If w = 6 or w = 3. I f H . we refer to parametric point estimation. (i) IfA = R = Rk. (ii) I f d = Q = R a n d l ( a . A point estimation problem is thus completely defined once A = R and 1(a. so that R is R. we refer to predictive point estimation. distributional summaries such as the mean. ) are w specified. the Bayes estimate maximises La. and more formal guidance may be required as to when and why particular functionals of the belief distribution are justified as point estimates. the Bayes estimate satisfies H a = HE(w).w ) . is an estimate of the true value of w (so that A = Q). . (iii) I f A = R C @ and I(a. ~ .assuming the mean to exist.w) = (a. Clearly.w ) and the belief distribution p(w)is an a E A = l2 which minimises J 1I(a. these could differ considerably. or IRk. the function to be maximised tends to p ( a ) and so the Bayes estimate with respect to zeroone loss is a mode of p(w). &(a) is a ball where of radius E in R centred at a. Direct intuition suggests that in the onedimensionalcase.l(BE(a))(w). This is provided by the following definition and result. a E A.w) = 1 . l(a. if w = 9.
(w . for a single random quantity w'.w)p(w)dw C ] = (a. s l(a. continuous ~ ( u J in one dimension. lW>. symmetric belief distribution.w ) ' H ( a. therefore. 1988. p ( u ) .a ) p ( w ) dw. we obtain (ii). Sacks (1963). and this is minimised when JB5(". for example. Of course.I. w M w )d o = 1  J lBE(.E ) = Y(" + E). Farrell (1964). adding c1 J*. the mean.. DifferentiatingJ ( a . and a strong rationale for a specific one of the loss functions considered in Proposition 5.W ) P ( W ) dw +Q P ( W ) h. we have (iii). Berger (1979. 1986). In general. DeGroot and Rao (1963. since 1 I ( u . median and mode coincide..(w)P(w) h. Hwang (1985. see. It is then immediate by a continuity argument that a should be chosen such that p ( a . Unless. a p Further insight into the nature of case (iii) can be obtained by thinking of a ) unimodal. 1992). de la Horra (1987. such a comment acquires even greater force if p ( w ) is multimodal or otherwise "irregular". .w ) p ( w )rlw with respect to a and equating to zero yields 2H This establishes (i). the provision of a single number to summarise p ( d ) may be extremely misleading as a summary of the information available about ii. 1966). Ghosh (19923. there is a very clear need for a point estimate. differentiating with respect to (I and equating to zero yields c1 lW5.. For further discussion of Bayes estimators. ( w )dw is maximised.2.o p ( w )dw to each side. Tiao and Box (1974). Brown (1973). Finally.) P ( W ) dw = C? whence.W ) Y ( W ) dw = 0. densities we have the relation mean > median > mode and the difference can be substantial if p ( ~is) markedly skew. positively skewed. Since . for unimodal. 1988). Berger and Srinivasan (1978).250 5 Inference Proof. Irony (1992) and Spall and Maryak (1992). I (U . In the case of a unimodal. 1992b).
A region C C_ R such that is said to be a 100( 1 .a)% credible region C. p ( w ) may be a somewhat complicated entity and it may be both more convenient. Proposition5. however. If R G 92. p ( w ) . Clearly. connected credible regions will be referred to as credible intervals. (Credible Region). for given a. area. probability). (Minimal size credible regions). posterior or predictive) density. In practice. for example.0 < a < 1. the identification of intervals containing 50%. p ( w ) and fixed a. w .with respect to p ( w ) . If p(w ) is a (priorposteriorpredictive)density. as we should normally wish to do for obvious ease of interpretation (at least in cases where p(w) is unimodal). ( ~C ) = 1 . provided that we are C willing to specify a loss function.a)%credible region for w .the problem of choosing among the subsets C G R such that J p ( w )dw = 1 . for any given a there is not a unique credible regioneven if we restrict attention to connected regions. I(C. reflecting the possible consequences of quoting the 100(1. if A = {C. length) is minimised. 95% or 99% of the probability under p ( w ) might suffice to give a good idea of the general quantitative messages implicit in p ( w ) . We now describe the resulting form of credible region when a loss function is used which encapsulates the intuitive idea that. For given 0.1 The Bayesian Paradigm 259 Credible regions We have emphasised that. Definition 5. uncertainty about an unknown quantity of interest.5. Let p(w)be a probability density for w E R almost everywhere continuous.w).4.3. and also sufficient for general orientation regarding the uncertainty about w .a could be viewed as a decision problem. ! . we refer to (priorposteriorpredictive)credible regions. This is the intuitive basis of popular graphical representations of univariate distributions such as box plots.simply to describe regions C R of given probability undery(o).if formal calculation of expected loss or utility is to be possible for any arbitrary future decision problem.needs to be communicated in the form of the full (prior.a } # 0 a n d P E then C is optimal ifand only ifit has thepmperty that p(w1) 2 p(w2)for all w 1 E C. given a.we would prefer to report a credible region C whose size llCll (volume. in the case where R C 9. w2 $ C (exceptpossibly for a subset of R of Zen. from a theoretical perspective. Thus. 90%.
5 Inference A irif l(C. there exists A C C such that for all w 1E A. It follows straightforwardly that.Define for D = (C n A') U B. we have wcC'r81r' p(w)ll~ D 11 5 n ' * (' / I f . B G C' be such that ! Let P(w E A ) = P(w E B)and p(w2)> p(wl) all w? E B and 0 1 E . If p ( w ) is a fpriorpo.e) we refer to highest (priordensity. The above development should therefore be understood as intended for the case of continuous fl. / p(w)l cw p(w)riw 5 S I I ~p(w)llc" =I < n DII (*'l)Ll W:PnIl with so that IjC n D'll 2 IIC' n Dll. the credible region approach to summarising p(w)is not particularly useful in the case of discrete $2. D= ( C ' n D ) u ( C " n D ) a n d P ( w E C )= P(w E D). 1965). A region C C fl is said to be a 100(1. cj The property of Proposition 5.forcrsiih2 ' .~teriiorpredi~.ti~.3 is worth emphasising in the form ofa definition (Box and Tiao. so that an optimal C must have minimal size. set of $1 having prohahilit?. For a number of commonly occurring univariate forms of p(w).(1 (ii) p(wl) p(w2) for all w1 E C'and w? @ C e.5.rc.260 Proof. posteriorpredictive) density regions. If C does not have the stated property. there exists wz $ C such that p(w2) p(wI). (Highestprobability density (HPD) regions).a. then since C = ( C n D ) u ( C n D ' ) . Definition 5. zero. Clearly.w)p(w) = k1'1 dw 1(1 + 1 .eptpossihl~. for any C E A.a ) %highesrprobahilip density region for w with respect to p(w)i f (i) P(w E c') = 1 . Then D E A and by a similar argument to that given above the result follows by showing that llDll < IlC'll. since such regions will only exist for limited choices of ( t .1. If C has the stated property and D is any otherregion belonging to A. there exist tables which facilitate the identification of HPD intervals for a range of values of ct . and hence IIC((5 IlDll.
lb w much less “plausible” than most w E C o (see. In general.. it may be satisfactory in practice to quote some more simply calculated credible re .1982. for example. particularly if p ( w )does not exhibit markedly skewed behaviour. Ferrandiz and Sendra.1 The Buyesiun Paradigm 261 WO c Figure 5. the derivation of an HPD region requires numerical calculation and.5. 1974.la woalmost as “plausible” us all (u’ E C U O C Figure 5. and Lindley and Scott. however. lsaacs ef al. 1985).
Pratt (1961 ). in the univariate case. I .. = / I I = 0. In addition. The problem is that of lack of invariance under parameter transformation.262 5 Inference gion. This.). we must formulate a proper decision problem. for example. For this reason. with the l. where 8 E 8 = 81)Uc31. 1). Wright ( 1986) and DasGupta ( 199 I ). I that even the intuitive notion of “implausibility if wl) @ c“’depends much more on the complete characterisation of p ( w ) than on an eitheror assessment based on an HPD region.. p(z 1 O ) . there is a fundamental difficulty which prevents such regions serving. If we wish to consider such actions. /(H. Hypothesis Testing The basic hypothesis testing problem usually considered by statisticians may be described as a decision problem with elements together with p ( w ) .05. For example. In cases where an HPD credible region C is pragmatically acceptable as a crude summary of the density p ( w ) .)of a random quantity w in the absence of alternative hypothesised values are often considered in the general statistical literature under the heading of “significance testing”. of course. it is easy to see that there is no general relation between HPD regions for w and v. Even if v = g ( w ) is a onetoone transformation. with u l ( a l . the main motivation and the principal use of the hypothesis testing framework is in model choice and comparison. typically. we shall postpone a detailed consideration of the . A = {a. the parameter labelling a paramctric is model. )corresponding to rejecting hypothesis Ho(H1). specifying alternativeactions and the losses consequent on correct and incorrect actions. conventional statistical tables facilitate the identification of intervals which exclude equiprobable tails of p ( w ) for many standard distributions.Ol). provides no justification for actions such as “rejecting the hypothesis that w = w0”. reflecting the relative seriousness of the four possible consequences and.O. loss function and = I. 1966). 0. For further discussion of credible regions see. in general. Although an appropriately chosen selection of credible regions can serve to give a useful summary of p ( w ) when we focus just on the quantity w. particularly for small values of c1 (for example. a specific value wg E fl will tend to be regarded as somewhat ! “implausible” if wg $ C‘. j E (0. it will suffice to noteas illustrated in Figure 5 . as a proxy for the actual density y ( w ) . For the present. We shall discuss this further in Chapter 6. Clearly. Aitchison ( 1964. there is no way of identifying a marginal HPD region for a (possibly transformed) subset of components of w from knowledge of the joint HPD region. then. Jnferenccs about a specific hypothesised value w. an activity which has a somewhat different flavour from decisionmaking and inference within the context of an accepted model..( I l } .
Shafer (1982b). the problem of implementation will. Moreover. although a specific mathematical form for the likelihood in a given context is very often implied (i) . 1965. Edwards el al. Pratt (1965). further summarisation. in problems involving a multiparameter likelihood the task of implementation is anything but trivial.1974. The key problem in implementing the formal Bayes solution to inference reporting or decision problems is therefore seen to be that of evaluating the required integrals.6 Implementation Issues Given a likelihood p ( z 18) and prior density p ( 8 ) . 1977). Lempers (1971).Berger and Sellke ( 1987)and Hodges ( 1990. Good (1950. lead to challenging technical problems.1)dimensionalintegrals are required to obtain univariate marginal density values or moments. Ic (k. requiring simultaneous analytic or numerical approximation of a number of multidimensional integrals. 1965. 1961b. example. Rubin (1971). where we shall provide a much more general perspective on model choice and criticism. or transformations thereof. Moreover. (k. or marginal moments.2)dimensionalintegrals are required to obtain bivariate marginal densities. General discussions of Bayesian hypothesis testing are included in Jeffreys (1939/1961). if k is at all large. if 8 has k components. will necessitate further integrations with respect to components of 8 or y.5. DeGroot (1973).Gilio and Scozzafava (1985).1 The Bayesian Paradigm 263 topic until Chapter 6. the starting point for any form of parametric inference summary or decision about 8 is the joint posterior density and the starting point for any predictive inference summary or decision about future observables y is the predictive density P(Y 12) = /P(Y I 1%) do. Clearly. implementation just involves integration in one dimension and is essentially trivial. It is clear that to form these posterior and predictive densities there is a technical requirement to perform integrations over the range of 8. (1963). Lindley (1957. In cases where the likelihoodjust involves a single parameter. Learner ( 1978). as we have seen in Chapter 4. Smith. Berger and Delampady ( 1987). Farrell(1968). 5.1. 1983). Zellner (1971).1992). However. in the case of p ( 8 I z). Box (1980). Dickey (197 1. since. (1986). or expected utilities or losses in explicitly defined decision problems. 1977). in order to obtain marginal densities. Smith (1965). two kdimensional integrals are and for required just to form p(8 I z) p(y 12). and so on. However. in general. The above discussion has assumed a given specification of a likelihood and prior density function.
and . the mathematical specification of prior densities is typically more problematic. or can a Bayesian implementation proceed purely on the basis of this qualitative understanding? (iii) either in the context of interpersonal analysis. for any specific beliefs.3.2. together with classical numerical analytical techniques such as GmssHennirc. Question(ii)will beansweredinpartat theendofSectionS. are there choices which enable the integrations required to be carried out straightforwardly and hence permit the tractable implementation of a range of analyses.2.4. Some of the problems involvedsuch as the pragmatic strategies to be adopted in translating actual beliefs into mathematical form. is there any necessity for a precise mathematical representation of the latter. Question (iii) will be answered in Section 5.8. where classical applied analysis techniques such as Lupluce's approxiniaticm for integrals will be briefly reviewed in the context of implementing Bayesian inference and decision summaries. as exemplified by the following questions: (i) given that.6. An extended historical discussion ofthis celebrated philosophical problem of how to represent "ignorance" will be given in Section 5. is there a formal way of representing the beliefs of an individual whose prior information is to be regarded as minimal. quadrature and stochastic simulation techniques such as intportuncv sanipling.eresampli~ig Markov chain Monte Curlo.relate more to practical methodology than to conceptual and theoretical issues and will be not be discussed in detail in this volume. or as a special form of actual individual analysis. many of the other problems of specifying prior densities are closely related to the general problems of implementation described above.2and in moredetail in Section 5. Question (iv) will be answered in Section 5. where an approximate "large sample" Bayesian theory involving asymptotic posterior norniality will be presented. are there analytic/numerical techniques available for approximating the integrals required for implementing Bayesian methods? Question (i) will be answered in Section 5.5.5 Injerence or suggested by consideration of symmetry. thus facilitating the kind of interpersonal analysis and scientific reporting referred to in Section 4. However. sufficiency or experience.3. where the informationbased concept of a reference prior density will be introduced.2 and again later in 6. sumplirtgitnportcirtc.3? (ii) if the information to be provided by the data is known to be far grcater than that implicit in an individual's prior beliefs. there is some arbitrariness in the precise choice of the mathematical representation of a prior density. where the concept ofa cottjugatu prior density will be introduced. relative to the information provided by the data'? (iv) for general forms of likelihood and prior density.
6.1 CONJUGATE ANALYSIS Conjugate Families The first issue raised at the end of Section 5.5. The conjugatefamily of prior densities for 8 E 8. that the class of mathematical functions from which p ( 8 ) is to be chosen be both rich in the forms of beliefs it can represent and also facilitate the matching of beliefs to particular members of the class. however. This suggests that it is only in the case of likelihoods admitting sufficient statistics of fixed dimension that we shall be able to identify a family of prior densities which ensures both tractability and ease of interpretation. Tractability can be achieved by noting that.s ( x ) }of abed dimension k independent of that of x.5) that if t = t ( s )is a sufficient statistic we have so that we can restate our requirement for tractability in terms of p ( 8 ) having the same structure as p ( t 18).is . To achieve a more meaningful version of the underlying idea. when the latter is viewed as a function of 8. as stated. let us first m a l l (from Section 4. Definition 5. since any particular mathematical form of p ( 8 ) is acting as a representation of beliefseither of an actual individual. or as part of a stylised sensitivity study involving a range of prior to posterior analyseswe require.2. when the latter is viewed as a function of 8. since Bayes’ theorem may be expressed in the form both p(8 I x) and p ( 8 ) can be guaranteed to belong to the same general family of mathematical functions by choosing p ( 8 ) to have the same “structure” as p ( z I 8 ) . for what choices of p ( 8 ) are integrals such as easily evaluated analytically? However. However. Again. (Conjugatepriorfamily). this is a rather vacuous idea.2 Conjugate Analysis 265 5. Given a likelihood function p ( s 18). in addition to tractability.2 5. This motivates the following definition.1. since p(8 I s)and p ( 8 ) would always belong to the same “general family” of functions if the latter were suitably defined.6 is that of tractability.with respect to a likelihood p ( x I 8 ) with suficienr statistic t = t ( x ) = {n. without further constraint on the nature of the sequence of sufficient statistics the class of possible functionsp(8) is too large to permit easily interpreted matching of beliefs to particular members of the class.
.)is a random sample from a regular exponential family distribution such that then the conjugate family for 8 has the form where is such thar ~ ( 7 ) =.1 I). given j. c ~ l ~ ..10 and 4. Proposition 5. the sufficient statistics for t$ have the form . (Conjugate families for regular exponential families). If x = ( X I : . The exponential family model is referred to as regular or nonregular.10 (the Neyman factorisation criterion). it follows that the likelihoods for which conjugate prior families exist are those corresponding to general exponential family parametric models (Definitions 4.4.266 where 5 inference and From Section 4.6. respectively. ] Tl { Prooj By Proposition 4. t$ and c. ~ ~ [ ~ ( e p e~x ft .x. for which. .h.5 and Definition 5. ( e ) d e }< x. . according as X does not or does depend on 8.
.5. sk(z) Tk I e. the conjugate prior density for 8 is given by p ( e I TI1.) such that J. I ro. ij). is given by p(6.. a Example 5.=I so that.4.6 (1 . I 1) . and 13 = To . Example 5.o)r"r' do. d > 0. p ( 8 I q). .g ) v r l . Writing (x = T.4. for any r = (TO. n. . we have p(H I a. S .T I ) (1 . (0 5 6. .71) = 1' 8'1 (1 .. gamma prior). beru prior). .@ ) ' . + 1. . (Poisson likelihood. The Bernoulli likelihood has the form Jl(S1~ .rI+ 1.2.T I ) = p ( 8 I Q . I n I. (Bernoulli likelihood. K(m.4.d) x 8" ' ( 1 . by Proposition 5. and comparing with the definition of a beta density. 0 > 0. 8) = Be(@1 ck.5. hence. by Proposition 5. . so that.' I ' . r l ) x exp(.2 Conjugate Analysis 267 00.8) = .TI ' (A) TI} 8'1 (1 . the conjugate prior density for 6. = To) = by Proposition 4. . The Poisson likelihood has the form so that.8)831.T.8 ) ~ ) X {log ~ P assuming the existence of K(qi.p(O Ir)dO < prior density has the form a conjugate p(eI 7) = p(sl(z) = T I .q .rO@) (V(T log8) .
by Proposition 5. . and comparing with the definition ofa normalgamma density.7 = T . . has the form so that..5) x 8 " with the definition of a gamma density.$). we have __ .1= f(r: . The normal likelihood. normalgamma prior).T ? ) .4 Inference assuming the existence of Writing c t = rI + 1and A = T. the conjugate prior density for H c: (I(. we have p(H I (k. given by Writing ( I = : ( T O f 1). / T { .rl. with unknown mean and precision.4. (Normal likelihood. comparing Example 5.6. A ) is given by assuming the existence of K ( q t . . 'ap( and hence.
5. (Conjugateanalysisfor regular exponentialfamilies).2. The following proposition demonstrates that. Proposition 5.2 Canonical Conjugate Analysis Conjugate prior density families were motivated by considerations of tractability in implementing the Bayesian paradigm.2 Conjugate Analysis 269 5. in the case of regular exponential family likelihoodsand conjugate prior densities.5. a .4: (i) the posterior density for 8 is which proves (ii). the analytic forms of the joint posterior and predictive densities which underlie any form of inference summary or decision making are easily identified. For the exponentid family likelihood and conjugate prior density of Proposition 5.
.5. }(I ..  + +  Example 5.rI t . (continued). and. = .H)’” x ex[> {log ( )H@ I TI (/!)} . combining it directly with the familiar beta prior density. With the Bernoulli likelihood written in its explicit expothe nential family form.5(i) establishes that the conjugate family is closed under sumpling. we could proceed on the basis of the original representation of the Bernoulli likelihood. simply defined.4. T!)... .’(# I Tit. is given by [ I ( @ 12.(z)). = c j ( 1 .5(ii) establishes that a similar.4. additive finitedimensionaltransformation. 71.. 71) x /dz I #)I. and writing r. a concept which seems to be due to G.r.o.. . to a simple. posterior density corresponding to the conjugate prior density. again. ~ ) xp xH’”(j ())‘‘‘f’@I + ’(I 0)‘ I where (k. family of distributions.*j) ( z l O ) p ( f l l o .. to provide some preliminary insights into the prior posterior predictive process described by Proposition 5. .. .. The inference process defined by Bayes’ theorem is therefore reduced from the essentially infinitedimensionalproblem of the transformation of density functions. we shall illustrate the general results by reconsidering Example 5. TI 1 x ( 1 . Alternatively. showing explicitly how the inference process reduces to the updating of the prior to posterior hyperparameters by the addition of the sufficient statistics. Be(H I (I. the process reduces to the updating of the + prior to posterior hyperparameters. Proposition 5.H)” rxp {log () 10 H r. Barnard. where q1( ) = n 11. = o + r. ..5 Inference Proposition 5. . simplifyingclosure property holds for predictive densities. with respect to the corresponding exponential family likelihood. j). \o that p(fllz. T rt.. 11 and r . the inference process being totally defined by the mapping T under which the labelling parameters of the prior density are simply modified by the addition of the values of the sufficient statistic to form the labelling parameter of the posterior distribution. However. A. p(B I GI. (T t..j.. The forms arising in the conjugate analysisof a number of standard exponential family forms are summarised in Appendix A. rl ( 7 1 ) = T~ +I.. This means that both the joint prior and posterior densities belong to the same.
.. p ( s l . wecould. . A.4.5. r n ) = I' p(r:. Using the standard exponential family form has the advantage of displaying the simple hyperparameter updating by the addition of the sufficient statistics. The predictive density for future Bernoulli observables. . (continued). However. If.2 Conjugate Analysis 271 Clearly..r . Example 5. p(r. 6) and.)* . . we were interested in the predictive density for r... Writing = yl + ....r. we see that where a result which also could be obtained directly from Proposition 5 3 i i ) . B)IiI theorem can be simply expressed..13) x 8" (1 . . Instead of working with the original Bernoulli likelihood. . )de d I .T ~ ~In .$ + n .n ) oc Be(8 I a..~ ) ~ ' .4..~ n @ 'I (1 ' Taking the binomial form....ofcourse.Tn+.ll@). in terms of the notation introduced in Section 3. when analysing particular models we shall work in terms of whatever functional representation seems best suited to the task in hand. Ia. r. if either n or rl.+I..particular.. as Bi(rll 18.. ..4) E Be(@ Q I + r.. which we denote by 9 = ( Y l 7 . . ) . the second form seems much less cumbersome notationally and is more transparently interpretable and memorable in terms of the beta density.2. were ancillary. 6 the prior to posterior operation defined by Bayes' ' ) . it easily follows that p(rb. . 1 r n . y(e I 71. B)p(e a. we would use one or other of ) p(rl. .I n ...I n. in either case.work withalikelihooddefinedin termsofthesufficient statistic (R. ym) = (G. the two notational forms and procedures used in the example are equivalent. In general.0.. instead.. is also easily derived. 0) or Y ( R I T. . + y.. .
c ~ )y = (y. since y ( (.2.i 1 this evaluation becomes (r. j..m). that using the fact that r(t + 1 ) = t l ' ( t ) and recalling. Bb(r:. on substituting into the above. we can obtain considerable = ( $ ' $ I . = h ( x . . .s of the exponential family and its conjugate prior family.t I ) / ( r / t2 ) . the lorn ofthe mean of a beta distribution. so that these forms become .sicc.2 reveals this predictive density to have the binomialbeta form. YIJ C'XP { rhYh+ ~M11)} 11 E 3 ** for appropriately defined Y . yo E 1 .. We shall refer to these (Definition 4. .. we require I t ( ) > 0. . .  Y E 1'.. The particular case I I J = I is of some interest. ) . where kvf = . I I I = 1 ) is then the predictive probability assigned to a success on the ( n + 1)th trial. .(O)and y. From a notational perspective (cf. observed successes in the first rt trials and an initial Be(@ .2..cessio~r (Laplace.) ( ( I + r . . We shall consider this problem further in Example 5. = 1 j ( I .. We see immediately. For an elementary.. . E(BI n .j.p.Comparison with Section 3. .12) as the cunonicul (or nururul)fi. hut insightful. J / ( o .12). yo. from Section 3.k. P(Y I + ) = a ( y ) w ( y ' @. H. see Lindley and Phillips (1976). I6 of Section 5. .. s With respect to quadratic loss. . which has served historically to stimulate considerable philosophical debate about the nature of inductive inference.2.jt r i ) i the optimal estimate of H given current information. .rrn. df. a and realvalued functions u. h and c. .t i ( + ) } . If Q = %*. together with prior hyperparameters rill. In the case ( I .. which is the celebrated kiplace's rrclec~. and the above result demonstrates that this should serve as the evaluation ofthe probability of a success o n the next trial. frequency of successes.Yo) = ('(74).j)belief about the limiting relative If>.i = 1. .4. . . I f t .4. simplification by defining c. P ( @ I rLo. I 8 12).yr).. . given r .. Definition 4.. 2 In presenting the basic ideas of conjugate analysis. we used the following notation for the kparameter exponential family and corresponding prior form: and P(8 the latter being lefined for T such that K ( T )< x. ilccount of Bayesian inference for the Bernoulli case. .
6. rhen the posterior density corresponding to the canonical conjugateform.p(y I $)dg = 1 and that b($) is continuously differentiable and strictly convex throughout the interior of The motivation for choosing no.2 Conjugate Analysis 273 in order for p ( + I no. In the case of the Bernoulli parametric model. for details). .~ the valare ues of I/ resultingfrom a random sample of size nfrom the canonical exponential family parametric model. yo) to be a proper density. and. Proposition 5. p(y 1 $).7. We shall typically assume that @ consists of all $ such that s. we have seen earlier that the pairing of the parametric model and conjugate prior can be expressed as p(slt9)= (1 . yo). the posterior distribution of the canonical parameter 3 is given by . b(t+'j) = log(1 The canonical forms in this case are obtained by setting + e").4. is given by Example 5. the situation is somewhat more complicated (see Diaconis and Ylvisaker. . p ( $ I no. (continued).p0 as notation for the prior hyperparameter is partly clarified by the following proposition and becomes even clearer in the context of Proposition 5. (Canonical conjugate analysis). 1979. . ffyl.6 ) e x p {zlog (1 3 a ( y ) = 1. *.5. for J! # @. I/. hence.
t'?) is now immediately given by Proposition 5. the choice of the representation of the parametric model and conjugate prior forms is typically guided by the ease of interpretation of the parametrisations adopted. Example 5. In the case of the Poisson parametric model. Example 5. we can give a more precise characterisation of this weighted average. (continued). we have seen earlier that the pairings of the parametric model and conjugate form can be expressed as The canonical forms in this caSe are obtained by setting Again.6.6. the canonical representation often provides valuable unifying insight.( f ) I] I'Xl)( T11#)(~XJI(Tt log0). Again using the canonical forms..c q ( 0) PXp(. From a rheorerical perspective. (continued). In the case ofthe normal parametric model.6. we have seen earlier that the pairings of the parametric model and conjugate form can be expressed as p(O I TI.r 10) = . T I ) p(. rtn + 11 of prior and sample information.6. the posterior distribution of the canonical parameters @ = ( t i I . For specific applications.5..6 above suffices to demonstrate that the canonical forms may be very unappealing.r! = [ h . . 1 The canonical forms in this case are obtained by setting The posterior distribution of the canonical parameter L" is now immediately given by Proposition 5.274 5 inference Example 5. however. w/l)+ nil. as in Proposition 5. where the economy of notation makes it straightforward to demonstrate that the learning process just involves a simple weighted average.r k1g#) ..
Namely. that the posterior expectation of Vb(+). This establishes the result. (Weightedavemgeform of posterior expectation). yo) = yo. . I But = J. VP(+ I n o . that this is no more than a conventional way of indicating a subjective consensus.since and hence E ( y I +) = E ( g nI +) = Vb(+). In this context. . be claimed to be “objective”.8. By Proposition 5. it suffices to prove that E(Vb(@)no. 71.7.2 Conjugate Analysis 275 Proposition 5. a sufficiently large sample will lead to more or less identical posterior beliefs.the latter can be viewed as an intuitively “natural” samplebased estimate of Vb(+).yo). a Proposition 5. .2). we make an important point alluded to in our discussion of “objectivity and subjectivity”. y. . (710. in this natural conjugate setting. are the values of y resulting from a random sample of size n f from the canonical exponential family parametric model.6. that is its Bayes estimate with respect to quadratic loss (see Proposition 5. tends to one and the samplebased information dominates the posterior.7 reveals. Yo)d+. For any given prior hyperparameters. The former is the prior estimate of Vb(+).as the sample size n becomes large.5. ! Proof. however. in common parlance. is a weighted average of yo and 8. I y1 . that in the stylised setting of a group of individuals agreeing on an exponential family parametric form.. the weight.2. in Section 4. resulting from a large amount of data processed in the light of a central core of shared assumptions. Statements based on the latter might well. One should always be aware. but assigning different conjugate priors.
yo) which does not integrate to unity (a socalled improper density) and thus cannot be interpreted as representing an actual prior belief. ii+ attached to the prior mean. In particular.. for continuous exponential families. Diaconis and Ylvisaker (1979) and GoeI and DeGrim ( 1980). .. attached to the data mean y. for Ob(+). It is interesting to ask whether other forms of prior specification can also lead to linear posterior expectations.4... improper conjugate priors in the context of the Bernoulli parametric model with beta conjugate prior. this suggests that a tractable analysis which "lets the data speak for themselves" can be obtained by letting 7)u +0.. Conversely. under some regularity conditions. In particular. but positive nqj. The weighted average form of posterior mean.The choice i t 0 = 0 typically implies a form of p ( @ 1 n o . whether knowing or constraining posterior moments to be of some simple algebraicform suffices to characterice possible families of prior distributions. j ) .r. 0 corresponds to ( I 0. however. using standard rather than canonical forms for the parametric models and prior densities. (confinuedJ. that. leads to a Be(# I c i r..6 makes clear that the prior parameter.+ t z ) ' t i . Section 5. . denotes the number of successes in ri Bernoulli trials. These kinds of questions are considered in detail in. providing a weighted average between the prior mean for H and the frequency estimate provided by the data. which has expectation  + I where 7: = ( c t + .3 for illustration of why a weightedaverage form might not be desirable). lor I the limiting relative frequency of successes. y . . however. Or.  + . = . 11. plays an analogous role to the sample size. linearity of the posterior expectation does imply that the prior must be conjugate. The choice of an itII which is large relative to n thus implies that the prior will dominate the data in determining the posterior (see.. Clearly.5 Inference Proposition 5. Be(@ c t .{ + i t . H. The following example illustrates this use of limiting.r. more generally. t . We have 5een that if I.r . for example. the conjugate beta prior density.7 shows that conjugate priors for exponential family parameters imply that posterior expectations are linear functions of the sufficient sratistics. . the choice of an which is small relative to 11 ensures that the form of the posterior is essentially determined by the data. it can be shown. Example 5. obtained in Proposition 5.6.. t). and also appearing explicitly in the prior to posterior updating process given in Proposition 5. In this notalion.7. this has to be regarded as simply a convenient approximation to the posterior that would have been obtained from the choice of a prior with small. ) posterior for H.
j small compared with I‘.n .. d = 0 ) XX6r“’(l8)” r’J8l ( 1 . of 8.+ 7~0Wr)) 7 21.3 = 1. Within this latter framework./7t~ limiting prior form. that this is merely an approximation device and in no way justifies regarding p ( 8 I o = 0.7 provided further insight into the (posterior mean) form arising from quadratic loss in the case of an exponential family parametric model with conjugate prior.. based closely on GutiCrrezPeiia ( 1992).2.. or of what constitutes a “noninformative”prior distribution (in some sense). . however. .r.8 ) x Be(BIr. provides further insight into how the posterior mode can be justified as a Bayes estimate. p ( 8 ) uniform implies that p(g(8)) is not uniform. . entail “complete ignorance” about any function. the choice ct = .. r.. Consider.I. ) .5. presumably.)will lead to an almost identical posterior distribution for 8.. Clearly. with further discussion in Section 5. any choice of ct. (for example. This makes it clear that ad hoc intuitive notions of “ignorance. At an intuitive level. A further problem of interpretation arises if we consider inferences for functions of 8 .n .4. it might be argued that this represents “complete ignorance“ about 8 which should. from the discussion preceding Proposition 5. There is a need for a more formal analysis of the concept and this will be given in Section 5. which implies a Be(8 I r. would be The which is not a proper density. the following development.r .1 = or ct = d = 1 for typical values of r v . We recall.Yo) = 4% Yo)exrJ{ WY... however. . (3 = 0) as having any special significance as a representation of “prior ignorance”. it is certainly convenient to make formal use of Bayes’ theorem with this improper form playing the role of a prior. However.6. . and c.n8)p(8I c1 = 0. since p ( 8 I ct = 0.r. d = 0. and realvalued functions 6. for appropriately defined Y.2 Conjugate Anulysis 277 3 4 0. which implies a uniform prior density for 8.r. the canonical forms of the kparameter exponential family and its corresponding conjugate prior: and P(21. cannot be relied upon.lwlo. rt . ’ It is important to recognise. Q b .()approximation to the posteriordistribution. I t . for example.. g(8). Proposition 5.2 established the general forms of Bayes estimates for some commonly used loss functions. E Q.6. having expectation r. A Proposition 5. As a technique for arriving at the approximate posterior distribution. a = .#)x p ( ~ .
t) = .) . ( s . (Logarithmic divergence befween conjugatedistributions). ..s It).k. . so that and (ii) follows. ..(t"gD sb(+)}d+ for 7 = 1 . t ) .+hE[Ch(+)I.QG'+:)(w..+[~l.t)f.t)]' t) and &(s.l.(s.+I. = = Recalling that E(Vb(+)] yo. .].b(+) + (+ .. we see that J L. Proposition 5. k. Funherdefine 5 Inference Consider p(+lno.270 t E Y..)'W+): ( i i ) E[b(+l$~~~j] r o ..8. With respect to the canoniculform of the kparameter exponentialfumily and its corresponding conjugate prior: ( i ) 6(+l+l.(+. .d&. .We can now establish the following technical results.sb(+)}d+ = d ( s . yo) and define d(s. From the definition of logarithmic divergence we see that d"+llctIl) = h(+. ~ I ) +(I]'?~I)$/(I}. recall the logarithmic diver between two distributions y(zl6) and p(xl0o).we can write. + (+ . As a final preliminary. . t) = &f(s. E[6(+I+n)I = b(+n) . ' [ k = a + ~ d ( 7 1 n~opo)'(?tng/o)] . with respect to s..)'Ey. t)/i)n. for 1 = 1.. .).s~t)cx.logc. yo) E [ + ' v ~ ( ~ L )7]t . . with s > 0 and = [ r f .. J . and interchanging the order of differentiation and integration. Moreover. ..)C(S.j = b(+.(s.+. Differentiation of the identity lag/Fxp{t'+ .E[b(+)]+ E[+'Cb(+)I .b. rtnyo) b($l)) =4dr +)Lo' { k + [ ~ ~ ~ ( ~ J~ JI I ~. Differentiating this identity with respect to t. establishes straightforwardly that E[b(+)] 4dW %Y.b(+) + Proof.j and (i) follows. = s "I + d. .
replacing noyo. At the same time. with a suitable extension of the conjugate family idea. 279 siae) Proposition 5. The next example shows that might well not be the case. yo).the Bayes estimate for $. see Consonni and Veronese (1992b). Diaconis and Ylvisaker ( 1979)highlight the fact that. In complex problems. $) = 6($la). In addition. a For a recent discussion of conjugate priors for exponential families. and for other coins even as extreme as 1:4.8. Let us consider the repeated spinning under perceived “identical conditions” of a given coin. with a. (Conjugateposterior modes as Bayes e t m t s . from the canonical kparameter exponential family p(y I+) and corresponding conjugare prior p($lno. y. .5. this is not at all the case if a coin is spun on its edge. unsuspected implications.(no + n)b($).9. + .. $*. is immediately obtained. With respect to the loss function I ( @ . whereas a tossed coin typically generates equal longrun frequencies of heads and tails. = n’(y.)‘$ . However.6) that the logarithm of the posterior density is given by + (w. + yn). and myo n&. 5. conjugate priors may have strong.3 Approximations with Conjugate Families Our main motivation in considering conjugate priors for exponential families has been to provide tractable prior to posterior (or predictive) analysis. see Dawid (1988a). $*. . . we might hope that the conjugate family for a particular parametric model would contain a sufficiently rich range of prior density “shapes” to enable one to approximate reasonably closely any particular actual prior belief function of interest. for an example. it also indicates how. about which we have no specific information beyond the general background set out .yo+ ng. .2 ConjugafeAnalysis This result now enables us to establish easily the main result of interest. (The spun coin ).7. + from which the claimed estimating equation for the posterior mode. derived from independent observations yl. with n + n replacing no. The result now follows by noting that the same equation arises in the o minimisation of (ii) of Proposition 5. we can achieve both tractability and the ability to approximate closely any actual beliefs. which satisfies Vb($*)= (no + n)l(nOyo + n&).is the posterior mode. .. some coins do appear to behave symmetrically. We note first (see the proof of Proposition 5. Example5.2. Experience suggests that these longrun frequencies often turn out for some coins to be in the ratio 2: 1 or 1:2. constant Proof.
a .3 Be(@ I 121. that an insistence on tractability. .= 1. so that .I Considering the general mixture prior form 5 . heads. + . I .2 displays the prior density resulting from the mixture 0.' J = H ' " ( 1 . . .). 8 . completely specifies our belief model. How might we represent this prior density mathematically? We are immediately struck by two things: first. P )= C n . . J = . 10). . . . i n the light of the information given.1 we easily see from Bayes' theorem that where rt: = (I.. i = 1 . . suppose we judge the sequence of outcomes t o be exchangeable. with mixing weights x . r.( ' . = . 111 p ( H 1 l r . .5 Be(@I 10. therelore. the conjugate family for the Bernoulli parametric model is the beta family (see Example 5.20) t 0. . . B e ( B I r t .rl + . . so that a Bernoulli parametric model. j . attached toaselectionofconjugatedensities. .which does not contain bimodal densities. It appears. rl + r.. would preclude an honest prior specification. ) and that these result in . secondly. .H ) " . .. among other things. . and possibly trimodal.rtrtrcrs of beta densities. . Suppose now that we observe / I outcomes z = ( J ... . Be(0 1 r t . 1 z.i.. . .4). . in the sense of restricting ourselves to conjugate priors. together with a prior density for the longrun frequency of heads. d. and .r. we can easily generate multimodal shapes by considering rtri. . [ H ) = nH'q . reflects a judgement that about 20% of coins seem to behave symmetrically and most of the rest tend to lead to 2: 1 or I:2 ratios. ) . any realistic prior shape will be at least bimodal.280 5 Inference above.H ) I . . I . which. 15) t (1. with somewhat more of the latter than the former. Figure 5. Under the circumstances specified.2 Be(@ 15. However. > (1. . . I ) ) . z l +.
17) for 71 = 50.15.46). = 14.. suppose that the spun coin results in 3 heads after 1 0 spins and 14 heads after 50 spins.0.09. ' and the resulting posterior densities are shown in Figure 5..34).2 Conjugate Anulysis so that the resulting posterior density.10).18. p' = (56. ~=(20. 281 is itself a mixture of rri beta components.77.29.2 Prior and posteriorsfrom a threecomponent beta mixture prior density Detailed calculation yields: for 71 = 10. ' p' = (27. 3 ) .O0fi).51. The suggested prior density corresponds to rrt = 3. 0 . = 3.5. ~ = ( 0 . a = (24. . x' = (0.0.2. Figure 5.15. r.22. 0 .90. In the case considered above. 2 . x * = (0.23).0.16. 5 . r. a=(lO. a = (13.07). This establishes that the general mixture class of beta densities is closed under sampling with respect to the Bernoulli model.O.20).
ure elements o the conjugate jamily. (Mixtures ofconjugute priors). the use of mixtures of conjugate densities both maintains the tractability of the analysis and provides a great deal of flexibility in approximating actual forms of prior belief. Let z = ( X I . The results follows straightforwardly from Bayes' theorem and Proposition 5. .202 5 Inference This example demonstrates that... In fact. Then f and Proof.10.5. Proposition 5. d . . . ) bc u rundor sample from a regulur exponentia1. the same is true for any exponential family model and corresponding conjugate family.faniilj distribution S N ~ that I P(X and let P(e where. at least in the case of the Bernoulli parametric model and the beta conjugate family. as we show in the following. . x. for 1 = 1..
In practice. (a) 1 I( 8 ) / q ( 8 ) I 1 t. prior. and .P)/P I 1 . P / ( I + a)I( 8 I z ) / q ( e 1 5 (1 f a ) / q p 4 (v) for E = max { ( 1 . The answer is that any prior density for an exponential family parameter can be approximated arbitrarily closely by such a mixture.p E %*. 4 0 12) = Pta: I @)q(@)/P ( 2 I Q>q(fW8. ( 8 I p I a) Ip ( e ) / q ( W q 5 b / q [ (iv) for all 8 E 8.5. in the sense that probability statements based on the resulting posterior will not differ radically from the statements that would have resulted from using a more honest.a. as to when an “approximate” (possibly improper) prior may be safely used in place of an “honest” prior. a limiting conjugate form. f o r 011 8 E ~ 3 . Proof.1 1.. ( 1 . . (Dickey. However.q ) } and f : 8 8 such rhat I f( 8 ) I I m. typically a conjugate form. for some 00 C 0 and a. %?n* S (i) (1 . we are left with having to judge when a particular tractable choice.q ) / q a (ii) q I p ( z > / q ( z )5 (1 + Q>/P (iii) for all 8 E 8. all 8 E p for (b) p ( e ) / q ( e ) I 0 . in a much more general setting than that of conjugate mixtures. where. P r (i) clearly follows from at P Clearly. 1976).2 Conjugate Analysis 203 It is interesting to ask just how flexible mixtures of conjugate prior are. Proposition 5.p ) . their analyses do not provide a constructive mechanism for building up such a mixture. or a mixture of conjugate forms. . but difficult to specify or intractable. Let P = S p ( 8 I z)d& = S q ( 8 I z ) d e . as shown by Dalal and Hall (1983). 8 E 8 and rhut q ( 8 ) is a nonnegative function such rhar q ( z ) = J0 p ( z I 8 ) q ( O ) d 8 < 30. is “good enough. and Diaconis and Ylvisaker (1985). (Priorapproximation). The following result provides some guidance. Suppose that a belief model is defined by p ( z 1 8 ) und p ( B ) .
More specifically. for some constant c. Finally. a If.1 I therefore asserts that if a mathematically convenient alternative. in the above. Proposition 5. 11) which proves (v). (iv) and (v) establish that both the respective predictive and posterior distributions.then q(0)may reasonably be used in place of !>(0).then (i) implies that 0. Part (iii) follows from (b) and (ii). can be found. Figure 5. provides . a frequently occurring situation. q ( B ) .)has high probability under p ( 0 12) and (ii). and (a) and (ii). where the choice q(0) = c.3 illustrates. tnvs from = (1 + 0 . so that q(0)provides a good approximation 0 to p(0) within 8 and p(0)is nowhere much greater than q(0). giving high posterior probability to a set 80C 0 within which it provides a good approximation to p ( 0 ) and such that it is nowhere orders of magnitude smaller than p(0)outside C)O. p ( 0 ) . in stylised form.q ) + ( 1 .q ) 5 0 + 3:. In the case of 8 = 8. i f f is taken to be the indicator function of any subset Q" 2 8.284 5 Inference which establishes (ii).+ ( 1 .( v ) implies that providing a bound on the inaccuracy of the posterior probability statement made using q(0 12) rather than p(O 12). to the wouldbe honest prior. 80 a subset of 8 with high probability under q(O I x)and CL is is chosen to be small and ijnot too large. within (30. and also the posterior expectations of bounded functions are very close.
The second of the implementation questions posed at the end of Section 5.1. . In qualitative terms. .To answer the second question posed at the end of Section 5. 1962). . ..L .1. the following section provides a more detailed analysis. .. . Given observations z = ( L ~.x. the choice of the function q(0) = c. .5 3 Asymptotic Analysis .6 concerned the possibility of avoiding the need for precise mathematical representation of the prior density in situations where the information provided by the data is far greater than that implicit in the prior. the parameter 8 acquired an operational meaning as some form of strong law limit of observables.3 Typical conditionsfor precise meusurement iue a convenient approximation. 5. the posterior distribution. . In this situation of “precise measurement” (Savage. . Fgr 5. The above analysis goes some way to answering that question.3 ASYMPTOTIC ANALYSIS In Chapter 4. we . which has little curvature in the region of nonnegligible likelihood.6. clearly satisfies the conditions of Proposition 5.10 and we obtain the normalised likelihood function. the likelihood is highly peaked relative to p ( 0 ) . p(B I z . we saw that in representations of belief models for observables involving a parametric model p(x I 0) and a prior specification p ( B ) . for an appropriate constant c‘. ) . ) then describes beliefs about that strong law limit in the light of the information provided by 2 1 .
c are strictly larger than zero.e. p 2 . is “distinguishable”from the others. . = lim p ( O I 12) = 0. . I z) 1.1 Discrete Asymptotics Webegin by considering the situation whereC3 = {el.p .}. Intuitively. i. 82. . such that the parametric model correspondingto the true parameter.for all I # t . . Under appropriate conditions. Proposition 5.. and assuming that p(z18) = nyFl( x . 6. where8 E 0 = { 8 .. . kt x = (q. vurions for which a belief model is defined by the purumetric model p ( .rI B f ) / p ( r1 e.I: I 8 ) . . = 1. by the strong law of large numbers (see Section 3.. > 0. i 1 1x # f.    .. . (Discrete asymptotics). . C. XI. as 11 The righthand side is negative for all i # t. p ... p where Conditional on O f . I O ) . in the sense that the logarithmic divergences..286 5 Inference now wish to examine various properties of p(8 I z) the number of observations as increases. } consistsofacountable (possibly finite) set of values. i. s.3.. By Bayes’ theorem.)logb)(. Proof. . and equals ixro for I = I. as 7) cx.e. we shall see that this is.12. which establishes the result. the latter is the sum of I I independent identically distributed randomquantities and hence. Suppose thut Of E 8 is the true vulue of 8 and thur.rl...)]d.2. so that. .[y(r I e.  5. . the case.)be ohser. we would hope that beliefs about 8 would become more and more concentrated around the “true” parameter value. 0 and S.3). then r1x liin p ( 8 . u n d r h e p r i u r y ( 8 ) = { p i . for all i # t . the corresponding strong law limit. indeed.x for i # t . 0 2 } .
we see that . . ( e m. In fact. m u and 6. in the case of a parametric representation for an exchangeable sequence of observables.However.L)(O. under suitable regularity conditions.erf) 2 1 . are small for large n. Proceeding heuristically for the moment. without concern for precise regularity conditions. and R. V logp(z 1 0) = 0. 5.&. assumed to be determined by setting V logp(6) = 0.. we obtain where R<. established is for countable 0.3 Asymptotic Analysis 207 An alternative way of expressing the result of Proposition 5. ignoring constants of proportionality. respectively. l ) f H . we note that. If we now expand the two logarithmic terms about theirrespective maxima. A particularly interesting result is that if the true 8 is not in 8..12.m . 1970)for details. to say that the posterior distribution function for 8 ultimately degenerates to a step function with a single (unit) step at 8 = Bt.)  .mo)lHo(e mo) :(e . postethe rior degenerates onto the value in 0 which gives the parametric model closest in logarithmic divergence to the true model.2 Continuous Asymptotlcs Let us now consider what can be said in the case of general 8 about the forms of probability statements implied by p(8 15) for large n.3.5. this result can be shown to hold.Rn denote remainder terms and Assuming regularity conditions which ensure that RQ. for much more general forms of 8..)tH(b. the proofs require considerable measuretheoretic machinery and the reader is referred to Berk (1966.
defined by . since. therefore. j .and is often called the observed inforniution mutrix... e. h . . (see Section 3. Sharply peaked loglikelihoods imply small posterior uncertainty and viceversa.)).. by either Nk(8 I e. for all i .2. The Hessian matrix. vature of the loglikelihood function at its maximum. =N o where m (the prior mode) maximises p ( 8 ) and ell(the niuxiniutn likelihood eso timate) maximises p(z 18). for large t i . . There is a large literature on the regularity conditions required to justify rnathematically the heuristics presented above.. by the strong law of large numbers. . ) )or Nk(8 I el. )a.. Jeffreys ( 1939/ I96 1. .. In the case of 19 E 8 C 'R. = H.rrl(hl. and whose precision matrix is the sum of the prior precision matrix and the observed information matrix. H( where k is the dimension of 8. we see that H ( b .. Also.. tend to resemble a multivariate normal distribution. so that the approximate posterior variance is the negative reciprocal of the rate of change of the tirst derivative of log p ( z 1 8) in the neighbourhood of its maximum. We might approximate p(8 lz). Other approximations suggest themselves: for example. 1958. Chapteril). H ( e .288 with HI. )  n I ( & . for large r z the prior precision will tend to be small compared with the precision provided by the data and could be ignored. +I . )measures the local cur. This heuristic development thus suggests that p ( 8 15) will.' (H I ~ N ( e . LeCarn ( 1953.). )where I ( 8 ) . Those who have contributed to the field include: Laplace ( 18 12). 5 Inference + W@/t) m. .. Nk(8 I m. I .5) H ) whose mean is a matrix weighted average of a prior (modal) estimate and an observationbased (maximum likelihood) estimate.. is the socalled Fisher (or expecfed)informution mutrix. 1956.
Freedman (1963b. where 5 is the largest eigenvalue of Ell. . . Heyde and Johnstone (1979). Hartigan (1983.)ltJ ( a 2 L . We do not require any assumption that the m. Chapter 4). for every n. however. . ( 8 )in a small neighbourhood of m. Lindley (196lb). .. . Johnson and Ladalla ( 1979) and Crowder (1988). Fraser and McDunnough (1989). (c2) “Smoothness”. We I C.. as n becomes large. Chao (1970). x. inside a small neighbourhood of m.(mrr) I exists as I I and we shall now establish a bound for that limit. Chapter lo).. f . equivalently.)’C. we shall see that (cl). 0 =V = and implying the existence and positivedefiniteness of Cn = (~::(mn))’.3 Asymptotic Analysis 289 1966. Dawid (l970).) + Essentially. For any 6 > 0. Chen (1985). p(x 18) and prior p ( 8 ) .. typically of the form ppl(8)= p ( 8 I q . = a~J) where [LZ(m.8’ I < 6). Defining 1 I = (8t8)1i2 Bb(8’) ( 8 E 0.and assume throughout that.L’. m. Related work on higherorder expansion approximations in which the normal appears as a leading term includes that of Hartigan (1965). n 1.A ( E 5 L~(8){L”(m.).1970). = 2 . Fu and Kass (1988). there exists N and 6 > 0 such that.). DeGroot (1970. Sweeting (1992) and Ghosh et al.m . of p . although the mathematical development to be given does not require this. we assume that 8 E 8 C Rk and that ( p f l ( 8 ) . The account given below is based on Chen (1985).z.(m~) L t ( 8 )I 8=mr.. be a global maximum of p n .’ ( 8 . . m. nor do we need to insist that m. We define L n (8 ) = l o g p n ( 8 ) . ~ 1 3 6 i pn(8)d8 1 as n + 30. Ibragimov and Hasminski (1973). : (cl) “Steepness”. (or. there is a strict local maximum. (c2) together ensure that. 1986). (1994). becomes negligible. that the limit of pr. 1 .. the function pll becomes highly peaked and behaves . we shall 8 and = 8 show that the following three basic conditions are sufficient to ensure a valid normal approximation for p . for large n. 1965).5. implicitly assume. (c3) “Concentration”. . In what follows. ) }The final condition (c3) ensures that the probability outside any neighbourhood of m. T i + 0 as n + 00. themselves converge. for any n > N and 8 E Bd(m. like themultivariate normaldensitykernelexp{i (0m.’(8) exists and satisfies I . L. Johnson (1967. ( ~ ) / a f h I 8=mn. For any E > 0 . derived from an exchangeable sequence with parametric model .2)}~1 ) 5 I t A(&)..} is a sequence of posterior densities for 8. where I is the k x k identity matrix and A ( € ) a k x k symmetric positiveis semidefinite matrix whose largest eigenvalue tends to zero as E + 0. Sweeting and Adekola (1987). 1970. Bermlidez (1985). Walker (1969).) satisfying: L:.
since.~ = .)C.) where {k + R.) and and with largest (smallest) eigenvalues of C . = h ( l .g( ~ ) ) I . z ' z } d r . ) ~ 1.. Clearly.13..* x p n ( m l .It follows that ' . I. for any k x k matrix V. I.. ) .simple Taylor expansion establishes that a I). I X .. ) W { L J W. . t.' ( m l .(i'.. ( d ) . ) I '~(f2 ~ l ~ ) nlim l~) ' 5 II + A(E)(' lirri P r ... ~ "5 (2n)". and t .( 0 . (c2)implv that lirn p .lS2  Q ( E ) ( ~ ( E ) )the where s. we have equality if and only if h i .(e) P ~ ~ . Pmot Given E > 0 consider n > IV and d > 0 as given in ( ~ 2 ) Then.(b) 5 1 for all n.x lim ~ .290 Proposition 5.'(0. . The conditions (cI ) . tend to infinity as ri v2((v') 11  3c. . = b(1 g(?))'r2/ZF. l ( m . 1 1x and the required inequality follows from the fact that 1 A ( E )t 1 as E 1 / 0 and P. .m/.". any 8 E Bd(m. l ) ~ x p. exp { .. Bdli7(0)c z : (%lV%)"2d} < + ... +  .x 5 Inference with eyuulity ifand only i f ( c 3 ) holds.. { E Bh!K(O) where are the largest (smallest) eigenvalues of V. which is condition (c3). . iF'. .)'(I R. ? ( m .. Since (cl) implies that both s. . we have .A ( E ) J ' 1.m. and A ( E )respectively..x El(&)= 1. ( m ~ ~ ) ~ = p ..for some 8 lying between 8 and m. (Bounded eoncentralion). .). for . ( b I..I. = ~ ~ ~ ( 8 ' ) { ~ ~ ( m l l ) }1. */ cr ~ .
for any 6 > 0 and sufficiently large n.(a 5 4. The result follows from Proposition 5.5.. 5 a . for any E > 0.. P. using the (c3) notation above. consider p l l ( . by (cl). (b. = C.13. ) as the densityfunction of a random quantity 8. and writing b 2 a.14.(mn)= 0 and c. and dejne.Xi') distribution. is a necessary and suflcient conditionfor ( . 5 b) .for a.b E !Rk.3 Asymptotic Analysis 291 We can now establish the main result. given ( c l ) and ( ~ 2 ) .' = L. [I . ) Given (cl). where L'. P. ~ / ~ Proof.(mn):' Proposition 5. For each n.'/2(8.mn). to converge in distribution to (b.exp { .A ( E ) ] 'a ' z 5 [ I . It then follows. 5 b) is bounded above by where Z(E) = { Z.s(b'(b}. a z 5 b}. as n + m.13. Given (cl) and (c2).A(E)]*'* ~5 b} and is bounded below by a similar quantity with + A ( E in place of A(&). it suffices to show that. that. (Asymptotic posterior n o d t y ) .as E + 0 we have where Z ( 0 ) = { % . . b where p ( 4 ) = ( 2 ~ ) . which may colloquially be stated as "8 has an asymptotic posterior Nb(8(mn. (c2). We first note that where.a are nonnegative.to denote that all components of b .(Q5 (b. P ( a 5 (b 5 b) if and only if (c3) holds. by a similar argument to that used in Proposition 5.Then.
it follows that the lefthand side tends to Zero as 11 + XJ. ( 8 ) .such that. Proposition 5. with G ( 8 ) = log y(8) for some density (or normalisable positive function) y(8) over 63. (Alternative conditions).292 5 inference Conditions (cl) and (c2) are often relatively easy to check in specific applications. a To understand better the relative ease of checking (c4) or (cS) in applications. We shall illustrate the use of (c4) for the general case of canonical conjugate analysis for exponential families.) 1 C. It is useful therefore to have available alternative conditions which. Two such are provided by the following: (c4) For any d > 0. Moreover. Proof.(m. It is straightforward to verify that given (c4). we note that. but.1 I ) and the remaining terms Sincep. (c2).often intractable.). is bounded (Proposition 5. for any rz > iV and 8 $Z B.15. (c4) does not even require the use of a proper prior for the vector 8..j(m. Given ( ~ 1 )(c2).L f f ( m . (c5) As (c4). sothat L .5) implies ( ~ 3 ) . given (cl).. ifp. .normalisingconstant l) p(z). that given (c4). there exists an intcger lV and d E 3.. or (c. imply (c3). ( 8 . but (c3) may not be so directly accessible. either ( 4 ) .. . I or the righthand side clearly tend to zero. and similarly. doesnot involve the..(O) is based on data z.
and V h ( + + ) .. for each 11. Suppose that y l .. . Colloquially. where Nk(@ Proof.’ = (7% n)”6”(mI.2).3 Asymptotic Analysis 293 Proposition 5. By the strict we between concavity of h ( . yI.(+) is unimodal with a maximum at = m.16..where = +. with angle 0 between . ) . It follows that. j ( m . .)9 i ’ ) distribution.l. ) = 0. we have to prove that has an asymptotic posterior 1 6’(m. to be the densiryfunction for a random quantify and &jne 4. .2. satisfying V h ( m .for any 6 > 0 and B # B .). . + ++ .b’(m11)). consider the posterior density with 5.5. converges in distribution to 4..m. From a mathematical perspective. . y ~ and X njjIl) E. for some 11 and m. with b(70) a continuously differentiable and strictly convex function (see Section 5.b(+).. + + + + where h(+) = [b’(m. ) .y n are data resulting from a random sample of size n from the canonical exponential family form with canonical conjugate prior density For each n. have. = Cr=l yi/ri.1 Ei1r2(+.)]’+ .. (Asymptotic normality under conjugate analysis).)= (7t0 ~ i ) ..~ ( n . where b’(m:. Then 4.
.).. / n given H. we note the interesting "duality" between this asymptotic form for H given n .. .l ) l o g H S ~ ~ ~ ~ ~ ( ~ ) (. ( o )= iogp. = b/'(m.8 ) .294 for c = irif { 1 V h ( @ + I): @ $? B.1) L.. (confinued). and the asymptotic distribution for r . ) ) ' I) as 1 1 .. J) prior. Finally. a + r i . . we see that  and hence that the asymptotic posterior for H is (As an aside. .) .. with X2 the largest eigenvalue of b"(rn.. Conditions (cI). r.. hence that (c4) is and satisfied. is the posterior derived from I I Bernoulli trials with I. the latter not depending on no and 5..v. by the central limit theorem. successes and a Be(@0 .. .. Suppose that Be(@ n.(rnfl)} b"(lC. (c2) fo1lows straightforwardly from the fact that ' h.16. Taking n = c j= 1 for illustration.H and It follows that Condition (cl) is clearly satisfied since ( . + n). (n. (c4) may be veritied with an argument similar to the one used in the proof of Proposition 5.I o g p ( z ) so that ((t.l)log(l . Proceeding directly.y.A.){b"(mfI)}' = '.)}> 0. I .. and :I.(@) 1 0 g p (j ~~ + iogP(e) I = ) = (fb.L : ( m .13..4.1) ( . It follows that 5 Injerence L(@)  JL(mfl) < (no + r o I@ m l I < cl { (@ .. .(@) .. and so the result follows by Propositions 5.I ri .x.. which. I where ct. = ..(@){G. ..$). .s(rn.m n ) ' K 1 ( + m J } ' where 1'1 = cX'. has the form Further reference to this kind of "duality" will be given in Appendix B. = ( t + r. I2 Example 5. 1 + L ...condition ( c 2 ) follows from the fact that L::(H)is a continuous function of H .J.).= H 1 .
alternative form is given by the following.10.16 is given in terms of the canonical paramevisation of the exponential family underlying the conjugate analysis. with the additional assumptions lm. is nonsingular with continuous entries.. ' 8.14. and that given any 6 > 0. Proof. suppose that 8 has an asymptotic Nk (8 . 1992. thut. Section 3. with respect to a parametric model p(xl80). a . This follows immediately by reversing of the roles of 8 and v. This is a generalization and Bayesian reformulation of classical results presented in Sertling (1980. if u = g( 8 ) is a transformation .14. we could ask the same question in relation to Proposition 5. with a p propriate transformations of the mean and covariance. The expression for the asymptotic posterior precision matrix (inverse covariance matrix) given in Proposition 5. the adequacy of the normal approximation provided by Proposition 5.. then the asymptotic precision matrixfor u = g ( 8 ) has the form where J g .6 for all suflciently large n.5 3 Asymptotic Analysis . With the notation and background of Proposition 5. to an arbitrary (onetoone) reparametrisation of the model. at 8 = On.17 may be highly dependent on the particular transformation used. ifH.. such that. 1964b) analyses the choice of transformations which improve asymptotic normality. A partial answer is provided by the following.z 25 c(6)) 2 1 . 7 A simpler. More generally. u has an asymptotic distribution Proof. Corollary 1. ug + 0 and mn t 8 0 in probabilitj.3). Anscombe (1964a. Proposition 5. (Asymptoticprecision after transformah'on). (Asymptoticnormality under transformation). I is often rather cumbersome to work with. This prompts the obvious question as to whether the asymptotic posterior normality "carries over.C ) distribution.17. For details.1 ( u )= dg'(u) au is the Jacobian of the inverse transformation. = C denotes the asymptoticprecision matrixfor . there is a constraint c(6) such that P(@. 1993). In Proposition 5. Then. 533 . a For any finite n. see Mendoza ( 1994). A related issue is that of selecting appropriate parametrisations for various numerical approximation methods (Hills and Smith. . 295 Asymptotics under Transformations The result of Proposition 5. where u: (z:) is the G largest (smallest) eigenvalue of C .
that Be(H I t t . again. (Nonnormal asympfoticposrerior). . terior for a parameter H E W is given by N(H!..14.. leads to the asymptotic distribution N(v12siii ' ( (1).i. by Proposition 5.. (continued). Suppose that the asymptotic pos. < .r. 5. and suppose that the actual value of H generating the .L. . . + Example 5. ) [y'( rrtl. .:(m. The next result provides aconvenient summary of the required transformation result.3) v = g ( ~= 2hiiiI ) \/ii Straightforward application of Corollary 3. through N(.I. C x . the and that L : : ( i n . .= . perhaps derived from N(x. + 8. and suppose now that we are interested in the asymptotic posterior distribution of the variance stabilising transformation (recall Example 3.). m). with N(HI0. . the asymptotic. and (I. Intuitivcly. I ) i s 0 = 0. = d + r ) . posterior distribution for I / is N (i+g( rrr.14.r.3 Example 5. ) 0 as n t 9c.)]?).:( i n . so that the result follows from Proposition 5. Technically..)] in the form of the asymptotic precision given in Corollary 2 to Proposition 5. A concrete illustration of the problems that arise when such a condition is not met is given by the following..IU. / I .fi = g ( 0 ) is such thnt . Corollary 2.o ( H ) = H'. i .3..Fi is converging in probability to 0 through sfricr/? posilitv values. whose mean and variance can be compared with the forms given in Example 3. / ~ . that 0 has an asymptotic posterior distribution of the form N(0Iwl. I ) . .. . I7 that things will go wrong if ~ ' ( T I z . This is dealt with in the result presented by the requirement that g'(00) # 0. . r ..Then. having IJ z 0 .17. . Now consider the transformation I / ... Suppose that given the conditions of Propositions 5.8.I t ) . .1. . Suppose. )+ 0 sequence ~ I I .) in probability. g ' ( 0 ) = 0 and the condition 0 1 the corollary is not satisfied. in probubiliry us 11 x.. ). whew t i . we simply wish to consider onetoone transformations ofa single parameter. . . ! = ( i t I . i = I .j.. it is clear that u cannot have an asymptotic normal distribution since the sequence . .I0. .)).tends in probubiiity to 0" under p(xl60)..4. to Proposition 5. .9'(0) = d g ( # ) / ri0 is continirous and nonzero at 0 = 00. it can be shown that the asymptotic posterior distribution of 1111 is \' in this case.296 5 Inference In many applications. where utI. is the posterior distribution of the parameter of a Bernoulli distribution afte 7) trials.I7 with sccilrr 0. .h)..  Proof. (Asymptotic normality after scalar transformation). In fact. It is clear from the presence of the term [g'( m. The conditions ensure.17. r .
we note that. from the model {fly7] N(xtIOl.17. for large n. An obvious choice for 6 2 is 62 = 0.. . ..14) may be much more straightforward under one choice of parametrisation of the likelihood than under another. 102. 297 One attraction of the availability of the results given in Proposition 5. . Example 5.= O. obtaining the distribution of c$ = (41. I7 are clearly satisfied. (Asymptotic posterior normakity for a ratio).. from the model { fir=.4). is so that the required asymptotic posterior for 4.14 using the 41. @* . and the conditions of Proposition 5. etc. 0.. using ) . for example. XI zz 0 and O2 # 0..= p1 . N(BI10.. in the notation of Proposition 5. An indication of the usefulness of this result is given in the following example (and further applications can be found in Section 5. Secondly. 0.I). XZ)}.) is given by . is nonzero for O2 # 0.. . such that (8..N(0210.it is very easily verified that the joint posterior distribution for 8 = (0. asymptoticposterior for can be obtained by defining an appropriated./& is Any reader remaining unappreciative of the simplicity of the above analysis may care to examine the form of the likelihood function. as First. . Proposition 5.17 and Corollary I is that verification of the conditions for asymptotic posterior normality (as in. independently..) = (4.17. . and subsequently marginalising to 41. subsequently deriving the form for the parameters of interest by straightforward transformation.r..l). so that. n&.5 3 Asymptotic Analysis .02) + (&. corresponding to an initial parametri&. prrametrisation.. sation directly in terms of 0./& TI + 50.. y. and to contemplate verifying directly the conditions of Proposition 5.)} and.. Suppose that we have a random sample sl. . + L. N(y. The result given enables us to identify the posterior normal form for any convenient choice of parameters. g(el. &) Proposition 5. we note that the marginal where 91&# = X I + . . We are interested in the posterior distribution of dI = @.. . 42) and + The determinant of this. + y. 0.9. .. A. another random sample yl. where XI zz 0. & is a onetoone transformation. It follows that the asymptotic posterior of Q.
There is no “objective” prior that represents ignorance. However. .6 relates to specifying prior distributions in situations where it is felt that. in sonie welldefinedsense. representing “prior ignorance”. Put bluntly: data cannot ever speak entirely for themselves.6. The third of the questions posed at the end of Section 5. together with the specific informationtheoretic tools that have emerged in earlier chapters as key measures of the discrepancies (or “distances”) between belief distributions. “vugue prior knowledge” and “letting the data speak for themselves” is far more complex than the apparent intuitive immediacy of these words and phrases would suggest. whose role and character will be precisely defined. the setting for our development of such a reference analysis will be the general decisiontheoretic framework. we shall provide a brief review of the fascinating history of the quest for this “baseline”. therefore. leading to inferences which are negligibly dependent on the initial state of information. it is as well to make clear straightaway our own view very much in the operationalist spirit with which we began our discussion of uncertainty in Chapter 2that “mere words” are an inadequate basis for clarifying such a slippery concept. we recognise that there is often a pragmatically important need for a form of prior to posterior analysis capturing. the problem of characterising a “tioninformuri. it will be clear that the reference prior component of the analysis is simply a mathematical tool. or as a defuult option when there are insufficient resources for detailed elicitation of actual prior knowledge. evenfor moderate sumple sizes. relative to the data.2. the notion of the prior having a minimal effect. On the other hand. “uniquely noninformative“ or “objective” prior. and towards welldefined decisiontheoretic and informationtheoretic procedures. to be used when a default specification having a claim to being noninfluential in the sense described above is required. the data should be expected to dominate prior information because of the “vague” nature of the latter. In line with the unified perspective we have tried to adopt throughout this volume. From the approach we adopt. However. but it is not a privileged.ie”or "objective" prior distribution. more typically. every prior specification has sonic informative posterior or predictive implications: and “vague” is itself much too vague an idea to be useful.1. limiting prior form. It has considerable pragmatic importance in implementing a reference unulysis. it might be required as a limiting “what if?” baseline in considering a range of prior to posterior analyses. We seek to move away. Such a reference analysis might be required as an approximation to actual individual beliefs. Its main use will be to provide a “conventional” prior. we have examined situations where data corresponding to large sample sizes come to dominate prior information. from the rather philosophically muddled debates about “prior ignorance” that have all too often confused these issues. In Section 5.5 Inference 5. on the final inference.4 REFERENCE ANALYSIS In the previous section.
let us now. { e . p ( @ ) )but with x = (21 .  . w l ) and the availability of an experiment e which consists of obtaining an observation x having parametric model p(x I wa)and a prior probafor bility density p(w) = p(w1 I w ~ ) p ( w 2 ) the unknown state of the world. with appropriate notational changes. the relationship betweenw and 8 is that described in theirjoint distributionp(w. indulge in a harmless abuse of notation by writing wl = w > w 2= 8.If w1 is a function of w2. at least in suitably regular cases. I 0). and = Jp(r I8 ~ 8 ) do. Now. and noting that the expected (utility) value of the experiment e. with unknown state of the world w = ( w I . Often of w is just some function w = @(0) 8. without any risk of confusion. . 8 ) = p ( o I 8 ) p ( 8 ) . . w. assuming w is independent of x. of course. then vu{e(k). This both simplifies the exposition and has the mnemonic value of suggesting that w is the state of the world of ultimate interest (since it occurs in the utility function). same mathematical form as v . denotes the optimal answer under p(w I x).the expected utility value of the experiment e(lc). whereas inference from data 2 E X provided by experiment e takes place indirectly. has the . given the prior p ( 8 ) . in uriliry rerms. simply P(W2). We note that if a.p(8)}. If e ( k ) denotes the experiment consisting of k independent replications of e .utilities for consequences ( a . whereas 8 is a parameter in the usual sense (since it occurs in the probability model). This general structure describes a situation where practical consequences depend directly on the w Icomponent of w . as k m we obtain. nfZ1 ? . . denotes the optimal answer underp(w) and a. then.zk) andp(x I 8 ) = p(z. To avoid subscript proliferation. a E A. through the w2 component of w as described by p(w1 1 w2). relative to the observational information provided by e. let us examine. given 0. for given conditional prior p(w 18) and utility function u(a.13 (ii). the prior density is. using Definition 3. to an inference problem. .5. . if w is not a function of 8. is where. Intuitively. that is yielding observations ( ~ 1 .4. the influenceof the priorp(8). u p ) .w).1 Reference Decisions Consider a specific form of decision problem with possible decisions d E 23 providing possible answers. . x k } with joint parametric model p ( s i I O).4 Reference Analysis 299 5. n:=. w ) given by u(d(w1)) = u ( a .
..). ... We shall derive a reference analysis of this problem.. the less will be the expected value of perfect information about 8. . h X is the expected (utility) vufue of perfecr injormution. outcomes of I. . complete) information about 8. . T ( 8 ) . This.p ( 8 )} )p Ill.e.I.A ) parametric model.say.. suggests a welldefined "thought experiment" procedure for characterising a "minimally valuable prior": choose. AI.J = .I. .300 5 Inference from c. .). 2 0). not as that which maximises the expected value of the experiment. so that.( 8 ) ) = lirii r. . Al. and H = I ( . the posterior distributions. . Moreover.r I I(. E R. will be called a ureference decision. given p ( 8 ) . (Prediction with quadratic loss). . which muvimises the expected v h e ofperfect informafion ubout 8.). . we have For the purposes of the"thought experiment".. . and.. . with known precision A.. { e( x . Example 5. . together with a prior for / r to be selected from the class {.. Suppose that beliefs about a sequence of observables. . the less valuable the information contained in the prior. for which A = 'R.. . correspond to assuming the latter to be a random sample from an :V(. Such a prior will be called a ureference prior.Y(/r I / I . to exist. Assuming a quadratic loss function. replications of the experiment yielding the observables ( . . A. given ... . let zL = ( 5 . from the class of priors which has been identified as compatible with other assumptions about ( w . r ~ ~ say.that prior. the decision problem is to provide a point estimate for . ..rl. conversely. . for given pII.8 ) . will be called urejerence posferiors: and the optimal decision derived from ~ ( 15)and ~ ( ow) w . + 15) = 1 q a ( e 1 +e I z)x lJ(ZI W 4 8 ) derived from combining T ( 8 )with actual data 5. 2 = ( . the more we would expect to gain from exhaustive experimentation.I. . Clearly..Q ) denote the (imagined) . . .. assuming the limit (c. then. about 8. It is important to note that the limit above is not taken in order to obtain some form of asymptotic "approximation" to reference distributions.10. ] . . the more valuable the information contained in y ( 8 ) .) perfect (i.{ c( k). r l .. .p. I .. I . the "exact" reference prior is dejned as that which maximises the value ofperfect information about 8. .
by virtue of the normal distributional assumptions. so that since. X) = n . (Variance estimation). Q&'). A) together with a gamma prior for X centred on XI.. and p ( z . ..} . we know from Proposition 5. Suppose that beliefs about I = {rI. Example 5.. . the predictive variance of T given does not depend explicitly on 2 1 . Then Let Z A = {z. correspond to assuming I to be a random sample from N ( z 10. with pIJarbitrary. . In fact. striightforward manipulations reveal that so that the ureference prior corresponds to the choice Xo = 0. . k replications of the experiment. I 0. X) Ga(X I ( 1 . so that Thus. so that p ( X ) = Ga(X I n .II.1 1. I ). 2 ' 8 = A.4 Reference Analysis and let us denote the future observation to be predicted ($A. assuming a standardised quadratic loss function.. Q > 0. Then ' 301 However. The decision problem is to provide a point estimate for u2 = X ...I 11 N(Z.. 'w = 02.I .+]) simply by 1. OX.x.3 that optimal estimates with respect to quadratic loss functions are given by the appropriate means.5. we have A = 9.} the outcome of denote . .
may be explicitly obtained by standard algebraic manipulations. Ci’ dominates 3’ or any smaller multiple of . where the “inference answer” space consists of the class of possible belief distributions that could be reported about the quantity of interest. . the ureferenc. but a slightly smaller multiple of s’.r1.. .. arbitrary.~feriur for X is Ga(X I 1/12..1} is finite. thus. For further information. rf. .7 and 3. where nsL) = .G2 as seen from a frequentist perspective.. IX I . . see Bemardo ( 1981a) and Rabena ( 1998)._ _ . .Thus. the itreference approach has led to the “correct”multiple of . one has v . Since 1. Chapter 4). = /lS?/2 (71/2) 41 . I I This is maximised when c) = 0 and. f/ f ’L Hence.r. Definition 2. Example 45 in Berger.rf and.c. the expected value of perfect information (cf. In this case. from a frequentist perspective. Indeed. Explicit reference decision analysis is possible when the parameter space 8 = { B . .ns2/2).1 )‘(/A = * ft I  I and this is attained when a = d / ( c i + 1). .Y2 in terms of frequentist risk (cf. we noted that reporting beliefs is itself a decision problem. the reference estimator of 0’ with respect to smndordised quadratic loss is t i o r the usual s 2 . the cwyfprencc prior corresponds to the choice f k = 0. ?. .302 and krrs’ = 5 Inference 1. inf 1 Ga(X 1 0 . with &.). . . . z = (. { r ( k ) )>(A)) .4.4. hence.. . and the utility function is a proper scoring rule whichin pure inference problemsmay be identified with the logarithmic scoring rule.~ ( 8 ) ) .0. 198Sa.{e(rX). the iireference decision is to give the estimate c. j ) (OX . 6‘ is the best invariant estimator of u2 and is admissible. .epc). { r ( x )p ( X ) } = lirii . It is of interest to note that.19) may be written as andthewreference prior. Given actualdata.2 Onedimensional Reference Distributions In Sections 2. .. 5. which isthat r ( 0 )whichmaximises21. J.
. with 8 E 8 C 8. p(w 18). The corresponding expected information from the (hypothetical) experiment e ( k )yielding the (imagined)observation f k = ( 2 1 . If an experiment e consists of an observation 3 E X having parametric model : p(3: I O ) . ) .qx() denote the optimal choices of q ( .{ by v{ and replace the term “ureference” by “reference”.8} = A l o g g ( 8 ) + B(8)? the expected utility value of the experiment e. given the prior density p ( 8 ) . we shall consider the case where the quantity of interest is 8 itself.). 3:~) parametric model with k P(% is given by I @ = n P ( 3 : i 18) r=l and so the expected (utility) value of perfect information about 8 is I { e ( O Q ) . Noting that u is a proper scoring rule. . for which we simply denote v. ) lim . q(8) > 0. . is where qo(. p ( f l ) = Cm r { e ( k ) . with w = 8. Here.5. we have considered a rather general utility structure where practical interest centred on a quantity w related to the 8 of an experiment by a conditional probability specification.More general cases will be considered later. A = { q ( .4 Reference Analysis 303 a } a}. P ( 8 ) } . In discussing reference decisions.. Ja q(8)dO = 1) and the utility function is the logarithmic scoring rule u{q(. respectively. Our development of reference analysis from now on will concentrate on this case.). so that. ) with respect to p ( 8 ) and p ( 8 ( s ) . for any it is easily seen that the amount of information about 0 which e may be expected to provide.
. This quantity measures the missing informulion about 0 as a function of the prior p ( 0 ) . ) } = l g { p ( . This to approach will now be developed in detail.p(6. d Unfortunately. ) which maximises Since this is of the form F { y ( . However. the prior ~ p ( d which maximises I'{( (k). which maximises the missing information functional. . any function 7 4 . F must satisfy the condition for all T.k = 1.)j is typically infinite (unless H can only take a finite range of values) and a direct approach to deriving r(Q) along these lines cannot be implemented. Given actual data x. as a functional of p ( . ) }do. ) .)} ) must be an extremal of the functional y is twice continuously differentiable.and subsequently take ~ ( 0 )be a suitable limit.p ( 0 ) } . verified that the amount of information about H which k independent replications of c may be expected to provide may be rewritten as and zp = (21. f 0 C Y. where.x I { c ( k ) . a natural way of overcoming this technical difficulty is available: we derive the sequence of priors ~ k ( 0 which maxiniise ) I { t ( k ) . therefore..304 5 liverenee provided that this limit exists. . . Z 6. denoted by ~ ( 0 )is thus defined to be that prior . the reference posterior ~ ( I x)to be reported is simply derived from Bayes' theorem. Moreover. . so that is is the posterior distribution for 0 after % A has been observed. . for any prior y(8) one must have the constraint p [ H ) dd = 1 and. The reference prior for 0. Let c be the experiment which consists of one observation 2 from ~ J [10). Suppose that we are interested in reporting inferences about d and that R no restrictions are imposed on the form of the prior distribution ~ ( 0 )It is easily . d as ~ ( 12) x p ( x I O ) T ( O ) .p(6.2.a}a possible outcome from cr(k).. limk..
This completes our motivation for Definition 5. under suitable regularity conditions. for all 8 E 8. this only provides an implicit solution for the prior which maximises 1 8 { e ( k ) . and hence that p ( 8 ) cx fk(8). 8. for any function T . p ( 8 ) } . . .7. Thus. .q). the required condition becomes which implies that the desired extremal should satisfy. for each k.. . for large values of posterior distribution p(8 zk) = p ( 8 I zl. may be found to the posterior distribution of say. It follows that. an approximation. p ( 8 ) } . k. p'(8 I q.4 Reference Analysis It follows that. 305 where. a sequence of posterior distributions with the same limiting distributions that would have been obtained from the sequence of posteriors derived from the sequence of priors r k ( 8 ) which maximise I B { e ( k ) .).5. by formal use of Bayes' theorem. after some algebra. which is independent of the prior p ( 0 ) . Note that.0) depends on the prior through the However.since fk. For further information see Bemardo ( I 979b) and ensuing discussion. the sequence of positive functions I will induce.
I . for details. We should stress that the dejinitions and "propositions '' in this section ore by and l u g e heuristic in the sense that they are lacking statements of the technical conditions which would make the theory rigorous. ) will be called a reference prior for 6. The positive functions IT( 0) are merely pragmatically convenient tools for the derivation of reference posterior distributions via Baycs' theorcm. und dejine Zn. and it will be clear from later illustrative examples that the forms which arise may have no direct probabilistic interpretation..xk} he the result oj'k .7. lr(9 I x . Making the statements and . 10) the ck(x)s are the required normalising constants und.x E X . 9 E 0 C 92. see Berger and Bernardo (1992~).of ~ k ( 6 I. relutive to the e.assuming this limit to exist. Although most of the following discussion refers to reference priors. the natural convergence in this context. independent replications off: . The use of convergence in the information sense. 5 Inference Let x be the result of un experiment c which consists of one observation from p ( x 19). It should be clear from the argument which motivates the definition that any asymptotic approximation to the posterior distribution may be used in place of the asymptotic approximation y'(8 I a )defined above. is necessary to avoid possibly pathological behaviour.)d%L. An explicit form for the reference prior is immediately available from Definition 5. x).rperiment P . it must be stressed that only reference posterior distributions are directly interpretable in probabilistic terms.306 Definition 5.fi)r almost ull x. where The reference posterior density of9 u8er x has been observed i dejined to be s the logdivergence litnit.fi)r some c(x)> 0 und for ull B E 0. . ) where Tk(0 I x) = Q(z)p(" f. let zk = {XI. (Onedimensional reference distributions). rather than just pointwise convergence.7. Any positiveefunction ~ ( 9such that..(@.
}. ~ k ( Iz) x p ( z I H ) f . provided the limit exists.6 E 0 C 8. is and convergence in the information sense is verijied.5. . Proposition 5. 1992b.6.z }a random samplefrom p ( z I 0). at the time of writing. . 1992c)and Berger et al.. the reader would be best advised to view the procedure as an “algorithm. i = 1.19. . ( @ ) lim 0 a required. . .a > 0. Using n ( e )as a formal prior.4 Reference Analysis 307 proofs precise. and p ” ( 0I x k ) is an . A reference prior for t9 relative to the experiment which consists of one observutionfrom p ( z 1 O ) . . . and hence n(e I z)= kcc nk(@ Iz). any f function of the form n(B.2appears to produce appealing solutions in all situations thus far examined. The reader interested in the technicalities involved is referred to Berger and Bemardo (1989. . Proposition 5. So far as the contents of this section are concerned. the limits above will not depend on the particular asymptotic approximation to the posterior distribution used to derive fi((e). independently of the experiment performed. . with zk = { q . which compared with other proposalsdiscussed in Section 5. (Reference prior in the finite case).hi where c(z) is the required normulising constant. however. is still an active area of research.).. where B E 0 = { 61. Proof.. 1992a. i = 1. . under suitable regularity conditions.. it turns out that the reference prior is uniform.Bo E (3. (Explicitform of the reference prior).. Note that. Let x be the result o one observation from p ( z 1 8 ) .M . given. .is a reference prior and the reference posterior is n(e. a If the parameter space is finite. (1989). Then.. z E X. . would require a different level of mathematics from that used in this book and.18. by where c > 0.)= a.I 2) = c ( z ) p ( z1 e. k asymptotic approximation to the posterior distribution of 0.O.
.20. cc Hence.12) that if 8 is finite then. Let e. I . p(I9. z A .. . ..derived in accordance with Definition 5. z L . dl.fiom p ( x . will converge to zero a k s + { 1 i = 1. by considering the sample to be a single observation . our derivations have proceeded on the basis of an assumed single observation from a parametric model. We have already established (Proposition 5. that for experiments involving more structured designs (for example.7. Note. conditional on a parametric model.. Indeed.) = cxp J P ( % * 8.7 for identifying the reference prior depends on the asymptotic behaviour of the posterior for the parameter of interest under (imagined)replications of the experiment to be actually performed. . .for all n. 10).x E X. Thus far. involving a sequence of n 2 1 observations. Proposition 5. It follows that the integral in the exponent of fLd0.n 2 1. . from y ( x 1 0). n::. in this case the expected missing information is finite and equals the entropy of the prior. Then PI= P. (Independence of sample size). x. in linear models) the situation is much more complicated.0 E (3.. .. is the true value of 19. be the experiment which consists of the observation ofa rundom sample 2 1 . a reference prior is given by The general form of reference prior follows immediately. denote the class of reference priors for 0 with respect to e l # . . p ( x 1 ) The next proposition establishes that for experiments 6. ) ~ . This is maximised if and only if the prior is uniform. I T I .)logp(6. which are to be modelled as if they are a random sample. however.308 5 Inference Proof. the reference prior does not depend on the size of the experiment and can thus be derived on the basis of a single observation experiment.. . . . . The technique encapsulatedin Definition 5.. a The preceding result for the case of a finite parameter space is easily derived from first principles. for any strictly positive prior. r k ) will converge to 1 if 8. and let P. .
. 00) to the posterior distribution of 8 given z k .8. .}is the result of a kfold independent replicate of then. so that p'(0 I zk) = p'(8 1 yk).x2. it is natural to wonder whether the reference priors derived from the distribution of the sufficient statistic are identical to those derived from the joint distribution for the sample. Proposition 5. x E X.for all n. .. denotes a kfold replicate of (xl.5.. .for any n.l = t(xl. 2 1 be the experiment which consists of the observation of a random n .0. Let err.4 Reference Analysis 309 el. respectively.. If q.. E 8 and where p*(0 1 zk) is an asymptotic approximation (as k . the latter . (Compatibility with sumient stutistics). by Proposition 5. of reference priors derived by considering replicationsof ( 2 1 . from p ( x I 0). considered as the result of a kfold independent replicate of e. . p ( 0 I zk) = p(O I uk).xkn} which can be . coincide. the classes admits a suflcient statistic t.. by the definition of a sufficient statistic. consists of r ( 0 )of the form But znk can equally be considered as a nkfold independent replicate of el and so the limiting ratios are clearly identical. Proof. . . E 8. . Z . . If zk = {XI? .. . then.. .z..) and y k denotes the . . . . ...5) and t. x . Proof.. It follows that the corresponding asymptotic distributionsare identical. The next proposition guarantees us that this is indeed the case..18. corresponding kfold replicate oft.x. PIconsists of r(0)of the form with c > 0. We thus have .k = {xl. ) Then. ..21. . .for any prior p ( 0 ) . . Now consider z. a In considering experiments involving random samples from distributions admitting a sufficient statistic of fixed dimension. . XA. X ~ .. ~ . and are identical to the class obtuined by considering replications of el.. sample xl. . . so that P. 0 where. .. ..
with .(o 1 zh) = PO (9 ' ( 0 )I % A ) I.22. but the following example shows that the technique is by n o means restricted to this case. with f z 6 X . As we shall see later. E 4?. H E 0. ' E X . if0 isdiscrete.(o)= .' ( o ) ) . (Invariance under onetoone transformations). The question now arises as to whether referonetoone mapping y : 8 ence priors for f? and o.derived from the parametric models p ( z I H ) and p ( z 1 o).. p ( z If?). .I 10) p(zk logp'((9 I zk)dzk a The assumed existence of the asymptotic posterior distributions that would result from an imagined kfold replicate of the experiment under consideration clearly plays a key role in the derivation of the reference prior.. p. respectively. the reference priors are identical. so is Q and the result follows from Proposition 5. in the sense that their ratio is the required Jacobian element. . . Then.310 5 Inference so that. The next proposition establishes this form of consistency and can clearly be extended to mappings which are piecewise monotone.' ( o ) ) /J. for any proper prior p ( f ? )the corresponding prior for c? is given by p.20. t Proposition5. could. Otherwise. by Definition 5. are consistent. we shall typically consider the case of asymptotic posterior normality.>l' fi(0).l and hence.o = ~ ( 0 ) . if zk denotes a kfold replicate of a single observation from p ( z 10). Suppose that n (0). ( o ) = (TOy . T'. I we for reparametrise and work instead with p ( x lo). the asymptotic posterior approximations are related by the same Jacobian element and hence = ~ J ~I exp ~I = l.I' E S. s E X. p o ( g . The second result now follows from Proposition 5. any monotone CP. a Given a parametric model. It follows that. (  Proof.( (3) ure reference priors derived by considering replio cations ofe. then.IoI.7. {.tperiments consisting o u single obserwtion from p ( x I 0).J<.. it is important to note that no assumption has thus far been required concerning the form of this asymptotic posterior distribution. respecti\selt. However. where o o = y( 0 ) und y : (3 9 is u onetoone monotone mupping.18. as k + x. Identity with those derived from c I follows from Proposition 5.19. f? E 8 undfroni p(z I &). for sonic c > 0 und for all (3 E a: (i) K . for all o E CP. If 0 is discrete. of course.
.11) 1 * < t ) 5 p:I  + 11u11 4. 2 lllnx L J It follows that. ... Otl I. = [z::/n ..z. . then t.and hence that ~ ( 8= c h ) (K71f i 1)/2 kl (ktf + 1)/2 = Any reference prior for this problem is therefore a constant and. . . x. .z....x ! . are Be(u I k71..  f <_ 8 5 2) + f null * a uniform distribution over the set of 8 values which remain possible after z has been observed. : ~ . kn).5. Let e be the experimentwhich consists of observing the sequence 11 . respectively. t.. sample from a uniform distributionon [8 . as k + oc. actual data 2 = (q.. together with aprior distribution p(e) fore. + f . is a sufficient statistic for 8.4 Reference Analysis 311 Example5.xn).1 < 5 x!.1) and Ek(v I 1.1 .z. (Uniform model). . noting that the distributions of 11 = p ’mr i x) . = 3 i111 q.(0)= (ktz + 1)/2. 2 2’ Iknl = J IlIln . For large k. 1: dnl ..8 E ‘R. z!:.:il. . 1 It is easily verified that c.12. n 2 1. . therefore. the reference posterior distribution is n ( e l z )I x c . ~f i. and p ( e I z) p(t) I t. we see that the above reduces to kn + 1 It follows that fi. .}.. . (r. the righthand side is wellapproximatedby log { 1 and... . a kfold replicate of e with a uniform prior will result in the posterior uniform distribution P’(0 I t k ... .8 + .. given a set of .)a ~(8). = uiax{q. the expectation being with respect to the distribution of tk. ...8+ 41. whose belief distribution is represented as that of a random .8 ./n = mintxl..
. p ( x I 0).( z I I ~.18... L4t c he the e. und let zi I . Proposition 5.rperinietit which consists ofthe observation of a random saniple x = { X I . asymptotically sufjiicient. . the asymptotic posterior distribution p’(0 I r k f l ) . it follows from the assumptions that The rcsult now follows from Proposition 5. x E X . consistent estiniute of 8. under suitable regularity conditions. us k  x. estimator). then. 8 E 0 C %.23.)froni . . (Explicit form o the reference prior when there is a f consislent.fu kjbld replicwte of ) c lfthere e. . (xj.vists HI. a concept which is made precise in the next proposition.0. E 0. be the result . will only depend on za.I = #a . .reference priors are dejned by where Proc.. the reference prior can easily be identified from the form of the asymptotic posterior distribution. In such cases.corresponding to an imagined kfold replication of an experiment p I l involving a random sample of IZ from p ( x 10). x. for m y c > 0. through an usyniptuticully sirflcient.31 2 5 Inference Typically.. with prohahilic one und.such thnt.$ As k ) .
where ~ { 2 ~ )1 @ P(I I@)= for O < r < $ e(2(1 . It is easily verified that if = {xI1.: is a sufficient.xn). the sufficient statistic tk. .x)}''l for f 5 x 5 1 defines a oneparameter probability model on [O. From the form of the righthand side. producing the sufficient statistic t = t(z). . for large k ... 1)(given by 8 = 1). for any prior p(B). P(B I zkrr It is also easily shown that p(tk..& > 0. 11.} . 1985) in exploring deviations from the standard uniform model on [O. which finds application (see Bemardo and Bayarri.18. . is therefore 1 r(eI 5 ) = ~ ( I to) p ( Z I e) errI exp{71(61 l)t} .. for some c > 0.. . which is a Ga(0 I n.. xl. results from a kfold replicate of c'. is given by and..) so that and.4 Reference Analysis 313 Example 5. (Dsvhztbnfrom uniformity nrodel).. 0 > 0.) = Ga(0 I k7L + 1: knlb..r I e).. an asymptotic posterior approximation which satisfies the conditions required in Proposition 5.. 1 E[tkrrel =  ~ [ t . consistent estimate of 8.. It follows el 1 el = kn02 provides. 0 5 I 5 1. rit) distribution. I. Let c.5. . be the experiment which consists of obtaining a random sample from y(.. from Proposition 5. from which we can establish that that a. = ti. we see that p'(0 I &. The reference posterior for B having observed actual data 2 = ( T I .13..23.
. as required.e. reference priors have the fortn K(e) {/~(e)}~? l~ Prooj Under regularity conditions such as those detailed in Section 5. Then.24 is closely related to the "rules" proposed by Jeffreys ( 1946. E 9. under the conditions where asymptotic posterior normality obtains we find that h(0) = p ( z 10) .. if rhe asymprotic posterior distribution of 8. and therefore..n is a consistent estimute u 0.23. x E X..2. I939/1961 ) and by Perks ( 1947) to derive "noninformative" priors. the asymptotic posterior distribution of CI tends to normality.. ) . H. Spically. we can obtain a characterisation of the reference prior directly in terms of the parametric model in which 8 appears.given a kfold replicate of el. . for some c > 0. where @A.l o g p ( z 10) J ( :2 i i. (Reference priors under asymptotic normality).. . . T ( 0 ) x h(0)' ?. 1 . Proposition 5. is some consistent estimator of 0. be the experiment which consists o the observation of a random sample f 2 . In such cases. 0 E 8 c R.z.24. Let e. by Proposition 5. . from p(x I Q).3. 1925). = (2) 1 :L' {h(8)}li2.2. given a kjold replicate of el. is where &. Fisher's information (Fisher.3. it follows that an asymptotic approximation to the posterior distribution of 8.314 5 Inference Under regularity conditions similar to those described in Section 5. Thus. is normul with f precision knh ( O k I . and hence the reference prior. a The result of Proposition 5.
) Consider now. whatever the number of successes In particular. by Roposition 5. however. has to be specified.which. i). Consider an experiment which consists of the observation of n Bernoulli trials. It should be noted however that. it is often simpler to apply Proposition 5.5. by definition. which produces an improper posterior until at least one success is observed.. r 2 1.zn}. . even under conditions which guaranteeasymptotic normality. Technically. Jeffreys’ formula is not necessarily the easiest way of deriving a reference prior.f.1. Note that a(0 1 )is proper. even though there are no observed successes. (Compare this with the Haldane (1948) prior.14.I / * . As illustrated in Examples 5.8 E 0. n.’ ( l @ ) .12 and 5.r + i). already encountered in Section 5. 1 s is the beta distribution Be(@ r + $ . I q T ( e ) er”*(i. zi. and hence. The probability model for this situation is the negative binomial from which we obtain . with n fixed in advance. a(@ = Be(@ f . the reference posterior.a function of the entire probability model p ( z I B). therefore.I . p ( x p)= t3yi e p. E (0. See Polson (1992) for a related derivation. If T = C:=. an experiment which consists of counting the number z of Bernoulli trials which it is necessary to perform in order to observe a prespecified number of successes. so that G = {XI. if r = 0.24.4 the idea that knowledge of the data generating mechanism may influence the prior specification. . not only of the observed likelihood. Exampie 5. a(@)3: @ .from which I a(@ 12) C( P(G C( T. this is a consequence of the fact that the amount of information which an experiment may be expected to provide is rhe value of an integral over the entire sample space X. of course. We have.yr1/2. o Ie I 1.4 Reference Analysis 315 becomes Jeffreys’ (or Perks’) prior. 12) sensible inference summaries can be made. (Binomial and negative binomial &k). z E X .n + i). It is importantto stress that reference distributionsare. the reference prior is *(e) 3: e1/2(i e ) .18 using an asymptotic approximation to the posterior distribution. .13 above.
either per se.r . we note that this distribution is proper. different experiments generally define different types of limiting beliefs. the reference prior is a(@) x H ' ( 1 . which is the beta distribution Be(0 1 r. The concept of a reference posterior may easily be extended to this situation by maximising the missing information which the experiment may possibly be expected to provide nirhin this restricted class of priors. 8 E 8.24.I' + f ). so that the data analyst has available the full specification of the probability model p(s I H ) . say. to twndirion on some assumed features of the prior distribution p ( 0 ) . . 8. In order to carry out the reference analysis described in this section. scientists are typically required to specify not only the data but also the conditions under which the data were obtained (the design of the experiment). where. This difference in the underlying assumption about 6. . Q. To report the corresponding reference posteriors (possibly for a range of possible alternative models) is only part of the general priortoposterior mapping which interpersonal or sensitivity considerations would suggest should always be carried out. In reporting results. A reference prior is nothing but a (limiting) form of rather spec& beliefs: namely. is duly reflected in the slight difference which occurs between the respective reference prior distributions. . r I H ) i T ( H ) x 8' ' ( 1 . that the preceding argument is totally compatible with a full personalistic view of probability. however.0 ) ' ' I 2. z E X. or on a "what if:'" basis. what can be said about the parameter of interest ifprior information were minimal rrlurive to the maximum information which a welldefined. those which maximise the missing information which a purriculur experiment could possibly be expected to provide. such a full specification is clearly required. it is often interesting. . which is not true in direct binomial sampling.r) I p ( . I' 7 1.5 Inference and hence. Again. The reference 1 .3 Restricted Reference Distributions When analysing the inferential implications of the result of an experiment for a quantity of interest. . thus defining a restricted class. we continue to assume that 0 E 8 C R. = r I'. specific experiment could be expected to provide? 5. . of priors which consists of those distributions compatible with such conditioning.4. for simplicity.H ) posterior is given by K(H I '. Consequently. Reference analysis provides an answer to an important "what if?" question: namely. whatever the number of observations s required to obtain r successes. by Proposition 5. See Geisser (1984) and ensuing discussion for further analysis and discussion o f this canonical example. We want to stress. Note that I' = 0 is not possible under this model: the use of an inverse binomial sampling design implicitly assumes that r successes willeventually occurfor sure. .
z }of the k single observation experiment.the amount of infonnation where fk(6) = CXP which could be expected from k independent replications I = {zl. is defined to be no(6 I z). . let zk = { X I . which correspond to the sequence of priors. .let 6 be a subclass of the class o all f prior distributionsfor 6. . which are obtained by maximising. Q ( 1oz).4 Reference Analysis 317 Repeating the argument which motivated the definition of (unrestricted) reference distributions. . as k + 30.Q(O). . within Q. {J P(%k 10) k P ( 6 I % k ) h 1 . p ( s I O)n.. with 6 E (3 3. s . I% ) ) I + nQ(6 rf(6 12) cx 0.. Definition 5. (Restricted reference dishibutions). and nf(6) is a prior which minimises. where 6 is the logarithmic divergence specijied in Definition 5.7.z k } be the result of k independent f replications o e and define Provided it exists. we are led to seek the limit of the sequence of posterior distributions. the Qreference posterior distribution of 6. .5.8. nk(6). that such E [ ~ { ~ . x E X . within 4 A positivefunction &(6) in Q such that i then called a Qreferenceprior fur 6 relutive to the experiment e. nk(6I z). afer x has been observed. Let x be the result o an experiment e which consists of one observationfrom f p ( x IO).
which essentially establishes that the Qreference prior is the closest prior in G.18 that x ( 0 ) is proper if and only if Moreover. by Definition 5.8 is illuminated by the following result. to the unrestricted reference prior ~ ( 0 )in the sense of minimising its logarithmic . Let r f ( H ) be the prior which minimises the integral within Q. & ( H ) minimises. then. divergence from r(0). within Q .8. a Qreference prior r ( ~ ( 0 ) Proof.318 5 Inference The intuitive content of Definition 5. Suppose that un unrestricted reference prior r ( 0 )relative to a given experis f sarisJes ment i proper. by the continuity of the divergence functional. which is maximised if the integral i s minimised. is the prior which a .25. i ir exists. Then. where. It follows from Proposition 5. (The restricted reference prior as an approximation). Proposition 5.
it is necessary to apply Definition 5. a Q reference prior o 8 relative to e.18.4 Reference Analysis 319 If n(8)is not proper. the result follows from Proposition 5.8 directly in order to The characterise nQ(8).m } . 00.i = I. (Explicitform of restricted reference priors). is of the form f f where the ' are constants determined by the conditions which define Q. A standard argument now shows that the solution must satisfy and hence that Taking k + . i it exists.. ) . for given { ( g i ( . a . then.. thus leading us to seek an extremal of the functional corresponding to the assumption of a kfold replicate of e..and.26. s Proofi The calculus of variations argument which underlay the derivation of reference priors may be extended to include the additional restrictions imposed by the definition of Q.5. Proposition 5. O s ). letQberhecfassofpriordisrribu~oionsp(8)ofe . . Let e be an experiment which provides information about 8. . following result provides an explicit solution for the rather large class of problems where the conditions which define Q may be expressed as a collection of expected value restrictions. which satisfy Let n(0) be an unrestricted reference prior for 8 relative to e.
5. ..A).26 that the restricted reference prior will be u~i#~rnz. To motivate our approach to this problem. .4. Recalling that p ( 8 ) can be thought of in terms o f . .. T (X I 0 ) .ri.A ) = I+. I we may rewrite the vector parameter in the form 8 = (&A). } be a random sample from a location model p(. the decomposition Ij(e) f)(c>. X E . ... suppose. where (3 is the parameter of interest and X is a nuisance parameter. ) 2 1 . V[H]= m i . . Thus. ( 1 ( H ) cfH = mi. for p ( X I o) has been specified and that only n(@)remains to be identified. Let 2 = { . the asymptotic posterior distribution of 19 will be of the form p ’ ( 0 1 . ..4 Nuisance Parameters The development given thus far has assumed that H was onedimensional and that interest was centred on 0 or on a onetoone transformation of 19. so that the unrestricted reference prior will be from Proposition 5.23. We shall next consider the case where 8 is twodimensional and interest centres on reporting inferences for a onedimensional function. z from p(a: 18) = p ( z 16. (Loeation modoh). and suppose that the prior mean and variance of 0 are restricted to be E[H]= p. Thus. It now follows with Hav(6))dfl = p.18 then implies that the “marginal reference prior” for c) is given by .0).p . consider Z A to be the result of a kfold replicate of the experiment which consists in obtaining a single observation.320 5 Inference Example 5. o E CP. .r 1 H ) = h(. by Proposition 5. where H. Q!J = (p(8).H ) . f(e. Proposition 5. Under suitable regularity conditions.r.r . for the moment. when the decision problem is that o reporting f marginul inferencesfor Q.15.qll(~ = I 0). r l . which is constant. I . .. is an asymptotically sufficient.I’ E Ji. .assuming a logarithmic score (utility) function.and [ ( O .Without loss of generality. H E P. the restricted reference prior is the norm1 distribution with the specified mean and variance. consistentestimator of 8. The problem is to identib a reference prior for 8. . that a slrirable referenceform. ) Y . . .. .
P(%k 14) = which plays a key role in the above derivation of T ( @ ) .(X 14) = ex* { JP(%k I4. In general. . Z k ) d Z k p* (A 1 4. Clearly. and k 1 t P(Xk ICp.A) = r ( 4 w 14) derived from the above procedure. a more subtle approach is required toovercome this technical problem. 4 i=l Given actual data z. involvingjnire parameter ranges. marginal reference posterior for 4. as we have already seen. However. zk)is an asymptotic approximation to the conditionalposterior for X given 4.(X I$) a. before turning to the details of such an approach. where the approach outlined above does produce an interesting solution. reference priors are typically not proper probability densities. would then be This would appear.X) logp'(X 14. to provide a straightforward approach to deriving reference analysis procedures in the presence of nuisance parameters. However. typically not be a proper will probability model. The above approach will fail in such cases.5. .X) = JJP(zz 1 7 4. there is a major dificuiry. This means that the integrated form derived from r ( X 1 Cp). we present an example.4 Reference Analysis 321 p * ( @I zk)is an asymptotic approximation to the marginal posterior for 4. J P(Zk I47 X). then. and By conditioningthroughout on Cp. I8 that the "conditional reference prior" for X given 4 has the form where f. corresponding to the the reference prior r(e)= r(4. we see from Proposition 5.
pp. Many commentators have argued that. and wished to provide a reference analysis. . (Induction). 1939/1%1. hut still relatively small compared with the population size. All the elements sampled turn out to have the specified property. the sample being large in absolute size. . Consider a large. Straightforwardcalculation then establishes that which is not close to unity when ri is large but n / N is small. the observed number of elements having the property by . 1921.. in view of the large absolute size of the sample. . is given by Suppose we considered 0 to be the parameter of interest.19 implies that p(e = r) =  1 !V+l. Then. an argument related to Laplace's rule of succession. Proposition 5.A). :V. However.16. . has the form If p(8 = T). . since the set of possible values for 8 is finite.r. A random sample is taken without replacement from the population. 1980a. the sample size by I t .I: defines a prior distribution for 8. The probability model for the sampling mechanism is then the hypergeometric.) Let us denote the population size by . and the actual number of elements in the population having the property by 8. (See. let us define A = { H 1 ifB=. T* = 0.r.322 5 Inference Example 5. . which. one should be led to believe quite strongly that all elements of the popululion have the property. careful consideration of the problem suggests that it is nor 0 which is the parameter of interest: rather it is the parameter o= {0 1 if0 = lV if 0 # :V To obtain a representation of O in the form (d.V.1. is a reference prior. Wrinch and Jeffreys. for example. 128132 and Geisser.. having observed :r = 71. irrespective of the fact that the population size is greater still. finite dichotomised population.V if H # :Y. Jeffreys. all of whose elements individually may or may not have a specified property.the posterior probability that 8 = N . for possible values of . . . r = 0.
where 4 is the parameter vector of interest and X is a nuisance I X parameter. The required 00. n.(X I4). conditional reference prior.for further discussion). which may depend on 4.1. can then be obtained and a marginal reference prior n.. The strategy reference prior n(q5. but in any specific problem a ‘‘natural’’ sequence usually suggests itself. .. N .’s to be made.  . We return now to the general problem of defining a reference prior for 6 = (4. the reference priors r(@) K ( X 14) are both uniform over the apand propriate ranges. = 1) = 1.(@)identified. # E 9. E A. n ( X 14) restricted to A. 1985b. . We recall that the problem arises because in order to obtain the marginal reference prior n(@) for the first parameter we need to work with the integrated model P(. typically. this will not be the case. For each i. the A. A) is then obtained by taking the limit as i clearly requires a choice of the A.19.A) as an ordered parurnerrisation of the model. such that. 4 However. Ui = A. This suggests the following strategy: identify an increasing sequence {A*} of subsets of A. A). a proper integrated model reference prior. and are given by T(X = 1 I Q. 1 T ( X = T 1 Q = 0) =  N 3 T = 0.5.. We shall refer to the pair (4.k 14) = J P ( % k I7X>n(X 14) dX.1. These imply a reference prior for 0 of the form and straightforward calculation establishes that 1 p(e=~jz=n)= n+l which clearly displays the irrelevance of the sampling fraction and the approach to unity for large n (see Bemardo. this will only be a proper model if the conditional prior n(X 14) for the second parameter is a proper probability density and.4 Reference Analysis 323 By Proposition 5. on each A. can be normalised to give a which is proper. We formalise this procedure in the next definition.
relutive to the orderedparametrisation (li. 7r( A I 0). 6. The referenceprior. the reference prior thus defined maximises the missing information about the parameter of interest. 7rf ( A 1 0). in effect. A b I ( A I 0 )flA: (iv) use those to derive the sequence of reference priors ( v ) dejne 7r(@ I x ) such that.7 to the model p ( x I 0. such that This will typically be simply obtained as Ghosh and Mukerjee (1992) showed that. E 4) x A C fR x 9. proper priors. obtain the A). (ii) for each 6. A. x E X . for conditional reference prior. { :Ii }. (3 E @.9. (iii) use these to obtain a sequence ofintegruted models Pl(Z 10) = l. f u. (Reference distributions given a nuisance parameter). .I. relative r o the 1 (0) experiment e and to the increasing sequences of subsets of 1 . 0) P(X I cp. (4)= A.The reference posterior. (9. is defined to be the result of thefollowing procedure: (i) applying Definition 5.for the parameter of interest (3.324 5 Inference Definition 5. ir(9 I x ) . is any positivefunction 7r(@ A).A). normalise T (X I @) within each :(cp) to obtain a sequence o I . Let x be the result of an experintent c which consists of one observationfrom A) the probubili9 model p ( x 1 o. Jilrfired Q. A)..for almost all x. .
$ Proposition 5. for given 4. Then. Proof. $) is onetoone. ~ ~ ( $14). by r. relative to [e.27. and let e’ be an experiment which consists in obtaining one observation from p ( x I 4. the form chosen for the latter is. where Hence.9 would not depend on the particular form chosen for the nuisance parameters. Let e be an experiment which consists in obtaining one observation from p(xI@.22. of course. The following proposition establishes that this is the case.5. we would hope that the reference posterior for @ derived according to Definition 5. If we denote these normalised forms. By Proposition 5. $). ]. Intuitively. A) can be written alternatively as p ( z I 4. A.we see that the 14) and normalised forms are consistently related by the appropriate Jacobian element.A). { S 1 ( # ) } where \zIt(&) = g d { A f ( @ ) }ure identical. the missing information about the nuisance parameter.9. with li.4 Reference Analysis 325 subject to the condition that.+) E x \zI C ? ?where (Q. for simplicity. (Invariance with respect t the choice of the nuisance o parameter). for the integrated models used in steps (iii) and (iv) of Definition 5. (@. I A) formation.{At($)}] and [el. if we define and normalise re(@ over 9. and hence that the procedure will lead to identical forms of r(415). = go(A). @) is onetoone transx 9. p ( x I @. $I).(@) rA(g. given @. . a .A) (4. arbitrary. is maximised. the reference posteriorsfor @. A) for which the transformation (4. we see 7 that.(X I 4). In a model involving a parameter of interest and a nuisance parameter.’(@) 14)over At(@).A) E CP x A c R x R. for any 1c. + (4. = $(#. Thus. (&.
The next proposition establishes that this is the case. we may wish to consider retaining the same form of nuisance parameter.( 6 1 Zk) = p " ( 0 1 I % ) > where & is an asymptotically sufficient.E l?.326 5 Inference Alternatively.(g = ':)3) (. Proposition 5. (Invariance under onetoone transformations). ). @. consistent estimate of 8. A E A.rr.(?) defined and by steps (ii)(iv) of Definition 5. step (i)of Definition 5.Jgi ( 7 ) exists.) is A) . 7. Let c be an experiment which consists in obtaining one observation from J J ( X 16. Then. Fordiscrete@. ( B i v h ereferencepriors under asymptotic normulily). A E iL where = g ( o ) . is multivariute normul with precision matrix k n H (& . Suppose that thejointas~mptc)ticpci.rr.rr(A 19 '(7)). and let f J be un experiment which consists in obtaining one observationfmm p ( x 17. o E @..24 establishes that even greater simplification results when the asymptotic distribution is normal. $@isdiscrete. Thus.given datu x. as required by Definition 5.y ) } ] .i~ N consistent estimate rf (0.. Let F. I 1. = .)definedby steps (ii)(iv)of Definition 5.(y) = A f { g ( o ) are relatedby: } A : (i) 7r. 7rf (0) 7r. uhere (& ).(~yIx) .22. and the result follows immediately. where 7 = g ( 4 ) is now the parameter vector of interest.9 clearly results in aconditional referencepriorn(A I @) = . A). Intuitively.9. . A). A) might be written as p ( z I f. relative to [ P . Proposition 5.A) E @ x A C 'H x 'R.18..fold A).X.29. girena I... but redefining the parameter of interest to be a onetoone function of 9. p ( z 14. In all cases. { @ I ( . by Proposition 5.. A).9 are related by the claimed Jacobian element.. Proposition5.28. In Proposition 5. Proof. . { A . replicate off I . (o)}] the and [e'. A.~terior distribution of (8. A).. be suitably defined sequences of subsets of A.9 are both uniform distributions.1c. and the result follows straightforwardly... We shall now extend this to the nuisance parameter case. If . we would hope that the reference posterior for y would be consistentlyrelated to that of qj by means of the appropriate Jacobian element.xnfrom p ( z 14. I ( ? ) 1 by Proposition 5.. (o7.23. . be the experiment which consists ofthe observation ofu rundom sample xl. a . and let { A f ( c l ) } . referenceposteriorsfor @ arid {.(o)and7rf(. we saw that the identificationof explicit forms of reference prior can be greatly simplified if the approximate asymptotic posterior distribution is of the form p . and suppose that h .
5. the integrand has the form for large k . so that . Then 327 = . Marginally.29 then follows from Proposition 5. The first part of Proposition 5. where. A. derive the form of x t ( $ ) . is the partition of H corresponding to &. 4 v 2 : deJinea reference prior relative to the ordered parametrisation ($. with n. the asymptotic conditional distribution of A is normal with precision knhlZ($kn. j n(A 14) 1.. A).ik. Given 4. {h22(4.2. where ho = (hll .24. we note that To if zk E 2 denotes the result of a kfold replication of e.2. the asymptotic distribution of $ is univariate normal with precision knh.(A 14) denoting the normalised version of x(A 14) over A l ( $ ) .).h12hG1h21).).4 Reference Analysis hi.). i = 1. ik. where dA} .. Proof. (&.
the reference prior takes on a very simple form. we c*lroosc a suitable increasing sequence o subsets { 1.. . (.' = J.(A I d ' ) ) = tr.29.$A. c I . he the experiment which consists in the observation of a random sample x = { .. g2(A) 1 dX..} from a normal distribution. unknown. = J. Suppose that.I. In such cases. under the conditions ($Proposition 5.L(X) logy1 ( A ) dA. n. suppose ulso that arul Then a reference prior relative to the orderedparametrisation (0. I a Example 5. Corollary.17. Since. . factorise into products A)} A)} of separate functions of 6 and A. where a. and hence 7r. (Normal mean and standard deviation). } do not depend on 0. and standard deviation. and { h.. ) is A Proof. x I[)(" lo.29.y.gz(A). .X ) 7 r ( d . the result follows. the formsof {hz2(q>. We shall first obtain a reference analysis for ~ r taking (T to be the nuisance parameter. 7r( A I 0)x fi(~)!g2( A). . with both mean. and the subsets {A. u. . By Proposition 5.A) is defined by . which do not depend ON f } 0. a In many cases. . A)dX. and the result easily follows.328 5 Injerence has the stated form. . Let c .. for data z the reference prior n(S.(@. It then follows that where b. 1 1 .
is 3( [' s + ( p . A. If we now reverse the roles of p and a. we obtain. of subsets of A = %+ not depending on p.l)K2. u).. .Z)*. provides a suitable sequence e .2. . i = 1. writing 4 = (a. ' 5 a 5 el}.The corresponding reference posterior for p. given 2.1)..F)'] "I2 = St(p (2. over which ~ ( I p ) can be norrnalised and the a corollary to Proposition 5. . We therefore first obtain the Fisher (expected) information matrix. It follows that x ( p 3 = x ( p ) n ( uI p ) x 1 x a) 0 1  provides a reference prior relative to the ordered parametrisation ( p . (n 71 where ns" = C(z.5.so that the latter is now the parameter of interest and /L is the nuisance parameter.29 can be applied.29 can be applied.4 Reference Analysis 329 Since the distribution belongs to the exponential family. . = {a. whose elements we recall are given by from which it is easily verified that the asymptoticprecision matrix as a functionof 8 = (11: a) is given by so that. ) p . asymptotic normality obtains and the results of Proposition 5. for example.
2. The corresponding p. . by a similar analysis tothe above.).18. but we now take d = p / a to be the parameter of interest. i) = x?(Xrrs' I if . reference posterior for a.@ = (6.?I "J. Example 5. for example. with . the righthand side of which can be written in the form Noting. We consider the same situation as that of Example 5. that the integral is a constant. I P.17.' . n ) ~ ( p . ~ ( X n l ~ = Ga (Xriu' s) z I i(n  1). 0)4 P 10) dP.n ) is clearly a onetoone transformation.29 can be applied. I r 2 2 ( a . is ~ ( n l z3:) / p ( z l / i . In the following example the form does change when the parameter of interest changes.27 the choice is irrelevant). . (Stundordised nomulmeun). One feature of the above example is that the reference prior did not. ' = 1ir(p1n) x a' so that.over which ~ ( 1 a) can be normalised and the p corollary to Proposition 5. ) = g ( / r . provides a suitable sequence of subsets of 11 = R not depending on u. p ) } ' : = f i u .. depend on which of the parameters was taken to be the parameter of interest.1). It follows that 7r(p. ?r(X 1 ~ X!?lW?l exp { ins'X} or. in fact.330 5 Inference sothat { h o ( ~ . = {p:P' 5 / I 5 c ' } . A. and implies that changing the variable to X = o*.n) = 7r(n)7r(p n ) J 1 x n I ' provides a reference prior relative to the ordered parametrisation (a. If n is taken as n the nuisance parameter (by Proposition 5. by comparison with a N ( p If. udii ) x 40) J fi . alternatively. / ~ ) } nl2and. i = 1.n X ) density.given 2.
Again.19.2. . (Product of n o d means). for example. Further discussion of this example will be provided in Example 5. i = 1..y. It is easily seen that so that the reference prior relative to the ordered parametrisation (4. A) = (ad.} y = { yl. .. which is clearly different from the form obtained in Example 5. the sequence A.29. a) x (2 + d . respectively.5. = {a.6. a)is given by K ( 4 .z. we use Proposition 5. . . We conclude this subsection by considering a rather more involved example. from . . since its corollary does not apply.} are to be taken. and N ( z I (r. ..3 > 0. the Fisher information matrix is easily seen to be d) H e ( 6 ) = W(tL a) = (a :J Suppose now that we make the onetoone transformation q5 = (4. .a) x (2 + 5) 'I2 a.b9) = g ( 6 ) .. where a natural choice of the required Ai(@) subsequence does depend on 4. when . Such a parameter of interest arises. 1). Example 5. 1) and N(y I d. In this case..17. provides a reasonable basis for e' applying the corollary to Proposition 5.29.17.26 of Section 5.4 Reference Analysis and using Corollary I to Proposition 5. fr > 0. a)parametrisation this corresponds to 7r(p. 5 a 5 r'}.. . @I 331 a4(2 + d2)) .. writing 6 = (a.' % .2.I In the ( p . so that the complete parametric model is for which. Consider the case where independent random samples 2 = {q. so that d = n o is taken to be the parameter of interest and A = a/@is taken to be the nuisance parameter.a/i?) = g ( o .2.
subsets of the original parameter space. for (0. that.332 5 Inference inference about the area of a rectangle is required from data consisting of measurements of its sides. in the space of X 6 A.j): 0 < (I < 1 . it can be shown. A natural increasing sequence of . this does depend on 0. and which leads to a reference prior relative to the ordered parametrisation (L?. which transform. The Jacobian of the inverse transformation is given by and hence. I = 1. = {(o. I7 The question now arises as to what constitutes a "natural" sequence {A. over which to define the normalised a. for large i . . into the sequence We note that unlike in the previous cases we have considered.j ) would be the sets s.( o)}. A ) given by . Pi+ x W .2.9.(X I @) required by Definition 5. using Corollary 1 to Proposition 5. after some manipulation. To complete the analysis. 0 < < j< / } .
and by the h J ( 8 )the lower right element of SJ’(8). we assume that 8 = 8 1 x . . .4 Reference Analysis In the original parametnsation. (81. In order to describe the algorithm for producing this successively conditioned form. and of the consequencesof choosing a different sequence A. to n(a. respectively.. this corresponds to 333 which depends on the sample sizes through the ratio m / n and reduces. . regular case we shall first need to introduce some notation.).x 0.8. say. and define 8 l= L . .2. x e!. and generating.).. by successive conditioning. to be such that the Fisher information matrix has full rank. an increasing sequence of compact subsets of 8. and denote by SJ(8) corresponding upper left j x j submatrix of S ( 8 ) . (@).. For a detailed discussion of this example. Finally. since the underlying notion of a reference analysis is a “minimally informative” prior rekutive to the actual experiment to be performed.see Berger and Bemardo ( 1989).with 8.. ) . .e... T(e2 1 e l ) q . When the model parameter vector 8 has more than two components. . .). nothing paradoxical in this. in the standard. . referred to.. . . Assuming the parametric model p ( z I e). . . relative to this ordered parametrisation. There is. . . who showed that it provides approximate agreement between Bayesian credible regions and classical confidence intervals for 4. The reference prior for the ordered parametrisation (#.4.. and.5 Multlparameter Problems The approach to the nuisance parameter case considered above was based on the use of an ordered parameuisation whose first and second components were (4. I x . of the form x(e) = . as the parameter of interest and the nuisance parameter..87.~). we denote by (8...’ ( 8 ) . of course. 8 E 8. a reference prior.define the component vectors g l J l = ( 8 .. this successive conditioning idea can obviously be extended by considering 8 as an ordered parametrisation.. . = 1 .evtl) . .. we define S ( 8 ) = H ..5. A) was then constructed by conditioning to give the form x(A I @)x(r#~). A). E 8.(ern 18. .. . in the case n = rn.. eLI =(8J+l.. for i = 1. 2 . 5. We note that the preceding example serves to illustrate the fact that reference priors may depend explicitly on the sample sizes defined by the experiment. a form originally proposed for this problem in an unpublished 1982 Stanford University technical report by Stein.$)*x (tr2 + $?)1’2. ..
( 0 ) depends only on &I. in . j = 1. if.O. This follows closely the development given in Proposition 5.29. 2 . .. If b . and under regu1arit. (Orderedreference priors under asymptotic normality). .. the reference prior n ( 0 ) . Corollary. 1992b. particularly. .. (0)} terms in the above depend only on even greater simplification obtains if H ( 8 ) is block diagonal. rrt. . is given by where w f( 0 )is dejined by the following recirrsion: and (i) For j = n?. r n . For details see Berger and Bemardo (1992a. . O.2. . E k)!.. where Proof.1. . . (ii) For j = ria . ). for j = 1. then . . 1992~). and a function not depending on fl. . .. and 6..30. . .relative to the ordered paramerrisation (0. E el.v conditions extending those of Proposition 5...334 5 Inference Proposition 5. . a The derivationof the ordered reference prior is greatly simplified if the { h. .. the j t h term can be factored into a product of a function of 0.29 in an obvious way. With the above notation.
. j in this latter case.with f then h.#) =St(/II)F. . for which the Fisher information matrix is easily seen to be It follows from Proposition 5. . in normal model N .).5.. 71 2 2.81. . ( zI p. . ( 8 )1 = (6.e. ..p . . . . )!?I ( 8 ) v : where g. r ) . ( 8 ) . p.is given by x ( p . Example 5.7n. . f m o j The results follow from the recursion of Proposition 5. but experience with a wide range of examples suggests thatat least for nonhierarchical models (see Section 4.. . . in fact. 2 1. ..29...20... T ) ix T'. At present. . where the parameters may have special forms of interrelationshipthe best procedure is to order the components of 8 on the basis of their inferential interest. ( 8 ) = h.. . ... p.. are mutually orthogonal). a random sample from the multivariate which consists in obtaining (2. be an experiment . . no formal theory exists to guide such a choice.=I I1 which agrees with the standard argument according to which one degree of freedom should be lost by each of the unknown means.. since the parameters are all mutually orthogonal. then ! . .??l(nl)) I . TI. . (Reference anafysisfor nt normal means). = 1. ( 8 )does nor depend on el.. 1 ) X. . Clearly. p ..5). and if the @ 's do not depend on 8.(O.30 that the reference prior relative to the natural parametrisation ( p . . The reference posterior for any pJ is easily shown to be the Student density 7T(/lIlSl . 8. The question obviously arises as to the appropriate ordering to be adopted in any specific problem. . .. . depend on the order in which the parametrisation is taken.6. . {)a. in this example the result does not.). Let e. ... J=l n m f. . Furthermore.. T ) x 7I or n ( p l . .J((I1l)S~.. ~ x a'if we para) metrise in terms of u = r is thus the appropriate reference form if we are interested in any of the individual parameters. . The reference prior +.4 Reference Analysis ! 335 I H ( 8 )is block diagonal (i. . 4 6 ) 0. ..
B. . and there i s no need to consider subset sequences { O ! } . . . .H..1 + HI . ..H? . k ( 1 H. I... . r . .. ... . . . .2). . . .. .2.. . .H . noting that H ( O l .) U.. . . .. ) . . .. .} be an observation from a rnultinomial distribution (see Section 3.) is given by we see that the conditional asymptotic precisions used in Proposition 5..T(HI. 1 ..21.CH. 1 = //! rl! ' ' ' l'. . . . H?H. 4 I W ( 0 . H. .. . (Multinomial model).1.. ..).XI..) x .O.336 5 Inference Example 5.... . N .. Let x = { r l .S8. . .) = 1 .. . and hence that :I f/1(101) HI#? O2(IH2) " ' " ' 010...YH. 1 1 3.)! H' '' . is then given by and corresponding reference posterior for HI is x(H1 11'1.. the conditional reference priors derived using Proposition 5... 1 f H. . .Z'H. . I />(I)... . ~lfl.. . . ...H.' . . IIH.. .. '. . . In this case... . H.... ) rlH2 . 10. .28 turn out to be proper.&... I . ... . I'.!(!l . ... 1 The required reference prior relative to the ordered parametrisation ( 0 ... 1 . . . say...29 are easily identified. . .H. In fact. so that p ( 1. 02 1 ...
. as one could expect.22.. .u1}. Further comments on ordering of parameters are given in Section 5. Let { xl. .6)*)1'* 1 After some algebra. . .where Supposethat the correlationcoefficient p is the parameterof interest. For a detailed discussion of this example see Berger and Bemardo ( 1992a).2.8 . ~ which. after appropriate changes in labelling and it is independent of the particular order in which the parameters are taken. 7 ) .81)1!2(1 . A$(% I /A. .) = Be (8. .t i 2 (1 . }be a random sample from a bivariate normal distribution.q .rID. the above analysis applies to any 8. x . and consider the ordered parametrisation { p . this implies that 7r(01I r l .4 Reference Analysis which is proportional to 337 J ().u l .5.. Example5. . . p2. (Normalcornlationcoefiient). by symmetry considerations.. Clearly. r.p:".i = 1.c 5 x (1 . . P I ..6. coincides with the reference posterior which would have been obtained had we initially collapsed the multinomial analysis to a binomial model and then carried out a reference analysis for the latter. . .7n. It is easily seen that 50 that . ..
. see Bayarri (198 I ) .2 x (H+ ])lo ' and it is easily verified that this prior leads to a posterior in which the data are no longer overwhelmed. } For several plausible "diffuse looking" prior distributions for 19one finds that the corresponding posterior virtually ignores the data. One possibility would be to embed the discrete space { 1. (In@& discrete case). often possible to obtain an approximate reference posterior by embedding the discrete parameter space within a continuous one. i l l .338 After some algebra it can be shown that this leads to the reference prior ?T(/A//. . } on the basis of a random sample z = {q. only depends on the data through the sample correlation coefficient r.6. also. for each H > 0. suppose it is of interest to make inferences about an integer 19 E { 1.23..02) 5 Inference x (1 lol In. this has to be interpreted as suggesting that such priors actually contain a large amount of information about H compared with that provided by the data. the appropriate refrence prior is x(B) x //(#)I .. . Hills (1987). It is. . x[ since. 219) analysis. Furthermore. A more careful approach to providing a "noninformative" prior is clearly required. due to the nonexistence of an asymptotic theory comparable to that of the continuous case.oI.} in the continuous space 10. . however. from .. m . further discussion will be provided in Section 5. whose samplingdistributiononly depends on p.I 2 r .r. If the physical conditions of the problem require the use of discrete H values. See. t ) l l )is still a probability density for . . r.j 1 z ) = approximate discrete reference posterior. . ! . Example 5. . Then.}". for example. Infinite discrete parameter spaces The infinite discrete case presents special problems. . ( ~z)tlfl. For a detailed analysis of this example. o2 is taken. p ( . as one could expect from Fisher's (1915) original analysis. one could always use.' whatever ordering of the nuisance parameters p I. This agrees with Lindley's (1965. the corresponding reference posterior distribution for p (where F is the hypergeometric function).I '2 .2. In the context of capturerecaptureproblems.. p. Y and Berger ( 1991)and Berger and Bemardo ( I992b) e for derivations of the reference distributionsfor a variety of other interesting models. using Proposition 5. Intuitively. .2.2. P(H = 1 I z ) = as an . 1 = .24.
and arbitrary sample size.5 NUMERICAL APPROXIMATIONS Section 5. z I 8 ) . In this section. E[yl6] (which will be either 8 or some transformation thereof) will be the parameter of interest. if one wants to predict y based on I when (y. not y. given a likelihood p(a: 1 8 ) and a prior density p ( 8 ) . Specifically. . pl. in a hierarchical model with. the p t ' s may be the parameters of interest but a prior is only needed for the hyperparameters po and A. In future work. in the prediction problem. One needs a reference prior for 8. 5. The difficulty that arises is how to then identify parameters of interest and nuisance parameters to construct the ordering necessary for applying the reference prior method.3 considered forms of approximation appropriate as the sample size becomes large relative to the amount of information contained in the prior distribution. A] = po will be defined to be the parameter of interest. } in the hierarchical model). we propose todeal with this difficulty by defining the parameter of interest in the reduced model to be the conditional mean of the original parameter of interest. z ) has density p(y.5. This technique has so far worked well in the examples to which it has been applied. p2. the starting point for all subsequent inference summaries is the joint posterior density for 6 given by From this. we may be interested in obtaining univariate marginal posterior densities for the components of 6. and in the hierarchical model E [ p l I jb. The difficulty with these problems is that there are unknowns (typically the unknowns of interest) that have specified distributions. Section 5. we shall consider numerical techniques for implementing Bayesian methods for arbitrary forms of likelihood and prior specification. but its distribution is conditionally specified..5 Numerical Approximations 339 Prediction urui Hierurchical Models Two classes of problems that are not covered by the methods so far discussed are hierarchical models and prediction problems.4 considered the problem of approximatinga prior specification maximising the expected information to be obtained from the data. Likewise. but further study is clearly needed. Thus. bivariate joint marginal posterior densities for pairs of . say. and find the reference prior for the remaining parameters based on this marginal model.A).the unknown of interest is y. .. the real parameters of interest having been integrated out. We note that the technical problem of evaluating quantities required for Bayesian inference summaries typically reduces to the calculation of a ratio of two integrals. The obvious way to approach such problems is to integrate out the variables with conditionally known distributions (y in the predictive problem and the { p . For instance.p p being N ( p t I pu.
Markov Chuiri Monte Curlo. and assuming !I( 8 ) almost everywhere positive. exponential families together with conjugate priors). in order to obtain marginal (univariate or bivariate) densities.vesiuti Conipirtcirion.. Except in certain rather stylised problems (e.stinipliti~~. An exhaustive account of these and other methods will be given in the second volume in this series. Bu. In all these cases.340 5 Inference components of 8. Suniplinginiportcinc.1 Laplace Approximation We motivate the approximation by noting that the technical problem of evaluating quantities required for Bayesian inference summaries. we may be interested in marginal posterior densities for functions of components of 8 such as ratios or products. thus. Often. and g ( 8 ) is some realvalued function of interest. we note that thc posterior expectation of interest can be written in the form . First. for specified likelihood and prior. highest posterior density intervals and regions. together with summary moments. is typically that of evaluating an integral of the form wherep(8 I 5 )is derived from a predictive model with an appropriate representation as a mixture of parametric models. g ( 8 ) is a first or second moment.g. and since p(8I z ) is given by we see that E[y(8) I z)]has the form of a ratio of two integrals.rit?rcJriori. Iterative Quadrature: Imporrance Sanipling. In this section. we shall outline five possible numerical approximation strategies. and so on. or whatever. we need to evaluate the denominator in Rayes' theorem in order to obtain the normalising constant of the posterior density: then we need to integrate over complementary components of 8. Focusing initially on this situation of a required inference summary for g ( 8 ) . 5. the technical key to the implementation of the formal solution given by Bayes' theorem. which will be discussed under the subheadings: kiplace Apl)r~~. the required integrations will not be feasible analytically and.5.~. is the ability to perform a number of integrations. efficient approximation strategies will be required.er~. Alternatively. or transformations of 8.
and &3n'/* exp nh(e)} .and are thus equivalent to normallike approximations to the integrands. Tierney. Jeffreys and Jeffreys. 1989a. nh*(e) = iogg(e) + logp(6) + iogp(x I e). the functions h(8) . the Lapface approximations for the two integrals defining the numerator and denominator of E[g(O)I x]are given (see.5.d such that h(B) = sup { . . 1989b). Tierney and Kadane (1988. is thus seen to provide a potentially very powerful general approximation technique.~ ) ) 1 The Laplace approximation approach. In the context we are considering. with the vector x = (x. Kass.)and h'(.h ( 8 ) } B ~ ci = [h'7e)]1'2 lo=* ' Assuming I t ( .. 8 = 6 E 91. 1989a. for example.n h * ( ~ * ) } exp . 1991) and Wong and Li (1992) for further underpinning of. of and h*(O) are defined by nh(e) = iogp(e) + iogp(z I el. the denominator of E[g(O)I x]is given by .. and Tierney and Kadane (1986) have shown that E [ g ( d )I 2 = fi[g(8)I x](1 4.. this methodology. 1989b. 1946) by J2ncT*n'/2 { .5 Numerical Approximations 341 where.) observations fixed. ) .and define 88' and 6 . See.) to be suitably smooth functions. It'(. the approximations consist of retaining quadratic terms in Taylor expansions of h ( . Considering now the general case of 8 E 9'the Laplace approximation to '. also. { Essentially. exploiting the fact that Bayesian inference summaries typically involve ratios of integrals.x. it then follows immediately that the resulting approximation for E[g(O)I x has the form i fi"g(8) 1 1 = x ($) exp { n [h'(B')  "(e)]}. .. Kass and Kadane ( 1987. and extensions to.O ( T ~ . Let us consider first the case of a single unknown parameter.).
application of the Laplace approximation approach corresponds to obtaining y(Cp I 2)pointwise by fixing Cp in the numerator and defining g(A) = I . and /lJA. It is easily seen that this leads to where nh.A ) + logy(z 14.342 and the Hessian matrix of h evaluated at 8.(A) = logy(@.) and 8 * ..A). defined in terms of It ' (. ' . Writing completely analogous to the univariate case. A considered as a function of X for fixed 4. A ) IS constant... we see that. The form fi(4I z) thus provides (up to proportionality) B pointwise approximation to the ordinates of the marginal posterior density for 4.(A). if p(Cp.A) and the required inference summary is the marginal posterior density for Cp.) =  sup /?. Considering this form in more detail. If 8 = (4. with an exactly analogous expression for the numerator.
who develops a general approximation technique based around the Pearson family of densities. we may be able to improve substantially on this direct Laplace approximation approach. One possible alternative. the forms concerned are not well approximated by secondorder Taylor expansions of the exponent terms of the integrands. The approximation to the marginal density for 4 given by $(+Is)has a form often referred to as the modiJedprofile likelihood (see. such as gammas or betas.2. particularly when components of 8 are constrained to ranges other than the real line.oice of qualatic function Q. the form p ( z 14.+. we note that the Laplace approximation is essentially derived by considering normal approximations to the integrands apIf pearing in the numerator and denominator of the general form E [ g ( 8 )I z]. for example. considered as a function of X for fixed 4. For further references. Section 4.5 Numericul Approximurions 343 The form V2logp(s I 4. A. for a convenient discussion of this terminology). which specify a density for 0 of the form where &(el = qo + qle + q2e2 and the range of 8 is such that 0 < Q(0) < x. and evaluated at the value which maximises the loglikelihood over X for fixed 4. at least if 8 = 0 is a scalar parameter. which may be the case with small or moderate samples. is to attempt to approximate the integrands by forms other than normal.) is the Hessian of the loglikelihood function. corresponding to the parametric model p ( s I 4:A).p0 and a quadratic function Q . Cox and Reid. see Appendix B. perhaps resembling more the actual posterior shapes. It is shown by Morris (1988) that..5. Approximation to Bayesian inference summaries through Laplace approximation is therefore seen to have links with forms of inference summary proposed and derived from a nonBayesian perspective. In relation to the above analysis. for a given c. These are characterised by parameters m. an analogue to the Laplacetype approximation of an integral of a unimodal function f (0) is given by . Such an approach has been followed in the oneparameter case by Morris (1988). 1987.&) is usually called the profile likelihood for 9.
note. It may be. we see. I n .. dehning ! I ( @ ) = U.. N ) likelihood. together with the appropriate. has arisen from I a Bi( r. therefore. we can. Writing r. identify the analytic form of the posterior mean in this case. prior density) the Laplace approximation can itself be made more accurate. instead. We..24. . To provide a concrete illustration of these alternative analytic approximation approaches consider the following. even in contexts where the original parametrisation does not suggest the plausibility of a normaltype approximation to the integrands. A second alternative is to note that the version of the Laplace approximation proposed by Tierney and Kadane (1986) is not invariant to changes in the (arbitrary) parametrisation chosen when specifying the likelihood and prior density functions. Example 5. together with.I'. First. after some algebra. i 4 ). Be(@ I.Details of the Q and p for familiar forms of Pearson densities are given in Morris forms of I< ( 19881. where it is also shown that the approximation can often be further simplified to the expression '.A.I. Suppose that a posterior beta distribution. in fact.14).where rii = l"(e)Q(o) and 8 maximises r ( H ) = log(f(O)Q(H)]. that by judicious rcparametrisation (of the likelihood. the required integrals are detined in terms of and the Laplace approximation can be shown to be .. Jacobian adjusted. whereas the Pearson family approach does not seem so readily generalisable.. Be(H I I. ) prior (the referenci prior. but we shall ignore this for the moment and examine approximations implied by the techniques discussed above. that the TierneyKadane form of the Laplace approximation gives the estimated posterior mean If. that such a strategy is also available in multiparameter contexts. I I .#= . we reparametrise to ( 5 = h i i i \'Ti. derived in f Example 5. (Approximatingthe mean of a beta distribution). incidentally.
there are awkward complications if the integrands are multimodal. the results for 71 = 5. does. Table 5. here.580 (0. we simply summarise. whether in the form of posterior moments o r marginal posteriordensities. However.38) 0. in fact.6%) E. At the time of writing.5. we obtain E‘(0Ir.1 (r + f)’ (.585 (0.estimated true I. Details are given in Achcar and Smith (1989). 1989) has yet to be clarified. In general.s + i) (percentage errors in purmrheses) True value Laplace appro. 1980b. we can study the performance of the three estimates for various values of n and s.1 Approximation oJE[OI . which typify the performance of the estimates for small n. Further examples are given in Achcar and Smith (1989). .1 0.r +I By considering the percentage errors of estimation. in multiparameter contexts there may be numerical problems with the evaluation of local derivatives in cases where analytic forms are unobtainable or too tedious to identify explicitly.which is. if we work via the Pearson family. in some sense. this area of approximation theory is very much still an active research field and the full potential of this and related methods (see.583 0. In addition. s = 3. in cases involving a relatively small number of parameters. in combination with judicious reparametrisation. can provide excellent approximations to general Bayesian inference summaries. Leonard et 01. the Laplace approach. n .0) as the “natural” choice for a betalike posterior. Lindley. preselected to be best. also.rimations Pearson approximation Ep Is] ~~ ~~ E[01. I that the Pearson approximation. in Table 5. with Q ( 0 ) = O( 1 .r]fiomBe(0 I s + f .563 (3..69) We see fromTable 5. it is strikingthat the performance of the Laplace approximation under reparametrisation leads to such a considerable improvement over that based on the original parametrisation. outperformthe others.10 I s] 0.j = ( n + 1)”1 + 3/2)’ ( n + 2)”’.1. it would appear that. and is a very satisfactory alternative to the “optimal” Pearson form. defined by I true . However.5 Numerical Approximations 345 Alternatively.
for example. that. ) . (see. the applicability of this approximaparameters defined on (x. even for moderate n (less than 12. is the ith zero of the Hermite polynomial H. using. exp(t . x). for example.(J}/ l t1 .log(h .. then the quadrature rule approximates the integral without error. if h ( t ) is a suitably well behaved function and g(t) = h ( t )( 2 K I T L ) "'cxp { . It turns out that. etc. expressed in informal terms.2 5 Irtference lteratfve Quadrature It is well known that univariate integrals of the type are often well approximated by GaussHermite quadrature rules of the form where 1 . ' then where n ) . to use the above form we must specify 11and cr in the normal component. say).. if f ( t ) is a polynomial of degree at most 211 . = /I + J2a t . log(f ) or log(1 . respectively.346 5. Moreover.3. this is a rather rich class which. 1982). and 2. It follows that GaussHermite rules are likely to prove very efficient for functions which.t ) . In fact. using the same (iterated) .(t). In particular. Naylor and Smith. covers many of the likelihood x prior shapes we typically encounter for Moreover.).. tion is vastly extended by working with suitable transformations of parameters h defined on other ranges such as (0.u) . . ) a. prior information.sitnul~cinc~oits!\ norrntrlising c w i s f r i n f and thc>. closely resemble "polynomial x normal" forms. x) or (0. given reasonable starting values (from any convenient source. we note that if the posterior density i s wellapproximated by the product of a normal and a polynomial of degree at most 21. we typically can successfully iterate the quadrature rule.5. Of course... substituting estimates of the posterior mean and variance obtained using previous values of 'in. for example. This implies.firsfmid . :.1. maximum likelihood estimates. then an rwiluciting /he tipoint GaussHermite rule will prove effective for .secoiid tttoitietirs. = w.
1985. the lattice of integration points formed from the product of the two onedimensional grids will efficiently cover the bulk of the posterior density.. Naylor and Smith. transform further to a centred. Successive transformationsare then based on the estimated covariance matrix from the previous iteration. The “obvious” extension of the above ideas is to use a Cartesian product rule giving the approximation where the grid points and the weights are found by substituting the appropriate iterated estimates of p and a2corresponding to the marginal component t. via an appropriate linear transformation. To overcome this problem. scaled. However. thus causing the Cartesian product rule to provide poor estimates of the normalising constant and moments. we could first apply individual parameter transformations of the type discussed above and then attempt to transform the resulting parameters. In practice. approximately orthogonal. 1988).5. more “orthogonal” set of parameters. for example. Clearly. 1982. At the first step. 1987. Cartesian product integration of functions of interest. on suitably dimensioned grids. cany out. Smith er al. it is efficient to begin with a small grid size (71 = 4 or n = 5) and then to gradually increase the grid size until stable answers are obtained both within and between the last two grid sizes used..5 Numricul Approximations 347 set of mi and 2. Naylor and Smith. Our discussion so far has been for the onedimensional case. . (3) Using the derived initial location and scale estimates for these “orthogonal” parameters. these will lead to many of the lattice points falling in areas of negligible posterior density.to a new. this linear transformation derives from an initial guess or estimate of the posterior covariance matrix (for example. The following general strategy has proved highly effective for problems involving up to six parameters (see. if high posterior comelations exist. ( I ) Reparametrise individual parameters so that the resulting working parameters all take values on the real line. (2) Using initial estimates of the joint posterior mean vector and covariance rnam x for the working parameters. based on the observed information matrix from a maximum likelihood analysis). however. . the need for an efficient strategy is most acute in higher dimensions. In this case. The problem with this “obvious” strategy is that the product form is only efficient if we are able to make an (at least approximate) assumption of posterior independence among the individual components. set of parameters.
if f is a function and 9 is a probability density function which suggest the "statistical" approach of generating a sample from the distribution function G. Cartesian product approaches become computationally prohibitive and alternative approaches to numerical integration are required.e (4) Iterate. see Kass and Slate (1992).referred to in this context as the inzportunce sunipling distributionand using the average of the values of the ratio f l y as an unbiased estimator of j(. and 2.7). 5. It is amusing to note that the roles can be reversed and Bayesian statistical methods used to derive optimal numerical quadrature formulae! See.r)d. For related discussion. the variance of such an estimator clearly depends critically on the choice of G. derived by transforming from cartesian to spherical polar coordinates and constructing optimal integration formulae based on symmetric configurations over concentric spheres. 1993) and Marriott and Smith ( 1992). see Marriott ( I 988).340 5 1njerenc. For further information on this topic. say between six and twenty.3 Importance Sampling The importance sampling approach to numerical integration is based on the observation that. However. Diaconis ( 1988b) and O'Hagan ( 1992).r. for example. exploitation of this idea requires designing importance sampling distributions which are efficient for the kinds otintegrands arising in typical Bayesian applications. In multiparameter Bayesian contexts. or modifications thereof.5. it being desirable to choose g to be "similar" to the shape of f . O'Hagan ( I99 I ) and Dellaportas and Wright ( 1992). Other relevant references on numerical quadrature include Shaw ( I98Xb). One possibility is the use of spherical quadrature rules (Stroud. see Smith (1991 ). The efficiency of numerical quadrature methods is often very dependent o n the particular parametrisation used. . Flournoy and l'sutakawa ( 199 I ). For problems involving larger numbers of parameters. Hills and Smith ( 1992. A considerable amount of work has focused on the use of multivariate normal or Student forms. 1971. The ideas outlined above relate to the use of numerical quadrature formulae to implement Bayesian statistical methods. For a brief introduction. Sections 2. successively updating the mean and covariance estimates. Full details of this approach will be given in the volume Bayesinn Conipicrclrion.6. until stable results are obtained both within and between grids of specified dimension.
5 Numerical Approximations 349 much of this work motivated by econometric applications. so that sample information about such quantities provides (for any given choice of h ) operational guidance on the appropriate choice of a. h ( u ) = log(u) leads toa family whosesymmetric member is the logisticdistribution.rather than actually generating “uniformly distributed” random numbers. forexample. In the univariate case. we again employ individual parameter transformations. bearing in mind that we need to have available a flexible set of possible distributionalshapes. l).1). is for which G’ available explicitly. We note.l) function such that limh(u) = cc 11u  R is a monotone increasing and 0 5 a 5 1 is a constant. The tailbehaviour of the distributions is governed by the choice of the function h.5. the median is linear in a. 0 or a + 1 we obtain increasingly skew distributions (to the left or right). As we remarked earlier. 1985). as a . the contributions of Kloek and van Dijk (1978). together with “orthogonalising” transformations.5 leads to symmetric distributions. are polynomials in a (of correspondingorder). an exactly parallel argument suggess that the choice of a suitable g followed by the use of a suitably selected “uniform” configuration of points in the kdimensional unit hypercube will provide an efficient multidimensional integration procedure. van Dijk and Kloek (1983. and if we work with y = C(s).. 1989). l). However. etc. In the transformed setting. Iff is a function of more than one argument (k.tan [ T (1 .u ) / 2 ] leads toafamily whose symmetric member is the Cauchy distribution.(u)= . one such family defined on R is provided by considering the random variable where u is uniformly distributed on (0. so that parameters can be treated “independently”.the required integral is the expected value of f[G’(z))/y[G’(. This problem has been extensively studied by number theorists and systematic experimentation . we are likely to get a reasonable approximation to the integral by simply taking some equally spaced set of points on (0. 1988a) proceeds as follows. van Dijk et al. To use this family in the multiparameter case. Moreover. if we choose g to be heaviertailed than f. part of this strategy requires the specification of “uniform’’ configurationsof points in the kdimensional unit hypercube. The choice u = 0. it is natural to consider an iterative importance sampling strategy which attempts to learn about an appropriate choice of G for each parameter. the effectivenessof all this dependson choosing a suitable G. h. In the univariate case.say).r)] with respect to a uniform distribution on the interval (0. Thus. h : (0. Owing to the periodic nature of the ratio function over this interval. An alternative line of development (Shaw. so that all parameters belong to 8. in particular. (1987) and Geweke (1988. the moments of the distributions of the I.
The general strategy is then the following. 5. thc distribution . set for "suitable" choices of q. (3) In terms of these transformed parameters. ( 5 ) Use information from this "sample" to learn about skewness. (2) Using initial estimates of the posterior mean vector and covariance matrix for the working parameters.rre. This technique is referred to by Rubin ( 1988)as scimplinKimport~~nc. more "orthogonal" set of parameters. ( I ) Reparametrise individual parameters so that resulting working parameters all take values on the real line.~uniF. . Teichroew ( 1965) provides a historical perspective on simulation techniques. for each gJ. . tailweight. j = 1. k.4 Samplingimportanceresampling Instead of just using importance sampling to estimate integralsand hence calculate posterior normalising constants and moments.we can also exploit the idea in order to produce simulated samples from posterior or predictive distributions. we note the essential duality between a sample and the distribution from which it is generated: clearly. . . and revise estimates of the mean vector and covariance matrix. shifting the focus in Bayes' theorem from densities to samples. conversely. see Shaw ( 1988a). . We begin by taking a fresh look at Bayes' theorem from this samplingimportanceresampling perspective. As a first step. 1990) and Wolpert (I991 ). scaled.. Stewart and Davis (1986). . For further advocacy and illustration of the use of (nonMarkovchain)Monte Carlo methods in Bayesian Statistics. .. k. Shao ( 1989. transform to a centred. 1983. etc. and hence choose "better" !I.350 5 1nfrrenc.e with various suggested forms of "quasirandom" sequences has identified effective forms of configuration for importance sampling purposes: for details.5. (4)Use the inverse distribution function transformation to reduce the problem to that of calculating an average over a "suitable" uniform configuration in the kdimensional hypercube. 1985. (6) Iterate until the sample variance of replicate estimates o f the integral value is sufficiently small. the distribution can generate the sample. Our account is based on Smith and Gelfand (1992). 1987). given a sample we can recreate. at least approximately. j = I. see Stewart ( 1979.liil~ (SIR). .
under appropriate regularity conditions.v} having mass q1 . 1987. Then. Bayes' theorem defines the inference process as the modification of the prior density p ( 8 ) to form the posterior density p ( 6 I z). it is immediately verified that the expected sample size from h ( 8 ) is AlIN f(x)drr. (iii) if ti 5 f ( 8 ) / M g ( 8 ) accept 8. . consider.Given a sample of size N for g ( 8 ) . (ii) generate u from Un(u 10.8. . an empirical distribution function. this corresponds to the modification of a sample from p ( 8 ) to form a sample from p(8 12) through the medium of the likelihood function p ( z 18).5.1). calculate s If we now draw 8' from the discrete distribution {el. Any accepted 8 is then a random quantity from h ( 8 ) . p. otherwise repeat (i)(iii). if P describes the actual distribution of 8*. N i=l . Given f(8) and the sample from g ( 8 ) . Suppose that a sample of random quantities has been generated from adensity g ( 8 ) . the univariate case.v from g ( 8 ) . we can approximate samples from h ( 8 )as follows. a kernel density estimate. In cases where the bound A 1 in the above is not readily available. an exact sampling procedure follows immediately from the well known rejection method for generating random quantities (see. . 5 for all 8 . Shifting to a sampling perspective. To gain insight into the general problem of how a sample from one density may be modified to form a sample from a different density. for mathematical convenience. 60): (i) consider a 8 generated from g ( 8 ) . . how can we derive a sample from h(B)? In cases where there exists an identifiable constant hi > 0 such that f(8)/g(6) A f . In terms of densities.5 Numerical Approximulions 351 (as a histogram.. To see this. Ripley. on then 8' is approximately distributed as a random quantity from h(8). Given el. . but that what it is required is a sample from the density where only the functional form of f(8) is specified. . consider the following. through the medium of the likelihood function JI(Z 18).8. or whatever). for example.
Then. For an illustration of the method in the context of sensitivity analysis and intractable reference analysis.352 Since sampling with replacement is not ruled out. We shall not pursue these ideas further here. The samplingresampling perspective outlined above opens up the possibility of novel applications of exploratory data analytic and computer graphical techniques in Bayesian statistics. .u(e) p ( e ) . if "lf is not readily available. the rejection procedure given above can be applied to a sample for p ( 0 ) to obtain a sample from p ( e I z)by taking . see Albert ( 1993). . j ( e ) = fz(e)and .\I = p(x I b ) . which selects 8. so that the inference process via sampling proceeds in an intuitive way. the less h ( 0 ) resembles g(0) the larger N will need to be if the distribution of 0' is t o be a reasonable approximation to h(0). see Stephens and Smith ( 1992).(O) . if 4 maximibinp p(z18) is available. since the topic is more properly dealt with in the subsequent volume Bu. we can use the approximate resampling method. into the posterior sample with probability Again we note that this is proportional to the likelihood. Clearly. however. With this samplingimportanceresampling procedure in mind.P(ZI8) . uccept 8 into the posterior saniylr with prohNb i b f. For fixed 2. Alternatively. = Bayes' theorem then takes the simple form: For each 8 in the prior sample.wsictn Conipurutiori. let us return to the prior to posterior sample process defined by Bayes' theorem. the sample size generated in this case can be as large as desired. define fz(8) = p(x I 8 ) p ( 8 ) .\IIJ(W p(z1 e) The likelihood therefore acts in an intuitive way to define the resampling probability: those 8 with high likelihoods are more likely to be represented in the posterior sample.for pedagogical illustration.
.5. Tierney (1992). The Gibbs Sumpling Algorithm Suppose that 8 . typically avail02. so that. As we have already observed in this . in distribution and 1 .C g(si) . Geyer (1992). which have proved particularly convenient for a range of applications in Bayesian statistics. and whose equilibrium distribution is p ( 8 1 z ) .o k . stylised cases.. However. Raftery and Lewis ( 1 992). For recent accounts and discussion. forward to simulate from. we simply need algorithms for constructing chains with specified equilibrium distributions. Roberts (1992). and that our objective is to obtain summary inferences from the joint posterior p(8 1 z)= ~ ( 0 1 . &I=). Suppose that we wish to generate a sample from a ' posterior distribution p ( 8 1 z ) for 8 E 0 c 8'but cannot do this directly. . . Under suitable regularity conditions. . which is straightsuppose that we can construct a Markov chain with state space 0. this will typically lead. asymptotic results exist which clarify the sense in which the sample output from a chain with equilibrium distribution p ( 8 1 z ) can be used to mimic a random sample from p ( 8 l s ) or to estimate the expected value. .5. If we then run the chain for a long time. except in simple. . has components 01. . with respect to p ( 8 ( z ) .8'. .. (1993) and Smith and Roberts ( 1993). In what follows. for example. If 01.5 Markov Chain Monte Carlo The key idea is very simple. see. successive8' will be correlated. section. I Clearly. simulated values of the chain can be used as a basis for summarising features of the posterior p(8lz)of interest. Gelfand and Smith (1990). . will be required between realisations used to form the sample. is a realisation from an appropriate chain. see.of a function g ( 8 ) of interest. Chan (1993. The second of the asymptotic results implies that ergodic averaging of a function of interest over realisations from a single run of the chain provides a consistent estimator of its expectation. if the first of these asymptotic suitable spacings results is to be exploited to mimic a random sample from p ( 8 ) z ) . we outline two particular forms of Markov chain scheme. 1992b).3c include 8' + 8 CY p ( 8 1 s ) . . To implement this strategy. . unavoidably. also. Casella and George (1992). . Gilks er ul. Ritter and Tanner (1992). t a= f { g(e)} almost surely. . the vector of unknown quantities appearing in Bayes' theorem. . to challenging problems of numerical integration. able asymptotic results as t .5 Numerical Approximations 353 5. or parallel independent runs of the chain might be considered. Tanner and Wong ( 1987) and Tanner ( I 9 9 1 ). Besag and Green (1993). Gelman and Rubin (1992a.
the replicates (8:. by inspection of the form of p ( 8 I z)x p(z I 8 ) p ( 8 )in any given application. .. we note that for the socalledful(condirionaldensities the individual components. 12). (0 draw B!.”. Roberts and Smith.. . (0. as functions of 8. are typically easily identified. To this end. . (‘) as t x . 8. Iz. that given an arbitrary set of starting values..61:”. Thus. 1984. . and so on.In particular.. . ..‘)) tends in distribution to a random vector whose joint density is p(8 1 z). we implement the following iterative procedure: draw 8. .8. Geman and Geman.Iy. by making H I suitably +  .”. .:) are It approximately a random sample from p ( 8 . given the data and specified values of all the other components of 8.8. @:. Markov chain with transition probabilities given by Then (see. . ... this apparent need for sophisticated numerical integration technology can often be avoided by recasting the problem as one of iterative sampling of random quantities from appropriate distributions to produce an appropriate Markov chain. ).O::. .” from ~ ( 8 : ~ 6rli I t .. 1994).‘I). . . 8i. Now suppose that the above procedure is continued through 1 iterations and is independently replicated nr times so that from the current iteration we have 971 replicates of the sampled vector 8‘ = (8~”. H. 8:” tends in distribution to a random quantity whose density is p ( 8 . I z). for large f .0. .In fact..~” . for example. 101 draw@’’ from2)(6)2(z ..8.’. follows. . (0) [Ill for the unknown quantities.(01). . H ~ ) ’ ) . . . where 8‘ is a realisation o f a . Suppose then... ).’) from p(0l Iz.
This constructiondefines a Markov chain with transition probabilities given by @' q(8. 1987. 1991. . I 2) for p ( 8 . p(y I z) = n1l p(y I8:"). 1992. either as a kernel 8.5. So far as sampling from the p(8. . With some probability a(@. 1986. that an estimate p(6. in an obvious notation. de" 8") 1 .e')is considered as a proposed possible value for However.. .k.. density estimate derived from ( ! ' . which have an approximate p ( 0 I z) Ez. that simulation approaches are ideally suited to providing summary inferences (we simply report an appropriate summary of the sample). .j # i) is concerned. Wakefield er al. Gelfand and Smith (1990). however. . 1992. and Dellaportas and Smith. with state space 8 and equilibrium distribution p ( 8 1 z ) by defining the transition probability from 8' = 8 to the next realised state dt+' as follows. We note. i = 1. for example. . either the full conditionals assume familiar forms.". 8 * . for example. in which case general stochastic simulation techniques are available such as envelope rejection and ratio of uniformswhich can be adapted to the specific forms (see. we reject the value generated from q ( 0 . 1993). 8ib. Carlin and Gelfand (1991). . otherwise.). . the average being over distribution for large t). also.. or they are simple arbitrary mathematical forms. . in which case computer routines are typically already available. Gilks and Wild. the vector 8' drawn from q(8. The potential of this iterative scheme for routine implementation of Bayesian analysis has been demonstrated in detail for a wide variety of problems: see. such that. (1993). We shall not provide a more extensive discussion here. Gelfand er al. . inferences for arbitrar))kncrions of 81. . 12) is easily obtained..5 Numerical Approxinzutions 355 large. = 8'. ! 8'. The MetropolisHustings algorithm This algorithm constructs a Markov chain 8 ' .. or from . Gilks. . . if 8' = 8.e") a(e. . O ' ) and set Of+' = 8. 8$'.& (we simply form a sample of the appropriate function from the samples of the 8. the 8.'s) orpredictions (for example. we actually accept O'). (1990) and Gilks ef al. See. Ripley. since illustration of the technique in complex situations more properly belongs to the second volume of this work. a further randomisation now takes place. I x. Devroye. Let q(8!s') denote a (for the moment arbitrary) transition probability function.
r.s are not too small. is easily seen to be equal to if p ( s ) 2 p(.0). p ( t ) = P ( J . . see. . Tierney ( 1992). this ratio. 5. . only enters p(0.) is the indicator function. This general algorithm is due to Hastings (1970). by virtue of exchangeability. .6 5.l are finitely exchangeable. It is important to note that the (equilibrium) distribution of interest. provided that the 0') thus far arbitrary q(0. (1953).r.where I(. If now we set it is easy to check that p(Ojz)p(S.s + 1) and s and r) . y ( 0 l z ) . Peskun (1973).1 DISCUSSION AND FURTHER REFERENCES An Historical Footnote Blackwell (1988) gave a very elegant demonstration of the way in which a simple finite additivity argument can be used to give powerful insight into the relation between frequency and belief probability.e') through the ratio p ( 0 ' ( z ) / p ( 0 1 z ) . . . + .. . Roberts and Smith (1994) and Smith and Roberts (1993). . . The calculation involved has added interest in thataccording to Stigler (1982)it might very well have been made by Bayes himself.r. . We observe 2 = (x1. The argument goes as follows.1 = f ) .. which. and wish to evaluate Writing s = . Besag and Green ( 1993). .6. + . also. .. Suppose that 01 observables x i . Metropolis ef al. ' + .rl + .. = p ( t I l z ) p ( 0 ' .0') is chosen to be irreducible and aperiodic on a suitable state space. is a sufficient condition forp(01z) to be the equilibrium distribution of the constructed chain.This is quite crucial since it means that knowledge of the distribution up to proportionality (given by the likelihood multiplied by the prior) is sufficient for implementation.. .
before observing 5 . In particular.6 Discussion and Further References 357 This can be interpreted as follows. xn+l. (6. all possibilities should have the same initial probability. An easy calculation shows that this is satisfied if p(19)is taken to be uniform on (0. the noninformative prior suggested by the principle is the discrete uniform distribution y ( @ ) = { l/AI. Inverting the argument. It will be clear from the general subjective perspective we have maintained throughout this volume. we considered s and s 1to be about equally plausible as values for 2 1 . . can only take a finite number of values. However. we see that if one wants to have this “convergence” of beliefs and frequencies it is necessary that p ( s ) z p ( s + 1). suppose we require that p ( 8 ) be chosen such that + + + does not depend on s.6. it will also be clear from our detailed development in Section 5. l/M}. In this section we shall provide an overview of some of the main directions followed in this search for a Bayesian “Holy Grail”. If. if an unknown quantity. Stigler (1982) has argued that an argument like the above could have been Bayes’ motivation for the adoption of this uniform prior. say. say. In the early works by Bayes ( 1 763) and Laplace ( 1 8 14/1952). that we regard this search for “objectivity” to be misguided. . at first sight. This may. This is closely related to the socalled LaplaceBertrand paradox. in the absence of evidence to the contrary. the definition of a noninformative prior is based on what has now become known as the principle of insuficient reason. representing “ignorance” and “letting the data speak for themselves” has proved extremely seductive.5. But what does this entail? Reverting to an infinite exchangeability assumption. often being regarded as synonymous with providing objective inferences. . and hence the familiar binomial framework.the resulting posterior odds for x n + l = 1 will be essentially the frequency odds based on the first n trials. According to this principle. or the BayesLaplace postulate (see Section 5. the considerable body of conceptual and theoretical literature devoted to identifying “appropriate” procedures for formulating prior representationsof “ignorance” constitutes a fascinating chapter in the history of Bayesian Statistics. the idea of a noninjormarive prior distribution..4 that we recognise the rather special nature and role of the concept of a “minimally informative” prior specification appropriately defined! In any case.2 Prior Ignorance To many attracted to the formalism of the Bayesian inferential paradigm. .6. see Jaynes (1971) for an interesting Bayesian resolution.l)the socalled Bayes (or BayesLaplace) Postulate. . I). seem . h i . 5.
. discrete cases the uniform (now intproper) prior is known to produce unappealing results. where In effect.=l. If the space. t ) = 1. In an illuminating paper.x ?t 1 x I log ?Z ~ 1 x .23. finite. Based on these invariance considerations. as indicated in Example 5. embedding the discrete problem within a continuous framework and subsequently discretising the resulting reference prior for the continuous case may produce better results. more generally. the prior ir(/I) x I/'. discrete cases care can be required in appropriately defining the unknown yucmtip of inreresf.2. log log I t r.. I6 showed that even in simple. of @ values is a continuum (say. but Example 5. "ignorance about 0" should surely imply "equal ignorance" about a onetoone transformation of o.x .the (often improper) density A ( 0 ) x h(o)'. for the case of the integers.' .o.. Jeffreys ( I939/ I96 I . the real line) the principle of' insufficient reason has been interpreted as requiring a uniform distribution over @. Jeffreys ( 1946)proposed as a noninformative prior. More recently. @. if some procedure yields p ( o ) as a noninformative prior for o and the same procedure yields p ( < ) as a noninformative prior for a onetoone transformation ( = <( o)of Q.p(. 238) suggested. Moreover.. in the sense that equal mass .intuitively reasonable. with respect to an experiment f = {. Kass ( 1989) elaborated on this geometrical interpretation by arguing that. .. natural volume elements generate "uniform" measures on manifolds...consistency would seem to demand that p ( <)d< = p(o)do.. whose natural length ~. determined by a Riemannian metric. a procedure for obtaining the "ignorance" prior should presumably be invariant under onetoone reparametrisation. p. Jeffreys noted that the logarithmic divergence locally behaves like the square of a distance. intuitively.'.. thus. a uniform distribution for o implies a nonuniform distribution for any nonlinear monotone transformation of 62 and thus the BayesLaplace postulate is inconsistent in the sense that. in countably infinite. However.r lo)}. However. f element i s h ( ~ 5 ) ' /and that natural length elements o Riemannian nietrics are invariant to reparametrisation.Y. involving a parametric model which depends on a single parameter o. Rissanen ( 1983) used a coding theory argument to motivate the prior ir(71) . Specifically.
Lindley showed that if the quantity can only take the values Cp or Q + b(4). and a number of alternative justifications of the procedure have been provided. Mukerjee and Dey (1993) and Nicolau (1993) extend the analysis to the case where there are nuisance parameters. His intuition as to what is required. the essential property that makes Lebesgue measure intuitively appealing. In the continuous case. Jeffreys explored the implications of such a noninformative prior for a large number of inference problems.say. + . but can lead to unappealing results (Jeffreys. Lindley (1961b) argued that. To determine 6 ( Q ) in the context of an experiment e = {X.5. this turns out to be equivalent to Jeffreys’ rule. 182) when one tries to extend it to multiparameter situations. . Hanigan (1966b) and Peers (1968) discuss twosided intervals. in that there are many other procedures (some of which he described) which exhibit the required type of invariance. He found that his rule (by definition restricted to a continuous parameter) works well in the onedimensional case. was rather good. i. Peers ( I 965) later showed that the argument does not extend to several dimensions.p(r 14)).6 Discussion and Further References 359 is assigned to regions of equal volume. p. if p(4) = p ( @ 6(4)) = is 2b2(d)h(d). 193911961.’ !thus defining an appropriate mesh. the amount of information that e may be expected to provide about 4. describes the precision of the measuring process.’ where s(Q) is the asymptotic standarddeviationofthe maximum likelihoodestimate of I$. under suitable regularity assumptions. 6(Q). Tibshirani (19891. They showed that. however. In his work. The procedure proposed by Jeffreys’ preferred rule was rather ad hoc. arguing as before. and that a possible operational interpretation of “ignorance” is a probability distribution which assigns equal probability to all points of this grid. this implies a prior proportional to b(q5I. one can always replace a continuous range of q by discrete values over a grid whose mesh size. onesided intervals asymptotically coincide if the prior used for the Bayesian analysis is Jeffreys’ prior.Under regularity conditions which imply asymptotic normality. this suggests : ~.ai. Welch and Peers (1963) and Welch (1965) discussed conditions under which there is formal mathematical equivalencebetween onedimensionalBayesian credible regions and corresponding frequentist confidence intervals.This expected information will be independent of Q if 6(0) 0 h ( ~ ) . Perks (1947) used an argument based on the asymptotic size of confidence regions to propose a noninformative prior of the form n(Q) x s ( Q ) . in practice. Jeffreys’ prior ~ ( qx)h ( Q ) ’ / 2Akaike (1978a) used a related argument to justify Jeffreys’ prior as “locally impartial”. Jeffreys’ solution for the onedimensional continuous case has been widely adopted.
p(X 1e) 4 ! (((0) f ( J ' ) ] . an axiom system is used to justify the use of an information measure as a payoff function and Jeffreys' prior is shown to be a minimax solution in a two personzero sum game. this will happen if . 1983. which is. they showed that. asymptotically. Chapter 5) defines a similarity measure for events E. at least approximately. I: to be P ( E n F ) / P ( E ) P ( F ) shows that Jeffreys' priorensures. Section 1. more detailed argument. and for ) proposed as a conventional prior " ( 0 ) a modcl parameter f. /  Using standard asymptotic theory. under suitable regularity conditions. again. Good ( 1969)derived Jeffreys' prior as the"least favourable" initial distribution with respect to a logarithmic scoring rule.360 5 Inference Hartigan (1965) reported that the prior density which minimises the bias of the estimator (I of ($ associated with the loss function / ( d . Since the logarithmic score is proper.1 that ~ ( owhich implies a uniform prior for a function 4 = <(@) such that p(. Kashyap ( 197 1 ) provided a similar. a location J parameter family.r I C ) is. in the sense that it minimises the expected score from reporting the true distribution.d)) is If. 2 . Jeffreys' prior. for some functions C and f. this implies that n ( 0 )= / / ( 0 ) ' . one uses the discrepancy measure as a natural loss function (see Definition 3. Jeffreys' prior may technically be described. as a minimax solution to the problem of scientific reporting when the utility function is the logarithmic score function. and hence is maximised by reporting the true distribution. They suggested that the principle of insufficient reason may be sensible in location problems. Hartigan (197 I .3) argued for selecting a prior by convention to be used as a standard of'rejerencv. Following Jeffreys ( 1955). in particular. that is. and constant similarily for current and future observations. under suitable regularity conditions and for large samples. such that. Box and Tiao (1973. where the statistician chooses the "noninformative" prior and nature chooses the "true" prior.15).
(s.e.Z)']/a* is (see (T. of independent a priori treatment of location and scale parameters. } be a random sample . unacceptable.5. if the noninformative prior is Jeffreys' prior. where the multivariate rule leads to n(8. also.17 in Section 5. Let { cl. A).. Unfortunately. leading to the socalled multivariate Jeffreys' rule n(e) I H ( e )I 3c where is Fisher's infurmurim marrix. widely adopted in the literature. See. At this point. a ) x D the posterior distribution of B would be such that [Cy=. and consider D = the (unknown) standard deviation. in higher dimensions the forms obtained may depend on the path followed to obtain the limit. in the case of unknown : mean.a) = B . . a J u' and the () posterior distribution of u would be such that [E:=. pointed out by Jeffreys himself (Jeffreys. if we used the multivariate Jeffreys' prior x ( p .~ Example 5. 1939/1961 p. Similar problems arise with other intuitively appealing desiderata. 1. from N ( S 1 p. see Zellner (1986a). For example. For an illustration of this.a)3c o*.6 Discussion and Further References 361 i. where 0 is the location and (z the scale parameter. This is widely recognised as is. the Box and Tiao suggestion of a uniform prior following transformation to a parametrisation ensuring data translation generalises.1 = 0. Example 5. Stein (1962).. An example of this. although the implied information limits are mathematically welldefined in one dimension.. see Geisser and Cornfield (1963): for an elaboration of the idea. Xi. applying his rule separately to each of the two subgroups of parameters. the appropriate (univariate) Jeffreys' prior is . which implies that [Ey=. and then multiplying the resulting forms together to anive at the overall prior specification.. although many of the arguments summarised above generalise to the multiparameter continuous case.rf]/o* is 1 . say. (Univoriote normal model). see Kass ( 1990).and conflicts with the use of the widely adopted reference prior n(p. who extends the analysis by conditioning on an ancillary statistic.25. x i .. For a recent reconsideration and elaboration of these ideas. in that one dws not lose any degrees of freedom even though one has lost the knowledge that p = 0. yc. Unfortunately. to the requirement of uniformity following a transformation which ensures that credible regions . the results thus obtained typically have intuitively unappealing implications. 182) is provided by the simple locationscale problem. 7 ) * ] / 0 2 again. The kind of problem exemplified above led Jeffreys to the ad huc recommendation. in the multiparameter setting. one may wonder just what has become of the intuition motivating the arguments outlined above. In the case of known mean.4). .
. in several dimensions. Villegas ( 1969. Other informationbased suggestions are those of Eaton (1 982). Geisser ( 1979) and Torgesen ( 1981 ). Such approaches include the use of limiting forms of conjugate priors. Stone ( 1965)recognised that. Novick ( 1969). The standard prior recommended by group invariance arguments is T ( I L . Geisser (1984). 1981). if a group structure is present. Stone ( 1970).5 Inference are of the same size. The problem. a predictivistic version of the principle of insufficient reason. approaches which do not have this property should be regarded as unsatisfactory. but unappealing results. the right Haar measures coincides with the relevant Jeffreys' prior. Even when the parameter space may be considered as a group of transformations there is no definitive answer. Partially satisfactory results have nevertheless been obtained in multiparameter problems where the parameter space can be considered as a group of transformations of the sample space.26.}be a random sample from a normal distribution "V(z I 11. of course. the corresponding right Haar measure is the only prior for which such a desirable convergence is obtained. . as in Haldane (1948).. lnvariance considerations within such a group suggest the use of relurivefyinvuriant(Hartigan. Further developments involving Haar measures are provided by Zidek ( 1969). Dawid ( 1983b)provides an excellent review of work up to the early 1980's. 1964)priors like the Haar measures. In such situations. Stone and Dawid ( 1 972) show that the posterior . and different forms of informationtheoretical arguments.. producing priors that can have finite support (Berger et ul. 1977). For some undesirable consequences of the f e j Haar measure see Bemardo (1978b). it follows that. such regions can be of the same size but very different in form. Although this gives adequate results if one wants to make inferences about either 11 or 17. a large group of interesting models do not have any group structure. it should be possible to approximate the results obtained using a noninformative prior by those obtained using a convenient sequence of proper priors. such as those put forward by Zellner ( 1977. . so that these arguments cannot produce general solutions. Let 2 = { x . in those onedimensional problems for which a group of transformations docs exist. Florens ( 1978. It is reassuring that. 1989). A).x. is that. 1982). Chapter 10) and Piccinato ( 1973. Example 5. Maximising the expected information (asopposed to maximising the expected missing information) gives invariant. This idea was pioneered by Barnard ( 1952). 1991 ). However. whatever their apparent motivating intuition. 1977a. 1977b. I97 I . . the right Haar measures are the obvious choices and yet even these are open to criticism. Spall and Hill ( 1990) and Rodriguez ( 1991 ). If this is conceded. (Standardised mean). O )= o' where X = n'. Novick and Hall ( 1965). . in an appropriate sense. DeGroot ( 1970. Jeffreys' original requirement of invariance under reparametrisation remains perhaps the most intuitively convincing. Chang and Villegas (1986) and Chang and Eaves ( I 990). is quite unsatisfactory if inferences about the it standardised mean d = / L / O are required. He went on to show that.
yl). which includes all proposed “noninformative”priors for this model that we are aware of.18) 7r(& a) = (2 + &)’:‘La’ and the corresponding posterior distribution for 4 is We observe that the factor in square brackets is proportional to p ( t 1 Q) and thus the inconsistency disappears.az. (Correlclrion coejjkient).‘ D 2 n . Stone and Zidek (1973). Let (z. and suppose that inferences about the correlation coefficient p are required. but equivalent forms. further explored by Dawid. . .aI.) = { ( X I . (.I. it may be used to discard those methods which fail to satisfy this rather intuitive requirement. It may be shown that if the prior is of the form x(/LI. Example 5. . expect to be able to “match’ the original inferences . This type of marginalisation paradox. no such prior exists. Although this idea has failed to produce a constructiveprocedure for deriving priors.p) = 7r(p)a.27. about @J by the use of p ( t I Q) together with some appropriate prior for 4 However.. Jaynes ( 1980)disagrees. for any given model. On the other hand. An acceptable general theory for noninformative priors should be able to provide consistent answers to the same inference problem whenever this is posed in different.a) = P(t 14) only depends on d...#%~. a) is (see Example 5.6 Discussion and Further References distribution of Q obtained from such a prior depends on the data through the statistic 363 whose sampling distribution. appears in a large number of multivariate problems and makes it difficult to believe that.5. therefore. a single prior may be usefully regarded as “universally” noninformative. One would. yn)} be a 9 random sample from a bivariate normal distribution. P(t I P. the reference prior relative to the ordered partition (4. then the posterior distribution of p is given by a(pIz79) = 4 p l r ) . ‘ .
364 where 1' 5 Inference = [C.3)q' c.22. . see Bayarri ( 198I ). it is easily checked that Jeffreys' multivariate prior is and that the "twostep" Jeffreys' multivariate prior which separates the location and scale parameters is ? r ( / / .pi) +rl I f 7 : I . 2 IS? 19). f 7 .#.p) = (1 . pp. using this reduced model.(. the sampling distribution of r is Moreover..tio. / I I. .: thus. {p. 2 . 7)ylqc. )= ( I . / . ) given previously./l2. On the other hand. more generally.f7'.y. so that Comparison between x ( p 1 r ) and p ( I' 1 p ) shows that this is possible if and only if ( I = 1. of on the ordering of the parameters.ul~irfrmc. ni 1.p') I. to avoid inconsistency the joint reference prior must be of ) the form 7r(p*. ) Hence one would expect to be able to match. This posterior distribution only depends on the data through the sample correlation coefficient I. t o use proper approximations to noninformative priors as an approximate description of "ignorance" does not solve the problem either. p.' f f l'051'. using the transformationsh = tanh ip and t = tanh ' I . . / / 1 ) T ( f f l . f 7 : .') I (see Lindley. which is precisely (see Example 5. and ~ ( p = (1 . with this form of prior.I'. the posterior distribution ir(pi I .(. Jeffreys' prior for this univariate model is found to be. Although marginalisation paradoxes disappear when one uses proper priors.1. I' is sufficient. Hence.I. However. 1965.and I.ffl.~ ( 0x ( I .' is the hypergeometric function. 337) the reference prior relative to the natural order.i interest or. . For further detailed discussion of this example.  (173 is the sample correlation coefficient. I/?. Once again. this example suggests that different noninformative priors may be appropriate depending 011 the pcirtic.p ? ) .
He allowed for the existence of a certain amount of initial “objective” information and then tried to determine a prior which reflected this initial information. (Stein’s parodar).n = 1 and t = 200. but nothing .]=. k = 1OO. ” :] I1 while the sampling distribution of nf is a noncentral a2 distribution with k degrees of freedom and parameter n6r so that E[t 141 = dj + k / n . see Appendix B.V. the corresponding posterior probabilities P(C I z) may be completely different from the conditional values P(C 18)for almost all 8 values. Indeed. V [ @. for large k. .6 Discussion and Further References 365 Example 5. Efron ( 1973). for example. Jaynes (1968) introduced a more general formulation of the problem.Let T. u l . p: are desired. .. However. Section 3. by Proposition 5. An illuminating example is provided by the reanalysis by Bemardo (1979b.I. be the mean of the 71. we have El0 I 2)‘c 30!. see Stein (1959). % 100.2. However.28. reply to the discussion) of Stone’s (1976) Flatland example.2 t + I1 . k E [ 4 1 z ] = t + . . . . Stone.. by carefully distinguishing between parameters of interest and nuisance parameters.L J ~ }. reference analysis avoids this type of inconsistency. However.~ obtained by reparametrising to polar coordinates in the full model. Thus.4 a noncentral x2 distribution with k degrees of freedom is and noncentrality parameter nt.. a random sample from . with such a prior the posterior distribution of ~1. 114) 1 x 14) ( 1 whose mode is close to . .. with.r]z 322. Let x = (5. (5I p. C”b13. if inferences about Q = prior ovenvhelms. what the data have to say about b.l ’ ’ ~ ’ ( n f k. It may be shown that this is also the posterior distribution of 4 derived from the reference prior relative to the ordered partition { 4. For further details.The universally recommended “noninformative”prior for this model is x ( p 1 . 1976). observations from coordinate i and let t = Tr. which may be approximated by the proper density z. Ik}. the use of this where X is very small. for some region C. .z.5. the asymptotic posterior distribution of q is A‘(@ 16. I whereas the unbiased estimator based on the sampling distribution gives 4 = t . / L A ) = 1. Such a phenomenon is often referred to as srrong inconsistency (see.2. Naive use of apparently “noninformative” prior distributions can lead to posterior distributions whose corresponding credible regions have untenable coverage probabilities. in the sense that.. say. 5 (4&)’) and hence. . the reference posterior for Q relative to p ( t 14) is n ( @z)s n(o)p(t o + . } be a multivariate normal distribution . Bemardo (1979b) and Femindiz (1982). For further discussion of strong inconsistency and related topics. so that c. .
appropriate for all the inference problems within a given model. If a convenient group of transformations is present. but cannot be extended to the continuous case.. subject. Contextspecific"noninformative" Bayesian analyses have been produced for specific classes of problems. Moreover. His arguments are quite convincing in the finite case. subsequent work by Berger and Bemardo ( 1989) has shown that the heuristic arguments in Bemardo ( 1979b) can be misleading in complicated situations. however.j'p(ojl o g p ( o ) d ( . the prior has to depend not only on the model but also on the parameter of interest or. 1989)and finite population survey sampling (Meeden and Vardeman. ri{p(oj} = . (i) In the finite case. the noninvariant entropy functional. ~ ( o ) itself be a repmust resentation of ignorance about @ so that no progress has been made. The reference prior theory introduced in Bernardo ( 1979b) and developed in detail in Section 5. (ii) In onedimensional continuous regular problems. Jaynes' solution is to introduce a "reference" density T ( Q ) in order to define an "invariantised" entropy. Jaynes' principle of maximising the entropy is convincing. again. unique. no general procedure is proposed. The quest for noninformative priors could be summarised as follows. also. avoiding marginalisation paradoxes by insisting that the reference prior be tailored to the parameter of interest. (iv) In continuous multiparameter situations there is no hope for a single.5 1nferenc. CsiszAr. thus necessitating more precise definitions. xno longer has a sensible interpretation in terms of uncertainty. I992b. If no such information exists and c) can only take a finite number of values. 1985). Berger and Bernardo ( 1992a. "noninformative prior". and to use the prior which maximises this expression. on some notion of the order of importance of the parameters. These include dynamic models (Pole and West. It reduces to Jaynes' form in the finite case and to Jeffreys' form in onedimensional regular continuous problems.4 avoids most of the problems encountered with other proposals.e else (see. Jaynes considered the entropy of a distribution to be the appropriate measure of uncertainty subject to any "objective" information one might have. more generally. Jaynes suggests invariance arguments to select the reference density. However. Unfortunately. However. 19923 . with no attempt to provide a general theory. if c) is continuous. 1991). Jeffreys' prior is appropriate. (iii) The infinite discrete case can often be handled by suitably embedding the problem within a continuous framework. to any initial "objective" information one might have. To avoid having the prior dominating the posterior for sonw function 4 of interest. Jaynes' niurini~tn entropy solution reduces to the BayesLaplace postulate.
However. However.8. Stein. Finally. to inadmissible estimates (see. the initial density p ( @ )is a description of the opinions held about d. However. we noted that some aspects of model specification. and cited . for instance. 1970. 1963. there can be no final word on this topic! For example. either for the parametric model or the prior distribution components. (Box and Tiao. In general we feel that it is sensible to choose a noninformative prior which expresses ignorance relarive to informationwhich can be supplied by a particular experiment. why should one’s knowledge. Posteriors obtained from actual prior opinions could then be compared with those derived from a reference analysis in order to assess the relative importance of the initial opinions on the final inference. in an appropriate sense. as we argued earlier. a given ordered groupingof the parameters.3 Robustness In Section 4.3. 1956). Clarke and Wasserman (1993).1. for continuous parameters.5. In many situations. the discussion of stopping rules in Section 5.4) as technical devices to produce reference posteriors. but only arbitrarily close to admissible posteriors. p. more generally. George and McCulloch (1993b) and Ye (1993) seems to open up new perspectives and directions. posterior distributions derived from “noninformative”priors need not be themselves admissible. If the experiment is changed. of a quantity depend on the experiment being used to determine it? Lindley (1972. 71). e. For many subjectivists. Regarded as a “baseline” for admissible inferences. 1965. then the expression of relative ignorance can be expected to change correspondingly.6. or ignorance. we would accept this argument. priors which reflect knowledge of the experiment can sometimes be genuinely appropriate in Bayesian inference. This is recognised to be potentially inconsistent with a personal interpretation of probability. Akaike. Stein. independent of the experiment performed. 5.g. can seem arbitrary. This approach was described in detail in Section 5. sensible “noninformative” priors may be seen to be.6 Discussion and Further References 367 showed that the partition into parameters of interest and nuisance parameter may not go far enough and that reference priors should be viewed relative to a given orderingor. A completely different objection to such approaches to noninformative priors lies in the fact that.4. for example. recent work by Eaton (1992). they depend on the likelihood function. 1965. 1980a). “noninformalive”distributions have sometimes been criticised on the grounds that they are typically improper and may lead. and may also have a useful role to play (see. 1973. Ye (1993) derives reference priors for sequential experiments. p. limits of proper priors (Stone. 46).
For convenience of expositionand perhaps because the prior component is often felt to be the less secure element in the model specificationwe shall focus the following discussion on the sensitivity of characteristics of p ( H I : r ) to the choice of p ( H ) .l* . we shall provide some discussion of how insight and guidance into appropriate choices might be obtained. Conversely. With .rio) = N(.r) = .nce as an example the case of the choice between normal and Student4 distributions as a parametric model component to represent departures of observables from their conditional expected values. Similar ideas apply to the choice ofp(.rIH) + . We begin our discussion with a simple.L. a single observable I E R having a parametric density p(xlH). a somewhat ”opaque” operation from the point of view of comparing specifications of p(. to insight into the effect of a particular choice of p(. for example. This now results in a /inem form . involves multiplication of the two model components.10). In this section. p ( 0 ) .logp(0).368 5 1trjert. il logp(. for simplicity. The mechanism of Bayes’ theorem. p(l. this is the quantity which transforms the prior n into the posterior and hence opens the way.clO)or y(O) on a “what if!” basis.rlO) given the form of p ( 0 ) . Defining y(. See. for example. a0 OH do The first term on the righthand side is (apart from a sign change) a quantity known in classical statistics as the eyj’icient score funcrioti (see. perhaps. / ) A ) and p ( 0 ) is of ”arbitrary” form. direct approach to examining the ways in which a posterior density for a parameter depends on the choices of parametric model or prior distribution components. suppose we take logarithms in Bayes’ theorem and subsequently differentiate with respect to 8. O the linear scale.r[O)p(H)d0.) = . However. s(1’) = I3 u I3 s p(.r) ?l.logp(. Cox and Hinkley. followed by normalisation. we shall illustrate these ideas by considering the form of the posterior mean for 19 when p(. Ramsey and Novick (1980) and Smith (1983). with 6’ E Y? having prior density ! p(f?).I’denoting the mean of 1 ) independent observables from a normal distribution with mean 0 and precision A. examination of the second term on the righthand side for given [ J ( T ~ #may provide insight into the implications of ) the mathematical specification of the prior.r10).rlO. 1974). Consider.logp(H1.
A"). Suppose we carry out a "what if?" analysis by asking how the behaviour of the posterior mean depends on the mathematical form adopted for p ( 0 ) . but tedious.7 A . 0 I ( T ) 5 1. the posterior mean is bounded. The formula given for E(0lx) therefore reproduces the weighted average of sample and prior means that we obtained in Section 5. In the case of the doubleexponential. the departure of the observed mean from the prior mean. However. the reader can easily verify that in this case p ( x ) will be normal. 1992) provides the approximation What if we take p ( 0 ) to be doubleexponential? In this case. if b = ~ . a ) the exact treatment of p(x) and s ( x ) becomes intractable. . What if we take p ( 0 ) to be normal? With p ( 0 ) = N(0lp. so that What if we take p ( 8 ) to be Srudentt? With p ( 0 ) = St(0lp. After some algebrasee Pericchi and Smith (1992)it can be shown that.1 ~ ( ~ ) . detailed analysis (Pericchi and Smith. XU. In the case of the Studentt.p.' A .' v . .p very large the posterior mean approaches z. like the normal.2. we see that for very small x . the posterior mean is unbounded in z .l. and hence S ( X ) will be a linear combination of x and the prior mean. so that W Examination of the three forms for E(0lx) reveals striking qualitative differences.p. E(8l.5.b)! where w(x) is a weight function.p the posterior mean is approximately linear in x . Pericchi and Smith.)= W ( X ) ( X + b ) + [l.w ( x ) ] ( x.6 Discussion and Further References it can be shown (see. with limits equal to x plus or minus a constant. for some v > 0. whereas for z .' JZ. 1992) that 369 E ( e l X )= x . p E 92 and the evaluation of p(s) and S ( X ) is possible. for example. In the case of the normal.
O'Hagan and Berger ( 1988). the seminal papers of Box and Tiao ( 1962. with respect to such prior classes. 1981. See Jeffreys (1939/1961) for seminal ideas relating to the effect of the tailweight of the distribution of the parametric model on posterior inferences. might consist of a class of priors all having the specified quantiles but with other characteristics varying: see. Here. but is all too aware that aspects of the specification have . In the case of specifying a prior component such concern might be motivated by the fact that elicitation of prior opinion has only partly determined the specification (for example. viewed in the context of alternative choices in an appropriately defined neighbourhood of 110. O'Hagan (1979. Polson ( I99 1 ). See Smith (1983) and Pericchi t i crl. if nonparametrically. How to formulate interesting neighbourhood classes ofdistributions? How tocalculate. ' The approach illustrated above is wellsuited to probing qualitative differences in the posterior by considering. which was far from p . In the case of specifying a parametric component &.. 1988b). (1993) for further discussion and elaboration. one knew how one would like the Bayesian learning mechanism to react to an "outlying" x . It is then natural to be concerned about whether posterior conclusions might be highly sensitive to the particular specitication plj. for example.. a suitable neighbourhood might be formed by taking po to be normal and forming a class of distributions whose tailweights deviate (lighter and heavier) from normal: see. bounds on posterior quantities of interest such as expectations or probabilities? At the time of writing. for example. Suppose. From a mathematical perspective. of the neighbourhood be those such that the density ratio p / p 0 is hounded in some sense'? Or such that the niaximum . instead. West (1981 ). For example. individually. for example. a suitable neighbourhood of p. 1964). Other relevant references include Masreliez (1975). by identifying a few quantiles). that someone has in mind a specific candidate component specification.S Injerence Consideration of these qualitative differences might provide guidance regarding an otherwise arbitrary choice if. with considerable remaining arbitrariness in "filling out" the rest of the distribution. Main ( 1988). what measures of distance should be used to detine a neighbourhood "close" to p. but an awareness of the arbitrariness of choosing a conventional distributional form such as normality. this formulation of the robustness problem prcsents some intriguing challenges. Gordon and Smith ( 1993)and 0 Hagan and Le ( 1994). should neighbourhoods be defined parametrically or nonparametrically'? And. involved somewhat arbitrary choices. p. Here. the effects of a small number of potential alternative choices of model component (parametric model or prior distribution).for example an "error" model for differences between observables and their (conditional) expected valuessuch concern might be motivated by definite knowledge of symmetry and unimodality. p ~ say.? Should the elements. this is an area of intensive research.
also. 1992a. Kadane and Chuang ( 1978). Rubin ( 1977. 1988b). see. Potzelberger and Polasek ( 199l ). (1991. 1992a.1993). Dempster ( 1973. Sivaganesan ( I99 l ). Sans6 and Pericchi ( I 992). Berger and O’Hagan (1988). Lavine e t a f .6. Polasek and Potzelberger (1988. Wasserman ( 1989. For an introduction to the methodand to the related technique ofjuckknijingsee Efron (1982). Kadane ( 1984). Edwards et al. Sivaganesan and Berger (1989. Cuevas and Sanz (1988). For a recent textbook treatment. Walley ( I99 l ). However. p = ( 1 . Berliner and Goel (l99O). few issues seem to be resolved and we shall not.E)P” E q . one might wonder whether the data themselves could be used to suggest a suitable form of parametric model component. in the case of a large data sample.1994). DeRobertis and Hartigan (1 98 I ).5. since it is a heavily computationally based method we shall defer discussion to the volume Bayesian Computation. see Efron and Tibshirani (1993). Angers and Berger (1991). Finally. Doksum and Lo (1990). Lavine (1991a. Wasserman and Kadane (I990. for instance. 1992b). Meeden and Isaacson ( I977).4 Hierarchicaland Emplrlcal Bayes In Section 4. O’Hagan (1994b). Bayarri and Berger ( 1994). 1993). Pericchi and Pkrez (1994). Berger and Fan (1991). Relevant references include. a simple version of such . 1988a. Gilio (1992b). 1991b. Hartigan ( 1983). Berger ( 1980. The socalled Bayesian bootstrap provides such a possible approach. Pericchi and Nazaret (1988). as a computerbased method for assigning frequentist measures of accuracy to point estimates.5. See. Rios (1990. Carlin and Demp ster (1989). we motivated and discussed model structures which take the form of an hierarchy. Expressed in terms of generic densities.6 Discussion and Further References 371 difference in the probability assigned to any event under p and po is bounded? Or such that p can be written as a “contamination” of p o . + The term Boorsrrap is more familiar to most statisticians as a computationally intensive frequentist databased simulation method for statistical inference. Berger and Mortera (1991b. 1992b). Delampady and Dey (1994). Hill ( 1975). 1994) and Wasserman (1992a). Berger and Jefferys ( 1992). 1990. 1992b. for small E and q belonging to a suitable class? As yet. There are excellent reviews by Berger (l984a. Moreno and Pericchi (1992. 1994). Dawid (1 973). Hartigan (1969. Rios and Martin (1994). attept a detailed overview. 1985a. 1982. 1985a). 1993). Delampady and Berger ( 1990). Berger and Berliner ( 1986). de la Horra and Femandez (1994). 19921. Moreno and Can0 (199 I).6. 5. Salinetti (1994). therefore. Delampady (1989). Nau (19921. Rubin (1981) and Lo (1987. in particular. G6mezVillegas and Main (1992). 1993). 1994). Osiewalski and Steel ( 1993). (1963). Kempthorne (1986).1992a. 1975). (1993). which together provide a wealth of further references. Pericchi and Walley ( 1991). Liseo et al. thus removing the need for detailed specification and hence the arbitrariness of the choice.
But because of the “relatedness” of the k observables. the parameters el. trial centres) from which the k units are drawn.)Pt~I z) 14) P ( @ b )x P ( 4 4 M 4 L and P(ZI4) = / P w ) P ( e l 4 ) d e . The second and third stages of the hierarchy thus provide a prior for 8 of the familiar mixture representation form Here. it may be of interest to make inferences both about the unit characteristics. . are themselves judged to be exchangeable. . sources: for example. and the population characteristics. mean and covarianceof the population (of individuals. Observables 21. However.&. In many applications.372 an hierarchical model has the form 5 inference The basic interpretation is as follows. . but related.z k are available from k different. The first stage of the hierarchy specifies parametric model components for each of the k observables. . the e. the “hyperparameter” 4 typically has an interpretation in terms of characteristicsfor example.’s. In either case.. . k individuals in a homogeneous population. 4. . . .I4% !x P(zle. such models can be implemented in a fully Bayesian way using appropriate computational techniques. straightforward probability manipulations involving Bayes’ theorem provide the required posterior inferences as follows: where P(@. Of course. as we shall see in the volumes Buyesiun Computation and Buyesiun Methods. actual implementation requires the evaluation of the appropriate integrals and this may be nontrivial in many cases. or k clinical trial centres involved in the Same study.
Among the areas which have stimulated the development of Bayesian theory. 1975) and Morris (1983). the term has come to refer mainly to work aimed at approximating (aspects of) posterior distributions arising from hierarchical models. see Deely and Lindley (1981). 1992). 1968. Mouchart and Simar (1980). we would have p(6.5 Further Methodological Developments The distinction between theory and methods is not always clearcut and the extensive Bayesian literature on specific methodological topics obviously includes a wealth of material relating to Bayesian concepts and theory. Wolpert and WarrenHicks ( 1992) and George et al. Lindley and Smith (1972). Kass and Steffey (1989) and Ghosh (1992a). Calibration (Dunsmore. 1988. Smith (1973a. @*. 1%9b). We note that if p ( 4 l z ) were fairly sharply peaked around its mode. Leonard (1975). Schervish et al. 1994. 1974. Pkrez and Pericchi ( 1992). Singpurwalla and Wilson. We shall review this material in the volume Bayesian Methods and confine ourselves here to simply providing a few references. 5. following the line of development of Efron and Moms (1972. A tempting approximation is suggested by the first line of the analysis above. The naYve approximation outlined above is clearly deficient in that it ignores uncertainty in 4. Bemardo. Binder. van der Merwe and van der Merwe ( 1992). 1964. Hoadley. 1974). Classification and Discrimination (Geisser.Iz) = P(& I#**PI. The analysis thus has the flavour of a Bayesian analysis.6. ( 1992). Bemardo and Gir6n. 1966. 1973b). Lindley (1971). 1988.Ericson (1969a. Hill (1969. 1989. Goel (1983). with 4’ as a “plugin” value. Goldstein and Smith (1974).5. Much of the development following Moms ( 1 983) has been directed to finding more defensible approximations. For more wholehearted Bayesian approaches. Brown and Mtikelainen. some key references are Good (1965. 1970. An eclectic account of empirical Bayes methods is given by Maritz and Lwin (1989). 1964. 1992). Goel and DeGroot (1981). 1978. more recently. Dawid (1988b). The form that results can be thought of as ifwe first use the data to estimate 4 and then. since the term was originally used to describe frequentist estimation of the secondstage distribution: see Robbins (1955. Such shortcut approximations to a fully Bayesian analysis of hierarchical models have become known as Empirical Bayes methods. Dawid . we note the following: Actuarial Science and Insurance (Jewell.6 Discussion and Further References 373 A detailed analysis of hierarchical models will be provided in those volumes. 1983). Gilliland et al. (1 993. 1980b). 1994). This is actually slightly confusing. However. but with an “empirical” prior based on the data. (1982). Berger and Robert (1990). use Bayes’ theorem for the first two stages of the hierarchy. say.
Geman. Smith. 1992. Carlin and Polson.. Image Analysis (Geman and Geman. 1994). Besag.Leonard and Hsu.. Leonard. 1992). If we accept the model.rimcrtion (Makov. 1994). 1986. 1988: Florcns et al. 1985. fhen the mechanics of Bayesian learningderived ultimately from the requirements of quantitative coherenceprovide the appropriate uncertainty accounting and dynamics. Econometrics (Mills. Pole et al. 1994).. are as much a part of the general world of confronting uncertainty as model conditioned thinking. 1992). Cox. 1975. 5. Hald. 1986. Smouse. 1986. 1993.5 1tlferenc.e and Fang. 1986). West and Harrison. Control Theor?.. 1968. Stochusric Appro. West. Splines (Wahba. Mortera.. 1969b. Godambe. Qiculity A.6 Critical Issues We conclude this chapter on inference by briefly discussing some further issues under the headings: (i) Model Conditioned Inference.. 1992). and. 1967. 1992. Bernard0 and Gir6n.Contingency Tables (Lindley. 1965. (iii) Sequentid Methods and (iv) Conrparative Injerence. West and Migon. Gu..Wolpert and WarrenHicks.s. Ericson. 1984. Irony et d.. model choice. 1988: Mardia et ul. Grenander and Miller. Ansley et nl. 1971. Garnerman and Migon. this has translated into model conditioned inference. (Aoki. ultimately. Meng and Rubin. 1970. Mixtirres (Titterington et al. 1978. 1993: McCulloch and Tsay. Booth and Smith. 1967. 1992.6. Lo. in the sense that all prior to posterior or predictive inferences have taken place within the closed world of an assumed model structure. In this chapter. 1992. 1992). But what if. 1994). . 1976. 1987. Sawagari et ul. Singpurwalla and Soyer. We shall therefore devote Chapter 6 to a systematic exploration of these issues. 1985. 1992). Multivoriate Anulwis (Brown et a/. MetuAnulysis (DuMoucheland Harris. (ii) Prior Elicitdon. 1967). It has therefore to be frankly acknowledged and recognised that all such inference is conditional.Missing Dutu (Little and Rubin.. 1988) and fime Series and fiwecmting (Meinhold and Singpurwalla. 1994). Finite Populution Sampling (Basu. Ameen. Good. Steel. 1992. Smith and Gathercole. 1993). 1988. 1992. 1992. as individuals. 1992. Gamerman. Model Conditioned Inference We have remarked on several occasions that the Bayesian learning process is predicated on a more or less formal framework. L a w (Dawid. I994). 1987: Rubin. 1994: Robert and Soubiran. Berliner.suruncr (Wetherill and Campling. 1987. 1992. 1989. issues of model criticism. 1983. 1988. 1992. 1984. model comparison. 1987. 1966. 1993: West et ul. 1989. 1986. 1969. 1983. we acknowledge some insecurity about the model? Or need to communicate with other individuals whose own models differ'? Clearly. 1992b: Diebolt and Robert. 1993. Harrison and West. 1964. 1969.
(1982). of course. who provides a bibliography of early work. which clearly goes beyond the boundaries of statistical formalism and has proved of interest and importance to researchers from a number of other disciplines. 1980) Dickey (1980). which we will review in the volumes Bayesian Computation and Bayesian Methods. in real applications. Very briefly. However. how to structure questions to an individual. Much has been written on this topic. Sequential Methods In Section 2. Winkler (1967a. 1967b). from the perspective of applications the best known protocol seems to be that described by Stliel von Holstein and Matheson (1979). moreover. Key references include the seminal monograph of Wald ( 1947). It is obvious. which are amenable to a mathematical treatment. i. and the classic texts of Wetherill (1966) and DeGroot (1970). Some key references are de Finetti (1967). Leonard and Hsu (1992) and West and Crosse ( 1992). Garthwaite and Dickey (1992). Warnings about the problems and difficulties are given in Kahneman et af. In accounts of Bayesian Statistics from a theoretical perspectivelike that of this volumediscussions of the prior component inevitablyfocus on stylised forms. despite its importance. the use of which in a large number of case studies has been reviewed by Merkhofer (1987). General discussion in a textbook setting is provided. However. there is a danger of losing sight of the fact that. (1968). even if they are all immediately available. however. Jackson (1960). and will be better discussed in the volume Bayesian Methods. thus enabling general results and insights to be developed. Edwards et al. for example.6 we gave a briefoverview of sequential decision problems but for most of our developments. leads to the problem of how to elicit and encode such beliefs.over and over. There is a large Bayesian literature on sequential analysis and on sequential computation. and how to process the answers. Berger and . we assumed that data were treated globally.. Hogarth (1975. including psychology and economics. that data are often available in sequential form and. and Goodwin and Wright (1991). We shall therefore not attempt here any kind of systematic review of the very extensive literature. Wetherill (1961). such as conjugate or reference specifications.6 Discussion and Further References Prior Elicitation 37s We have emphasised. prior specifications should be encapsulations of actual beliefs rather than stylised forms. by Morgan and Henrion (1990). that our interpretation of a model requires in conventional parametric representationsboth a likelihood and a prior. Kadane (1980). Lindley (1982d). French (l980). the topic has a focus and flavour substantially different from the main technical concerns of this volume. there are often computational advantages in processing data sequentially. This. Jaynes (1985).5. in order to arrive at a formal representation.e.
Barnett (1973/1982). 1990). Any reader for whom our treatment is too condensed. . Piccinato (1992) and Poirier (1993). Banholomew (1971). primarily dealing with the analysis of stopping rules. Box ( 1983). Roberts ( 1967). Some other references. Cox and Hinkley (1973).Berry ( 1988)discuss the relevance of stopping rides in statistical inference. Casella and Berger ( 1987. should consult Thatcher ( 1964). Anderson ( 1984). Banholomew ( 1967). DeGroot ( 1987). Conipurutive Injerence In this and in other chapters. are Amster ( 1963). and for the very obvious reason that there are still some statisticians who do not currently subscribe to the position adopted here. Press (1972/1982). Witmer ( 1986) reviews multistage decision problems. our main concern has been to provide a selfcontained systematic development of Bayesian ideas. Barnard ( 19671. both for completeness. Basu ( 1975) and Irony ( 1993). in Appendix B. However. Wc shall therefore provide. it seems necessary to make some attempt to compare and contrast Bayesian and nonBayesian approaches. Pratt ( 1965). a condensed critical overview of mainstream nonBayesian ideas and developments.
there are good reasons for systematically entertaining a range of possible belief models. M. Smith Copyright 02000 by John Wiley & Sons.1. or from that of a group of modellers. Links with hypothesis testing. Ltd Chapter 6 Remodelling Summary It is argued that whether viewed from the perspective of a sensitive individual modeller. Throughout. most of our detailed develop .1 MODEL COMPARISON Ranges of Models We recall from Chapter 4 that our ultimate modelling concern is with predictive beliefs for sequences of observables. but intractable. the case where the range of models is being considered in the absence of specification of an actual belief model. secondly. significance testing and crossvalidation are established. A variety of decision problems are examined within this framework: some involving model choice only. finally. actual belief model. other involving only a terminal action.1 6. Bemardo and Adrian F. a clear distinction is drawn between three rather different perspectives: first.BAYESIAN THEORY Edited by Josd M. the case where the range of models under consideration is assumed to include the “‘true” belief model. More specifically. the case where the range of models is being considered in order to provide a proxy for a specified. 6. such as prediction. some involving model choice followed by a terminal action.
P. However.from providing a practical basis for routine concrete applications. . Q 2 . For example. in order to replace the general representation by mixtures involving finiteparameter families of densities. a range of different belief models. has the general representation This corresponds to an (as ifi assumption of a random sample from the unknown distribution function. a predictive belief distribution.t??. model involves judgements and assumptions going far beyond the simple initial judgement of exchangeability. more generally. are therefore typically much less securely based in terms of individual beliefs. . P I . for example. The following stylised examples serve to illustrate some of the kinds of ranges of models that might be entertained in applications involving simple exchangeability judgements. These further judgements. subsequent development was based on formal assumptions of further invariance or sufficiency structure. can each be represented in the general form . . but intractable. form to a specific. this passage from the general. than the straightforward symmetry judgement. together with a prior distribution. J J I I for some Q1. This is why. F. we saw that for an exchangeable realvalued sequence. or pragmatic appeal to historical experience or scientific authority. 1986).. and Smith.. Given the assumption of exchangeability. defined over the space.370 6 Remodelling ment has centred on belief models corresponding to judgements of exchangeability or. however. Q. together with a prior distribution for the label. but tractable. and hence the models that result from them..of all distribution functions on 8. In such cases. In each case. the predictive model typically has a mixture representation in terms of a random sample from a labelled model. Dickey. 1973. F’. there are therefore strong reasons for systematically entertaining a range of possihle models (see. the range of . various forms of partial exchangeability. Both from the perspective of a sensitive individual modeller and also from that of a groupof modellers. for F. the latter encapsulating the particular alternative judgements that characterise the different models. Inescapably. the latter being interpretable in terms of a strong law limit of observables. and certainly much less likely to be mutually acceptable in an interpersonal context. the very general nature of this representation precludes it at least in terms of current limitations on intuition and technique . in Chapter 4 much of our .
n. appeal to the central limit theorem (Section 3.2. can be thought of. For example. our subsequent development will be expressed in terms of a possibly infinite sequence of models PI. and .so that Xi= p+e. . Inferencefor a Location Parameter Suppose that observations z1. i = 1... . we typically only work with a finite range. however.... Various beliefs are then possible about the “error distribution.P2. thus requiring the assumption of a distribution with lighter tails than normality. ... as measurements of p with errors el... in practice. .. and using density representationsthroughout. . . fl.. or generated as concrete suggestions by a group of individuals (each committed to one of the forms). . . . .1 Model Comparison 379 models can either be thought of as generated by a single. . conditional on . e 2 . nondogmatic individual (seeking to avoid commitment to one specific form).. . exchangeable. with el. . . different past experience might suggest that the experimenter automatically suppresses any observations suspected of being “aberrant. p for . In general.en.6. past experience might suggest a substantial proportion of “aberrant” or “outlying” measurements. or generated purely formally..3) might suggest the assumption of normality. achoice of a range of models to cover these possibilities might be: . as an imaginative proxy for models thought likely to correspond to the ranges of judgements which might be made by the eventual readership of inference reports based on the models. . x. With k = 3. . thus requiring a distribution with heavier tails than normality. .. k some k 2 2. .
p(s 16). even though. given = the assumption of exchangeability for a realvalued sequence. 0 )might not depend on j. ( p . ~ . they specify 81.) = lJp(. doubleexponential and uniform parametric models. in this case. respectively.. . are even more dogmatic. . p . )=I If k = 2. o). Then. a (21 ( F ) which concentrates with probability one on A*. in that they not only all focus on a single parametric family. a ) probability one tothe family with parametric formy.. 8 .(. in general. I p . however.o) = dQ.3.(p. . Our purpose here is mainly to point out how choices within the general exchangeable framework correspond to specification of Q. . one given the "size" of 3.A[ is the set of all distributions other than normal. and hence that he'.) . . A somewhat different situation arises if k = 2 . within the context of the parametric family p(. ..j = 1. 3 .2. 9 . However. in the general representation. so that I1 p . within the family. j = 1. k.. the p . Conversely. ( .as a simple hypothesis and a general alternative. .I'.380 6 Remodelling with p. Normality versus nonNormality Suppose that N c 3is the set of all normal distributions on the real line. o).81.rf I 8. (F)corresponds to a belief over 3 which assigns withdensityp.r I 8 ) . this is often referred to as a situation of two simple hypotheses.=I f 1 corresponding to what are usually referred to... . .GI = n p ( s f18. The rival models then have the representations If P ~ ( . If these modelling possibilities emanate from a single individual. . . ( p . c ~ . for the two parameters of this family.. an individual dogmatically asserting nonnormality is specifying a &( F ) which concentrates with probability one on N'. as the values of the parameter. the interpretations of the parameter a strong s law limits of observable measures of the location and spread of the measurements are the same. Thus pJ (p. an individual dogmatically asserting normality is specifying. but. cannot but be struck by the monumental dogmatism implicit in 41! Purumetric Hypotheses Suppose that Q.). specifying prior beliefs for the location and scale parameters appearing in these normal. but ( 3 2 simply assigns a prior density p ( 0 ) over 8 in p(s 18). again focuses on a specific parameter value.(p. a ) could differ. . . .
. nondegenerate.I. r e p resenting the kth of a number of replicates of an observable in “context” i E I.7.z~..6. so that dQ(Ol$. xIJk. $) + YZJ. O ?= dQ(OI. we might consider: assigning probability one to 81 = . Q3 corresponds to a general hypothesis that all treatments have different effects.7. so that .1 Model Comparison 381 In the contexts of judgements of partial rather than full exchangeability. 6 2 2 corresponds to the hypothesis that one of the treatments (possibly a “control”) is different from all the other treatments. Q1 corresponds..O. The examples which follow illustratejust a few of these possibilities. .) reduces to dQl(O).. . . which themselves have the same effects. discussed in detail in Section 4. N(ztJk I + .... = 1. = Om = 6.) over the limiting frequencies.T .42).) reduces to dC)2(&.3. in the context of 01 responses in 7n clinical trial treatment groups. . . = Q = &..6. subject to “treatment” j E J.. form dQ(131..) . In particular.O. . In. O. . i .4. say.3 and 4. . Q2 : assigning probability one to 81 = ~ I .. the general representation of the joint predictive .2.. say. .6. . Several Samples Consider the situation of m unrestrictedly exchangeable sequences of zeroone random quantities. if z(%)= (zJ1.~). we considered triply subscripted random quantities.alternativemodels are defined by different forms of Q. expanding somewhat on the earlier discussion of model elaboration and simplificationgiven in Sections4. . . we considered the situation where the predictive model might be thought of as generated via conditionally independent normal + a. given a basic assumption of unrestricted exchangeability. We recall that. . the many versions of the former discussed in Chapter 4 clearly provide considerable scope for positing interesting ranges of models in any given application. density for z ( n ~.. As a stylised illustration of the possibilities. any further (nondegenerate)relationshipsamong them being defined by the specific form of &. corresponding to Qi: the assumed equality of the limiting frequencies of ones in each of the rn sequences. Q3 : retaining a general. For example. loosely speaking.z(nJ is given by so that. to the hypothesis that all treatments have the same effect. .. . Structured Luyours In Section 4. .
Example 6.4.1. In the case of several separate experiments. together with a nondegenerate specification for { . . the kinds of alternative models that might be considered.. Instead. together with a nondegenerate specification for {fk.i E I . } and p: Qj: specifying ?. (Biwssay).jJ = 0 for all . Example 6. for all i.}. j and . = 0 for all i. {$. = 0.382 6 Remodelling distributions.1.2. As a stylised illustration of alternative modelling possibilities. j E J.} and cr. j and (I. Alternative models for an individual growth curve might correspond to different assumptions about the functional dependence of the response on time (for example. j. { n ( i l } . &: specifying = 0 for all i. j . Cowriates In Section 4.6. In the case of several growth curves for subjects from a relatively homogeneous population.. 01: specifying = 0.6.jJ } and 1. } . but differ in whether or not they constrain model parametersfor example. = 0 for all i . = 0. . alternative models might assume the same functional form. for each of the cases considered in Section 4. (Growth curves). Given the enormous potential variety of such covariate dependent models. we discussed a variety of models involving covariates. linear versus logistic). together with a nondegenerate ! specification for p. alternative models might be concerned with whether some or all ofthe parameters defining the growth curves are identical or differ across subjects. the LDWs.. = 0. we shall simply indicate in general terms. we might consider: QI: specifyingT. n . where beliefs about the sequence of observables (5's) were structurally dependent on another set of observables (z's). . together with a prior distribution Q for T and any I J linearly independent combinations of { o . Alternative models for a single experiment might correspond to different assumptions about the functional dependenceof the survival probabilities on the dose (for example. for all i . together with a nondegenerate Specification for { n. The reader familiar with analysis of variance methods will readily identify these prior specifications with conventional forms of hypotheses regarding absences of interaction and main effects. j .to be equal. it does not seem appropriate to attempt a notationally precise illustration of all possibilities. { dj}.4. logit versus probit).
Bayesian Methods.16 of Section 4. the predictive distributions for the alternative models are described by PJZ) = P ( Z I = P . (SeeSection 4.4. a priori.’s are exchangeable. we shall discuss in detail a number of practical applications of this kind. In the remainder of this chapter. to be the odd one out.I Model Comparison 383 Example 63. equivalently. ( e . . that rn . nu J . 6.3 for further development of this idea in a nonhierarchical setting. how should an individual or a group proceed? From the perspective adopted throughout this book.1.1of the p. . in order to highlight the conceptual issues. E 1. (Exchangeable normal mean parameters). say. Alternative models in the multiple regression context typically correspond to whether or not various regressor variables can be omitted from the linear regression form. The emphasis will be on somewhat stylised. ) d e .) Confronted with a range of possible models. . but all are equally likely.6.pll. reflecting a symmetric judgement of “similarity”for pI. . Given the specifications of the various densities forming the mixtures. In Example 4.p. ( Z I e .6. under i consideration for observations z can be described in terms of finite parameter mixture representations. other symmetry judgements are possible: for example. E I. In the third volume of this work.5. of the m. p l . clearly the answer depends on the perceived decision problem to which the modelling is a response. we considered a case where all the means. However.. we shall just content ourselves with some general comments for one of the specific cases considered in Section 4.5. the other one is not. . ) p . to whether or not various regression coefficients can be set equal to zero. let us assume that all the belief models P2. involving the substantive complexities of context and the computational complexities of implementation will be more appropriately presented in the volumes Bayesian Computationand Bayesian Methods.6. This would create a model allowing potential “outliers”among the m groups themselves. versions of such problems. Detailed casestudies. . groups of observables with normal parametric models were judged exchangeable. typically simple.2 Perspectiveson Model Comparison To be concrete.7. and where this latter relationship was modelled as a mixture over a further parametric form. .. (Multiple regression).. . . Hierarchical Models Given the enormous variety of potential hierarchical models and alternative forms. Example 6. we shall therefore illustrate various of the kinds of decision problems that might be considered.
we ourselves choose the lists as part of the process . without the explicit knowledge of which of them is the true model..3 on the role and nature of models. the renormalised mixture of i\f. each a coded version of a different specified probability model. but it is not known which program was used. corresponds to believing that one of the models { J\.)denoting prior weights on the component models { M. this would be appropriate whenever one knew for sure that the real world mechanism involved was one of a specified finite set. suppose that a parametric model with a specified parameter has heen extensively adopted and found to be a successful predictive device in a range of applications. Provided we feel comfortable. where data are known to have been generated using one of a set of possible simulation programs. The first alternative. Beyond such "controlled" situations. in this new context. which we shall call the .we need to draw attention to important distinctions among three alternative ways in which these possible models might be viewed.. i E I}.Vclosed perspective'? Clearly.6 RemodelliriR For mnemonic convenience. . One rather artificial situation where this would apply would be that of a computer simulation "inference game". as in our previous discussions) and the set of these models by M = { A I . some ambiguity a to what should be regarded as a component model s (for example. i E I}to be those individual models we are interested in comparing or choosing among. i E I}. in principle. with assigning prior weights to these two alternative formulations. or the range of different beliefs of a group of individuals.8. Nature does not provide us with an exhaustive list of possible mechanisms and a guarantee that one of them is true.!i E I. continuing the discussion of Section 4. From this perspective. could itself be regarded as a model).There is. to incorporate uncertainty about the appropriate value. reality is typically not as relatively straightforward as this.. Now suppose that a new application context arises and that it is felt necessary to reconsider whether to continue with the previous specified parameter value or. we can exploit the . it seems to us to be difficult to accept the Mciosed perspective in a literal sense. when does it actually make sense to speak of a "true" model and hence to adopt the . i E I).Mc. but this can be resolved pragmatically by taking { Af. For example. i E I} is "true".). i E I}(rather than P.. However./osed view. which may reflect cither the range of uncertainties within an undecided individual. C l i t with f"("iI. the overall model specifies beliefs for 2 of the form = P ( N . from now on we shall denote the alternative models by { Af. However. Instead. But.Uclosed framework. of course. and JI. Before we turn toa detailed discussion of decisions concerning model choice or comparison among { '11. jp(a: I M.. there may he situations where one might not feel too uncomfortable in proceeding "as if' one meant it.
to be evaluated in the light of the individuals separate actual belief model. In the latter case. there is no separate overall actual belief specification. the alternative models are presumably being contemplated as a proxy because the actual belief model is too cumbersome to implement. which we shall call the Mcompleted view.1 Model Comparison 385 of settling on a predictive specification that we hope will prove “fit for purpose” in the jargon of modern quality assurance.i E I}does not make sense and the actual overall model specifies beliefs for z of the form p ( z ) = p t ( z ) = p ( z I Mt). which we shall denote by Alt. . {Aft. assigning the probabilities {P(Aft).6. But if we abandon the Mclosed perspective. From this perspective. Typically.which we shall call the M open view. contingency table structures with different patterns of independence and dependence assumptions. they will still have to be evaluated and compared in the light of these actual beliefs.i E I} are simply a range of specified models available for comparison. The Mcompleted perspective will typically have selected the particular proxy models in the light of an actual belief model. The Mopen perspective requires comparison of such models in the absence of a separate belief model. p(z)perhaps because we lack the time or competence to provide it. if the actual belief model is based on nonlinear functions of many covariates. link functions. However. i E I}simply constitutea range of specified models currently available for comparison. Examples of lists of ‘‘proxy models that are widely used include familiar ones based on parametric components. corresponds to an individual acting as if { A l . in the absence of an actual specified belief model. i E I might be adopted for a variety of reasons. corresponding to: regression models with different choices of regressors. also acknowledges that { M . in this case. iE I} does not make sense. however. . relative to a given proposed range of models A&. For example. so that assigning probabilities { P ( M t ) . We now proceed to give these alternativeperspectives a somewhat more formal description. The second alternative. it would seem intuitivelyand we shall see this more formally laterthat the alternative models have to battle it out among themselves on some crossvalidatory basis. i E I} will have been proposed largely because they are attractive from the point of view of tractability of analysis or communication of results compared with the actual belief model hit. together with Student probability distribution specifications. how else might we approach the very real and important problem of comparing or choosing among alternative models? It seems to us that the approach depends critically on whether one has oneself separately formulated a clear belief model or not. Mcompleted models.. . the proxy models to be evaluated might be various linear regression models with limited numbers of covariates and normal probability distribution specifications. In the former case. etc. generalised linear models with different choices of covariates. The third alternative.
. 1 1 E I.1. with assigning prior weights to these two alternative formulations. either as an end in itself. imagining a large future x. P ( M .3 Model Comparison as a Decision Problem We shall now discuss various possible decision problems where the answer to an inference problem involves model choice or comparison among the alternatives in M = {If. maximising expected utility implies that the optimal model choice rri’ is given by 7f(77L* 1 )= supu(r7/. w in this case labels the “true model. and the utility of choosing a particular model then depends on whether a correct choice has been made. with p ( w 1 x)representing acruul belit$ about w having observed x. observed data on which decisions are to be based will be denoted by 2. This decision structure is shown schematically in Figure 6.2. However. w) Figure 6. Some of these only make sense from an Mclosed perspective.Recalling sample of observations. y = (y... in principle. fI(r u . will be denoted by rrz. . an example of an obvious w of interest might be the M . so that the utility function has the form z!(?nf. for which.1 A decision problem involving model choice only It is perhaps not obvious why such a problem would be of interest from an Mopen perspective. 0).9 of Section 5.. The first decision problem we shall consider involves only the choice of an M. we can exploit the .  ftl where ~ ( m z)= I . l ) 2 z..i E I}. . w ) p ( w z)rlw. without any subsequent action. .1. or as the basis for a subsequent answer to an inference problem. Throughout the following development. . I.6 Remodelling 6.y$). i E I. I others can be approached from either an Mclosed. Whatever the forms ofw and u(m.3. in the general decision problem defined by Figure 6. Provided we feel comfortable. I y) + 1 as s Proposition 5.u). . from an Mclosed perspective.. .bfclosed framework. / u ( m f . where w is some unknown of interest.and the choice of model M. an Mcompleted or an & open perspective.
we can. in particular. we may wish to predict a future observation.2 A decisionproblem involving model choice and subsequenr inference .1 Model Comparison In the Mclosed case. when w is the actual “state of the world. turns out. even though this may require extensive i (Monte Carlo) numerical calculations in specific applications..6. I z). one can compare the models in M. are the posterior probabilities. j E J . We note. In this way. w ) a. at least in principle. the key role played by the quantities (P(h1. at least approxiIt mately.which. relating to an unknown “state of the world” w of interest. requires an answer a. denotes the utility resulting from the successive choices m. one can compare the models in M on the basis of their expected utilities without actually having specified an alternative“true” model. i E I. which we denote by m. that. (answer to inference question.).2. We shall defer a detailed discussion of this until Section 6. Let us now consider a rather different form of decision problem which first requires the choice of model Mt from M...e. j E J.1.)and a. given M. even though none of them corresponds to one’s own assumption regarding the true model. model h4. of From the Mcompleted perspective. however. If u(mt. nothing can be said in general about the explicit form of p(w I z).6. From the Mopen perspective. assuming klz to be the model. being true. (i. 5)is given by standard (posterior or predictive) manipulations conditional on model Ma. ( u I s)= p(w I A’. given s. the resulting decision problem is shown schematically in Figure 6. Figure 6.. or estimate a parameter common to all the models in M . E I..I z). obtain p(w I z)and evaluate ii(m. 387 where and p .E I}. and then. the same analysis can be carried out in the Mopen as in the Mcompleted framework. perhaps surprisingly. For example. in other words. model All.i E I. within the purview of the Mclosed i framework.
. I . in the above context.(w I a). including a utility function. is obtained from maximising The form pl(w1 x)in the above is again given by standard (posterior or predici tive) manipulation conditional on model M.3.Iz) S l l p U ( m .1 x).6. In the next two subsections. given x. different form of decision problem is that shown schematically in Figure 6.. it is worth remarking that. If we omit the explicit model choice step. we have also noted above that evaluation of p(w I x) and {ii(rn. 12) = J’ 1 1 ( 7 ~ . we shall explore a number of specific cases. Figure 63 A decision problem involving rerminul decision only . numerically if necessary.w)y(w . the resulting. for which ii(nz. Detailed analysis for the Mopen case will be given in Section 6. = If1 where G ( m . been given has above in the Mclosed case.a.i E I} can in principle be carried out. it is not necessary to choose among the elements of M in order to provide an answer a. 1 x)dw is the expected utility. while the form p(w la)again represents actual beliefs about w given x. From a conceptual perspective. Before proceeding to further aspects of model choice and comparison.of optimal behaviour given model . E 1.. . however. to an inference problem.so that n. The explicit form of p(w 1 z) as a mixture of the y. In the Mcompleted case. / a ) .set of alternative models depends on the specification (at least implicitly) of a decision structure.3aa 6 Remodelling Systematic application of the criterion of maximising expected utility establishes that the optimal model choice is that r n .\I. it is important to recognise that different choices of w and different forms of utility structure will naturally imply different forms of solution to the problem of model choice. in order to underline the general message that coherent comparison of a finite or countable .
It is then easily seen from the analysis relating to Figure 6.1 Model Comparison 389 In this case. In general. I that 1 if w = A I . given by c(a* 12) = sup ii(u 12) n where c(a* 12) = with p(w I z) discussed above. maximising expected utility leads immediately to the optimal answer a*.:w ) = 1 if w = . i E I. The optimal decision is therefore to “choose rhe model which has the highesr posterior probability”. even if explicit choice among the models is not part of the decision problem. if i we entertain a range of possible models for data x.%It)... P( Aft I y) + 1 as s + 30. In this case. . ~ ( mz)= I .. is that of choosing rhe m e model. In the particular case of an Mclosed perspective. 6. when the “state of the world” of interest is defined to be the “true” model. stated colloquially. .solutions to decision problems conditional on z will always implicitly depend on a comparison of the models in the light of the data. . . 1 z). we confine attention to the Mclosed perspective and consider first the problem of choosing a model from M . the problem. without any subsequent decision. yS). (choosing .w)p(w I z) dw. as it follows from the posterior weighted mixture form of p(w 1 x) that. s u(a.1. 0 if w # M. model comparison in the light of the data x is still being effected through the presence of { P ( M . a natural form of utility function may be u(m.4 Zeroone Utllitles and Bayes Factors In this section. E I}.given z is hence . = P(M1I z).\It 0 if w # Af. and The expected utility of the decision m. From the Mclosed perspective.6. although we have omitted the model choice step. so that assuming a future sample y = (yl. J u ( m zw)p(w I z)dw .
then just corrcsponds to the specifications { p l ( xI el).If. < the relative plausibility of H. we noted the extremely simple forms of prcdictivc models which result when beliefs not only concentrate on a specific parametric family of distributions. the prior On weight of evidence and log B. (5) I signifies that H.the Bayes factor it1 favour of H I (cind ugaittst H I )is Riven by rhe posterior to prior odds rm.HJ corresponding to assumptions of alternative models. . for example. for data x. I and the integrated likelihood ratios reduce to simple ratios of likelihoods. Ail.. this logarithmic scale. t I}. apparently due to Turing (see. 1878).IfJ may be usefully compared using the posrerior odds rurio.IfJ. I E I}. s Pl(X I e. (Bayesfactor). . .)del Definition 6. . Good ( 1950)has suggested that the logarithms ofthe various ratios in the above be called weights o evidence (a term apparently first used in a related context by f Peirce.\IJ). making explicit the key role of the rutio o integruted fike1ihood. suppose that some form of intuitive measure o pairwise comparison f of plausibility is required between any two of the models { A f . but also identify the value of the parameter. Given rwo hyporheses H I . The above analysis suggests that . . P (M / 1 z) . I :V) = In words.0. Thus. has increased.390 6 Remodelling Buyes Factors Less formally. (2)combine udditive/y to give the posterior weight of evidence. I. I988b). ( x )corresponds to the integrated likelihood weight of evidence in favour of ill.y in providing the f mechanism by which the data transform relative prior beliefs into relative posterior beliefs. (and against . in the context of parametric models. ( x ) > 1 signifies that HI is now more relatively plausible in the light of 5 .)Pl(e. so that log U . Intuitively. the Bayes factor provides a measure of whether the data x have increased or decreased the odds on H I relative to H.1.. Good. An alternative set of such models.p(s IM. In Section 6...) P(MJ12) p(s I Af. 1 E I . . 1. The fundamental importance of this transformation warrants the following definition.) P(111. .) x7 r(q) where. I f l . B . the above comparison can be described as “posreriorodds rutio = intepited likelihood ratio x prior odds ratio”. for example.
6.I Model Comparison
391
Hypothesis Testing The problem of hypothesis testing has its own conventional terminology, which, within the framework we are adopting, can be described as follows. Two alternative models, A l l , M 2 are under considerationand both are special cases of the predictive model
P ( Z ) = / P ( Z I 8)4?(e).
with the same assumed parametric form p ( z I O ) , 8 E 0,but with different choices of Q. If, for model A&, Qt assigns probability one to a specific value, 8,. say, the model is said to reduce to a simple hypothesis for 8 (recalling that the form p ( z 10) is assumed throughout). If. for model Af,, Q, defines a nondegenerate the density p , ( 8 )over 8,5 8, model is said to reduce to a composite hypothesis for 8. If a simple hypothesis is being compared with a composite hypothesis, so that 8,= 8  { O , } , the latter is called a general ulfernative hypothesis. In the situation where the “state of the world” of interest, w , is defined to be the true model A&, we can generalise slightly the zeroone utility structure used earlier by assuming that
~ ( t n ~=, 1,,, ) ~
w = Af,.
i = 1.2, j = 1,2 .
with Z1, = 12. = 0 and 1 1 2 >Zzl > 0. Intuitively, there is a (possibly asymmetric)loss in choosing the wrong model, and there is no loss in choosing the correct model. Given data 5,and using, again p i ( w ( z )= 1 if w = Afi and 0 otherwise, the expected utility of mi is then easily seen to be
so that
We thus prefer A.12 to A l l , if and only if
revealing a balancing of the posterior odds against the relative seriousness of the two possible ways of selecting the wrong model. In the symmetric case, 112 = 121. the choice reduces to choosing the a posteriori most likely model, as shown earlier for the zeroone case. The following describes the forms of socalled Bayes tests which arise in comparing models when the latter are defined by parametric hypotheses.
392
6 Reniodellirig
Proposition6.1. (Forms of Bayes tests). In comparing nvo nrodels. i\lI . >\f?, defined by parametric hypothesesfor p ( x I 0 ) .with utility structure
where: (simple versus simple test).
Proof. The results follow directly from the preceding discussion.
a
The following examples illustrate both general model comparison and a specific instance of hypothesis testing.
Example 65. (Geometric versus Poisson). Suppose we wish to compare the two completely specified parametric models, NegativeBinomial and Poisson. defined for conby ditionally independent .rl . . . . . .(',,.
.\II : Nb(.r, 1 H I . I ) . A12 : Pn(.c.,1 1 9 2 ) .
i = I . . . . .I I
The Bayes factor in this case is given by the simple likelihood ratio
Suppose for illustration that HI = {, H1 = 2 (implying equal mean values E [ . r ]= 2 for both models); then, for example, with rt = 2. , r , = .r2 = 0, we have U l ? ( z ) = (.I;!) z (i.07. indicating an increase in plausibility for A I , ; whereas with I ! = 2. .rl = .rr : we have 2, U 1 ? ( s )= lri/729 x 0.30, indicating a slight increase in plausibility for :\I2. Suppose now that 8,. are not known and are assigned the prior distributions
6.1 Model Comparison
393
whose forms are given in Section 3.2.2 (where details of Nb(z, I l , e l )and Pn(r, 102) can also be found). It follows straightforwardly that
and that
so that
We further note that
so that prior specifications with ( a l  l)az = ,&,&imply the same means for the two predictive models.
Table 6.1 Dependence of BI2 on priordata combinations (2)
= 2, $1 = 2 = 2. & = 1
01
01
= 31.0, = 60
a 2
3 1
a = 2
( k i = 2, Sl = 3 6 . 3 = 30 a = 3. $2 = 2 012 2
= x2 = 0
x1=x2=2
2.70 0.29
5.69 0.30
0.80 0.49
6 Remodelling
As an illustration of the way in which the prior specification can affect the inferences. ) we present in Table 6.1 a selection of values of B I 9 ( zresulting from particular priordata combinations. In the first twocolumns. the priors specify the same predictivemeansforthe two models. namely E [ sI A[,] = 2, but the priors in the second column are much more informative. In the final column, different predictive means are specified. Column 2 gives Bayes factors close to those obtained above assuming 0 , .B2 known, as might be expected from prior distributions concentrating sharply around the values 0, = and Or = 2. However, comparison of the first and third columns for s1= x2 = 0 makes clear that, with small data sets, seemingly minor changes in the priors for model parameters can lead to changes in direction in the Bayes factor.
4.
The point made at the end of the above example is, of course, a general one. In any model comparison, the Bayes factor will depend on the prior distributions specified for the parameters of each model. That such dependence can be rather striking is well illustrated in the following example.
Example6.6. (Lindley’sparodoxj. Suppose that forz = (1,.. ..r,,)twoalternative .. models ,\Il. P ( M l )> 0. I = 1.2. correspond to simple and composite hypotheses ;If2.with about p in X ( x t 1 p. A) defined by
MI : pl(z) =
n
41
!V(x, I ill,. A).
p,,. A known.
,. , : p 2 ( z ) = / f i X ( x , Ir
I
l ~ ~ , A ) ~ ( ~ 4 I ~ ~ , . AplI .) Af l . A ~known f ~ .
I
In more conventional tenninology, .cl. . . . ,s,, a random sample from S ( s [ 11. A). with are precision A known; the null hypothesis is that p = and the alternative hypothesis IS that /L # p,,, with uncertainty about /L described by ” V ( p 1 p l . A, ). Since F = nI ~ ~ I,is a sufficient statistic under both models, we easily see that = ,
It is easily checked that,& anyjixed?, B 1 2 ( s+ x as A, ) 0, so that evidence in favour of All becomes overwhelming as the prior precision in ,\I2 gets vanishingly small, and hence P(U1 15) 4 1. In particular, this is true for x such that 1  / I , , i is 5 large enough to cause the “null hypothesis” to be “rejected” at any arbitrary, prespecified level using a conventional significance test! This “paradox” was tirst discussed in detail by Lindley (1957) and has since occasioned considerable debate: see Smith (1965), Bernardo ( 1980), Shafer ( 1982b). Berger and Delampady ( 1987). Moreno and Cano ( I989), Berger and Mortera (1991a)and Robert (1993) for further contributions and references.
+
6.1 Model Comparison
395
A model comparison procedure which seems to be widely used implicitly in statistical practice, but rarely formalised, is the following. Given the assumption of a particular predictive model, { p ( z I 8 ) ,p ( 8 ) } ,8 E 0, posteriordensity,p(8 I z), a is derived and, as we have seen in Section 5.1.5, may be at least partially summarised by identifying, for some 0 < p < 1, a highest posterior density credible region R,(z), which is typically the smallest region such that
Intuitively, for large p , R p ( z )contains those values of 8 which are most plausible given the model and the data. Conversely, R;(z) consists of those values of 8 which are rather implausible. ) one Now suppose that, given a specified p and derived Rp(z . is going to assert that the “true value” of 8 (ie., the value onto which p(8 I y> would concentrate as the size of a future sample tended to 30) lies in R p ( z ) . Defining the decision problem to be the choice of p , so that the possible answers to the inference problem are in A = [O,11, with the state of the world w defined to be the true 8, a value up = p has to be chosen. An appropriate utility function may be
where f and g are decreasing functions defined on [O, 11. Essentially, such a utility function extends the idea of a zeroone function by reflecting the desire for a “correct” decision, but modified to allow for the fact that choosing p close to one leads to a rather vacuous assertion, whereas a correct assertion with p small is rather impressive. The expected utility of choosing a = p is easily seen to be ,
.ii(ap> = P f b )
+ (1  p)g(l  P ) ,
from which the optimal p may be derived for any specific choices of f and g. We note that i f f = g, the unique maximum is at p = 0.50, so that it becomes optimal to quote a 50% highest posterior density credible region. If, for example, f(p) = 1  p , g( 1  p) = [l  (1  p)I2 = p 2 , the resulting optimal value of p is l/& x 0.58, so that a 58% credible region is appropriate. More exotically, if f(p) = 1  (2.7)p’. g(l p ) = (1  p )  l , the reader might like to verify that a 95% credible region is optimal.

6.1.5
General Utilities
Continuing for the present with the (Mclosed) hypothesis testing framework, the consequences of incorrectly choosing a model may be less serious if the alternative models are“close” in some sense, in which case utilities of the zeroone type, which take no account of such “closeness”, may be inappropriate.
396
6 Remodelling
Onesided Tests We shall illustrate this idea, and forms of possibly more reasonable utility functions. by considering the special case of 0 E E 82, with parametric form p ( z 10) and models 1\21. 1\22 defined by p , ( . ~  l 0= p ( r i 0 ) . 0 E 8, (0: 0 5 0(,} ) = y.r(a .I0 )=p (r1 8 ). I 9 € 0 2 = ( 0 : I 9 > 0 , , }
for some 00 E 8. models thus correspond to the hypotheses that the parameter The is smaller or larger than some specified value 00. It seems reasonable in such a situation to suppose that ifone were to incorrectly choose (0 > 0") rather than MI (I9 5 Ho), in many cases this would be much less serious if the true value of 0 were actually 19,) i than if it were 0,)  100:. say, for E > 0. Such arguments suggest that, with the state of the world ui now representing the true parameter value 6'. we might specify a utility function of the
for i = 1.2 where I,, /2 are increasing positive functions of ( H  00) and (19,) H ) , respectively. The expected utility of the decision corresponding to 7 ~ ! , (i.e., the choice of Af,) is therefore given by
ii(m, I z) =
J'.:
I,(O)p(B [ s)d0.
where
The optimal answer to the inference problem is to prefer 1\11 to A12 if and only if
E ( m * 1 )> ii(7172 1 ) 2 2.
with explicit solutions depending, of course, on the choices of (1.12, and the form of p(O I z), illustrated in the following example. as
Example 6.7. (Normalpostenor; linear losses). If I I ( H ) = H  H,,, I,(/?) = k(&  0). with k retlectingthe relative seriousness of "overestimating" by choosing mcdel .If,. and p ( H I s). given z = . I ' , . . . . . .r,,.is N ( H I p,,. A,,). say, then we have
& ( m I z)= ,

.
I
( H  H,l),V(t? I I / , ) . A,,) tlH [A,,
2([AI
/I:.",,
= ,\,,
 //,8)].
6.1 Model Comparison
where
397
and
where
q * ( t )= N ( t 10. 1)
+t
l
"9 .
10.1) ds.
It is therefore optimal to prefer M ito Af2 if and only if
In the symmetric case. k = 1. it is easily seen that this reduces to preferring MI if and only if 11,~< 00, as one might intuitively have expected. For references and further discussion of related topics, see DeGroot ( 1970, Chapter I I), and Winkler (1972, Chapter 6).
Predicrion
Moving away now from model comparisons which reduce to hypothesis tests in parametric models, let us consider the problem of model comparison or choice, given data x,in order to make a point prediction for a future observation y. The general decision structure is that given schematicallyin Figure 6.2, where, assuming realvalued observables, m, corresponds to acting in accordance with model hiz,u J ,j E .I, denotes the choice, based on hl,, of a prediction, yz, for a future observation y, and we shall assume a "quadratic loss" utility,
u(mt.Ga,y) (G,  Y Y ) ~ , =
i E I.
We recall from the analysis given in Section 6. I .3 that the optimal model choice is m*, given by
where 6 is the optimal prediction of a future observation 9, given data x and : assuming model hf,; is. the value y which minimises that
390
6 Remodelling
where p i ( y 12) is the predictive density for y given model AI,. It then follows immediately that
the predictive mean, given model AI,, s o that
Completion of the analysis now depends on the specification of the overall actual ' ) g belief distribution p(y I z)and the computation of the expectation of (g,'  . i E 1, with respect to y(y 1 z). Again, in the Mcompleted case there is nothing further to be said explicitly: one simply carries out the necessary evaluations, using the appropriate form of p(y 12).by numerical integration if necessary. In the Mopen case. the detailed analysis of the problem of point prediction with quadratic loss will be given in Section 6. I .6. In the Mclosed case, we have
and. after some rearrangement. it is easily seen that
x)rly.
which reduces to
The first term does not depend on i , and the second term can be rearranged in the form
where $' is the weighted prediction
6.I Model Comparison
399
The preferred model MI is therefore seen to be that for which the resulting prediction, y{*, is closest to y', the posterior weightedaverage, over models, of the individual model predictions. If k = 2, it is easily checked that the preferred model is simply that with the highest posterior probability. If we wish to make a prediction, but without first choosing a specific model, it is easily seen that the analysis of the problem in terms of the schematic decision problem given in Figure 6.3 of Section 6.1.2 leads directly to jj* as the optimal prediction. Clearly, the above analyses go through in an obvious way, with very few modifications, if, instead of prediction, we were to consider point estimation, with quadratic loss, of a parameter common to all the models. More generally, the analysis can be camed out for loss functions other than the quadratic.
Reporting Inferences Generalising beyond the specific problems of point prediction and estimation, let us consider the problem of model comparison or choice in order to report inferences about some unknown state of the world w. For example, the latter might be a common model parameter, a function of future observables, an indicator function of the future realisation of a specified event, or whatever. A major theme of our development in Chapters 2 and 3 has been that the problem of reporting beliefs about w is itself a decision problem, where the possible answers to the inference problem are the consists of the class of probability distributions for w which are compatible with given data. The appropriate utility functions in such problems were seen to be the scorefunctionsdiscussed in Sections 2.7 and 3.4. This general decision problem is thus a special case of that represented by Figure 6.2, where, given data ac, m ,represents the choice of model A!{, the subsequent answer u J ,j E J,. to the inference problem is some report of beliefs assuming M,, and the utility function is defined by about w,
u ( m 7 , a J . w ) u,(q l(. z ) , w ) , = l
for some score function u l , and form of belief report, ql( * 1 ac), about w, corresponding to d,, j € J,. If pl(.I 2) is the form of belief report for w actually implied by m,and if u, is a proper scoring rule (see, for example, Definition 3.16) then it follows that the and optimal a,, j E J , must be a; = pl(. 1 z) that
u(7n7.a:,w) u , ( p t ( .lz).w), =
i E I.
If, moreover, the score function is local (see, for example, Definition 3.1% we have the logarithmic form
u(m,,a:.w) = Alogp,(w lz)
+ B(u), i E I ,
A > 0,
400
6 Reniodellitig
for n > 0 and B(w) arbitrary. in accordance with Proposition 3.13. The expected utility of I n , is therefore given by
u(m,lz) =
and the preferred model is the M , for which this is maximised over I E I. Comments about the detailed implementation of the analysis in the .Mopen case are similar to those made in the previous problem. For ;Mclosed models, we have the more explicit form
u(rn, 12) = (I c ( I J pM
JCI
s
{a1015y,(wIz)+U(w)}p(wIz)tlw.
p,(w
I z)logp,(wI s ) d w +
O ( w ) p ( w z)rlw. I
H I , ib
which, after straightforward rearrangement, shows that the preferred by minimising, over i E I.
given
the posterior weightedaverage. over models. of the logarithmic divergence (or and discrepancy) between p,(w 1 z) each of p,(w 12)..j # i E 1. If. instead, we were to adopt the (proper) quadratic scoring rule (see. for example, Definition 3.17). we obtain, ignoring irrelevant constants,
U ( m , 1 )x 2
so that, after some algebraic rearrangement. in the case of Mclosed models the preferred M, is seen to be that which minimises. over i E 1.
I{
2p,(w 1 )2
. I
p f ( wI z)dw p ( w 1 ) 2 cfw.
1
Comparison of the solutions for the logarithmic and quadratic cases reveals that if. for arbitrary f.
defines a discrepancy measure between y and q. both may be characterised as identifying the 121, for which
XI'('''Jz)ci{pJ(wz, y , ( w I z)} I I l
./El
is minimised over i E 1,the differences in the two case5 corresponding to the form of f (logarithmic or quadratic. respectively).
6.1 Model Comparison
401
Example 6.8. (Point prediction versuspredictive beliefs). To illustrate the different potential implications of model comparison on the basis of quadratic loss for point prediction versus model comparison on the basis of logarithmicscore
for predictive belief distributions, consider the following simple (Mclosed) example. Suppose that alternative models A i l , A for z = (q, . . . r,,) defined by: l z are
.
with XI, Xz, /ill, known: we are thus assuming normal data models with precisions XI, A?. XI, respectively, and uncertainty about p described by N ( p Ip,). A,,) in both cases. Now, given x, consider two decision problems: the first problem consists in selecting a model and then providing a point prediction, with respect to quadratic loss, for the next the observable, x"+~; second problem consists in selecting a model and then providing a predictive distribution for z,,], with respect to a logarithmic score function. For the first problem, straightforward manipulation shows that the predictive distribution for x n r l assuming model M, is given by
where
so that, corresponding to the analysis given earlier in this section, model M, leads to the prediction pf,(j), = 1.2, and the preferred model is MI if and only if j
To identify these posterior probabilities, we note that, if $ = T E  ~ C ( XS ) 2 ,  ,
which, for small A,), is well approximated by
The posterior model probabilities are then given by
402
6 Remodelling
where pl? = P(All)/P(Jf2). Model Al is therefore preferred if and only if l
Iog[BI1(z)pl2j 11 >
In the case of equal prior weights, p12 I and, assuming small A,,, if we write the condition = in terms of the model variances cf = A;', J = 1.2. we prefer dll when
Noting that the lefthand side is an intuitively reasonable databased estimate of the model variance, we see that model choice reduces to a simple cutoff rule in terms of this estimate. For the second decision problem, the logarithmic divergence of p ( .rjt I 9 1 1 . ~ ) from p(.r,,+ I A/?, z) is given. for small All. by
with a corresponding expression for d2,. The general analysis given above thus implies that model MI is preferred if and only if
i.e., if and only if
(rather than > 1, as in the point prediction case). Note, incidentally, that should it happen 1 model .\II. would be preferred if and only if < A l l . which . that P(:lfl j z) = P(,4f2 z), happens if and only if h l .> A?. Intuitively, all other things being equal. we prefer in this case the model with the smallest predictive variance. Toobtain some limited insight into the numerical implicationsof these results. consider the case where 0 ; = 1, 0 = 25, It  1. P(dfl) = P(AL) = and .Y') = 3. which gives : : = 0.3'33,so that P( d l , 12) = 0.28, P ( M 2; I ) z 0.7'2. Usini the point prediction with ; quadratic loss criterion, we therefore prefer A I L . However. hI1 = 1 .I29 and hi1 10.31. 50 that if we want to choose a predictive distribution in accordance with the logarithmic score ' criterion we prefer A l l . since (13.28)/(13.72) > (1.129)/(11).31). However. if R = 1. the reader might like t o verify that :\I2 is preferred under both criteria (B12 = O.05S. implying l that P r ( Al i I ) = 0.(15.5).
7
we note that there are TL possible partitions of 5 = z . where (I.. or a predictive distribution with respect to logarithmic or quadratic score. denotes z . for point prediction with quadratic loss. First.= into z . and that. the expected utility of the choice A f t . .depending on i.1 Model Comparison 403 6.. and the z’s are exchangeable.. in the Mcompleted case.6 Approximation by Crossvalidation For the general problem of model choice followed by a subsequent answer to an inference problem. and with a quadratic score function we have Secondly.. In the Mclosed case.l(j)as a “proxy” for 5 and x. is required. a future observation. What can be done to compare the values of hl. ignoring irrelevant terms..given appropriate computation.n.. For example. each such partition effectively provides x. we note that.2 implies that the optimal choice of model from M is the All for which is maximised over i f I. z ) = logp(yIAl.(j). in all these cases. i E I. we have seen that the mixture form of p(w I z) enables an explicit form of general solution to be exhibited. z. j = 1. We turn now to the case of model comparison within the Mopen framework.{z3} . we have for a predictive distribution with logarithmic score function we have. so that p(w I z) not available? We shall illustrate a possible approach to this problem by detailed consideration of the special case where w = 9. . for which a point prediction with respect to quadratic loss. [z. . (21. . with xJ deleted. has the mathematical form for some function ft of y and x. . we have noted that the solution is in principle available.6.. .where xnl (j)= x.z). whose form can be explicitly identified. as a . if ri is reasonably large.= xl].) . denotes the optimal subsequent decision given Adt. as proxies for an actual is belief model which itself has not been specified. f i ( y ... the analysis based on Figure 6.1.i E I.
in the form k 1 ~ o g p (I zi\ij) . the quantity C PI(^.if we randomly select A. so that the expected utilities of MI. which can be regarded as an average measure based on the logarithm of the integrated likelihood under model AfI. over i E I. we and writing p l ( y I z)= p(y I M I .. for computational purposes.logp(t. is.JIzn 1 ( A ) k J=. But. 1 k . which is an average measure. this criterion can be given an interesting reformulation. we maximise. when x is replaced by z.= j saw above.. it attempts to predict a missing part of the data from an available subset of the data. . of how well MI performs when.(j). on a leaveoneoutatatime basis. and can be conveniently rewritten. we can form 71 partitions z . A l l . over i E I. can be compared on the basis of the quantities ]=I In the case of point prediction. It follows that.l(j) I M. using squared distance.] such that z . [zI. In the case of a predictive distribution with a logarithmic score. a “proxy” for z and xj is a “proxy” for y.404 6 Remodelling “proxy” for y.~ ( .z). . k 3 x. i E I. in this Mopen perspective.. k C J=l In the case of comparing two models. as I ? . approximation implies this value of E[y I M &z] that we minimise. J f 2 .r. of these partitions. .. IxltI(j)) log M(1. a standard law of large numbers argument suggests that.=) z . if y is a future observation and Jij*(j) denotes the .y 1%) is no/ specified. can rearrange the criterion to see that we prefer model A l l if where.j). Under the logarithmic prediction distribution utility. If we now randomly select k from these n partitions. however. for large r i . as we . p(.)._l(. x.
Thus. Although there are clear differences here in formulation (Mopen versus Mclosed.A12 on the basis of the models’ ability to “predict” x given no data (beyond what has been used to specify p z ( 8 . based on the versions { Pt (x. p.I8..4 the role of Bayes factors based on the I of 1 1 versions { p . can be rewritten so that the criterion implies preferring model 11.1 loss). In Section 6. in both the quadratic loss and logarithmic score cases. ~ . logpredictive utility versus 0. where B12(xj. it is interesting to note the role played again by the Bayes factor.3but merely note that the above approximation to the optimal Bayesian procedure leads naturally to a crossvalidation process. squared distance terms or small logintegratedlikelihood values. 1993). the latter puts the emphasis on “future predictive power”. Stone ( I 974) and Geisser (1975) for early accounts and Shao ( I 993) for a recent perspective..%I2. We recall from Section 6. These kinds of approximate performance measurements for comparing models could obviously be generalised by considering random partitions of x involving leaveseveraloutatatime techniques. But this. respectively. if under model A I . for example.1112. in turn. The above development clearly establishes that such crossvalidatory techniques do indeed have an interesting role in a Bayesian decisiontheoretic .xr11 ) )denotes the Bayes factor for 11 against (j 11 AI2. (z O. . ) ) 1 1 .) P* (61 I=a.1 ( d ) } 1 of A l l . A I 2 on the basis of the models’ ability to predict one further observable. as a pragmatic device for generating statistical methods without seeming to invoke a “true” sampling model: see.1 observations.4. ) ) . as n t 00) Monte Carlo estimate of the lefthand side of the model criterion above. for example. in the above we are taking a geometric average of Bayes factors which are evaluating MI.1.). One interesting difference is the following (Pericchi. the performance measure will penalise Id. ( 8 .11 if for j = 1.1. We shall not develop such ideas further hereapart from giving one further interesting illustration in Section 6. In contrast.k. in the context of zeroone loss functionsand the Mclosed perspective. from a mainly nonBayesian perspective.3.6. which results in a preference for models under which the data achieve the highest levels of “internal consistency”.. there are zl which are “surprising” in the light of ~ . .1 Model Comparison 405 provides a (consistent. the Bayes factor is evaluating 1111. Model choice and estimation procedures involving crossvalidation (sometimes called predictive sample reuse) have been proposed by several authors.1 ( j )thus leading to large . given R . . The former situation puts the emphasis on “fidelity to the observed data”.
)') < xi{(l * .~ ) P + w.. given x. . I where 7.3 and 6. . MI would be the preferred model results given in Sections 6.. is preferred to !If. and in the absence of a specified actual belief model.10) under Af? the optimal prediction is i = 1.\f1 is just i.rI.correspondingto simple and composite hypotheses about / r in :V(. from the crossvalidation approximation analysis given above we see that A l .1) '(2 . r . omitted.6..r. A ) and de6ned by: MI : p l ( z ) = n I I 1 N ( .' . . . pll..X known: The analysis given in Example 6.and x. we note that.\I?. an approximate analysis shows that d l . .If. U ~ 1 1 ( ~ ) }.and k = H.\I. the optimal prediction of a future observation. and it was shown that.j ) = 7 + ( u . Example 6.r... if XI + 0.) is the mean of the sample x with .r.1. 1 1 . In Example 6.. We shall now reconsider the case of quadratic loss for point prediction in the Mopen context. ) > I).4 that. A ) . or quadratic loss utility for point prediction (since in this latter case. the criterion reduces to the comparison of posterior probabilities when just two models are being compared). 0. is preferred to df2 if and only if. First. as XI   where w. I + C(/icl 11 r( . for z = (.). It follows from 0. In particular. = /ill.6 was within the Mclosed context with P( . will be preferred if the posterior mean on average does better as a predictor than / t o ..r.. Secondly. . !/. we considered the case of two alternative models.1. under . r . if and only if I I .r.( Intuitively.9. I pI. . based on k random partitions of 5 into s. .2.tT. whereas (making appropriate notational changes to the results given in Example 5.. !  .( j ) . .\II. . (Lindley's paradox revisited). as XI under either mroone utility.6 Rentodelling setting for approximating expected utilities in decision problems where a set of alternative models are to be compared without wishing to act as if one of the models were "true". = u X ( X I r t X ) . P(N1 xj I 1 for uny tixed x.
To aid the modelling process.representing all possible observed relevant attributes (discrete or continuous) for the individual population unit: for example. .y. . . x:. or a record of all the input and control parameters of the production run.j= 1. . disease state of a specific patient. and that complex casestudies and associated computation will be more appropriately discussed in subsequent volumes.1. yet unknown. or 1%previous runs of the same production process.((z). consider the following problem. . 6. a predictive distribution p(w 1 g ( z ) . the patient’s complete recorded clinical history. .n}. as yet unknown.. .dl random quantity exceeds See Leonard and Ord (1976) for a related argument.6 and makes clear that. for the state of the world w of the new individual unit. classifying the. quality level of the output from a particular run of an industrial production process. we shall consider the important problem of model comparison which arises when we try to identify appropriate covariates for use in practical prediction and classification problems. a data bank (of “training data”) is available consisting of D = { ( W J .1 Model Comparison 407 This is easily seen (Pericchi. z s ) . which. n previous patients presenting at the same clinic. This result provides a marked contrast to that obtained in Example 6.. That said. it might be illuminating at this juncture to indicate briefly how the theory we have developed above can be applied in contexts which are much more complicated than those of the simple.. recording all the attributes and (eventually known) states of the world for n previously observed units of the same population: for example..D ) . . 1993) to be equivalent to preferring hl. under Mi.6.7 Covariate Selection We have already had occasion to remark several times that our emphasis in this volume is on concepts and theory.on covariates y1 ( z ) . Mclosed or Mopen perspective. or predicting the.z{). if. and only if.is equivalent to rejecting Mi if a Snedecor F’l.. even given the same data and utility criterion. Some kind of decision is to be made regarding an unknown state of the world w relating to an individual unit of a population: for example. . preferences among models in M may differ radically. for various choices of m. . stylised exampleson which most of our discussion has been based. Possible predictive models are to be based. To fix ideas. We shall suppose that the ultimate objective is to provide. which are themselves selected functions of z = (zl. . To this end. depending on whether one is approaching their comparison From an the value 2.
. . D). I y(z). of course.408 6 Reniodelling where y denotes a generic element of the set of all possible { y. we shall suppose that identification of the density p ( . if ci(y(. Then. y E Y . w I D) represents the predictive distribution for ( z .I y(z ) . in the case of disease classification the use of fewer covariates could well mean cheaper and quicker diagnoses. . we typically refer to the problem as one of clnssijication. If w is discrete. . D) is equivalent to the identification of y E J'. To simplify the exposition. the different possible models corresponding to the different possible choices of covariates. . but possibly also a cost component. If we suppose. i = 1 . it will include functions mapping z to z .a). the latter will not only incorporate a score function component for assessing p ( . on the practical problem under consideration: typically. we refer to it as one of prediction. of course. so that individual attributes themselves are also eligible to be chosen as covariates. the resulting decision problem is shown schematically in Figure 6. reflecting the different costs associated with the use of different covariates y. D ) when w turns out to be the true state of the world.s. i n } under consideration for defining covariates. depend on the form of the utility function. For example. are then compared on the basis of their expected utilities The resulting optimal choice will. D).w} denotes a utility function for using the predictive form p ( . I y(z). the "traingiven ing data" D. however. I y( z ) .4 Selecrion ojcovciririres CIS n decision prohler~r Ifp(z. i = I . . Figure 6. . Qpically. . The particular forms in Y will depend. if w is continuous. that the utility function can be decomposed into additive score and cost components.). where J' denotes the class of all y under consideration. . for simplicity. in the case of predicting production quality the use of fewer covariates could cut costs by requiring less online measurement and monitoring.4. .(.
with the implication of subsequently acting as if the corresponding M i . or to the choice m . we considered model comparison problems arising from the existence of a proposed range of welldefined possible models. with the implication of "doing something else.2. recalling the discussion of Section 6. in the terminology o Section 6. where the primary decision consisted in choosing m. over all possible choices of one covariate.2. but discussion of these would take us far into the realm of methods and casestudies. given . i E I}.a(yii. say.2 Model Rejection 409 the expected utility of the choice y is given by In many cases. which corresponds to C subsequently acting as if M o were the predictive model.) I D) .6. has been proposed initially. (thus. the set Y = M of possible models is typically a rather pragmatically defined collection of stylised forms.) for which ii(y.1. quadratic or logarithmic. rejecting A&). one could ignore costs. There are interesting open problems in the development of the crossvalidation techniques. two covariates.w I 0) likely to is prove far too complicated for any honest representation. for example.1. were the predictive model.i E I. I D) is less than some appropriately predefined small constant.. and the primary decision corresponds either to the choice w.. in most applications p ( z . reflecting marginal expected utility for the incorporation of further covariates. More pragmatically. it will be natural to use proper score functions.2 6. we need to perform a comparison of the models in Y from the f Mopen perspective. etc.2. so that. we shall be concerned with the situation which arises when just one specific welldejned modelfor x.2. and so will be deferred to the second and third volumes of this work. M = {Mi. In this section. for observations x. an Mclosed perspective would not usually be appropriate. the optimal model will typically involve a large number of covariates. and. if cost functions are used which increase with the number of covariates in the model.. In fact.i = 1. observe that is typically concave. . If costs are omitted. 6. in a sense. . that might be employed in particular cases. a small subset of the latter will typically be optimal. i E I.1 MODEL REJECTION Model Rejection through Model Comparison In the previous section. If.. Given the complexity of problems of this type. . identify the optimal yLi). and hence select that y.+.
evaluations are based on mixture forms involving prior and posterior probabilities of the '11.but may be numerically involved.M. I z)denotes the ultimate expected utility of a primary action. 12) to compare with u( mo I 2).E I proceeds t as indicated in Section 6. in some larger class of models . . For this redefined problem of model rejection within .. problem of model rejection'? If we are concerned with coherent action in the context ofa wellposed decision problem. shown schematically in Figure 6. Such a structure may arise.. the hitherto undefined becomes value of ii( r n & I z) where I' = I .( 0 ) indexes the models in M distinct from ill. welldefined. we are forced to consider alternative models to ill. has M been put forward for reasons of simplicity or parsimony. as a consequence of the application of some kind of principle of simplicity or parsimony.. contain alternatives of practical interest). for example. by consideration of uctual ulternutives to . if we adopt the Mcompleted perspective. One way or another.. in principle. or..410 6 Remodelling Figure 6. particularly where \I. I . I z). as an attempt to "get away with" using :\I. it might be done by consideration of formal dtertiutivrs. i E I. Let us suppose therefore that we have embedded dl. this model rejection problem might be represented schematically by Figure 6. in some way. we may use a crossvalidation procedure to estimate the expected utilities.G ( . What perspectives might one adopt in relation to this. as a consequence of ill0 being the only predictive model thus far put forward in a specific decision problem context. we see from Figure 6. thus far clearly illdefined. Otherwise. the calculation is. = { M . the calculation of U( rrr.5 Model rejecrion as a decision problem z.. thought (by someone) to be of practical interest. of course.5 that we cannot proceed further unless we have some method for arriving at a value of C( n i t . unstated) alternatives.. (which might also...6. Thus.. generated by selecting. a "mathematical neighbourhood" of :\I. For any specific decision problem. instead of using more complicated (but.bfclosed perspective.\I. if we adopt the Mopen perspective. . in this context.5. if we adopt the . This might be done. i E I}.
the redefinition of the problem of model rejection as one of model comparison corresponds to modifying slightly the representation given in Figure 6.where.2 Model Rejection 411 Figure 6. i E I } In the case where {hi. . for some 80c 8. A). ) on the subspace defined by 4 = 40).. In order. i E 1’. or (ii) p ( z I &. implicit (but as yet undefined) exrru utility relating to the special status of A10. but not ‘‘too large”. we might still s prefer Mo because o the special “simple” status of hfo.. we shall assume. this would seem to ignore the fact that when ho has been put f forward for reasons of simplicity or parsimony there is an implicit assumption that the latter has some “extra utility”.6. 8 E Q}. will correspond to either: (i) p ( z I Bo). predictive models for 2 are defined. IB ) .. therefore.6 Model rejection within M = { M. Thus.) The formulation of the model rejection problem given above is rather too general to develop further in any detail.6. From this perspective.i E 1’) consists of actual alternatives to h. over and above the expected utility a(mo 1 z). by replacing ii(rn0 I z)by C(n% f ~ ( 2 ) ~“(z) Is) where represents. and the “extra utility” of choosing Ado if at all possible may be greater. M. However. for related discussion. forsome parametric family { p ( . an given z.10. a simple hypothesis on a parameter of interest q3 where X is a nuisance parameter (specified by a priorpo(8) = p(A I & which concentrates .ii(nbI )were positive.M I } . z E Z’} consists of formal alternativesto h. The same argument f applies even more forcibly in the case where { M t . to provide concrete illustrative analyses. 1980. so that rejecting mo corresponds to choosing the best of m. if G(m81 z) . (See Dickey and Kadane.since rejecting MOmay not lead obviously to an actual alternative model.10. for the remainder of this chapter that M = { M o . a simple hypothesis that 8 = Bo (specified by a degenerate prior pO(0) which concentrates on 80). by: Specially. we might regard the redefined model rejection problem as essentially identical to the model comparison problem.
2 Discrepancy Measures for Model Rejection Within the parametric framework described at the end of the previous section. 2). if and only if where ~ “ ( 2 ) represents.If. or by the { Afl}closed form. e) .2. where With a considerable reinterpretation of conventional statistical terminology.Ilk . We shall refer to a(@) as a (utilitybased)discrepuncy measure between M. . We shall focus on this version of the model rejection problem with utilities defined by ~ ( n $ ~ ) ( m I .Ifl when 8 E c ) is the true parameter. u .. the utility premium attached to keeping the simpler model Mo. . 0). leading..) as a test stutistic. corresponds to a form of parametric restriction (or “null hypothesis”) imposed on model ill.to the rejection of model Af.the latter prowith M = (Mol viding a kind of “adversarial” analysis. if the observed value of the test statistic exceeds a criticul \ d u i ~ . Noting that there are only two alternatives in this decision problem.. E 8. p(0 1 z). since it assigns !210no special status. 8 defined either by an . we might refer to 1( .11(7U{).\l~if and only if 1(z)> f ( l ( Z ) . a(e) = u ( l l l .. 6.. MI}. . as before. it is common practice to consider the decision problem of model rejection to be that of assessing the conzpciribiliry of model Ml1 with the data z this being calibrated relative to the wider model . since the optimal inference will clearly be to reject model . within which A J o is embedded. and overall beliefs. say in favour of the larger model nr. model Af.Mclosed form... for given data x. O). In terms of the discrepancy.412 6 Remodelling The next three sections consider some detailed model rejection procedures within this parametric framework. 8 E @. it suffices to specify the (conditional)difference in utilities. In such situations. and the optimal action is to reject model . c ( 5 ) = E(l(5). p ( 0 I Af. el.
Box. However. z 1 Ill") = a .01). )such that P ( t ( z )> .6.)P(a:I A h ) + P(hfl)P(Z a1. should be P ( M .lo to only be rejected with low probability (a. 6. =P(h1lIX) { ?. Examples will be given in the following two sections.)and has no special theoretical significance. 2 . () thus obtaining the (1 .111 P(Afo..say. for example. we would choose c ( .2 Model Rejection 413 How might c(z) be chosen? One possible approach could be to consider. a choice of c ( .2. conditional on information available prior to observing x and assuming A10 to be true. we obtain t(z> = where ~ (I ' z0 ) = 1 J s(e>P(ez)de I P(M I 11.2 is defined to be Assuming the decision problem of model rejection to be defined from the Mclosed perspective.3 Z m n e Discrepancies Suppose that the discrepancy measure introduced in Section 6.A I 4 0 ) d A A)p( IM I ) = and J P(a: I W l (me. Ill0 It follows from the analysis given in the previous section that rejected if and only if.05 or 0. see Appendix B. )which would lead h. this is just one possible approach to selecting c ( . for criticism of the practice of working with a fixed a.a ) t h percentage point of the predictive distribution of t(z). for specified critical value c(z). for values of a of the order of 0.(FAo.2. 1980. for a nondecisiontheoretic approach). Section 3.3.11) ' I if 00specifies a simple hypothesis if@"= (+(. Of course. Under this approach. 1 )> c(z). It is interesting that this choice turns out to lead to commonly used procedures for model rejection which have typically not been derived or justified previously from the perspective of a decision problem (see.A). prior to observing and assuming hi0 to be true.
1939/1961. ( z ) . logr(n + 1) z ( i t + l)log(rt) . Chapter 5) that similar results can be ohtained for a variety of"null models" in more general contingency tables.c(z))/c(z). rr . is detined by the equation is denotes the upper 100o'X point o f a 1.... :Ifl defined by and F'(Af. (Null hypothesisfor a binomial pammeter). I?(=).414 6 Rernodelling i.e. and applying Stirling's formula. can be approximated by ) where is the usual chisquared test statistic. Straightforward manipulation shows that which. the rejection criterion for Alfo terms in of the Bayes factor is given by Example 6. if the posterior odds on Af0 and against MI are smaller than (1 .r. having decided where ti..) I . for purposes of illustration. The reader might like 10verify (perhaps with the aid of Jeffreys. In the case where the prior odds are equal.. assuming.10. . . By considering 2 log U .. with . n ) which calibrates the procedure to only reject MIwith probability o when MII true. \:. Suppose that 2 represents s successes in I I binomial trials.J t + f l o g ( 2 ~+ (127. large . distribution.. .If. HGI..a value o f c ( s ) = c(O.. Of course. on this particular approach to the choice o f cfz). there is n o real need to identify it! The rejection procedure is simply defined by comparing the test statistic value. I .for given r r . with its tail critical value.(.]) = P(dfl) = f .
by MI : p(x) = n . or from the "adversarial" form corresponding to assuming model 1 1 . pll. (Lindley's paradox revisited. . A12 defined.2. for x = ( 2 1 . In Examples 6. In the case of a locationscale model. . again).11.x . . u )8 = ( p .4 General Discrepancies Given our systematic use throughout this volume of the logarithmic divergence between two distributions. Using the logarithmic divergence discrepancy. measure (the noncentrality parameter) suggested by intuition as a discrepancy measure for a location scale family. . for some appropriate c(z). N ( z . . as illustrated in Section 6.X known. O0 = ( p ~ . 1 prl.6. standardised. an obviously appealing form of discrepancy measure is given by where p ( z I 8 ) is the general parametric form of model h 1 1. . ) . A). we might consider the standardised measure In any case.3. where withp(8 I z) derived either from an Mclosed model. 11 Example 6.9 we considered the use of models A l l . .2 Model Rejection 415 6. .6 and 6. we obtain which is just a multiple (by n / 2 ) of a natural. the general prescription will be to reject A40 if t ( z ) > c(z).2. 6 ) .=I .
. Multivariate normal location cases are also easily dealt with./ I . 1) distribution and the appropriately calibrated value of ('(2)is thus implicitly defined. as a special case of Proposition 5. with known or unknown (equal) precisions. 1985). which. this has an . if I .17) is given by where t w 2 = X(J. is easily seen 10 be uniform. II)./ I . testing the equality of means in two independent normal samples.rrA). with s' = [ r j s 2 / ( r r .}closed perspective (see Example 5. for example.I ) distribution and the scale parameter. since it seems t o us that detailed discussion of model rejection and comparison procedures all too easily becomes artificial . by rejecting Jf.V(Z 10. The reader can easily extend the above analyses toother stylised test situations: for example. A ) from an { dl.significancetest statistic for a normal location null hypothesis in the presence of an unknown this has a St(t i 0. //A)+ where. Rueda (1992) provides the gencral expressions for onedimensional regular exponential family models.(z) 1 exceeds the upper 100(tr/2)'%point of an : I: 1j . Wc shall not pursue such cases further here. distribution. the reference posterior for ( I / .. ( The above analysis is easily generalised to the case of unknown A. .. ~ ) / V seen to be a version of ) is the standard significance test statistic for a normal location null hypothesis.)L'v(// IT. u . the statistic ~ ( z= dx(7 . Here. With respect to p(z 1 Oil).416 6 Remodellirig Assuming the reference prior for / I derived from an {Afl}closedperspective. with CT') = A ". = and hence (// //l. we have the reference posterior p(/rIz) :V(/i17. ~ ) / S 'is B version of the standard where.follows that It we see that fi(7 . I . the logarithmic divergence discrepancy in this case being proportional to the Mahalanobis distance (see Ferrindiz. With respect to p(zI appropriately calibrated value of v(z) is defined by the standard rejection procedure.7)2.24.
Smith and Spiegelhalter (1980). Raiffa and Schlaifer (1961). Others openly adopt a decisiontheoretic approach. and our approach has been to consider formally a decision problem where the action space is the class of available models. Dickey (1971. Some authors adopt a purely inferential approach. Zellner and Siow (1980). Pettit and Young ( 1990). Felsenstein (1992) and Rueda (1992). From the perspective of this volume. see. 1977). In this setting. we have shown that “natural” Bayesian solutions. there are frequently strong reasons for considering a range of possible models. Bemardo (1980). are obtained as particular cases of the general structure for stylised. for example. Spiegelhalterand Smith ( I982). by deriving either posterior probabilities. loss functions. where the primary decision consists in acting as if the proposed model were true without having specific alternatives in mindand have shown that useful results may be obtained by embedding the proposed model within a larger class. Berger and Delampady ( I987). Poskitt (1987). Zellner ( 1984). Bemardo and Bayarri (1985). Kass and Vaidyanathan ( 1992). or Bayes factors for competing models.3. Bemardo (1982. such as choosing the model with the highest posterior probability. McCulloch and Rossi ( 1992) and Lindley ( 1993). and also from that of a group of modellers. Dickey and Lientz (1970). 1972). Zellner (197 I). for example. Lindley (1965. Ferrfindiz ( 1985). 6. We have also considered the generally illposed problem of model rejection. San Martini and Spezzafem ( 1984). appropriately chosen. we have taken the analyses of this chapter sufficiently far to demonstrate the creative possibilities for model choice and comparison within the disciplined framework of Bayesian decision theory. 1985a). Karlin and Rubin (1956). Aitkin (1991). Berger ( 1985a). see. There is an extensive Bayesian literature directly related to the issues discussed in this chapter.1 DISCUSSION AND FURTHER REFERENCES Overvlew We have argued that both from the perspective of a sensitive individual modeller.6.3 Discussion and Further References 417 outside the disciplined context of real applications of the kind we shall introduce in the second and third volumes of this work. Schlaifer (1961). DeGroot (1970). and then using discrepancy measures as loss functions in order to decide whether or not the original simpler model may be retained after all. Leamer (1978). or model choice. This obviously leads to the problem of model comparison.3 6. Box and Hill (1967). . G6mezVillegasand G6mez (1992).
which hopefully can be defended as a sensible and useful conceptual representation of the problem. 1985). Some relevant references are Johnson and Geisser (1982.3 Critical issues We shall comment further on six aspects of the general topic of remodelling under the following subheadings: ( i ) Model Choice iind Model Criticism. or an imaginative new idea. Gelfand. As a consequence of this probing. cluster unulwis. and Chaloner (1994). all powerful predictive machine. 1983. Pettit (1986). Such investigations will typically include residuul analysis. 6. where data are randomly selected from a proper subset of the sample space rather than from the entire population.3. Fully specified belief models are an integral part of this structure. 1993). will force this sequence of informal exploration and formal analysis to eventually settle on the use of a particular belief model. a class of alternative models will typically emerge. graphical analysis and prior experience with similar situations. We shall not elaborate on this here. 1990. but it would be highly unrealistic to expect that in any particular application such a belief model will be general enough to pragmatically encompass a defensible description of reality from the very beginning. In this chapter we have discussed some of the procedures which may be useful in a jbrniui comparison of such alternative models. 1992a) provide a Bayesian analysis of selecfion models. may force one to make yet another iteration towards the never attainable "perfect".418 6. data limitations. This remodelling process is never fully completed. Dey and Chang (1992).2 Modelling and remodelling 6 Remodelling We have already argued that we see Bayesian statistics as a rather formalised procedure for inference and decision making within a welldefined probabilistic structure.(ii) Inferenre . however. Peiia and Guttman (19931. which may have been informally suggested by a combination of exploratory data analysis. identification of outliers andor influential duru. Naturally. more time. In practice. a pragmatic combination of time constraints. 1992. The outcome of this strategy will typically be a more refined model for which a similar type of analysis may be repeated again. Geisser (1987. Guttman and Peiia (1993). and capacity of imagination. Verdinelli and Wasserman (1991). Chaloner and Brant (1988). Bayarri and DeGroot (1987. in that either new data. And even with such a simple model. more formal investigation of its adequacy and the consequences of using it will often be necessary before one is prepared to seriously consider the model as a predictive specification.3. and the behaviour of diagnostic srurisrics when compared with their predictive distributions. Weiss and Cook ( 1992). McCulloch (1989). mainly exploratory analysis. Pettit and Smith (1983. we typically first consider simple models.
( 1992).. Dey and Chang (1992) and Gircin et al. the posterior distribution of the parameters only permits the estimutionof the parameters conditionalon the adequacy of theentertained model. identification of influential observations. model choice or model selection. nondecisiontheoretic Bayesians have often tried to produce procedures which evaluate the compatibility of the data with specific models. it seems to us . segregation of homogeneous clusters. (iii) OverjStting and Crossvalidation.3 Discussion and Further References 419 and Decision.I z1 .z. 1987. Gelfand. also. Dempster ( 1975).6. is illdefined unless an alternative is formally considered. Additional references include Geisser (1966. (1986). .z. the predictive distributions which correspond to different models are comparable among themselves. 1971. and given the obvious attraction of being able to check the adequacy of a given model without explicit consideration of any alternative. = t c ( z n + . the problem of accepting that a particular model i s suitable. and we have argued that. (v) Scient@cReporting. can all be reformulated as particular implementations of this general framework. and (vi) Computer Sopare. from a decisiontheoretical point of view. However. The use of predictive distributions to check model assumptions was pioneered by Jeffreys (193911961).1993). partly due to the classical heritage of significance testing. As mentioned before. may be used to choose among a class of alternative models.Geisser and Eddy ( 1979). Clayton et al. . The basic idea consists of defining a set of appropriate diagnosricfuncrions t . . Box and Tiao (1973). while the predictivedistribution makes possible criticism of the entertained model in the light of current data. See. or determining the predictive probability of observing tt’s more “outlying” than those observed. 1988. based on a different sample with their predictive distributions p t . or outlier screening.1985. under different headings such as model comparison. The reader will readily appreciate that common techniques such as residual analysis.+k) of the data and comparing their actual values in a sample l. Possible comparisons include checking whether or not the observed t i ’ s belong to appropriate predictive HPD intervals. As clearly stated by Box (1980). (. . we see these very useful activities as part of the informal process that necessarily precedes the formulation of a model which we can then seriously entertain as an adequate predictive tool.~) from the same population. . However.. Bemardo and Bermlidez (1985). 1992). Moreover. whilein generalthe posteriors are not.(iv) Improper Priors. . Model Choice and Model Criticism We have reviewed several procedures which. Hodges (1990. Rubin (1984).
it is far more common to be obliged to do both model construction and model checking with the same data set.1. say. by explicitly evaluating the degree of compatibility between the observed t and its predictive distribution.6. { X I . and its actual. Skene ef al. 1990).x. t. of the general type ( I ( p t ( . Overjfring and CrossValidation If we are hoping for a positive evaluation of a prediction it is crucial that the predictive distribution is based on data which do not include the value to be predicted. however. then some form of alternative must be considered. eventually observed. I z l . t ) ). Florens and Mouchart (1985).z} one of which is used to produce the .420 6 Remodelling inescapable that if a fonnal decision is to be reached on whether or not to operate with a given model. otherwise. stylised utility functions. Hill ( 1986. severe overtring may occur. . For a discussion of how cross validation may be seen as approximating formal Bayes procedures. that the more traditional Bayesian approaches to model comparison. Scoring rules. . Klein and Brown (1984). .1s crossvalidation. Gelfand. ( 1986) and West (1 986). see Peiia and Tiao (1992). they make explicit the identification of those aspects of the model which really matter. can be obtained as particular cases of the general structure by using appropriately chosen. A natural solution consists of randomly partitioning the available sample z = . We have seen. z. Inference and Decision Throughout this volume. by requiring the specification of an appropriate utility function. although it is sometimes possiblc to check the predictions of the model under investigation by using a totally different set of data than that used to develop the model. into two subsamples { z . relevant predictive distributions. Dey and Chang ( 1992). the consequences of entertaining a particular model may usefully be examined in terms of the discrepancies between the prediction provided by the model for the value of a relevant observable vector.. For recent work. . . value. Pragmatically.} say. and references therein. Very often. see Section 6. provide natural utility functions to use in this context. This technique is usually known . such us determining the posterior probabilities of competing models or computing the relevant Bayes factors. For further discussion of model choice. These advantages are specially obvious in model comparison since.. . and the other to compute the diagnostic functions. see Winkler (1980). moreover. . Krasker (1984). the procedure then being repeated as many times as is necessary to reach stable conclusions. . Poirier (1985). we have emphasised the advantages of using a formal decisionoriented approach to the stylised statistical problems which represent a large proportion of the theoretical statistical literature.
sk). Improper Priors In the context of analysis predicated on a fixed model. or a diagnostic function. as described by the utility function.3 Discussion and Further References 421 A possible systematic approach to crossvalidation starting from a sample of . thus precluding./2 is large enough to evaluate the relevant observable function t = t(z1.The observable function could either be that predictive aspect of the model which is of interest. (iii) Estimate the expected utility of the model by Note that the last expression is simply a Monte Car10 approximation to the exact value of the expected utility. in general the use of improper priors is much more problematic. G } . we have seen in Chapter 5 that perfectly proper posterior parametric and predictive inferences can be obtained for improper prior specifications. as described in the above approach to model criticism. involves the following steps: (i) Define a sample size k.. But if one or more of the p t ( 0 ) is not a proper density.'s in a which are not in a.6. )I zbi) will be proper even if the reference prior r ( 8 )is not. a = {tl. and a model { p ( z Ie). We also note that this programme may be carried out with reference distributionssince the correspondingreference (posterior) predictive distributions r l ( t ( a . . p ( B ) } . . We first note that for models the predictive quantities typically play a key role in model comparisons for a range of specific decision problems and perspectives on model choice... size n. .. . the corresponding pi (2)'s will also be improper.. (ii) Determine the set of all predictive distributions of the form where aJ is a subsample of z of size k and air] consists of all the 5. When it comes to comparing models. however.where A: 5 n.
. x. I z l . ( J ) ) will be proper. Eaves (1985) and Consonni and Veronese (1992a).. the calculation of posterior probabilities for models in an Mclosed analysis.. and . with the reference prior approach some models are implicitly disadvantaged relative to others. this is due to the fact that the amount of information about the parameters of interest to be expected from the data crucially depends on the prior distribution of the nuisance parameters present in the model. no problem arises.in a way which may be expected to achieve neutral discrimination between the models. have common B and improper p ( B ) . Some suggestions along these and similar lines include Bemardo ( I 980). .(z)/p. 1957). proper.6. Lindley's paradox (see. also.(z). Spiegelhalter and Smith (1982). p . Indeed. . equivalently. Pericchi (19841. A!. ( ~is) not proper. ( O l ) is im. technically.. . ) . .. weighting the Bayes factors. recalling from our discussion in Section 6. p 7 ( O l ) } we see that the latter does run into trouble if p . p . (6.). An exception to this arises when two mcdels '11. However. I . for the problem of predicting a future observation using a logpredictive utility function the (Monte Carlo approximated) criterion for model choice involves the geometric mean of Bayes factors of the form where ( z ~ ( j )~ denotes a partition of 2 = ( $ 1 . does provide a meaningful comparison between the two models. say. with a formal improper specification for the prior component of a model an initial amount of data needs to be passed through the prior to posterior process before the model attains proper status as a predictive tool and can hence be compared with other alternative predictive tools.422 6 Remodelling for example. x. Essentially.1. discussed earlier in this chapter. the Bayes factor in favour of A!.). for sufficiently large I ) . A possible solution consists in specifying the improper prior probabilities of the modelsor. (8. .. However. Bartlett. there is an inherent difficulty with these methods when the models compared have different dimensions. Another possible solution to the problem of comparing two models in the case of improper prior specification for the parameters is to exploit the use of crossvalidation as a Monte Carlo approximation to a Bayes decision rule. then. As we saw in Section 6.] Since. is a well known simple example of this behaviour. in which case it can be argued that the ratio p. since.6 that the conventional Bayes factor is used to assess the models' ability to "predict Z" from { p l ( z1 6 . even for improper 1 (nonpathological) priors p. .
( j )z s ( j )denotes the Bayes factor for All against A h based on the versions of A f l .~ ( j ) ] . de Vos cfair Bayes factors). The closest we can come to this. Berger and L.3 Discussion and Further References 423 Mi. in a sense. Again. we see the explicit role of the geometric average of Bayes factors... F. but with the latter “reversing”. At the time of writing we are aware of work in progress by several researchers who propose forms related to those discussed here. .6. The proposal is now to select randomly k such partitions. 0.1. and overcome the problem of the impropriety of p t ( O . we can approach the evaluation of the lefthand side as a Monte Carlo calculation. O’Hagan Cfroctionul Bayes factors) and A. These include: J.is to take partitions of the form x = [xs(j). Proceeding formally. . (:). A&. and j = 1. .6. . A. ~ wheres(2 l)isthesmallestintegersuchthatbothp~(81~ s ( j )andpz(02 I zs(j)) e ) are proper. the role of past and future data compared with the form obtained in Section 6. However. ) . Again. . Pericchi (intrinsic Bayes factors). ) where B t 2 ( z r . and approximate the lefthand side of the criterion inequality by The (Monte Carlo approximated) model choice criterion then becomes. if we were to take logpi(x) as the utility of choosing the Mopen perspective prefers lcll if where p ( z ) is not specified. based on partitioning x and averaging over random partitions. in this contect we want partitions where the proxy for the predictive part resembles data z and the proxy for the conditioning part resembles “no data”. R. prefer hi1 if . .
Fisher. However. R. objective processes of science by allowing subjective beliefs to enter the picture. Beliefs as individual. such beliefs are an essential element. personal probabilities are the key element in this process and.424 6 Rentodelling From a practical perspective. since posterior predicfiw distributions obtained from models with different dimensions are always directly comparable. F. we have seen that shared assumptions about structural aspects . leads to a criterion of the form. bypasses the dimensionality issue.Smith) suggests that such a criterion effectively encompasses and extends a number of current criteria. while many are willing to concede that. As Dickey ( 1973) has remarked. However. like superstition. in turn. where models are evaluated in terms of their predictive behaviour. We conclude by emphasising again that a predictivistic decisiontheoretical approach to model comparison. prefer 1\11 if Work in progress (by L. rhetorically: But is not personal knowledge. nonobjective and unscientific. 1956/1973) that it would be somehow subversive to sully the nobler. "fidelity to the observed data" and "future predictive power''. The solution to such problems lies in combining ideas from Chapters 4 and 6. in utility terms. there has also been a widespread view (see. or opinion. Pericchi and A. at any given moment in the learning cycle. there are clearly practical problems of communication between analysts and audiences which need addressing. and therefore to be avoided in science'? Who cares to read about a scientific reporter's opinion as described by his prior and posterior probabilities'? We have already made clear our own general view that objectivity has no meaning in this context apart from that pragmatically endowed by thinking of it as a shorthand for subjective consensus. are the encapsulation of the current response to the uncertainties of interest. Scientific Reporring Our whole development has been predicated on the central idea of normative standards for an individual wishing to act coherently in response to uncertainty. in narrowly focused decisionmaking. for example. This can be formalised by adopting a utility function of the form which. On the one hand. it might be desirable to tradeoff. M.
we have seen. Computer Sofnoare Despite the emphasis in Chapter 2 and 3 of this volume on foundational issuesnecessary for a complete treatment of Bayesian theorywe are well aware that the majority of practising statisticians are more likely to be influenced by positive. computational power grows apace. But what are the appropriate software tools for Bayesian Statistics? What software? For whom? For what kinds of problems and purposes? A number of such issues were reviewed in Smith (1988). Goel (1988) provided a review of Bayesian . the dataspecific priortoposterior nunsformarion of the collection of all personal probability distributionson the parameters of a realistically rich statistical model. while perhaps differing over the prior Component specification. preferably handson. (1963) and Hildreth (1963). We shall return to this general problem in the second and third volumes of this work. For early thoughts on these issues. Scientific reports should objectively exhibit as much as possible of the inferenrid conrenr of rhe data. communicating a single opinion ought not to be the purpose of a scientific report. there is a great deal of work to be done in effecting such a cultural change. There is nothing within the world of Bayesian Statistics that prohibits a scientist from performing and reporting a range of "what if?" analyses. exchangeability) can lead a group of individuals to have shared assumptionsabout the parametric model component. as does the sophistication of graphical displays. We believe that this is the way forwardalthough. We are also well aware that the availability of suitable software is the key to the possibility of obtaining that handson experience. . many still remain unresolved. .6. but. that entertaining and comparing a range of models fit perfectly naturally within the formalism. see Smith (1978). from several perspectives. an experimenter can summarise the application of Bayes' theorem to whole ranges of prior distributions. but at the time of writing. On the other hand. There are also some obvious technical challenges in making such a programme routinely implementable in practice. . derived to include the opinions of his readers. Roberts (1974) and Dickey and Freeman (1975). To quote Dickey again: . see Dickey (1973). see Edwards et al. . rather. for technical expositions. . experience with applicationsof methods to concrete problems than they ever will be by philosophical victories attained through the (empirically) bloodless means of axiomatics and stylised counterexamples. to let the data speak for themselves by giving the effect of the data on the wide diversity of real prior opinions. for a discussion in the context of a public policy debate. However.3 Discussion and Further References 425 of beliefs (for example. it has to be said.
(1986). Grieve ( 1987). who. as described by Goldstein ( I98 I. who discuss the use of MINITAB to teach Bayesian statistics. RacinePoon ( 1992). (1985. on which the commercial expert system builder Ergo'" is based. Cowell ( 1992) and Spiegelhalter and Cowell ( 1992). and references therein). a program to perform Bayesian inference using Gibbs sampling. Further review and detailed illustration will be provided in the volumes Buyesian Computution and Bayesian Methods. and Ley and Steel ( 1992). who discusses sampleassisted graphical analysis. who presents LISPSTAT. who describe the use of the Buyes Four package. . an object oriented environment for statistical computing and discusses possible uses of graphical animation. 1987) and RacinePoon et al. Lauritzen and Spiegelhalter (1988). Examples of creative use of modem software in Bayesian 0s analysis include: Smith er a f . respectively. Albert (1990). who describes BUGS. who describes [B/D/. Thomas et al. 1992). 199 I. 1988. Wooff ( 1992). describe and apply the probabilistic expert system shell BAIES. (1992).Tiemey ( I 990.426 6 Remodelling software in the late 8 ' . an implementation of subjectivist analysis of beliefs. who make use of the commercial package Muthemutica'" . and Marriot and Naylor (1993). Korsan (1992).
. M. the sufticient statistics. it records the appropriate likelihood function. The second records the basic elements of standard Bayesian inference processes for a number of special cases. definition. In particular. univariate and multivariate) used in this volume. Smith Copyright 02000 by John Wiley & Sons. The first records the definition. the reference prior and corresponding reference posterior and predictive distributions. Bemardo and Adrian F. variable range. Ltd Appendix A ~~ Summary of Basic Formulae Summary Two sets of tables are provided for reference. A. parameter range. the conjugate prior and corresponding posterior and predictive distributions.l PROBABILITY DISTRIBUTIONS The first section of this Appendix consists of a set of tables which record the notation. and first two moments of the probability distributions (discrete and continuous.BAYESIAN THEORY Edited by Josd M. and the first two moments of the most common probability distributions used in this volume.
1. . .. .. . I . . a ) BinomialBeta f p . 115) N = 1..11 Bb(z 1 (I. . b . Sumtnary of Basic k'ormulae Univariate Discrete Distributions Br(r I 8 ) Bernoulli ( p ... . :If.2.I' = 0. . ..I / Hy(x I N . . .2. r z = 1. . I 17) .. A f = 1. n ) Hypergeometric ( p . ( L l+. .:If) I h = t n i t i ( n . 9. ( = n1ax(O. . .VfAI z = ~ . 115) 0 < e < 1 71 = 1.'Ux) . 115) Bi(z 18. x = 0.A. . .2. r i ) Binomial (p. t / . . N E [ r ]= 7)' + *\I N . A')  p ( r )=c )( (n. .
c = 8' V[z] = T10 02 E[z] = r0 Nbb(x I a .1. .1 Probability Distributions 429 Univariate Discrete Distributions (continued) Nb(x 18. Pn(x I A) X>O Poisson (p.. .2. 1 . a E[x] = v B . r ) NegativeBinomial (p. 9.2. .8)' r = O.. .2. ..l.l ) ( a .. r ) NegativeBinomialBeta (p.A. 116) p(x) = c E[x] = X X! V[T] = x ~~~ A' Pg(x I a . l ..2) = 0. 116) O < O < l.) PoissonGamma fp. v > o . E = 0 . 1191 n a>0. p>o.r=l. 118) rB E [ z ]= a1 V[x] = 5 + ( a . 2.
1) > 0 E [ x ] = aS1 x >0 Ex(x 10) Exponential (p. d) Beta ( p . 118) a > 0.I' < t j =c I' = (b. 120) 0 . 116) U n ( s 1 u.430 Univariate Continuous Distributions A.3 > 0.e@= E [ s ] = 1/e Gg(x I ct. 117) b>a P(P) rr < .] G a ( i 1 a. 6 ) Uniform ( p . ".(I)' E [ x ]= + ( U + b) G... 118) t1>0 p ( r ) = (. Summary o Basic Forntulae f Be(r 1 a. 3.I' >0 1/19' c=6.I' >0 . \'[.u)' = &(I> . 71 > 0 .j) Gamma (p. n ) GammaGamma f p .I] = > 0.
A) v > 0. B) Squareroot InvertedGamma (p.j3>0 X > O E [ 4= a1 a 8’ V [ x ]= (a. 120) x>o V [ x ]= 2v x 2 ( x I v. i3 > 0 x>o . x > 0 Nancentral Chisquared ( p .A.2)2(v .1 Probability Distributions 431 Univariate Continuous Distributions (continued) x z ( x I v) = x: Chisquared ( p . 121) x>O E [ x ]= v +x V [ x ]= 2 ( v +2 4 Ig(x I a .1)2(a.‘ 2 >0 c= E[z]= v2 1 (1/2)”‘2 Uv/2) V[X]= (v . p) InvertedGamma ( p . 119) Cr>o.2 ) x’(z1 v ) InvertedChisquared (p.4) ~ ~~ ~ Ga”‘ ( x I a . 119) a > 0. 119) V > O .
< 11 < C X .r + 1) '(a 4 2) 1 N(. x > 0.[.r I p .432 A. 1 3 ) (1 InvertedPurero (p. Summar?. A.r] = . t j >0 o<s< I' j I p ( r ) = cz"' Els] = .I.] E[:r]= p St(s 111.r < +x 1(2K)l 2 I y(r) = c exp {. 120) > 0. 3 . = x' \. x > 0 p)2} < . 122) x.' a ( t s = (1.of Basic Formulae Univariute Continuous Distributions (continued) Ip(x 1 ( 1 . ct) =x Student t ( p .n > 0 x < .A(. A) x NorniuI (p.P '(k(Ct + 1)1 \'I. 121) x < 11 < soc.r < + x .
2. 135) z = (51> ..=.+ e . 133) Mdk(x I 8. .1 Probability Distributions 433 Univariate Continuous Distributions (continued) Lo(x a. .A. . n) Multinomial (p.1.xa) xj =0.1) (a a. . E[x. 122) Multivariate Discrete Distributions Muk(x I 8.. ) Logistic P (p. Chl 5 n 2 1 n! = n.] npi = ..n) MultinomialDirichlet (p.
XI.p)'A(z . A.1 = r= j CC x 1 "?(27i)~4 I[x]= x ' Pa2(r.434 A..(. ..p)} !z EIX.. .k) I 0 < J . .rA ) E %A p ( z ) = c cxp { .I. 136) 2 p = (/I.( 5 I Ng(z. 134) x = (. < 1.( ...A. .A) Multivuriute Norntul ( p .. 141) . 11. y 1 p. ./&) E Rk A symmetric positivedefinite = ( X I . XI=. 31) Biluterul Pureto (p. y ICY.. Summary of Basic Formulae Multivariate Continuous Distributions Dik.. . J) NornialGammu (p.. .(z I a ) Dirichlet (p. I361 N k (x1 p.
< +m ..P) Wishart ( p .Zk) < 2.. 140) ______ Nwk(x.y 1 p .ck) x xi < +m < g symmetric positivedefinite ~ ~ Stk(x I p.m. Wik(x I a.g I p . a . . x = ( q . A.. 139) 90 < pi < +w. 138) h > k .:.p ) E [ x ]= p.l ] (n+k)/2 c= V [ X ] A'(cI . 8 ) Multivariate NormalGamma (p. 140) 90 < pz < +. A. A. .p ) [ A ( z.2)'a = x symmetric positivedefinite P symmetric nonsingular . a. I Probability Distributions 435 Multivariate Continuous Distributions (continued) Ngk(x. P ) Multivariate NormalWishart ( p .A. A > 0 2a>kl j3 symmetric nonsingular . CL > 0 A symmetric positivedefinite 5 M = (XI. . a ) Multivariate Student (p..
1) ‘6 ~ ( 6 .r ) I T ( X 1 Z ) = B b ( t 1 f + I . and specified by different ordered parametrisations.2 INFERENTIAL PROCESSES The second section of this Appendix records the basic elements of the Bayesian learning processes for many commonly used statistical models. When clearly defined. We recall. the reference prior for ( p . ‘1) I i.of Basic Formulae A. the reference prior is only defined relative to an ordered parametrisation. p ) . there are very many different reference priors. in multiparameter problems.9.r) .sections of the table. we provide. from Section 5. the conjugate posterior and the conjugate posterior predictive for a single observable.2. and we provide the corresponding A) reference posteriors for 11. in separate . I~ ) = Bb(s I CL + r. = T (A. and A. however.4 that. the conjugate family. Summar?. + 11 . = Be(# ) ~ ( 6 z)= . p ) x A’. d) p(s)= Bb(r I o.. For each of these models. multivariate normal and linear regression models. In the case of uniparameter models this can always be done. Bernoulli model p ( 0 ) = Be(@1 0 .I. the conjugate prior predictives for a single observable and for the sufficient statistic. together with the reference predictive distribution for a future observation. I c1 r. . !j p ( .17). however. corresponding to different inference problems. 9 z + + + 71 7i . we also provide. 3. A) happens to be the same as that for (A. namely ~ ( p . in the multinomial. I z ) = Be(6. 1) p ( r ) = Bb(r 1 ct. the following: the sufficient statistic and its sampling distribution.. In the univariatc normal model (Example 5. 4 + ri . in a final section. n ) y(6. the reference prior and the corresponding reference posterior and posterior predictive for a single observable. i) I Be(0 I 4 + r. These are not reproduced in this Appendix.I.436 A.
1 ' 2 + + i. = 0 .A. . 1 .(e) n(8 I z ) = Be(8 I nr.. 2. 18. . 2.J. s f ) T ( Z I z ) = Nbb(s I nr.. 2 . r ) .. . . . s r) CX @1(1 .2 Inferential processes 437 Poisson Model p ( r I A) = Pn(r I n X ) t(%) T = = c:'=. NegativeBinomial model z = ( x I . 0c 8 < 1 P(G I@)= Nb(z. . 5..q .
18) = Un(r.1) T ( 0 ) x 8' ~ (I z) = Ga(8 I r i . >o . } .. Summury of Busic Formulae Exponential Model p(H) = Ga(O I Q . P ( J . e). i ? + + + t) + 1.j. . . X I . 10.. 1) Uniform Model = ($1. 1) p ( t ) = G g ( fI o. J) p ( .r I I t . 0 <XI < H 6.A. r ) = G g ( r I cr. t) 8 7r(s1 a ) = Gg(. t . J y(s 12)= G g ( r I n I t .n ) p(8Jz) = Ga(8 I ( I I ) ..
2 Inferential processes 439 Normul Model (known precision A) Normal Model (known mean p ) . ( b o ( A1 7r(A I z ) = Ga(A 1 in.(A) . ntI. i t ) J %) = St(s I p .A. n) .
Summary oj Basic Formdue .440 Normul Model (both puramriers utiknown) A.
A. . A) = N k ( q I p.. X k x k positivedefinite . 2*E Rk p E Rk. A). 1 } .2 Inferential processes 441 Multinomial Model Multivariate Normal Model z = (21. 2 . . p ( s l I p..
= X'y) .442 Linear Regression A. Summary o Basic Farmulae f t ( % ) (X'X.
however. and fiducial inference. . Ltd Appendix B Non= Bayesian Theories Summary A summary is given of a number of nowBayesian statistical approaches and procedures.prediction. The main theories reviewed include classical decision theory. Further issues discussed include: conditional and unconditional inferences.1 OVERVIEW Bayesian statistical theory as presented in this book is selfcontained and can be understood and applied without reference to alternative statistical theories. These are illustrated and compared and contrasted with Bayesian methods in the stylised contexts of point and interval estimation. two broad reasons why we think it appropriate to give a summary overview of our attitude to other theories. First.BAYESIAN THEORY Edited by Josd M. M. Bemardo and Adrian F. and the material in this Appendix may help them to put the contents of this book into perspective. readers will have some previous exposure to “classical” statistics. Smith Copyright 02000 by John Wiley & Sons. many. asymptotics and criteria for model choice. nuisance parameters and marginalisation. 6. frequentism. if not most. likelihood. and hypothesis and significance testing. There are.
In Chapter 4.forniuI procedures. It is important. with solutions directly tailored to problems. z E X.2. we have argued that the existence of a prior distribution is a mathematical consequence of the foundational axioms. or hypothesis testing.are primary. classical decision theory is only partially relevant to inference.our own experience has been that some element of comparative analysis contributes significantly to an appreciation of the attractions of the Bayesian paradigm in statistics. In contrast. NonBayesian theories depart radically from this viewpoint.444 B. from a Bayesian viewpoint. informal activities. NonBayesian Theories Secondly. Likelihood I nference and Fiduciul cind Relared 7heories. and nonBayesian inference theories typically ignore the decision aspects of inference problems. (ii) NonBayesian theories typically use only a parametric model family of the ignoring the prior distribution p ( 0 ) . NonBayesian statistical theories typically lack foundational support of this kind and essentially consist of a set of recipes which are not necessarily internally consistent. . The form { y ( z 18). therefore. and that exploratory data analysis and graphical displays are often prerequisite. a decision ~tructurc is the natural framework for any fornial statistical problem. designed to satisfy or optimise an ud hoc criterion. In Chapters 2 and 2. and have described how a ”pure” inference problem may be seen as a particular decision problem.8 E e ) . implications of this fact are so far reaching that sometimes Bayesian statistics is simplistically thought of as statistics with the “optional extra” of a prior distribution. most nonBayesian theories essentially consist of stylised procedures. typically derived from combining p ( z I 8 ) and p ( B ) . specifically reviewing Classical Decision Theory Frequentist Procedures. (iv) We have argued that. such as those for point or interval estimation. we stressed that predictive models. Bayesian statistics has an uxionrutic foundation which guarantees quantitative coherence. and often lacking the necessary flexibility to be adaptable to specific problem situations. we recall from Chapter 1 our acknowledgment that Bayesian l analysis takes place in a r a t h e r j ~ r t aframework. In Section B. we will revise the key ideas of a number of nonBayesian statistical theories. to be clear that in this Appendix we are discussing nonBayesian. (iii) The decision theoretical foundations of Bayesian statistics provide a natural framework within which specific problems can easily be structured. As a preliminary. We begin by making explicit some of the key differences between Bayesian and nonBayesian theories: (i) As we showed in detail in Chapter 2.
u(d.3. Inren~al pothesis Testing and Sign8cance Testing. HyBayesian textbooks into the topics of Point Estimarion.3 the basic structure of a general decision problem. we recall.4. For readers seeking further comparativediscussion at textbook level. a prior distribution a p(w)over 0.w ) . We established that the existence of horh the prior distribution p ( w )and the utility function u(d. Within each of those subheadings we will comment on the internal logic. or decision loss. consisting of a set of a possible decisions D. 8. In Section B.Cox and Hinkley (1974). f Approuches to Prediction.8.w ) is a mathematical consequenceof the axioms of quantitativecoherence and that the best decision d' is that which maximises the expected utility E(d) = s u(d. DeGroot (1987). EL) .2.Casella and Berger ( 1990) and Poirier ( 1993). Aspects o Asymptotics and Model Choice Criteria. the books by Barnett ( I97 I / 1982). w)p(w)dw.2 Alternative Approaches 445 In Section B. 0 ) . 11. Nuisunce Purumeters und Marginalisation. from our discussion at the end of Chapter 5 . We established furthermorethat. A regretfunction. Anderson (l984). we will follow the typical methodological partition of nonEstimation..1 ALTERNATIVE APPROACHES Classical Decision Theory We recall from Section 3. we will discuss in detail some key comparative issues: Conditional and Unconditional Inference.u(d7 ) to w conform more closely to standard notation in classical decision theory. parameter space 0. and a utility function u ( d ( w ) )which we shall denote by . is easily defined from the utility function (at least in bounded cases) by l ( d .Press (1972/1982).2 B.w ) = sup u(d. the relevance to actual statistical problems. and the performance of classical procedures relative to their Bayesian counterparts. if additional information z is obtained which is probabilistically related to w by y ( z 1 w ) .then the best decision d: is that which maximises the posterior expected utility where P(W I=) P(= I w)P(w) Some authors prefer to use loss functions instead of utilities.
the basic space is not the class of decisions..446 B. consisting of functions Ci : S f ) which attach a decision n'(z)to each possible data set x.l. a fixed value = (iv) &(z) ( I J u .(c) / i t ) . the risk.\~(/i'/t. but including a formulation of standard statistical problems within a decision framework. the sample mean ( i i ) n'. where the risk function reduces to the loss function.'i'(. A.I be a random sample from a . arid w p p that ~ we want to select an estimator for p . for each w. under the assuniption of a quadratic loss function /(ti. but the class of decision rrtles. the sample median = ( i i i ) n'. Although some of the basic ideas in classical statistical theory were present in the work of Neyman and Pearson (1933). perhaps justified by utilityonly axiomatics of the type pioneered by von Neumann and Morgenstem ( 1944/1953). Consequently.?. ~ rr7).firricrion I * ( 4. (Estimation o the mean of a normal distributiun). . excluding prior distributions as core elements. .. NonBayesian Theories which quantifies the maximum loss that. This work was continued by Girshick and Savage (1951) and by Stein (1956). Since slipD u(d. / I ./. . but a prior distribution for w is not.w)p(w)c~u is minimised by the same decision d' which maximises Z(d) and hence.(z) . the core framework of classical decision theory may be loosely described as decision theory without a prior distribution. It is then suggested that decision rules should be evaluated in terms of their triwtrge loss with respect to the data which might arise. I ) distribution. An excellent textbook introduction is that of Ferguson (1967). the two formulations are essentially equivalent. may suffer as a conseone quence of a wrong decision. from a Bayesian point of view. the posterior mean from an = centred on p. . r ...L j o f a .. decision rule 0' is defined as  and subsequent comparison of decision rules is based on their risk functions..u. and with precision i t l i .r : p . the situation with no additional data (the nodata case). Classical decision theory focuses on the way in which additional information x should be used to assist the decision process. expected the loss  l ( d )= J f(d. . Example B. Thus. f Let 2 = { .d ) L . I ) .Some possible derision rules are ( i ) i j i ( 3 c ) = 7. . . w) only depends on w. In contrast to this Bayesian formulation. it was Wald (1950) who introduced a systematic decision theory framework. ) ' ( U ~ . A utility function (or a loss function) is accepted.~) prior.  f . so thul 11 = 'W. as 3 special case.. 1 1 ) = (11 . The formulation includes.
It is easily Seen that.2 Alternative Approaches 447 Using the fact that the variance of the sample median is approximately 7 ~ / 2 nthe . 0 and 71. corresponding risk functions are easily seen to be (i) r(dl..2 { ~ ~ 11) + ni(p. (iii) r ( & .po)L (iv) ~(6~. I can hardly be considered a good decision rule since.p) = l / n (ii) r(&. r(6’:w)2 r ( b . The closer fill is to the true value of 11. = (n + 7 2 0 ) .B.I L ) ~ } 3 2 1 I 1 2 3 Figure B1 Riskfunctions for p. the best decision rule markedly depends on p .11) = (11 . .. w ) . for all w. Note that (iv) includes both (i) and (iii) as limiting cases when rt. has larger risk than 61 but.6. otherwise. for any value of the unknown parameter p. the more attractive 6 and 61will obviously be..) + x respectively.  Admissibility The decision rule 62 in Example B. n = 1 0 and n.. Figure B. = 5 . = 0.p ) = 7r/2n. . This is formalised within classical decision theory by saying that a decision rule 6’ is dominated by another decision rule 6 if. the rule 6. whatever the value of p. 1 provides a graphical comparison of the risk functions. has a smaller risk.
aa B. Wald (1950) proved the important converse result that. . ( & * .w ) with strict inequality on some subset of I2 with positive probability under p ( w ) .(6/. and a class is ntinirnal complere if it does not contain a complete subclass. classical decision theory focuses on the decision rule which minimises e. A class of decision rules is corviplete if for any 6' not in the class there is a n' in the class which dominates it. pos.\ which it calls a Bayes decision ride. I!. NonBayesian Theories with strict inequality for some w . further concepts and criteria are required. w ) p ( w ) d u .w)p(w)dw< I . ) so that the Bayes rule may be simply described as the decision rule which maps each data set x to the decision P ( z ) which minimises the corresponding posterior expected loss. under appropriate regularity conditions we may reverse the order of integration above to obtain rriin 1/ I!. ) p ( o 1 z ) p ( z riztlw . under rather general conditions. for guidance on how to choose among admissible decision rules. r ) w . .h correspontls ro n proper prior ciistribution is dmissihle.then one would have .vpected risk (or socalled Buyes risk) uiin A r ( b . any admissible decision r u k is ( I Bu.. s r ( d ( . /I I ( 6( z) .sih1. 1 J which would contradict the definition of a Bayes rule as one which minimises the expected risk. Note that this interpretation does not require the evaluation of any risk function. if n" is the Bayes decision rule which corresponds to p ( w )and 6' were another decision rule such that r(6'.w ) y( z I w ) )I( w ) tlz riw . w ) 5 r ( P .v improper. classical decision theory establishes that one can limit attention to a minimal complete class.ve. However.s deiisioti ride with respect to some. prior disrrihution. Bayes Rides If the existence of a prior distribution p ( w )for the unknown parameter is accepted. It is easily shown that uny Buys rule bt4iic. and that a decision rule is udniissible if there is no other decision rule which dominates it. Indeed. since P(ZI W ) P ( W )  o(w I Z ) l J ( Z ) . If one is to choose among decision rules in terms of their risk functions. d )p ( w )~ l w i n i r i = 'I . Note that.
the minimax solution may be reasonable (essentially coinciding with the Bayes solution). This asserts that one should choose that decision (or decision rule) for which the maximum possible loss (or risk) is minimal. but has a slightly higher maximum risk for an extremely unlikely w value seems absurd.2 Alternative Approaches 449 There is. it is rather pointless to work in decision theory outside the Bayesian framework. For details. the minimax criterion seems entirely unreasonable.e. and that to derive the Bayes rule does not require computation of the risk function but simply the minimisation of the posterior expected loss. .22 Frequentlst Procedures We recall from Section 5. Lindley. and it can violate the transitivity of preferences (see e.. that which gives the highest expected risk. Minimax rules The combined facts that admissible rules must be Bayes. apart from purely mathematical interest. by the parametric model component { p ( z 18). see James and Stein (1961).8 E 0). 1972). under rather general conditions. make it clear that. Nevertheless. it has no obvious intuitive merit in standard decision problems. minimax has very unattractive features. The idea that the minimax rule should be preferred to a rule which has better properties for nearly all plausible w values. however.g. although in specific instancesnamely when prior beliefs happen to be close to the least favourable distribution. even as a formal decision criterion. with the authoritative monographs by DeGroot (1970) and Berger (1985a) becoming the most widely used decision theory text.1 the basic structure of a stylised inferenceproblem. It can be shown. that the minimax rule is the Bayes decision rule which corresponds to the leasr favourable prior distriburion. A famous example is the inadmissibility of the sample mean of multivariate normal data as an estimator of the population mean.. no guarantee that improper priors lead to admissible decision rules. Moreover. where probabilistically related to 8 inferences about 8 E 8 are to be drawn from data z. While this may have some value in the context of game theory. Indeed. this has been the mainstream view since the early 1960’s. Thus. for instance.B. i.. 8. it gives different answers if applied to losses rather than to regret functions. The intuitive basis of the minimax principle is that one should guard against the largest possible loss. some textbooks continue to propose as a criterion for choosing among decisions (without using a prior distribution)the rather unappealing minimax principle. even though it is the Bayes estimator which corresponds to a uniform prior. where a player may expect the opponent to try to put him or her in the worst possible situation. . and certainly in the finite spaces of real world applications.
tioti lik(8 12) = p ( x 18) (or variants thereof). differences. Although some of the ideas probably date back to the early 18OO's. described by p ( t I e). The basic ideas behind frequentist statistics consist of (i) selecting a function of the data t = t ( x ) . For a specific parameter value 8 = 8. in order for such comparisons not to depend on the use of z rather than 2. as reflected in discussions at the time published in the Royal Statistical Societyjournals. (ii) deriving the sampling distribution of t .. Convenient references are Neyman and Pearson ( 1967)and Fisher (1990).450 B. and (iii) measuring the "plausibility" of each possible 8 by calibrating the observed value of the statistic t against its expected longrun behaviour given 8. say. Frequentist procedures make extensive use of the /ike/ihood~~nc. essentially taking the mathematical form of the sampling distribution of the observed data x and considering it as a function of the unknown parameter 8. If z = x(x) is a onetoone transformation of x. as a basis for both the construction and the assessment of statistical procedures. and that the required inferential statement about 8 given x is simply provided by the full posterior distribution where Frequentist statistical procedures are mainly distinguished by two related features. most of the basic concepts were brought together in the 1930's from two somewhat different perspectives..called a sfurisric. (i) they regard the information provided by the data x as the sole quantifiable form of relevant probabilistic infomation and (ii) they use..the likelihood in terms of the sampling distribution of z becomes (in the above variant) which suggests that meaningful likelihood comparisons should be made in the form of ratios rather than.e. NonBayesian Theories We established that the existence of a prior distribution p ( 8 ) is a mathematical consequence of the axioms of quantitative coherence. being critically opposed by Fisher. longrun frequency behaviour under hypothetical repetition of similar circumstances. which is related to the parameter 8 in a convenient way. the conditional distribution p ( t 1 O ) . the work of Neyman and Pearson. i. also.if the observed value o f t is well within the area where most of the probability density . See. Wald (1947) for specific methods for sequential problems.
8)g(Z). ~ in which case.c18) = St(z I8.B. SuJiciency WerecallfromSection4. the posterior distribution of 8 only depends on z through t . The suflciency principle in classical statistics essentially states that. other frequentist developments of the sufficiency concept have little or no interest from a Bayesian perspective. then O0 is claimed to be compatible with the data. there is more to inference than the choice of estimators and their assessment on the basis of sampling distributions. or 1 . 8 ) = p ( z I t). identical conclusions should be drawn from data zI and 2 2 with the same value oft. p(8 I Z) = p(8 I t). i. . or a rare event has happened. Clearly... Commonsense(supported by translational and permutational symmetry arguments) suggests that 8 = (zI+ ~ ) / may 2 be a sensible estimate of 8.e.1. (1988) and references therein.l) so that. but otherwise seems inconsequential. This contrast is highlighted by the following example taken from Jaynes (1976). rather than probability statements about hypothetical repetitions of the data conditional on ( h e unknown) 8. Example B2 (Cauchy observaabns).2 Alternative Approaches 451 of p ( t 18. . The idea was introduced by Fisher (1922) and developed mathematically by Halmos and Savage (1949) and Bahadur ( 1954). For example: a sufficient statistic t is compfere if for all 8 in 0. according to a ndive frequentist. for any given model p ( z 18) with sufficient statistic t. Such an approach is clearly far removed from the (to a Bayesian rather intuitively obvious) view that relevant inferences about 8 should be probability statements about 8 given the observed data. for any prior p ( 8 ) . 1). From a Bayesian viewpoint there is obviously nothing new in this “principle”: it is a simple mathematical consequence of Bayes’ theorem. T e property of completeness guarantees the uniqueness of h certain frequentist statistical procedures based on t. Yet. Let z = {slQ } consist of two independent . and that a necessary and sufficient condition for t to be sufficient for 8 is that the likelihood function may be factorised as lik(8 I Z) = p ( 18) = h ( t . the sampling distribution of 0 is again S t ( r IB.1 l).if the conditionaldistribution ofthedatagiven t is independent o f 8 (Proposition4. observationsfromaCauchydistributiony(. However.5thatastatistict issuflcienrifp(Z 1 t .) lies. Can0 er al. for example.rz 8 to estimate 8.l.i. see. The concept of sufficiency in the presence of nuisance parameters is controversial. from a “textbook” perspective. it cannot make any difference whether one uses s .. otherwise it is said that either O0 is not the true value of 8.J h ( t ) p ( t1 8 ) d t = O implies that h(t) = 0.e.
I' = (1 I / f ~ . . The cunditiunali5 principle in classical statistics states that.r = 1 ) = 0. in the inferential process described by Bayes' theorem.2 where the receiver is selected at random.. R1 might have been the receiver) hut did Nor.2 j . For further information. (Condifional versus uncondilional arguments). If It? were the receiver and . The conflict arises because the latter (unconditional) argument takes f undue account o what nrighf have happened (i..99.. #Ji = fJ(. the conditionality "principle" is . it suffices to work conditionally on the value of the ancillary statistic. The difficulty in this case disappears if one works conditionally on the ancillary statistic 1 . The apparent need for such a principle in frequentist procedures is well illustrated in the following simple example. the conditional likelihood would have been lik(8.2. suggesting H I as the true value of 0.lK.x2 I /2. if a is ancillary.1 we demonstrated how a sufficient statistic t = t ( z )may often be partitioned into two component statistics t ( z ) = [a(z).3.i = I ) = 0. suggesting 82 instead.2. A further example regarding ancillarity is provided by reconsidering Example B. A 0.l' = 11 H 2 . iik(0. NunBayesian Theories Ancillarity In Section 5.452 B. lik(H1 1 R1. the unconditional likelihood given . then so that. On the other hand. We defined such an n ( z ) to be is an ancillary statistic and showed that.r = 1 were obtained.I signal comes from one of two sources HI or HI'.just a small (id / r o c .r = 1) = 0.the conclusions about the parameter should be drawn as if a were fixed at its observed value. 1 Rr. whenever there is an ancillary statistic a. Example B. see Basu ( 1959).845.rl . These examples serve to underline the obvious appeal of a trivial conscquencc of Bayes' theorem: namely. that one should always condition inferences on whatever information is available. N ) = 0.r = 1) = (J.e. and there are two receivers R I and It2 such that ])(. with p ( R I ) = 0.8.I' = 1 would have been lik(fll I . s(z)J such that the sampling distribution of a(z) independent of 8.
Basu (1964) noted that if x is uniform on [O. explicitly says that given f.2 Alternative Approaches 453 step towards this rather obvious desideratum (which is. 9 6 6 . would too frequently give misleading conclusions in hypothetical repetitions. that (iii) there are no means of assessing any finitesample realised accuracy of the procedures. P [E / I f l. it can be used to criticise specific solutions to concrete problems. I [ and hence ancillary. that (ii) optimality criteria have to be defined in terms of longrun behaviour under hypothetical repetitions and. 1). so that E suggests itself as an estimator of ji. .. in the Ion8 run.9S/fi whenever a random sample of size n from N ( s 1 ji. since ancillary statistics are not readily identified.. dom sample from N ( s I p. It is easily seen that F is a sufficient statistic. 9 6 / 6 1 S] = 0. Example BA. with precision n.95 that p belongs to f f 1 . we are producing an interval which will include the true value of the parameter 95% of the time.the degree ofbeliefis 0. and are not necessarily unique. whose sampling distribution is N ( f I p . } be a ran. Since the sampling distribution of L concentrates around p. Moreover. A much stronger version of this “principle”. one might expect f to be close to 11 on the ba5is of a large number of hypothetical repetitions of the sample. states that statistical procedures have to be assessed by their performance in hypothetical reperitions under identical conditions.!)C/&q . for some possible value of the parameter. .95 which is derived from the reference postenor distribution of p given f. Note that this says nothing about the probability that j i belongs to that interval for any given sample. The Repeated Sampling Principle A weak version of the repeated sampling principle states that one should not follow statistical procedures which. the superficially similar statement P [ p E 2 f 1 . This implies that (i) measures of uncertainty have to be interpreted as longrun hypothetical frequencies. In contrast. Let x = {s. the conditionality “principle” is not necessarily easy to apply.B. but the conditional distribution of x given its fractional part is a useless onepoint distribution! See Basu (1992) for further elegant demonstration of the difficulties with ancillary statistics in the frequentist approach. For example. /I] = 0. . in any case. Moreover. 1) is obtained. a normal distribution centred at the true value of the parameter.95 so that. n ) . whose essence is at the heart of frequentist statistics. if we define a statistical procedure to consist of producing the interval Z f 1. Although this is too vague a formulation on which to base a formal critique. applying the conditionality principle may leave the frequentist statistician in an impasse. then the fractional part of I is uniformly distributed on [0. and is not concerned at all with hypothetical repetitions of the experiment. conditional on 11. however. 1+ O [ . “automatic” in the Bayesian approach). (Confidence versus HPD intervals). From a frequentist viewpoint.. x.
. We recall from Section 5 . Note that the argument only works if there is no reason to believe that some 8 values are more likely than others. Thenthein~urianceprincipfewould require the conclusions about 8 drawn from the statistic f(. A more elaborate form of invariance principle involves transformations of both the sample and the parameter spaces. where 1 is a vector of unities.r . From a Bayesian point of view. . g(. Then the experiments produce the same conclusions about 8. Consider two experiments yielding. I the following trivial consequence of Baye\’ theorem.454 Invuriance 6. have the same distribution for every 8 .For example. data z and z arid with model repthe resentation involving the same parameter 8 E 8. The invariance principle then requires that any estimate t ( s )of0 should satisfy t(z + n l ) = t(z) + n. since they induce the same posterior distribution. in estimating I9 E ’R from a location as those drawn from model p ( I 0) = h(. so that I+ Ie)= z ) p ( z I el. with X = 0. for [he relative support givcn by the two sets o f data l o the possible values of 8 is precisely the same.t:) = 3’ + u.4.q(s)) be the same to g ( t ( x ) ) . ~ In this case. 8 2 3 Likelihood Inference . for all the elements of a group 6 of transformations there is a unique transformation g such thaty(O) = 8 a n d p ( z 18) = p(y(z) Ig(8)). and proportional likelihoods. Thus. A final general comment. for invariance to be a relevant notion it must be true that the transformation involved also applies to the prior distribution (otherwise. As we shall see in Section B. Another limitation to the practical usefulness of invariance ideas is the condition that g ( e ) = 8. respectively. ci E ‘31. such an approach typically fails to produce a sensible solution to the more fundamental problem of predictinx furitre ohsenwtions. the invariance principle could not be applied if it were known that B 2 0. one may have a uniform loss of expected utility from following the invariance principle). \ince long run behaviour under hypothetical repetitions depends on the entire distribution { p ( z 10). same prior. and g(0) = 0 + a. Frequentist procedures typically violate the likelihood principle. The Iikelikood principle suggests that this should indeed be the case. NonBayesian Theories If a parametric model p ( z 18) is such that two different data sets. Suppose that. in the location/translation example. 21 and x2.z E Y}and not only on the likelihood. Frequentist procedures centre their attention on producing inference statements about irnohscmvrhle parameters.19)it may be natural to consider the group of translations. then both the likelihood principle and the mechanics of Bayes’ theorem imply that one should derive the same conclusions about 8 from observing z l as from observing x?.
1992). Cox and Reid (1987. With a uniform prior. (i) is just a restatement of the likelihood principle and. since it is the factor which modifies prior odds into posterior odds. in that examples exist where. Bickel and Ghosh (1990). it follows from Bayes’ theorem that so that the likelihood ratio satisfies (ii). breaks down immediately when there are nuisance parameters. i.BarndorffNielsen ( 1980. Frequentist statistics . in their uses of the likelihood function. Proponentsof the likelihood approach to inferencego further. 1991 ). for some possible parameter values. For further information on the history of likelihood.e. For early attempts. Cox (1988). moreover. hypothetical repetitions result in mostly misleading conclusions.5. but the suggested procedures for doing this seem hard to justify in terms of the likelihood approach. the Bayesian approach automatically obeys the likelihood principle and certainly accepts the likelihood function as a complete summary of the information provided by the data about the parameter of interest.1983. Fraser and Reid (1989). also. Useful references include: Barnard and Sprott ( 1968). The likelihood approach can also conflict with the weak repeated sampling principle. the attempt to produce inferences solely based on the likelihood function. The basic ideas of this pure likelihood approach were established by Barnard ( 1949. however. 1973) and Andersen (1970. The use of “marginal likelihoods” necessarily requires the elimination of nuisance parameters. 1973). when common priors are used across models with proportional likelihoods.1 for a link with Laplace approximations of posterior densities. However. Butler ( I 986). Section 5. of course.. but also as a meaningful relative numerical measure of support for different possible models. proportional to the likelihood function.Birnbaum ( 1962. Davison (1986). Barnard er al.B. Indeed. See. Pereira and Lindley ( 1987).1963). Goldstein and Howard (1991) and Royal1 (1992).2 Alternative Approaches 455 As mentioned before. see Kalbfleish and Sprott (1970. work has focused on the properties ofprojle likelihood and its variants. Bjrnstad (1990) and Monahan and Boos (1992). the posterior distribution is. Plante (197 I). Both claims make sense from a Bayesian point of view when there are no nuisance purumerers.1972) and Edwards (1972/1992). Other references relevant to the interface between likelihood inference and Bayesian statistics include Hartigan (1967). in that they regard it not only as the sole expression of the relevant information. see Edwards (1974). They essentially argue that (i) the likelihood function conveys all the information provided by a set of data about the relative plausibility of any two different possible values of 8 and (ii) the ratio of the likelihood at two different 8 values may be interpreted as a measure of the strength of evidence in favour of one value relative to the other.1968. or for alternativeparameter values within the same model. Akaike (1980b). ( 1962). In recent years. the pure likelihood approach.
reveal the true value of 0. if. Consider the model e { I. . . NonBayesian Theories solves the difficulty by comparing the observed likelihood function with the distribution of the likelihood functions which could have been obtained in hypothetical repetitions. . as we shall discuss further in Section B. coupled with the seeming aversion of most statisticians to the use of prior distributions. say.I' Example B. . r I H ) . . The following example is due to Birnbaum (1969). a form of inference summary that seems most intuitively useful. '1.2. with high probability. .4. a second observation would. H E { O . Of course. . with this prior. and is declared to be the parameterof interesf.5. H = 0 is considered to be special. where. we note that. if H = 0 the likelihood of the true value is always I ilOOfh of the likelihood of the only other possible H value. I . the likelihood approach has difficulties in producing an agreed solution to prediction problems. p ( H = 1 . not with the likelihood function. . 100). like frequentist procedures. 6. 1. Finally.I' = 1. From a Bayesian point of view. . . as might well be the case in any real application of such a model. one observation from the model provides no information. the answer obviously depends on the prior distribution. whatever . for any prior.2. . . . . 1 0 0 . p ( . but with the posterior distribution defined as the weighted average of the likelihood function with respect to the prior.2. .456 B. . 2. 100 and a straightforward calculation reveals that this is cilso the posrerior. Then. . (NuLve fikolihood versus reference onufysis).r is observed. given a single observation. I' = 1 . has led to a number of attempts to produce "posterior" distributions without using priors.r. Bayesian statistics solves the problem by working.3 that frequentist approaches are inherently unable to produce probability statements about the parameter of interest conditional on the data. We now review some of those proposals. . ) = 1/200. . for . then the reference prior turns out to be p ( H = 0) = 1j 2 . 100). . This fact. namely H = . Thus. then one certainly has However. If all f) are judged a priori to have the same probability.4 Fiducial and Related Theories We noted in Section B.
The basic characteristics of the argument may be described as follows. hence. However. Let x = {x. Fisher (1930. what he termed thefiduciul argument. 1933. 1939) developed. which is monotone increasing in 8. } be a .F(t 18) has the mathematical properties of a distribution function over (80.with mean 6. with ~ ( 8= 8 I . Example B. The argument is trivially modified if F(t 10) is monotone increasing in 8. Hence. it follows that. Let p ( z I 8 ) . Suppose further that the distribution function o f t . Essentially.I 3 ) = F(% I 0) is monotone increasing from 0 to 1 as 6. random sample from an exponential distribution p(z 18) = Ex(z 18) = Be@”. Then. (Fiducial and reference distributions).B. he proposed using the distribution function F ( t 18) of a sufficient estimator t E T for 8 E 0 in order to make conditional probability statements about 8 given t . G(6. the fiducial distribution of 6.2 Alternative Approaches 457 Fiducial Inference In a series of papers published in the thirties. . with F(t 160)= 1 and F ( t 181) = 0. by using G(O I t) = F ( t 10).81) ? be a onedimensional parametric model and let t = t ( z ) R be a sufficient statistic for 0.Ol) and. as proposed by Fisher ( 1930. and has a distribution function ‘. thus somehow transferring the probability measure from T to 63. through a series of examples and without any formal structure or theory. is obtained as and Note that this has the form f(6. is monotone decreasing in 8. 1%) x p ( z 1 @a(@). Since n(8) = 8’ is ) in this case the reference prior for 0. Lindley (1958) established that this is true if. the fiducial distribution coincides with the reference posterior distribution. It is easily verified that 3 is a sufficient statistic for 8. has the mathematical structure of a “posterior density” for 8.6. and only if. . This is thefiduciul disrriburion of 8. 1935. in this example.. ranges over (0.z.1956/1973). This last example suggests that the fiducial argument might simply be a reexpression of Bayesian inference with some appropriately chosen “noninformative” prior. cc). . However. . F(t I 0). G(8 I t) = 1 . 8 E (80. . no formal justification was offered for this controversial “transfer”.
Fisher's original argument. and which has a distribution which only depends on 0 through h(0.458 B. Efron ( 1993) for a recent suggested modification of the fiducial distribution which may have better Bayesian properties. In onedimensional problems. z) conditional on the observed value of (~(z).t). 2) = [g(O. a matter of considerable controversyhow the argument might be extended to multiparameter problems. and that mainly due to the perceived stature of its proponent. . is more or less well defined and often produces reasonable answers. From a modem perspective. Other relevant references are Brillinger ( 1962) and Barnard (1963). and using the distribution of g(O. Then. 11. as described I above. See Seidenfeld (1992) for further discussion. They lacked the daring to question it. As Good ( I97 I ) puts it . as he modestly says. that so many people assumed for so long that the argument was correct. with sufficient statistic t . it seems almost inconceivable that Fisher should have made the error which he did in fact make. and (ii) because. his 1930 'explanation left a good deal to be desired. however. See. in fact. it is possible to find some function h ( 8 . Pivotul Inference Suppose that. . which are nevertheless far better justified from a Bayesian reference analysis viewpoint. Partitioning a pivotal function h(H. h(0. produce some interesting results does in multiparameter problems where the standard fiducial argument fails. since G(f? t ) is a pivotal function with a uniform distribution over [O. t ) which is monotone increasing in 0 for tixed t . is if special case of this formulation. and in t for fixed 0.o ( z ) ] to identify a possibly uniquely defined ancillary statistic a ( z ) . when applicable. NonBuyesian Theories the probability model y(r 10) is such that I and II may separately be transformed so as to obtain a new parameter which is a locution parameter for the transformed variable. the fiducial argument.2). However. However. . for a given model p ( z 10). It is because (i) it seemed so unlikely that a man of his stature should persist in the error. if we do not examine the tiducial argument carefully. The Royal Statistical Society discussions following the papers by Fieller (1954) and Wilkinson ( 1977) serve to illustrate the difficulties. possibly conditional on the observed values of an ancillary statistic ( ~ ( z ) . the fiducial argument seems now to have at most historical interest. which is independent of H. His basic idea is to produce statements derived from the distribution of an appropriately chosen pivotal function. it is by no means clearand. Barnard ( I980b) has tried to extend this idea into a general approach to inference. t) is called a pivotalfunction and the tiducial distribution of 8 may simply be obtained by reinterpreting the probability distribution of It over T as a probability distribution over 8.
. and independent of 8.   . 3 = p + uz. (Sfructuml and reference distributions). coincide with the corresponding reference posterior distributions. If p ( e ) is normal. . He proposes the specification of what he terms a structural model. and a probability distribution for e which is assumed known. 1) leads to structural distributions for u and p which.2 Alternative Approaches 459 the mechanism by which the probability measure is transferred from the sample space to the parameter space remains without foundational justification. The key idea is then to reverse this relationship. Fraser claimed that one often knows more about the relationship between data and parameters than that described by the standard parametric model p ( z I 8).}be a set of independent measurements with unknown location p and scale 0. Let x = {xI:. which relates data 2 and parameter 8 to some error variable e. the transformation governed by the value of 8. 1972. .. and the argument is limited to the availabilityby no means obviousof an appropriate pivotal function for the envisaged problem.7.. and fi(f . Structural Inference Yet another attempt at justifying the transfer of the probability measure from the sample space into the parameter space is the structural approach proposed by Fraser ( 1968.B. Thus. as is often the case.x.l)s2/a2 x‘fi. this structural model may be reduced in terms of the sufficient statistics 2 and Sz to the equations s = ns. the observed variable 2 is seen as a transformation of the error e.If the errors have a known distribution p ( e ) .so that 8 in a sense “inherits” the probability distribution.p ) / s St ( t I 0. the svuctural equation is and the error distribution is.1. and to interpret 8 as a transformation of e governed by the observed 2. 1979).71. Example B. and error distributions Reversing the probability relationship in the pivotal functions ( n . having two parts: a structural equation.
NonBuyesiun Theories The general formulation of structural inference generalises the affine group structure underlying the last example. IY83b.. . the final answer is an element f of t3. hence.460 B. operating on a realised error e. . 1977a. . within the Bayesian framework. Let { p ( z 1 O ) . with no explicit recognition o the rcncertuinry inwlved. Here. that only works in some situations. the problem of point estimation is naturally described as a decision problem where the set of possible answers to the inference problem. However. in most examples. We recall from Section 5. to be interpreted as the response 2 generated by some transformation 8 E G‘ in a group of transformations G. 1981. and references therein). In fact. the mechanism by which the probability measure on X is transferred to t3 is certainly welldefined in the presence of the group structure central to Fraser’s argument. Pragmatically. As Lindley (1969) puts it . Formally. . are special cases of reference posterior distributions (sce Villegas.3 STYLISED INFERENCE PROBLEMS 8 3 1 Point Estimation . the normal. . This is the socalled point estimation problem.1. 1990. or setting a stock level.suspiciouh of any argument. . Note that. 6 E 0)be a fully specified parametric family of models and suppose that it is desired to calculate from the data z a single value e(z) E 8 representing the “best estimate” of the unknown parameter 0 . and the Poisson distribution [is] not basically different in character from . Fraser’s argument [is]an improvement upon and an extension of Fisher‘s in the special case where the group structure is present but [one should beJ . a point estimate of 8 may be motivated as being the simplest possible summary of the inferences to be drawn from z about the value of 6: alternatively. for example. the group structure is fundamental and the approach seems to lack general validity and applicability. . 8. Dawid. the structural distributions are precisely the posterior distributions obtained by using as priors the right Haar measures associated with the structural group.l ( z )has the same probability distribution e and. and considers a structural equation z = Be. adjusting a control mechanism. A. in this formulation. . for inference is surely B whole.S that. this may be used to provide a structurul distribution for 8. with a completely identified error distribution for e. . . . which in turn. When the structural argument can be applied it produces answers which are mathematically closely related to Bayesian posterior distributions with “noninformative” priors derived from (group) invariance arguments. one may genuinely require a point estimate as the solution to a decision problem. It is then claimed that 8 . is the parameter space 0.
mode or median of the posterior distribution of 8. using these desiderata as criteria.8 ) which describes the decision maker’s preferences in that context. with respect to any particular loss function. Fiducial.9) that intuitively natural solutions. It is worth stressing that this is a constructive criterion. We have seen (Propositions 5.. withas we notedthe general minimax principle being unpalatable to most statisticians. proposes methods for obtaining “best” estimators. but classical decision theory provides no foundationally justified procedure for choosing among admissible estimators. and. . admissible estimators are essentially Bayes estimators. Classical decision theory ideas can obviously be applied to point estimation viewed as a decision problem. Hence.3 Stylised Inference Problems 461 one specifies the loss function l(a. etc. From our perspective. such as the mean. minimax estimates. the optimal estimator is naturally taken to be that which maximises the likelihood function. are particular cases of this formulation for appropriately chosen loss functions.B. one may define admissibleestimates. Hence. The likelihood approach proceeds by using the likelihood function to measure the strength with which the possible parameter values are supported by the data. to obtain that value of 8 which minimises some specified loss function with respect to such a distribution. more formally. either to offer as an estimator of 8 some location measure of the probability distribution of 8 or. in that it identifies a precise procedure for obtaining the required value. pivotal and structural inference approaches all produce “posterior” probability distributions for 8.2 and 5. the problems and limitations of classical decision theory that we identified in Section B. Thus. The criteria adopted are typically nonconstructive. The frequentist approach proceeds by defining possible desiderata of the long run behaviour of point estimators. in that the very definition of a maximum likelihood estimator (MLE) determines its method of construction. and chooses as the (Bayes) estimate that value @(z)which minimises the posterior expected loss. and identifies conditions under which “good behaviour” will result.1 carry over to particular applications such as point estimation.2. their “solution” to the problem of point estimation is essentially that suggested by the Bayesian approach. We also note that the definition of an optimal Bayesian estimator is constructive. Thus.
even if 8. is jointly sufficient for ( p . since then we simply have t o minimise I8) in this unbiased class. v . and only then. However.r lOj = N( I . although requiring the sampling distribution of 6 to be centred at 8 may have some intuitive appeal. in the long run. (ii) Sufficiency is a concept rekutive to a model. the (unique) unbiased estimator of the parameter O E (0. Indeed: (i) In many problems. is one certain to use all the relevant information about the parameter of interest. A concept of relative efjfciency is developed in these terms. there ure no unbiased estimators. unbiased estimator of the parameter 8 for a binomial Bi(r 18. for all 8. and no theory exists which specifies conditions under which this can be guaranteed not to happen.2. 1) of a geometric distribution p(.4 and B. } is known. I ..c 10.1).cr) + 0. An estimator 6 . For example. (i) Sufficiency is a global concept. when 8 . . For example ( 2 . data (2. thus. unbiased estimators may give nonsensical answers. The bias of an estimator &z)is defined to be and its mean squared error (mse) to be From a frequentist point of view it is desired that.{ O . the following two points introduce a note of caution.. if. it does (j not follow that &(z) sufficient for a component parameter 8. r / u is an distribution.e.?. cr). with univariate normal s?) ). 1000) or the mixture form 0. for then. is more e&ient than 6. However.462 B. 8 ) .m ) if the true model is St ( r 1 p .(z) is is sufficient for 8. even though these two models are indeed very “close” to N(I I p. if quadratic loss isjudged to be an appropriate “distance” measure. i. For instance.0 ) ’ . thus.999 x N ( s I p. a frequentist would like an estimator (5. (5. (ii) Even when they exist.. For instance.2 that the search for good estimators may safely be limited to those based on suflcient statistics. there are powerful arguments u p i n s t requiring unbiasedness.1. if e s is sufficient for 8 .. ] A simple theory is available if attention is restricted to unbiased estimators. 1‘ = 0. NonBayesian Theories Criteria for Point Estimution It should be clear from Sections 5.71) but there is no unbiased estimator of O’. mse(8. 18) < me(&. thus. nor is R? sufficient for a*. .s) is not sufficient ’ for (11. estimators such that b(8) = 0. L’(e . . should be as close to 8 as possible. ~ ? but 2 is not sufficient for / I . with small mse(d 18) for almost all values of 8.001 x N(. even a small perturbation to the assumed model may destroy sufficiency.
a consistent estimator is asymptotically unbiased. hardly a sensible solution! Similarly (see Ferguson. de la Horra ( 1986) and Diaconis and Freedman (1986a. a)observation is x. in the sense that it has the smallest mse among all linear. a sufficient condition . that this is a rather restricted view of optimality. a. . a. Freedman and Diaconis ( 1983). Wald (1939). Obviously. see Bickel and Blackwell (1967). a.1. . 3 is said to be the best linear unbiased estimator (BLUE) of p. provided p = E ( z I 8) and a2= V ( z 18) exist. . so that an estimator with small bias and small variance may be preferred to one with zero bias but a large variance. An estimator clearly like is said to be weakly consistent if . (iv) Even from a frequentist perspective. By Chebychev's inequality. for example. . l). to demonstrate.. it seems inappropriate to make our estimate of p dependent on the fact that we might have obtained an invalid measurement. ) .$ a.8 in probability. unbiased estimators. . t 8 with probability one. hardly sensible). a. but did not.. = N ( z 10.. Thus. by making the answer dependent on the sampling mechanism. see Diaconis and Freedman (1983). 8(z) = 0. a valid measurement. Yet.e. P ( 18) = nz e'B"/z!. and strongly consistent if . 0)= N ( z I p. z = 1: 2. For further discussion of the conflict between Bayes and unbiased estimators. if one is measuring . Another frequentist criterion for judging an estimator concerns the asymptotic = . also.3 Stylised Inference Problems is 8(0) = 1. Sometimes. behaviour of its sampling distribution. 2) to make . It is easy. but the unbiased estimator from ifz 5 100 p ( x I p . For discussion on the consistency of Bayes estimators. the unbiased estimator of p from a N ( z I p . then the onfy unbiased estimator of e'..ii with an instrument which only works for values L 5 100 and obtains z = 50. buteven more ridiculously the only unbiased estimate of e28 is (.. a frequentist would to converge to 8 (in some sense) as n increases. 1967).1) otherwise will be something else. See. 1986b)..0 as n . with appropriate examples.1)". is 1 if z is even and 0 if it is odd (again.1 (for all odd z)! (iii) The unbiasedness requirement violates the likelihood principle. unbiased estimators may well be unap pealing if they lead to large mean squared errors. For the frequentist properties of Bayes estimators.since nonlinear estimators may be considerably .B. a). a quantity which must lie in (0. thus. i.0. If we write explicit the dependence of the estimator on the sample size. however. "Optimum Estimators '' We have mentioned before that minimising the variance among unbiased estimators is often suggested as a procedure for obtaining "good" estimators. . z = 0. for the weak consistency of unbiased estimators is that V ( 8 . . leading to the estimate of a probability as . .. see. Schwartz ( 1965). this procedure is even further restricted to linear functions. if 8 is the mean of a Poisson distribution.
if 0 is sufficient for 6’ there is a irniyue function g ( 8 ) for which a MVB exists. if x = { 11.(xI 0)to be ?1(5 18) = . and only if. under suitable regularity conditions. is an improved estimuror of 8. An ”absolute” standard by which unbiased estimators may be judged is provided by the CramerRao inequality.464 B. can be found is rather limited.. with equality if.. Moreover. the MVB estimators.e.. It follows that a minimum variance bound estimator must be sufficient. . . A decisiontheoretic consequence of the socalled RaoBlackwell theorem is that any estimator of t which is not a function of the sufficient statistic f must be 9 . given the value of the sufficient statistic f . For example. Rao ( 1945) and Blackwell ( 1947) independently proved that if G(z) is an estimator of Q and t = t(x)is a sufficient statistic for 8. i. and a linear function of the score function. then.Bayesiun Theories more efficient.Z r ~ / n a MVB estimator f o r d . where k(B) does not depend on x. the range of situations where “optimal” unbiased estimators. namely that described above. @[u(x O)] = 0. I . .. n?). the conditional expectation of 8 ( x ) . €. but no MVB estimator exists for (T! One might then ask whether it is at least possible to obtain an unbiased estimator with a variance which is lower than that of any other estimator for each 0. in which case g is said to be a mitiitnimt w r i unce bound (MVB) estimator of g(t9). Let j = g(s)be an unbiased estimator of g ( 0 ) and define the eflcienr scorefunction u. the existence of such iinijortnl! minimum ruriunce (UMV) estimators can indeed be established. } is a is random sample from N(:r 10. Indeed. . Under suitable regularity conditions. Non. .logp(x 10). i) oe Then. even if it does not reach the CramerRao lower bound.I’ 6 ( z ) p ( sI t)&. unbiased. Specifically. for every value of 8. a result which can be generalised to multidimensional problems.r. in the sense that.mse(8 I 8) I mse(8 I O ) . We have already stressed that limiting attention to unbiased estimators may not be a good idea in the first place.1. B ( t ) = E[B I t ] = .
the frequentist approach does not (expect for special cases like the exponential family). distribution N(8 1 8. but when they do exist they typically have very good asymptotic properties. least squares. asymptotically fully efficient and. Let { p ( z I 6 ) . asymptotically fully efficient. MLE estimators are not guaranteed to exist or to be unique. so that. from a frequentistperspective they are consistent.8 E (3) be a fully specified parametric family of models and suppose that it is desired to calculate. and the method of moments.the information function. some frequentist statisticians pragmatically minimise an expected posterior loss to obtain an estimator. as a constructive procedure for obtaining estimators this result is of limited value due the fact that it is usually very difficult to calculate the required conditional expectation. all these construction methods have been used at various times within the frequentist approach to produce candidate “good estimators”. Thus. MLE’s can be shown to be consistent (hence asymptotically unbiased. see Lehmann ( 195911983). and there is a complete sufficient statistic t = t(x). a region C ( z ) within which the parameter 8 may reasonably be expected to lie.l)/[n(n . Historically. If 6(x) is unbiased and complete. Under fairly general conditions. (1993). see. which have then been analysed using the criteria described above. Both the likelihood and the Bayesian solutions to the point estimation problem automatically define procedures for obtaining them. They are typically biased. the result may be used to show that r(r . which usually have to be investigated case by case. For an extensive treatment of the topic of point estimation. partly under the influence of classical decision theory.n). In addition to the MLE approach. these methods do not in themselves guarantee any particular properties for the resulting estimators. For example. However. and have analogous asymptotic properties to maximum likelihood estimators (i. and asymptotically normal. asymptoticallynormal). and is the UMV estimator of 8.l)]is a UMV estimator of 8*. if n *m. Nowadays. I ( @ ) ) Bayesian estimators always exist for appropriately chosen loss functions and automatically use all the relevant information in the data. which may be obtained as the posterior mean which corresponds to a uniform prior. A famous example is the Pimun esrimuror (Pitman (1939)..e.  8 3 2 Interval Estimation . However.. also. from the data x . However. other methods of construction include minimum chisquared. under suitable regularity conditions.3 Stylised Inference Problems 465 inadmissible. whose behaviour they then proceed to study using nonBayesian criteria.the sampling distribution of converges to the normal with mean 8 and precision I ( 8 ) .B.then 8 ( t ) is unbiased. but there is no MVB estimator of 8’. even if biased in small samples). r / a is the MVB estimator of the parameter 8 of a binomial distribution Bi(r 18. rather than mapping X . Robert et ul.
and such that if a l > (12 then Pl 5 0"2.g. whose elements may be claimed to be supported by the data as "likely" values of the unknown parameter 8. the regions obtained are typically intervals. as a set of 8 values which may safely be declared to be consistent with the observed data. Plante ( 1984.e. One only has the rather dubious 'Transferred assurance" from the longrun definition. i. such that contains the true value of the parameter with (posterior) probability 1 . 199I 1.x . those of the smallest size.c.5 that. among such regions.(b of the @'(x)values will be larger than 0. Note that this formulation is equally applicable to prediction problems simply by using the corresponding posterior predictive distribution. more formally. Whether or not the pcirticulur @(z) which corresponds to the observed data x is smaller or greater than (9 is enrirely uncertuin.erconfiderzceIimit8. given z.the specific interval ( . a subset of 8 is associated with each value of x. hence the more standard reference to the interval estimation problem. a statistic 8" (2) such that for all 0. within a Bayesian framework.(1 and. Combining a lower limit at confidence level 1 . is called an upper confidence limit for (9 with confidence coeflcient 1 . We recall from Section 5.. then g(@') is an upper confidence limit for g(0). parameter H may reasonably be expected to lie.(I)% credible region C. see e. Given x. Indeed. the highest posterior density (HPD) regions. a 100(1 . i.e.n. when 8 is onedimensional. 0 < (1 < 1. a proportion 1 ..(z) 5 0 18) = 1 . NonBayesiun Theories into 0. in rhr long run. ~ " ( z ) ]is then typically interpreted as a region where. as in point estimation. Note that if y is strictly increasing.. This is the socalled region estimation problem. A lol. credible regions provide a sensible solution to the problem of region estimation. for each (L value. suggest themselves as summaries of the inferential content of the posterior distribution. Confidence Limits For 0 < c t < 1 and scalar B E 8 C A.(x) issimilarlydefinedasastatistica(x) such that Pr{&. The nesting condition is important to avoid inconsistency. I . Region estimates of 8 may be motivated pragmatically as informative simple summaries of the inferences to be drawn from x about the value of 8 or.( t with an upper limit at confidence level .u with the corresponding nesting property.466 B. It the is crucial however to recognise that the only proper probubiliry interpretatioir of a confidence interval is that.
(iv) In multiparameter situations. for a variety of reasons. such uniformly most accurate intervals are not guaranteed to exist. 467 [aI (z) &'2 (z)] confidence at For twosided confidence intervals. a1 and cr2 may be chosen to minimise the expected interval length Ez I ~ [ e ^ 2 ( z ). we obtain a nvosided confidence interval level 1 .k1 181. other alternatives. the construction of confidence intervals is by no means immediate. however. which produces central confidence intervals based on equal tailarea probabilities.B. (i) They typically do not exist for arbitrary confidence levels when the model is discrete. that shortest intervals for 8 do not generally transform to shortest intervals for functionsg(8).3 Stylised Inference Problems 1 . the construction of simultaneous confidence intervals is rather controversial. a convenient choice is a1 = 02. (ii) Most selective intervals. Unless appropriatepivotal quantitiescan be found.crl . typically based on replacing the unknown nuisance parameters by estimates. For fixed a1 a2 = a.such that. or whether one should think of the problem as that of estimating a region for a single vector parameter. . moreover. mean 0 and precision I ( e ) . however. or as one of considering the probability that a number of confidence statements are simultaneously correct. the properties of various alternative procedures. for all 8. It is less than obvious whether one should use the confidence limits associated with individual intervals. (ii) There is no general constructive guidance on which particular statistic to choose in constructing the interval.with a sampling distribution which is asymptotically normal N(u 10. However. one could try to choose a 12 1 and (12 to minimise the probability that the interval contains false values of 8. (5) It must be realised. For fixed crl + (x = (1. There are. (iii) There are serious difficulties in incorporating any known restrictions on the parameter space. the fact that u has I(8)).a2 . (i) Shorrest confidence intervals. It can be proved that intervals based on the score function + have asymptotically minimum expected length. are generally less than clear.may be used to provide approximate confidence intervals for 8. It is worth noting that. and no systematic procedure exists for incorporating such knowledge in the construction of confidence intervals.0 2 . (v) Interval estimation in the presence of nuisance parameters is another controversial topic.
1992). see. where the parameter of interest is the ratio of two normal means. NonBuyesian Theories (vi) Interval estimation of future observations poses yet another set of difficulties. for a subset of possible data with positive probability. Robinson (1979a.5).468 B. there are values ct < 1 such that.0. We give two examples. Basu ( 1964. Unless one is able to find a function of the present and future observation whose sampling distribution does not depend on the parameters (and this is not typically the case). 9 2 ) provides a 50%. However.5. then it is easily established that for all H + so that (y. But. As a final point.2 are. Corntield ( I 969). but it certainly does not solve it and. the corresponding 1 .2. The reader interested in other blatant counterexamples to the (unconditional) frequentist approach to statistics will find references in the literature under the keywords relevunf subsets. one is again limited to ud huc approximations based on substituting estimates for parameters.5 then certainly yl < 0 < y2. respectively.even though the confidence level of the interval is only 50%.. Two important such references are Robinson ( 1975) and Jaynes (1976). we should mention that for many of' the standard textbook examples of confidence intervals (typically those which can be derived from univariate continuous pivotal quantities). (i) In the FiellerCreasy problem. the smaller and the larger of these two observations. y?). it may create others.. even in the simplest case where 0 is a scalar parameter labelling a continuous model p ( z I e). the concepr of a confidence interval is open to what many would regard as a rather devastating criticism.so that we know fur sure that 0 belongs to the interval (g. when possible. y ~ y. may mitigate this problem. also.yl L. 1979b). Pierce ( 1973). (ii) If . 1988). Namely the fact that the confidence limits can turn out to be either vacuous or just plain silly in the lighr o j f h eobservrd dam. as discussed in Section B. Solemnly quoting the whole real line as a 95% confidence interval for a real parameter is not a good advertisement for statistics.r1 and 2 2 are two random observations from a uniform distribution on the and interval (8 . Maatta and Casella (1990) and Goutis and Casella (1991). the quoted intervals are nicmeric~crl/~ equal ..0 0. Casella (1987.(t confidence interval is the enrire real line. Conditioning on ancillary statistics. Buehler ( I959).2. and . which refer to subsets of the sample space yielding special information and subverting the "longrun" or "on average" frequentist viewpoint. see Bemardo (1977) and Raftery and Schweder t 1993). For Bayesian solutions. These examples reflect the inherent difficulty that the frequentist approach to statistics has of being unable to condition on the complete observed data. 0. if for the observed data it turns out that y~ . confidence interval.
. Casella er al.1) i derived in Example 5. This is the socalled problem of hypothesis resting. the intuitive interpretation that many users (incorrectly. These are both the “best” confidence intervals for 11. would in fact be correct.. alternative loss functions to the standard linear functions of volume and coverage probability. of course!) tend to give to frequentist intervals of confidence 1 . given the data. In most such problems. we have a decision problem. with 0. derivable from the sampling distribution of the pivotal quantity m ( Z . but if both observations belong to the set is then P r { C I z E R . Although the theory can easily be extended to any finite number of alternative hypotheses. Let { p ( z I O ) .p ) / s . suppose that we wish to decide and whether the unknown 8 lies in 80or in el.i.17. the two hypotheses are not symmetrically treated.namely that.. 8 E 0) a fully specified parametric family of models. au = accept NOor 01 E accept HI. ( 1 993) have proposed. Indeed.. A typical example of this situation is provided by the class of intervals for the mean of a normal distribution with unknown precision.a.l)s*.t . p : o } = 0.3 Srylised lnference Problems 469 to credible regions of the same level obtained from the corresponding reference posterior distributions. {x. ~then C = }.. This means that. If H0 denotes the hypothesis that 0 E OOand H I the hypothesis that 8 E El. we will present our discussion in terms of a single alternative hypothesis. instead. xnlaX) a 50% interval for p . with only two possible answers to the inference problem. 8 3 3 Hypothesis Testing . Note that although this longterm coverage probability is not directly relevant to a Bayesian. the working hypothesis Ho is usually called the nuff hypothesis. while H I is referred to as the alrernative hypothesis. Buehler and Feddersen (1963) demonstrated that relevant subsets exist even in this standard case. for interval estimation.. if described.5181. Pierce (1973) has shown that similar situations can occur whenever the confidence interval cannot be interpreted as a credible region corresponding to a posterior distribution with respect to a proper prior.B. and also the credible intervals which correspond to the reference posterior distribution for p. as a reference posterior credible interval. where the choice is to be made on the basis of the observed data 2. x ( p I x) = St(p 1 ?. . in these cases. there is probability 1 . the example suggests that special care should be exercised when interpreting posterior distributions obtained from improper priors. if x = { q . ( n . be partitioned into two disjoint subsets 0 0 and el.o that the interval contains the true parameter value.
Thus. e E c3(. by accepting a false null and rejecting a true null. or simply a rest 0') is specified in terms of a critical region Hn. I ) that the null hypothesis H I . j # i . Assuming the stylised Mclosed case. although. From the point of view of classical decision theory. 1 e } .470 B.O) = 0 0 E 8. 1. the problem of hypothesis testing.. The most relevant frequentist aspect of such a procedure 0' is its powerfuncrion p o q e I ($1 = Pr{z E R. and on the ratio of the prior probabilities of the hypotheses. the ideal power function would be pow(e 1 o') = 0.8 E k } and the utility structure is simply .).y(e)rie. i = 0. I (i) is the longrun probability of .. Y(Z I e)P(e)de J& P ( W 8 I is smaller than a cutoff point which depends on the ratio l o i / l l ~of the losses incurred.. This corresponds to checking whether the appropriate (integrated) likelihood ratio. can be appropriately treated using standard decision theoretical methodology. p(Mi) = L. respectively. We also recall that the solution to the decision problem posed generally depends on whether or not the "true" model is ussutned to be included in the family of analysed models. and maximising the corresponding posterior expected utility. Obviously.. one will seldom be able to derive a test procedure with such pow(8 an ideal power function. NonBayesian Theories We recall from Section 6. as formulated above. should be rejected if. within a Bayesian framework.. where the true model ) is assumed to belong to the family { p ( z 10).1 that. i = o. the problem of hypothesis testing is naturally posed in terms of decision rules.(a. For any 8 E 8. that is. or Bayes factor. a decision rule for this problem (henceforth called a test />roceditre 6. we have seen (Proposition 6. ji)" P ( X I e)P(We J& P ( W 8 BIII(X) = Je. = 1. defined as the set of x values such that H is rejected o whenever x E Rd. as a function of 8. e E 0 . e E 8. and only if. the longrun probability that the test rejects the null hypothesis Ho. which specifies. naturally. 1 = I. by specifying a prior distribution and an appropriate utility function.
so that a(6 18) = a(6 1 8 0 ) = a ( & ) and O(d 18) = O(SIOl) = O(d).B. + Testing Simple Hyporheses When both Ho and H I are simple hypothesis.. if 0. the corresponding hypothesis is referred to as a simpfe hypothesis.O(6) should reject HOif. which is then called the level of signflcunce of the tests to be considered. In this case. and only if. 1967) which says that a test which minimises p(d) subject to a(6) 5 a must reject Ho if. and is closely related to the NeymanPearson lemma (Neyman and Pearson. This can be seen as a particular case of the Bayesian solution recalled above. it can be proved that a test which minimises m ( d ) + b. modifying Rs to reduce one would make the other larger. thus. 1933. .e. and only if. contains more than one value of 8. to specify a significance level CI is to restrict attention to those tests whose size is not larger than a. a(6 18) = Pr{z E Rd 18) =o p(Sl8) = Pr{z if 8 E eo. one usually tries to minimise some function of the two. for example. otherwise. The size of any specific test d is defined to be a = sup pow(e eceo I 6). Let us denote by a(6 18) and p(S 18) the respective probabilities of these two types of error.3 Srylised Inference Problems 471 incorrect rejection of the null hypothesis. It would obviously be desirable to identify tests which keep both error probabilities as small as possible. Either 80 or may contain just a single value of 8. i. and accepting a false null hypothesis. then Hi is referred to as a composire hypothesis. However. a socalled error of v p e I . Hence. a linear combination aa(6 18) bP(6 18). frequentist statisticians often specify an upper bound for such probability. otherwise 4' Ra 18) =O if 8 E el. a socalled error of type 2. typically. For any test procedure 6 one may explicitly consider two types of error: rejecting a true null hypothesis. if the likelihood ratio in favour of the null is smaller than the ratio of the weights given to the two kinds of error.
irrelevunt rundoinisation. However. The NeymanPearson lemma shows explicitly how to derive such a test.{f 1 c I 0. Although this can be avoided by carefully selecting n ( h ) as a decreasing function of the sample size. . Indeed: (i) With discrete data one cannot attain a fixed specific size (0') without recourse to auxiliary.. For a Bayesian view on randomisation. for any other 6 such that N(6 1 8 ) 5 oo. E t) 5 %} is said to have a monotone likelihood ratio in the statistic f = t ( s )if for all 0 < 612.6. for all 8 E (3. H. Lindley. 1972)that minimising a lineur combination of the two types of error is actually the only coherent way of making a choice. 6) in the 0 . A model {p(z 10). it seems far inore natural to minimise a linear combination m ( 6 ) + bJ( 0') of the two error probabilities. is rejected when p(s1 H I . )is far larger than p(zI i l l ). p ( z 1 H z ) / p ( z I H I ) is an increasing function 1 of 1.1 loss case.472 B.g. UMP tests often exist for onepow(8 I n ' ) sided alternative hypotheses.)) = 0 . whereas minimisation of a linear combination of the form mi (6) + Iid(6) can always be achieved. NonBuyesian Theories for some appropriately chosen constant k. ) oo. Other strategies for the choice ofo(6) and <j( have been proposed. It has become standard practice among many frequentist statisticians to choose a significance level 00 (often "conventional" quantities such as 0. o( = . is a constant such that PI.01) and then to find a test procedure which minimises J(6) among all tests such that o( b) 5 (t(I (rather than explicitly minimising some combination of the two probabilities of error). see Kadane and Seidenfeld (1986). j ( d ) one may find that. frequentist statisticians have traditionally defined an optimal test 6 to be one which minimises d(6 1 8 ) fora fixed significance level oO. with and large sample sizes. ) . (ii) More importantly. in which case no difficulties of this type can arise. in the sense that no other procedure is equivalent to minimising an expected loss. and.. but it should be emphasised that this is not a sensible procedure. by fixing o(6) minimising . this implies deriving a test 0' such that In pow(e 10') i all. e E efl and for which pow(8 I 6) is as large as possible in A test procedure 6' is called a irniforrniy most power1 (UMP) test. For example. It can be proved that. n') it is important to note (see e. Conipositu Alternurive Hypotheses In spitc of the difficultiesdescribed above. at level ofsignificuncx~ i f o (6' I8) 5 1 1 . due to the fact that the minimising j j ( 0') may be extremely small compared with the fixed ft(n'). If p ( z 1 8) has a monotone likelihood ratio in I and 1.05 or 0. when 8 is onedimensional.j(6) corresponds to the minimax principle. terms of the power function.. 5 posv(8 1 6).
} is a random sample from a normal distribution N ( s I p . then the test 6. A test 6 is called unbiased if for any pair 8 0 E 80 81 E 8 1 and it is true that pow(Oo 16) 5 pow(OLI6). Example B.B.9.1) then the test & defined by the region Rn:. with E =0.40. However. 11 5 /Lo versus HII p > 110.645/fi} Figure B. If x = {J. } is a ranf dom sample from a normal distribution N(. . Similarly.10 significance level. = 1. Since these critical regions are different. Example B.282/fi} is a UMP test for H(. it seems desirable that. pow(&16) should be smaller in Qo than elsewhere. Since the power function pow(8 1 6 ) describes the probability that the test 6 rejects the null. I E . . r . . (2 = 2.c. (Compurative power o &@red tests). . illustrated in the above example. defined by critical region Rb. it follows that there is no UMP test for p = /ill versus p # PIl. .po I > 1. suggests that a less demanding criterion must be used if one is to define a “best” test among those with a fixed significance level. 2 p0 versus HI E / I < [L. . that UMP tests typically do not exist for twosided alternatives.05 . . defined by R62 = (2: pll.r 111. = {x.. .j : > 1. I ./LO > 1.. .2 Power of testsfor the mean of a normul distribution P(.~. the test 6. n = 30. = 1 .x . .with the same level. when Ho is true.If x = (. . = (2. UMP tests do not generally exist..282/fi} is a UMP test for H.3 Stylised Inference Problems 473 then the test 6 which rejects HOif t 2 cis a UMP test of the hypothesis HO 8 5 00 versus the alternative H I = 8 > Oo. 1). (Nonexisfence of4 UMP fesf).8. = = The fact. at the level of significanceao..
Clearly. locally most powerful tests may be very inappropriate when the true value of 0 is far from O. in this case. Under suitable regularity conditions. there is. even when they exist. and with that of a typical nonsymmetric test 6. An added advantage of this approach is that there is no need to select beforehand an arbitrary significance level. in general. of this test with those defined in Example B. even numerically. We are drawn again to the general comment that... ~Figure B. prior information and utility preferences should be an integral part of the solution. are considered to be more likely. no simple form of reinterpretation which would have a Bayesian justification. unbiased procedures may only be reasonable in special circumstances. Not only. there is a tendency on the part of many users to interpret a pvalue as implying that the probability that If.. It seems obvious that 6.8.. by requiring the power function to be maximised in a neighhourhood of the null hypothesis.474 B. or whenever the values of smaller than / I ( . pvalues cannot generally . However.is true is smaller than the pvalue.9 above that. This value is called the tail ureu or pvalue corresponding to the observed value of the statistic. so that.2 compares the power . of the same level. of course. is this false within the frequentist framework but. which is more cautious about accepting values of / I larger than plI than about accepting values of 11 smaller than pi. In particular: (i) It should be obvious that the all too frequent practice of simply quoting whether or not a null hypothesis is rejected at a specified significance level no ignores a lot of relevant information. indeed. It is clear from Example B. locally most powerful tesrs may be derived by using the sampling distribution of the efficient score function in a process which is closely related to that described in our discussion of interval estimation. Yet another approach to defining a “good” test when UMP tests do not exist is to focus attention on localpowr. if such a test is to be performed. the statistician should report the cutoff point (1 such that Hn would not be rejected for any level of significance smaller than ( t . NonBayesian Theories is an unbiased test for Nu 5 11 = f i b versus H I = /i # ~ 4 . the requirement of maximum local power does not say anything about the behaviour of the test in a region of high power and.. in any decision procedure. As noted in the case of confidence intervals. which has the critical region for suitably chosen constants (j < r 2 . should be preferred to the unbiased test whenever the consequences of the first class of errors are more serious. Methodological Discussion Testing hypotheses using the frequentist methodology described above may be misleading in many respects.
is significantly different from 8 0 in the sense of implying any practical difference. (ii) Another statistical “tradition” related to hypothesis testing consists of declaring an observed value stutisticully signijcunt. We recall from Section 6. implying that there exists statistical evidence which is sufficient to reject the null hypothesis.designed either to contain acruafalternatives of practical interest. However. See Chernoff (1951) and Stein (1951) for further arguments against standard hypothesis testing. Berger (1 985a). which may well be numerically different from a hypothetical value 8 0 . but under most plausible utility functions the difference has no political significance.0 I . to identify the most appropriate procedure.3 Stylised Inference Problems 475 be interpreted as posterior probabilities. (iii) Finally. tests on p or X should condition on the observed m. Thus. since the classical theory of hypothesis testing does not make any use of a utility function. could be solved by embedding the hypothetical model in some larger class { p ( e It?). if it completely specifies a density p ( x 100). within the Bayesian framework. the problem of significance testing. 8 E 8 0 ) has been initially proposed. Berger and Delampady ( 1987)and Berger and Sellke ( 1987). given a family { p ( z I O ) . whenever the corresponding tail area is smaller than a “conventional” value such as 0. The null hypothesis may be either simple. null hypothesis HO = 8 E 8 0 is tested against a (at least) one specific alternative.2 that. X by the conditionalityprinciple. even in the theory’s own terms. For example. 8.4 SignificanceTesting In the previous section. the mutual inconsistency of frequentist desiderata often makes it impossible. where onfy the null hypothesis H0 = { p ( z 1 O ) .then rn is ancillary and hence. at least asymptotically. unrestricted tests may be uniformly more powerful. a vote proportion of 34% for a political party is technically different from a proportion of 34.3. as formulated above. 8 E 0}. For detailed discussions see. or composite.001 %. any discrepancy measure For . orformal alternatives generated by selecting a mathematical neighbourhood of Ho.B. and it is desired to test whether or not the data x are compatible with this hypothesis. we have reviewed the problem of hypothesis testing where. In this section we shall review the problem of pure significance tests. there is no way to assess formally whether or not the true value of the parameter 8.05 or 0. r r 2 X ) with precision m determined by a random integer m. Yet. 8 E 0). without considering specific alternatives. for example. See Casella and Berger (1987) for an attempted reconciliation in the case of onesided tests. Durbin (1969) showed that. if z is a random sample from N(x I p .
p(1 I H O )should be the same for all 8 E ell. f would exceed the observed value t (x). given the data x. The result of the analysis is typically reported by stating the pvalue and declaring that Ho should be rejected for all significance levels which are smaller than 1'. of the kind which it is desired to test. Comparison with the Bayesian analogue summarised above prompts the following remarks. due to its special status corresponding to simplicity (Occam's razor). NonBuyesiun Theories describing. This (fully Bayesian) procedure could be described as that of selecting a test statistic t ( x ) which is expected to measure the discrepancy between HI. describing the additional utility obtained by retaining HO because of its special status. we proposed the logarithmic as a reasonable general discrepancy measure. if HO is composite.and the true model. Then. for each 8. the conditional utility difference. we showed that Ho should be rejected if. conditional on Ho. if t(z) is larger than some cutoff point E O ( Z ) describing the additional utility of keeping I i . a disguised Bayes factor seems to be the only proposal). (ii) The larger the value of t the stronger the evidence of the departure from H.476 B. While in the Bayesian analysis .. or whatever.apvulue or significance level is calculated as the probability. ( i ) The frequentist theory does not generally offer any guidance on the choice of an appropriate test statistic (the gcnerulised likelihood rurio resf. a test statistic t = t ( x ) is selected with two requirements in mind. if it were true. and rejecting H. scientific support (or fashion). that. in repeated samples.. and for any function E ~ I ( X ) . From a frequentist point of view. that p is given by so Small values of p are regarded as strong evidence that HO should be rejected. (i) The sampling distribution of f under the null hypothesis p ( t 1 H o ) must be known and. In particular. t ( ~> EO(x) ) where is the expecred posreriur discrepane?.
.e. Moreover. in general.2 how this may actually be . The absence of declared alternatives even precludes the use of the frequentist optimality criteria used in hypothesis testing. this should certainly f take into account the advantages o keeping Ho. (iii) If a measure of the strength of evidence against Ho is all that is required. tl(X) tdx) Flgure B 3 Esualising the compatibility o t ( x ) with H f o (iv) If a decision on whether or not to reject Ho has to be made. the compatibility of t ( s )with HI. it is required that p ( t I #) be the same for all 6 in 00. We described in Section 6.B. often. this may be very difficult. (ii) Even if a function t = t ( z ) is found which may be regarded as a sensible discrepancy measure. is simply not the case. t l ( s ) may readily be accepted as compatible with Ho while t 2 ( s ) may not.3 Stylised Inference Problems 477 t(s)is naturally and constructively derived as an expected measure of discrepancy. the frequentist statistician needs to determine the unconditional sampling distribution o f t under Ho. i. more relevant answer than quoting the realised pvalue. in Figure B. rely on intuition to select t.Ho) under the null hypothesis seems a more reasonable. the frequentist statistician must. and actually impossible when there are nuisance parameters. the position of the observed value of t with respect to its posterior predictive distribution p ( t Is.3. may be described by quoting the HPD intervals to which it belongs. in the more interesting situation of composite null hypotheses. Indeed. Thus. which. or may be measured with any proper scoring rule such as A logp(t(z) Is. defining the cutoff point in terms of utility.HI)) +B.
Similarly.vexpect. one must rest on the rather dubious "transferred" propzrties of the longrun behaviour of the procedure. but have not. but this is only one possible choice. criticisms made of confidence intervals typically apply to significance testing.4 8. but the rekcvczric~ the aggregate. since they are concerned with events which might have occurred. since confidence intervals can generally be thought of as consisting of those null values which are not rejected under a significance test. for example. We should finally point out that most of the criticisms already made about hypothesis testing are equally applicable to significance testing.478 B. I .1 COMPARATIVE ISSUES Conditional and UnconditionalInference At numerous different points of this Appendix we have ernphasised the following essential difference between Bayesian and frequentist statistics: Bayesian statistics directly produces statements about the uncertainty of unknown quantities." where these hypothetical repetitions might take place. Y G / f i which might be constructed by repeated sampling from a normal distribution with known unit precision. at best. Not only may one dispute the existence of the conceptual "co//ectii. ? Within the frequentist approach. one might expect that. Thus. Thus. conditional on known data. before the data are collected. only tangential. . with no logical possibility of assessing the relevant. Thus. either parameters or future observations. how close is the unknown 11 to the observed . It is useful to distinguish between two very different concepts.of the intervals of the form d f I . in the long run. the "precision" we may iniriu//.+values or confidence intervals are largely irrelevant once the sample has been observed. however. and then seeks (indirectly) ways of making this relevant to inferences about the unknown parameters given the observed data. is the following: given the observed . introduced by Savage ( 1962). 8. to spec@. Indeed. NonBayesian Theories chosen to guarantee a specified significance level. the true mean p will be included in 95%.fincz/precision. 385) . to quote Jetfreys ( 1939/1961. lottgrim suritpling properties under hypothetical repctitions. longrun properties for specific inference of problems seems. initiu/precisiorr andJirrolprecision. for each value of the unknown parameters. they typically have average characteristics which describe.inferences of a totally different type.I' which derives from the observed sample (which typically will r i o f be repeated in any actual practice). A far more pertinent question.c. the problem at the very heart of the frequentist approach to statistics is that of connecting uggregute. Indeed. p. frequentist procedures are designed in terms of their expected behaviour over the sample space. not necessarily the most appropriate in all circumstances. frequentist statistics produces probability statements about hypothetical repetitions of the data conditional on the unknown parameter.4.
on uveruge. Example B 1 .. the desired result. but nevertheless possible. from large samples. as pointed out in our discussion of the conditionality principle. Indeed. This seems a remarkable procedure. . .. often referred to as the vector of nuisance parameters. Given the sample and + 51 using a uniform (reference)prior for p. ~ ~ =is a 2 efficient estimator with a sampling variance of the order of ) / very 1/n2. very precise estimates of p.i. unlikely. 1992) and Cox and Reid ( 1987). p + i1.0 sample from a uniform distribution over ]/I . if the acniul data turn out this way.. Berger ( 1984b). Basu (1964. we saw in Examples B. Suppose however that we obtain a specific large sample with a small range. admittedly.4. they are not necessarily unique and. can yield a totally uninformative sampling distribution.s. so that.. The following example... . for example. see. there remain many problems with conditioning on ancillary statistics. be a random . namely the (marginal) posterior distribution of the parameter of interest. .It is easily verified that the midrange fi = (x. See. the full parameter vector 8 can typically be partitioned into 8 = {&.2 and B. Thus. taken from Welch (1939) further illustrates the difference between initial and final precision.B. Thus. x.A} where & is the subvector of interest and X is the complementary subvector of 8.also.. Let z = {x.. we can only really claim that p E]I. a hypothesis which may be true may be rejected because it has not predicted observable results which have not occurred. the presence of nuisance parameters does not pose any formal. Indeed. this is.+ ~ . .. The need for conditioning on observed data can be partially met in frequentist procedures by conditioning on an ancillary statistic.. we may expect.4 Comparative Issues 479 . moreover. 8. such as the search for maximum power in hypothesis testing. they are not easily identifiable. [ (since the reference posterior distribution is uniform on that interval). 4... However. and can conflict with other frequentist desiderata. We recall from Section 5... no matter how efficient the estimator fi was expected to be. conditioning on an ancillary statistic. (Itdial a d f i n a l precision). can simply be written as . theoretical problems.1 that.2 Nuisance Parameters and Marginalisation Most realistic probability models make the sampling distribution dependent not only on the unknown quantity of primary interest but also on some other parameters.. within a Bayesian framework.3 that it is easy to construct examples where totally unconditional procedures produce ludicrous results. rather than the usual l/n. the final precision of our inferences about p is bound to be rather poor.
typically based on asymptotic theory. Indeed.6. Neyman and Scott ( 1948) illustrated such problems by considering models with many nuisance parameters of the type . provides the best known example. and common variance 9. NotiBuyesiun Theories where the full posterior y(@. . of course. (estimated. even the domain of "fairly restricted applicability" resulting from reliance on asymptotic methods can be problematic. . Kiefer and Wolfowitz (1956) and Cox (1975) proposed solutions for this type of problem based on treating the A. Note that such models are not unrealistic: for example. . Frequentist inferences about the mean of a normal distribution with unknown variance based on the Student! statistic. From .the latter being the parameter of interest. or marginal). for problems with nuisance It parameters are offairly resrricrrd uppliccr6ili~. some statisticians see this as thc main motivation for developing asymptotic results: a . our italics) However. .=I where a new nuisance parameter A. 0 A) =n / J ( X .rimritc. . 0 A. is introduced with each observation.rinmrioris when "exact" solutions are not available. . this.may be a physiological measurement on individual i.5.). The situation is very different from a frequentist point of view. therefore. the problem posed by the presence of nuisance parameters is only satisfactorily solved within a pure frequentist framework in those few cases where the optimality criterion used leads to a procedure which depends on a statistic whose sampling distribution does not depend on the nuisance parameter. frequentists arc forced t o use uppro. the central idea being that when the number 11 of observations is large and emrs of estimation correspondingly small. serious difficulty is that the techniques . . s The only general alternative strategy which has been proposed to avoid resorting to asymptotics when exact methods are not available is to use a modified form of likelihood. An rsritnated likelihood is obtained by replacing nuisance parameters by (for example) their maximum likelihood estimates.I [J(Z 1 . simplificationsbecome available that are not available in general (Cox and Hinkley. whose sampling distribution does not involve the variance.L Bayesian viewpoint. In an early paper. p. 1974. where the dependence on the nuisance parameters has been reduced or eliminated. then becomes a case of hierarchical modelling a discussed in Section 4.X I z)is directly obtained from Bayes' theorem. Indeed. This procedure does not take account ofthe uncertainty due to the lack of knowledge aboul the nuisance parameters. conditional. 279. essential to hare widely applicable procedures that in some sense provide good nppro. . methods. 1. . is.r.'s as independent observations from some distribution.480 B. In general. . which may have a normal distribution with mean A.
1973).Cox (1988)and Fraser and Reid ( 1989). For example. There are however two main problems with this type of approach. Liseo ( 1993) shows that reference posterior credible regions have better frequentist coverage properties than those obtained from likelihood methods. substitution of the regression coefficients by their mle’s leads to an estimate of the variance which is misleadingly precise. is the implied likelihood (Efron. . as one might expect. see. Another suggestion. and a consensus seems to exist about this vague information condition. 1973) and Andersen ( 1970. In both cases. Marginal and conditional likelihoods are based on breaking the likelihood function into two factors. 1977).B.4 Comparative Issues 481 and may be misleading both in the precision and in the location associated with inferences about the parameters of interest. For further discussion and extensive references. ProJle likelihood provides a much more retined version of this approach. either using invariance arguments or conditioning on sufficient statistics for the nuisance parameters. 1993). forexample. see Basu ( 1 9771. (ii) They critically depend on the highly controversial notion of a “function not containing relevant information in the absence of knowledge about the nuisance parameters”. the resulting forms tend to coincide. Dawid (1980a). closely related to fiducial inference. Key references to this approach are Kalbfleish and Spron ( 1970.for which no operational definition has ever been provided. which often gives answers which closely correspond to Bayesian marginalisation results.1991).CoxandReid( 1987). the FiellerCreasy problem concerning the ratio of normal means provides a typical example (see Bemardo. in linear regression with many regressors. For a Bayesian overview of methods for treating nuisance parameters. In the cases where the techniques can be applied. Willing (1988) and Albert (1989). with the integrated likelihood integrated with respect to the conditional reference prior distribution 7r(A 14) of the nuisance parameters given the parameter of interest.BarndorffNielsen(1983. (i) They are not general and can only be applied under rather specific circumstances. one factor provides a likelihood function for the parameter of interest while the other is assumed “to contain no information about the parameter of interest in the absence of knowledge about the nuisance parameter”.
y ) involves a future observable. if one is faced with a decision problem whose utility function u ( o . The range of potential applications of these ideas is extensive. . with an operationalist concern with modelling uncertainty in terms of obsewables.1.4. no special theory has to be developed.inference statements . y)p(y I z) dy.rII}. the inferential content of the predictive distribution may be appropriately summarised by location or spread measures. often J.ZI. Thus.]. a*. (the original problem considered by Bayes.y 12) becomes the necessary ingredient in determining the optimal action. Two observations ( x l I q. is. The action space consists of the class of sampling distributions. then p(. given that the measurement using the first procedure has turned out to be gl. . “interval estimates” of y such as the class of HPD intervals which or may be derived from p(y 12). Since any valid coherent inferential statement about y given z is contained in the posterior predictive distribution p(y 1 2).3 B.3 that. which maximises the appropriate (posterior) expected utility l l ( f 1 . is just a convenient step in the process of passing from /=1 to by means of p ( @I z)x p ( z 1 @ ) [ I ( @ ) . ( i ) Density esrimarion. posterior predictive density p(y2 I UI. the optimal estimator of the sampling distribution. in its central role as a coherent learning process about parameters.usually a random sample {q.482 8. We recall from Section 5. Bayes‘ theorem. from data z. .) made on a set of I I individuals using two different measuring procedures. Moreover. the predictive distribution. are (ii) Calibration. . and it is desired to estimate the measurement yl that the second procedure would yield on a new individual. are desired about. respectively providing “estimators” of y.. Of course. . 1763. The solution is a simple exercise in probability calculus leading to the required q). for squared error loss. which is the posterior expectation of the sampling distribution. unobserved data j j . such as the mean and the mode of p(y 1 z). in a binomial setting). from a Bayesian point of view. NonBayesian Theories Approaches to Prediction The general problem of statistical prediction may be described as that of inferring the values of unknown observable variables from current available information. as yet.
zl. Cifarelli and Regazzini (1982). I). Although in practice it is usually convenient to work via parametric models. set of values. stressed by de Finetti (1970/1974.calibration. In contexts analogous to (ii) and (iii). optimization and smoothing are found. since it effectively ignores the uncertainty about 6. Section 6. The compatibility of a given model with observed data may be assessed by comparingthe realised value of a test statistic with its predictive distribution under that model (cf. regulation. It is important to recall here (see Section 5. Applicationsof predictive ideas to classification. in Dunsmore (1966. .1969). The books by Aitchison and Dunsmore (1975)and Geisser (1993) contain a wealth of detailed discussion of prediction problems. is bound to give misleadingly overprecise inference statements a b u t y. have continued this tradition by trying to develop a completely predictive approach which bypasses entirely the use of parametric models. for instance. (iv) Regulation. Gelfand and Desu (1968) and AmaralTurkman and Dunsmore ( I 985). (vi) Model criticism.B. We should emphasise again that the all too often adopted naive solution of prediction based on the “plugin estimate” form * ~ effectivelyreplacing the posteriordistributionby a degeneratedistributionassigning probability one to an estimator of 8. among others. (v) Model comparison. This is a particular case of the problem of calibration where the 32i’s (and y2) can only take on a discrete. In a setting with alternative models. The point is illustrated in detail by Aitchison and Dunsmore (1975). I ) that. also. 1974.1968. usually the maximum likelihood estimate. hence. to Roberts (1963. 5 2 ) .2). it is desired to select and fix a value of yl so that y2 is as close as possible to a prescribed value. the reader is referred. Geisser (1966. RacinePoon (1988). Section 6. the latter may be compared in terms of their predictive posterior probabilities (cf. this point. including those involving decision making. The particular case of optimisution obtains when it is desired to make ‘y2 as large (or small) as possible. 1980b. 197011975)has considerable theoretical importance.4 Comparative Issues 483 (iii) Classification. See. usually finite. for example.Bemardo (1988). Lavine and West (1992) and Zidek and Weerahandi (1992). The solution is obtained by minimising the expectationof an appropriate loss function with respect to the predictive distribution p(y2 I y1. parameters are limiting forms of observables and. For further details of the systematic use of predictive ideas. inference about parameters may be seen as a limiting form of predictive inference about observables. Klein and Press (1992). 1988) and Zellner (1986b). by virtue of the representation theorems.
Iz . (ii) The difficulties of “transferring” the longrun aggregate properties of confidence intervals into inference statementsconditional on the observed data.) to obtain and p ( ~ . in the discussion which follows Kalbfleish’s paper..a.r p(. with t denoting a sufficient statistic for 8.r. ~ 1 . can be derived from the sampling distribution o f t .e. . N > 0 + I and the method is applied both to obtain directly p(. This is an interesting example of the fact that fiducial distributions do not necessarily have basic coherence properties unless they are equivalent to Bayesian posterior distributions. are even more acute in a tolerance region setting. a proportion p of possible samples x would produce regions R( 2 ) such that Pr[y E R ( z )1 1 = I . . designed to guarantee that. one obtains different answers.+ I)c“’. . for all parameters values. Kalbfleish ( 1 97 1 ) was one of the tirst to examine likelihood methods for prediction. They are essentially limited to producing rolerance regions.o of future observations. .) from the joint predictive. 1 I .. . 18) = . . NonBayesianTheories By comparison. even when it is. Guttman (1970) provides a comparison between frequentist tolerance regions and HPD regions from predictive distributions. . he proposed computing a predictive distribution of the form P(Y I t )= J P(Y I c) 1 f) whenever a fiducial distribution for 8 . c I . Of course.484 B. . regions which.. R(x).. .I’ O+l > 0.. .c. moreover. / ( 8 1 t). (I) Descriptions of the frequentist approach to prediction are given in Cox (1975). If this sounds obscure. . r. i. and then p(. . 1 . Essentially. For instance. Lindley points out that if the model is H ’ (.c. J . it may lead to inconsistent results when the fiducial distribution is not a Bayes’ posterior. .. particularly in comparison with the simple idea of an HPD region from the predictive distribution p ( g 1 x).the possibilitiesfor frequentistbasedprediction are fairly limited. 8 for all 8 E 0. something which is typically only possible in very simple stylised problems..we can but agree! Moreover: In order to construct a tolerance region it is essential to find a function of y and x with a sampling distribution which does not involve 8. in the long run. will contain a proportion I .. ~ +~~ I . the method is not always applicable. Mathiasen (1979) and BarndorffNielsen (1980).
3 that. who sets out a theory of preyuentiul analysis.. Rissanen. Links with stochastic complexig (Solomonoff. 1987. I 8 . Theoretical developments requiring an extension of the standard Kolmogorov ( 193311950) framework for probability are pursued in Vovk (1993a). converges to a normal distribution N(8 I d.B. and a further overview is provided by Geisser (1993). this is often the only way to obtain analytic results. others relating to profile likelihood ideas. . We recall from Section 5.. the other a string of probability forecasts. In frequentist statistics. Seminal contributions include those by Hinkley ( 1979). d ( 8 ) ) . From a Bayesian point of view. some sufficiencybased. for example. under suitable regularity conditions when the parameter of interest is continuous. See.with precision matrix The most frequently used asymptotic results in frequentist statistics concern the large sample behaviour of the maximum likelihood estimate d n which. 1987) are reviewed in Dawid ( 1992). 1970. as the sample size increases. Instead.Lejeune and Faulkenbeny (1982). A more radical approach to prediction is set out in Dawid (1984). Wallace and Freeman. the basic ingredients are s