Second Edition
Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
our focus and order of presentation have changed.and semipararnetric models. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. such as the empirical distribution function. Appendix B is as selfcontained as possible with proofs of mOst statements. weak convergence in Euclidean spaces. However. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. is much changed from the first. These will not be dealt with in OUr work. Our one long book has grown to two volumes. of course. instead of beginning with parametrized models we include from the start non. We . Volume I. or the absolutely continuous case with a density. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. problems. such as the density. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. Specifically. which we present in 2000. some methods run quickly in real time. Hilbert space theory is not needed. each to be only a little shorter than the first edition. (5) The study of the interplay between numerical and statistical considerations. Despite advances in computing speed. reflecting what we now teach our graduate students. and functionvalued statistics." That is. Chapter 1 now has become part of a larger Appendix B. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. then go to parameters and parametric models stressing the role of identifiability. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. and probability inequalities as well as more advanced topics in matrix theory and analysis. From the beginning we stress functionvalued parameters. As a consequence our second edition. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. we assume either a discrete probability whose support does not depend On the parameter set. However. As in the first edition. There have. and references to the literature for proofs of the deepest results such as the spectral theorem. we do not require measure theory but assume from the start that our models are what we call "regular.
Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. including some optimality theory for estimation as well and elementary robustness considerations. There is more material on Bayesian models and analysis. One of the main ingredients of most modem algorithms for inference. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. from the start. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. which parallels Chapter 2 of the first edition. The conventions established on footnotes and notation in the first edition remain.Preface to the Second Edition: Volume I xv also. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. include examples that are important in applications. Chapters 14 develop the basic principles and examples of statistics. Although we believe the material of Chapters 5 and 6 has now become fundamental. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. Robustness from an asymptotic theory point of view appears also. including a complete study of MLEs in canonical kparameter exponential families. Nevertheless. Finaliy. Generalized linear models are introduced as examples. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. such as regression experiments. Save for these changes of emphasis the other major new elements of Chapter 1. if somewhat augmented. Wilks theorem on the asymptotic distribution of the likelihood ratio test. inference in the general linear model. There are clear dependencies between starred . Chapter 6 is devoted to inference in multivariate (multiparameter) models. Chapter 5 of the new edition is devoted to asymptotic approximations. the Wald and Rao statistics and associated confidence regions. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Included are asymptotic normality of maximum likelihood estimates. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. there is clearly much that could be omitted at a first reading that we also star.
taken on a new life. in part in appendices. will be studied in the context of nonparametric function estimation. Semiparametric estimation and testing will be considered more generally. Examples of application such as the Cox model in survival analysis.5 I. Michael Ostland and Simon Cawley for producing the graphs. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. Ying Qing Chen. 5. encouragement. and the functional delta method. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics.XVI • Pref3ce to the Second Edition: Volume I sections that follow. particnlarly Jianging Fan. The topic presently in Chapter 8. The basic asymptotic tools that will be developed or presented.edn . and the classical nonparametric k sample and independence problems will be included.4. We also expect to discuss classification and model selection using the elementary theory of empirical processes.4 ~ 6. Last and most important we would like to thank our wives. other transformation models. convergence for random processes.4. greatly extending the material in Chapter 8 of the first edition. in part in the text and.2 ~ 5. Fujimura.3 ~ 6. Jianhna Hnang. Nancy Kramer Bickel and Joan H. elementary empirical process theory. and active participation in an enterprise that at times seemed endless. For the first volume of the second edition we would like to add thanks to new colleagnes. Bickel bickel@stat.berkeley. Yoram Gat for proofreading that found not only typos but serious errors. are weak. and our families for support. 6. and Prentice Hall for generous production support. appeared gratifyingly ended in 1976 but has.edn Kjell Doksnm doksnm@stat. We also thank Faye Yeager for typing. density estimation. j j Peter J.berkeley. Michael Jordan. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo.3 ~ 6. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley.6 Volume II is expected to be forthcoming in 2003. • with the field.2 ~ 6.
the physical sciences. we select topics from xvii . Introduction to Mathematical Statistics. and the structure of both Bayes and admissible solutions in decision theory. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. 3rd 00. the information inequality.arters. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. the Lehmann5cheffe theorem. Our appendix does give all the probability that is needed. the treatment is abridged with few proofs and no examples or problems. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. Our book contains more material than can be covered in tw~ qp. multinomial models. and engineering that we have taught we cover the core Chapters 2 to 7. we need probability theory and expect readers to have had a course at the level of. we feel that none has quite the mix of coverage and depth desirable at this level. In the twoquarter courses for graduate students in mathematics. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. and Stone's Introduction to Probability Theory. Linear Statistical Inference and Its Applications. 2nd ed. The work of Rao. Hoel. However. covers most of the material we do and much more but at a more abstract level employing measure theory. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. Port. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). and nonparametric models. for instance. and the GaussMarkoff theorem. Although there are several good books available for tbis purpose. Be cause the book is an introduction to statistics.. Finally. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels.. which go from modeling through estimation and testing to linear models. statistics.
Gray. G. These comments are ordered by the section to which they pertain. and A Samulon. preliminary edition. S. i . 2 for the second.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. L. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. Scholz. enthusiastic. I . and moments is established in the appendix. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. J. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. or it may be included at the end of an introductory probability course that precedes the statistics course. The comments contain digressions. Among many others who helped in the same way we would like to mention C. We would like to acknowledge our indebtedness to colleagues. E. distribution functions. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. X. R. Without Winston Chow's lovely plots Section 9. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. . Chen. caught mOre mistakes than both authors together. respectively. and stimUlating lectures of Joe Hodges and Chuck Bell. The foundation of oUr statistical knowledge was obtained in the lucid. and friends who helped us during the various stageS (notes.5 was discovered by F. Chapter 1 covers probability theory rather than statistics. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. Minassian who sent us an exhaustive and helpful listing. reservations. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. Gupta. A serious error in Problem 2. U. Quang. P. final draft) through which this book passed. W. (i) Various notational conventions and abbreviations are used in the text. Peler J. in proofreading the final version.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. Chou. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. and so On. Bickel Kjell Doksum Berkeley /976 : . A special feature of the book is its many problems. Lehmann's wise advice has played a decisive role at many points. I for the first. (iii) Basic notation for probabilistic objects such as random variables and vectors. densities. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. Drew.2. Cannichael. They need to be read only as the reader's curiosity is piqued. students. and additional references. C.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
. I . .. j 1 J 1 .
produce data whose analysis is the ultimate object of the endeavor. which statisticians share. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically.1. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy. a single time series of measurements.2. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals.Chapter 1 STATISTICAL MODELS.1 and 6. The goals of science and society.1 1. In any case all our models are generic and. Data can consist of: (1) Vectors of scalars. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. for example. (2) Matrices of scalars and/or characters. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading.1.1. (4) All of the above and more. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. Moreover. trees as in evolutionary phylogenies. but we will introduce general model diagnostic tools in Volume 2.3). Chapter 1.5. and so on. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book.1 DATA. functions as in signal processing. and/or characters. A 1 . large scale or small. AND PERFORMANCE CRITERIA 1. MODELS. for example. as usual. measurements. Subject matter specialists usually have to be principal guides in model formulation. are to draw useful information from data using everything that we know. in particUlar. GOALS. scientific or industrial.4 and Sections 2.
An unknown number Ne of these elements are defective. In this manner. randomly selected patients and then measure temperature and blood pressure. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. we approximate the actual process of sampling without replacement by sampling with replacement.21. e. Goals.. and so on. is distributed in a large population. (b) We want to study how a physical or economic feature. and more general procedures will be discussed in Chapters 24. and Performance Criteria Chapter 1 priori.3 and continue with optimality principles in Chapters 3 and 4. robustness. treating a disease. height or income. to what extent can we expect the same effect more generally? Estimation. producing energy. (3) We can assess the effectiveness of the methods we propose.5 and throughout the book. Chapter I. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. in particular. plus some random errors. Here are some examples: (a) We are faced with a population of N elements. learning a maze. Random variability " : . in the words of George Box (1979). a shipment of manufactured items. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. (2) We can derive methods of extracting useful information from data and. testing. The population is so large that.! I 2 Statistical Models. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. give methods that assess the generalizability of experimental results. (5) We can be guided to alternative or more general descriptions that might fit better. and B to n. So to get infonnation about a sample of n is drawn without replacement and inspected. The data gathered are the number of defectives found in the sample. have the patients rated qualitatively for improvement by physicians. For instance. It is too expensive to examine all of the items. and so on. . (c) An experimenter makes n independent detenninations of the value of a physical constant p. if we observe an effect in our data. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. We begin this in the simple examples that follow and continue in Sections 1. "Models of course. For instance. and diagnostics are discussed in Volume 2. for instance. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. Goodness of fit tests. for example. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. for modeling purposes. reducing pollution. We begin this discussion with decision theory in Section 1. we can assign two drugs. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. A to m.. Hierarchies of models are discussed throughout. confidence regions.
. we observe XI.. identically distributed (i. I.. F. .. which together with J1. We often refer to such X I..l. Sample from a Population. k ~ 0. as X with X . Fonnally.) random variables with common unknown distribution function F.N(I . Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not.. The same model also arises naturally in situation (c).l) if max(n . H(N8. The main difference that our model exhibits from the usual probability model is that NO is unknown and. n corresponding to the number of defective items found. and also write that Xl. 1 €n) T is the vector of random errors.. . It can also be thought of as a limiting case in which N = 00. where .t + (i. Here we can write the n determinations of p.d. Parameters.i. which we refer to as: Example 1./ ' stands for "is distributed as.. n)} of probability distributions for X. The mathematical model suggested by the description is well defined. If N8 is the number of defective items in the population sampled. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. n) distribution. . n). n.. X n ? Of course. .. OneSample Models.0) < k < min(N8. so that sampling with replacement replaces sampling without. . if the measurements are scalar. X n are ij. Given the description in (c).2. . . as Xi = I. .d. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times.. . 0 ° Example 1.. A random experiment has been perfonned. Models... X has an hypergeometric. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models.. although the sample space is well defined. . On this space we can define a random variable X given by X(k) ~ k. which are modeled as realizations of X I.1. . .. we cannot specify the probability structure completely but rather only give a family {H(N8. Thus. So.. First consider situation (a).6) (J..Xn independent. .1.8). anyone of which could have generated the data actually observed. can take on any value between and N.1. .. N... N."" €n are independent. .2) where € = (€ I .. completely specifies the joint distribution of Xl.1 Data. X n as a random sample from F.. that depends on how the experiment is carried out. . The sample space consists of the numbers 0.Section 1. Sampling Inspection..xn .1. 1. That is. €I. 1 <i < n (1.". in principle. What should we assume about the distribution of €. then by (AI3." The model is fully described by the set :F of distributions that we specify..
.. 0'2).. where 0'2 is unknown.t + ~. cr 2 ) distribution. We call the y's treatment observations..1.. that is. and Y1. 0 ! I = .. All actual measurements are discrete rather than continuous. That is. . cr > O} where tP is the standard normal distribution. (3) The distribution of f is independent of J1. . . (72) population or equivalently F = {tP (':Ji:) : Jl E R. Example 1. '. YI. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. E}. Heights are always nonnegative. or by {(I". It is important to remember that these are assumptions at best only approximately valid. . patients improve even if they only think they are being treated. for instance.. Commonly considered 9's are all distributions with center of symmetry 0. G) : J1 E R. respectively. (3) The control responses are normally distributed.t. or alternatively all distributions with expectation O. heights of individuals or log incomes. . ... Then if F is the N(I".. Goals. Thus. tbe set of F's we postulate. There are absolute bounds on most quantitiesIOO ft high men are impossible. then F(x) = G(x  1") (1. whatever be J.. . will have none of this. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. if drug A is a standard or placebo. we refer to the x's as control observations.3. 1 X Tn a sample from F. the Xi are a sample from a N(J. . We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. . Often the final simplification is made. response y = X + ~ would be obtained where ~ does not depend on x.. This implies that if F is the distribution of a control. £n are identically distributed.4 Statistical Models. then G(·) F(' . Let Xl. 1 Yn a sample from G.3) and the model is alternatively specified by F. Then if treatment B had been administered to the same subject instead of treatment A. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. so that the model is specified by the set of possible (F.Xn are a random sample and. By convention. if we let G be the distribution function of f 1 and F that of Xl. TwoSample Models. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. we have specified the Gaussian two sample model with equal variances. The classical def~ult model is: (4) The common distribution of the errors is N(o. G E Q} where 9 is the set of all allowable error distributions that we postulate. 1 X Tn .t and cr." Now consider situation (d). •.~). A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect.1. G) pairs..2) distribution and G is the N(J. The Gaussian distribution. To specify this set more closely the critical constant treatment effect assumption is often made. Equivalently Xl. We call this the shift model with parameter ~. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl.Yn.
the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. For instance. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. In some applications we often have a tested theoretical model and the danger is small. As our examples suggest.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. For instance. Experiments in medicine and the social sciences often pose particular difficulties.1. We are given a random experiment with sample space O.1. but not others. However. That is. The number of defectives in the first example clearly has a hypergeometric distribution. All the severely ill patients might.1. P is the . Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter.5. if they are true.1. for instance. Since it is only X that we observe. Using our first three examples for illustrative purposes. may be quite irrelevant to the experiment that was actually performed.Section 1. In this situation (and generally) it is important to randomize. if (1)(4) hold. The danger is that. there is tremendous variation in the degree of knowledge and control we have concerning experiments. in Example 1.1 Data. On this sample space we have defined a random vector X = (Xl.2. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4.1. This distribution is assumed to be a member of a family P of probability distributions on Rn. we can be reasonably secure about some aspects. In others. if they are false.2 is that.3 when F. G are assumed arbitrary. the data X(w).1. When w is the outcome of the experiment. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N.4. have been assigned to B. For instance. This will be done in Sections 3. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. equally trained observers with no knowledge of each other's findings. though correct for the model written down. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. A review of necessary concepts and notation from probability theory are given in the appendices. Fortunately. Parameters. in Example 1.Xn ). ~(w) is referred to as the observations or data. The advantage of piling on assumptions such as (I )(4) of Example 1. we now define the elements of a statistical model.3 and 6. in comparative experiments such as those of Example 1.2. we can ensure independence and identical distribution of the observations by using different.'" . our analyses. Statistical methods for models of this kind are given in Volume 2. In Example 1. P is referred to as the model.1).6.1. Models. we need only consider its probability distribution. It is often convenient to identify the random vector X with its realization.
.1. .e.X l . We may take any onetoone function of 0 as a new parameter.2 Parametrizations and Parameters i e To describe P we USe a parametrization. However. Of even greater concern is the possibility that the parametrization is not onetoone. tt is the unknown constant being measured.6 Statistical Models. 1) errors.Xrn are identically distributed as are Y1 .2 with assumptions (1)(4) we have = R x R+ and. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth.3 that Xl. by (tt. Pe the H(NB.1. Finally. and Performance Criteria Chapter 1 family of all distributions according to which Xl. we will need to ensure that our parametrizations are identifiable.1. 02 ). I 1. still in this example. Thus. 1 • • • . Then the map sending B = (I'.. n) distribution. N. e j .1. . What parametrization we choose is usually suggested by the phenomenon we are modeling. If.G taken to be arbitrary are called nonparametric.1..1.. lh i O => POl f. Thus. For instance. Ghas(arbitrary)densityg}.2. The critical problem with such parametrizations is that ev~n with "infinite amounts of data. that is. Yn . In Example 1. in (l.1 we take () to be the fraction of defectives in the shipment. X n ) remains the same but = {(I'. we only wish to make assumptions (1}(3) with t:. Yn . The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. to P..I} and 1. Goals. For instance. 0.2 ) distribution. in Example 1. G) : I' E R. Pe 2 • 2 e ° . . ..1. if () = (Il. models P are called parametric. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. the parameter space e. in Example = {O. X rn are independent of each other and Yl .. or equivalently write P = {Pe : BEe}. It's important to note that even nonparametric models make substantial assumptionsin Example 1. parts of fJ remain unknowable.2 with assumptions (1)(3) are called semiparametric.. () is the fraction of defectives...2) suppose that we pennit G to be arbitrary. ..Xn are independent and identically distributed with a common N (IL. If. G) into the distribution of (Xl. ." that is.G) has density n~ 1 g(Xi 1'). models such as that of Example 1. a map. j.l. if) (X'. that is.3 with only (I) holding and F. Now the and N(O.. . knowledge of the true Pe. for eXample.1 we can use the number of defectives in the population. such that we can have (}1 f. G with density 9 such that xg(x )dx = O} and p(". . the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. on the other hand. .. Pe the distribution on Rfl with density implicitly taken n~ 1 .1. having expectation 0. . () t Po from a space of labels. in senses to be made precise later.2)). . we know we are measuring a positive quantity in this model. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. Models such as that of Example 1.. we can take = {(I'. we have = R+ x R+ .t2 + k e e e J . Note that there are many ways of choosing a parametrization in these and all other problems. under assumptions (1)(4). NO. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. that is. . moreover.JL) where cp is the standard normal density. . 1) errors lead parametrization is unidentifiable because.G) : I' E R. as a parameter and in Example 1.. as we shall see later.
instead of postulating a constant treatment effect ~. We usually try to combine parameters of interest and nuisance parameters into a single grand parameter ().1. For instance. and Statistics 7 Dual to the notion of a parametrization. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). a function q : + N can be identified with a parameter v( P) iff Po. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined.1." = f1Y . in Example 1. which correspond to other unknown features of the distribution of X. . More generally.1 Data. or the median of P.3. For instance. When we sample.M(P) can be characterized as the mean of P. from P to another space N. Formally.flx.. that is. X n independent with common distribution.. consider Example 1.2 where fl denotes the mean income and. Sometimes the choice of P starts by the consideration of a particular parameter. which indexes the family P. which can be thought of as the difference in the means of the two populations of responses. For instance. In addition to the parameters of interest. in Example 1. . (J is a parameter if and only if the parametrization is identifiable. Similarly. in Example 1. A parameter is a feature v(P) of the distribution of X. The (fl. formally a map.. thus.2. But given a parametrization (J + Pe. G) parametrization of Example 1.2 again in which we assume the error € to be Gaussian but with arbitrary mean~. ) evidently is and so is I' + C>. where 0'2 is the variance of €. the focus of the study. in Example 1. that is. . Then "is identifiable whenever flx and flY exist. we can define (J : P + as the inverse of the map 8 + Pe.1. it is natural to write e e e . then 0'2 is a nuisance parameter. if the errors are normally distributed with unknown variance 0'2. make 8 + Po into a parametrization of P. a map from some e to P. E(€i) = O.2 with assumptions (1)(4) the parameter of interest fl. the fraction of defectives () can be thought of as the mean of X/no In Example 1. and observe Xl.1. say with replacement. Then P is parametrized by 8 = (fl1~' 0'2).3) and 9 = (G : xdG(x) = J O}. But = Var(X. as long as P is the set of all Gaussian distributions. or the midpoint of the interquantile range of P.1. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. Parameters. ~ Po.2 is now well defined and identifiable by (1.1. or more generally as the center of symmetry of P. we can start by making the difference of the means. our interest in studying a population of incomes may precisely be in the mean income. implies 8 1 = 82 . Here are two points to note: (1) A parameter can have many representations.3 with assumptions (1H2) we are interested in~. For instance. v.1. there are also usually nuisance parameters. if POl = Pe'). from to its range P iff the latter map is 11. is that of a parameter. For instance. implies q(BJl = q(B2 ) and then v(Po) q(B).1. Models.Section 1.1.
8 Statistical Models. . Our aim is to use the data inductively. a statistic we shall study extensively in Chapter 2 is the function valued statistic F. statistics. Mandel. then our attention naturally focuses on estimating this constant If. The link for us are things we can compute.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. hence. x E Ris F(X " ~ I . For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. usually a Euclidean space. which now depends on the data.1. For instance. however.3 Statistics as Functions on the Sample Space I j • . a common estimate of 0.. Thus.. is used to decide what estimate of the measure of difference should be employed (ct. consider situation (d) listed at the beginning of this section. Formally. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. is the statistic T(X 11' . which evaluated at ~ X and 8 2 are called the sample mean and sample variance.2 a cOmmon estimate of J1. < xJ. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. These issues will be discussed further in Volume 2. Next this model.• 1 X n ) = X ~ L~ I Xi. 1964). It estimates the function valued parameter F defined by its evaluation at x E R. Xn)(x) = n L I(X n i < x) i=l where (X" . in Example 1. F(P)(x) = PIX. Models and parametrizations are creations of the statistician. the fraction defective in the sample.. This statistic takes values in the set of all distribution functions on R. Nevertheless. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. called the empirical distribution function. For instance. In this volume we assume that the model has . and Performance Criteria Chapter 1 1. Informally. this difference depends on the patient in a complex manner (the effect of each drug is complex)..1. but the true values of parameters are secrets of nature.1. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. Goals. for example. T( x) is what we can compute if we observe X = x. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions.1. a statistic T is a map from the sample space X to some space of values T. to narrow down in useful ways our ideas of what the "true" P is. can be related to model formulation as we saw earlier. we can draw guidelines from our numbers and cautiously proceed. T(x) = x/no In Example 1.2 is the statistic I i i = i i 8 2 = n1 "(Xi . Deciding which statistics are important is closely connected to deciding which parameters are important and. .
the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. Y. and there exists a set {XI. This is the Stage for the following. Example 1. Problems such as these lie in the fields of sequential analysis and experimental deSign. Regression Models. When dependence on 8 has to be observed. are continuous with densities p(x. (A. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation.4 Examples. for example. For instance. 0).1. 0).1 Data. age. Zi is a d dimensional vector that gives characteristics such as sex. 0).1.1. Parameters. (2) All of the P e are discrete with frequency functions p(x. . See A. patients may be considered one at a time. Yn are independent. This is obviously overkill but suppose that. in the study. Distribution functions will be denoted by F(·.. ). Moreover.X" .. drugs A and B are given at several . For instance. Regular models. . 0). we shall denote the distribution corresponding to any particular parameter value 8 by Po. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. in situation (d) again. 8). ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. these and other subscripts and arguments will be omitted where no confusion can arise.. weight.. and so On of the ith subject in a study. In the discrete case we will use both the tennsfrequency jimction and density for p(x.3 we could take z to be the treatment label and write onr observations as (A. Y n ) where Y11 ••• . density and frequency functions by p(. height. sequentially. Yn ). This selection is based on experience with previous similar experiments (cf.). They are not covered under our general model and will not be treated in this book.3. Thus. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. 1990). (zn. (B. The experimenter may. assign the drug that seems to be working better to a higher proportion of patients. .1. 'L':' Such models will be called regular parametric models. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. after a while.4. . However.Section 1.lO. X m ).. Models. and Statistics 9 been selected prior to the current experiment. the number of patients in the study (the sample size) is random. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. Lehmann. Expectations calculated under the assumption that X rv P e will be written Eo. We observe (ZI' Y.. (B. Thus. in Example 1. Notation.. assign the drugs alternatively to every other patient in the beginning and then. Xl). Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. 1.
In the two sample models this is implied by the constant treatment effect assumption. i = 1. . I3d) T of unknowns. (3) g((3.1.Yn) ~ II f(Yi I Zi). In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. (3) and (4) above. See Problem 1. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi. The most common choice of 9 is the linear fOlltl. I'(B) = I' + fl. .E(Yi ). On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. d = 2 and can denote the pair (Treatment Label. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.3 with the Gaussian twosample model I'(A) ~ 1'.. For instance.8. Then we have the classical Gaussian linear model. [n general. . Then. That is... . .. Here J. We usually ueed to postulate more. and nonparametric if we drop (1) and simply .10 Statistical Models. By varying the assumptions we obtain parametric models as with (I).z) = L.L(z) is an unknown function from R d to R that we are interested in. and Performance Criteria Chapter 1 dose levels.1. semiparametric as with (I) and (2) with F arbitrary. Example 1.1..treat the Zi as a label of the completely unknown distributions of Yi. (b) where Ei = Yi .2) with .. then the model is zT (a) P(YI. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems.. . z) where 9 is known except for a vector (3 = ((31. Often the following final assumption is made: (4) The distributiou F of (I) is N(O.2 uuknown. in Example 1. Clearly.1. then we can write. ..2 with assumptions (1)(4).1. n. i=l n If we let J.. which we can write in vector matrix form. the effect of z on Y is through f. Goals. Treatment Dose Level) for patient i. So is Example 1.Z~)T and Jisthen x n identity.(z) only. (c) whereZ nxa = (zf. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model. 0 . .3(3) is a special case of this model.L(z) denote the expected value of a response with given covariate vector z.. ..
1 Data. X n be the n determinations of a physical constant J.t.. Parameters..(3cd··· f(e n Because €i =  (3e n _1).{3Xj1 j=2 n (1. .. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. . . the model for X I. It is plausible that ei depends on Cil because long waves tend to be followed by long waves.. . 0'2) density. The default assumption. C n where €i are independent identically distributed with density are dependent as are the X's. Let XI. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. Measurement Model with Autoregressive Errors.. ergodicity. i = 1. . . .Il.. 0 . . Example 1. i ~ 2. Models. .. . . X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. the elapsed times X I.Section 1. In fact we can write f. en' Using conditional probability theory and ei = {3ei1 + €i...(3)I'). Of course.. Let j. Xl = I' + '1. To find the density p(Xt. .1. Xi .. p(e n I end f(edf(c2 . .. n. eo = a Here the errors el..t = E(Xi ) be the average time for an infinite series of records. . .. i = l. .(1 . we have p(edp(C21 e1)p(e31 e"e2)". Consider the model where Xi and assume = J.p(e" I e1..xn ) ~ f(x1 I') II f(xj . save for a brief discussion in Volume 2. at best an approximation for the wave example. .5.cn _d p(edp(e2 I edp(e3 I e2) ..n ei = {3ei1 + €i. .. say. An example would be.. we give an example in which the responses are dependent. and Statistics 11 Finally. is that N(O.. and the associated probability theory models and inference for dependent data are beyond the scope of this book.t+ei. . model (a) assumes much more but it may be a reasonable first approximation in these situations. ... the conceptual issues of stationarity. However. Xi ~ 1. .(3) + (3X'l + 'i. . we start by finding the density of CI. ' .n. x n ).. X n is P(X1..
. X has the hypergeometric distribution 'H( i. N. given 9 = i/N. . to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. and indeed necessary. This is done in the context of a number of classical examples.1. The notions of parametrization and identifiability are introduced. There are situations in which most statisticians would agree that more can be said For instance. vector observations X with unknown probability distributions P ranging over models P.. a ."" 1l"N} for the proportion (J of defectives in past shipments. . 1. we have had many shipments of size N that have subsequently been distributed. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. in the past. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. the most important of which is the workhorse of statistics. Models are approximations to the mechanisms generating the observations. That is.2. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. Goals. N.1. and Performance Criteria Chapter 1 Summary. . n). N. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. In this section we introduced the first basic notions and formalism of mathematical statistics. If the customers have provided accurate records of the number of defective items that they have found. There is a substantial number of statisticians who feel that it is always reasonable. 1l"i is the frequency of shipments with i defective items.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. We view statistical models as useful tools for learning from the outcomes of experiments and studies.I 12 Statistical Models. we can construct a freq uency distribution {1l"O. i = 0. ..2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. PIX = k.2. it is possible that. the regression model.. We know that. in the inspection Example 1. ([. 0 = N I I ([.2) This is an example of a Bayesian model. Thus.
x) = ?T(O)p(x. We now think of Pe as the conditional distribution of X given (J = (J. We shall return to the Bayesian framework repeatedly in our discussion. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. (1962). which is called the posterior distribution of 8.2. Suppose that we have a regular parametric model {Pe : (J E 8}.1)'(0.2) is an example of (1. let us turn again to Example 1.2.2 Bayesian Models 13 Thus. Savage (1954).2. Before sampling any items the chance that a given shipment contains . In the "mixed" cases such as (J continuous X discrete. For a concrete illustration. the information or belief about the true value of the parameter is described by the prior distribution. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). with density Or frequency function 7r.4) for i = 0. the joint distribution is neither continuous nor discrete. However. To get a Bayesian model we introduce a random vector (J. then by (B.. (1. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. (1.O). (8. Raiffa and Schlaiffer (1961).1. This would lead to the prior distribution e. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. In this section we shall define and discuss the basic clements of Bayesian models. the resulting statistical inference becomes subjective. ?T. given (J = (J.9)'00'. 1. However.3). (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. After the value x has been obtained for X. . Before the experiment is performed.3). Lindley (1965). The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. The joint distribution of (8. Eqnation (1. and Berger (1985). If both X and (J are continuous or both are discrete. The theory of this school is expounded by L. suppose that N = 100 and that from past experience we believe that each item has probability . De Groot (1969). whose range is contained in 8. select X according to Pe. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies.l.2. we can obtain important and useful results and insights. For instance. An interesting discussion of a variety of points of view on these questions may be found in Savage et al.1. 100.1 of being defective independently of the other members of the shipment. .Section 1. = ( I~O ) (0. I(O.. X) is appropriately continuous or discrete with density Or frequency function. 1.3) Because we now think of p(x. the information about (J is described by the posterior distribution.
1100(0..<1>(0.1. .. Suppose that Xl.1 and good with probability . (1.6) To calculate the posterior probability given iu (1. Therefore.X.2.xn )= Jo'1r(t)t k (lt)n. 1.2.7) In general.30.8) as posterior density of 0. Example 1.10 '" .j . Goals. I 20 or more bad items is by the normal approximation with continuity correction.1) distribution. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.9)(0.181(0.8.. ( 1.X > 10 I X = lOJ P [(1000 .1) '" I . P[1000 > 20 I X = 10] P[IOOO .. and Performance Criteria Chapter 1 I . Here is an example.2. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous. Thus.6) we argue loosely as follows: If be fore the drawing each item was defective with probability . 14 Statistkal Models.Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.:. . .X) .52) 0.2.1)(0. If we assume that 8 has a priori distribution with deusity 1r. P[IOOO > 201 = 10] P [ . we obtain by (1.1) (1.<I> 1000 .1. (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).10).II ~.3). Specifically.181(0.1)(0.9 ] .1100(0.9) I . the number of defectives left after the drawing.9)(0. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0.9 independently of the other items.001..2. some variant of Bayes' rule (B. 1r(t)p(x I t) if 8 is discrete.2.5) 5 ) = 0. 1r(9)9k (1 _ 9)nk 1r(Blx" . .2... to calculate the posterior.9) > .9) I . 0. This leads to P[1000 > 20 I X = lOJ '" 0. is iudependeut of X and has a B(81.1 > .. 1008 . Bernoulli Trials.30.. .J C3 (1.k dt (1.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous..2.2.4) can be used. theu 1r(91 x) 1r(9)p(x I 9) 2. Now suppose that a sample of 19 has been drawn in which 10 defective items are found. X) given by (1. .. (A 15.
nonVshaped bimodal distributions are not pennitted. Then we can choose r and s in the (3(r. which has a B( n.2. We also by example introduce the notion of a conjugate family of distributions. We return to conjugate families in Section 1. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family.. .. Or else we may think that 1[(8) concentrates its mass near a small number. we might take B to be uniformly distributed on (0. Suppose.8) distribution. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. introduce the notions of prior and posterior distributions and give Bayes rule.2. 'i = 1. Specifically. so that the mean is r/(r + s) = 0. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.2. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions. and the posterior distribution of B givenl:X i =kis{3(k+r.Section 1. For instance. (A.2 indicates. Xi = 0 or 1.2.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.2 Bayesian Models 15 for 0 < () < 1. and s only. As Figure B. .2. If n is small compared to the size of the city. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi..6. where k ~ l:~ I Xj. x n ). the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. To get infonnation we take a sample of n individuals from the city. . = 1. If we were interested in some proportion about which we have no information or belief.2. s) density (B.<.05 and its variance is very small.n. . To choose a prior 11'.9). 1). upon substituting the (3(r. we only observe We can thus write 1[(8 I k) fo".. 8) distribution given B = () (Problem 1. We present an elementary discussion of Bayesian models.2. for instance.2. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city.(8 I X" .15.13) leads us to assume that the number X of geniuses observed has approximately a 8(n. r.2.2.05.16. One such class is the twoparameter beta family. k = I Xi· Note that the posterior density depends on the data only through the total number of successes.ll) in (1. we need a class of distributions that concentrate on the interval (0. which corresponds to using the beta distribution with r = . The result might be a density such as the one marked with a solid line in Figure B.1).2. must (see (B. n .10) The proportionality constant c.s) distribution.k + s) where B(·.11» be B(k + T.·) is the beta function.nk+s). Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. which depends on k. say 0. Summary.
4. Unfortunately {t(z) is unknown. 0 Example 1. the expected value of Y given z. i we can try to estimate the function 1'0. if there are k different brands. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. then depending on one's philosophy one could take either P's corresponding to J.l < Jio or those corresponding to J. 1 < i < n. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. Thus. the fraction defective B in Example 1.1." In testing problems we. However. For instance.3.1 do we use the observed fraction of defectives . say. and Performance Criteria Chapter 1 1.1. "hypothesis" or "nonspecialness" (or alternative). If J. As the second example suggests.16 Statistical Models. we have a vector z. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z). Given a statistical model.3 THE DECISION THEORETIC FRAMEWORK I . as in Example 1.2. drug dose) T that can be used for prediction of a variable of interest Y. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. A very important class of situations arises when.1.1. there are k! possible rankings or actions.lo means the universe is expanding forever and J.l > J." (} > Bo. for instance. in Example 1. For instance.lo is the critical matter density in the universe so that J. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment.1 or the physical constant J. such as. as it's usually called. Prediction. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not. Ranking. with (J < (Jo. one of which will be announced as more consistent with the data than others.1. sex.L in Example 1. For instance. Making detenninations of "specialness" corresponds to testing significance. and as we shall see fonnally later. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). These are estimation problems.1. Intuitively. at a first cut.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. Thus. P's that correspond to no treatment effect (i. Goals.e.lo as special. i.3. if we believe I'(z) = g((3. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. In Example 1. We may wish to produce "best guesses" of the values of important parameters.3. Zi) and then plug our estimate of (3 into g. z) we can estimate (3 from our observations Y. We may have other goals as illustrated by the next two examples.1. shipments. if we have observations (Zil Yi).l < J. in Example 1. There are many possible choices of estimates. Example 1. say. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function.l > JkO correspond to an eternal alternation of Big Bangs and expansions. of g((3..2. state which is supported by the data: "specialness" or. whereas the receiver does not wish to keep "bad. say a 50yearold male patient's response to the level of a drug. (age.
on what criteria of performance we use. in Example 1. I} with 1 corresponding to rejection of H. 1).. 2. X = ~ L~l Xi. 3).6. By convention. or combine them in some way? In Example 1. we need a priori estimates of how well even the best procedure can do. A new component is an action space A of actions or decisions or claims that we can contemplate making. for instance. P = {P. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments.1. . Here quite naturally A = {Permntations (i I . ik) of {I. (2. # 0. 2. .3. Intuitively.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. 3). (3.2.1. it is natural to take A = R though smaller spaces may serve equally well. in estimation we care how far off we are. in Example 1.1 Components of the Decision Theory Framework As in Section 1. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). 2). We usnally take P to be parametrized.1. ~ 1 • • • l I} in Example 1. : 0 E 8). a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. If we are estimating a real parameter such as the fraction () of defectives. 3. we would want a posteriori estimates of perfomlance. accuracy. ..1. I)}. The answer will depend on the model and. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study.1. Testing. or p.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. (3. taking action 1 would mean deciding that D. 1. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. (1. k}}.Section 1. In any case. A = {a. . 1. A = {( 1. or the median. Thus. 1.1. (2. On the other hand. That is. in Example 1. if we have three air conditioners. in ranking what mistakes we've made.. Thus. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. Estimation. and reliability of statistical procedures. and so on.2 lO estimate J1 do we use the mean of the measurements. most significantly.1.1. (3) provide assessments of risk.1. 3. Here are action spaces for our examples. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. For instance. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples.3. 2). . Ranking. Thus. (2) point to what the different possible actions are. in Example 1. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is.. there are 3! = 6 possible rankings. A = {O. even if. in testing whether we are right or wrong. is large. Action space.
a) = [v( P) .al < d. say. is P. = max{laj  vjl.. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider. a) = min {(v( P) . If. l( P.3.Vd) = (q1(0). Closely related to the latter is what we shall call confidence interval loss. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. Sex)T. and truncated quadraticloss: I(P. . 1 d} = supremum distance. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient.. as the name suggests.. a) = 0.3. . Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. . If v ~ (V1. although loss functions. Here A is much larger. a) 1(0. examples of loss functions are 1(0. r " I' 2:)a.a)2. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P. a). Estimation. Other choices that are. sometimes can genuinely be quantified in economic terms. a) if P is parametrized.. or 1(0." that is. Although estimation loss functions are typically symmetric in v and a. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·).. ' . and Performance Criteria Chapter 1 Prediction. and z = (Treatment. the expected squared error if a is used. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable.a)2).a) ~ (q(O) .a(z))2dQ(z). a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. and z E Z. asymmetric loss functions can also be of importance. the probability distribution producing the data. 1'() is the parameter of interest. say. [f Y is real. if Y = 0 or 1 corresponds to.: la.i = We can also consider function valued parameters.qd(lJ)) and a ~ (a""" ad) are vectors. For instance.ai. Goals.18 Statistical Models.a)2 (or I(O. Q is the empirical distribution of the Zj in .3. r I I(P. "does not respond" and "responds. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. a) I d . Quadratic Loss: I(P. d'}. . A = {a . Loss function. For instance.)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. The interpretation of I(P. l(P. a) = (v(P) .2. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. in the prediction example 1. As we shall see. For instance. a) = J (I'(z) ." respectively. . then a(B. a) 1(0. as we shall see (Section 5. a) = 1 otherwise. a) = l(v < a). I(P. [v(P) .V.1).. M) would be our prediction of response or no response for a male given treatment B. ~ 2.
where {So. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.1) Decision procedures.3 we will show how to obtain an estimate (j of a from the data. (1.3.2). in the measurement model. 0.y is close to zero. Y). .9. a(zn) jT and the vector parameter (I'( z.I loss: /(8. For the problem of estimating the constant IJ. I) /(8. . In Section 4. In Example 1. is a or not. Here we mean close to zero relative to the variability in the experiment.1 times the squared Euclidean distance between the prediction vector (a(z. a) = 1 otherwise (The decision is wrong). L(I'(Zj) _0(Z. .. the decision is wrong and the loss is taken to equal one. Then the appropriate loss function is 1.Section 1. the statistician takes action o(x). that is.)..3 The Decision Theoretic Framework 19 the training set (Z 1.1 loss function can be written as ed.). is a partition of (or equivalently if P E Po or P E P. a) ~ 0 if 8 E e a (The decision is correct) l(0. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision.1.. Y n). then a reasonable rule is to decide Ll = 0 if our estimate x .. y) = o if Ix::: 111 (J <c (1.1'(zn))T Testing.))2..0) sif8<80 Oif8 > 80 rN8. Of course. For instance.2) and N(I'. This 0 . relative to the standard deviation a. We define a decision rule or procedure to be any function from the sample space taking its values in A.1. Otherwise.a) = I n n.. Using 0 means that if X = x is observed. .3 with X and Y distributed as N(I' + ~.3..1 suppose returning a shipment with °< 1(8.). and to decide Ll '# a if our estimate is not close to zero. If we take action a when the parameter is in we have made the correct decision and the loss is zero... other economic loss functions may be appropriate. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median. . e ea.. . respectively. The data is a point X = x in the outcome or sample space X. The decision rule can now be written o(x.. if we are asking whether the treatment effect p'!1'ameter 6. I) 1(8. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost. this leads to the commonly considered I(P.2) I if Ix ~ 111 (J >c . Testing. (Zn. ]=1 which is just n. a Estimation.
and Performance Criteria Chapter 1 where c is a positive constant called the critical value. e Estimation. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) .3.3. and X = x is the outcome of the experiment. Goals. How do we choose c? We need the next concept of the decision theoretic framework." Var(X. Example 1. then Bias(X) 1 n Var(X) . . lfwe use the mean X as our estimate of IJ. 1 is the loss function. then the loss is l(P.d. but for a range of plausible x·s. R('ld) is our a priori measure of the performance of d. fJ(x)). our risk function is called the mean squared error (MSE) of. and assume quadratic loss.3. we regard I(P.3) where for simplicity dependence on P is suppressed in MSE. we typically want procedures to have good properties not at just one particular x. Thus.1'].1. withN(O. () is the true value of the parameter. If we expand the square of the righthand side keeping the brackets intact and take the expected value. 6 (x)) as a random variable and introduce the riskfunction R(P. (]"2) errors. A useful result is Bias(fi) Proposition 1. Moreover. (If one side is infinite. If we use quadratic loss.v(P))' (1. If d is the procedure used.3. The other two terms are (Bias fi)' and Var(v). measurements of IJ.. That is. We do not know the value of the loss because P is unknown.. Suppose Xl. Proof.) = 2L n ~ n . 6(X)] as the measure of the perfonnance of the decision rule o(x). X n are i. (Continued). we turn to the average or mean loss over the sample space.20 Statistical Models. Estimation of IJ.v can be thought of as the "longrun average error" of v. fi) = Ep(fi(X) . (fi .i.v) = [V  + [E(v) . Write the error as 1 = (Bias fi)' + E(fi)] Var(v). R maps P or to R+.6) = Ep[I(P. for each 8. MSE(fi) I.. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule). the risk or riskfunction: The risk function. Thus. i=l .. and is given by v= MSE{fl) = R(P.E(fi)] = O.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function. the cross term will he zero because E[fi . We illustrate computation of R and its a priori use in some examples... . so is the other and the result is trivially true.
itself subject to random error. in general. as we assumed. .5) This harder calculation already suggests why quadratic loss is really favored. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.1"1 ~ Eill 2 ) where £.1". The choice of the weights 0.4) which doesn't depend on J.8)X.d. (. Ii ~ (0. In fact.X) =. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. or numerical and/or Monte Carlo computation.2)1"0 + (O. of course.8 can only be made on the basis of additional knowledge about demography or the economy. by Proposition 1.i. as we discussed in Example 1. .3..23). say. or approximated asymptotically.1 2 MSE(X) = R(I".fii/a)[ ~ a .. The a posteriori estimate of risk (j2/n is. for instance by 8 2 = ~ L~ l(X i .2. write for a median of {aI.6). Then R(I".2 and 0. If we have no data for area A.4.3. but for absolute value loss only approximate.1")' ~ E(~) can only be evaluated numerically (see Problem 1.fii V. Then (1. an estimate we can justify later.3. 1) and R(I".X)2. then for quadratic loss..1. or na 21(n .3 The Decision Theoretic Framework 21 and. areN(O.3. If we have no idea of the value of 0.t. . Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator. If. is possible. Example 1.a 2 .. X) = a' (P) / n still. that the E:i are i. if X rnedian(X 1 . 00 (1. . census. . If we only assume. age or income.. a natural guess for fl would be flo.3.S. analytic.1). Let flo denote the mean of a certain measurement included in the U. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. the £.Xn ) (and we.a 2 then by (A 13.2. X) = EIX .2 .. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss. planning is not possible but having taken n measurements we can then estimate 0. For instance.X) = ". R( P. n (1. 1 X n from area A.4) can be used for an a priori estimate of the risk of X.. whereas if we have a random sample of measurements X" X 2. = X. a'.fii 1 af2 00 Itl<p(t)dt = .6 through a . We shall derive them in Section 1. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. for instance.1 is useful for evaluating the performance of competing estimators. E(i .Section 1. with mean 0 and variance a 2 (P).3. ( N(o. an}). .3.
0)P[6(X. The mean squared errdrs of X and ji. using MSE.8)'Yar(X) = (. = 0. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X). We easily find Bias(ii) Yar(ii) 0. The test rule (1.1.3. then X is optimal (Example 3. respectively. 22 Statistical Models. Ii) of ii is smaller than the risk R(I'. neither estimator can be proclaimed as being better than the other. I' E R} being 0. Because we do not know the value of fL.. the risk is = 0 and i'> i 0 can only take on the R(i'>. D Testing.6) P[6(X. Y) = I] if i'> = 0 P[6(X. A test " " e e e .64)0" In MSE(ii) = .1 loss is 01 + lei'>.3.04(1'0 . I. iiJ + 0.6) = 1(i'>.3. Y) = which in the case of 0 .) ofthe MSE (called the minimax criteria). Figure 1. The two MSE curves cross at I' ~ 1'0 ± 30'1 . thus. However. I)P[6(X.64 when I' = 1'0. MSE(ii) MSE(X) i . Y) = IJ. R(i'>. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 .1')' + (. Goals. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.21'0 R(I'. the risk R(I'. Y) = 01 if i'> i O.I' = 0.2(1'0 .81' .1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'. If I' is close to 1'0.4).n.I. In the general case X and denote the outcome and parameter space.1') (0. if we use as our criteria the maximum (over p.3. Figure 1.64)0" In. Here we compare the performances of Ii and X as estimators of Jl using MSE.2) for deciding between i'> two values 0 and 1.
that is.05. For instance. the focus is on first providing a small bound. and then trying to minimize the probability of a Type II error.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error.1 lead to (Prohlem 1. where 1 denotes the indicator function.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error. on the probability of Type I error. For instance. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). [n the NeymanPearson framework of statistical hypothesis testing. it is natural to seek v(X) such that P[v(X) > vi > 1.a) upper coofidence houod on v. the loss function (1.3. confidence bounds and intervals (and more generally regions).[X < k]. For instance.3.01 or less. in the treatments A and B example. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed. a > v(P) I. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.3. say . "Reject the shipment if and only if X > k. If (say) X represents the amount owed in the sample and v is the unknown total amount owed. 8 < 80 rN8P. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0)." in Example 1. 6(X) = IIX E C].IX > k] + rN8P.3.Section 1. Such a v is called a (1 . usually .18).[X < k[.a (1.05 or . Thus.8) sP. we call the error committed a Type I error.1.1) and tests 15k of the fonn.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v.6) Probability of Type [error R(8. the risk of <5(X) is e R(8. a < v(P) . R(8.o) 0. I(P. This is not the only approach to testing.3.8) for all possible distrihutions P of X.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1. Here a is small. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function. 8 > 80 . (1.
24 Statistical Models. A has three points.. are available. Goals.3. Typically.8) and then see what one can do to control (say) R( P.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. Ii) = E(Ii(X) v(P))+. general criteria for selecting "optimal" procedures. Suppose we have two possible states of nature. or wait and see. Suppose that three possible actions.a<v(P). for some constant c > O. What is missing is the fact that. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O.1 nature makes it resemble a testing loss function and. and Performance Criteria Chapter 1 1 . We shall go into this further in Chapter 4. 1. the connection is close. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ . e Example 1. where x+ = xl(x > 0). In the context of the foregoing examples. replace it.1. or repair it. we could drill for oil. it is customary to first fix a in (1. administer drugs.a>v(P) . Suppose the following loss function is decided on TABLE 1. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model.3. . or sell partial rights. The decision theoretic framework accommodates by adding a component reflecting this. We next tum to the final topic of this section. an asymmetric estimation type loss function. a2. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . c ~1 .2 .5. we could operate. though.3.3. It is clear. The loss function I(B. which we represent by Bl and B . a 2 certain location either contains oil or does not. a patient either has a certain disease or does not. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. . we could leave the component in. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. For instance. and so on. The 0.a ·1 for all PEP. sell the location. a) (Drill) aj (Sell) a2 (Partial rights) a3 .a)  a~v(P) . as we shall see in Chapter 4. For instance = I(P. and a3. a component in a piece of equipment either works or does not work. and the risk of all possible decision procedures can be computed and plotted. aJ. though upper bounding is the primary goal. rather than this Lagrangian form.
6. take action U2.Section 1. I x=O x=1 a. The risk of 5 at B is R(B.3.4.5(X))] = I(B. R(B" 52) R(B2 .2 (Oil) (No oil) e. 3 a. 1 9. 0 .5 8. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. .5) E[I(B.3.7) = 7 12(0.6) + 1(0. a. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space. a. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5." and so on. and frequency function p(x. The frequency function p(x.)) are given in Table 1. 1.5. i Rock formation x o I 0. the loss is 12. a3 7 a3 a.4. and so on. 2 a.4) ~ 7.3.0 9 5 6 It remains to pick out the rules that are "good" or "best. whereas if there is no oil.) R(B. 6 a. a.3 0. (R(O" 5).5 9. 5 a.3 and formation 1 with frequency 0.a2)P[5(X) ~ a2] + I(B. if there is oil and we drill.6 0. e.7 0.6 3 3.52 ) + 10(0.5 4.a3)P[5(X) = a3]' 0(0. .)P[5(X) = ad +1(B.). Next.. a3 4 a2 a.5 3 7 1.4 = 1. X may represent a certain geological formation.. The risk points (R(B" 5.).2.7." 02 corresponds to ''Take action Ul> if X = 0.3) For instance.)) 1 1 R(B . We list all possible decision rules in the following table.. Risk points (R(B 5.2 for i = 1.4 10 6 6. R(B k .4 and graphed in Figure 1.3. R(B2. whereas if there is no oil and we drill.6 4 5 1 " 3 5. . and when there is oil.. a. Possible decision rules 5i (x) .3. formations 0 and 1 occur with frequencies 0. the loss is zero. 8 a3 a2 9 a3 a3 Here. 01 represents "Take action Ul regardless of the value of X.5. TABLE 1. it is known that formation 0 occurs with frequency 0.6 and 0. e TABLE 1." Criteria for doing this will be introduced in the next subsection.4 8 8..) 0 12 2 7 7. 5. .). if X = 1. R(025. B2 Thus.3 The Decision Theoretic Framework 25 Thus. If is finite and has k members. B) given by the following table TABLE 1.3.
For instance. and S6 in our example.9 . Researchers have then sought procedures that improve all others within the class.6 . We shall pursue this approach further in Chapter 3.3. i = 1. 1.. Symmetry (or invariance) restrictions are discussed in Ferguson (1967). Here R(8 S. .26 Statistical Models. Goals. R(8. Section 1.2 .Si) Figure 1. in estimating () E R when X '"'' N((). The risk points (R(8 j .) < R( 8" S6) but R( 8" S.S') for all () with strict inequality for some (). (2) A second major approach has been to compare risk functions by global crite . R( 8" Si)).   . and Performance Criteria Chapter 1 R(8 2 . unbiasedness (for estimates and tests). and only if.7 . neither improves the other. We say that a procedure <5 improves a procedure Of if. for instance. a5). we obtain M SE(O) = (}2.4 .2.S) < R(8.5).3. Consider.3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.8 5 .S. .5 0 0 5 10 R(8 j .) I • 3 10 . S.) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. Extensions of unbiasedness ideas may be found in Lehmann (1997. Si). if we ignore the data and use the estimate () = 0..9. The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. or level of significance (for tests). It is easy to see that there is typically no rule e5 that improves all others. if J and 8' are two rules. Usually.
which attains the minimum Bayes risk l that is.8R(0.8 6 In the Bayesian framework 0 is preferable to <5' if. 0)11(0).0). If we adopt the Bayesian point of view.5 9 5. given by reo) = E[R(O. TABLE 1. .10) r( 0) ~ 0. Bayes and maximum risks oftbe procedures of Table 1. = 0.o(x))]. we need not stop at this point.o)] = E[I(O.92 5. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.3. therefore. r(o.38 9.5 we for our prior. + 0.8.) maxi R(0" Oil.9 8.3. and reo) = J R(O.) ~ 0. .). ()2 and frequency function 1I(OIl The Bayes risk of <5 is.2.3. 0) isjust E[I(O. and only if.8 10 6 3.9).2. . To illustrate.3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. . (1. to Section 3. the only reasonable computational method.6 3 8.3.. 11(0.3.20) in Appendix B. Then we treat the parameter as a random variable () with possible values ()1.6 12 2 7.02 8.3. if we use <5 and () = (). We shall discuss the Bayes and minimax criteria. From Table 1. suppose that in the oil drilling example an expert thinks the chance of finding oil is . We postpone the consideration of posterior analysis. Note that the Bayes approach leads us to compare procedures on the basis of. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule.0) Table 1.. . If there is a rule <5*.4 8 4.48 7. the expected loss. R(0" Oi)) I 9. In this framework R(O.2. r(o") = minr(o) reo) = EoR(O.5. if () is discrete with frequency function 'Tr(e).5 gives r( 0.3. O(X)) I () = ()]. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.6 4 4.9) The second preceding identity is a consequence of the double expectation theorem (B. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.I..Section 1. r( 09) specified by (1.5 7 7. Bayes: The Bayesian point of view leads to a natural global criterion.2R(0.4 5 2. it has smaller Bayes risk.7 6.o)1I(O)dO. (1. but can proceed to calculate what we expect to lose on the average as () varies.3.
a weight function for averaging the values of the function R( B. in Example 1. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. . is called minimax (minimizes the maximum risk). suppose that. The maximum risk of 0* is the upper pure value of the game. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. supR(O. To illustrate computation of the minimax rule we tum to Table 1. 4. For instance. It aims to give maximum protection against the worst that can happen. . but only ac.28 Statistical Models. Nature's choosing a 8.3. The criterion comes from the general theory of twoperson zero sum games of von Neumann. 6). . Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk.75 if 0 = 0. Player II then pays Player I." Nature (Player I) picks a independently of the statistician (Player II). I Randomized decision rules: In general. This criterion of optimality is very conservative. = infsupR(O.75 is strictly less than that of 04.5 we might feel that both values ofthe risk were equally important.J). This is. J) A procedure 0*. we prefer 0" to 0"'.4. which has . we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used. 8)]. Nevertheless l in many cases the principle can lead to very reasonable procedures. Our expected risk would be. who picks a decision procedure point 8 E J from V. The principle would be compelling. which makes the risk as large as possible.J') . if V is the class of all decision procedures (nonrandomized). and Performance Criteria Chapter 1 if (J is continuous with density rr(8). < sup R(O. Goals. J). 2 The maximum risk 4. From the listing ofmax(R(O" J). Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule.20 if 0 = O . J'). R(O.(2) We briefly indicate "the game of decision theory. R(02. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. the sel of all decision procedures. For simplicity we shall discuss only randomized .3.5. Of course. e 4. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy.4. J)) we see that J 4 is minimax with a maximum risk of 5. For instance. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. sup R(O. if and only if.3. in Example 1.
3. This is thai line with slope .13) As c varies.).(02 ).3. i = 1. which is the risk point of the Bayes rule Js (see Figure 1..). 9 (Figure 2 1. including randomiZed ones.R(OI. we represent the risk of any procedure J by the vector (R( 01 . Ei==l Al = 1.(0 d = i = 1 .3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . Ai >0. A point (Tl. we then define q R(O.3.1). Ji ) + (1 . then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 .Ji)' r2 = tAiR(02.3..12) r(J) = LAiEIR(IJ. We will then indicate how much of what we learn carries over to the general case.3..3.i/ (1 . = ~A.3. J)) and consider the risk sel 2 S = {(RrO"J).2.R(O.14) .J. By (1. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. S is the convex hull of the risk points (R(0" Ji ).A)R(O"Jj ). Finding the Bayes rule corresponds to finding the smallest c for which the line (1. given a prior 7r on q e.R(02.. .. If the randomized procedure l5 selects l5i with probability Ai..\. i = 1 .d) anwng all randomized procedures.3. (2) The tangent is the line connecting two unonrandomized" risk points Ji . (1. Ji )). minimizes r(d) anwng all randomized procedures.'Y) that is tangent to S al the lower boundary of S. (1. .3..13) intersects S. Jj ). this point is (10. .r)..5.13) defines a family of parallel lines with slope i/(1 .J) ~ L.J. R(O .3). S= {(rI.3. i=l (1. (1.J): J E V'} where V* is the set of all procedures. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1. All points of S that are on the tangent are Bayes. That is. T2) on this line can be written AR(O"Ji ) + (1. J). Jj .Section 1.11) Similarly we can define. the Bayes risk of l5 (1. For instance. 1 q.5. . when ~ = 0.10). 0 < i < 1.r2):r. R(O .Ji )] i=l A randomized Bayes procedure d. As in Example 1. ~Ai=1}. l5q of nOn randomized procedures.i)r2 = c. A randomized minimax procedure minimizes maxa R(8. AR(02.A)R(02. If .3).3.
and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1.3.15) Each one of these rules./(1 .3..3. thus.3. To locate the risk point of the nilhimax rule consider the family of squares. by (1.. Because changing the prior 1r corresponds to changing the slope .3.9. the first point of contact between the squares and S is the .. < 1. as >. 2 The point where the square Q( c') defined by (1. and. (\)). It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*.16) whose diagonal is the line r. The convex hull S of the risk poiots (R(B!. (1. 0 < >. See Figure 1. OJ with probability (1 .e.30 Statistical Models.3.16) touches S is the risk point of the minimax rule.)') of the line given by (1. (1.>'). = r2. oil. In our example. Let c' be the srr!~llest c for which Q(c) n S i 0 (i.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes).13).3. ranges from 0 to 1.11) corresponds to the values Oi with probability>. Goals. We can choose two nonrandomized Bayes rules from this class.3. where 0 < >. namely Oi (take'\ = 1) and OJ (take>. the first square that touches S). < 1. i = 1. is Bayes against 7I". . = 0). Then Q( c*) n S is either a point or a horizontal or vertical line segment. . the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le. R(B .3.
. 1967. under some conditions.) and R( 8" 04) = 5. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. all admissible procedures are either Bayes procedures or limits of .e. .3. 0.14) with i = 4. The following features exhibited by the risk set by Example 1.V) : x < TI. if and only if. which yields .5 can be shown to hold generally (see Ferguson. y) in S such that x < Tl and y < T2. this equation becomes 3. if and only if. (a) For any prior there is always a nonrandomized Bayes procedure. Naturally. Using Table 1.)). From the figure it is clear that such points must be on the lower left boundary. for instance).3.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 .. they are Bayes procedures. all rules that are not inadmissible are called admissible. R( 81 .. the set of all lower left boundary points of S corresponds to the class of admissible rules and. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. 1 . the minimax rule is given by (1. A rule <5 with risk point (TIl T2) is admissible. or equivalently. thus. {(x.\). A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5.59. There is another important concept that we want to discuss in the context of the risk set. e = {fh.14). Y < T2} has only (TI 1 T2) in common with S.3. agrees with the set of risk points of Bayes procedures.6 ~ R( 8" J. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes.. If e is finite. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1.4 we can see. 04) = 3 < 7 = R( 81 .\ ~ 0.3.0).. l Ok}. (d) All admissible procedures are Bayes procedures. (c) If e is finite(4) and minimax procedures exist.\) = 5.Section 1.5(1 . . we can define the risk set in general as s ~ {( R(8 . (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. In fact. j = 6 and . However.4.4 < 7. for instance. Thus. there is no (x.\ the solution of From Table 1.3.\ + 6..4'\ + 3(1 . if there is a randomized one. > a for all i. that <5 2 is inadmissible because 64 improves it (i.
in the college admissions situation. Summary. decision rule. The frame we shaH fit them into is the following. at least in their original fonn. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. The MSPE is the measure traditionally used in the . and Performance Criteria Chapter 1 '.32 . The basic biasvariance decomposition of mean square error is presented. testing. loss function. confidence bounds. at least when procedures with the same risk function are identified.4). A meteorologist wants to estimate the amount of rainfall in the coming spring. ranking. Other theorems are available characterizing larger but more manageable classes of procedures.. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. which is the squared prediction error when g(Z) is used to predict Y.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) . An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. we tum to the mean squared prediction error (MSPE) t>2(Y. For more information on these topics. Since Y is not known.3.y)2. Using this information. I II . The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. . ]967. Statistical Models. and risk through various examples including estimation. Z is the information that we have and Y the quantity to be predicted. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. One reasonable measure of "distance" is (g(Z) . Here are some further examples of the kind of situation that prompts our study in this section. Goals. Similar problems abound in every field.I 'jlI . We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. In terms of our preceding discussion. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. and prediction. A government expert wants to predict the amount of heating oil needed next winter. Section 3. Next we must specify what close means. • 1. We introduce the decision theoretic foundation of statistics inclUding the notions of action space.4 PREDICTION . Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. These remarkable results. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio.Yj'. For example.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. we referto Blackwell and Girshick (1954) and Ferguson (1967). I The prediction Example 1. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. which include the admissible rules. We stress that looking at randomized procedures is essential for these conclusions. Bayes procedures (in various senses).
L and the lemma follows. 0 Now we can solve the problem of finding the best MSPE predictor of Y.) = 0 makes the cross product term vanish.4) . Let (1. EY' = J.4.20). ljZ is any random vector and Y any random variable. Because g(z) is a constant. see Problem 1.6.4.I'(Z))' < E(Y . By the substitution theorem for conditional expectations (B.25. (1. 1.L = E(Y).4.g(Z))' (1.g(z))' I Z = z]. exists. Proof.711).5 and Section 3.c)' ~ Var Y + (c _ 1')'. g(Z)) such as the mean absolute error E(lg( Z) . when EY' < 00.1. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class.1) < 00 if and only if E(Y .g(Z))2.4 Prediction 33 mathematical theory of prediction whose deeper results (see.c)' < 00 for all c.4. we can conclude that Theorem 1.4. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y .YI) (Problems 1. we have E[(Y . (1. L1=1 Lemma 1.l. that is. see Example 1.c) implies that p.1. consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information. for example.1) follows because E(Y p. or equivalently. Lemma 1.2) I'(z) = E(Y I Z = z). We see that E(Y .Section 1.1 assures us that E[(Y .c)2 as a function of c.3) If we now take expectations of both sides and employ the double expectation theorem (B. See Remark 1. and by expanding (1.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'.4.4. E(Y .4. Just how widely applicable the notions of this section are will become apparent in Remark 1.3.16).(Y.4.4. 1957) presuppose it.g(Z))' I Z = z] = E[(Y . In this section we bjZj . in which Z is a constant. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6.4. given a vector Z. E(Y .C)2 has a unique minimum at c = J. we can find the 9 that minimizes E(Y .4. Grenander and Rosenblatt. EY' < 00 Y .c = (Y 1') + (I' . then either E(Y g(Z))' = 00 for every function g or E(Y .4.c)2 is either oofor all c or is minimized uniquely by c In fact.
4.6) which is generally valid because if one side is infinite. (1.1. I'(Z) is the unique E(Y .4.4. Theorem 1. = I'(Z).5) is obtained by taking g(z) ~ E(Y) for all z.2. (1. 0 Note that Proposition 1. That is.6) and that (1. Suppose that Var Y < 00. then by the iterated expectation theorem.E(U)] = O.4.5) An important special case of (1.4. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor. Vat Y = E(Var(Y I Z» + Var(E(Y I Z).1(c) is equivalent to (1. As a consequence of (1.4.Il(Z) denote the random prediction error. . Property (1. when E(Y') < 00.E(V)J Let € = Y . Proof.4. then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <.4. and recall (B.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes.4. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z. Properties (b) and (c) follow from (a).I'(Z))' + E(g(Z) 1'(Z»2 ( 1. let h(Z) be any function of Z. then (1. we can derive the following theorem.4.g(Z)' = E(Y .E(Y I z)]' I z). Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . Infact.8) or equivalently unless Y is a function oJZ. so is the other. that is.4.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV .34 Statistical Models. 1. To show (a).5) becomes.4. which will prove of importance in estimation theory. = O.4. If E(IYI) < 00 but Z and Y are otherwise arbitrary. Var(Y I z) = E([Y .20).E(v)IIU .6). Goals. then we can write Proposition 1. then Var(E(Y I Z)) < Var Y.
7) can hold if. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state.lI.45. the best predictor is E (30. and only if.4.10 0.885.25 2 0.1.25 0. i=l 3 E (Y I Z = ~) ~ 2.'" 1 Y. the average number of failures per day in a given month. E(Y . whereas the column sums py (y) yield the frequency of 0. 2.7) follows immediately from (1.10 I 0. An assembly line operates either at full. E(Var(Y I Z)) ~ E(Y .20. 2. . y) = 0.30 py(y) I 0. The assertion (1. But this predictor is also the right one. Equality in (1. We want to predict the number of failures for a given day knowing the state of the assembly line for the month.E(Y I Z))' ~ 0 By (A. I. or quarter capacity. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day. These fractional figures are not too meaningful as predictors of the natural number values of Y.10 0.E(Y I Z))' =L x 3 ~)y y=o .45 ~ 1 The MSPE of the best predictor can be calculated in two ways.4.15 I 0. In this case if Yi represents the number of failures on day i and Z the state of the assembly line.:.15 3 0.4. o Example 1.6).4. Each day there can be 0. E (Y I Z = ±) = 1.05 0. also.25 0.9) this can hold if.E(Y I Z = z))2 p (z.25 1 0. The following table gives the frequency function p( z.4.05 0.05 0. and only if.30 I 0.Section 1.025 0. if we are trying to guess. I z) = E(Y I Z).y) pz(z) 0. (1.10 0. Within any given month the capacity status does not change.8) is true. The first is direct.1 2. 1.50 z\y 1 0 0. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2. as we reasonably might.025 0. half.4 Prediction 35 Proof. or 3 shutdowns due to mechanical failure.10. p(z. or 3 failures among all days.
10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution.4. u~.y as we would expect in the case of independence.E(Y I Z))' (1. (1 .4.LY. Therefore. If p = 0. can reasonably be thought of as a measure of dependence. Because of (1.4. and Performance Criteria Chapter 1 The second way is to use (1.6) we can also write Var /lO(Z) p = VarY . Theorem B. this quantity is just rl'.2. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y. O"~.4.2 tells us that the conditional dis /lO(Z) Because . E(Y . Y) has a N(Jlz l f. If (Z. which is the MSPE of the best constant predictor. In the bivariate normal case. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z .E(Y I Z p') (1.885 as before. Thus. Similarly.9) I . The term regression was coined by Francis Galton and is based on the following observation. which corresponds to the best predictor of Y given Z in the bivariate normal model. the MSPE of our predictor is given by. whereas its magnitude measures the degree of such dependence. 2 I . The larger this quantity the more dependent Z and Yare. E((Y . the best predictor is just the constant f. Regression toward the mean.J. . Goals. p) distribution.6) writing.4./lz). p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y.11) I The line y = /lY + p(uy juz)(z ~ /lz). " " is independent of z. The Bivariate Normal Distribution. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. = Z))2 I Z = z) = u~(l _ = u~(1 _ p').L[E(Y I Z y . .4. (1.p')). a'k. = /lY + p(uyjuz)(Z ~ /lZ). Suppose Y and Z are bivariate normal random variables with the same mean . = z)]'pz(z) 0.36 Statistical Models. If p > O. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . is usually called the regression (line) of Y on Z. the best predictor of Y using Z is the linear function Example 1.4.
there is "regression toward the mean. Yn ) from the population. . The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. .4. and positive correlation p. Note that in practice. E zy = (COV(Z" Y). be less than that of the actual heights and indeed Var((! .COV(Zd. ttd)T and suppose that (ZT.6). tall fathers tend to have shorter sons. Theorem B. should. in particular in Galton's studies. . One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y.Section 1. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) .4.11). By (1.6.4. Thus. E).5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. The variability of the predicted value about 11. Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. In Galton's case.. .. N d +1 (I'. Let Z = (Zl. (Zn. 0 Example 1..4 Prediction 37 tt. the distribution of (Z. consequently.8 = EziEzy anduYYlz ~ uyyEyzEziEzy. coefficient ofdetennination or population Rsquared. " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. Y» T T ~ E yZ and Uyy = Var(Y).12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy. YI ). . .. variance cr2. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population.. This quantity is called the multiple correlation coefficient (MCC).3. or the average height of sbns whose fathers are the height Z. We shall see how to do this in Chapter 2. The Multivariate Normal Distribution." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox.4..8. . Thus.6) in which I' (I"~. where the last identity follows from (1. Then the predicted height of the son. distribution (Section B. I"Y = E(Y). is closer to the population mean of heights tt than is the height of the father. I"Y) T.p)tt + pZ. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y .. (1 . the best predictor E(Y I Z) ofY is the linear function (1.p)1" + pZ) = p'(T'. uYYlz) where. . usual correlation coefficient p = UZy / crfyCT lz when d = 1. y)T has a (d + !) multivariate normal. the MCC equals the square of the .
yn)T See Sections 2.1.209.y = . What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z.bZ)' is uniquely minintized by b = ba.3%.l Proof.:. In words.4. let Y and Z = (Zl...4. Goals. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1.13) . The natural class to begin with is that of linear combinations of components of Z.ba) + Zba]}' to get ba )' . Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height. We expand {Y .4.3. and parents.y = .bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .zy ~ (407. Y) a I VarZ. we can avoid both objections by looking for a predictor that is best within a class of simple predictors.9% and 39. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. . Y. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6. In practice. .1 and 2. The problem of finding the best MSPE predictor is solved by Theorem 1. The percentage reductions knowing the father's and both parent's heights are 20. (Z~. Z2 = father's height).39 l.393. E(Y . respectively.zz ~ (.E(Z')b~. y)T is unknown.2. 29W· Then the strength of association between a girl's height and those of her mother and father.bZ)' = E(Y') + E(Z')(b  Therefore. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor.)T.38 Statistical Models. when the distribution of (ZT. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') . E(Y . are P~. We first do the onedimensional case. p~y = .~ ~:~~). If we are willing to sacrifice absolute excellence. P~. l.335. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z.. The best linear predictor.5%. respectively. and Performance Criteria Chapter 1 For example..
whiclt corresponds to Y = boZo We could similarly obtain (A.a .1.I'd Z )]' = 1 05 ElY I'(Z)j' . whatever be b.a . Note that R(a.4. 0 Note that if E(Y I Z) is of the form a + bZ.E(Y)) .16) directly by calculating E(Y . is the unique minimizing value. Therefore. In that example nothing is lost by using linear prediction. This is because E(Y . On the other hand.E(Z) and Y .bZ) + (E(Y) .l7) in the appendix.2. then a = a.E(Z))]'.E(Z)f[Z .E(Y) to conclude that b.t and covariance E of X.bZ)2 is uniquely minimized by taking a ~ E(Y) . OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . E(Y . E(Y .E(Z)IIZ ..13) we obtain tlte proof of the CauchySchwarz inequality (A.1 the best linear predictor and best predictor differ (see Figure 1.14) Yl T Ep[Y . and only if.bZ)' ~ Var(Y .I'l(Z)1' depends on the joint distribution P of only through the expectation J.boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if. Theorem 1. (1.1). then the Unique best MSPE predictor is I'dZ) Proof. From (1. and b = b1 • because.E(Y)]) = EziEzy.Section 1. Best Multivariate Linear Predictor.boZ)' = 0.4.4. by (1.bE(Z). in Example 1. Substituting this value of a in E(Y .a .tzf [3.bE(Z) .b.a.l).4. E(Y . That is. it must coincide with the best linear predictor. I 1.Z)'.I1.bZ)2 we see that the b we seek minimizes E[(Y .4. A loss of about 5% is incurred by using the best linear predictor. = I'Y + (Z  J.4. We can now apply the result on zero intercept linear predictors to the variables Z .4. Let Po denote = .4 Prediction 39 To prove the second assertion of the theorem note that by (lA.4.E(ZWW ' E([Z .E(Z)J))1 exist.b(Z .a)'. If EY' and (E([Z . 0 Remark 1. This is in accordance with our evaluation of E(Y I Z) in Example 1. if the best predictor is linear.4. E[Y . b) X ~ (ZT.5). .E(Z)J[Y .
2 • • 1 o o 0.4.50 0. .4.4 by extending the proof of Theorem 1... p~y = Corr2 (y. Set Zo = 1.4.I'I(Z)]'. 0 Remark 1. .15) . (1. N(/L. We want to express Q: and f3 in terms of moments of (Z. Zd are uncorrelated.4. See Problem 1. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution. . b). Thus. b) is minimized by (1.1.4.14).1(a). the multivariate uormal. 0 Remark 1. Because P and Po have the same /L and E. . that is.75 1. Y).d. ..25 0.3.4.4.40 Statistical Models.4. R(a.2.05 + 1.4.4.(a + ZT 13)]) = 0.3.I'(Z) and each of Zo. not necessarily nonnal. . b) = Ro(a.4. By Proposition 1. Ro(a.4. We could also have established Theorem 1. Suppose the mooel for I'(Z) is linear. b) = Epo IY .. In the general.17 for an overall measure of the strength of this relationship. j = 0. b) is miuimized by (1.4. thus. By Example 1.19. The three dots give the best predictor. E). I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd. However. .45z. that is.00 z Figure 1. the MCC gives the strength of the linear relationship between Z and Y. . and Performance Criteria Chapter 1 y . The line represents the best linear predictor y = 1. 0 Remark 1. ~ Y . E(Zj[Y .4. distribution and let Ro(a. case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y. Goals. A third approach using calculus is given in Problem 1. .14).3 to d > 1. and R(a.I'LCZ))..
We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y.5)'.4.4. we see that r(5) ~ MSPE for squared error loss 1(9. I'(Z» + ""'(Y.3. we can conclude that I'(Z) = . T(X) = X loses information about the Xi as soon as n > 1.16) (1. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B.15) for a and f3 gives (1. Remark 1.12).4.23). the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss. With these concepts the results of this section are linked to the general Hilbert space results of Section 8..4. Remark 1.2.5(X»].5 Sufficiency 41 Solving (1.14) (Problem 1.4.4.. The optimal MSPE predictor in the multivariate normal distribution is presented.(Y 1ge) ~ "(I'(Z) 19d (1. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. o Summary. then by recording or taking into account only the value of T(X) we have a reduction of the data. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear..2 and the Bayes risk (1.Thus. 1.16) is the Pythagorean identity. can also be a set of functions.8) defined by r(5) = E[I(II. Using the distance D. Note that (1. If T assigns the same value to different sample points. Recall that a statistic is any function of the observations generically denoted by T(X) or T.1.4. Moreover.Section 1. If we identify II with Y and X with Z. The range of T is any space of objects T.4. I'dZ» = ".I'(Z) is orthogonal to I'(Z) and to I'dZ). then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = .17) ""'(Y. the optimal MSPE predictor is the conditional expected value of Y given Z. We begin by fonnalizing what we mean by "a reduction of the data" X EX.5.6.(Y 19NP). The notion of mean squared prediction error (MSPE) is introduced..g(Z»): 9 E g).10. and projection 1r notation..4.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{". g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O. usually R or Rk. .' (l'e(Z). we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation. Thus. Because the multivariate normal model is a linear model. We return to this in Section 3. but as we have seen in Sec~ tion 1.5 SUFFICIENCY Once we have postulated a statistical model.3. I'dZ) = . this gives a new derivation of (1. Consider the Bayesian model of Section 1.5) = (9 .(Y. I'(Z» Y .(Y I 9)..
0 In both of the foregoing examples considerable reduction has been achieved. is a statistic that maps many different values of (Xl. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information.. Therefore. the sample X = (Xl. • X n ) (X(I) . the conditional distribution of XI/(X l + X. whatever be 8.) are the same and we can conclude that given Xl + X.) and that of Xlt/(X l + X. However.Xn ) is the record of n Bernoulli trials with probability 8.Xn = xnl = 8'(1.2. . Thus. .5. is sufficient. when Xl + X. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. and Performance Criteria Chapter 1 Even T(X J. (Xl. XI/(X l +X. X 2 the time between the arrival of the first and second customers. Then X = (Xl. P[X I = XI. By (A. the conditional distribution of Xl = [XI/(X l + X..1. X n ) into the same number. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise. ..'" .1 we had sampled the manufactured items in order. = t. Although the sufficient statistics we have obtained are "natural.. We prove that T = X I + X 2 is sufficient for O. The total number of defective items observed. Begin by noting that according to TheoremB. X.l. . Instead of keeping track of several numbers.. Example 1. 1) whatever be t.2. once the value of a sufficient statistic T is known. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. T is a sufficient statistic for O.'" .'" . o Example 1. .4).0. By (A.) and Xl +X.' • . in the context of a model P = {Pe : (J E e}.5. suppose that in Example 1.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. Xl and X 2 are independent and identically distributed exponential random variables with parameter O. T = L~~l Xi. " .) given Xl + X. Using our discussion in Section B.l we see that given Xl + X2 = t.1. . Xl has aU(O.3. recording at each stage whether the examined item was defective or not. We give a decision theory interpretation that follows. .. X.) is conditionally distributed as (X. A machine produces n items in succession.5. Goals. + X. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. = tis U(O. By Example B. Thus. We could then represent the data by a vector X = (Xl. . Y) where X is uniform on (0. given that P is valid. .l. we need only record one. t) distribution.X. are independent and the first of these statistics has a uniform distribution on (0. X(n))' loses information about the labels of the Xi. 16. Each item produced is good with probability 0 and defective with probability 1. where 0 is unknown.5).)](X I + X. For instance.9. It follows that." it is important to notice that there are many others e.1) where Xi is 0 or 1 and t = L:~ I Xi. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. 1). Thus. .l.Xn ) does not contain any further infonnation about 0 or equivalently P.8)n' (1. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O. ~ t. t) and Y = t . whatever be 8..42 Statistical Models.
forO E Si. In general. e)h(Xj) Po[T (15.} h(x).]/porT = Ii] p(x. The complete result is established for instance by Lehmann (1997. By our definition of conditional probability in the discrete case.e) = g(ti. and Halmos and Savage.S. Section 2.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. To prove the sufficiency of (152). More generally. Such statistics are called equivalent. O)h(x) for all X E X.. • only if. We shall give the proof in the discrete case.] Po[X = Xj.. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. a statistic T(X) with range T is sufficient for e if. .. and e and a/unction h defined (152) = g(T(x). checking sufficiency directly is difficult because we need to compute the conditional distribution.] o if T(xj) oF ti if T(xj) = Ii. e porT = til = I: {x:T(x)=t. if (152) holds.5) . if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y)..4) if T(xj) ~ Ii 41 o if T(xj) oF t i . a simple necessary and sufficient criterion for a statistic to be sufficient is available. Po[X = XjlT = t. This result was proved in various forms by Fisher.5 Sufficiency 43 that will do the same job. Now.I. Let (Xl.Ll) and (152). 0) porT ~ Ii] ~~CC+ g(li. there exists afunction g(t. Po [X ~ XjlT = t..} p(x.6). (153) By (B. X2. Applying (153) we arrive at. 0 E 8. Fortunately.2. then T 1 and T2 provide the same information and achieve the same reduction of the data.O) Theorem I. Proof.In a regular model.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. h(xj) (1. T = t. i = 1. It is often referred to as the factorization theorem for sufficient statistics.Section 1. it is enough to show that Po[X = XjlT ~ l. we need only show that Pl/[X = xjlT = til is independent of for every i and j.5. e) definedfor tin T and e in on X such that p(X. Neyman.O) I: {x:T(x)=t.
.').'t n /'[exp{ . ~ Po[X = x. Let Xl.~.5.Il) . 1=1 (!. Estimating the Size of a Population.n /' exp{ . Expression (1..3.. are introduced in the next section. we can show that X(n) is sufficient.[27f. Conversely.9) can be rewritten as " 1 Xn .4)).. 1.JL)'} ~=l I n I! 2 .1.8) = eneStift > 0. Consider a population with () members labeled consecutively from I to B.O)=onexP[OLXil i=l (1.'" .Xn are recorded. I x n ) = 1 if all the Xi are > 0. By Theorem 1. X n ).. > 0. .9) if every Xi is an integer between 1 and B and p( X 11 • (1. .. n P(Xl.5..2.5.Xn ) is given by I..2 (continued). [27". if T is sufficient.1 to conclude that T(X 1 . . }][exp{ .16. ' . The probability distribution of X "is given by (1. which admits simple sufficient statistics and to which this example belongs.4. .O) where x(n) = onl{x(n) < O}..5.. 0) by (B. . .Xn ) = L~ IXi is sufficient. X(n) is a sufficient statistic for O. . We may apply Theorem 1.5. The population is sampled with replacement and n members of the population are observed and their labels Xl.5..5. l X n are the interarrival times for n customers. . .' (L x~ 1=1 1 n n 2JL LXi))].S.X n JI) = 0 otherwise. . .. .10) P(Xl.Xn.'I. and h(xl •. _ _ _1 ... O} = 0 otherwise. •. = max(xl.' L(Xi ..5. Let 0 = (JL. Common sense indicates that to get information about B..6) p(x. Then the density of (X" . 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2.7) o Example 1.5.xn }. and both functions = 0 otherwise.5. we need only keeep track of X(n) = max(X .Xn ) is given by (see (A. 0 Example 1. both of which are unknown. T is sufficient. and Performance Criteria Chapter 1 Therefore..44 Statistical Models. Takeg(t. A whole class of distributions. then the joint density of (X" . Goals.. O)h(x) (1. 0 o Example 1. 10 fact.X. and P(XI' . T = T(x)] = g(T(x).8) if all the Xi are > 0.2..5. . . let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.3). If X 1..
~ I: xn By the factorization theorem. R(O. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. where we assume diat the given constants {Zi} are not all identical.• Yn are independent.5. I)) .6. in Example 1.~0)}(2"r~n exp{ .1 we can conclude that n n Xi.l) distributed.2a I3. X is sufficient. a<. [1/(n i=l n n 1)1 I:(Xi . X n ) = (I: Xi. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. Suppose. An equivalent sufficient statistic in this situation that is frequently used IS S(X" .5. 0 X Example 1.. EYi 2 .12) t ofT(X) and a Example 1.o) = R(O. (1. .1. . Then pix. that is.' + 213 2a + 213.EZiY. X n are independent identically N(I). T = (EYi . 0') for allO. Then 6 = {{31. L~l x~) and () only and T(X 1 . The first and second components of this vector are called the sample mean and the sample variance. for any decision procedure o(x).Section 1. ..5. 1 2 2 . .5 Sufficiency 45 I Evidently P{Xl 1 • •• . if T(X) is sufficient.9) and _ (y. .5.} + EY. {32. a 2 ). a 2)T is identifiable (Problem 1. X n ) = [(lin) where I: Xi.Zi)'} exp {Er. I)) = exp{nI)(x . Let o(X) = Xl' Using only X. i= 1 = (lin) L~ 1 Xi. Here is an example. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function.) i=l i=l is sufficient for B.xnJ)) is itself a function of (L~ upon applying Theorem 1. Thus. with JLi following the linear regresssion model ("'J . Specifically..1. we can.X)']. respectively. Suppose X" .( 2?Ta ')~ exp p {E(fJ. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting. that Y I .5. Yi N(Jli.4 with d = 2.··· . Ezi 1'i) is sufficient for 6.. 2:>.
Equivalently. choose T" = 15"' (X) from the normal N(t. IfT(X) is sufficient for 0. and Performance Criteria Chapter 1 given X = t.12) follows along the lines of the preceding example: Given T(X) = t.5. Theorem }..6). 0 The proof of (1. where k In this situation we call T Bayes sufficient.2. Goals. by the double expectation theorem. In Example 1. if Xl.6'(T))IT]} ~ E{E[R(O.1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. then T(X) = (L~ I Xi. n~l) distribution. Minimal sufficiency For any model there are many sufficient statistics: Thus. (Kolmogorov)..46 Statistical Models. In this Bernoulli trials case.O) = g(S(x). O)h(x) E:  . Thus. 6'(X) and 6(X) have the same mean squared error. ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .Xn is a N(p. = 2:7 Definition.5. ). R(O. This 6' (T(X)) will have the same risk as 6' (X) because. we can find a transformation r such that T(X) = r(S(X)).14. But T(X) provides a greater reduction of the data. . o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e.l and (1.1 (continued).. it is Bayes sufficient for every 11. 0 and X are independent given T(X). We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . L~ I xl) and S(X) = (X" .6') ~ E{E[R(O..6(X»)IT]} = R(O. in that.6).5. . . (T2) sample n > 2. Example 1. X n ) are hoth sufficient..2. Using Section B. the distribution of 6(X) does not depend on O. T = I Xi was shown to be sufficient. 0) as p(x.5. This result and a partial converse is the subject of Problem 1. Now draw 6' randomly from this conditional distribution. Then by the factorization theorem we can write p(x.4. the same as the posterior distribution given T(X) = L~l Xi = k. Let SeX) be any other sufficient statistic.
) = (/". = 2 log Lx(O. 20. 2/3)/g(S(x).4 (continued). Now. T is minimally sufficient. t2) because.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI. L x (') determines (iI.5. when we think of Lx(B) as a function of ().5 Sufficiency 47 Combining this with (1. Thus. The formula (1.)}. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x.5. for a given observed x.O). if we set 0. we find T = r(S(x)) ~ (log[2ng(s(x).(t. Lx is a map from the sample space X to the class T of functions {() t p(x. take the log of both sides of this equation and solve for T. The likelihood function o The preceding example shows how we can use p(x. for example. ~ 2/3 and 0.1) nlog27r . In the discrete case. Thus. ()) x E X}. Example 1. as a function of (). the ratio of both sides of the foregoing gives In particular. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. if X = x.) ~ (L Xi. the "likelihood" or "plausibility" of various 8.5.. it gives. then Lx(O) = (27rO. (T') example. In this N(/". However. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x.T. the statistic L takes on the value Lx.1). 20' 20 I t.O)h(x) forall O. t. It is a statistic whose values are functions. (T').O E e. for a given 8. ~ 1/3.5.o)nT ~g(S(x).Section 1. the likelihood function (1.) n/' exp {nO. }exp ( I . L x (()) gives the probability of observing the point x. we find OT(1 .8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. 1/3)]}/21og 2. For any two fixed (h and fh. L Xf) i=l i=l n n = (0 10 0.
but if in fact a 2 I 1 all information about a 2 is contained in the residuals. Thus. 1 X n are a random sample. (X. if a 2 = 1 is postulated.. L is minimal sufficient. Thus. Let Ax OJ c (x: p(x. If. See Problem ((x:). < X. 0) and heX) such that p(X. Suppose that X has distribution in the class P ~ {Po : 0 E e}. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X.. . l:~ I (Xi . L is a statistic that is equivalent to ((1' i 2) and. if the conditional distribution of X given T(X) = t does not involve O.Rn ).X)2) is sufficient. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions.12 for a proof of this theorem of Dynkin.5. hence. • X( n» is sufficient.. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X). S(X) = (R" .. then for any decision procedure J(X). or for the parameter O.Xn . and Scheffe. . itself sufficient. .5.5. . where R i ~ Lj~t I(X.X. The "irrelevant" part of the data We can always rewrite the original X as (T(X). it is Bayes sufficient for O. If T(X) is sufficient for 0.5..Oo) > OJ = Lx~6o)' Thus. By arguing as in Example 1. in Example 1. Suppose there exists 00 such that (x: p(x. O)h(X).. .5. as in the Example 1. Consider an experiment with observation vector X = (Xl •. OJ denote the frequency function or density of X...5.O) > for all B. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. or if T(X) ~ (X~l)' . Lehmann. For instance. Then Ax is minimal sufficient. Let p(X. 0) = g(T(X).X).4.5. 1) (Problem 1. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. X is sufficient. 1.).1 (continued) we can show that T and. We show the following result: If T(X) is sufficient for 0. Ax is the function valued statistic that at (J takes on the value p x.X Cn )' the order statistics. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. We say that a statistic T (X) is sufficient for PEP.1. _' . 1) and Lx (1. . 0 In fact. the ranks. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x).. The likelihood function is defined for a given data vector of . the residuals. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid.5. a 2 is assumed unknown. (X( I)' . Goals.13. If P specifies that X 11 •. 1 X n ). ifT(X) = X we can take SeX) ~ (Xl . a statistic closely related to L solves the minimal sufficiency problem in general.48 Statistical Models. .17). Summary. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. hence. 0 the likelihood ratio of () to Bo.
We return to these models in Chapter 2. B). We shall refer to T as a natural sufficient statistic of the family. beta. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. x.B) and h(x) with itself in the factorization theorem.6. More generally.exp{xlogBB}. Ax(B) is a minimally sufficient statistic.B o) > O}. by the factorization theorem. The Poisson Distribution. such that the density (frequency) functions p(x. B( 0) on e. BEe. 1.Section 1.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. B E' sufficient for B. 1. Pitman.1) where x E X c Rq. is said to be a oneparameter exponentialfami/y.6. In a oneparameter exponential family the random variable T(X) is sufficient for B. Probability models with these common features include normal. realvalued functions T and h on Rq. Here are some examples. p(x. gamma.B) ~ h(x)exp{'7(B)T(x) . these families form the basis for an important class of models called generalized linear models.B(B)} (1.2) . IfT(X) is {x: p(x. Example 1. B) of the Pe may be written p(x. and multinomial regression models used to relate a response variable Y to a set of predictor variables. B. Let Po be the Poisson distribution with unknown mean IJ. 1.1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. Subsequently. Note that the functions 1}. and Dannois through investigations of this property(l).6.o I x.6. many other common features of these families were discovered and they have become important in much of the modern theory of statistics. for x E {O. They will reappear in several connections in this book.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. Then. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. then. if there exist realvalued functions fJ(B). This is clear because we need only identify exp{'7(B)T(x) . B>O.2"" }.1. and T are not unique. and if there is a value B E such that o e e.B(B)} with g(T(x). B) > O} c {x: p(x. = 1 . (1. Poisson. binomial.B) = eXe.
0 E e. n) < 0 < 1.O) f(z.4) Therefore.h(x) = . where the p. (1.6.~O'(y  z)' logO}. . B(O) = 0.(y I z) = <p(zW1<p((y . p(x. B) are the corresponding density (frequency) functions.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .1](0) = logO. .3) I' . . Specifically.3.6.6. . . 1. + OW. T(x) ~ x.6. the family of distributions of X is a oneparameter exponential family with q=I. fonn a oneparameter exponential family as in (1.. Suppose X = (Z.T(x)=x.1](O)=log(I~0). the Pe form a oneparameter exponential family with ~~I . B(O) = logO.h(X)=( : ) .0)].1](0) = ..~z.1). o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families.O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l.. . is the family of distributions of X = (Xl' .6.6.y. Statistical Models.~O'.50 . Here is an example where q (1. q = 1. This is a oneparameter exponential family distribution with q = 2. Then > 0.5) o = 2. If {pJ=». Suppose X has a B(n.) exp[~(O)T(x. 0) distribution. we have p(x.B(O)=nlog(I0).2..6. The Binomial Family.) i=1 m B(8)] ( 1.} . 0 Then.O) IIh(x. 1 X m ) considered as a random vector in Rmq and p(x. suppose Xl.O) = f(z)f. . .6) . 1). Goals.Xm are independent and identically distributed with common distribution Pe. 0 Example 1.. Example 1. and Performance Criteria Chapter 1 Therefore. h(x) = (2rr)1 exp { . I I! .' x. for x E {O. 1 (1. . y)T where Y = Z independent N(O.T(x) = (y  z)'. Z and W are f(x.
h(m)(x) ~ i=l II h(xi).6. 1 X m). and pJ LT(Xi). · . 1]. •..B(O)} for suitable h * .1 Family of distributions . X m ) corresponding to the oneparameter exponential fam1 T(Xd. and h. For example.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x.. i=l (1.2) r(p. 1)(0) N(I'. I:::n I::n Theorem 1. B. then the m ) fonn a I Xi.1 the sufficient statistic T :m)(x1. B(m)(o) ~ mB(O).' fixed I' fixed p fixed .Il)" . .\ fixed r fixed s fixed 1/2. then qCm) = mq.6.\) (3(r..8) .6. . .· . In the discrete case we can establish the following general result. . if X = (Xl.6 Exponential Families 51 where x = (x 1 . the m) fonn a oneparameter exponential family.Xm ) = I Xi is distributed as P(mO).. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. pi pi I:::n TABLE 1. . This family of Poisson distIibutions is oneparameter exponential whatever be m.. fl. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. . If we use the superscript m to denote the corresponding T. . and h.7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m. B.. . Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T.6.1.x log x 10g(l x) logx The statistic T(m) (X I. Therefore.Section 1. We leave the proof of these assertions to the reader....
}.. We obtain an important and useful reparametrization of the exponential family (1.1) by letting the model be indexed by 1} rather than 8.9) with 1} E E contains the class of models with 8 E 8. By definition..8) exp[~(8)t . Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case. x E X c Rq (1..A(1})1 for s in some neighborhood 0[0.1}) = (1(x!)exp{1}xexp[1}]}. the result follows. £ is called the natural parameter space and T is called the natural sufficient statistic. Let E be the collection of all 1} such that A(~) is finite.B(8)] L {x:T(x)=t} ( 1.8) h(x) exp[~(8)T(x) .9) where A(1}) = logJ . x E {O.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h.2.6. L {x:T{x)=t} h(x)}.1}) = h(x)exp[1}T(x) .1. 00 00 exp{A(1})} andE = R.6.6. The exponential family then has the form q(x. The model given by (1.6. = L(e"' (x!) = L(e")X (x! = exp(e").B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x).6. .2. and Performance Criteria Chapter 1 Proof. Then as we show in Section 1.52 Statistical Models. if q is definable. selves continuous.9) and ~ is an Ihlerior point of E. where 1} = log 8.6. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families.1.A(~)I. .2. then A(1}) must be finite.. P9[T(x) = tl L {x:T(x)=t} p(x. E is either an interval or all of R and the class of models (1.6. If 8 E e.6. Theorem 1.6. Example 1. The Poisson family in canonical form is q(x. If X is distributed according to (1. x=o x=o o Here is a useful result. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) . Goals. (continued).
. 1. and Dannois were led in their investigations to the following family of distributions.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). and realvalued functions T 1. x> 0. Koopman. Now p(x. Pitman. 0 Here is a typical application of this result.O) = h(x) exp[L 1]j(O)Tj (x) .6.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.4 Suppose X 1> ••• .8 > O. (1. is said to be a kparameter exponential family.8(0)]. nlog0 J..8) = (il(x.6 Exponential Families 53 Moreover. Proof.6.. Example 1. .8) ~ (x/8 2)exp(_x 2/28 2).. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). E(T(X)) ~ A'(1]). This is known as the Rayleigh distribution.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . Var(T(X)) ~ A"(1]). B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 .12). Therefore. A family of distrib~tioos {PO: 0 E 8}.Section 1. 1 Tk. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic. It is used to model the density of "time until failure" for certain types of equipment. The rest of the theorem follows from the momentenerating property of M(s) (see Section A. is one.A(1])] because the last factor. 2 i=1 i=l n 1 n Here 1] = 1/20 2. More generally. being the integral of a density. We give the proof in the continuous case.10) . We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . if there exist realvalued functions 171. _ . X n is a sample from a population with density p(x. Direct computation of these moments is more complicated. c R k . exp[A(s + 1]) . ' e k p(x. h on Rq such that the density (frequency) functions of the Po may be written as.<·or . 02 ~ 1/21].6.17k and B of 6. x E X j=1 c Rq.
. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi). The Normal Family. = N(I".'l) = h(x)exp{TT(x)'l. In either case. . +log(27r" »)]. . i=l i=1 which we obtained in the previous section (Example 1. X m ) from a N(I". The density of Po may be written as e ~ {(I".. 1)2(0) = 2 " B(O) " 1"' 1 " T.A('l)}. letting the model be Thus. .1.. J1 < 00. A(71) is defined in the same way except integrals over Rq are replaced by sums. 0 Again it will be convenient to consider the "biggest" families.6. Again.') population. Suppose lhat P.Ot = fl. i=l . the vector T(X) = (T.4). .. .I' 54 Statistical Models..O) =exp[".' . q(x. we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}..2(". .x .2.Tk(x)f and.'" .(X) •. . then the preceding discussion leads us to the natural sufficient statistic m m (LXi.. (1.TdX))T is sufficient. . Goals.11) (]"2.. which corresponds to a twOparameter exponential family with q = I. .x E X c Rq where T(x) = (T1 (x).6. 8 2 = and I" 1 2' Tl(x) = x. . in the conlinuous case.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1. suppose X = (Xl.10).6. and Performance Criteria Chapter 1 By Theorem 1.').') : 00 < f1 x2 1 J12 2 p(x.5.(x) = x'.LTk(Xi )) t=1 Example 1. .fJk)T rather than family generated by T and h is e.3.. . (]"2 > O}.. I ! In the discrete case.').. h(x) = I.. + log(27r"'».LX. 2 (". . If we observe a sample X = (X" .. It will be referred to as a natural sufficient statistic of the family. the canonical kparameter exponential indexed by 71 = (fJI.5..
..'I2. where m. 0) ~ exp{L . ."k.5.Xn)'r where the Xi are i. Let T. < l. . (N(/" (7') continued}.6.5.Section 1..id.(x).k. remedied by considering IV ~j = log(A. A E A.6. In this example. a 2 ). Multinomial Trials. = n'Ezr Example 1.. Now we can write the likelihood as k qo(x. 2. Example 1. . ~l ~ /. with J.'12 ~ (3. However 0.n log j=l L exp(. . .1. In this example... . 0) for 1 = (1. . .j j=l = 1.7.~2)]. k}] with canonical parameter 0.5.)j..T. .~3 ~ 1/2(7'. . A(1/) ~ ~[(~U2~. ~3 and t: = {(~1. TT(x) = T(Y) ~ (EY"EY. ..5../(7'.k. This can be identifiable because qo(x. i = 1.5 that Y I . .. Suppose a<./Ak) = ".'. We write the outcome vector as X = (Xl.~2): ~l E R..~. Yi rv N(Jli. It will often be more convenient to work with unrestricted parameters. 1/) =exp{T'f.) + log(rr/.. is not and all c. . and rewriting kl q(x.)T.4 and 1. Linear Regression.~3 < OJ.~·(x) .~._l)(x)1/nlog(l where + Le"')) j=1 . j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I.= {(~1. the density of Y = (Yb .. and AJ ~ P(X i ~ j).6 Exponential Families 55 (x. From Exam~ pIe 1. and t: = Rk .6.~3): ~l E R. Example 1..6..EziY. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories. h(x) = I andt: ~ R x R.~l= (3.. where A is the simplex {A E R k : 0 < A. . Y n are independent. . . as X and the sample space of each Xi is the k categories {I.. E R. in Examples 1. E R k .ti = f31 + (32Zi. L71 Aj = I}.(x) = L:~ I l[X i = j]. . Then p(x.(x»). k ~ 2. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)]. = 1/2(7'../(7'..j = 1. . x') = (T.kj. ~. I <j < k 1. 0 + el) ~ qo(x. I yn)T can be put in canonical form with k = 3.. n. < OJ.. A) = I1:~1 AJ'(X). 0. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj ./(7'.
is an npararneter canonical exponential family with Yi by T(Yt . 0 1. then all models for X are exponential families because they are submodels of the multinomial trials model. let Xt =integers from 0 to ni generated Vi < n. and 1] is a map from e to a subset of R k • Thus. 0 < Ai < 1.Y n ) ~ Y. h(y) = 07 I ( ~. 17).·.6. 11 E £ C R k } is an exponential family defined by p(x.6. See Problem 1.6. Logistic Regression. 1/(0») (1. I < k. Ai).7 and X = (X t. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. Here 'Ii = log 1\. ) 1(0 < However. " .).2. 1]) is a k and h(x) = rr~ 1 l[xi E over.d.12) taking on k values as in Example 1.56 Note that q(x. X n ) T where the Xi are i.8.1. Goals. Here is an example of affine transformations of 6 and T. More10g(P'I[X = jI!P'IIX ~ k]). " Example 1.6. be independent binomial. this. where (J E e c R1. ..6. 1 < i < n.13) . if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' .17 e for details. are identifiable. from Example 1. the parameters 'Ii = Note that the model for X Statistical Models. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO.. as X. if c Re and 1/( 0) = BkxeO C R k . .i. . 0) ~ q(x. B(n. is unchanged.' A(1/) = L:7 1 ni 10g(l + e'·). .. k}] with canonical parameter TJ and £ = Rk~ 1. If the Ai are unrestricted.. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I.6.. 1 <j < k .6. < . < X n be specified levels and (1. 1 < i < n. Let Y. Similarly.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x.. .
. i = 1.i.9.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1... which is called the coefficient of variation or signaltonoise ratio. Example 1..8. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic. is a known constant '\0 > O. .x)W'. Gaussian with Fixed SignaltoNoise Ratio. I ii. then this is the twoparameter canonical exponential family generated by lYIY = (L~l.14) 8. o Curved exponential families Exponential families (1.13) holds. and l1n+i = 1/2a. . + 8.6. . }Ii N(J1..6. + 8.).)T . . If each Iii ranges over R and each a.1. > O.. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization.. T 2 = L:~ 1 X.a 2 ).12) with the range of 1J( 8) restricted to a subset of dimension l with I < k . SetM = B T .Section 1.6.8. x). log(P[X < xl!(1 . the 8 parametrization has dimension 2. In the nonnal case with Xl. . ~ (1..Xn i. this is by Example 1.) = L ni log(1 + exp(8. N(Jl.00). 'fJ1(B) curved exponential family with l = 1. . Y.d.. that is.8. ranges over (0. = '\501 and 'fJ2(0) = ~'\5B2.. 8) in the 8 parametrization is a canonical exponential family. Y n are independent. + 8. suppose the ratio IJlI/a.i/a. i=I This model is sometimes applied in experiments to determine the toxicity of a substance.i. generated by .x = (Xl. . .. Then (and only then). L:~ 1 Xl yi)T and h with A(8. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance.6. . However.1)T. PIX < xl ~ [1 + exp{ (8. This is a 0 In Example 1. p(x.'. .6. where 1 is (1.6. Suppose that YI ~ .6. n...10. E R. which is less than k = n when n > 3.x and (1.5 a 2nparamctercanonical exponential family model with fJi = P. '. Yn.Xi)).PIX < x])) 8.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied... a. y.. with B = p" we can write where T 1 = ~~ I Xi. LocationScale Regression.. Example 1. .xn)T. Then. so it is not a curved family.'V T(Y) = (Y" .
TJ'(Y).i.12. 1 < j < n. 0) ~ q(y.1.6 are heteroscedastic and homoscedastic models.13) exhibits Y.10)._I11. sampling.58 Statistical Models.4 Properties of Exponential Families Theorem 1. 83 > 0 (e.8. 1988.6. .3.E Yj c Rq. with an exponential family density Then Y = (Y1.10 and 1. Sections 2. say. 1989.6.) and 1 h'(YJ)' with pararneter1/(O). We return to curved exponential family models in Section 2.5. as being distributed according to a twoparameter family generated by Tj(Y. be independent. In Example 1.). and Snedecor and Cochran.1 generalizes directly to kparameter families as does its continuous analogue. and Performance Criteria Chapter 1 and h(Y) = 1.) = (Yj. but a curved exponential family model with 0 1=3.. 8. n. 1 Bj(O). For 8 = (8 1 . the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~. Thus. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. . we define I I . Examples 1.) depend on the value Zi of some covariate.g. Supermode1s We have already noted that the exponential family structure is preserved under i. 1 Tj(Y.. Bickel. and = Ee sTT .(O)Tj'(Y) for some 11. Ii.1/(O)) as defined in (6.12) is not an exponential family model.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before.2. Section 15. respectively. We extend the statement of Theorem 1.5 that for any random vector Tkxl. a. Carroll and Ruppert. and B(O) = ~.8.6. ~. 1. Recall from Section B.6. Next suppose that (JLi. yn)T is modeled by the exponential family generated by T(Y) ~.(0). then p(y. Goals. Let Yj .8 note that (1..6. E R..6. Even more is true. 1978.d. for unknown parameters 8 1 E R.6. M(s) as the momentgenerating function.
Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. Corollary 1.6. then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) .15) is finite..6.2.a. 8". By the Holder inequality (B. Proof of Theorem 1. Theorem 1. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.6. A(U'71 Which is (b).T(x)). v(x). for any u(x).6 Expbnential Families 59 V. Substitute ~ ~ a. .6. (with 00 pennitted on either side). Fiually (c) 0 The formulae of Corollary 1. Since 110 is an interior point this set of5 includes a baLL about O.Section 1.3 V.6. u(x) ~ exp(a'7. ('70)) . T.r'7o T(X) . If '7" '72 E + (1 .a)'72) < aA('7l) + (1 .8".15) t: the righthand side of (1. .6. 8A 8A T" = A('7o) f/J. + (1 . A where A( '70) = (8". Let P be a canonicaL kparameter exponential famiLy generated by (T. 1O)llkxk.a)'7rT(x)) and take logs of both sides to obtain. h(x) > 0.6. We prove (b) first.3(c).6.6. Suppose '70' '71 E t: and 0 < a < 1.3.r(T) = IICov(7~. . ~ = 1 . S > 0 with ~ + ~ = 1.6.6. h) with corresponding natural parameter space E and function A(11).1. vex) = exp«1 .15) that a'7.6.3..a)A('72) Because (1.9.a)'72 E is proved in exactly the same way as Theorem 1.5.1 and Theorem 1.4).1 give a classical result in Example 1. Under the conditions ofTheorem 1.A('7o)} vaLidfor aLL s such that 110 + 5 E E. ('70) II· The corollary follows immediately from Theorem B. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.A('70) = II 8". ('70). t: and (a) follows.
. Formally.6. . (i) P is a/rank k."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. II ! I Ii . 2' Going back to Example 1.8.(Tj(X» = P>. T. Suppose P = (q(x.7 we can see that the multinomial family is of rank at most k .6.~. ifn ~ 1. p:x.4. (continued). and Performance Criteria Chapter 1 Example 1. . there is a minimal dimension.1. Xl Y1 ). using the a: parametrization.~~) < 00 for all x.6. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2.1 is in fact its rank and this is seen in Theorem 1. But the rank of the family is 1 and 8 1 and 82 are not identifiable. It is intuitively clear that k .6. P7)[L. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. However. Here. for all 9 because 0 < p«X. in Example 1. + 9.lx ~ j] = AJ ~ e"'ILe a . . such that h(x) > O. and ry. Theorem 1. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i.6. ajTj(X) = ak+d < 1 unless all aj are O. 9" 9. o The rank of an exponential family I . (iii) Var7)(T) is positive definite.6. (X)". Tk(X) are linearly independent with positive probability. Sintilarly.4 that follows.4. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. k A(a) ~ nlog(Le"') j=l and k E>.x. f=l . We establish the connection and other fundamental relationships in Theorem 1. However. Goads.(9) = 9.. Then the following are equivalent.7).7. (ii) 7) is a parameter (identifiable). we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 .60 Statistical Models.
2. ~ (ii) ~ _~ (i) (ii) = P1).6.T(x) .." Then ~(i) > 1 is then sketched with '* P"[a.1)o)TT. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/.. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1). A' is constant. Proof. all 1) = (~i) II. Let ~ () denote "(. = Proof. Apply the case k = 1 to Q to get ~ (ii) =~ (i). This is just a restatement of (iv) and (v) of the theorem.4 hold and P is of rank k.T ~ a2] = 1 for al of O. by our remarks in the discussion of rank. thus.. F!~ . with probability 1. ranges over A(I'). for all T}. . Thus. ~ P1)o some 1). which that T implies that A"(TJ) = 0 for all TJ and. Note that.6.6.) is false... A"(1]O) = 0 for some 1]0 implies c. . :. hence.3. The proof for k details left to a problem. III. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". A'(TJ) is strictly monotone increasing and 11. 0 .'t·· . because E is open. hence. 0 Corollary 1.) with probability 1 =~(i). We.6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0.~> . (b) logq(x. '" '. A is defined on all ofE.(ii) = (iii).. have (i) .Section 1.TJ2)T(X) = A(TJ2) . Suppose tlult the conditions of Theorem 1.A(TJI)}h(x) ~ exp{TJ2T(x) . Now (iii) =} A"(TJ) > 0 by Theorem 1.2 and. Taking logs we obtain (TJI .A(TJ..1)o) : 1)0 + c(1). by Theorem 1. 1)) is a strictly concavefunction of1) On 1'.(v). 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 .' . = Proof ofthe general case sketched I. . We give a detailed proof for k = 1.. ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. ".6. of 1)0· Let Q = (P1)o+o(1). Conversely.A(TJ2))h(x). (iii) = (iv) and the same discussion shows that (iii) .
(1.. An important exponential family is based on the multivariate Gaussian distributions of Section 8.10).17) The first two terms on the right in (1.Xn is a sample from the kparameter exponential family (1.Jl)TE. See Section 2.6. the relation in (a) may be far from obvious (see Problem 1. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log".Jl.. E).. This is a special case of conjugate families of priors. ..6.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi . write p(x 19) for p(x. (1.E).4 applies. . where we identify the second element ofT.6. By our supermodel discussion. Thus.. Recall that Y px 1 has a p variate Gaussian distribution...3.)] eXP{L 1/.. Example 1. ~iYi Vi)..6. B(9) = (log Idet(E) I+JlTEl Jl).6.1 (Y .6. which is a p x p symmetric matrix.2 we considered beta prior distributions for the probability of success in n Bernoulli trials. h(Y) . .t parametrization is close to the initial parametrization of classical P. E). E. 0).a 2 + J12). 0 ! = 1.(Y).ljh<i<.tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y. Y n are iid Np(Jl. Then p(xI9) = III h(x.. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi. which is obviously a 11 function of (J1. Goals. the 8(11) 0) family is parametrized by E(X). with its distinct p(p + I) /2 entries. then X .Jl) .2 (Y ."5) family by E(X).11. Jl. ifY 1.._.2 p p (1. We close the present discussion of exponential families with the following example. and. the N(/l. .. with mean J.{Y.5 Conjugate Families of Prior Distributions In Section 1. . .11. E(X. so that Theorem 1.6. 9 ~ (Jl. The corollary will prove very important in estimation theory.6.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2.. For {N(/" "')}.18) _._...5. .~p).. families to which the posterior after sampling also belongs.16) Rewriting the exponent we obtain 10gf(Y.1 '" 11"'.62 Statistical Models.6.21). where X is the Bernoulli trial. E) = Idet(E) 1 1 2 / .6. N p (1J. as we always do in the Bayesian context.p/2 I exp{ .I. is open.. and that (. It may be shown (Problem 1.X') = (JL.(Xi) i=l j=l i=l n • n nB(9)}. However.6. T (and h generalizing Example 1. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j.Yp. The p Van'ate Gaussian Family. ..(9) LT.a 2 ). ...17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E.6..Jl)}. Suppose Xl.
20) given by the last expression in (1..6.. .22) 02 p(xIO) ex exp{ 2 .Section 1. and t j = L~ 1 Tj(xd. The (k + I)parameter exponential/amity given by k ".21) and our assertion follows.6.21) where S = (SI.'" . then Proposition 1.fk+IB(O) logw(t)) j=1 (1.6. . the parameter t of the prior distribution is updated to s = (t + a). To choose a prior distribution for model defined by (1.6.6.6. 1r(6jx) is the member of the exponential family (1. which is k~dimensional."" Sk+l)T = ( t} + ~TdxiL ..6.6 Exponential Families 63 where () E e...(lk+1 j=1 i=1 + n)B(O)) ex ". n n )T and ex indicates that the two sides are proportional functions of ().'''' I:~ 1 Tk(Xi).18) and" by (1. Note that (1..(0) ~ exp{L ~j(O)fj .. . 0 Remark 1.(Olx) ex p(xlO)".6. is a conjugate prior to p(xIO) given by (1.(Xi)+ f. If p(xIO) is given by (1..18) by letting 11.u5) sample.6. A conjugate exponential family is obtained from (1.(0).(O) ex exp{L ~J(O)(LT. .. .6.6.) .6..21) is an updating formula in the sense that as data Xl. where a = (I:~ 1 T1 (Xi).20) where t = (fJ.6.12. Forn = I e.. then k n . n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way. X n become available.36). we consider the conjugate family of the (1.19) o {(iJ. ik + ~Tk(Xi)' tk+l + 11.20). .6.. let t = (tl. We assume that rl is nonempty (see Problem 1.fk+Jl. (1.fkIJ)<oo) with integrals replaced by sums in the discrete case. . . . tk+l) E 0. That is. j = 1...18). .6. k. Suppose Xl. ..2)' Uo 2uo Ox . Because two probability densities that are proportional must be equal. Example 1.Xn is a N(O. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1. be "parameters" and treating () as the variable of interest.20).1.1. where u6 is known and (j is unknown. Proof. .O<W(tJ.6. .
6. t.n) and variance = t.6.21).37).(O) <X exp{ . rJ) distributions where Tlo varies freely and is positive.6. Moreover. = rg(n)lrg so that w. W.( 0 .6. Eo known.. we obtain 1r. it can be shown (Problem 1.6.25) By (1.26) (1. = 1 .6. tl(S) = 1]0(10 2 TO 2 + S. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. E RP and f symmetric positive definite is a conjugate family f'V .26) intuitively as ( 1. (0) is defined only for t. 761) where '10 varies over RP. (J Np(TJo. 1r.23) Upon completing the square.20) has density (1.6.(n) = .24) Thus.28) where W.30) that the Np()o" f) family with )0.6.6. 1 < i < n.W.27) Note that we can rewrite (1.6.) } 2a5 h t2 t1 2 (1. Goals. consists of all N(TJo. Np(O. the posterior has a density (1. if we observe EX.24).i. . = s. 0 These formulae can be generalized to the case X. Our conjugate family. Using (1.d.. we find that 1r(Olx) is a normal density with mean Jl(s. = nTg(n)I(1~. + n.6. 75) prior density.6.) density. therefore.23) with (70 TO .64 Statistical Models.(S) = t. i. Eo).6. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1. If we start with aN(f/o. we must have in the (t l •t2) parametrization "g (1.(n) (g +n)I[s+ 1]0~~1 TO TO (1. > 0 and all t I and is the N (tI!t" (1~ It.
.Xn ). e. . 2.. the map A .. {PO: 0 E 8}. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:.'T}k and B on e. E 1 R is convex. .1 are often too restrictive. In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. The set E is convex. r) is a p(p + 3)/2 rather than a p + I parameter family. . If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . the family of distributions in this example and the family U(O.6. which is onedimensional whatever be the sample size. which admit kdimensional sufficient statistics for all sample sizes. Discussion Note that the uniform U( (I.. must be kparameter exponential families. In fact.6 Exponential Families 65 but a richer one than we've defined in (1.29) (Tl (X). the conditions of Proposition 1. In fact. and realvalued functions TI. The natural sufficient statistic max( X 1 ~ .6.. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1.. starting with Koopman. to Np (8 .3 is not covered by this theory. 8 C R k ..20).6.5. The set is called the natural parameter space. Eo).. is not of the form L~ 1 T(Xi).. . O}) model of Example 1. Pitman. Some interesting results and a survey of the literature may be found in Brown (1986).32.20) except for p = 1 because Np(A.6.31 and 1. T k (X)) is called the natural sufficient statistic of the family. Despite the existence of classes of examples such as these. 0) = h(x) expl2:>j(I1)Ti (x) . for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. and Darmois.. . See Problems 1. " Tk.B(O)]. Problem 1. x j=1 E X c Rq. h on Rq such that the density (frequency) function of Pe can be written as k p(x. with integrals replaced by sums in the discrete case.Section 1. (1.(s) = exp{A(7Jo + s) .6. B) are not exponential.10 is a special result of this type.6.J: h(x)exp{TT(x)7J}dx in the continuous case.. Summary. . is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I .A(7Jo)} e e. a theory has been built up that indicates that under suitable regularity conditions families of distributions..6. ••• .
L.1 units. (iii) Var'7(T) is positive definite. (v) A is strictlY convex on f. Suppose that the measuring instrument is known to be biased to the positive side by 0. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. Goals. ... If P is a canonical exponential family with E open. tk+1) 0 ~ {(t" . 1. (iv) the map '7 ~ A('7) is I ' Ion 1'. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1. . Give a formal statement of the following models identifying the probability laws of the data and the parameter space.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:. Ii II I .1)) (O)t j . l T k are linearly independent with positive Po probability for some 8 E e.66 Statistical Models..B(O)}dO. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J. is conjugate to the exponential family p(xl9) defined in (1.6.. T" . and t = (t" . He wishes to use his observations to obtain some infonnation about J..) E R k+l : 0 < w < oo}. State whether the model in question is parametric or nonparametric.parameter exponential family k "..7 PROBLEMS AND COMPLEMENTS Problems for Section 1. Assume that the errors are otherwise identically distributed nonnal random variables with known variance. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.1 1. tk+.L and variance a 2 .L and a 2 but has in advance no knowledge of the magnitudes of the two parameters. The (k + 1). (ii) '7 is identifiable. then the following are equivalent: (i) P is of rank k.29).(0) = exp{L1)j(O)t) .
.. .... each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others. Each ."" a p . = (al. .2 ) and Po is the distribution of Xll. . .1..7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown.Xp )..) (a) Xl.2) and N (1'2.1.2 ).ap ) : Lai = O}. Q p ) {(al. ~ N(l'ij.p. i=l (c) X and Y are independent N (I' I . .. i = 1. (e) The parametrization of Problem l.1(d). 3.for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A.t.) (a) The parametrization of Problem 1. 0. restricted to p = (all' . . Can you perceive any difficulties in making statements about f1. (a) Let U be any random variable and V be any other nonnegative random variable. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest.2).) < Fy(t) for every t.. . 0.. ... At.. 1'2) and we observe Y X.1 describe formaHy the foHowing model. . . B = (1'1..l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.1(c). (d) Xi. are sampled at random from a very large population. Two groups of nl and n2 individuals.. and Po is the distribution of X (b) Same as (a) with Q = (X ll . Q p ) and (A I . . .. (b) The parametrization of Problem 1.Xpb .. Are the following parametrizations identifiable? (Prove or disprove.2) where fLij = v + ai + Aj.Section 1. Show that Fu+v(t) < Fu(t) for every t. respectively. Once laid. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y.. . 4. then X is (b) As in Problem 1.··. Ab.. . . Which of the following parametrizations are identifiable? (Prove or disprove..1. . j = 1.. bare independeht with Xi.. e (e) Same as (d) with (Qt.Xp are independent with X~ ruN(ai + v. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. 2. = o. . v..
mk·r. (b)P.. .t) fj>r all t. whenXis unifonnon {O.2). .4(1). . Mk. .. 0 < V < 1 are unknown. . 5.t) = 1 .. It is known that the drug either has no effect or lowers blood pressure.Li = 71"(1 M)iIJ. + nk + ml + . Vk) is unknown..1...1.... Goals. then P(Y < t + c) = P( Y < t . is the distribution of X when X is unifonn on (0... 0 < J.Li < + . I ! fl. The simplification J..c bas the same distribution as Y + c. What assumptions underlie the simplification? 6. and Performance Criteria Chapter 1 ":.3(2) and 1. (0).O}.. Vi = (1 11")(1 ~ v)iI V for i = 1.k + VI + . 0 < Vi < 1... ml . Consider the two sample models of Examples 1. I . .· .p(0. + mk + r = n 1.. I.I nl .. nk·mI .1. . 0'2).Lk VI n". e = = 1 if X < 1 and Y {J. The number n of graduate students entering a certain department is recorded. J.t).c has the same distribution as Y + c. i = 1.. . Let N i be the number dropping out and M i the number graduating during year i. . but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown.. . (e) Suppose X ~ N(I'. . 68 Statistical Models.' I . I nl.Nk = nk. the density or frequency function p of Y satisfies p( c + t) ~ p( c . .9 and they oecur with frequencies p(0.. .k is proposed where 0 < 71" < I. k. Let Y and Po is the distribution of Y.··· . Which of the following models are regular? (Prove or disprove.2. o < J1 < 1. P8[N I . 2...0..0). }. Let Po be the distribution of a treatment response. 7. VI.0'2) (d) Suppose the possible control responses in an experiment are 0.. Hint: IfY .  . Show that Y .9). 8. 0 = (1'..p(0. = n11MI = mI.... =X if X > 1.) (a) p. V k mk P r where J.1)..P(Y < c . (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. if and only if. In each of k subsequent years the number of students graduating and of students dropping out is recorded. + fl. .... The following model is proposed. • . 1 < i < k and (J = (J.c) = PrY > c . Suppose the effect of a treatment is to increase the control response by a fixed amount (J. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour.LI.:". • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour. .L. Both Y and p are said to be symmetric about c. M k = mk] I! n..2.0..LI ni + .. + Vk + P = 1.. is the distribution of X e = (0.
that is. (b) In part (a).. e) vary freely over F ~ {(I'."') and o(x) is continuous. = (Zlj. N. = 9. e(j) > 0. and (I'.. e(j). or equivalently. Show that if we assume that o(x) + x is strictly increasing.6. (b) Deduce that (th. The Lelunann 1\voSample Model. Sx(t) = . 0 < j < N.zp. ... N} are identifiable. . . .. suppose X has a distribution F that is not necessarily normal. Therefore.1. i . I(T < C)) = j] PIC = j] PIT where T.tl) implies that o(x) tl. the two cases o(x) . Y = T I Y > j]. then C(·) = F(. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. 11.{3p) are identifiable iff Zl. . I} and N is known. ci '" N(O. r(j) = = PlY = j.N = 0. . according to the distribution of X. The Scale Model. 12. . Let X < 1'.. C). C).. are not collinear (linearly independent). Yn denote the survival times of two groups of patients receiving treatments A and B.. .. . eX o. > 0. o (a) Show that in this case. . N}.. then satisfy a scale model with parameter e.tl and o(x) ~ 21' + tl.3 let Xl. p(j). . if the number of = (min(T. C(t} = F(tjo). . log y' satisfy a shift model? = Xc.2x and X ~ N(J1. Does X' Y' = yc satisfy a scale model? Does log X'. X m and Yi.d.. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. . 1 < i < n."" Yn ). C are independent.Section 1. o(x) ~ 21' + tl .Znj?' (a) Show that ({31. Sbow that {p(j) : j = 0.tl) does not imply the constant treatment effect assumption.. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. . Let c > 0 be a constant. .Xn ). then C(·) ~ F(.. (]"2) independent. . t > O.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). eY and (c) Suppose a scale model holds for X.. fJp) are not identifiable if n pardffieters is larger than the number of observations."').2x yield the same distribution for the data (Xl. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. .. j = 0. . . (Yl.. .. Suppose X I. e) : p(j) > 0. . Hint: Consider "hazard rates" for Y min(T..Xn are observed i.. 0 'Lf 'Lf op(j) = I. 10. C(·) = F(· . That is.. In Example 1. Y.tl).i.'" . {r(j) : j ~ 0.
Also note that P(X > t) ~ Sort). t > 0. Goals. A proportional hazard model. .ho(t) is called the Cox proportional hazard model..2) is equivalent to Sy (t I z) = sf" (t). .l3. Show that hy(t) = c. (c) Under the assumptions of (b) above. h(t I Zi) = f(t I Zi)jSy(t I Zi).. 13. Tk > t all occur.. then Yi* = g({3. Then h.. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t).1) Equivalently.. Sy(t) = Sg..) Show that (1.) Show that Sy(t) = S'.l3. 12.G(t). Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). z) zT. (c) Suppose that T and Y have densities fort) andg(t).2) and that Fo(t) = P(T < t) is known and strictly increasing.12.(t). I (b) Assume (1. Hint: By Problem 8.(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 .ll) with scale parameter 5.h. t > 0. show that there is an increasing function Q'(t) such that ifYt = Q'(Y.. F(t) and Sy(t) ~ P(Y > t) = 1 . Show that if So is continuous. Specify the distribution of €i.1.{a(t).. . I) distribution.7. hy(t) = c.z)}. = = (.). t > 0. 7. Zj) + cj for some appropriate €j. . Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. Survival beyond time t is modeled to occur if the events T} > t.7.I ~I .7. Moreover. respectively. I (3p)T of unknowns.(t) if and only if Sy(t) = Sg. . are called the survival functions. = exp{g(.. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. thus. " T k are unobservable and i. z)) (1. we have the Lehmann model Sy(t) = si..d.  .00).l3. I' 70 StCitistical Models. where TI".(t). So (T) has a U(O. Hint: See Problem 1. For treatments A and E. k = a and b.(t) with C. (b) By extending (bjn) from the rationals to 5 E (0. (1. .2. and Performance CriteriCl Chapter 1 t Ii .. 1 P(X > t) =1 (. as T with survival function So. log So(T) has an exponential distribution. Set C. ~ a5.i. The most common choice of 9 is the linear form g({3.
Find the median v and the mean J1 for the values of () where the mean exists.p(t)).v arbitrarily large. X n are independent with frequency fnnction p(x posterior1r(() I Xl. 1) distribution. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. Show that both .(x/c)e. In Example 1. (a) Suppose Zl' Z.2) distribution with .5) or mean I' = xdF(x) = Fl(u)du. . Here is an example in which the mean is extreme and the median is not. Consider a parameter space consisting of two points fh and ()2.2 with assumptions (t){4). Examples are the distribution of income and the distribution of wealth. Problems for Section 1.4 1 0." . and for this reason... . what parameters are? Hint: Ca) PIXI < tl = if>(.Section 1..000 is the minimum monthly salary for state workers..d. Find the . have a N(O. and Zl and Z{ are independent random variables.11 and 1. where () > 0 and c = 2.i. .X n ).G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R.2 1. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0. Yl .1. 'I/1(±oo) = ±oo. J1 may be very much pulled in the direction of the longer tail of the density.L ..6 Let 1f be the prior frequency function of 0 defined by 1f(I1. and suppose that for given (). (b) Suppose X" . 15. wbere the model {(F. J1 and v are regarded as centers of the distribution F. (b) Suppose Zl and Z. 1f(O2 ) = ~.p and ~ are identifi· able..p and C.12. still identifiable? If not. ) = ~. the parameter of interest can be characterl ized as the median v = F.7 Problems and Complements 71 Hint: See Problems 1. . Let Xl. have a N(O.. G. 14.2 0.. '1/1' > 0. Are.2 unknown.O) 1 .1..d. Generally.Xm be i. Observe that it depends only on l:~ I Xi· I 0).1. the median is preferred in this case. Yn be i.i. When F is not symmetric.8 0.l (0. Ca) Find the posterior frequency function 1f(0 I x). . Show how to choose () to make J. F. x> c x<c 0. . Merging Opinions.
0 < x < O. (3) Suppose 8 has prior frequency.8). .'" 1rajI Show that J + a. . . " J = [2:. Assume X I'V p(x) = l:.. 7l" or 11"1..[X ~ k] ~ (1.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. Goals. (b) Find the posterior density of 8 when 71"( OJ = 302 . Let X I. 'IT . is used in the fonnula for p(x)? 2. " ••. X n be distributed as where XI. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . (3) Find the posterior density of 8 when 71"( 0) ~ 1.25. 4'2'4 (b) Relative to (a). ) ~ . . . . Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O. does it matter which prior.0 < 0 < 1.. 0 (c) Find E(81 x) for the two priors in (a) and (b)... .O)kO. for given (J = B. . 2. k = 0. Unllormon {'13} . l3(r.75. 1r(J)= 'u . 71"1 (0 2 ) = .. c(a).. 1. For this convergence.. .. 7r (d) Give the values of P(B n = 2 and 100.2. where a> 1 and c(a) .2. X n are independent with the same distribution as X. This is called the geometric distribution (9(0)). .)=1.. and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 . . the outcome X has density p(x OJ ~ (2x/O'). what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta. Compare these B's for n = 2 and 100.. Let 71" denote a prior density for 8. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. 4. ~ = (h I L~l Xi ~ = . "X n ) = c(n 'n+a m) ' J=m.72 Statistical Models.=l1r(B i )p(x I 8i ).0 < B < 1.. ':.'" .. I XI . 3. }. Suppose that for given 8 = 0. 3. Find the posteriordensityof8 given Xl = Xl. (J. Show that the probability of this set tends to zero as n t 00. . . Then P.m+I. (d) Suppose Xl.Xn are natural numbers between 1 and Band e= {I.. Consider an experiment in which. < 0 < 1.Xn = X n when 1r(B) = 1.
.. . then !3( a.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. are independent standard exponential. XI = XI. is the standard normal distribution function and n _ T 2 x + n+r+s' a J.. Xn) + 5. In Example 1.8) that if in Example 1.. ." . 11'0) distribution. I Xn+k) given . 9.··. S.. If a and b are integers...f) = [~.. 6.• X n = Xn.J:n). Va' W t . where VI. Xn) = Xl = m for all 1 as n + 00 whatever be a. ZI.n.1. . .2.. Show in Example 1.. n. where E~ 1 Xi = k.1. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N .) = n+r+s Hint: Let!3( a. X n = Xn is that of (Y. Interpret this result. (b) Suppose that max(xl.1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •. . Show rigorously using (1...2. s). . } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is . D = NO has a B(N...'" .. b) denote the posterior distribution. Show that ?T( m I Xl. f3(r..b> 1.c(b... a regular model and integrable as a function of O..2. Show that a conjugate family of distributions for the Poisson family is the gamma family.0 E Let 9 have prior density 1r. Suppose Xl. .Section 1."'tF·t'.. 11'0) distribution. 7. (a) Show that the family of priors E. where {i E A and N E {l. 10. Next use the central limit theorem and Slutsky's theorem. b) is the distribution of (aV loW)[1 + (aV IbW)]t. Assume that A = {x : p( x I 8) > O} does not involve O. Let (XI. .7 Problems .2. X n+ Il ."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl. f(x I f). . . . .Xn is a sample with Xi '"'' p(x I (}).(1.1= n+r+s _ .and Complements 73 wherern = max(xl. Justify the following approximation to the posterior distribution where q.. X n+ k ) be a sample from a population with density f(x I 0). Show that the conditional distribution of (6. W. .
0 > O. = 1.O the posterior density 7t( 0 Ix). D(a)... the predictive distribution is the marginal distribution of X n + l . .. . Find I 12.. v > 0. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}.d.. T6) densities. . Find the posterior distribution 7f(() I x) and show that if>' is an integer.) pix I 0) ~ ({" .. N(I". Suppose p(x I 0) is the density of i.i.Xn + l arei. given (x. Letp(x I OJ =exp{(xO)}. Here HinJ: Given I" and ". Xl . 9= (OJ. . and () = a. (b) Discuss the behavior of the two predictive distributions as 15. i).J u. Q'j > 0.) u/' 0 < cx) L". 13. .d. I(x I (J). . 0> 0 a otherwise..9). "5) and N(Oo. 'V .O.d. . ./If (Ito. has density r("" a) IT • fa(u) ~ n' r(. j=1 '\' • = 1. < 1. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8.. 82 I fL. Note that. where Xi known. 52) of J1. The posterior predictive distribution is the conditional distribution of X n +l given Xl. 1 X n .~vO}. LO.Xn.~X) '" tnI.. 11. . given x. . 0 > O. . Next use Bayes rule..6 ""' 1r. x n ). .. Let N = (N l . (a) If f and 7t are the N(O. X 1.. . . x> 0. X and 5' are independent with X ~ N(I". posterior predictive distribution. vB has a XX distribution. unconditionally.. 14. 0<0.2 is (called) the precision of the distribution of Xi. j=1 • . .2) and we formally put 7t(I". 01 )=1 j=l Uj < 1. . .. (c) Find the posterior distribution of 0". (N. In a Bayesian model whereX l ..~l . The Dirichlet distribution is a conjugate prior for the multinomial. Q'r)T...i. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I. The Dirichlet distribution.") ~ .. . X n .. . Show that if Xl.. . •• • . ."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x.)T.J t • I X.i... <0<x and let7t(O) = 2exp{20}. and Performance Criteria Chapter 1 + nand ({... L.1 < j < T. (. Goals.74 where N' ~ N Statistical Models. a = (a1. compute the predictive and n + 00. 8 2 ) is such that vn(p. N. O(t + v) has a X~+n distribution. X n are i.\ > 0. (12).) be multinomial M(n.
.. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. Suppose that in Example 1. Find the Bayes rule for case 2.3.3. . 1C(02) l' = 0. (b) Find the minimax rule among {b 1 . 0. 8 = 0 or 8 > 0 for some parameter 8.. Problems for Section 1. (c) Find the minimax rule among the randomized rules. or 0 > a be penoted by I. 159 for the preceding case (a)..3. = 11'. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St.. Find the Bayes rule when (i) 3. (c) Find the minimax rule among bt. I b"9 }. Suppose the possible states of nature are (Jl. . a) is given by 0. respectively and suppose .5. 1957) °< °= a 0..1.. 1C(O. a2.5 and (ii) l' = 0. the possible actions are al. (J) given by O\x a (I . Let the actions corresponding to deciding whether the loss function is given by (from Lehmann.159 be the decision rules of Table 1. . .3 IN ~ n) is V( a + n).7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a). and the loss function l((J.Section 1.3. . (J2.. 1.q) and let when at.5.1. (b)p=lq=. Compute and plot the risk points (a)p=q= .1.1.5. (d) Suppose that 0 has prior 1C(OIl (a). (d) Suppose 0 has prior 1C(01) ~ 1'. then the posteriof7r(0 where n = (nt l · · · .3.) = 0. n r ). The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O..p) (I .3. I. = 0. 159 of Table 1. See Example 1. a3. .
n = 1 and . 11 .. E.=l iii = N. 0 be chosen to make iiI unbiased? I (b) Neyman allocation. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. Stratified sampling.) c<f>( y'n(r . ... Weassurne that the s samples from different strata are independent.s. I I I. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. Show that the strata sample sizes that minimize MSE(!i2) are given by (1.andastratumsamplemeanXj. We want to estimate the mean J1. < 00.I L LXij.g. b<f>(y'ns) + b<I>(y'nr). i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4. I . 0<0 0=0 0 >0 R(O.d. 1) distribution function.. 1 < j < S. 1 < j < S. (a) Compute the biases. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. J". random variables Xlj"". Within the jth stratum we have a sample of i. < J' < s.j = I. 1 (l)r=s=l.8. and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J".Xnjj. Goals..7.) r=2s=1.76 StatistiCCII Models. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. c<I>(y'n(sO))+b<I>(y'n(rO)).(X) 0 b b+c 0 c b+c b 0 where b and c are positive. variances. (.0)). .i. geographic locations or age groups).. Assume that 0 < a. and <I> is the N(O.0)) +b<f>( y'n(s . are known (estimates will be used in a latet chapter). Suppose X is aN(B. (b) Plot the risk function when b = c = 1. . . and MSEs of iiI and 'j1z. are known.j = 1. How should nj. where <f> = 1 <I>..3) .
.15. Suppose that X I.5(0 + I) and S ~ B(n.3) is N. ali X '"'' F. and = 1.i.. .3. I).. Hint: Use Problem 1.I 2:.3.=1 Pj(O'j where a. (c) Same as (b) except when n ~ 1..= 2:.3..2. ... p = .7 Problems and Complements 77 Hint: You may use a Lagrange multiplier. < ° P < 1.5.4. Suppose that n is odd.0 . . Each value has probability .p). We want to estimate "the" median lJ of F.b.) with nk given by (1.5..=1 pp7J" aV. > k) (ii) F is uniform. . The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = .. 5. Use a numerical integration package.20.. b = '.2. Next note that the distribution of X involves Bernoulli and multinomial trials.40.7. n ~ 1. set () = 0 without loss of generality. ~ ~ ~ . (d) Find EIX . (iii) F is normal. Let XI.45.3. X n be a sample from a population with values 0.13. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift). X n are i. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). l X n .. U(O..bl when n = compare it to EIX . Hint: See Problem B. a = .2.bl. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X).bl for the situation in (i). N(o. plot RR for p = .Section 1. ~ a) = P(X = c) = P. . . (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i.d. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~. Let X and X denote the sample mean and median. Also find EIX .. Let Xb and X b denote the sample mean and the sample median of the sample XI b.. Hint: By Problem 1. 1).0 + '.0.5. ~ ~  ~ 6.1.5..2p. > 0. and that n is odd. .2. that X is the median of the sample. . [f the parameters of interest are the population mean and median of Xi . ° = 15.9. P(X ~ b) = 1 . (b) Evaluate RR when n = 1. respectively.'. .0 ~ + 2'.25. .2'. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. . ~ 7. (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. 75.5.b. Hint: See Problem B.
[X . (a) Show that s:? = (n . . e 8 ~ (. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i.(o(X)).X)2 has a X~I distribution. (a)r = 8 =1 (b)r= ~s =1.Xn be a sample from a population with variance a 2 . ..J. and only if. Show that the value of c that minimizes M S E(c/. I I 78 Statistical Models.10) + (. .. I (b) Show that if we use the 0 1 loss function in testing. then this definition coincides with the definition of an unbiased estimate of e.X)' = ([Xi .1)1 E~ 1 (Xi .0) = E.0): 6 E eo}.X)2. . You may use the fact that E(X i .  . satisfies > sup{{J(6.3(a) with b = c = 1 and n = I. Let Xl. (a) Show that if 6 is real and 1(6. (b) Suppose Xi _ N(I'.3) that . 6' E e. .2 ~~ I (Xi ..0(X))) < E..1'])2. I .4. If the true 6 is 60 . 10.1'] . A decision rule 15 is said to be unbiased if E. ~ 2(n _ 1)1. a) = (6 .3. = c L~ 1 (Xi . A person in charge of ordering equipment needs to estimate and uses I . Which one of the rules is the better one from the Bayes point of view? 11. defined by (J(6.. 9.(1(6'.2)(. and Performance Criteria Chapter 1 :! .) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B.L)4 = 3(12. ' . 0 < a 2 < 00. 0) (J(6'.X)2 keeping the square brackets intact. being lefthanded). Goals. In Problem 1.a)2. for all 6' E 8 1. then a test function is unbiased in this sense if. i1 .. then expand (Xi .3. Find MSE(6) and MSE(P)..(1(6.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi .for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100. . the powerfunction. Let () denote the proportion of people working in a company who have a certain characteristic (e.0(X))) for all 6. .'). II. 10% have the characteristic.g. It is known that in the state where the company is located.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. I . (i) Show that MSE(S2) (ii) Let 0'6 c~ . 8.
The interpretation of <p is the following. In Example 1.a )R(B.3. 0. Your answer should Mw If n.3. use the test J u . procedures.1. If X ~ x and <p(x) = 0 we decide 90. Suppose that Po. Show that the procedure O(X) au is admissible. In Example 1.1.1) and let Ok be the decision rule "reject the shipment iff X > k. o Suppose that U . Consider the following randomized test J: Observe U. and then JT. there is a randomized procedure 03 such that R(B.ao) ~ O.4.Polo(X) ~ 01 = Eo(<p(X)). and possible actions at and a2.) + (1 .7 Problems and Complements 79 12. Po[o(X) ~ 11 ~ 1 . B = . 16. consider the estimator < MSE(X). s = r = I.3. (h) If N ~ 10.3. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1.. If U = u.3. Show that J agrees with rp in the sense that. find the set of I' where MSE(ji) depend on n.1. Consider a decision problem with the possible states of nature 81 and 82 .1'0 I· 17.4. For Example 1.UfO. Show that if J 1 and J 2 are two randomized. a) is . if <p(x) = 1." (a) Show that the risk is given by (1. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. (b) find the minimum relative risk of P... and k o ~ 3. Suppose the loss function feB. plot R(O. by 1 if<p(X) if <p(X) >u < u. consider the loss function (1. 15.~ is unbiased 13. . (c) Same as (b) except k = 2. Ok) as a function of B. b.. z > 0. Furthersuppose that l(Bo.) for all B. but if 0 < <p(x) < 1. In Problem 1.wo to X. 0.3. (a) find the value of Wo that minimizes M SE(Mw). we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise.Section 1.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. show that if c ::. Suppose that the set of decision procedures is finite. 1) and is independent of X. given 0 < a < 1. . 14. 0 < u < 1. wc dccide El. Compare 02 and 03' 19.7). then. Define the nonrandomized test Ju . Convexity ofthe risk set.' and 0 = II' . 03) = aR(B. 18.
7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. An urn contains four red and four black balls. !. and the ratio of its MSPE to that of the best and best linear predictors. (a) Find the best predictor of Y given Z. and the best zero intercept linear predictor. 7. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y.8 0.l. its MSPE. Let X be a random variable with probability function p(x ! B) O\x 0. 3.) the Bayes decision rule? Problems for Section 1. 6. Give the minimax rule among the randomized decision rules. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. . 2. !. the best linear predictor. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z). Goals. U2 be independent standard normal random variables and set Z Y = U I. and Performance Criteria Chapter 1 B\a 0.) = 0. al a2 0 3 2 I 0. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.(0. (c) Suppose 0 has the prior distribution defined by .9. In Problem B.(0. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. Give an example in which Z can be used to predict Y perfectly.Is Z of any value in predicting Y? = Ur + U?.6 (a) Compute and plot the risk points of the nonrandomized decision rules.2 0. (b) Give and plot the risk set S. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.. 4.1 calculate explicitly the best zero intercept linear predictor.4. What is 1. Give the minimax rule among the nonrandomized decision rules. In Example 1.1. 0. 0 0.4 = 0. Let U I . Four balls are drawn at random without replacement...4 I 0. 5.80 Statistical Models. (b) Compute the MSPEs of the predictors in (a).
Define Z = Z2 and Y = Zl + Z l Z2. Show that if (Z. (a) Show that E(IY .cl) = oQ[lc . Y) has a bivariate normal distribution. 12. a 2. 0. p) distribution and Z". 10.col < eo. (b) Show directly that I" minimizes E(IY . where (ZI .t. that is. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13.el) as a function of c. Suppose that Z has a density p.2 ) distrihution. Y" be the corresponding genetic and environmental components Z = ZJ + Z". (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. Z". and otherwise. If Y and Z are any two random variables.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. exhibit a best predictor of Y given Z for mean absolute prediction error.Section 1.YI > s. Y" are N(v. Suppose that Z has a density p. Y)I < Ipl. Let ZI. that is. ° 14. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor.  = ElY cl + (c  co){P[Y > co] . which is symmetric about c. z > 0. Let Y have a N(I". (b) Suppose (Z. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. ICor(Z. y l ).1"1/0] where Q(t) = 2['I'(t) + t<I>(t)].L.z) for 11. Suppose that Z. p( z) is nonincreasing for z > c. Yare measurements on such a variable for a randomly selected father and son. p( c all z. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Y'.7 Problems and Complements 81 Hint: If c ElY . Let Zl and Zz be independent and have exponential distributions with density Ae\z. a 2. Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . Show that c is a median of Z. Y = Y' + Y". Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . 9. 7 2 ) variables independent of each other and of (ZI. which is symmetric about c and which is unimodal. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Y') have a N(p" J. + z) = p(c .
Goals.3. and Var( Z.S from infection until detection. and Doksurn and Samarov./L£(Z)) ~ maxgEL Corr'(Y. and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. y> O.g(Z)) where £ is the set of 17.15. >. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. on estimation of 1]~y. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z./L)/<J and Y ~ /3Yo/<J. I. mvanant un d ersuchh. and is diagnosed with a certain disease. 82 Statisflcal Models. a blood cell or viral load measurement) is obtained. Show that Var(/L(Z))/Var(Y) = Corr'(Y.) ~ 1/>''.) = E( Z. Let /L(z) = E(Y I Z ~ z). We are interested in the time Yo = t . 1995.g./LdZ) be the linear prediction error. Consider a subject who walks into a clinic today. then'fJh(Z)Y =1JZY.4. Predicting the past from the present.: (0 The best linear MSPE predictor of Y based on Z = z. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. 2 'T ' . . .4. Show that p~y linear predictors..) ~ 1/>. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y. > 0.exp{ >./L(Z)) = max Corr'(Y. hat1s.) = Var( Z. j3 > 0. Show that. .1] . in the linear model of Remark 1. 1).(y) = >. 1905. (c) Let '£ = Y . g(Z)) 9 where g(Z) stands for any predictor. . Let S be the unknown date in the past when the sUbject was infected. IS. 16.) (a) Show that 1]~y > piy. and Performance Criteria Chapter 1 .I • ! . At the same time t a diagnostic indicator Zo of the severity of the disease (e.ElY . that is.4./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). at time t. IS. It will be convenient to rescale the problem by introducing Z = (Zo .4. Hint: See Problem 1. Hint: Recall that E( Z. ~~y = 1 . Yo > O. . One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y.y}. where piy is the population multiple correlation coefficient of Remark 1.. = Corr'(Y. (See Pearson. . .
6) when r = s. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. (a) Show that ifCov[r(Y).I exp { . Cov[r(Y).z). 7r(Y I z) ~ (27r) . 1). and Nonnand and Doksum. (c) The optimal predictor of Y given X = x. Z2 and W are independent with finite variances. s(Y)] < 00. 19. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.s(Y)] Covlr(Y).. Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. x = 1.z) .4.4. Write Cov[r(Y). then = E{Cov[r(Y).4.g(Zo)l· Hint: See Problems 1.. This density is called the truncated (at zero) normal.>.9.. see Berman. 2000). 20. Y2) using (a). where X is the number of spots showing when a fair die is rolled. . 6. N(z .Section 1. Hint: Use Bayes rule.>. Let Y be the number of heads showing when X fair coins are tossed. (b) Show that (a) is equivalent to (1. density. Establish 1. b). (In practice. (b) The MSPE of the optimal predictor of Y based on X. . Find Corr(Y}. . c.~ [y  (z . Find (a) The mean and variance of Y. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). solving for (a. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo .4. where Z}.(>' .>')1 2 } Y> ° where c = 4>(z . all the unknowns. including the "prior" 71". 1990. wbere Y j .7 and 1. s(Y» given Z = z.. Let Y be a vector and let r(Y) and s(Y) be real valued. (c) Show that if Z is real. need to be estimated from cohort studies. b) equal to zero. 21. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo.). and checking convexity. I Z]' Z}.. + b2 Z 2 + W.14 by setting the derivatives of R(a.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density .
5 1. and (ii) w(y. 0 > 0. Zz and ~V have the same variance (T2. show that the MSPE of the optimal predictor is . n > 2. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. z) = I.l .. .e)' = Y' . 0 > 0. This is known as the Weibull density.p~y).84 Statistical Models. .15) yields (1.e' < (Y . (a) p(x. Oax"l exp( Ox"). a > O. we say that there is a 50% overlap between Y 1 and Yz. In this case what is Corr(Y.x > 0. . and Performance Criteria Chapter 1 (e) In the preceding model (d). suppose that Zl and Z.6(r. Let Xi = 1 if the ith item drawn is bad. s). In Example 1.2eY + e' < 2(Y' + e'). if b1 = b2 and Zl. z) and c is the constant that makes Po a density. Y z )? (f) In model (d). Find Po(Z) when (i) w(y. 3.z). Assume that Eow (Y. Find the 22. Goals. > a.g(z») is called weighted squared prediction error.4.4. optimal predictor of Yz given (Yi. . 1 2 y' .0> O.~(1 . . a > O..14). 1. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. Problems for Section 1.g(z)]'lw(y. and suppose that Z has the beta. z) = z(1 .z) be a positive realvalued function.z). Zz).. This is thebeta. Y ~ B(n. (a) Let w(y.6(O..Xn is a sample from a population with one of the following densities. z) ~ ep(y.density. 0) = Ox'.z)lw(y.. and = 0 otherwise. where Po(Y. 23. Show that EY' < 00 if and only if E(Y . 25. < 00 for all e.') and W ~ N(po. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. 0 (b) p(x. population where e > O.3. Then [y . . z "5). g(Z)) < for some 9 and that Po is a density.. p(e). Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z). 2. are N(p. density. 00 (b) Suppose that given Z = z. 24.4. . \ (b) Establish the same result using the factorization theorem. Let Xl. Verify that solving (1. x  . 1 X n be a sample from a Poisson. 1).e)' Hint: Whatever be Y and c. Suppose Xl. 0) = < X < 1. (c) p(x. 0 < z < I. 0) = Oa' Ix('H).z) = 6w (y. .
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
J. T. Transformation and Weighting in Regression New York: Chapman and Hall. DE GROOT. J." Biomelika. . P. GRENANDER. Ii . P. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion). R. Nonlinearity. 14431473 (1995). L. Vols. Goals. Royal Statist. M. 1967. Elements of Information Theory New York: Wiley. JR.96 Statistical Models. AND M.. CARROLL. II. 1. E. Science 5. STUART. and Later Developments. S. 1985. 383430 (1979). ROSENBLATT. and so on.. R. New York: Springer. B. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. DoKSUM. Eels. THOMAS. Note for Section 1. Optimal Statistical Decisions New York: McGrawHili. I: I" Ii L6HMANN. Sands. Statist. P. SAMAROV. Statist.. A. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997).g. A. J. Soc. BERMAN. D. A 143.266291 (1978). 1969. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. 77. ley. R. M. L. 1991. "Model Specification: The Views of Fisher and Neyman. v. L. U.. H. "Using Residuals Robustly I: Tests for Heteroscedasticity. BICKEL. Leighton. 22. AND A. 2nd ed. IMS Lecture NotesMonograph Series. This permits consideration of data such as images.. Mathematical Statistics New York: Academic Press... P. Ch. 1997. G. GIRSHICK. E. 5. MA: AddisonWesley. 125. m New York: Hafner Publishing Co.I' L6HMANN. K A AND A.. . E. . R. RUPPERT. I and II. AND COVER. L6HMANN.• Statistical Decision Theory and Bayesian Analysis New York: Springer. 1954. Theory ofGames and Statistical Decisions New York: Wiley.. the Earth). for instance.547572 (1957). 1966. Feynman. The Advanced Theory of Statistics. T.. 1988. G.733741 (1990). • • ... 0 . positions. 23. BROWN. Math Statist. CRUTCHFIELD.t AND D.." Ann. 1963." J.. KENDALL. E. The Feynmtln Lectures on Physics. and M. Iii I' i M. FEYNMAN.9 REFERENCES BERGER. .. S. 6. . 1961. BOX.. L." Ann. L. D. BLACKWELL. Statistical Analysis of Stationary Time Series New York: Wi • .. I. AND M. 1986.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. 1975. 160168 (1990). 1957. Hayward. "A Theory of Some Multiple Decision Problems. FERGUSON. J. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. M. r HorXJEs. and spheres (e. 40 Statistical Mechanics ofPhysics Reading. KRETcH AND R.. ." Ann. Testing Statistical Hypotheses." Statist...
Division of Research. 303 (1905). Harvard University. J. H. A. J. Soc. 6779. AND R. Part II: Inference London: Cambridge University Press. New York: Springer. SL. 2000. Ames. J. SCHLAIFFER. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. Wiley & Sons. Introduction to Probability and Statistics from a Bayesian Point of View. G. Editors: S. Biometrics Series II.. COCHRAN. New York. Wiley & Sons.Section 1. The Foundation ofStattstical Inference London: Methuen & Co. Roy.) RAIFFA. (Draper's Research Memoirs. D. Reid.. PEARSON. NORMAND. 1965. V. Graduate School of Business Administration. The Foundations ofStatistics. G. 8th Ed. 1961. . "On the General Theory of Skew Correlation and NOnlinear Regression. Applied Statistical Decision Theory. MANDEL. 1. 1954. Part I: Probability... SAVAGE. Lecture Notes in Statistics. 1989. AND W. W. D. DOKSUM. 1986. 1962. AND K. 1964. Dulan & Co. IA: Iowa State University Press. The Statistical Analysis of Experimental Data New York: J. G. Ahmed and N. E. L. 9 References 97 LINDLEY. ET AL. London 71. AND K.. Boston.. L. K." Empirical Bayes and Likelihood Inference. GLAZEBROOK. SAVAGE. Statistical Methods. B... WETHERILL." Proc. SNEDECOR. Sequential Methods in Statistics New York: Chapman and Hall.
:i.'. I:' o.! ~J . I .: '". I II 1'1 . I 'I i .·. .
8) is smooth. Of course. if PO o were true and we knew DC 8 0 . how do we select reasonable estimates for 8 itself? That is.1) Arguing heuristically again we are led to estimates B that solve \lOp(X. usually parametrized as P = {PO: 8 E e}.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates. 8). The equations (2. Then we expect  \lOD(Oo. Estimating Equations Our basic framework is as before.1. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.2) define a special form of estimating equations.O).1. That is. the true 8 0 is an interior point of e. =0 (2.O) EOoP(X. 6) ~ o. X """ PEP. p(X.O) where V denotes the gradient.8) as a function of 8. (2. but in a very weak sense (unbiasedness). we don't know the truth so this is inoperable. we could obtain 8 0 as the minimizer.1. So it is natural to consider 8(X) minimizing p(X.1 2.8).8) is an estimate of D(8 0 .2) 99 .1. In this parametric case. This is the most general fonn of the minimum contrast estimate we shall consider in the next section. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. X E X. and 8 Jo D( 8 0 . We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. Now suppose e is Euclidean C R d .Chapter 2 METHODS OF ESTIMATION 2. As a function of 8.
4 with I'(z) = g({3.100 More generally.i.g({3. z) is continuous and ! I lim{lg(l3. . z.I1) =0 is an estimating equation estimate.zd = LZij!3j j=1 andzi = (Zil..z)l: 1{31 ~ oo} 'I = 00 I (Problem 2..»)'.z. {3 E R d . Evidently. Least Squares. .1.8/3 ({3.1. g({3.. (1) Suppose V (8 0 .('l/Jl. A naiural(l) function p(X.!I [" '1 L • . z. Then {3 parametrizes the model and we can compute (see Problem 2.1. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . Wd)T V(l1 o. we take n p(X. (3) n n<T. W(X.i + L[g({3o. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable.6) . (2. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y.)g(l3.1.g({3. where the function 9 js known.8/3 ({3.)J i=1 ~ 2 (2.1. "'.1L1 2 = L[Yi . W ..10). d g({3. (1). . z).' .1. z.1. (3) exists if g({3. An estimate (3 that minimizes p(X. z) is differentiable in {3. Example 2.1. ua). N(O. Yn are independent.5) Strictly speaking P is not fully defined here and this is a point we shall explore later.1. I D({3o. g({3. ZI). If. The estimate (3 is called the least squares estimate.4 R d. IL( z) = (g({3. further. then {3 satisfies the equation (2. But.4) w(X. :i = ~8g~ ~ L. ~ Then we say 8 solving (2.) .)Y. ~ ~8g~ L. .). (3) = !Y . Here is an example to be pursued later. l'i) : 1 < i < n} where Yi.p(X.3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e. . Here the data are X = {(Zi.16). .1.Zid)T 'j • r . z.2) or equivalently the system of estimating eq!1ations.8) ~ E I1 . . i=l (2..7) i=l J . there is a substantial overlap between the two classes of estimates. for convenience. suppose we postulate that the Ei of Example 1.z. i=l 3 I (2. Zn»T That is.4 are i.1. Consider the parametric version of the regression model of Example 1.1. I In the important linear case.d. (3) E{3.
More generally.Section 2. say q(8) = h(I'I. .8) the normal equations. if we want to estimate a Rkvalued function q(8) of 9. We return to the remark that this estimating method is well defined even if the Ci are not i. The method of moments prescribes that we estimate 9 by the solution of p. 1 1j ~_l".. .. 8 E R d and 8 is identifiable. In fact. .2. 1 1 < j < d.(8). 1 < i < n}.. This very important example is pursued further in Section 2. To apply the method of moments to the problem of estimating 9. . we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. provides a first example of both minimum contrast and estimating equation methods.  n n. . I'd( 8) are the first d moments of the population we are sampling from. . u5). Here is another basic estimating equation example. N(O. we assume the existence of Define the jth sample moment f1j by.1 Basic Heuristics of Estimation 101 the system becomes (2. ~ .1 from R d to Rd.. = 1'.2 and Chapter = 6. thus. . L.i.1.1.1. d > k. .i. 0 Example 2.9) where Zv IIZijllnxd is the design matrix... .d. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po.d. .Xi t=l .Xn are i. .. I'd). as X ~ P8. lid) as the estimate of q( 8). Suppose that 1'1 (8).. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). Suppose Xl. we need to be able to express 9 as a continuous function 9 of the first d moments. and then using h(p!. Thus. which can be judged on its merits whatever the true P governing X is. Least squares. Method a/Moments (MOM). I'd of X. These equations are commonly written in matrix fonn (2. Thus. /1j converges in probability to flj(fJ).1 <j < d if it exists. suppose is 1 ...
~ iii ~ (X/a)'.. 2. A = Jl1/a 2 . two other basic heuristics particularly applicable in the i.Pk are completely unknown. .102 Methods of Estimation Chapter 2 For instance.. In this case f) ~ ('" A)..(1 + ")/A 2 Solving . An algorithm for estimating equations frequently used when computationofM(X.3.)]xO1exp{Ax}.>0. for instance. Frequency Plugin(2) and Extension.1..10.1 0) . If we let Xl. As an illustration consider a population of men whose occupations fall in one of five different job categories..5.·) fli =D'I'(X. 4. in particular Problem 6. . express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. the method of moment estimator is not unique. X n be Ltd. with density [A"/r(. f(u.1.. as X and Ni = number of indices j such that X j = Vi. There are many algorithms for optimization and root finding that can be employed. i = 1.. .(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. A>O. Example 2. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. and 1" ~ E(X') = . . . .d. or 5. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn.6. l Vk of the population being sampled are known. x>O. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. case. the proportion of sample values equal to Vi.·)  l~t. Here k = 5. . We can.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles.2 and 0=2 = n 1 EX1. .~ (I'l/a)'. .. . 1'1 for () gives = E(X) ~ "/ A. 0 Algorithmic issues We note that. Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. This algorithm and others will be discussed more extensively in Section 2. 1968). A = X/a' where a 2 = fl. I. Suppose we observe multinomial trials in which the values VI. . '\)..1.11). It is defined by initializing with eo.4 and in Chapter 6. consider a study in which the survival time X is modeled to have a gamma distribution. I. Vi = i.. 3.•. but their respective probabilities PI. 2. Here is some job category data (Mosteller.X 2 • In this example. in general.i. . then setting (2.
that is.53 = 0. + P3).12). v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI.0. and Then q(p) (p. . . We would be interested in estimating q(P"". in v(P) by 0 P )..pl .. That is.. 1 Nk/n.. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 .11) to estimate q(Pll··.Xn . Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. Pk) of the population proportions. whereas categories 2 and 3 correspond to whitecollar jobs. the difference in the proportions of bluecollar and whitecollar workers. . 1~!.31 5 95 0.. P3) given by (2.(P2 + P3). If we use the frequency substitution principle.»i=1 for Danish men whose fathers were in category 3. i < k. categories 4 and 5 correspond to bluecollar jobs.. the estimate is which in our case is 0.. .}}.1.41 4 217 0. .03 2 84 0. HardyWeinberg Equilibrium... N 3 ) has a multinomial distribution with parameters (n. v. . Suppose . . Equivalently. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8). . .0 < 0 < 1. .. Example 2.1.. let P dennte p ~ (P".1. . If we assume the three different genotypes are identifiable.Pk) with Pi ~ PIX = 'Vi].4.09. ." . Next consider the marc general problem of estimating a continuous function q(Pl' . . together with the estimates Pi = NiJn.Pk) = (~l . P3 PI = 0 = (1.. P2 ~ 20(1.Pk by the observable sample frequencies Nt/n...0).13 n2::. (2.. Now suppose that the proportions PI' .1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0. can be identified with a parameter v : P ~ R. The frequency plugin principle simply proposes to replace the unknown population frequencies PI.12) If N i is the number of individuals of type i in the sample of size n.Ps) = (P4 + Ps) . lPk)..1. . 1 < think of this model as P = {all probability distributions Pan {VI. Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles.Ni =708 2:::o.Section 2. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h.0)2. the multinomial empirical distribution of Xl. suppose that in the previous job category table. For instance.44 .12 3 289 0. use (2. then (NIl N 2 . ..P2. .
14) are not unique. .15) ~ ". T(Xl.E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. consider va(P) = [Fl (a) + Fu 1(a )1. (2. . where a E (0. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance.1.. ( .1.13) and estimate (2.).. the frequency of one of the alleles. we can usually express q(8) as a continuous function of PI. In particular. Po ~ R given by v(PO) = q(O).Pk(IJ»). If PI. X n are Ll.16) .1. (2. thus. 1  V In general..Xn ) =h N.104 Methods of Estimation Chapter 2 we want to estimate fJ. We shall consider in Chapters 3 (Example 3.. . that we can also Nsfn is also a plausible estimate of O. "·1 is..Pk. E A) n.d.1. as X p. then v(P) is the plugin estimate of v.1. Fu'(a) = sup{x. Nk) .Pk are continuous functions of 8. by the law of large numbers. .14) As we saw in the HardyWeinberg case.. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. in the Li.1.. P > R where v(P) . . ./P3 and. We can think of the extension principle alternatively as follows. F(x) < a}. I: t=l (2. . that is. Now q(lJ) can be identified if IJ is identifiable by a parameter v . . we can use the principle we have introduced and estimate by J N l In. The plugin and extension principles can be abstractly stated as follows: Plugin principle.h(p) and v(P) = v(P) for P E Po. Because f) = ~. F(x) > a}.. . If w..1. .d.4) and 5 how to choose among such estimates. Let be a submodel of P. .. . Note. however. q(lJ) with h defined and continuous on = h(p. ..13) Given h we can apply the extension principle to estimate q( 8) as.4. if X is real and F(x) = P(X < x) is the distribution function (dJ. case if P is the space of all distributions of X and Xl.'" . the representation (2. (2. . 0 write 0 = 1 . I) and ! F'(a) " = inf{x. suppose that we want to estimate a continuous Rlvalued function q of e. Then (2.13) defines an extension of v from Po to P via v.(IJ).
the sample median V2(P) does not converge in probability to Ep(X).1.1. II to P.17) where F is the empirical dJ.i. and v are continuous.d. then the plugin estimate of the jth moment v(P) = f. these principles are general. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. 1/..1.1. because when P is not symmetric. (P) is called the population (2.i. Pe as given by the HardyWeinberg p(O). . casebut see Problem 2. A natural estimate is the ath sample quantile .14. P 'Ie Po.tj = E(Xj) in this nonparametric .1.1. The plugin and extension principles are used when Pe.1..) h(p) ~ L vip. For instance. e e Remark 2. is called the sample median.1. case (Problem 2.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution.1 Basic Heuristics of Estimation 105 V 1.4.d.12) and to more general method of moment estimates (Problem 2. ~ I ~ ~ Extension principle. For instance. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. Remark 2.13).Section 2. However. ~ ~ . In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P). P E Po. is a continuous map from = [0. This reasoning extends to the general i. then Va. they are mainly applied in the i. context is the jth sample moment v(P) = xjdF(x) ~ 0. = v(P). Here x!. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero.3 and 2.2. but only VI(P) = X is a sensible estimate of v(P). i=1 k /1j ~ = ~ LXI 1=1 I n . if X is real and P is the class of distributions with EIXlj < 00. then v(P) is an extension (and plugin) estimate of viP). in the multinomial examples 2. As stated. ~ For a second example. Here x ~ = median. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP.1 I:~ 1 Xl. The plugin and extension principles must be calibrated with the target parameter. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero.1. (P) is the ath population quantile Xa. = Lvi n i= 1 k . If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.
.3. because Po = P(X = 0) = exp{ O}. there are often several method of moments estimates for the same q(9). a special type of minimum contrast and estimating equation method. a saving grace becomes apparent in Chapters 5 and 6.. . Unfortunately.1. if we are sampling from a Poisson population with parameter B.1 because in this model B is always at least as large as X (n)' 0 As we have seen. then B is both the population mean and the population variance.d. •i .1. • . the plugin principle is justified. It does turn out that there are "best" frequency plugin estimates. Moreover. 0 Example 2. as we shall see in Section 2. This is clearly a foolish estimate if X (n) = max Xi> 2X . Plugin is not the optimal way to go for the Bayes. The method of moments estimates of f1 and a 2 are X and 2 0 a. ". See Section 2.8) we are led by the first moment to the estimate. Thus.I and 2X . where Po is n.3 where X" .6. 'I I[ i['l i . .4. The method of moments can lead to either the sample mean or the sample variance. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3. X(l ... X n is a N(f1. However.4.7. for large amounts of data. We will make a selection among such procedures in Chapter 3.1. 0 = 21' . [ [ . I . When we consider optimality principles. " .2. we find I' = E. O}. = 0]. optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. Example 2. Because f. a frequency plugin estimate of 0 is Iogpo. In Example 1. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. minimax. Example 2. valuable as preliminary estimates in algorithms that search for more efficient estimates. a 2 ) sample as in Example 1.LI (8) = (J the method of moments leads to the natural estimate of 8. 2. To estimate the population variance B( 1 . these are the frequency plugin (substitution) estimates (see Problem 2. Discussion. we may arrive at different types of estimates than those discussed in this section. .) = ~ (0 + I). Because we are dealing with (unrestricted) Bernoulli trials.1. X). This minimal property is discussed in Section 5.1. . . X.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates. these estimates are likely to be close to the value estimated (consistency). Algorithms for their computation will be introduced in Section 2. the frequency of successes. U {I. (X.1 is a method of moments estimate of B. )" I I • . C' X n are ij. and there are best extensions.l [iX. . Suppose that Xl.5. therefore.X). .. those obtained by the method of maximum likelihood. Estimating the Size of a Population (continued). For instance.2 with assumptions (1)(4) holding.5. as we shall see in Chapter 3. . . Remark 2. (b) If the sample size is large. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. For example. If the model fits. Suppose X I.Ll).4.. they are often difficult to compute. X n are the indicators of a set of Bernoulli trials with probability of success fJ.
. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. For data {(Zi.(3) ~ L[li . If P is an estimate of P with PEP. then is called the empirical PIE. The general principles are shown to be related to each other. If P = PO. f. . li) : I < i < n} with li independent and E(Yi ) = g((3. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. It is of great importance in many areas of statistics such as the analysis of variance and regression theory. .Section 2.Pk). For this contrast.ld)T where f.g((3. z) = ZT{3. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). ~ ~ ~ ~ ~ v we find a parameter v such that v(p. For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. When P is the empirical probability distribution P E defined by Pe(A) ~ n.. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter.) = q(O) and call v(P) a plugin estimator of q(6). . 0).2 2.... . D(P) is called the extensionplug·in estimate of v(P).Zi)j2.O) = O.lJ) ~ E'o"p(X. when g({3.1 L:~ 1 I[X i E AI.. and the contrast estimating equations are V Op(X..2. where ZD = IIziJ·llnxd is called the design matrix. 1 < j < k. I < i < n. ~ ~ ~ ~ 2. Method of moment estimates are empirical PIEs based on v(P) = (f. P. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3.. 6). The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. is parametric and a vector q( 8) is to be estimated. a least squares estimate of {3 is a minimizer of p(X. Zi). 0 E e. P E Po.ll. Let Po and P be two statistical models for X with Po c P. Suppose X . In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . A minimum contrast estimator is a minimizer of p(X.lj = E( XJ\ 1 < j < d. where Pj is the probability of the jth category. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters..
2. z.f3d and is often applied in situations in which specification of the model beyond (2.1. " . is modeled as a sample from a joint distribution. i=l (2. as in (a) of Example 1.. E«. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj.).»)2. j.2.m::a. . and 13 ~ (g(l3. 1 < i < n. (2. (Zi.. I) are i. I<i<n. then 13 has an interpretation as a parameter on p.C=ha:::p::.) i.2.::. fj) because (2.z.) = 0.g(l3.108_~ ~_ _~ ~ ~cMc.4 with <j simply defined as Y. .Y) such thatE(Y I Z = z) = g(j3. j. .6) is difficult Sometimes z can be viewed as the realization of a population variable Z. Zn»)T is 1. E«. " (2.. 13 = I3(P) is the miniml2er of E(Y .z.d.o=nC..) = E(Y.). we can compute still Dp(l3o. The estimates continue to be reasonable under the GaussMarkov assumptions.. .i.2. . The least squares method of estimation applies only to the parameters {31' . • .2). .1 we considered the nonlinear (and linear) Gaussian model Po given by Y. That is.:1. 1 < j < n.) = g(l3. The contrast p(X. 13 E R d }.1.z.::'hc. the model is semiparametric with {3.. as (Z.1.g(l3. Y) ~ PEP = {Alljnintdistributions of(Z. which satisfies (but is not fully specified by) (2.4) (2. Yi). where Ci = g(l3.z).zn)f is 11.2.E:::'::'... Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ..2.d.4}{2.. . z.) + L[g(l3 o.2.:0::d=':Of. I' .) . that is.2.6) Var( €i) = u 2 COV(fi..g(l3. NCO.) =0.5) (2. (3)  E p p(X.g(l3. . .2.13) n n i=1  L Varp«.E(Y. This follows "I . < i < n..::2 : In Example 2. . If we consider this model. .2. 1 < i < j < n 1 > 0\ . . I Zj = Zj).:. Z could be educationallevel and Y income. 13) = LIY. Note that the joint distribution H of (C}. This is frequently the case for studies in the social and biological sciences. = 0. For instance.3) continues to be valid. a 2 and H unknown. aJ) and (3 ranges over R d or an open subset.En) is any distribution satisfying the GaussMarkov assumptions.g(l3. z. that is. 1_' .3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3.=.) + <" I < i < n n (2.(z. Z)f.2) Are least squares estimates still reasonable? For P in the semiparametric model P.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. .zt}.i. (Z" Y.
(2.7).2. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. which.1. E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. I)/. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). in conjnnction with (2. J for Zo an interior point of the domain.. is called the linear (multiple) regression modeL For the data {(Zi.3 and 6.1. i = 1.8) d f30 = («(3" . We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.5.2.4.9) . n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. we are in a situation in which it is plausible to assume that (Zi.. . E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj. As we noted in Example 2. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z .2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1.. and Volnme n.Section 2. where P is the empirical distribution assigning mass n. Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous.1 the most commonly used 9 in these models is g({3. . . a further modeling step is often taken and it is assumed that {Zll .1). = py .2.4. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!. 2:).L{3jPj.4).7) (2.. In that case.6).2. See also Problem 2.2. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il. see Rnppert and Wand (1994). Zd. .. {3d). z) = zT (3.zo). (2) If as we discussed earlier. and Seber and Wild (1989).2.41. j=1 (2.5) and (2.2.. Fan and Gijbels (1996). (2. as we have seen in Section lA.I to each of the n pairs (Zil Yi). Sections 6.2. For nonlinear cases we can use numerical methods to solve the estimating equations (2.1. Yi).
il2Zi) = 0.11) When the Zi'S are not all equal. il. g(z. The nonnal equations are n i=l n ~(Yi . we will find that the values of y will not be the same. For certain chemicals and plants. ( (J2 2 ) distribution where = ayy Therefore.10) Here are some examples.ill .8).) = il" a~. Y is random with a distribution P(y I z). In that case. we assume that for a given z. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy.2. p.1. is independent of Z and has a N(O. LZi(Yi .) = 1 and the normal equation is L:~ 1(Yi . Example 2. For this reason.2.il. il.12) . necessarily. . 139). I I. In the measurement model in which Yi is the detennination of a constant Ih. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. if the parametrization {3 ) ZD{3 is identifiable. We have already argued in Examp}e 2. .1.2. ~ = (lin) L:~ 1Yi = ii. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. whose solution is .I 'i. Nine samples of soil were treated with different amounts Z of phosphorus. Estimation of {3 in the linear regression model.) = O. 0 Ii. Zi.ecor and Cochran. 1967. i=l (2. Zi 1'. given Zi = 1 < i < n.(3zz. Following are the results of an experiment to which a regression model can be applied (Sned.2. exists and is unique and satisfies the Donnal equations (2. .1. We want to estimate f31 and {32. the least squares estimate.) ~ 0.2.2. the solution of the normal equations can be given "explicitly" by (2.il. 1 < i < n.2. g(z. {3.25. we get the solutions (2.1 '. d = 1. il. i I . The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d. Example 2.no Furthennore. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. see Problem 2. we have a Gaussian linear regression model for Yi.2. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. L 1 that. If we run several experiments with the same z using plants and soils that are as nearly identical as possible.
.. .9p.. The linear regression model is considerably more general than appears at first sight.. if we measure the distance between a point (Zi. Geometrically..3.(/31 + (32Zi)] are called the residuals of the fit. i = 1. . 0 ~ ~ ~ ~ ~ Remark 2. For instance. The regression line for the phosphorus data is given in Figure 2. P > d and postulate that 1'( z) is a linear combination of 91 (z). and 131 = fi .1. &atter plot {(Zi.(Z). 91.2.Yi) and a line Y ~ a + bz vertically by di = Iy. j=1 Then we are still dealing with a linear regression model because we can define WpXI ..Section 2. ..Yn on ZI. Zn. .4. n. Yi). €6 is the residual for (Z6. . The vertical distances €i = fYi .2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22.(a + bzi)l.. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl).132 z ~ ~ (2... This connection to prediction explains the use of vertical distance! in regression.42. Here {31 = 61. n} and sample regression line for the phosphorus data. (zn. . . . . that is p I'(z) ~ 2:).. .2.9. The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1. 9p( z).58 and 132 = 1. i = 1. ... suppose we select p realvalued functions of z. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl. Y6). ..13) where Z ~ (lin) I:~ 1 Zi.1.1. .2. Yn).. .
2.. Zi)/. and (32 that minimize ~ ~ "I.14) where (J2 is unknown as before.2. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant.. i = 1. In Example 2. ..II 112 Methods of Estimation Chapter 2 (91 (z).2.... That is. However.wi.. z.zd + ti. I and the Y i satisfy the assumption (2.)]2 i=l i=l ~ n (2.Zi)]2 = L ~. We need to find the values {3l and /32 of (3.filii minimizes ~  I i i L[1Ii .2.wi . Zi) + Ei . if d = 1 and we take gj (z) = zj. For instance. which for given Yi = yd. 1 <i < n Yi n 1 . + /3zzi)J2 (2.. iI L Vi[Yi i=l n ({3. Weighted Linear Regression.. Zi) = {3l + fhZi. Note that the variables .g«(3. Example 2. < n. fi = <d.. if we setg«(3. .2. .8.2.. 1<i <n Wij = 9j(Zi). _ Yi . .2.15) . Yi = 80 + 8 1Zi + 82 + Cisee Problem 2.wi. .".z. we arrive at quadratic regression. The weighted least squares estimate of {3 is now the value f3. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic).2. 0 < j < 2..wi 1 < . The method of least squares may not be appropriate because (2. .5) fails. we can write (2. 1 n.2..24 for more on polynomial regression. Zi) = (2.= y..g(.16) as a function of {3. Consider the case in which d = 2..17) .5). and g({3.. .3.2. Weighted least squares. Thus.= g({3. Zil = 1.. .. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci. then = WW2/Wi = .wi  _ g((3. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter... We return to this in Volume II. • I' "0 r .2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable.wi) g((3..'l are sufficient for the Yi and that Var(Ed. I . but the Wi are known weights. [Yi . Zi2 = Zi.
. .6).(L:~1 UlYi)(L:~l uizd Li~] Ui Zi .2.20). We can also use the results on prediction in Section 1.2. z) = zT.n where Ui n = vi/Lvi.2..(32E(Z') ~ . as we make precise in Problem 2.2.B.19) and (2.13 minimizing the least squares contrast in this transformed model is given by (2..(Li~] Ui Z i)2 n 2 n (2. 0 Next consider finding the. . that weighted least squares estimates are also plugin estimates. Yn) and probability distribution given by PI(Z".. . Y*) V.B that minimizes (2. we find (Problem 2.2. When ZD has rank d and Wi > 0.2.. when g(. using Theorem 1.1.n.26. ~ . i ~ I.4H2. . we can write Remark 2. .Section 2. . Thus.2. .. Var(Z") and  _ L:~l UiZiYi .1) and (2.(3.B+€ can be transformed to one satisfying (2. That is.B. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z . Zi) = z...28) that the model Y = ZD. By following the steps of Exarnple 2.)] ~Ui.2. suppose Var(€) = a 2W for some invertible matrix W nxn.3.. Let (Z*.2. More generally. Moreover.18) ~ ~ ~I" (31 = E(Y') . .1 7) is equivalent to finding the best linear MSPE predictor of y .2.4 as follows. (3 and for general d.. i i=l = 1.4.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI.2.~ UiZi· n n i"~l I" n n F"l This computation suggests.1 leading to (2.2.. wn ) and ZD = IIZijllnxd is the design matrix. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2.7).~ UiYi ..Y.2.1. the . a__ Cov(Z"'.Y") ~(Zi. Then it can be shown (Problem 2. (Zn.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2.2. . Y*) denote a pair of discrete random variables with possible values (z" Yl).16) for g((3.8). we may allow for correlation between the errors {Ei}. 1 < i < n.
.
.
.
.
represents the expected number of arrivals in an hour or.27) that are not maxima or only local maxima. 2n (2. 0) = 20'(1.x=O.2. A is an unknown positive constant and we wish to estimate A using X. 0) ~ 20(1 . Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. .2.. so the MLE does not exist because = (0.2..2.lx(0) = 5 1 0' . + n. (2. Example 2. the maximum likelihood estimate exists and is given by 8(x) = 2n.2. .) = ·j1 e'"p.2.>. .(1. 8' 80.6. Here X takes on values {O. Evidently.ny l. ]n practice. If we make the usual simplifying assumption that the arrivals form a Poisson process. 0)p(2. Here are two simple examples with (j real.. Let X denote the number of customers arriving at a service counter during n hours. } with probabilities. .O) = 0'. I). and as we have seen in Example 2. respectively. 2. .l. 1. p(3.0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.22) and (2.2. the MLE does not exist if 112 + 2n3 = O. 0 e Example 2. then X has a Poisson distribution with parameter nA. the dual point of view of (2. O)p(l. let nJ.29) '(j . Consider a popUlation with three kinds of individuals labeled 1. which has the unique solution B = ~.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section.2. p(2. and n3 denote the number of {Xl. there may be solutions of (2.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables.. Similarly. X2 = 2. 2 and 3. the likelihood is {1 . which is maximized by 0 = 0.1.27) doesn't make sense. situations with f) well defined but (2. x n } equal to 1. Because . X3 = 1.. OJ = (1  0)' where 0 < () < 1 (see Example 2. ~ maximizes Lx(B).5. .4). x. In general.28) If 2n. and 3 and occurring in the HardyWeinberg proportions p(I.2. equivalently. is zero. n2. 1).0)' < ° for all B E (0..0)..OJ'". p(X. If we observe a sample of three individuals and obtain Xl = 1. + n. Nevertheless.7. then I • Lx(O) ~ p(l. the rate of arrival. where ).2.
and k j=1 k Ix(fJ) = LnjlogB j ....7. Multinomial Trials. familiar from calculus. 31=1 By (2.2. is that l be concave in If l is twice differentiable.. ~ ~ .\ I 0. p(x. However. ..2. Let Xi = j if the ith trial produces a result in the jth category.Section 2. J = I. consider an experiment with n Li. 8' (2. k.6.logB. A sufficient condition.2.n. j = 1. (see (2. .8B " 1=13 k k =0.. = x/no If x is positive.2.2.30)). 80.lx(B) <0. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE. e. fJ) = TI7~1 B7'.. the maximum is approached as . = jl. As in Example 1.2. Then p(x. trials in which each trial can produce a result in one of k categories. . 8B. j=d (2. Then.8.. fJE6= {fJ:Bj >O'LOj ~ I}. thus. A similar condition applies for vector parameters. = L. this estimate is the MLE of ).2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). Example 2. L. We assume that n > k ~ 1.d.LBj j=1 • (2. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n.2.30) for all This is the condition we applied in Example 2. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category. 8) = 0 if any of the 8j are zero... this is well known to be equivalent to ~ e.32) We first consider the case with all the nj positive. .o.ekl with kl Ok ~ 1. = j) be the probability J of the jth category.2. and the equation becomes (BkIB j ) n· OJ = .kl. the MLE must have all OJ > 0.1. let B = P(X. If x = 0 the MLE does not exist. . for an experiment in which we observe nj ~ 2:7 1 I[X.32). 8Bk/8Bj (2. . ~ n .6.32) to find I..31 ) To obtain the MLE (J we consider l as a function of 8 1 .
i=l g«(3. least squares estimates are maximum likelihood for the particular model Po. then <r <k I. we can consider minimizing L:i. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) . Summary.. are known functions and {3 is a parameter to be estimated from the independent observations YI . ((g((3. where E(V. More generally. Zi). The 0 < OJ < 1.2. OJ > ~ 0 case.1). Using the concavity argument.X)2 (Problem 2.(Xi .1. wi(5). zi)1 . Thus.30.2 ) with J1. As we have IIV.. we check the concavity of lx(O): let 1 1 <j < k ...2. 'ffi f. Zi)J[l:} .28. z.Zi)]2. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. Suppose the model Po of Example 2. (2. ~ g((3. see Problem 2. . In Section 2.34) g«(3.2 both unknown.2. version of this example will be considered in the exponential family case in Section 2. Next suppose that nj = 0 for some j. o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood. 1 Yn .33) It follows that in this nj > 0. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. i = 1.' .) = gi«(3).. and 0. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3. X n are Li. N(lll 0.2. these estimates viewed as an algorithm applied to the set of data X make sense much more generally. Suppose that Xl. 0.202 L. . is still the unique MLE of fJ. j = 1. YnJT... n. .lV. 1 < j < k.IV.. where Var(Yi) does not depend on i. k..11(a)).2.... .. gi. zn) 03W.d.1.  seen and shall see more in Section 6.g«(3.1 holds and X = (Y" .120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). .1 L:~ . . Then (} with OJ = njjn.6. Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O).9.). . This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . Example 2.3. 1 < i < n. as maximum likelihood estimates for f3 when Y is distributed as lV. See Problem 2.2.. ~ g«(3.gi«(3) 1 .. . g«(3. Then Ix «(3) log IT ~'P (V.2. .1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.
e e I"V e De= {(a.3. b). See Problem 2. or 8 1nk diverges with 181nk I .Section 2. Suppose we are given a function 1: continuous. are given.. In the case of independent response variables Yi that are modeled to have a N(9i({3). . m. oo]P.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly.2 and other exponential family properties also playa role. 8). m).1 ). a8 is the set of points outside of 8 that can be obtained as limits of points in e.4 and Corollaries 1. (m. as k .zid.I ) all tend to De as m ~ 00.6. in the N(O" O ) case.bE {a. including all points with ±oo as a coordinate. (}"2) distribution.5. For instance. Concavity also plays a crucial role in the analysis of algorithms in the next section. These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e.2. 2 e Lemma 2. = RxR+ and ee e. In Section 2. m. . (m. it is shown that the MLEs coincide with the LSEs.3. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. .3.6.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.oo}}. We start with a useful general framework and lemma.3. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.1 and 1. (a. In general. if X N(B I . b). which are appropriate when Var(Yj) depends on i or the Y's are correlated. Properties that derive solely fTOm concavity are given in Propositon 2. That is. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d.1) lim{l(li) : Ii ~ ~ De} = 00.1 only. Suppose 8 c RP is an open set.a < b< oo} U {(a.00. Let &e = be the boundary of where denotes the closure of in [00. z=1=1 ~ 2.1.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. though the results of Theorems 1.00. Proof. we define 8 1n .6. (a. for a sequence {8 m } of points from open. (h). (m.b) .a= ±oo.6.. Suppose also that e t R where e c RP is open and 1 is (2. where II denotes the Euclidean norm.. b) : aER. For instance.1. Formally. Extensions to weighted least squares.3.3 and 1.
and is a solution to the equation (2. then lx(11 m ) . .2) then the MLE Tj exists.3.3) has i 1 !. then Ix lx(O. then the MLE8(x) exists and is unique. = . /fCT is the convex suppon ofthe distribution ofT (X). ! " . Then. We show that if {11 m } has no subsequence converging to a point in E. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. open C RP. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2.12. We give the proof for the continuous case. '10) for some reference '10 E [ (see Problem 1.to. ' L n . Suppose P is the canonical exponential/amity generated by (T.3. I I.3. From (B. then the MLE doesn't exist and (2.)) 0 Themem 2.3. no solution. By Lemma 2.1.).6.3. Without loss of generality we can suppose h(x) = pix. Furthennore. U m = Jl~::rr' Am = .9) we know that II lx(lI) is continuous on e. Write 11 m = Am U m . is open.3. II E j.1. II) is strictly concave and 1.3. are distinct maximizers.3) I I' (b) Conversely. is unique.3. Proofof Theorem 2.1.(11) ~ 00 as densities p(x.8 and 2. E. (ii) The family is of rank k. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1.  (hO! + 0.. if lx('1) logp(x.3. Suppose the cOlulitions o/Theorem 2.1.3.27). we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) . thus.1. lI(x) =  exists. Suppose X ~ (PII . then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT. if to doesn't satisfy (2. Existence and Uniqueness o/the MLE ij.)) > ~ (fx(Od +lx (0. a contradiction. Applications of this theorem are given in Problems 2.. II).00.122 Methods of Estimation Chapter 2 Proposition 2. with corresponding logp(x. h) and that (i) The natural parameter space. Let x be the observed data vector and set to = T(x). i .2).1 hold. " We. We can now prove the following. Corollary 2.1. which implies existence of 17 by Lemma 2. " .3. .3..3. .'1 . and 0. [ffunher {x (II) e = e e t 88. If 0. ~ Proof.'1) with T(x) = 0.
this is the exponential family generated by T(X) = 1 Xi. Write Eo for E110 and Po for P11o . Po[uTT(X) > 61 > O.3.3. Suppose the conditions ofTheorem 2. (1"2 > O. fOT all 1]. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0. . for every d i= 0. It is unique and satisfies x (2. . 0 ° '* '* PrOD/a/Corollary 2. Por n = 1. Nonexistence: if (2. As we observed in Example 1. Theorem 2. By (B. which is impossible. existence of MLEs when T has a continuous ca. 0 (L:7 L:7 CJ. 0 Example 2.Xn are i. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows.se density is a general phenomenon. Then AU ¢ £ by assumption.2. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. I Xl) and 1. u mk t u.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. thus.5. N(p" ( 2 ). In fact. CT = R )( R+ FOT n > 2. I' E R.2) and COTollary 2. T(X) has a density and.3. um~. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. The Gaussian Model. lIu m ll 00.3. Then because for some 6 > 0.1.4. Then.6.3.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. contradicting the assumption that the family is of rank: k.3.3. The equivalence of (2.k Lx( 11m. So we have Case 2: Amk t A.d.9. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists.i. Case 1: Amk t 111]=11. Evidently. that is.Section 2.3). Suppose X"". t u.3. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0.* P'IlcTT = 0] = I.1 follow. iff.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2..2) fails. CT = C~ and the MLE always exists.1.6.J = 00.3. So. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist. So In either case limm.3) by TheoTem 1.
When method of moments and frequency substitution estimates are not unique.3. P > 0. > O. 0 = If T is discrete MLEs need not exist.Xn are i. Thasadensity.3..5) have a unique solution with probability 1. the maximum likelihood principle in many cases selects the «best" estimate among them. Example 2.4 we will see that 83 is.. This is a rank 2 canonical exponential family generated by T   = (L log Xi.)e>"XxPI. A nontrivial application of Theorem 2. Suppose Xl. I <t< k .= r(jJ) A log X (2. I < j < k.7. then and the result follows from Corollary 2. but only Os is a MLE. by Problem 2..1.>.3). For instance.3.. in a certain sense. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.3. = = L L i i bJ _ I . It is easy to see that ifn > 2.2 that (2. How to find such nonexplicit solutions is discussed in Section 2. The statistic of rank k .2(b» r' log .I. To see this note that Tj > 0. The likelihood equations are equivalent to (problem 2.n. The TwoParameter Gamma Family.ln.3.3 we know that E. Because the resulting value of t is possible if 0 < tjO < n.4) and (2.I which generates the family is T(k_l) = (Tl>"" T. We conclude from Theorem 2.1 in the second.13 and the next example). if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .3. using (2..1 that in this caseMLEs of"'j = 10g(AjIAk)._t)T.6.(X) = r~.i. x > 0.3.2) holds. where Tj(X) L~ I I(Xi = j).. Thus. o Remark 2.4. I < j < k iff 0 < Tj < n..2. the best estimate of 8. The boundary of a convex set necessarily has volume 0 (Problem 2.3. exist iff all Tj > O. I <j < k.2 follows. From Theorem 1.3. we see that (2. Here is an example.1.3. and one of the two sums is nonempty because c oF 0.124 Methods of Estimation Chapter 2 Proof.3. with density 9p.3. where 0 < Aj PIX = j] < I. We follow the notation of Example 1.1.3.d.5) ~=X A where log X ~ L:~ t1ogXi.I and verify using Theorem 2. if T has a continuous case density PT(t).. In Example 3. I < j < k. Multinomial Trials. in the HardyWeinberg examples 2..6.4 and 2. . with i'.9).).T(X) = A(.2.1. They are determined by Aj = Tj In.4) (2.. lit ~ . Thus. . liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2.2(a).1.6.4.1). h(x) = XI.3.3. >. I < j < k. Example 2..3. LXi).3. We assume n > k . thus.
1 we can obtain a contradiction to (2. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to .. c(8) is closed in [ and T(x) = to satisfies (2.2.3.3.A(c(ii)) ~ = O.6. However. our parameter set is open. Remark 2.exist and are unique. 1)T. In some applications. (2.. Consider the exponential family k p(x.3. mxk e.2) so that the MLE ij in P exists. The remaining case T k = a gives a contradiction if c = (1.6) on c( II) = ~~.1.3.. n > kl.~l Aj = I}.3. . The following result can be useful. Similarly. m < k .3..k.1 directly D (Problem 2. (B). When P is not an exponential family both existence and unicity of MLEs become more problematic. . the following corollary to Theorem 2. . e open C R=.3) does not have a closedform solution as in Example 1.2) by taking Cj = 1(i = j). lfP above satisfies the condition of Theorem 2.B) ~ h(x)exp LCj(B)Tj(x) . Let CO denote the interior of the range of (c.1] it does exist ~nd is unique. If the equations ~ have a solution B(x) E Co.3.1. Unfortunately strict concavity of Ix is not inherited by curved exponential families.1 is useful. Ck (B)) T and let x be the observed data.3. if 201 + 0.3.3.Section 2.13). be a cUlVed exponential family p(x.1).3 can be applied to determine existence in cases for which (2. for example.2..10). In Example 2.B(B) j=l .3.1. the MLE does not exist if 8 ~ (0. . 1 < i < k . when we put the multinomial in canonical exponential family form. h). if any TJ = 0 or n. Let Q = {PII : II E e).1 and Haberman (1974).3.6. L. and unicity can be losttake c not onetoone for instance. Corollary 2.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand. (II) . x E X.I.2. then it is the unique MLE ofB.3. whereas if B = [0.1.. 0 The argument of Example 2. the bivariate normal case (Problem 2.3. BEe.3. Alternatively we can appeal to Corollary 2.3. . note that in the HardyWeinberg Example 2. .8see Problem 2.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0.lI) ~ exp{cT (II)T(x) .A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. Then Theorem 2. . = 0. 0 < j < k .7) Note that c(lI) E c(8) and is in general not ij. the MLEs ofA"j = 1. Here [ is the natural paramerer space of the exponential fantily P generated by (T.
. Ii± ~ ~[). m m m f?.. . = ~ryf.it..7) becomes = 0.2 .. m} is a 2nparameter canonical exponential family with 'TJi = /tila... 9) is a curved exponential family of the form (2...5 = . l}m. simplifies to Jl2 + A6XIl. where 0l N(tLj 1 0'1).3.1/'1') T .6.g( it' .3.3.d. Jl > 0.. L: Xi and I.6.oJ).) : ry.?' m)T " "... n.5 and 1. 11. N'(fl. Because It > 0.2~~2 .\()'.5.6. . . we can conclude that an MLE Ii always exists and satisfies (2. .it. = ( ~Yll'''' .n(it' + )"5it'))T which with Ji2 = n. Suppose that }jI •. af = f)3(8 1 + 82 z i )2.ry.3. I .. . < O}.ry. . Using Examples 1. .) : ry.9. generated by h(Y) = 1 and "'> I T(Y) '. corresponding to 1]1 = //x' 1]2 = . ..~Y'1.2.6) with " • . < O}.10. Gaussian with Fixed Signal to Noise. with I.Zn are given constants. = 1 " = 2 n( ryd'1' .n. Next suppose.". 1 n. ry. This is a curved exponential ¥.!0b''. < Zn where Zt.5X' +41i2]' < 0... LocationScale Regression.gx±).3. Zl < .3 )(t../2'1' . We find CI Example 2. . E R. j = 1. which implies i"i+ > 0.i..: Note that 11+11_ = '6M2 solution we seek is ii+.126 The proof is sketched in Problem 2.I LX. tLi = (}1 + 82 z i . 'TJn+i = 1/2a. Now p(y.3.4. .10. l = 1. t. we see that the distribution of {01 : j = 1. that ..'' . . Equation (2. .3.6..Xn are i.~Yn. Methods of Estimation Chapter 2 family with (It) = C2 (It) = . i = 1.nit. C(O) and from Example 1.11. .ryl > O. As a consequence of Theorems 2. ). Evidently c(8) = {(lh.3...A6iL2 = 0 Ii. = L: xl.< O. suppose Xl.7) if n > 2...2 and 2. which is closed in E = {( ryt> ry.'. i Example 2. As in Example 1.. as in Example 1.3)T.6.0'2) with Jl/ a = AO > a known.3. the o q. '1. are n independent random samples. . '() A 1/ Thus.
Here. order d3 operations to invert. entrust ourselves. even in the context of canonical multiparameter exponential families. Ixd large enough. E R.02 E R. x o1d = xo. f i strictly. L 10).Section 2.1. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. ~ 2.1). In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. is isolated and shown to apply to a broader class of models. such as the twoparameter gamma. .3. Given tolerance € > a for IXfinal .). there exists unique x*€(a. the basic property making Theorem 2. strict concavity. > OJ. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. then.7). However.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol.1 and Corollary 2. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis.3. Then c(8) is closed in £ and we can conclude that for m satisfies (2. Let £ be the canonical parameter set for this full model and let e ~ (II: II.O. if implemented as usual. even in the classical regression model with design matrix ZD of full rank.3. Initialize x61d ~ XI. Given f continuous on (a.4. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then. is the bisection algOrithm to find x*. by the intermediate value theorem. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all.3. > 2. then the full 2nparameter model satisfies the conditions of Theorem 2. 2. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. b).4 ALGORITHMIC ISSUES As we have seen. f( a+) < 0 < f (b.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families.1 work. d. The packages that produce least squares estimates do not in fact use fonnula (2. in pseudocode.3. at various times. the fonnula (2. We begin with the bisection and coordinate ascent methods. In fact.3.1. Finally.4 Algorithmic Issues 127 If m > 2. in this section. an MLE (J of (J exists and () 0 ~ ~ Summary. b) such that f(x*) = O.
may befound (to tolerance €) by the method afbisection applied to f(.i.. • .IXm+l .128 (I) If IX~ld . lx.. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.1. If(xfinal)1 < E.4. xfinal = Xnew· (4) If f(xnew) < 0.) = VarryT(X) > 0 for all . f(a+) < 0 < f(b).1) . for m = log2(lxt . exists. Then.to· I' Proaf.1. By Theorem 1.1. 0 Example 2. in addition. x~ld = Xnew· Go to (I).4. the MLE Tj. so that f is strictly increasing and continuous and necessarily because i.4.1).) = EryT(X) .6.3.xol· 1 (2) Therefore. !'(. Theorem 2.x*1 < €. o ! i If desired one could evidently also arrange it so that.4. which exists and is unique by Theorem 2.. Let X" .xol/€). by the intermediate value theorem.. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover.. . b) of the convex support a/PT. I (3) and X m + x* i as m j. . Moreover.3. the interior (a. satisfying the conditions of Theorem 2. The Shape Parameter Gamma Family..4.1 and T = to E C~. I. Xm < x" < X m +l for all m. x~ld (5) If f(xnew) > 0.d.. . X n be i. End Lemma 2. [(8..1.. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T.xoldl Methods of Estimation Chapter 2 < 2E. (2. xfinal = !(x~ld + x old ) and return xfinal' (2) Else. From this lemma we can deduce the following. 00. h).. i I i..
4. which is slow.'fJk· Repeat.2 Coordinate Ascent The problem we consider is to solve numerically. 0 J:: 2. In fact.4.···.Section 2. This example points to another hidden difficulty.'TJk =tk' ) . we would again set a tolerenceto be.'TJ3. 1 k. for a canonical kparameter exponential family. in cycle j and stop possibly in midcycle as soon as <I< .'TJk. it is in fact available to high precision in standard packages such as NAG or MATLAB.1]2. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists.oJ _ = (~1 "" {}) ~02 _ 1]1.1 can be evaluated by bisection. for each of the if'. 1] ~Ok = (~1 ~1 {} "") 1]I. but as we shall see.1. Notes: (1) in practice. Here is the algorithm. .'fJk' ~1) an soon. It solves the r'(B) r(B) T(X) n which by Theorem 2. The case k = 1: see Theorem 2.. r > 1. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. getting 1j 1r).1]2. say c. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2. However.···. eventually. d and finally 1] _ ~Il) _ =1] (~1 1]1"". bisection itself is a defined function in some packages. always converges to Ti.· .4.
(2) The sequence (iji!.2.. ijik) has a convergent subsequence in t x . 1 < j < k. 'II. Suppose that this is true for each l.) where flj has dimension d j and 2::..I ~(~ (1) 1 to get V. The case we have C¥. TIl = . (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. (ii) a/Theorem 2.1 hold and to E 1j(r) t g:. Theorem 2. The TwoParameter Gamma Family (continued). For n > 2 we know the MLE exists.. ff 1 < j < k. they result in substantial savings of time.···) C!J. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2. ilL" iii.4..3.. (I) l(ij'j) Tin j for i fixed and in.. the MLE 1j doesn't exist.1.4. 71J. Thus. ij as r t 00.4. Therefore. Suppose that we can write fiT = (fir. 71 1 is the unique MLE. . A(71') ~ to. Else lim. refuses to converge (in fI space!)see Problem 2.2. 0 ~ (4) :~i (77') = 0 because :~. .1j{ can be explicit. (i). rp differ only in the second coordinate. Bya standard argument it follows that. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then.. is computationally explicit and simple. as it should.A(71) + log hex). limi. fI. = 7Jk.\(0) = ~. the algorithm may be viewed as successive fitting of oneparameter families. Whenever we can obtain such steps in algorithms. 'tn.'11 11 t t (I . the log likelihood. (3) and (4) => 1]1 . . .j l(i/i) = A (say) exists and is > 00..1 because the equa~ tion leading to Anew given bold' (2.pO) = We now use bisection ' r' . Proof. j (5) Because 1]1. _A (1»). W = iji = ij. I(W ) = 00 for some j. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l.5). .3. (Wn') ~ O. to ¢ Fortunately in these cases the algorithm.2.2.. We note some important generalizations. j ¥ I) can be solved in closed form.. 1j(r) ~ 1j. We use the notation of Example 2.. 1J ! But 71j E [. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known.2: For some coordinates l. ij = (V'I) . We pursue this discussion next.I)) = log X + log A 0) and then A(1) = pY.. Let 1(71) = tif '1.130 Methods of Estimation Chapter 2 (2) Notice that (iii.. (6) By (4) and (5). 0 Example 2.2'  It is natural to ask what happens if.I) solving r0'.2. ..l(ij') = A.. that is.=1 d j = k and the problem of obtaining ijl(tO. in fact. . . Because l(ijI) ~ Aand the MLE is unique.? Continuing. I • . We give a series of steps.3. Continuing in this way we can get arbitrarily close to 1j. Then each step of the iteration both within cycles and from cycle to cycle is quick.. x ink) ( '~ini . . by (I).4.··..'. Consider a point we noted in Example 2.. Here we use the strict concavity of t. (fij(e) are as above. We can initialize with the method 2 of moments estimate from Example 2. . Hence.1J k) ..4.
. r = k. Change other coordinates accordingly. is strictly concave. .1 in which Ix(O). and Holland (1975).10. See also Problem 2. the method extends straightforwardly.4.7.B2 )T where the log likelihood is constant. The coordinate ascent algorithm. Next consider the setting of Proposition 2.Section 2. values of (B 1 .4.. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. e ~ 3 2 I o o 1 2 3 Figure 2. which we now sketch. find that member of the family of contours to which the vertical (or horizontal) line is tangent. At each stage with one coordinate fixed.1 illustrates the j process.4 Algorithmic Issues 131 just discussed has d 1 = .. that is.4.4. . the log likelihood for 8 E open C RP.. iterate and proceed..4. 1 B. and Problems 2.. Figure 2.p.. The coordinate ascent algorithm can be slow if the contours in Figure 2. 1' B .. each of whose members can be evaluated easily.1..2 has a generalization with cycles of length r. BJ+l' .3. = dr = 1. If 8(x) exists and Ix is differentiable. Then it is easy to see that Theorem 2. .4. Solve g~: (Bt. for instance. Feinberg. . The graph shows log likelihood contours.92.4.B~) = 0 by the method of j bisection in B to get OJ for j = 1.
10. Here is the method: If 110 ld is the current value of the algorithm. B) The density is = [I + exp{ (x = B) WI. methods such as bisection. and NewtonRaphson's are still employed.f. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. iinew after only one step behaves approximately like the MLE.4. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. We return to this property in Problem 6. A onedimensional problem 7 . when it converges.1. X n be a sample from the logistic distribution with d. F(x. Let Xl. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum. 1 l(x. (2. coordinate ascent. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist. if l(B) denotes the log likelihood.4.4. Bjork.to).2) gives (2.B) We find exp{ (x .4.and lefthand sides.4.132 Methods of Estimation Chapter 2 2. In this case. If 110 ld is close enough to fj. which may counterbalance its advantage in speed of convergence when it does converge. and Anderson (1974). in general. _.3. then by f1new is the solution for 1] to the approximation equation given by the right.3) Example 2..B) i=l 1 1 2 L i=l n I(X" B) < O.3.  I o The NewtonRaphson algorithm has the property that for large n.2) The rationale here is simple. Newton's method also extends to the framework of Proposition 2. I 1 The NewtonRaphson method can be implemented by taking Bold X. This method requires computation of the inverse of the Hessian. can be shown to be faster than coordinate ascent.4. is the NewtonRaphson method. the argument that led to (2. then ii new = iiold .6. B)}F(Xi. When likelihoods are noncave.3 The NewtonRaphson Algorithm An algorithm that.k I (iiold)(A(iiold) ..7.
0. i = 1. We give a few examples of situations of the foregoing type in which it is used. The algorithm was fonnalized with many examples in Dempster. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. m+ 1 <i< n. If we suppose (say) that observations 81. I)] (1 .4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. 0) where I". (0) = log q(s. in Chapter 6 of Dahlquist.0)] ~ 28(1 . As in Example 2. e = Example 2. difficult to compute. then explicit solution is in general not lXJssible. however. 1 8 m are not Xi but (€il. 2. and Rubin (977).13.4. . Po[X = (0. with an appropriate starting point.0)2. let Xi. is not X but S where = 5i 5i Xi.x(B) is "easy" to maximize. and Anderson (1974). X ~ Po with density p(x. a < 0 < I. for instance.B).4.B) + 2E'310g(1 . The log likelihood of S now is 1". Ei3). . but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example.4) Evidently. Po[X (0. a)] = B2.. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus. Here is another important example. Bjork. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). There are ideal observations. This could happen if. Lumped HardyWeinberg Data. and Weiss (1970).€i3). 1 <i< m (€i1 +€i2. . Petrie. 1. . the function is not concave. and its main properties.(0) })2EiI 10gB + Ei210g2B(1 .4.n.OJ] i=l n (2. and so on. Soules. Many examples and important issues and methods are discussed. Xi = (EiI. What is observed. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. A fruitful way of thinking of such problems is in terms of 8 as representing part of X. for some individuals.Section 2.2.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2.. we observe 5 5(X) ~ Qo with density q(s. Unfortunately.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. 0.x(B) is concave in B. though an earlier general form goes back to Baum. Yet the EM algorithm. €i2 + €i3).4.4. 0) is difficultto maximize.. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1).. where Po[X = (1. S = S(X) where S(X) is given by (2. Say there is a closedfonn MLE or at least Ip.0 E c Rd Their log likelihood Ip. Laird. 0 leads us to an MLE if it exists in both cases. A prototypical example folIows. E'2.4.4.6.4). (2. 0).
8) . S has the marginal distribution given previously. Suppose 8 1 .. Let . The rationale behind the algorithm lies in the following formulas.(f~).. ~i tells us whether to sample from N(Jil. . .<T.4) = L()(Si I Ai) = N(Ail'l + (1.i = 1.p() [A. I That is. Mixture of Gaussians. B) = (1. Although MLEs do not exist in these models. Suppose that given ~ = (~11"" ~n). J. . It is not obvious that this falls under our scheme but let (2. . It is easy to see (Problem 2. differentiating and exchanging E oo and differentiation with respect i.9) I for all B (under suitable regularity conditions). j.\ < 1.8).\ = 1.7) ii. we have given. p(s. + (1  A.\"'u.).p (~)..4.\.4.5. I:' .tz E Rand 'Pu (5) = .\)"'u. A. . that under fJ.4. The EM Algorithm. if this step is difficult.(sI'z)where() = (.af) or N(J12.12) q(s.0'2 > 0. I' Hi where we suppress dependence on s. .<T. EM is not particularly appropriate.B) 0=00 = E. Then we set Bnew = arg max J(B I Bold).B) =E. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed.B) I S(X) = s) 0=00 (2.(I'. If this is difficul~ the EM algorithm is probably not suitable.6) where A.B) IS(X)=s) q(s.4.4.B) J(B I Bo) ~ E.8) by o taking logs in (2. Bo) _ I S(X) = s ) (2. I. the Si are independent with L()(Si I .Ai)I'Z. The EM algorithm can lead to such a local maximum. Here is the algorithm.I.4.11). = 11 = . :B 10gq(s.9) follows from (2. including the examples.12). • • i ! I • . The second (M) step is to maximize J(B I Bold) as a function of B.4. we can think of S as S(X) where X is given by (2. are independent identically distributed with p() [A.4. reset Bold = Bnew and repeatthe process.ll. This fiveparameter model is very rich pennitting up to two modes and scales. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2.)<T~). Thus.o log p(X.Bo) and (2.4..6).Bo) 0 p(X. Note that (2.4. the M step is easy and the E step doable.. . i. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE.. (P(X. (P(X. .134 Methods of Estimation Chapter 2 Example 2. Again.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities.2)andO <.(sl'd +. = 01. o (:B 10gp(X. I . which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. As we shall see in important situations.4. O't.4.
." ) I S(X) ~ 8 . DJ(O I 00 ) [)O and.Section 2. 0) I S(X) = 8 ) DO (2. hence. (2. Let SeX) be any statistic. then .17) o The most important and revealing special case of this lemma follows. 00ld are as defined earlier and S(X) (2.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation.0) = O.1.h) satisfying the conditions a/Theorem 2.1.2. uold 0 r s. ~ E '0 (~I ogp(X .4.1. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew. Suppose {Po : () E e} is a canonical exponential family generated by (T. However.13) q(8.(X 18.E.4.O)r(x I 8.0) } ~ log q(s. DO The main reason the algorithm behaves well follows.4. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. S (x) ~ 8 p(x. q S.4. We give the proof in the discrete case. 0) ~ q(s.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof.Onew) E'Old { log r(X I 8. Then (2.12) = s.14) whete r(· j ·.4.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s.o log .4.4. 0 0 = 0old' = Onew. 0old)' Equality holds in (2.4.3.O) { " ( X I 8. (2. O n e w ) } log ( " ) ~ J(Onew I 0old) .00Id) by Shannon's ineqnality. Lemma 2. Id log (X I . (2.4. 0 ) +E.0 ) I S(X) ~ 8 . Theorem 2.15) q(s.3. Fat x EX.Onew) {r(X I 8 . In the discrete case we appeal to the product rule. r(Xj8. uold Now.16) ° q(s. Onew) > q(s.4. Lemma 2. formally.4.lfOnew.4 Algorithmic Issues 135 to 0 at 00 .11)  D IOgq(8. I S(X) ~ s > 0 } (2. On the other hand.0) (2. Because.
25) I" .21) Part (a) follows.4. • I E e (2Ntn + N'n I S) = 2Nt= + N.B).18) exists it is necessarily unique.4.A(Bo)) (2. 1'I. .4. m + 1 < i < n) . then it converges to a limit 0·. A'(1)) = 2nB (2. I • Under the assumption that the process that causes lumping is independent of the values of the €ij.22) • ~ 0) . i=m.4.23) I .(A(O) .(A(O) .4 (continued). 1 <j < 3.20) has a unique solution.B) 1 .= +Eo ( t (2<" + <i') I <it + <i'. that. Thus.16. A proof due to Wu (I 983) is sketched in Problem 2. 0 Example 2. Proof. = 11 <it +<" = IJ.4. (2.(x). Now. . we see.4. h(x) = 2 N i .136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. Part (b) is more difficult.24) " ! I .4.4. X is distributed according to the exponential family I where p(x.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) .18) (2. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.A(Bo)) I S(X) = s} (B .4.Pol<i.4..19) = Bnew · If a solution of(2. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .0)' 1.+l (2.I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 .4.BO)TT(X) . 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0. i i. = Ee(T(X) I S(X) = s) ~ (2. In this case.Bof Eoo(T(X) I S(X) = y) .(1 .A('1)}h(x) (2. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi). I . after some simplification.
which is indeed the MLE when S is observed. a~ + ~. compute (Problem 2.1) A( B) = E.2 I Z.T2 ]}' .. I Z. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. unew  _ 2N1m + N 2m n + 2 .Tl IlT4 (8ol d) . and find (Problem 2. [T5 (8ald) 2 = T2 (8old)' iii. T3 = The observed data are n l L zl. at Example 2.4..). 2 Mn .4). and for n2 + 1 < i < n. B).i. T 2 = Y. I Zi) 1"2 + P'Y2(Z.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0. iii.Bold n ~ (2.27) hew i'tT2Ji{[T3 (8ol d) . new = T 3 (8ol d) . ~ T l (801 d). Y) ~ N (I"" 1"" 1 O'~.Yn) be i.6.): 1 <i< nt} U {Z.4. we note that for the cases with Zi andlor Yi observed.Section 2.4 Algoritnmic Issues 137 where Mn ~ Thus. the conditional expected values equal their observed values.p2)a~ [1'2 + P'Y2(Z. then B' .) E. This completes the Estep.T 2 . I).4.l LY/. 1"1)/ad 2 + (1 . . Y).(Y. al a2P + 1"1/L2). . where B = (111.12) that if2Nl = Bm.T{ (2. To compute Eo (T I S = s). Let (Zl'yl). Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.4. p).(y.l L ZiYi. i=I i=l n n n i=l s = {(Z" Y.new ii~. 112. For the Mstep.4. for nl + 1 < i < n2. the EM iteration is L i=tn+l n ('iI + <..) E.. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2..1) that the Mstep produces I1l. ii2 . + 1 <i< n}.new ~ + I'i.8) of B based on the observed data. I'l)/al [1'2 + P'Y2(Z.(ZiY.1. to conclude ar. converges to the unique root of ~ + N. we observe only Yi . as (Z.1).} U {Y. where (Z. Jl2.4. /L2.T = (1"1.: nl + 1 <i< n. r) (Problem 2.new T4 (Bo1d ) . we oberve only Zi.= > N 3 =)) 0 and M n > 0. b. T4 = n. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112. T 5 = n. (Zn. a~. In this case a set of sufficient statistics is T 1 = Z.: n.d.26) It may be shown directly (Problem 2.4.
(3( 0'1. in Section 2.2. . . ! I 2. . Finally in Section 2..23).4. based on the second moment. Summary.1. based on the first moment. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. B)/p(X.).1 with respective probabilities PI. X n assumed to be independent and identically distributed with exponential. Suppose that Li. j.2.i' I' I i. We then. 2B(1 .B o)]. use this algorithm as a building block for the general coordinate ascent algorithm. in the context of Example 2. show that T 3 is a method of moment estimate of (). ~' "..P2. in general.. involves imputing missing values..P3 given by the HardyWeinberg proportions. Consider n systems with failure times X I . including the NewtonRaphson method. (a) Find the method of moments estimate of >..1 1. which as a function of () is maximized where the contrast logp(X. . Note that if S(X) = X. .. distributions.0. 0 Because the Estep. Also note that. By considering the first moment of X. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. &(>. X n have a beta.4 we derive and discuss the important EM algorithm and its basic properties.. the EM algorithm is often called multiple imputation.2. Remark 2.4.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1. respectively. based On the first two moments. Find the method of moments estimates of a = (0'1. X I. (b) Find the method of moments estimate of>. are discussed and introduced in Section 2. 0) is minimized. 2. where 0 < B < 1.4.4. j : > I) that one system 3.5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. (d) Find the method of moments estimate of the probability P(X1 will last at least a month. what is a frequency substitution estimate of the odds ratio B/(l.0'2) distribution.0'2) based on the first two moments.3 and the problems. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.B)? . (b) Using the estimate of (a).B)2. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). (c) Suppose X takes the values 1.0) and (I .4.6. then J(B i Bo) is log[p(X. .d. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e... Important variants of and alternatives to this algorithm.
X n . (b) Exhibit method of moments estimates for VaroX = 8(1 ..8.Xn be a sample from a population with distribution function F and frequency function or density p.. Let X(l) < . ~ Hint: Express F in terms of p and F in terms of P ~() X = No. (Zn.)). Y2 ). . 4. 5... The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants. 6. (a) Show that X is a method of moments estimate of 8.. . .. (Z2. given the order statistics we may construct F and given P.Section 2. Give the details of this correspondence. < ~ ~ Nk+1 = n(l  F(tk)). (d) FortI tk.. t).12. Hint: Consi~r (N I .5 Problems and Complements 139 Hint: See Problem 8. 8.. ~  7. < X(n) be the order statistics of a sample Xl...) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that. YI ).5. Yi) such that Zi < sand Yi < t n .2. Show that these estimates coincide. . . (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF.. . X 1l be the indicators of n Bernoulli trials with probability of success 8. ~ ~ ~ (a) Show that in the finite discrete case. . (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X).. Yn ) be a set of independent and identically distributed random vectors with common distribution function F. (See Problem B.F(tk)..8)ln first using only the first moment and then using only the second moment of the population. of Xi < xl/n.. of Xi n ~ =X. . which we define by ~ F ~( s. .findthejointfrequencyfunctionofF(tl).. . . t) is the bivariate empirical distribution function FCs. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). Let X I. See A. . N 2 = n(F(t2J . . < .. .2. The empirical distribution function F is defined by F(x) = [No. = Xi with probability lin. Let (ZI. Nk+I) where N I ~ nF(t l ).t ) = Number of vectors (Zi. Let Xl. . we know the order statistics.F(t. The natural estimate of F(s. empirical substitution estimates coincides with frequency substitution estimates...
. 11. with (J identifiable. I) = ". . (See Problem 2. Suppose X = (X"".) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. the result follows. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. ~ i . find the method of moments estimate based on i 12.ZY. (J E R d.j2. There exists a compact set K such that for f3 in the complement of K. X n be LLd. the sample covariance. p(X..Yn . ..4.Y) = Z)(Yk . z) is continuous in {3 and that Ig({3. In Example 2.2 with X iiI and /is.1.1. . f(a. . suppose that g({3. {3) is continuous on K."ZkYk . z) I tends to 00 as \. .17. I .Xn ) where the X. Y are the sample means of the sample correlation coefficient is given by Z11 •..IU9) that 1 < T < L ... Zn and Y11 . In Example 2. . Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . the sampIe correlation. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. and so on. {3) > c.81 tends to 00. ... as the corresponding characteristics of the distribution F. Show that the least squares estimate exists.L. Hint: See Problem B.). 0). . .. = #. 10. Vi). . 1". Hint: Set c = p(X.!'r«(J)) I . " . are independent N"(O.j) is given by ~ The sample covariance is given by n l n L (Zk k=l .. (b) Construct an estimate of a using the estimate of part (a) and the equation a . L I. 9. .. k=l n  n . Since p(X.140 Methods of Estimation Chapter 2 (a) Show that F(. .' where Z. : J • .) Note that it follows from (A. (b) Define the sample product moment of order (i.2).".2.. Let X". Suppose X has possible values VI. Show that the sample product moment of order (i. as X ~ P(J. (a) Find an estimate of a 2 based on the second mOment. >.• . j). respectively.
rep. (a) Use E(X.l Lgj(Xi ). > 0. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1.36. to give a method of moments estimate of (72.0) (ii) Beta.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill .10).1) (iii) Raleigh. . p fixed (v) Inverse Gaussian. .. 1 < j < k.1. = h(PI(O).. . A). ~ (X.Pr(O)) . When the data are not i.0 > 0 14. where U.O) ~ (x/O')exp(x'/20'). X n are i. Hint: Use Corollary 1.. 0).d. 1'(1. (b) Suppose P ~ po and I' = b are fixed. In the fOllowing cases.p:::'.:::nt:::' ~__ 141 for some R k valued function h. in estimate.6. (c) If J. Suppose that X has possible values VI. j i=l = 1..  1/' Po) / . 13.Section 2. fir) can be written as a frequency plugin estimate.. General method of moment estimates(!)..:::m:::. r..d. .6.. can you give a method of moments estimate of {3? .i. . Suppose X 1.6. Show that the method of moments estimate q = h(j1!.. .• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments.. . . .. as X rv P(J. Vk and that q(O) for some Rkvalued function h. 1'(0.5. Let g. Consider the Gaussian AR(I) model of Example 1. See Problem 1. .9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X))..p(x.1.. Let 911 . Use E(U[).l and (72 are fixed. A).x (iv) Gamma.. with (J E R d and (J identifiable.5 Problems and Com". find the method of moments estimates (i) Beta.) to give a method of moments estimate of p. IG(p.(X) ~ Tj(X). 0 ~ (p.i. . Iii = n.
.. Establish (2..... b. . .S 1 . _ (Z)' ' a. the method of moments l by replacing mjkrs by mjkrs.. we define the empirical or sample moment to be j ~ ... ... . The reading Yi . P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ.)] + [g(l3o..) ..g(I3. For a vector X = (Xl.ZY ~ ..... . t n on the position of the object. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. Show that <j < i ! .8.z."" X iq ). of observations.1".J is' _ n n..)] = [Y. SF. I.142 Methods of Estimation Chapter 2 15.. 5 SF 28. .  t=1 liB = (0 1 . ' l X q ). 1 q. respectively.q. II. 1". 83 6 IF 28. i = 1. For independent identically distributed Xi = (XiI.1 " Xir Xk J... Multivariate method a/moments..)]. 17. n 1  Problems for Sectinn 2. = n 1 L:Z..bIZ. 1 . and F. n. I I . I .~b. Readings Y1 .• l Yn are taken at times t 1..... let the moments be mjkrs = E(XtX~). .. k > 0. Y) and B = (ai. . and F at one locus resulting in six genotypes labeled SS. .. I.. ()2.. HardyWeinberg with six genotypes. FF.. = Y .... >. k> Q mjkrs L... and (J3.. ..6).z.. Let X = (Z... 1 I . ..  1.. S = 1. Y) and (ai. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28. + '2P4 + 2PS . 11P2 + "2P4 + '2P6 .g(I3... and IF. 16. Hint: [y.z.. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S.... Let 8" 8" and 83 denote the probabilities of S.. I .4. ). where OJ = 1. .! . 1 6 and let Pl ~ N j / n.1... . bI) are as in Theorem 1...•• Om) can be expressed as a function of the moments. I • ...2 1. SI.. j > 0.83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.. An Object of unit mass is placed in a force field of unknown constant intensity 8.._ ..g(l3o..z. where (Z..Y.3. Q.
(a) Let Y I . . if we consider the distribution assigning mass l/n to each of the points (zJ.t of the best (MSPE) predictor of Yn + I ? 4. 2 10.. all lie on aline..4.. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi. Y.. 9. Find the least squares estimate of 0:. (J2) variables. The regression line minimizes the sum of the squared vertical distances from the points (Zl. the range {g(zr. 13). Find the MLE of 8. . Let X I.. and 13 ranges over R d 8. . Zn and that the linear regression model holds.4)(2. Yn). .We suppose the oand be uncorrelated with constant variance. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. X n denote a sample from a population with one of the following densities Or frequency functions. (exponential density) (b) f(x. Hint: Write the lines in the fonn (z.5) provided that 9 is differentiable with respect to (3" 1 < i < d.B.. Find the line that minimizes the sum of the squared perpendicular distance to the same points. .. i = 1. to have mean 2. . .Yn). i + 1. Find the least squares estimates of ()l and B . x > 0.2.n. 7. x > c. (Pareto density) . nl and Yi = 82 + ICi. (a) f(x. 3. ~p ~(yj}) ~ . nl + n2. B > O. g(zn. (zn. . Show that the least squares estimate is always defined and satisfies the equations (2. B) = Be'x. B.5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l .2.. c cOnstant> 0. What is the least squares estimate based on YI . 13). .." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi.6) under the restrictions BJ > 0.<) ~ . i = 1. Find the least squares estimates for the model Yi = 8 1 (2. A new observation Ynl. . Yn have been taken at times Zl.. B > O.z. ..y.(zn. Yl). . . . Hint: The quantity to be minimized is I:7J (y. in fact. Show that the fonnulae of Example 2..BJ . Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants....3. Suppose Yi ICl".. Suppose that observations YI ..13 E R d } is closed.4. < O. . " 7 5.1. ..Section 2. where €nlln2 are independent N(O.2... . . . Find the LSE of 8. .2 may be derived from Theorem 1. .). .Yi). B) = Bc'x('+JJ.)' 1+ B5 6.I is to be taken at time Zn+l.. . _.
(72 Ii (a) Show that if J1 and 0.. then the unique MLEs are (b) Suppose J. . I • :: I Suppose that h is a onetoone function from e onto h(e).1). . X n . J1 E R. . I). 0 <x < I. bl to make pia) ~ p(b) = (b .P > I. be independently and identically distributed with density f(x.e. q(9) is an MLE of w = q(O). . Let X" . (Pareto density) (d) fix.t and a 2 are both known to be nonnegative but otherwise unspecified. Show that if 0 is a MLE of 0. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}.  under onetoone transformations).l L:~ 1 (Xi . (beta. for each wEn there is 0 E El such that w ~ q( 0).. they are equivariant 16. x> 0.X)" > 0.II ". x > 0. (a) Let X .O) where 0 ~ (1". 0> O.2 are unknown. Because q is onto n. (Rayleigh density) (f) f(x. Thus. (We write U[a.. I I I (b) LetP = {PO: 0 E e}. c constant> 0.Pe.a)l rather than 0. iJ( VB.I")/O") .. X n be a sample from a U[0 0 + I distribution. 0) = Ocx c . ry) denote the density or frequency function of X in terms of T} (Le. reparametrize the model using 1]). . WEn . . . Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. c constant> 0. ncR'. Hint: You may use Problem 2.1. Show that depends on X through T(X) only provided that 0 is unique. is a sample from a N(/l' ( 2 ) distribution. MLEs are unaffected by reparametrization. 0) = cO". ~ I in Example 2. > O. Suppose that Xl. X n • n > 2.  e . (Wei bull density) 11. 0 > O. 00 = I 0" exp{ (x .:r > 0. n > 2. and o belongs to only one member of this partition. the MLE of w is by definition ~ W. x > 1".5 show that no maximum likelihood estimate of e = (1".) ! ! !' ! 14.. ... say e(w).: I I \ .0"2). I i' 13.144 Methods of Estimation Chapter 2 (c) f(". If n exists. 0) = VBXv'O'. Hint: Let e(w) = {O E e : q(O) = w}.r(c+<).. 0 > O. I < k < p.5. Pi od maximum likelihood estimates of J1 and a 2 .t and a 2 . 0 > O. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1". Show that the MLE of ry is h(O) (i. then {e(w) : wEn} is a partition of e. Let X I. density) (e) fix. 12. < I" < 00.2. be a family of models for X E X C Rd.0"2 (a) Find maximum likelihood estimates of J. e c Let q be a map from e onto n. 0 E e and let 0 denote the MLE of O.0"2) 15. = X and 8 2 ~ n. . Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. Define ry = h(8) and let f(x. Hint: Use the factorization theorem (Theorem 1.16(b).l exp{ OX C } . 0) = (x/0 2) exp{ _x 2/20 2 }.
. . 1) distribution.r f(r + I. A general solution of this and related problems may be found in the book by Barlow.. 21. .~1 ' Li~1 Y. find the MLE of B. identically distributed. Let Xl. Suppose that we only record the time of failure. Show that the maximum likelihood estimate of 0 based on Y1 . Derive maximum likelihood estimates in the following models. in a sequence of binomial trials with probability of success fJ. Bartholomew. 16(i).5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O).B). Thus.1 (1 ... . f(k. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. Y n IS _ B(Y) = L.. Bremner. ..n . < 00. •• . X n ) is a sample from a population with density f(x. (We denote by "r + 1" survival for at least (r + 1) periods... where 0 < 0 < 1. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. .[X ~ k] ~ Bk . X 2 = the number of failures between the first and second successes.Section 2.B). ..X I = B(I. a model that is often used for the time X to failure of an item is P. we observe YI . B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'.B) = 1.1 (I_B). but e . In the "life testing" problem 1. 19.0") E = {(I'. and so on. (b) The observations are Xl = the number of failures before the first success. We want to estimate O..) Let M = number of indices i such that Y i = r + 1.Xn be independently distributed with Xi having a N( Oi.P. k ~ 1. Show that maximum likelihood estimates do not exist.O") : 00 < fl. and Brunk (1972). 17..2. We want to estimate B and Var." Y: . and have commOn frequency function.. (KieferWolfowitz) Suppose (Xl. . 20. . M 18.B) =Bk . Yn which are independent. . k=I. If time is measured in discrete periods.. . 1 <i < n..6. Censored Geometric Waiting Times.  . (a) The observations are indicators of Bernoulli trials with probability of success 8. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. 0 < a 2 < oo}.[X < r] ~ 1 LB k=l k  1 (1_ B) = B'.
X)' + L(Y. distribution. . (1') populations..a 2 ) = Sllp~. Assume that Xi =I=.6). 23. J2 J2 o o o I 3.I' fl.1. " Yn be two independent samples from N(J. Xl.5 66... Suppose Y.5 I' I I .' wherej E J andJisasubsetof {UI .'.a 2 ) and NU". N. the following data were obtained (from S.0 5.Xm and YI . respectively.0"2) if.tl.8 0. x) as a function of b.ji. Y.2 4. Show that the MLE of = (fl.< J.Y)' i=1 j=1 /(rn + n).146 Methods of Estimation Chapter 2 that snp. . TABLE 2.t.9 3.2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3.2.z!.4 o o o o o o o o o o o o o Life 20.jp): 0 <j. (1') is If = (X. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise. . where [t] is the largest integer that is < t. Hint: Coosider the mtio L(b + 1.6. 7 . 1 <k< p}.5 86. . . Suppose X has a hypergeometric.2..p(x.Xj for i =F j and that n > 2. Tool life data Peed 1 Speed 1 1 Life 54..0 2. and only if.. II I I :' 24.5 0.x n .0 I . and assume that zt'·· . Ii equals one of the numbers 22.8 2.(Zi) + 'i.1 2. 1i(b.2 3. Polynomial Regression.8 14. n). . Let Xl. Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer.up(X.4)(2.0 11. 1 1 1 o o . 1985).. where'i satisfy (2. Weisberg.0 0.. . (1') where e n ii' = L(Xi . .J. Set zl ~ . = fl. i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).2 4. x)/ L(b.8 3.
1 (see (B.6) with g({3. of Example 2.y)!(z.l be a square root matrix of W. .Y = Z D{3+€ satisfy the linear regression mooel (2. Consider the model Y = Z Df3 + € where € has covariance matrix (12 W. the second model provides a better approximation. (2.z. Show that (3.3. Y')/Var Z' and pI(P) = E(Y*) 11. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. . That is. = (cutting speed . y).5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life. Consider the model (2.Y') have density v(z.. l/Var(Y I Z = z). This will be discussed in Volume II. y) > 0 be a weight funciton such that E(v(Z. in Problem 2. Let (Z. E{v(Z. z.2. Y)Z') and E(v(Z.2. However. let v(z.1.4)(2. (2.1)..2.(P) and p..19). Y)[Y . Y) have joint probability P with joint density ! (z.2. z) ~ Z D{3.6)). this has to be balanced against greater variability in the estimated coefficients. Y)Y') are finite.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.1).8 and let v(z.ZD is of rank d.6) with g({3.y)/c where c = I Iv(z. (a) Show tha!. 28.2. Show that the follow ZDf3 is identifiable. (b) Let P be the empirical probability . (c) Zj. The best linear weighted mean squared prediction error predictor PI (P) + p.6.1.(b l + b. Being larger.2.(P)E(Z·). weighted least squares estimates are plugin estimates. Derive the weighted least squares nonnal equations (2.2. (12 unknown. ZI ~ (feed rate . Let W. .. 25.(P) coincide with PI and 13.(P)Z of Y is defined as the minimizer of t = zT {3.y)dzdy. (a) The parameterization f3 (b) ZD is of rank d.. Both of these models are approximations to the true mechanism generating the data. .y)!(z. Show that p.y) defined . . (a) Let (Z'.Z)]'). Set Y = Wly. Two models are contemplated (a) Y ~ 130 + pIZI + p.Section 2. 26.900)/300.5) for (a) and (b).13)/6. Use these estimated coefficients to compute the values of the contrast function (2.4)(2. ZD = Wz ZD and € = WZ€. ~.2.  27.(P) = Cov(Z'. z) ing are equivalent.
.476 3..054 1.301 2.114 1. 2. i = 1. . where 1'.~=l Z. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.5. Then = n2 = .300 6. (See Problem 2. Yn are independent with Yi unifonnly distributed on [/1i . which vanishes if 8 j = 0 for any j = q + 1.20). . ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In..870 30. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.379 2.480 5.971 0. with mean zero and variance a 2 • The ei are called moving average errors. . .397 4.689 4.. .038 3. 1'. Let ei = (€i + {HI )/2.391 0.8.858 3. Show that the MLE of . = (Y  29.).834 3..) ~ ~ (I' + 1'.1. nq+1 > p(x.229 4. That is. k.nk > 0. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI .968 2.058 3.1..856 2. _'.075 4..155 4..968 9.453 1.723 1.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e. Hint: Suppose without loss of generality that ni 0. .788 2. where El" .+l I Y .148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.6) is given by (2.393 3.511 3.182 0.611 4.064 5.ZD.046 2.100 3..ZD. . .131 5..019 6.274 5. .908 1. In the multinomial Example 2. 31.jf3j for given covariate values {z.196 2.097 1. Suppose YI .860 5.6)T(y .457 2.6) T 1 (Y .2.665 2.071 0. i = 1.392 4.. .093 5..918 2..€n+l are i. Chon) shonld be read row by row.455 9. different from Y? I I TABLE 2.091 3.020 8.a. n. Consider the model Yi = 11 + ei. . (a) Show that E(1'.093 1. .. . <J > 0.564 1.6) W. j = 1..ZD.d.716 7. = L.703 4. 0) = II j=q+l k 0. k. Is ji.2. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.599 0.921 2. ' I (f) The following data give the elapsed times YI .j}.i. 4 (b) Show that Y is a multivariate method of moments estimate of p.053 4. then the  f3 that minimizes ZD. \ (e) Find the weighted least squares estimate of p. n..669 7. /1i + a}. = nq = 0..(Y .676 5.582 2.17.360 1. .916 6..156 5.'. suppose some of the nj are zero.081 2.249 1. J. I Yi is (/1 + Vi).958 10. .039 9.666 3... The data (courtesy S.
Yn are independent with Vi having the Laplace density 1 2"exp {[Y. be the Cauchy density. x E R. Let g(x) ~ 1/". ~ f3I./Jr ~ ~ and iii. .32(b).l1. Let B = arg max Lx (8) be "the" MLE..O) Hint: See Problem 2. 8 E R. .. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass . be i.6. Suppose YI . If n is even. Find the MLE of A and give its mean and variance.Section 2.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x..Jitl. as (Z.i. + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x .. .5 Problems and Complements 149 (f3I' . .8).{3p. Let x. 35.. the sample median yis defined as ~ [Y(. Give the MLE when .17). These least absolute deviation estimates (LADEs). y)T where Y = Z + v"XW.fin are called (b) If n is odd...d. .I'...). Let X... . XHL i& at each point x'.& at each Xi. Y( n) denotes YI . be the observations and set II = ~ (Xl . Hint: Use Problem 1.. Yn ordered from smallest to largest. a)T is obtained by finding 131. . let Xl and X. .. Z and W are independent N(O.4. ~ 33. and x.J. where Jii = L:J=l Zij{3j.x. Show that XHL is a plugin estimate of BHL. Hint: See Example 1.  Illi < 1. (a) Show Ihat if Ill[ < 1.3. with density g(x ./3p that minimizes the maximum absolute value conrrast function maxi IYi . the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l). where Iii = ~ L:j:=1 ZtjPj· 32. a) is obtained by finding (31.. " . . be i. (See (2..tLil and then setting a = max t IYi . = I' for each i.(1 + x'j. (a) Show that the MLE of ((31.i...>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. .. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj).Ili I and then setting (j = n~I L~ I IYi .d.28Id(F. i < j.{:i. . Show that the sample median ii is the minimizer of L~ 1 IYi . .I/a}.) Suppose 1'. .7 with Y having the empirical distribution F. (3p. .d.(3p that minimizes the least absolute deviation contrast function L~ I IYi .2... . F)(x) *F denotes convolution. 1). then the MLE exists and is unique. .. A > 0.) + Y(r+l)] where r = ~n. 34.1. = L Ix. . .
1 n + m.B )'/ (7'. X.+~l Vi and 8 2 = E. ryo) and p' (x. Let Xl and X2 be the observed values of Xl and X z and write j. 1985). Find the MLEs of Al and A2 hased on 5.B I ) (K'(ryo. (a.+~l Wj have P(rnAl) and P(mA2) distributions. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). j = n + I.i.150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. N(B. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . 51. Show that the entropy of p(x. = I ! I!: Ii. Define ry = h(B) and let p' (x. . n. then B is not unique..0). B ) and p( X. Suppose h is a II function from 8 Onto = h(8). . Also assume that S. where Al +. Let Xi denote the number of hits at a certain Web site on day i. . .b)... Let 9 be a probability density on R satisfying the following three conditions: I. then the MLE is not unique. and 5. and positive everywhere.B) g(x + t> .Xn are i. Let (XI.f)) in the likelihood equation. where x E Rand (} E R. 37.. 9 is twice continuously differentiable everywhere except perhaps at O. Ii I. B) is and that the KullbackLiebler divergence between p(x. 3. Show that (a) The likelihood is symmetric about ~ x. Sl. BI ) (p' (x. and 8 2 are independent. ~ (d) Use (c) to show that if t> E (a. 'I i 39. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. o Ii !n 'I . . i = 1..B)g(x.d. B) denote their joint density. b). Suppose X I. . 2..h(y .j'.B)g(i . .. symmetric about 0.h(y) .. P(nA). .t> . B) and pix. B) = g(xB). 38. ~ (XI + x. ryl Show that o n ».ryt» denote the KullbackLeibler divergence between p( x. . If we write h = logg. distribution. 36. a < b. > h(y) . that 8 1 E. Assume that 5 = L:~ I Xi has a Poisson. Bo) is tn( B. Let K(Bo. h I (ry») denote the density or frequency function of X for the ry parametrization. .. 9 is continuous. The likelihood function is given by Lx(B) g(xI . (7') and let p(x.\2 = A. ~ Let B ~ arg max Lx (B) be "the" MLE. (b) Either () = x or () is not unique.x2)/2. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . ry) = p(x. Let Vj and W j denote the number of hits of type 1 and 2 on day j.B).)/2 and t> = (XI . such that for every y E (a. . Assume :1. . then h"(y) > 0 for some nonzero y.) be a random samplefrom the distribution with density f(x... Let X ~ P" B E 8.
)/o]}" (b) Suppose we have observations (t I .. I' E R. .olog{1 + exp[[3(t.1. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism. Let Xl. T.) Show that statistics. 41..Section 2. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. > OJ and T. where 0 (a) Show that a solution to this equation is of the form y (a. and /1. a. and 1'. exp{x/O. n where A = (a.}.7) for a. Give the least squares where El" •. .. 1'. . a get. For the ca.1.O) ~ a {l+exp[[3(tJ1. Aj) + Ei. i = 1. 1 X n satisfy the autoregressive model of Example 1.. . An example of a neural net model is Vi = L j=l p h(Zij. (. + C.~t square estimating equations (2. 1959. j 1 1 cxp{ x/Oj}...0). < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. . . [3. = log a .I[X.2. yd 1 ••• .c n are uncorrelated with mean zero and variance (72.5 Problems and Complements 151 40. a > 0.7) for estimating a.. Suppose Xl. X n be a sample from the generalized Laplace distribution with density 1 O + 0. 1 En are uncorrelated with mean 0 and variance estimating equations (2.j ~ x <0 1.~e p = 1. .  J1. x > 0. n.I[X.. 0). . .1'. 1).p. . 0 > O.5. (tn. Yn). where OJ > O. [3.)/o)} (72. give the lea. = LX. on a population of a large number of organisms. j = 1. . o + 0. and fj. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0. [3. . [3. i = 1. 1'). A) = g(z. /3. 6. n > 4. . = L X. Seber and Wild. Variation in the population is modeled on the log scale by using the model logY. h(z.. .1. and g(t.. 42. [3 > 0.
.0 . Hint: I . + C1 > 0).i.d. find the covariance matrix W of the vector € = (€1. Yn) is not a sequence of 1's followed by all O's or the reverse..y. = OJ. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p. 0 for Xi < _£I... Hint: Let C = 1(0). < .. n i=l n i=l n i=1 n i=l Cl LYi + C. Suppose Y I .. . + 8. Consider the HardyWeinberg model with the six genotypes given in Problem 2.1. • II ~.5). (X. 3.3. I . t·.) 1 II(Z'i /1)2 (b) If j3 is known.)... .'" £n)T of autoregression errors.15. '12 = A and where r denotes the gamma function.. .) Then find the weighted least square estimate of f. 1 Xl <i< n. n > 2..02): 01 > 0. Is this also the MLE of /1? Problems for Section 2. S. Prove Lenama 2...3.(8 . _ C2' 2. ..152 Methods of Estimation Chapter 2 (aJ If /' is known. = 1] = p(x"a. = L(CI + C.fi) = 1log p pry. gamma. C2' y' 1 1 for (a) Show that the density of X = (Xl. . show that the MLE of fi is jj =  2. Let = {(01.l. Under what conditions on (Xl 1 . Give details of the proof or Corollary 2.0.4) and (2. < I} and let 03 = 1. a. write Xi = j if the ith plant has genotype j. Xn· Ip (x.3.l  L: ft)(Xi /. This set K will have a point where the max is attained.x.J. > 0. .3 1.p).. .. Let Xl. = If C2 > 0.. fi) = a + fix. X n be i. r(A. < Show that the MLE of a. In a sample of n independent plants. (b) Show that the likelihood equations are equivalent to (2. . .. . . (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. fi exists iff (Yl .)I(c..1 > _£l. Yn are independent PlY. the bound is sharp and is attained only if Yi x.3. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4. There exists a compact set K c e such that 1(8) < c for all () not in K.x. I < j < 6.. L x.::.Xi)Y' < L(C1 + c. + 0.l. .
9.. if n > 2. In the heterogenous regression Example 1. and assume forw wll > 0 so that w is strictly convex.'~ve a unique solution (.6. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. [(Ai)... Let Y I ..en.lkIl :Ij >0.n) are the vertices of the :Ij <n}.0 < ZI < ... (3) Show that. . Use Corollary 2.l E R.40.. " (a) Show that if Cl: > ~ 1. .0). .. then it must contain a sphere and the center of the sphere is an interior point by (B. See also Problem 1.10 with that the MLE exists and is unique. and Zi is the income of the person whose duration time is ti. Prove Theorem 2.0.d. Show that the boundary of a convex C set in Rk has volume 0.Section 2.ill T exists and is unique. distribution where Iti = E(Y. . < Zn Zn. . with density.1 to show that in the multinomial Example 2. show 7. 12. . .6. . (0. < Show that the MLE of (a.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. . Let Xl. Let XI. Zj < . Hint: The kpoints (0. Then {'Ij} has a subsequence that converges to a point '10 E f.0).fl.3.t). • It) w' ( Xi It) .. .8) .. MLEs of ryj exist iffallTj > 0.. convex set {(Ij. w( ±oo) = 00.O.3. > 3. C! > 0.. .Xn E RP be i.1 <j < ac k1.9. e E RP. ~fo (X u I. 1 <j< k1.5 Problems and Complements 153 n 6. ji(i) + ji.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. ...1). 0 < Zl < . Hint: If BC has positive volume.3. .Ld. < Zn.. 10.. _. 0:0 = 1. (b) Show that if a = 1.3. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i . a(i) + a.}. the MLE 8 exists and is unique. fe(x) wherec.' . Yn denote the duration times of n independent visits to a Web site.. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 .3. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. 8.z=7 11.1 (a) ~ ~ c(a) exp{ Ix . (O.. X n be Ll... J.n. Suppose Y has an exponential. the MLE 8 exists but is not unique if n is even. . n > 2. .) ~ Ail = exp{a + ilz.
. b) ~ I w( aXi . .4 1. . Golnb and Van Loan. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <..8.1)".. W is strictly convex and give the likelihood equations for f. for example.2. (I't') IfEPD 8a2 a2 D > 0 an d > 0.6.d.2). O'~ + J1~.. b) and lim(a. Apply Corollary 2. O'?.J1. E" . Hint: (b) Because (XI. N(O. Yn ) be a sample from a N(/11. 8aob :1 13. respectively. .:. and p = [~(Xi . I (Xi .3. EM for bivariate data.b)_(ao.2)/M'''2] iI I'. P coincide with the method of moments estimates of Problem 2. b) = x if either ao = 0 or 00 or bo = ±oo. b = : and consider varying a. Show that if T is minimal and t: is open and the MLE doesn't exist. 2.J1. En i. then the coordinate ascent algorithm doesn't converge to a member of t:. .) has a density you may assume that > 0. 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex. Note: You may use without proof (see Appendix B.:. verify the Mstep by showing that E(Zi I Yi).bo) D(a..9). ~J ' .3.i.) Hint: (a) Thefunction D( a. /12. ag.)(Yi . (See Example 2. rank(ZD) = k.J1.l and CT. {t2. il' i.) I . Chapter 10. EeT = (J11. p) population... = (lin) L:. O'~ + J1i. L:.' ! (a) In the bivariate nonnal Example 2. b successively.2 are assumed to be known are = (lin) L:. . .n log a is strictly convex in (a.'2). (b) Reparametrize by a = . 1985. show that the estimates of J11.4. "i .. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2..J1. and p when J1..b) . (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations.1.1 and J1.. .154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. 0'1 Problems for Section 2. IPl < 1. ar 1 a~.4. O'~. Y. pal0'2 + J11J. 3. Let (X10 Yd.2)2. J12. (b) If n > 5 and J11 and /12 are unknown. (Xn .6. it is unique. provided that n > 3.(Yi . (i) If a strictly convex function has a minimum. > 0. (a) Show that the MLEs of a'f.4. See.
and that the (b) Use (2. B E [0. Suppose that in a family of n members in which one has the disease (and.I ~ + (1  B)n] .1) x Rwhere P. the model often used for X is that it has the conditional distribution of a B( n. Because it is known that X > 1. as the first approximation to the maximum likelihood .. B) variable. B= (A. I' >.? (b) Give as explicitly as possible the E.B)[I. .Section 2. Hint: Use Bayes rule.[lt ~ 1] = A ~ 1 . Yd : 1 < i family with T afi I a? known.5 Problems and Complements 155 4.. i ~ 1'. 1 and (a) Show that X .Ii). ~2 = log C\) + (b) Deduce that T is minimal sufficient.[lt ~ OJ. (a) Justify the following crude estimates of Jt and A. Y1 '" N(fLl a}).. 1 < i < n. Do you see any problems with >. 1]. it is desired to estimate the proportion 8 that has the genetic trait. (c) Give explicitly the maximum likelihood estimates of Jt and 5. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0".(I.P.n. IX > 1) = \ X ) 1 (1 6)" .O"~).(1 .[1 .B)n[n . is distributed according to an exponential ryl < n} = i£(1 I) uzuy. X is the number of members who have the trait. 8new . and given II = j. Y'i). Let (Ii. thus.4. Suppose the Ii in Problem 4 are not observed._x(1. 6...{(Ii. j = 0.. For families in which one member has the disease.(1 _ B)n]{x nB . 2 (uJ L }iIi + k 2: Y.B)"]'[(1 .and Msteps of the EM algorithm for this problem.2B)x + nB'] _. .3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I.B)n} _ nB'(I. given X>l. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique. when they exist. be independent and identically distributed according to P6. where 8 = 80l d and 8 1 estimate of 8. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it. also the trait). 2:Ji) . 1') E (0.x = 1.
~ 0. N++ c N a +c N+ bc Pabc = . 7. = 1.i..e. n N++ c ++c  .X3 be independent observations from the Cauchy distribution about f(x. 1 < b < B.1) + C(A + B2) = C(A + B1) . . * maximizes = iJ(A') t. Let Xl. Let Xl. b1 C. 8. (c) Show that the MLEs exist iff 0 < N a+c . PIU = a. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. V. N +bc given by < N ++c for all a. . 1 <c < C and La. . b1 c and then are .c)} and "+" indicates summation over the ind~x. c]' Show that this holds iff PIU ~ a. . where X = (U. Show that the sequence defined by this algorithm converges to the MLE if it exists.Na+c. (a) Suppose for aU a. N . (h) Show that the family of distributions obtained by letting 1'..4.X 2 .riJ(A) .9)')1 Suppose X. Let and iJnew where ). W = c] 1 < a < A.156 (c) [f n = 5. (1) 10gPabc = /lac = Pabe. V = b. hence. X Methods of Estimation Chapter 2 = 2. Consider the following algorithm under the conditions of Theorem 2.1 generated by N++c.2 noting that the sequence of iterates {rymJ is bounded and. Define TjD as before.2. Hint: Apply the argument of the proof of Theorem 2.1 (1 + (x e. W).4.9) = ".d. the sequence (11ml ijm+l) has a convergent subse quence. v < 00.A(iJ(A)).b.c Pabc = l. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. find (}l of (b) above using (} =   x/n as a preliminary estimate.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i.N+bc where N abc = #{i : Xi = (a. X n be i. v vary freely is an exponential family of rank (C .b. X 3 = a. iff U and V are independent given W. + Vbc where 00 < /l. 9. . X.
c. (b) Give explicitly the E.a>!b".1)(B ..I)(C . c" and "a. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.8).(A + B + C).3 + (A .a) 1 2 .4.1) + (A .4. (a) Show that S in Example 2. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model. (a) Show that this is an exponential family of rank A + B + C .Section 2. but now (2) logPabc = /lac + Vbc + 'Yab where J1. c" parameters.(x) :1:::: 1(S(x) ~ = s). f vary freely. Justify formula (2. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise. Let f.=_ L ellalblp~~~'CI a'. 12.:.I) ~ AB + AC + BC .b'.c' obtained by fixing the "b. v. Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=.5 Problems and Complements 157 Hint: (b) Consider N a+ c . Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations.I)(C . N±bc . 10.N++c/A.5 has the specified mixtnre of Gaussian distribution. 11.[X = x I SeX) = s] = 13.and Msteps of the EM algorithm in this case. Hint: P. = fo(x  9) where fo(x) 3'P(x) + 3'P(x .I) + (B .N++ c / B. Suppose X is as in Problem 9..
. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood.. and independent of El. ! . .4. R. (2) The frequency plugin estimates are sometimes called Fisher consistent. the probability that Vi is missing may depend on Zi. That is.4. N(ftl.158 Methods of Estimation Chapter 2 and r. 14. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. ~ 16. given Zi. Establish part (b) of Theorem 2. For example. a 2 . Noles for Section 2.4. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. Ii .p is the N(O} '1) density. ~ . Bm+d} has a subsequence converging to (0" JJ*) and.. That is. I Z. 18. If /12 = 1. in Example 2. Complete the E. 17. {32).6. 2 ). This condition is called missing at random. • .. Limitations a/the EM Algorithm.n}. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . Establish the last claim in part (2) of the proof of Theorem 2. . thus. (J*) and necessarily g. this assumption may not be satisfied. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). the process determining whether X j is missing is independent of X j . In Example 2. if a is sufficiently large. Zl. Zn are ii.6 . NOTES . For instance.5.3.2.i. Bm + I)} has a subsequence converging to (B* . EM and Regression. the "missingness" of Vi is independent of Yi.d.. suppose Yi is missing iff Vi < 2.{Xj }.) underpredicts Y.and Msteps of the EM algorithm for estimating (ftl.6. = {(Zil Yi) Yi :i = 1.5. where E) . but not on Yi. En a an 2.3 for the actual MLE in that example. find the probability that E(Y.. For X . Hint: Use the canonical nature of the family and openness of E. N(O. Verify the fonnula given in Example 2.~ go. . . we observe only}li.4. . 15. These considerations lead essentially to maximum likelihood estimates. A. {31 1 a~."" En. I I' . al = a2 = 1 and p = 0. necessarily 0* is the global maximizer.." . . Hint: Show that {(19 m. Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2.. Hint: Show that {( 8m . suppose all subjects with Yi > 2 drop out of the study. If Vi represents the seriousness of a disease.d. given X . Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected.4.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss.
1974. MA: MIT Press. A." J. Math. 2. ANOH. FAN.. HABERMAN. A. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. BISHOP. M. "Y. B.. Campbell. 164171 (1970). W. for any A. 54. GoLUB. c. E. 39.Section 2. BREMNER. 41. Local Polynomial Modelling and Its Applications London: Chapman and Hall.7 References 159 Notes for Section 2. Sciences. S. La. "Y. HOLLAND.5 (1) In the econometrics literature (e.. "On the Mathematical Foundations of Theoretical Statistics. 1985. M.. 2433 (1964). 1997).. BJORK. M. Acad. FISHER. A. AND N. MACKINLAY. G. NJ: Princeton University Press. The Analysis ofFrequency Data Chicago: University of Chicago Press.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). G. F.7 REFERENCES BARLOW. EiSENHART. AND N. 1. 0. DHARMADHlKARI.. Roy. BRUNK... Fisher 1950) New York: J." reprinted in Contributions to MatheltUltical Statistics (by R. C.loAGDEV. 1991. Statistical Inference Under Order Restrictions New York: Wiley. Note for Section 2. E. DAHLQUIST.. T. AND A. R. M. and MacKinlay. Statist. AND J. Statist. ANDERSON. BAUM. J. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!. A.• AND I. Note for Section 2. AND C. M.. S. FEINBERG. Soc. (2) For further properties of KullbackLeibler divergence. BARTHOLOMEW. G.. L. PETRIE. J. R. B. 1922. J. A." The American Statistician. Wiley and Sons. AND P. The Econometrics ofFinancial Markets Princeton." Journal Wash.g. AND D. GIlBELS. see Cover and Thomas (1991). t 996. M. 1972. AND K. WEISS. Matrix Computations Baltimore: John Hopkins University Press. LAIRD. 199200 (1985). 1974. S. DEMPSTER. Elements oflnfonnation Theory New York: Wiley. 1975. . D. THOMAS.. Discrete Multivariate Analysis: Theory and Practice Cambridge. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. W. E. T. RUBIN. VAN LOAN. COVER. 1997. 138 (1977).. "The Meaning of Least in Least Squares.1. CAMPBELL. La. A. Appendix A. Numerical Analysis New York: Prentice Hall.3 (I) Recall that in an exponential family.2. H. SOULES. P[T(X) E Al o for all or for no PEP. "Examples of Nonunique Maximum Likelihood Estimators.
. SHANNON." Ann. 11. 10. SNEDECOR. "Association and Estimation in Contingency Tables. Statist. 6th ed. C. i I... . In. 1989.. LIlTLE. STIGLER. Statistical Methods. WEISBERG. G. 1%7. F. SEBER. RUPPERT. Assoc.. 1987.. A. I • I . WU. A. B. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals."/RE Trans! Inform. Statisr. 13461370 (1994). RUBIN. E. E. 1985. "A Flexible Growth Function for Empirical Use. I 1 . The EM Algorithm and Extensions New York: Wiley. S.·sis with Missing Data New York: J. W.. Journal. 290300 ( 1959). 128 (1968). J. MOSTELLER. Theory. AND D. Botany. J. AND T. J. 2nd ed. WAND.." 1. 63. AND W. Amer. Wiley. "Multivariate Locally Weighted Least Squares Regression. KR[SHNAN." J. IA: Iowa State University Press.. AND M.. Exp. C. Applied linear Regression. N. 102108 (1956). Statist. "On the Convergence Properties of the EM Algorithm. MA: Harvard University Press. 22. R. G. 1986. 1997. Ames.95103 (1983) .. F. A. • . D. P." Bell System Tech." Ann. AND c. E.379243. The History ofStatistics Cambridge... New York: Wiley. 27. MACLACHLAN.. S.1. G. "A Mathematical Theory of Communication. Nonlinear Regression New York: Wiley. WILD. RICHARDS. COCHRAN. J. Statistical Anal.160 Methods of Estimation Chapter 2 KOLMOGOROV.623656 (1948)..
6) = EI(6. However. OUf examples are primarily estimation of a real parameter.Chapter 3 MEASURES OF PERFORMANCE. ) < R(B. In Section 3. We also discuss other desiderata that strongly compete with decision theoretic optimality. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. after similarly discussing testing and confidence bounds.1) 161 . We think of R(.6) = E o{(O. 6) as measuring a priori the performance of 6 for this model.1 INTRODUCTION Here we develop the theme of Section 1.6 . However. R(·. which is how to appraise and select among decision procedures. Strict comparison of 6. action space A.6(X).2 and 3.4. NOTIONS OF OPTIMALITY. actual implementation is limited.3 we show how the important Bayes and minimax criteria can in principle be implemented. 6) : e . 6 2 ) for all 0 or vice versa. and 62 on the basis of the risks alone is not well defined unless R(O. in particular computational simplicity and robustness. in the context of estimation. loss function l(O. AND OPTIMAL PROCEDURES 3. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. (3.2 BAYES PROCEDURES Recall from Section 1. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case.3.3 that if we specify a parametric model P = {Po: E 8}. o r(1T.2. 3. we study. In Sections 3.+ R+ by ° R(O.6(X)). We return to these themes in Chapter 6. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.a).6) =ER(O.
we could identify the Bayes rules 0.r= . B denotes mean height of people in meters. Clearly.2) the Bayes risk of the problem.4) the parametrization plays a crucial role here. 'f " .4. oj = ". We may have vague prior notions such as "IBI > 5 is physically implausible" if.2.2.2.3).(O)dO (3. e "().6). for instance.. such that (3. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). 00 for all 6 or the Using our results on MSPE prediction. if 1[' is a density and e c R r(". We don't pursue them further except in Problem 3.2. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1. After all. (3. Our problem is to find the function 6 of X that minimizes r(".2.5.2. For testing problems the hypothesis is often treated as more important than the alternative..(0)1 .6) In the discrete case... 1 1  .4) !... as usual. In view of formulae (1. We first consider the problem of estimating q(O) with quadratic loss. considering I.5). Here is an example.(O)dfI (3. X) is given the joint distribution specified by (1.2. and instead tum to construction of Bayes procedure.6) = J R(0. and 1r may express that we care more about the values of the risk in some rather than other regions of e.3) In this section we shall show systematically how to construct Bayes rules.3 we showed how in an example. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). 0) and "I (0) is equivalent to considering 1 (0. x _ !oo= q(O)p(x I 0). If 1r is then thought of as a weight function roughly reflecting our knowledge.6(X)j2. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ.162 Measures of Performance Chapter 3 where (0.6): 6 E VJ (3.) ~ inf{r(". we find that either r(1I".(O)dfI p(x I O). we just need to replace the integrals by sums. if computable will c plays behave reasonably even if our knowledge is only roughly right.(0) a special role ("equal weight") though (Problem 3. 1(0. it is plausible that 6. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on (). Thus. 0) 2 and 1T2(0) = 1. (0.8) for the posterior density and frequency functions. In the continuous case with real valued and prior density 1T. we can give the Bayes estimate a more explicit form. 6) ~ E(q(O) . It is in fact clear that prior and loss function cannot be separated out clearly either. 0) = (q(O) using a nonrandomized decision rule 6. (0.2.2. Recall also that we can define R(. and that in Section 1. j I 1': II .5) This procedure is called the Bayes estimate for squared error loss. .
2. '1]0. Applying the same idea in the general Bayes decision problem. This action need not exist nor be unique if it does exist. and X with weights inversely proportional to the Bayes risks of these two estimates. Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. tends to a as n _ 00.1). . To begin with we consider only nonrandomized rules. In fact. . if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. If we choose the conjugate prior N( '1]0.E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate.5. In fact.r( 7r. However. we should. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large.2. J') ~ 0 as n + 00. we fonn the posterior risk r(a I x) = E(l(e.2.a)' I X ~ x) as a function of the action a.2 Bayes Procedures 163 Example 3.Section 3.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 . for each x.w)X of the estimate to be used when there are no observations. r) . Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior. see Section 5. Thus. /2) as in Example 1.) minimizes the conditional MSPE E«(Y . if X = :x and we use action a.6) yields. X is the estimate that (3.2. But X is the limit of such estimates as prior knowledge becomes "vague" (7 . J' )]/r( 7r. J') E(e .12. " Xn.4. E(Y I X) is the best predictor because E(Y I X ~ . a) I X = x).00 with.1.1. Fonnula (3. Intuitively. that is. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r. This quantity r(a I x) is what we expect to lose.7) Its Bayes risk (the MSPE of the predictor) is just r(7r. a 2 / n. . take that action a = J*(x) that makes r(a I x) as small as possible.E(e I X))' = E[E«e . Because the Bayes risk of X. If we look at the proof of Theorem 1. we see that the key idea is to consider what we should do given X = x. For more on this.2.6. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3. 1]0 fixed).
o As a first illustration. az. the posterior risks of the actions aI. As in the proof of Theorem 1. and aa are r(at 10) r(a. .o(X)) I X] > E[I(O.2. if 0" is the Bayes rule. r(a.70 and we conclude that 0'(1) = a.8) Thus.89. Therefore. we obtain for any 0 r(?f.2. . E[I(O. E[I(O. I x) > r(o'(x) I x) = E[I(O.) = 0.2.o(X))] = E[E(I(O.2. More generally consider the following class of situations.) = 0. !i n j.. :i " " Proof.2. r(al 11) = 8. " . The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures. A = {ao. by (3.164 Measures of Performance Chapter 3 Proposition 3.2.8).· . let Wij > 0 be given constants. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}. .35. 11) = 5. 10) 8 10. Suppose we observe x = O.9) " "I " • But. and let the loss incurred when (}i is true and action aj is taken be given by j e e .8) Then 0* is a Bayes rule.o'(X)) I X].o'(X)) IX = x].1. ~ a. Example 3. consider the oildrilling example (Example 1.. ?f(O. a q }. 6* = 05 as we found previously. and the resnlt follows from (3. (3.Op}.74.4. [I) = 3.3.67 5. Bayes Procedures Whfn and A Are Finite.o(X)) I X = x] = r(o(x) Therefore.o) = E[I(O. Then the posterior distribution of o is by (1. 0'(0) Similarly. Therefore.1.2.2.2. az has the smallest posterior risk and.o(X)) I X)].8. I 0)  + 91(0" ad r(a.! " j. Let = {Oo.·.. r(a.9). (3.5) with priOf ?f(0..
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
such as 5(X) = q(Oo).. I I . This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles.2.I . in Section 1. according to these criteria.• •••. . I X n are d.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. This approach has early on been applied to parametric families Va. looking for the procedure 0. Bayes and minimaxity. Ii I More specifically. we can also take this point of view with humbler aims. We show that Bayes rules with constant risk.. it is natural to consider the computationally simple class of linear estimates. We introduced. and so on. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). Moreover. The most famous unbiased estimates are the familiar estimates of f. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable. 176 Measures of Performance Chapter 3 . we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['. N (Il. D.3. S(Y) = L:~ 1 diYi . all 5 E Va. ruling out... An alternative approach is to specify a proper subclass of procedures. for which it is possible to characterize and.1 . for example.4. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. Von Neumann's Theorem states that if e and D are both finite. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. on other grounds.L and (72 when XI. the game is said to have a value v. 0'5) model. or more generally with constant risk over the support of some prior. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . (72) = . This notion has intuitive appeal. symmetry." R(0. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. When Do is the class of linear procedures and I is quadratic Joss. the solution is given in Section 3. An estimate such that Biase (8) 0 is called unbiased. II . then in estimating a linear function of the /3J.1. then the game of S versus N has a value 11. Do C D.d. In the nonBayesian framework. . estimates that ignore the data. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do.1 Unbiased Estimation. for instance. are minimax. When v = ii.2. . 5) > R(0. 5') for aile. in many cases. computational ease.6. Obviously. compute procedures (in particular estimates) that are best in the class of all procedures. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. 3. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible.
. [J = ~ I:~ 1 Ui· XR is also unbiased. UN' One way to do this.4. (~) If{aj.an}C{XI.. X N ) as parameter . .. ~ N L. .. Unbiased estimates playa particularly important role in survey sampling. (3.(Xi 1=1 I" N ~2 x) .14) and has = iv L:fl (3. If ..6) where b is a prespecified positive constant.4.XN} 1 (3. X N ). these are both UMVU..4.3) = 0 otherwise.. . As we shall see shortly for X and in Volume 2 for . (3. Ui is the last census income corresponding to = it E~ 1 'Ui. .4.4.3...' . .. .8) Jt =X .. UN) and (Xl.2. Unbiased Estimates in Survey Sampling. and u =X  b(U . UN for the known last census incomes.3. ..1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 . ..< 2 ~ ~ (3. .4. .1. . We want to estimate the parameter X = Xj. Write Xl.4.Section 3.il) Clearly for each b.3 and Problem 1.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.Xn denote the incomes of a sample of n families drawn at random without replacement..···. We ignore difficulties such as families moving.5) This method of sampling does not use the information contained in 'ttl. reflecting the probable correlation between ('ttl. for instance. .. X N for the unknown current family incomes and correspondingly UI.. (3. is to estimate by a regression estimate ~ XR Xi.4. Example 3.. a census unit. Suppose we wish to sample from a finite population. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3. We let Xl. . This leads to the model with x = (Xl.4) where 2 CT.4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1.. UMVU (uniformly minimum variance unbiased). ..
It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. . II it >: . To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S).bey the attractive equivariance property.11. for instance.~'! I .Xi XHT=LJN i=l 7fJ. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5. ". the unbiasedness principle has largely fallen out of favor for a number of reasons. 178 Measures of Performance Chapter 3 i .I. q(e) is biased for q(e) unless q is linear. Specifically let 0 < 7fl. 111" N < 1 with 1 1fj = n.... They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later.4).4. I j . I II. Ifthc 'lrj are not all equal.4. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. :I . X M } of random size M such that E(M) = n (Problem 3. However.18.i . . this will be a beller cov(U. For each unit 1.X)!Var(U) (Problem 3. This makes it more likely for big incomes to be induded and is intuitively desirable. .3. . ~ . (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3... (ii) Bayes estimates are necessarily biasedsee Problem 3.3. X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef . ' .15).2Gand minimax estimates often are.. .. outside of sampling. (iii) Unbiased_estimates do not <:.'I . . . j=1 3 N ! I: .1 the correlation of Ui and Xi is positive and b < 2Cov(U.4. " 1 ".. M (3. An alternative approach to using the Uj is to not sample all units with the same prob ability. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems. 1. . If B is unbiased for e.7) II ! where Ji is defined by Xi = xJ.. . .j . Xl!Var(U). The HorvitzThompson estimate then stays unbiased.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows..19). Unbiasedness is also used in stratified sampling theory (see Problem 1.4. .4. .. The result is a sample S = {Xl. A natural choice of 1rj is ~n. 0 Discussion. I .
That is. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. (I) The set A ~ {x : p(x. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x.OJdx] = J T(xJ%oP(x. B) is a density.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. For instance. e e. e. suppose I holds. and p is the number of coefficients in f3. Simpler assumptions can be fonnulated using Lebesgue integration theory. Note that in particular (3. good estimates in large samples ~ 1 ~ are approximately unbiased.   . We make two regularity assumptions on the family {Pe : BEe}.4. For all x E A.p) where i = (Y .8) is assumed to hold if T(xJ = I for all x.Section 3. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n . The lower bound is interesting in its own right. unbiased estimates are still in favor when it comes to estimating residual variances. 167. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00.4. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. We expect that jBiase(Bn)I/Vart (B n ) . We suppose throughout that we have a regular parametric model and further that is an open subset of the line. has some decision theoretic applications. For instance. 3. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later.8) whenever the righthand side of (3. which can be used to show that an estimate is UMVU.O E 81iJO log p( x. OJ exists and is finite. p.. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large.2.9.8) is finite.OJdx (3. See Problem 3.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. B) for II to hold. B)dx. Assumption II is practically useless as written. f3 is the least squares estimate.O) > O} docs not depend on O. %0 [j T(X)P(x. Finally. B)dx.( lTD < 00 J .4. From this point on we will suppose p(x. Then 11 holdS provided that for all T such that E.4. Some classical conditiolls may be found in Apostol (1974). in the linear regression model Y = ZDf3 + e of Section 2. for integration over Rq. and we can interchange differentiation and integration in p(x. as we shall see in Chapters 5 and 6. and appears in the asymptotic optimality theory of Section 5.4.ZDf3). What is needed are simple sufficient conditions on p(x.4.
0 Xi) 1 = nO =  0' n O· o .4. the Fisher information number.O)).4. o Example 3.1. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. Then Xi . Suppose Xl.10) and. Similarly. I 1(0) 1 = Var (~ !Ogp(X. which is denoted by I(B) and given by Proposition 3. (J2) population.4.1.n and 1(0) = Var (~n_l .. t.9) Note that 0 < 1(0) < Lemma 3.4. 0) ~ 1(0) = E. (3. the integrals . If I holds it is possible to define an important characteristic of the family {Po}. (~ logp(X.i J T(x) [:OP(X.O)] dxand J T(x) [:oP(X.4. O)dx. 1 X n is a sample from a Poisson PCB) population. Suppose that 1 and II hold and that E &Ologp(X. 1 X n is a sample from a N(B. O)dx :Op(x.. O)dx = O. (3.O)]j P(x. I' .I [I ·1 .  Proof..1) ~(O) = 0la' and 1 and 11 are satisfied. . For instance. O)dx ~ :0 J p(x.O)] dx " I'  are continuous functions(3) of O.6. I J J & logp(x 0 ) = ~n 1 .O) & < 00. 0))' = J(:Ologp(x.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.4. 00. Then (see Table 1. .O)} p(x. 0))' p(x. thus. where a 2 is known. .&0 '0 {[:oP(X. /fp(x. Ii ! I h(x) exp{ ~(O)T(x) .180 for all Measures of Performance Chapter 3 e.. then I and II hold.11 ) . Then (3. suppose XI.2.
4.4.13) By (A.(T(X)) > [1/. (e) and Var. Let [.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section.l4) and Lemma 3.4.(0). We get (3.1 hold and T is an unbiased estimate ofB. T(X)) .'(0)]'  I(e) . 1/. .I 1. Then for all 0. e) and T(X).(T(X) > I(O) 1 (3.O).' (e) = Cov (:e log p(X..4.4.(0) is differentiable and Var (T(X)) . > [1/.(T(X)) by 1/.4.4. we obtain a universal lower bound given by the following.(0) = E(feiogf(X"e»)'.4.I1. 10gp(X. 0 E e.Section 3. If we consider the class of unbiased estimates of q(B) = B. Suppose the conditions of Theorem 3. (3.4.2. then I(e) = nI.O)dX. 0 (3.(T(X)) < 00 for all O. Var (t.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).4.17) .1. Then Var.4.15) The theorem follows because.16) to the random variables 8/8010gp(X.14) Now let ns apply the correlation (CauchySchwarz) inequality (A.1. .12) Proof. Denote E.1. ej.'(0) = J T(X):Op(x.4. CoroUary 3. (3. 0 The lower bound given in the information inequality depends on T(X) through 1/.. Let T(X) be all)' statistic such that Var.1. Theorem 3.Xu) is a sample from a population with density f(x. Here's another important special case. and that the conditions of Theorem 3. Using I and II we obtain. 1/. (3. Suppose that X = (XI. (Information Inequality). Proposition 3.O)dX= J T(x) UOIOgp(x. 1/.(0).1 hold. nI. Suppose that I and II hold and 0 < 1(0) < 00.O))P(x.'(~)I)'. e)) = I(e).4. by Lemma 3.
. which achieves the lower bound ofTheorem 3. . then X is UMVU..6.3.B(B)J. the MLE is B ~ X.1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI.1 and I(B) = Var [:B !ogp(X. .182 Measures of Performance Chapter 3 Proof.4. . then T' is UMVU as an estimate of 1. i g . For a sample from a P(B) distribution. then X is a UMVU estimate of B. 1) density. Theorem 3.(B) such that Var.4.18) follows.2. if {P9} is a oneparameter exponentialfamily of the form (1. As we previously remarked. the conditions of the information inequality are satisfied.18) 1 a ' . This is no accident.(B). Note that because X is UMVU whatever may be a 2 . then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E. (3.. I I "I and (3. These are situations in which X follows a oneparameter exponential family.B) = h(x)exp[ry(B)T'(x) . : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.1) that if Xl.4. .j. Suppose Xl. By Corollary 3.. o II (B) is often referred to as the information contained in one observation.. 0 We can similarly show (Problem 3. Example 3.4.19) Ii i . Example 3.4.(8) has a continuous nonvanishing derivative on e.B) ] [ ~var [%0 ]Ogj(Xi.(T(X». whereas if 'P denotes the N(O.B)] = nh(B).[T'(X)] = [. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . we have in fact proved that X is UMVU even if a 2 is unknown.4. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/.p'(B)]'ll(B) for all BEe. We have just shown that the information [(B) in a sample of size n is nh (8).4.4. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x.Xn are the indicators of n Bernoulli trials with probability of success B.j. Because X is unbiased and Var( X) ~ Bin.B)] Var " i) ~ 8B iogj(Xi.4.1 for every B.2.. This is a consequence of Lemma 3. Conversely.. Next we note how we can apply the information inequality to the problem of unbiased estimation. (Br Now Var(X) ~ a'ln. (Continued). then  1 (3.1) with natural sufficient statistic T( X) and .4. Suppose that the family {p..
Our argument is essentially that of Wijsman (1973).4.4..4.20) with respect to B we get (3.4.4.20) hold. B) = 2"' exp{(2nj 2"' exp{(2n.21 ) Upon integrating both sides of (3. denotes the set of x for which (3.B)) + nz)[log B . + n2) 10gB + (2n3 + n3) Iog(1 .4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. and only if.Section 3. then (3. If A.20) to (3. thus. (B)T'(X) + a2(B) (3. Thus.4.23) must hold for all B.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx.4. But now if x is such that = 1.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.25) But .4.24) [(B) = Vare(T(X) .ll.B) = a. :e Iogp(x. " and both sides are continuous in B. (3. (3.2.2.4. j hence.20) with Pg probability 1 for each B.6. By (3.. B) IdB.16) we know that T'" achieves the lower bound for all (J if. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. By solving for a" a2 in (3.4.23) for all B1 .4.19) is highly technical.22) for j = 1.4. B .4. there exist functions al ((J) and a2 (B) such that :B 10gp(X.4.19).6). In the HardyWeinberg model of Examples 2. be a denumerable dense subset of 8.B) so that = T(X) .10g(1 .4. continuous in B.(B)T' (x) +a2(B) for all BEe}. Then ~ ( 80 logp X. Let B . B) = a.(Ae) = 1 for all B' (Problem 3.(A") = 1 for all B'.14) and the conditions for equality in the correlation inequality (A.6.4. Here is the argument. However.. Note that if AU = nmAem . (3. 0 Example 3. it is necessary. then (3. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X). J 2 Pe. we see that all a2 are linear combinations of {} log p( Xj. p(x.. The passage from (3.4.A (B) .4 and 2. 2 A ** = A'" and the result follows.4. Conversely in the exponential family case (1.p(B) = A'(B) and. X2 E A"''''.1) we assume without loss of generality (Problem 3.1.A'(B» = VareT(X) = A"(B). .2 and.B) I + 2nlog(1  BJ} . B . Suppose without loss of generality that T(xI) of T(X2) for Xl. B) = aj(B)T'(x) + a2(B) (3.
Na).4. 0 0 . B) satisfies in addition to I and II: p(·. Sharpenings of the information inequality are available but don't help in general. in the U(O. Then (3.4. in many situations. B) )' aB (3. A third metho:! would be to use Var(B) = I(I(B) and formula (3. or by transforming p(x.25) we obtain a' 10gp(X.2. e). we have iJ' aB' logp (X.26) Proposition 3. • The multiparameter case We will extend the information lower bound to the case of several parameters. (3. . Theorem 3.7) Var(B) ~ B(I . We find (Problem 3. B)  2 (a log p(x.4. B) 2 1 a = p(x. B) to canonical form by setting t) = 10g[B((1 .4.p' (B) I'( I (B). =e'E(~Xi) = . I I' f . but the variance of the best estimate is not equal to the bound [. () (ell" . although UMVU estimates exist.(e) exist.B)] and then using Theorem 1.4. that I and II fail to hold.25). " 'oi .2.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211. '. B) example.6.ed)' In particular. .B) ~ A "( B).27) and integrate both sides with respect to p(x.26) holds. Proof We need only check that I a aB' log p(x. ..3. By (3..B)] ~ B.B) is twice differentiable and interchange between integration and differentiation is pennitted.. we will find a lower bound on the variance of an estimator . For a sample from a P( B) distribution o EB(::. B).2. Example 3. . It often happens.4.4.B)) which equals I(B). .6.'" . The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .B)(2n.IOgp(X.2 suggests. B) aB' p(x. Even worse. (Continued). assumptions I and II are satisfied and UMVU estimates of 1/.2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 .4. Suppose p('.. Extensions to models in which B is multidimensional are considered next.24). 0 ~ Note that by differentiating (3.4.4.4. for instance. See Volume II. ~ ~ This T coincides with the MLE () of Example 2. Because this is an exponential family. DiscRSsion. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. as Theorem 3. .
4 E(x 1") = 0 ~ I. 0 = (I". 1 <k< d. .4. . 6) denote the density or frequency function of X where X E X C Rq.O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x.Section 3. Let p( x. er 2 ).28) where (3.. .4. Suppose X = 1 case and are left to the problems. . Then 1 = log(211') 2 1 2 1 2 loger . d.? ologp(X.4.29) Proposition 3.30) iJ Ijd O) = Covo ( eO logp(X.Xn are U. .d.. The (Fisher) information matrix is defined as (3.5. ..O») ..4..0) = er.Ok IOgp(X. . (3. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. iJO logp(X.Bd are unknown.2 ] ~ er.4. 0). III (0) = E [~:2 logp(x. (a) B=T 1 (3..4.. then X = (Xl. Under the conditions in the opening paragraph. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. 0).2 = er.. j = 1.4 /2.( x 1") 2 2a 2 logp(x 0) .4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 . 0) j k That is. We assume that e is an open subset of Rd and that {p(x. in addition. ~ N(I".Xnf' has information matrix 1(0) = EO (aO:. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x. . 1 <j< d. (3.4. a ).. er 2). (c) If. .4. 0)] = E[er..32) Proof.31) and 1(0) (b) If Xl. The arguments follow the d Example 3. as X. nh (O) where h is the information matrix ofX. = Var('.
0)..4. I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. Suppose k . ! p(x.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3. [(0) = VarOT(X) = (3.6.4/2 . UMVU Estimates in Canonical Exponential Families. Here are some consequences of this result.6.4.4. (continued).' = explLTj(x)9j j=1 .1.4.(0) exists and (3. II are easily checked and because VOlogp(x..!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.4. J)d assumed unknown.4.4.. in this case [(0) ! .13). (3.4.(O).'(0) = V1/.( 0) be the d x 1 vector of partial derivatives. then [(0) = VarOT(X).4.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82.A.6. that is.33) o Example 3. = . By (3. Then VarO(T(X» > Lz~ [1(0) Lzy (3. Example 3.38) where Lzy = EO(TVOlogp(X. Suppose the conditions of Example 3. The conditions I.37) = T(X).186 Measures of Performance Chapter 3 Thus.. 1/.36) Proof. Assume the conditions of the .3.O) = T(X) . . Let 1/. 0) . . We will use the prediction inequality Var(Y) > Var(I'L(Z».(0) = EOT(X) and let ".4. A(O). . Then Theorem 3.A(O)}h(x) (3.4.30) and Corollary 1. Canonical kParameter Exponential Family. We claim that each of Tj(X) is a UMVU .4..34) () E e open. Z = V 0 logp(x.6 hold..2 ( 0 0 ) . where I'L(Z) denotes the optimal MSPE linear predictor of Y.
) = Var(Tj(X»..A(0) j ne ej = (1 + E7 eel _ eOj ) .Section 3. .0). a'i X and Aj = P(X = j). = nE(T. .4. (1 + E7 / eel) = n>'j(1..6 with Xl.Xn)T. we let j = 1.. we transformed the multinomial model M(n.j. To see our claim note that in our case (3.0. n T.)/1£.41) because .. k. Thus.A(O)} whereTT(x) = (T!(x).(X) = I)IXi = j]. .4. AI. j=1 X = (XI. .. . (3. . hence. .. This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'. j. without loss of generality.>'j)/n.(1.40) J We claim that in this case .. In the multinomial Example 1. . X n i...i. .7. But because Nj/n is unbiased and has 0 variance >'j(1 .4 8'A rl(O)= ( 8080 t )1 kx k .p(OJrI(O).6.. then Nj/n is UMVU for >'j. .A.. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. i =j:.3.>'. . Multinomial Trials.T(O) = ~:~ I (3.p(0) is the first row of 1(0) and.4.A(O) is just VarOT! (X).4.. j = 1.. We have already computed in Proposition 3. are known. .(X)) 82 80. Example 3. " Ok_il T. . . is >.39) where. But f.4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). ..4.Tk_I(X))..O) = exp{TT(x)O .d.p(0)11(0) ~ (1. 0 = (0 1 . by Theorem 3.4.. .. Ak) to the canonical form p(x.
B)a > Ofor all .4."xl' Note that both sides of (3. (72) then X is UMVU for J1 and ~ is UMVU for p. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal.d.4.3 whose proof is left to Problem 3.5 NON DECISION THEORETIC CRITERIA In practice.4. T(X) is the UMVU estimate of its expectation. If Xl. 1 j Here is an important extension of Theorem 3. features other than the risk function are also of importance in selection of a procedure.42) where A > B means aT(A .4.21." . " il 'I.2 + (J2. and robustness to model departures. Summary..8. Also note that ~ o unbiased '* VarOO > r'(O). . N(IL. . reasonable estimates are asymptotically unbiased.5.4. 3. The Normal Case.. interpretability of the procedure. Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.4.. . Using inequalities from prediction theory. Suppose that the conditions afTheorem 3.i.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. We study the important application of the unbiasedness principle in survey sampling. 3. .3 are 0 explored in the problems. X n are i.pd(OW = (~(O)) dxd . Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O). LX. They are dealt with extensively 'in books on numerical analysis such as Dahlquist.4. even if the loss function and model are well specified. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. .3 hold and • is a ddimensional statistic.42) are d x d matrices. . 188 MeasureS of Performance Chapter 3 j Example 3. These and other examples and the implications of Theorem 3. • Theorem 3.4. But it does not follow that n~1 L:(Xi .4..4. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure..X)2 is UMVU for 0 2 . . Then J (3. ~ In Chapters 5 and 6 we show that in smoothly parametrized models. we show how the infomlation inequality can be extended to the multiparameter case.
9) by. It may be shown that.3 and 3. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. the NewtonRaphson method in ~ ~(J) which the jth iterate. The interplay between estimated. Of course. and Anderson (1974). The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable. for this population of measurements has a clear interpretation. On the other hand. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.4. in the algOrithm we discuss in Section 2. consider the Gaussian linear model of Example 2. takes on the order of log log ~ steps (Problem 3. estimates of parameters based on samples of size n have standard deviations of order n.10).5.5 we are interested in the parameter III (7 • • This parameter.A(BU ~l»)). It is in fact faster and better to Y. fI(j) = iPl) . Then least squares estimates are given in closed fonn by equ<\tion (2.2 is given by where 0. at least if started close enough to 8.2 Interpretability Suppose that iu the normal N(Il.P) in Example 2. the signaltonoise ratio. On the other hand.1. Closed form versus iteratively computed estimates At one level closed form is clearly preferable.2.2 is the empirical variance (Problem 2.5.4. a method of moments estimate of (A.4.5. variance and computation As we have seen in special cases in Examples 3.1 / 2 is wasteful.3.4.3. But it reappears when the data sets are big and the number of parameters large. It follows that striving for numerical accuracy of ord~r smaller than n.5 Nondecision Theoretic Criteria 189 Bjork. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory.AI (flU 1) (T(X) . The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used.1.1).2).Section 3.11). Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2./ a even if the data are a sample from a distribution with .2. For instance. 3. solve equation (2.1. with ever faster computers a difference at this level is irrelevant. < 1 then J is of the order of log ~ (Problem 3. (72) Example 2. It is clearly easier to compute than the MLE. if we seek to take enough steps J so that III III < .1 / 2 .1. say.
9(Pl.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal.5. suppose we initially postulate a model in which the data are a sample from a gamma. Then E(X)/lVar(X) ~ (p/>. However. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. To be a bit formal. v is any value such that P(X < v) > ~.e. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). 1975b. For instance. the parameter v that has half of the population prices on either side (fonnally. economists often work with median housing prices.')1/2 = l. For instance. . But () is still the target in which we are interested. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. 1976) and Doksum (1975). However. We will consider situations (b) and (c).13. but there are several parameters that satisfy this qualitative notion.. This idea has been developed by Bickel and Lehmann (l975a.5.1. X n ) where most of the Xi = X:. but there are a few • • = . We consider three situations (a) The problem dictates the parameter. We can now use the MLE iF2.4) is for n large a more precise estimate than X/a if this model is correct. we observe not X· but X = (XI •. would be an adequate approximation to the distribution of X* (i. Similarly. and both the mean tt and median v qualify. 1 X~) could be taken without gross errors then p.I1'. E p. we turn to robustness. that is. they may be interested in total consumption of a commodity such as coffee.)(p/>. Alternatively. (c) We have a qualitative idea of what the parameter is. suppose that if n measurements X· (Xi. . we may be interested in the center of a population... The actual observation X is X· contaminated with "gross errOfs"see the following discussion.. See Problem 3. if gross errors occur. anomalous values that arise because of human error (often in recording) or instrument malfunction.4.\)' distribution. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. P(X > v) > ~). we could suppose X* rv p. On the other hand. However.. which as we shall see later (Section 5. amoog others. Gross error models Most measurement and recording processes are subject to gross errors.3 Robustness Finally. but we do not necessarily observe X*.5. E P*). We return to this in Section 5. 3. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B.
18). the assumption that h is itself symmetric about 0 seems patently untenable for gross errors.\ is the probability of making a gross error. We next define and examine the sensitivity curve in the context of the Gaussian location model. X~) is a good estimate. X is the best estimate in a variety of senses. but with common distribution function F and density f of the form f(x) ~ (1 ..) for 1'1 of 1'2 (Problem 3.Section 3. two notions.l.d. . (3. . A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i.1') and (Xt. Unfortunately. it is possible to have PC""f. .i. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· .1') : f satisfies (3.A Y. However. Formal definitions require model specification. where Y. make sense for fixed n.j = PC""f..d. . X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.2). we encounter one of the basic difficulties in formulating robustness in situation (b). the gross errors. P. identically distributed. We return to these issues in Chapter 6.l. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O...2.j) E P. and then more generally.5. . and definitions of insensitivity to gross errors.1'). If the error distribution is normal.Xn are i. Xi Xt with probability 1 . The breakdown point will be discussed in Volume II. Informally B(X1 . .5.f. Consider the onesample symmetric location model P defined by ~ ~ i=l""l n . Now suppose we want to estimate B(P*) and use B(X 1 . for example. Then the gross error model issemiparametric.1 ) where the errors are independent. where f satisfies (3.. Further assumptions that are commonly made are that h has a particular fonn. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6.2) Here h is the density of the gross errors and . if we drop the symmetry assumption.. F.5.1). the sensitivity curve and the breakdown point. ISi'1 or 1'2 our goal? On the other hand. we do not need the symmetry assumption.)~ 'P C) + .is not a parameter. so it is unclear what we are estimating.1.5 Nondecision Theoretic Criteria ~ 191 wild values.J) for all such P. That is. •.h(x). This corresponds to. in situation (c).5. it is the center of symmetry of p(Po. However.) are ij.l. has density h(y . In our new formulation it is the xt that obey (3.5. = {f (. specification of the gross error mechanism. . iff X" . .d. X n ) knowing that B(Xi.i. (3. Again informally we shall call such procedures robust. with probability .remains identifiable.2) for some h such that h(x) = h(x) for all x}. with common density f(x .5. Without h symmetric the quantity jJ. p(".. That is. Example 1. and symmetric about 0 with common density f and d. Y. The advantage of this formulation is that jJ.
X n ordered from smallest to largest. 0) = n[O(xl.Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n). The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ . X n ) = B(F).2.. .1.32. .l.f) E P.J. . Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' .L = (}(X l 1'. .. we take I' ~ 0 without loss of generality. See (2. . that is.. ~ ) SC( X. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution. not its location.··..L..1. (}(X 1 .. . 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P.2. and it splits the sample into two equal halves. How sensitive is it to the presence of gross errors among XI... . x) . P.. Thus.. ." .O(Xl' . We start by defining the sensitivity curve for general plugin estimates.17).. Because the estimators we consider are location invariant. that is.4). . Xnl as an "ideal" sample of size n .16). X(n) are the order statistics. has the plugin property. in particular. (}(PU. . . .od because E(X.X.I. Suppose that X ~ P and that 0 ~ O(P) is a parameter. . • . . . (i) It is the empirical plugin estimate of the population median v (Problem 3. Then ~ ~ O.1... This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. . 1') = 0.1 for all p(/l. X" 1'). At this point we ask: Suppose that an estimate T(X 1 . . is appropriate for the symmetric location model.. The sample median can be motivated as an estimate of location on various grounds. . ..1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J. +X"_I+X) n = x.. SC(x.5. Often this is done by fixing Xl.. therefore.14.Xnl so that their mean has the ideal value zero. In our examples we shall.. X n ) . I . the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely. (2.192 The sensitivity curve Measures of Performance Chapter 3 .5. X"_l )J. . where F is the empirical d. See Problem 3. where Xl.j)) = J. Now fix Xl!'" .L = E(X). See Section 2. Xl. . We are interested in the shape of the sensitivity curve. .. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas.od Problem 2. X =n (Xl+ .
9. x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3.5. < X(nl) are the ordered XI.5.5.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}.1. XnI.. v coincides with fL and plugin estimate of fL.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji..5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. n = 2k + 1 is odd and the median of Xl..1). The sensitivity curve in Figure 3.5. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < . we obtain .Xn_l = (X(k) + X(ktl)/2 = 0. . A class of estimates providing such intermediate behavior and including both the mean and . SC(x. The sensitivity curve of the median is as follows: If.. SC(x) SC(x) x x Figure 3. 27 a density having substantially heavier tails than the normaL See Problems 2. The sensitivity curves of the mean and median.2.5. . Although the median behaves well when gross errors are expected.32 and 3. .. say..Section 3.
5).2. .5.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. Intuitively we expect that if there are no gross errors..8) than the Gaussian density. 4e'x'. Note that if Q = 0.1 again.2[nn]/n)I.1. Xu = X. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5. by <Q < 4. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. See Andrews. If [na] = [(n . ... Figure 3. There has also been some research into procedures for which a is chosen using the observations. and Hogg (1974). Hampel. f = <p.. Xa  X. whereas as Q i ~. infinitely better in the case of the Cauchysee Problem 5. Bickel.4. then the trimmed means for a > 0 and even the median can be much better than the mean. The estimates can be justified on plugin grounds (see Problem 3. The range 0. the Laplace density.2[nnl where [na] is the largest integer < nO' and X(1) < .5. The sensitivity curve of the trimmed mean. (The middle portion is the line y = x(1 . i I . Haber.) SC(x) X x(n[naJ) . + X(n[nuj) n . . . Rogers. . We define the a (3." see Jaeckel (1971). However.5... we throw out the "outer" [na] observations on either side and take the average of the rest.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century.3) Xu = X([no]+l) + . Which a should we choose in the trimmed mean? There seems to be no simple answer. the sensitivity curve calculation points to an equally intuitive conclusion. f(x) = or even more strikingly the Cauchy. < X (n) are the ordered observations. For more sophisticated arguments see Huber (1981). That is. f(x) = 1/11"(1 + x 2 ).10 < a < 0. suppose we take as our data the differences in Table 3.l)Q] and the trimmed mean of Xl. Huber (1972).5. l Xnl is zero. for example. the mean is better than any trimmed mean with a > 0 including the median.2. Xa. Let 0 trimmed mean. For instance. If f is symmetric about 0 but has "heavier tails" (see Problem 3. that is. 4.. For a discussion of these and other forms of "adaptation.1.5. . and Tukey (1972).5.4. I I ! .
2 .10). We next consider two estimates of the spread in the population as well as estimates of quantiles. Write xn = 11. If we are interested in the spread of the values in a population. = ~ [X(k) + X(k+l)]' and at sample size n  1..4) where the approximation is valid for X fixed.25. . The IQR is often calibrated so that it equals 0" in the N(/l. other examples will be given in the problems. and let Xo.XnI. 0"2) model.X)2 is the empirical plugin estimate of 0. Xu is called a Ctth quantile and X. . X n is any value such that P( X < x n ) > Ct. Let B(P) = X o deaote a ath quantile of the distribution of X.1. P( X > xO') > 1 .1x. If no: is an integer.75 .5. . Spread. Example 3. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. 0 Example 3. where x(t) < .742(x...25 are called the upper and lower quartiles. Xo: = x(k). . Then a~ = 11.25).2. say k.Ct).674)0". To simplify our expression we shift the horizontal axis so that L~/ Xi = O.1 L~ 1 Xi = 11.. Similarly. X n denote a sample from that population. n + 00 (Problem 3.5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations. denote the o:th sample quantile (see 2. &) (3.I L:~ I (Xi .5.X.16). then SC(X.75 .is typically used. . Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI. < x(nI) are the ordered Xl. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. Because T = 2 x (.. 0 < 0: < 1. SC(X.. Quantiles and the lQR.X. .5.Section 3. the scale measure used is 0.5. then the variance 0"2 or standard deviation 0.75 and X. the nth sample quantile is Xo. where Xu has 100Ct percent of the values in the population on its left (fonnally.1.
We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability.F) where = limIF.. .) .75 X.SC(x.O.O(F)] = . The sensitivity of the parameter B(F) to x can be measured by the influence function.O. Other aspects of robustness. discussing the difficult issues of identifiability. F) and~:z.Xnl. We will return to the influence function in Volume II. Unfortunately these procedures tend to be extremely demanding computationally.25· Then we can write SC(x. although this difficulty appears to be being overcome lately.• .6. Summary. Ii I~ . trimmed mean. 1'. . and computability.. 1'75) . The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean. Most of the section focuses on robustness. 0. o Remark 3. ~ ~ ~ Next consider the sample lQR T=X.5.1 denotes the empirical distribution based on Xl. and Stabel (1983). Ronchetti.5.5. It is easy to see ~ ! '. in particular the breakdown point... It plays an important role in functional expansions of estimates.(x.1.: : where F n . ..25) and the sample IQR is robust with respect to outlying gross errors x. ~ . 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3. ~' is the distribution function of point mass at x (.196 thus. xa: is not sensitive to outlying x's. Rousseuw. for 2 Measures of Performance Chapter 3 < k < n . have been studied extensively and a number of procedures proposed and implemented. and other procedures.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly. SC(x. median.2. i) = SC(x. r: ~i II i . which is defined by IF(x. x (t) that (Problem 3. Discussion.(x. = <1[0((1  <)F + <t.F) dO I. An exposition of this point of view and some of the earlier procedures proposed is in Hampel.15) I[t < xl).
2. (a) Show that the joint density of X and 0 is f(x. In Problem 3. Hint: See Problem 1.O)P. (3(r.0 E e. 6.6 PROBLEMS AND COMPLEMENTS Problems for Section 3. x) where c(x) =. .2.. the Bayes rule does not change. respectively. the parameter A = 01(1 .. change the loss function to 1(0. (X I 0 = 0) ~ p(x I 0). ~ ~ 4. Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.1.2. That is. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin. Find the Bayes risk r(7f. 71') = R(7f) /r( 71'. density. J) ofJ(x) = X in Example 3. . under what condition on S does the Bayes rule exist and what is the Bayes rule? 5. a(O) > 0.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. 0) = (0 . In some studies (see Section 6.0)' and that the prinf7r(O) is the beta. s). . lf we put a (improper) uniform prior on A. where OB is the Bayes estimate of O. Show that (h) Let 1(0.4. which is called the odds ratio (for success).3. Consider the relative risk e( J. (72) sample and 71" is the improper prior 71"(0) = 1.2 with unifonn prior on the probabilility of success O. is preferred to 0. E = R. 0) is the quadratic loss (0 .Section 3.0) and give the Bayes estimate of q(O).3). 0) the Bayes rule is where fo(x. 3. Show that if Xl.Xn be the indicators of n Bernoulli trials with success probability O. then the improper Bayes rule for squared error loss is 6"* (x) = X. give the MLE of the Bernoulli variance q(0) = 0(1 . Suppose 1(0.2.6 Problems and Complements 197 3.. (c) In Example 3.2 preceeding. we found that (S + 1)/(n + 2) is the Bayes rule. Let X I. J). = (0 o)'lw(O) for some weight function w(O) > 0.2 o e 1.0)/0(0).a)' jBQ(1.4.. 1 X n is a N(B. Check whether q(OB) = E(q(IJ) I x). In the Bernoulli Problem 3. if 71' and 1 are changed to 0(0)71'(0) and 1(0.0). Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior.2.r 7f(0)p(x I O)dO.O) = p(x I 0)71'(0) = c(x)7f(0 . Suppose IJ ~ 71'(0). Compute the limit of e( J. 71') as .24. where R( 71') is the Bayes risk.
by definition.a)' / nj~. There are two possible actions: 0'5 a a with losses I( B. Measures of Performance Chapter 3 (b) n . 1). then E(O)) = aj/ao... Note that ).9j ) = aiO:'j/a5(aa + 1). bioequivalent. 7. . Find the Bayes decision rule. .15. B.• N.2(d)(i) and (ii). I j l .I(d).exp {  2~' B' } .. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x).. . A regulatory " ! I i· .Xu are Li.\(B) = r . i: agency specifies a number f > a such that if f) E (E.. (b) Problem 1.\(B) = I(B. Xl. and Cov{8j .(0) should be negative when 0 E (E.B. Assume that given f). defined in Problem 1.) with loss function I( B. E) and positive when f) 1998) is 1. E).. find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. Hint: If 0 ~ D(a). a = (a" . Cr are given constants. where E(X) = O. do not derive them.d.3. c2 > 0 I. Suppose tbat N lo .E). (e) a 2 + 00.19(c). .O)  I(B. where CI. One such function (Lindley. • 9.  I . . a) = [q(B)a]'. B). a) = Lj~. a r )T.ft B be the difference in mean effect of the generic and namebrand drugs. I ~' . 76) distribution. 0) and I( B.3. I) = difference in loss of acceptance and rejection of bioequivalence.) (b) When the loss function is I(B. . .=l CjUj .. (. (Bj aj)2. . a) = (q(B) . For the following problems. B . Var(Oj) = aj(ao . (Use these results.. E).Xn ) we want to decide whether or not 0 E (e. and that 9 is random with aN(rlO. Set a{::} Bioequivalent 1 {::} Not Bioequivalent . (a) Problem 1.. (E. (a) If I(B. then the generic and brandname drugs are.aj)/a5(ao + 1). equivalent to a namebrand drug. " X n of differences in the effect of generic and namebrand effects fora certain drug. and that 0 has the Dirichlet distribution D( a). Let q( 0) = L:. (c) Problem 1. Bioequivalence trials are used to test whether a generic drug is. . . .. N{O.O'5). where is known.. . 00. (c) We want to estimate the vector (B" . where ao = L:J=l Qj. B = (B" . given 0 = B are multinomial M(n. Suppose we have a sample Xl. Let 0 = ftc . compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. to a close approximation.2. 8..3.)T. On the basis of X = (X I.198 (a) T + 00...
Yo) to be a saddle point is that. representing x = (Xl.2.16) and (3.2. 0.Yp). Any two functions with difference "\(8) are possible loss functions at a = 0 and I. yo) is a saddle point of 9 if g(xo. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3.. and 9 is twice differentiable.3 1. (Xv.y = (YI.17).6." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). v) > ?r/(l ?r) is equivalent to T > t.Yo) Xi {}g {}g Yj = 0.Yo) = {} (Xo.2.. Yo) is in the interior of S x T.. Yo) = inf g(xo. S T Suppose S and T are subsets of Rm.\(±€. .) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). In Example 3. find (a) the linear Bayes estimate of ~l. . 0) and l(()..1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. . .1) < (T6(n) + c'){log(rg(~)+. respectively.3.6 Problems and Complements 199 where 0 < r < 1. Yo) = sup g(x.. 1) are not constant. A point (xo. (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3..6.) + ~}" . Con 10... (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). RP.1.Xm). S x T ~ R. Note that .Section 3. (a) Show that a necessary condition for (xo. {} (Xo. For the model defined by (3. Suppose 9 . .2 show that L(x. (b) the linear Bayes estimate of fl. Discuss the preceding decision rule for this "prior. Hint: See Example 3. 2. + = 0 and it 00"). y).
i. fJ( . 0») equals 1 when B = ~. I}. • " = '!.a)'.Xn ) . Let X" .a..j)=Wij>O. . 5. N(I'.. the conclusion of von Neumann's theorem holds.d< p.:.2.0•• ). B ) and suppose that Lx (B o. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is. Hint: See Problem 3.. and show that this limit ( . &* is minimax. . (b) Show that limn_ooIR(O.=1 CijXiYj with XE 8 m .1(0.n12).i) =0. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. n: I>l is minimax. Suppose e = {Bo. A = {O. Bd. a) ~ (0 .andg(x.Yo) > 0 8 8 8 X a 8 Xb 0. the test rule 1511" given by o. iij.thesimplex.b < m. d) = (a) Show that if I' is known to be 0 (!. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = .n). and 1 . B1 ) Ip(X.n)/(n + . and o'(S) = (S + 1 2 . Yo .2. 2 J'(X1 .)w". L:.. • . 1 <j. X n be i. f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta.0). y ESp..(X) = 1 if Lx (B o.n12. Suppose i I(Bi.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. . I j 3. B1 ) >  (l. . Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. Let S ~ 8(n...<7') and 1(<7'. B ) ~ p(X.. o(S) = X = Sin." 1 Xi ~ l}.c.. I • > 1 for 0 i ~. f._ I)'. 4. (b) Suppose Sm = (x: Xi > 0.d.l.= 1 .y) ~ L~l 12. Yc Yd foraHI < i.o•• ) =R(B1 . Let Lx (Bo.1 < i < m. o')IR(O.. I(Bi. Thus. prior. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo. and that the mndel is regnlar.j=O. i..PIB = Bo]. .
P. X k ) = (R l . . ik) is smaller than that of (i'l"'" i~).a < b" = 'lb. that both the MLE and the estimate 8' = (n . .\ . .) . . .l . an d R a < R b.\) distribution and 1(. Rk) where Rj is the rank of Xi.k. prior. M (n.d) where qj ~ 1 . Xl· (e) Show that if I' is unknown. . < im. . Let Xi beindependentN(I'i.1.X)' is best among all rules of the form oc(X) = c L(Xi . . 0') = (1. See Volume II. .\.. Permutations of I.j.j. 00. ik is an arbitrary unknown permutation of 1. f(k. and iI. .....» = Show that the minimax rule is to take L l. ... jl~ < .. I' (X"". = tj. Show that X has a Poisson (. . j 7. .. . Remark: Stein (1956) has shown that if k > 3..k} I«i). 8. . ftk) = (Il?l'"'' J1. . " a.. then is minimax for the loss function r: l(p. PI.tn I(i.o) for alII'.. . 0 1. Let Xl. dk)T.. . 1 = t j=l (di . o'(X) is also minimax and R(I'.I). 1 < j < k. respectively. > jm). Xk)T. i=l then o(X) = X is minimax.. .j. . a) = (. .Section 3_6 Problems and Complements 201 (b) If I' ~ 0.. 1 <i< k. Jlk. that is.. o(X 1. d = (d). d) = L(d.. For instance... N k ) has a multinomial.. . Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l . 1). . < /12 is a known set of values. where (Ill. See Problem 1. (c) Use (B.Pi)' Pjqj <j < k. ..2. b. .. Show that if = (I'I''''. Then X is nurnmax. ik). .29).tj. / .?. . .3.\. Let k . h were t·. . Hint: (a) Consider a gamma prior on 0 = 1/<7'.1)1 L(Xi ... Hint: Consider the gamma.X)' and.). (jl. Write X  • "i)' 1(1'. distribution..12. .Pi. LetA = {(iI.X)' are inadmissible. . o(X) ~ n~l L(Xi .. .I'kf. .. J.~~n X < Pi < < R(I'. ttl . hence. X k be independent with means f. Show that if (N).a? 1. 9..)... show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. b = ttl. X is no longer unique minimax. 6. .... .
LB= j 01 1. Let K(po.2. . q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q... 13. Suppose that given 6 ~ B..Xo)T has a multinomial.... Let the loss " function be the KullbackLeibler divergence lp(B. (b) How does the risk of these estimates compare to that of X? 12.=1 Show that the Bayes estimate is (Xi .. X n be independent N(I'.. Suppose that given 6 = B = (B . M(n.. a) is the posterior mean E(6 I X). . B j > 0. . B)..202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. . 14... . I: • . Pk... 1 ..15. . Define OiflXI v'n < d < v'n d v'n d d X ..3. J K(po. ... ! p(x) = J po(x)1r(B)dB.. 1)..a) and let the prior be the uniform prior 1r(B" . distribution.. See also Problem 3... For a given x we want to estimate the proportion F(x) of the population to the left of x..d with density defined in Problem 1. .8. d X+ifX 11. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . 10. distribution. + l)j(n + k)...1r) = Show that the marginal density of X. with unknown distribution F. n) be i._.i.2. BO l) ~ (k .4.I)!. X = (X"". . Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss. ... of Xi < X • 1 + 1 o.. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B.1'»2 of these estimates is bounded for all nand 1'. X has a binomial.BO)T... ~ I I . I: I ir'z .q)1r(B)dB. Let Xi(i = 1. Hint: Consider the risk function of o~ No.d.B).i f X> . B(n. See Problem 3. Let X" . ..
7f) and that the minimum is I(J. .4. 0) and q(x. We shall say a loss function is convex.1 E: 1 (Xi  J.Section 3.). Suppose X I..6 Problems and Complements 203 minimizes k( q. .X is called the mutual information between 0 and X. Show that X is an UMVU estimate of O." A density proportional to VJp(O) is called Jeffrey's prior. . X n are Ll. . J).k(p.alai) < al(O.. . . (b) Equivariance of the Fisher Information Bound. . N(I". respectively.O)!. N(I"o. a < a < 1. a. then E(g(X) > g(E(X».~). Hint: k(q. 15.4 1. Let X . .aao + (1 . . Give the Bayes rules for squared error in these three cases. Equivariance. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality.. X n be the indicators of n Bernoulli trials with success probability B.. Show that (a) (b) cr5 = n. It is often improper. Show that if 1(0. S. ... Let X I. ry). 0) and B(n. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). a) is convex and J'(X) = E( J(X) I I(X». p( x. then That is. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations. Jeffrey's "Prior.4.2) with I' .f [E.a )I( 0.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible.1"0 known. and O~ (1 .'? /1 as in (3. 0) cases. that is. ao) + (1 . Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable. J') < R( (J. (ry»).1 . if I(O. .. the Fisher information lower bound is equivariant.d. Jeffrey's priors are proportional to 1. a" (J. p(X) IO. 0) with 0 E e c R. 0. Reparametrize the model by setting ry = h(O) and let q(x. 2. Show that B q(ry) = Bp(h. ry) = p(x. Problems for Section 3. Show that in theN(O. for any ao. Fisher infonnation is not equivariant under increasing transformations of the parameter.x =. {log PB(X)}] K(O)d(J.4. K) . Let A = 3. Prove Proposition 3. 4. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. then R(O. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. R. h1(ry)) denote the model in the new parametrization.12) for the two parametrizations p(x.
find the bias o f ao' 6. {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J.13 E R. compute Var(O) using each ofthe three methods indicated. In Example 3. then = 1 forallO.XN }.. Let X" .4.ZDf3)T(y . X n ) is a sample drawn without replacement from an unknown finite population {Xl. . Show that if (Xl. . Hint: Use the integral approximation to sums. 9..5(b). ( 2).11(9) as n ~ give the limit of n times the lower bound on the variances of and (J. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic.Ii . Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1).8. . I I . then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . Does X achieve the infonnation . • 13.. . ..4.1 and variance 0. . .. Suppose Yl . .. 00.3.p) is an unbiased estimate of (72 in the linear regression model of Section 2. Suppose (J is UMVU for estimating fJ. X n be a sample from the beta.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate.o. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0.{x : p(x. Hint: Consider T(X) = X in Theorem 3. Show that 8 = (Y .2 (0 > 0) that satisfy the conditions of the information inequality. i = 1. Show that a density that minimizes the Fisher infonnation over F is f(x. .. 0) = Oc"I(x > 0). Show that assumption I implies that if A . B(O. distribution..' .4. . n.\ = a + bB is UMVU for estimating . Let F denote the class of densities with mean 0. Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji.2. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered. 2 ~ ~ 10. . Show that .P.. . . Y n in twoparameter canonical exponential form and (b) Let 0 = (a. Ct. 204 Hint: See Problem 3. 0) for any set E.ZDf3)/(n . 7. . a ~ i 12.\ = a + bOo 11. Let a and b be constants. . 14. and . Establish the claims of Example 3..4. (a) Find the MLE of 1/0. 8. =f. a ~ (c) Suppose that Zi = log[i/(n + 1)1.. 1). P.1. For instance. Find lim n.
20. UN are as in Example 3.~ for all k. then E( M) ~ L 1r j=1 N J ~ n. B(n. Show that X k given by (3... even for sampling without replacement in each stratum. k = 1.) Suppose the Uj can be relabeled into strata {xkd. B(n.4. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. that ~ is a Bayes estimate for O..p). .511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ . Let X have a binomial. 7.Section 3.4.. 1 < i < h.. 18.6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U. 8). _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk.4. 16. Show that if M is the expected sample size. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss.. k=l K ~1rkXk. .~).. (b) Deduce that if p. distribution.} and X =K  1". Let 7fk = ~ and suppose 7fk = 1 < k < K.X)/Var(U). 17. X is not II Bayes estimate for any prior 1r. K. Suppose UI. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. Define ~ {Xkl..4. G 19. Suppose X is distributed accordihg to {p. . if and only if.6 Problems and Complements 205 (b) The variance of X is given by (3. (c) Explain how it is possible if Po is binomial. XK.3. = N(O.. : 0 E for (J such that E((J2) < 00.Xkl. More generally only polynomials of degree n in p are unbiasedly estimable.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. . = 1. P[o(X) = 91 = 1.4). (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl. Show that is not unbiasedly estimable.". :/i. .l.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. (See also Problem 1. 15. . . Stratified Sampling.• . . 'L~ 1 h = N.
75' and the IQR./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]. Show that the a trimmed mean XCII.5) to plot the sensitivity curve of the 1.4.206 Hint: Given E(ii(X) 19) ~ 9. give and plot the sensitivity curves of the lower quartile X. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) ."" F (ij) Vat (:0 logp(X. . give and plot the sensitivity curve of the median.5.25 and net =k 3. 5. with probability I for each B. . An estimate J(X) is said to be shift or translation equivariant if.25. If a = 0. 1 2. use (3. the upper quartile X.. (iii) 2X is unbiased for . B) is differentiable fat anB > x. is an empirical plugin estimate of ~ Here case. 22. If a = 0. and we can thus define moments of 8/8B log p(x. however.l)a is an integer.. B). Regularity Conditions are Needed for the Information Inequality.9)' 21. If n IQR. Let X ~ U(O.4. 4.4. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6. T . 1 X n1 C. Note that logp(x. Show that.. that is. B) be the uniform distribution on (0. Hint: It is equivalent to show that.3. = 2k is even. for all X!. • Problems for Section 35 I is an integer. Yet show eand has finite variance.25 and (n . Show that the sample median X is an empirical plugin estimate of the population median v. Var(a T 6) > aT (¢(9)II(9). B). for all adx 1. Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. B») ~ °and the information bound is infinite. Prove Theorem 3.
03.. X.6 Problems and Complements 207 It is antisymmetric if for all Xl. .. .. . . k t 0 to the median. I)order statistics). and xiflxl < k kifx > k kifx<k. 7... X are unbiased estimates of the center of symmetry of a symmetric distribution.1. <i< n Show that (a) k = 00 corresponds to X. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj).30. plot the sensitivity curves of the mean. xH L is translation equivariant and antisymmetric.) ~ 8. One reasonable choice for k is k = 1.. J is an unbiased estimate of 11. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1. . Xu.6. X a arc translation equivariant and antisymmetric.JL) where JL is unknown and Xi .  XI/0. . The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00.5. (See Problem 3.1.5 and for (j is. For x > . then (i. X n is a sample from a population with dJ. In .03 (these are expected values of four N(O. .e.(a) Show that X. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified. trimmed mean with a = 1/4. . is symmetrically distributed about O. (b) Suppose Xl.. and the HodgesLehmann estimate.. (b) Show that   . (7= moo 1 IX.67. '& is an estimate of scale.Section 3.30. F(x . Its properties are similar to those of the trimmed mean. median. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite. i < j.).3. Deduce that X.
Let X be a random variable with continuous distribution function F.75 . 11. (d) Xk is translation equivariant and antisymmetric (see Problem 3..Xn arei. In what follows adjust f and 9 to have v = 0 and 'T = 1.7(a). P(IXI > 3) and P(IXI > 4) for the nonnal. we will use the IQR scale parameter T = X. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. = O. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr. •• . The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. and Cauchy distributions.O)/ao) where fo(x) for Ixl for Ixl <k > k. thenXk is the MLEofB when Xl. . with density fo((..5.x}2. (b) Find thetml probabilities P(IXI > 2). 9. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. (e) If k < 00.• 13.i.. ii. thus. Location Parameters. The (student) tratio is defined as 1= v'n(x I'o)/s. Show that SC(x. [. . Let 1'0 = 0 and choose the ideal sample Xl. (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. Let JJo be a hypothesized mean for a certain population.d. plot the sensitivity curve of (a) iTn ." . Xk) is a finite constant. 00. 00. Laplace.X. . where S2 = (n .) has heavier tails than f() if g(x) is above f(x) for Ixllarge.6). This problem may be done on the computer. then limlxj>00 SC(x.208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0.1)1 L:~ 1(Xi .25. and "" "'" (b) the Iratio of Problem 3.. In the case of the Cauchy density. For the ideal sample of Problem 3.11.) are two densities with medians v zero and identical scale parameters 7. .5. If f(·) and g(.. we say that g(. Use a fixed known 0"0 in place of Ci. 1 Xnl to have sample mean zero.~ !~ t . . Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00. the standard deviation does not exist. Suppose L:~i Xi ~ . n is fixed. .5. and is fixed. X 12.
and sample trimmed mean are shift and scale equivariant.) 14. .5. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F). c E R. . n (b) In the following cases. F) lim n _ oo SC(x.].(F) = !(xa + XI").. 8. (jx.8. compare SC(x. then ()(F) = c.t. If 0 is scale and shift equivariant._. . any point in [v(F). (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. 8) = IF~ (x. 8. ~ (a) Show that the sample mean. and note thatH(xi/(F» < F(x) < H(xv(F»._.xn ). In Remark 3. . b > 0. sample median. Let Y denote a random variable with continuous distribution function G. . (a) Show that if F is symmetric about c and () is a location parameter.5.F).1: (a) Show that SC(x. Show that if () is shift and location equivariant.. let v" = v.Xn_l) to show its dependence on Xnl (Xl.(F): 0 <" < 1/2}. In this case we write X " Y.. then for a E R. Fn _ I ). .Section 3. ([v(F). Show that J1. ~ (b) Write the Be as SC(x. .  vl/O. a + bx.. b > 0. and ordcr preserving. 8) and ~ IF(x. X is said to be stochastically smaller than Y if = G(t) for all t E R. An estimate ()n is said to be shift and scale equivariant if for all xl. median v.. 0 < " < 1. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y. 8. x n _. a. (b) Show that the mean Ji. it is called a location parameter. ~ ~ 8n (a +bxI. i/(F)] and.t )) = 0. Hint: For the second part. SC(a + bx. c + dO. the ~ = bdSC(x. ~ se is shift invariant and scale equivariant. Also note that H(x) is symmetric about zero. and trimmed population mean JiOl (see Problem 3.). F(t) ~ P(X < t) > P(Y < t) antisymmctric. let H(x) be the distribution function whose inverse is H'(a) = ![xax..:.. ..8. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F).5. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval....6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :. ~ ~ 15.(k) is a (d) For 0 < " < 1. xnd...a +bxn ) ~ a + b8n (XI. if F is also strictly increasing. · . then v(F) and i/( F) are location parameters.xn . v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v.O. i/(F)] is the value of some location parameter.5) are location parameters. ~ = d> 0. where T is the median of the distributioo of IX location parameter.) That is.67 and tPk is defined in Problem 3.
(ii) O(F) = (iii) e(F) "7.3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). 81" < 0.5. We define S as the . and we seek the unique solution (j of 'IjJ(8) = O. • • .4. I : . 3.BI < C1(1(i.t is identifiable. consequently. {BU)} do not converge.Bilarge enough. .4 I . C < 00. . Frechet. in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. B E B. then J. IlFfdF(x). is not identifiable. A.OJ then l(I(i) . . The NewtonRaphson method in this case is . (e) Does n j [SC(x. (1) The result of Theorem 3. Let d = 1 and suppose that 'I/J is twice continuously differentiable.1) Hint: (a) Try t/J(x) = Alogx with A> 1.. · .7 NOTES Note for Section 3. . we in general must take on the order of log ~ steps. Assume that F is strictly increasing. (b) ~ ~ .2). This is.rdF(l:).1 is commonly known as the Cramer~Rao inequality. 0. ) (a) Show by example that for suitable t/J and 1(1(0) .field generated by SA. I I Notes for Section 3.1/. (b) Show that there exists. In the gross error model (3. also true of the method of coordinate ascent. Show that in the bisection method. where B is the class of Borel sets.' > 0. then fJ. 18.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . and (iii) preceding? ~ 16. (b) If no assumptions are made about h. Because priority of discovery is now given to the French mathematician M.B ~ {F E :F : Pp(A) E B}.I F(x. we shall • I' . F)! ~ 0 in the cases (i).0 > 0 (depending on t/J) such that if lOCO) . (ii). ~ I(x  = X n . .. show that (a) If h is a density that is symmetric about zero.8) . • . ~ ~ 17.
Statist. J. Robust Estimates of Location: Sun>ey and Advances Princeton. AND E. LEHMANN. A.. S. Numen'cal Analysis New York: Prentice Hall.. 11391158 (1976). II. Point Estimation Using the KullbackLeibler Loss Function and MML. Statist. 1998. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. H.4. J. >. . Jroo T(x) :>.. M. DOWE. 15231535 (1969). BICKEL. Statist. ANDERSON. NJ: Princeton University Press. AND N. 1974. Families. D. AND E. BICKEL. BJORK.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag... P.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. 0. HAMPEL. AND E. F. 1972. Math. in Proceedings of the Second Pa. MA: AddisonWesley.." Scand. BOHLMANN. Statistical Decision Theory and Bayesian Analysis New York: Springer. D. AND J. M.. J. H. 1994.8). P. 1122 (1975). 1970. BICKEL.(T(X)) = 00.10451069 (1975b). T. Dispersion. F.. AND E. Optimal Statistical Decisions New York: McGrawHill. 4." Ann. BICKEL." Ann. 1974. DAHLQUIST. J. Bayesian Theory New York: Wiley. WALLACE. G. 3. BERGER. 10381044 (l975a). P. TUKEY. LEHMANN. AND C.. R. 1969. DE GROOT. 2nd ed. BAXTER.. ofStatist." Ann.)dXd>. APoSTOL. III. SMITH.8 REFERENCES ANDREWS.. HUBER. OLIVER.. H. BICKEL. "Descriptive Statistics for Nonparametric Models.. M. Statist. Location. "Unbiased Estimation in Convex. R. P. J. P. Introduction. W. Ma.. K. A. Reading." Ann. A. I.thematical Methods in Risk Theory Heidelberg: Springer Verlag. "Descriptive Statistics for Nonparametric Models. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. 1. W.. AND A. 2. LEHMANN. ROGERS. P. LEHMANN.. p(x. 40. Mathematical Analysis.] = Jroo.. 3. L. F. BERNARDO. "Descriptive Statistics for Nonparametric Models. (3) The continuity of the first integral ensures that : (J [rO )00 . M.Section 3. J.. 1985.. DoKSUM. "Measures of Location and AsymmeUy. 3.
V. E.. HUBER. RoyalStatist. HOGG. Statist. and Economics Reading. Testing Statistical Hypotheses New York: Springer. Assoc. 1986.. StatiSl. J.n. Y. J. 383393 (1974). 2Q4. 2nd ed. 909927 (1974). "Boostrap Estimate of KullbackLeibler Information for Model Selection. Math.. Robust Statistics: The Approach Based on Influence Functions New York: J. SAVAGE. ROUSSEUW. LEHMANN. Math. R. L." J. .212 Measures of Performance Chapter 3 HAMPEL. 538542(1973). Amer." Proc. 43. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification." J.. . YU. D. Mathematical Methods and Theory in Games. JAECKEL. S. Robust Statistics New York: Wiley. F. E.. LEHMANN.... RONCHEro. L. 69.. "The Influence Curve and Its Role in Robust Estimation. C. 13. Stall'.. AND W. Wiley & Sons." Statistica Sinica. "Robust Estimates of Location.. B. 1981. AND B. R. 2nd ed. P.." Scand. 49. . WALLACE. 49. RISSANEN. A. FREEMAN. R." J. B. 42. L. StrJtist.. Stu/ist. SHIBATA. R..10411067 (1972)." J. Royal Statist. 69. Math.. KARLIN. "Estimation and Inference by Compact Coding (With Discussions). New York: Springer. MA: AddisonWesley. LINDLEY." Statistical Science. I. 1959. Amer. 1986. 197206 (1956). "On the Attainment of the CramerRao Lower Bound. A. Exploratory Data Analysis Reading. P. 7. P. Actuarial J.. CASELLA. R. E. 240251 (1987). The Foundations afStatistics New York: J... HUBER. H. 223239 (1987). 1. • • 1 . L. Programming. Cambridge University Press.. AND P. STEIN. 1954. Wiley & Sons. WUSMAN." Ann. "Adaptive Robust Procedures. (2000). W. 136141 (1998).. Soc." Ann.. TuKEY. Amer. London: Oxford University Press.. "Decision Analysis and BioequivaIence Trials.. . Statist.. Assoc. AND G. 1. M.. 1972. "Robust Statistics: A Review. and Probability. Soc. LINDLEY. "Model Selection and the Principle of Mimimum Description Length. STAHEL. Part II: Inference.. HAMPEL.. • NORBERG. HANSEN. 1998. E.222 (1986).. 1948. MA: AddisonWesley. Theory ofProbability. D. 1965. Stlltist.375394 (1997). 10201034 (1971). Third Berkeley Symposium on Math. Assoc. j I JEFFREYS. Theory ofPoint Estimation. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. S. H. University of California Press. "Stochastic Complexity (With Discussions):' J. London." Ann. Part I: Probability. C. Introduction to Probability and Statistics from a Bayesian Point of View.
to answering "no" or "yes" to the preceding questions. o E 8. But this model is suspect because in fact we are looking at the population of all applicants here. parametrically. If n is the total number of applicants. Nmo. respectively. As we have seen.Nfb the numbers of admitted male and female applicants.3 the questions are sometimes simple and the type of data to be gathered under our control. The design of the experiment may not be under our control. where is partitioned into {80 .PmllPmO. distribution. or more generally construct an experiment that yields data X in X C Rq.3. respectively.2. medicine. conesponding.1. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial.1. M(n. 8 1 are a partition of the model P or. it might be tempting to model (Nm1 . Here are two examples that illustrate these issues. 3.3 we defined the testing problem abstractly. PI or 8 0 . Nfo) by a multinomial. and we have data providing some evidence one way or the other. They initially tabulated Nm1. Accepting this model provisionally. as is often the case. not a sample. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. the situation is less simple.1. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. peIfonn a survey. and 3. and indeed most human activities. the parameter space e.Nf1 . we are trying to get a yes or no answer to important questions in science. and the corresponding numbers N mo . treating it as a decision theory problem in which we are to decide whether P E Po or P l or. where Po. Sex Bias in Graduate Admissions at Berkeley. Nfo of denied applicants. pUblic policy. in examples such as 1.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. This framework is natural if. what is an appropriate stochastic model for the data may be questionable. 8d with 8 0 and 8.1 INTRODUCTION In Sections 1. modeled by us as having distribution P(j. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. Usually. e Example 4.PfO).Pfl. what does the 213 .
! . pml +P/I !. .! .. that N AA. This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. I .1. if there were n dominant offspring (seeds). . ~). Hammel. • . d = 1. The hypothesis of dominant inheritance ~. D). ...n 3 t I . and O'Connell (1975). m P [ NAA . . OUf multinomial assumption now becomes N ""' M(pmld' PmOd. That is. i:. as is discussed in a paper by Bickel. . If departments "use different coins.2. I . has a binomial (n. 0 I • .< . In a modem formulation.p) distribution. The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis. where N m1d is the number of male admits to department d. D). ~. • Pml + P/I + PmO + Pio • I . for d = 1. .214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. .... and so on. d = 1. the natural model is to assume. Mendel's Peas. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. Pfld. . ... if the inheritance ratio can be arbitrary..than might be expected under the hypothesis that N AA has a binomial. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. the number of homozygous dominants. PfOd. Mendel crossed peas heterozygous for a trait with two alleles." then the data are naturally decomposed into N = (Nm1d. NjOd. NIld. either N AA cannot really be thought of as stochastic or any stochastic I . In one of his famous experiments laying the foundation of the quantitative theory of genetics. the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. B (n. The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). one of which was dominant.=7xlO 5 ' n 3 . • I I . • In fact.: Example 4. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department. distribution. N mOd .D. . In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . . .
The same conventions apply to 8] and K.3 recover from the disease with the old drug. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. .!.1 loss l(B. the set of points for which we reject. B) distribution.. we reject H if S exceeds or equals some integer. see. acceptance and rejection can be thought of as actions a = 0 or 1. n 3' 0 What the second of these examples suggests is often the case. In science generally a theory typically closely specifies the type of distribution P of the data X as.€)51 + cB( nlP). If we suppose the new drug is at least as effective as the old. we shall simplify notation and write H : () = eo. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . then 8 0 = [0 .1 Introduction 215 model needs to pennit distributions other than B( n. To investigate this question we would have to perform a random experiment.Xn ).. 8 1 is the interval ((}o. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. and we are then led to the natural 0 . recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D.1.1. is better defined than the alternative answer 8 1 . That a treatment has no effect is easier to specify than what its effect is. where 00 is the probability of recovery usiog the old drug. 8 0 and H are called compOSite. 0 E 8 0 . We illustrate these ideas in the following example. the number of recoveries among the n randomly selected patients who have been administered the new drug. suppose we observe S = EXi . then = [00 . then S has a B(n. I} or critical region C {x: Ii(x) = I}. say 8 0 . we call 8 0 and H simple.Section 4. Thus. (1 . Suppose that we know from past experience that a fixed proportion Bo = 0. where 1 ~ E is the probability that the assistant fudged the data and 6!. = Example 4.(1) As we have stated earlier. That is. (}o] and eo is composite. it's not clear what P should be as in the preceding Mendel example. 11 and K is composite. Most simply we would sample n patients.3. where Xi is 1 if the ith patient recovers and 0 otherwise. Thus. p).11. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied.3. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. The set of distributions corresponding to one answer. administer the new drug. for instance. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. P = Po.1. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. In situations such as this one . and then base our decision on the observed sample X = (X J. say. for instance. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. a) = 0 if BE 8 a and 1 otherwise. . is point mass at . If the theory is false. and accept H otherwise. in the e . Now 8 0 = {Oo} and H is simple. When 8 0 contains more than one point. our discussion of constant treatment effect in Example 1. See Remark 4. say k. Moreover.
. As we noted in Examples 4. 0 PH = probability of type II error ~ P. when H is false. We now tum to the prevalent point of view on how to choose c. ! : .that has in fact occurred.3. if H is true.1 . One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. We do not find this persuasive.2. . The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. one can be thought of as more important.1.2 and 4. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. matches at one position are independent of matches at other positions) and the probability of a match is ~. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H.. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity. how reasonable is this point of view? In the medical setting of Example 4. We call T a test statistic. but if this view is accepted. is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute.T would then be a test statistic in our sense. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. asymmetry is often also imposed because one of eo. Thresholds (critical values) are set so that if the matches occur at random (i. and later chapters.. The value c that completes our specification is referred to as the critical value of the test.3. e 1 . it again reason~bly leads to a Neyman Pearson fonnulation.2.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. : I I j ! 1 I . as we shall see in Sections 4. Note that a test statistic generates a family of possible tests as c varies. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe.. of the two errors. suggesting what test statistics it is best to use.e. To detennine significant values of these statistics a (more complicated) version of the following is done. There is a statistic T that "tends" to be small. generally in science.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. We will discuss the fundamental issue of how to choose T in Sections 4. if H is false.. 0 > 00 .. . then the probability of exceeding the threshold (type I) error is smaller than Q". 1. It has also been argued that. ~ : I ! j . No one really believes that H is true and possible types of alternatives are vaguely known at best. (5 > k) < k).1.3. and large. . By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. For instance. . (5 The constant k that determines the critical region is called the critical value.1.3 this asymmetry appears reasonable. but computation under H is easy. • .1 and4. Given this position. (Other authors consider test statistics T that tend to be small. 1 i . 4.
3. That is. 8 0 = 0.1..1.2. 0) is just the probability of type! errot. Because a test of level a is also of level a' > a.3 with O(X) = I{S > k}. This is the critical value we shall use. then the probability of type I error is also a function of 8.3 and n = 10. Indeed. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4.P [type 11 error] is usually considered. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00. in the Bayesian framework with a prior distribution on the parameter.1. Here > cJ. Definition 4. even nominally.8)n. The power of a test against the alternative 0 is the probability of rejecting H when () is true.01 and 0.1.3 (continued). Once the level or critical value is fixed. it is too limited in any situation in which.0473. The values a = 0.1.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8. Example 4. This quantity is called the size of the test and is the maximum probability of type I error.j J = 10. we find from binomial tables the level 0. k = 6 is given in Figure 4. if our test statistic is T and we want level a. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. Such tests are said to have level (o/significance) a. if 0 < a < 1.(6) = Pe. and we speak of rejecting H at level a.Section 4. In that case. Thus.2(b) is the one to take in all cases with 8 0 and 8 1 simple.2. See Problem 3.05 are commonly used in practice.1.(S > 6) = 0. it is convenient to give a name to the smallest level of significance of a test. we can attach. In Example 4. Specifically.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 .05 critical value 6 and the test has size . Finally. the power is 1 minus the probability of type II error. if we have a test statistic T and use critical value c. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. 0) is the {3(8. . the approach of Example 3. {3(8.1. e t .1. 80 = 0. By convention 1 . Both the power and the probability of type I error are contained in the power function. Here are the elements of the Neyman Pearson story. even though there are just two actions. It is referred to as the level a critical value. whereas if 8 E power against (). numbers to the two losses that are not equal and/or depend on 0.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. which is defined/or all 8 E 8 by e {3(8) = {3(8. The power is a function of 8 on I. If 8 0 is composite as well. the probabilities of type II error as 8 ranges over 8 1 are determined.9. there exists a unique smallest c for which a(c) < a. such as the quality control Example 1.
.~==/++_++1'+ 0 o 0. . The power is plotted as a function of 0.6 0..:. and H is the hypothesis that the drug has no effect or is detrimental.. When (}1 is 0.05 test will detect an improvement of the recovery fate from 0.218 Testing and Confidence Regions Chapter 4 1. 0) family of distributions. (The (T2 unknown case is treated in Section 4. Power function of the level 0. If we assume XI._~:.2 o~31:.2 0. Suppose that X = (X I.1. From Figure 4. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives. then the drug effect is measured by p. .3 04 05 0. j 1 .3.1. _l X n are nonnally distributed with mean p.3 to (it > 0.1. for each of a group of n randomly selected patients. For instance.2 is known..3 for the B(lO.0 0. 0 Remark 4. OneSided Tests for the Mean ofa Normal Distribution with Known Variance.t < 0 versus K : J.1. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. whereas K is the alternative that it has some positive effect.9 1. :1 • • Note that in this example the power at () = B1 > 0. .8 0..3). k ~ 6 and the size is 0..7 0. suppose we want to see if a drug induces sleep. We might.. What is needed to improve on this situation is a larger sample size n.8 06 04 0. .3 versus K : B> 0. ! • i.5.3 is the probability that the level 0.3770.5.4.t > O. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0.1 it appears that the power function is increasing (a proof will be given in Section 4.05 onesided test c5k of H : () = 0. this probability is only .0 Figure 4.0473.1 0.1.3.2) popnlation with . record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. We return to this question in Section 4. and variance (T2. I I I I I 1 . I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe.. a 67% improvement.:"::':::'. I i j .0 0 ].) We want to test H : J. Example 4. That is.[T(X) > k]. Xn ) is a sample from N (/'.
o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP.. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1. T(X(lI).o versus p.1.2 and 4. (12) observations with both parameters unknown (the t tests of Example 4.1.1 Introduction 219 Because X tends to be larger under K than under H. That is.~(z).I') ~ ~ (c + V.1. This occurs if 8 0 is simple as in Example 4.• T(X(B)). This minimum distance principle is essentially what underlies Examples 4. It is convenient to replace X by the test statistic T(X) = .5. £0.5). (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.3. sup{(J(p) : p < OJ ~ (J(O) = ~(c).. . eo).2) = 1. Because (J(jL) a(e) ~ is increasing. In Example 4.o if we have H(p.V.co . .3..9).1 and Example 4. However.(jo) = (~ (jo)+ where y+ = Y l(y > 0).d.Section 4.co =. The task of finding a critical value is greatly simplified if £.(T(X)) doesn't depend on 0 for 0 E 8 0 .1. .i. The smallest c for whieh ~(c) < C\' is obtained by setting q:. < T(B+l) are the ordered T(X).   = . #. where d is the Euclidean (or some equivalent) distance and d(x.~I.a) is an integer (Problem 4.( c) = C\' or e = z(a) where z(a) = z(1 .nX/ (J. . has level a if £0 is continuous and (B + 1)(1 .. = P. the common distribution of T(X) under (j E 8 0 . ~ estimates(jandd(~.. where T(l) < . has a closed form and is tabled..eo) = IN~A . T(X(l)).1.P. In Example 4.1. it is clear that a reasonable test statistic is d((j.3. and we have available a good estimate (j of (j.1. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests.1.~ (c . if we generate i. which generates the same family of critical regions. T(X(B)) from £0.a) is the (1. then the test that rejects iff T(X) > T«B+l)(la)).2.y) : YES}. N~A is the MLE of p and d(N~A. p ~ P[AA]. in any case. But it occurs also in more interesting situations such as testing p. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. S) inf{d(x..a) quantile oftheN(O. 1) distribution. Rejecting for large values of this statistic is equivalent to rejecting for large values of X.p) because ~(z) (4. it is natural to reject H for large values of X. In all of these cases.
. as X . .1. P Fo (D n < d) = Pu (D n < d).1 J) where x(l) < .Fo(x)[.7) that Do. Let X I.d. o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X . Proof. :i' D n = sup [U(u) . F and the hypothesis is H : F = ~ (' (7'/1:) for some M. where F is continuous.12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628)...01.3.   Dn = sup IF(x) . and h n (L224) for" = . . F. F o (Xi). 1). • Example 4.1. Consider the problem of testing H : F = Fo versus K : F i..1.10 respectively. then by Problem B.. n.Fa. (T2 = VarF(Xtl. This is again a consequence of invariance properties (Lehmann. < Fo(x)} = U(Fo(x)) . • . h n (L358). and hn(t) = t/( Vri + 0. for n > 80. that is.d. The natural estimate of the parameter F(!. x It can be shown (Problem 4. the empirical distribution function F.. Set Ui ~ j .5.6. The distribution of D n under H is the same for all continuous Fo.1 El{Fo(Xi ) < Fo(x)} n.u[ O<u<l . 0 Example 4.220 Testing and Confidence Regions Chapter 4 1 .)} n (4. What is remarkable is that it is independ~nt of which F o we consider. the order statistics. thus. X n are ij. U. . + (Tx) is . Goodness of Fit Tests..1.. .. < x(n) is the ordered observed sample. 1). As x ranges over R. In particular.1. which is called the Kolmogorov statistic. 1) distribution.i...U. . . 1997).n {~ n Fo(x(i))' FO(X(i)) _ . Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0..1 El{U. as a tcst statistic . .. Suppose Xl. which is evidently composite.1 El{Xi ~ U(O.Xn be i. . can be wriHen as Dn =~ax max tI. This statistic has the following distributionjree property: 1 Proposition 4.(i_l. and .05. and the result follows. Also F(x) < x} = n.. . Goodness of Fit to the Gaussian Family.. The distribution of D n has been thoroughly smdied for finite and large n. In particular.4.. where U denotes the U(O. We can proceed as in Example 4. .. . ).
.i. It is then possible that experimenter I rejects the hypothesis H.Z) / (~ 2::7 ! (Zi  if) • . and only if.4) Considered as a statistic the pvalue is <l'( y"nX /u). this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test.. and only if. T nB . But.' . whereas experimenter II insists on using 0' = 0.<l'(x)1 sup IG(x) . we would reject H if. (4. .) Thus. 1).. (Sec Section 8.. Zn arc i.. from N(O.3. . for instance.Section 4. Therefore. whereas experimenter II accepts H on the basis of the same outcome x of an experiment.. Tn!. 1 < i < n. T nB .(iix u > z(a) or upon applying <l' to both sides if. then computing the Tn corresponding to those Zi. We do this B times independently. if X = x. If we observe X = x = (Xl. {12 and is that of ~  (Zi . thereby obtaining Tn1 . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. ..05. .. where Z I.. if the experimenter's critical value corresponds to a test of size less than the p~value. otherwise.1 Introduction 221 (12. Tn has the same distribution £0 under H.(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. .<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi .d.. H is not rejected. ~n) doesn't depend on fl.d. observations Zi. .4. a satisfies T(x) = .a) + IJth order statistic among Tn.X)/ii. and the critical value may be obtained by simulating ij.01.1. the pvalue is <l'( T(x)) = <l' (. I < i < n. I).. Consider. . Tn sup IF'(X x x + iix) . under H. · · . whatever be 11 and (12. a > <l'( T(x)).2. H is rejected. 1. N(O. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again.X) . Example 4. That is. If the two experimenters can agree on a common test statistic T. Now the Monte Carlo critical value is the I(B + 1)(1 . . In). the joint distribution of (Dq .
6) I • to test H. But the size of a test with critical value c is just a(c) and a(c) is decreasing in c. 1). a(8) = p. 80). Thus. 1985). distribution (Problem 4. to quote Fisher (1958).. We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4. we would reject H if. but K is not.1. .<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . Similarly in Example 4. then if H is simple Fisher (1958) proposed using l j :i. •• 1.1.Oo)} > 5.ath quantile of the X~n distribution.. The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4.6). Then if we use critical value c. H is rejected if T > Xla where Xl_a is the 1 . these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e. and only if. Thus. Proposition 4. if r experimenters use continuous test statistics T 1 .1).. In this context. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. for miu{ nOo. when H is well defined.1.3. The pvalue is a(T(X)). n(1 . T(x) > c.. a(T) is on the unit interval and when H is simple and T has a continuous distribution.1. • T = ~2 I: loga(Tj) j=l ~ ~ r (4. aCT) has a uniform. More generally. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). !: The pvalue is used extensively in situations of the type we described earlier.1. This is in agreement with (4. (4. the largest critical value c for which we would reject is c = T(x).(S > s) '" 1. U(O.5) . . so that type II error considerations are unclear. The normal approximation is used for the pvalue also. Suppose that we observe X = x.• see f [' Hedges and Olkin. a(Tr )..g. ' I values a(T.). Thus. ~ H fJ. and the pvalue is a( s) where s is the observed value of X.1. For example. ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. Thus.. The pvalue can be thought of as a standardized version of our original statistic. .. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). • i~! <:1 im _ . We have proved the following.1.2. I.222 Testing and Confidence Regions Cnapter 4 In general.4). that is.1.. . I T r to produce p . let X be a q dimensional random vector.5).
2. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H.00)/00 (1 . In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x.1. (I . 0. 1) distribution.)1 5 [(1. a given test statistic T. Typically a test statistic is not given but must be chosen on the basis of its perfonnance.) which is large when S true. 00 ) = 0. For instance. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. 0) 0 p(x. significance level. Such a test and the corresponding test statistic are called most poweiful (MP).3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. size.Oo)]n. 4. test functions. In this section we will consider the problem of finding the level a test that ha<. subject to this restriction. 00. and S tends to be large when K : () = 01 > 00 is . we try to maximize the probability (power) of rejecting H when K is true. power function. is measured in the NeymanPearson theory.00)ln~5 [0. power.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. that is. In the NeymanPearson framework. critical regions. and. by convention. Summary.2 and 3. (0. In Sections 3. L(x.01 )/(1. that is. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. OIl > 0. (4. We introduce the basic concepts of simple and composite hypotheses. 00 .3). OIl where p(x. where eo and e 1 are disjoint subsets of the parameter space 6. The statistic L takes on the value 00 when p(x. I) = EXi is large. o:(Td has a U(O.Section 4. then. under H. This is an instance of testing goodness offit. type II error. we test whether the distribution of X is different from a specified Fo. and pvalue. equals 0 when both numerator and denominator vanish. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. p(x. 0) is the density or frequency function of the random vector X. (null) hypothesis H and alternative (hypothesis) K. the highest possible power. OIl = p (x.0./00)5[(1. or equivalently. in the binomial example (4. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01.Otl/(1 . type I error.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. test statistics. In particular.
We show that in addition to being Bayes optimal. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads. we have shown that I j I i E.3 with n = 10 and Bo = 0. Because we want results valid for all possible test sizes Cl' in [0. Eo'P(X) < a. there exists k such that (4. likelihood ratio tests are unbeatable no matter what the size a is.P(S > 5)I!P(S = 5) = . They are only used to show that with randomization.3. if want size a ~ .k] + E'['Pk(X) .Bo. (a) If a > 0 and I{Jk is a size a likelihood ratio test.) For instance.3). It follows that II > O.7l').Bo. Note that a > 0 implies k < 00 and. then ~ .('Pk(X) .k is < 0 or > 0 according as 'Pk(X) is 0 or 1. BI )  .4) 1 j 7 .kEo['Pk(X) . i' ~ E. 'P(x) = 1 if S > 5. and 'P(x) = [0.2) forB = BoandB = B.) . 1.1). (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.05 . which are tests that may take values in (0.'P(X)] > O.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .1) if equality occurs. and suppose r. . (See also Section 1. then I > O.'P(X)] a. B .224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x. (4. To this end consider (4. If L(x.1]. Such randomized tests are not used in practice. Theorem 4.3) 1 : > O. thns. (c) If <p is an MP level 0: test.kJ}.3.'P(X)] > kEo['PdX) . Bo. we consider randomized tests '1'. (a) Let E i denote E8.'P(X)] = Eo['Pk{X) . where 7l' denotes the prior probability of {Bo}. o Because L(x. If 0 < 'P(x) < 1 for the observation vector x. B.'P(X)[L(X. B .'P(X)] . then it must be a level a likelihood ratio test.1. 'Pk is MP for level E'c'PdX). then <{Jk is MP in the class oflevel a tests.2. Proof.p is a level a test. BIl o . Finally.2.2.'P(X)] [:~~:::l. 0 < <p(x) < 1.0262 if S = 5. using (4. Bo) = O} = I + II (say). (NeymanPearson Lemma). we choose 'P(x) = 0 if S < 5.. EO'PdX) We want to show E. that is. k] .2.1..2.['Pk(X) . where 1= EO{CPk(X)[L(X.) . 'Pk(X) = 1 if p(x.B.Bd >k <k with 'Pdx) any value in (0.[CPk(X) ..05 in Example 4. i = 0. lOY some x. Bo) = O.'P(X)] 1{P(X. Note (Section 3. and because 0 < 'P(x) < 1.
Consider Example 3.Section 4. I x) = (1 1T)p(X.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0. tben. 00 . Let Pi denote Po" i = 0.pk is MP size a. define a.2. Remark 4. 60 . 00 .k] on the set {x : L(x.5) If 0. 0. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level. 2 (7 T(X) ~.2. we conc1udefrom (4. = 'Pk· Moreover.1.Oo.. 'P(X) > a with equality iffp(" 60 ) = P(·. 0. k = 0 makes E. 00 ) (11T)L(x.3.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. 0. = v. Here is an example illustrating calculation of the most powerful level a test <{Jk.) + 1T' (4. Therefore.O. 00 . OJ Corollary 4. We found nv2} V n L(X. 0. It follows that (4.2. then 'Pk is MP size a. Now 'Pk is MP size a. also be easily argued from this Bayes property of 'Pk (Problem 4. denotes the Bayes procedure of Exarnple 3. 'Pk(X) =a .7.2. See Problem 4.) < k. that is. 00 .2.10).2..2) holds forO = 0. OJ.) + 1Tp(X. 00 .2 where X = (XI. 0.) .I. If not. OJ! > k] > If Po[L(X.) > then to have equality in (4. 0. 0. The same argument works for x E {x : p(x. Because Po[L(X.) = k] = 0. X n ) is a sample of n N(j.L. 00 .2(b). 00 .Oo. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it.O.O.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v.. Example 4. . { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions.) (1 . then there exists k < 00 such that Po[L(X. 00 .(X.fit. If a = 1. 0. OJ! > k] < a and Po[T.2..1. where v is a known signal. Next consider 0 < Q: < 1.) = k j.2.. for 0 < a < 1.1T)L(x.2.v) + nv ] 2(72 x .5) thatthis 0.Po[L(X.v)=exp 2LX'2 ..) (1 . OJ! = ooJ ~ 0.OI)' Proof.2. when 1T = k/(k + 1). 0./f 'P is an MP level a test.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x.1. Part (a) of the lemma can. Then the posterior probability of (). decides 0. k = 00 makes 'Pk MP size a. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2.0.fit [ logL(X. is 1T(O. 00 ) > and 0 = 00 . 0.) > k] Po[L(X. . then E9.1T)p(x. (c) Let x E {x: p(x. 'Pk(X) = 1 and !.
then we solve <1>( z(a) + (v. if ito. they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. and only if. In this example we bave assnmed that (Jo and (J. then this test rule is to reject H if Xl is large. large.6) has probability of type I error 0:. 0. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on .2). the test that rejects if.2. itl = ito + )"6.I. Example 4. (Jj = (Pj.4. then.1. for the two popnlations are known. We return to this in Volume II.0.8).a)[~6Eo' ~oJ! (Problem 4.' Rejecting H for L large is equivalent to rejecting for I 1 . > 0 and E l = Eo. that the UMP test phenomenon is largely a feature of onedimensional parameter problems.95). say.90 or .2..2.7) where c ~ z(1 .226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. This is the smallest possible n for any size Ct test. Suppose X ~ N(Pj. The following important example illustrates.2. We will discuss the phenomenon further in the next section. Thus. <I>(z(a) + (v. If ~o = (1.2. . I I . E j ). this is no longer the case (Problem 4. however. itl' However if. • • . (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other... It is used in a classification context in which 9 0 .2./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'. a UMP (for all ). If this is not the case.) test exists and is given by: Reject if (4.. But T is the test statistic we proposed in Example 4. 0. . Such a test is called uniformly most powerful (UMP)./iil (J)). The power of this test is. .9). j = 0. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. Eo are known. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . By the NeymanPearsoo lemma this is the largest power available with a level Ct test. 6.. l.. by (4.Jto)E01X large.1. Note that in general the test statistic L depends intrinsically on ito. ). . 0 An interesting feature of the preceding example is that the test defined by (4.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O. T>z(la) (4. From our discussion there we know that for any specified Ct. ." The function F is known as the Fisher discriminant function. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction. E j ). if Eo #. O)T and Eo = I. among other things.2.
which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. there is sometimes a simple alternative theory K : (}l = (}ll. .)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}. ."" (}k = Oklo In this case. A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0. the likelihood ratio L 's L~ rr(:...'P) for all 0 E 8" for any other level 0:' (4. 0 < € < 1 and for some fixed (4. Such tests are said to be UMP (unifonnly most powerful).2) where . . is established.. and N i is the number of offspring of type i.M( n. .3. if a match in a genetic breeding experiment can result in k types..1) test !... We note the connection of the MP test to the Bayes procedure of Section 3. _ (nl. Testing for a Multinomial Vector. r lUI nt···· nk· ".3. n offspring are observed.1.1. . Suppose that (N1 . . Example 4.p. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1...3.u k nn. . k are given by Ow. Ok = OkO.. 0) P n! nn. However.Section 4. 1 (}k) distribution with frequency function. (}l. The NeymanPearson lemma. N k ) has a multinomial M(n. Two examples in which the MP test does not depend on 0 1 are given. ..OkO' Usually the alternative to H is composite. .2 that UMP tests for onedimensional parameter problems exist. . . here is the general definition of UMP: Definition 4.2 for deciding between 00 and Ol. This phenomenon is not restricted to the Gaussian case as the next example illustrates.. 'P') > (3(0. Nk) . For instance. 0" . . Before we give the example. then (N" . where nl. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1.. .···.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary..nk.. . nk are integers summing to n..3.3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4. With such data we often want to test a simple hypothesis H : (}I = Ow. 4. .
2.2.3 continned).. . Example 4.OW and the model is by (4.nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . set s then = l:~ 1 Xi. where u is known. If {P.00). = P( N 1 < c). we conclude that the MP lest rejects H. Example 4. The family of models {P. :. (2) If E'o6.1. . Suppose {P.3. : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x. Because f..: 0 E e}.. k. < 1 implies that p > E..o)n. 0 . Bernoulli case. .3) with 6. . This is part of a general phenomena we now describe.2) with 0 < f < I}.o)n[O/(1 . = (I . 00 . 0) ~.2 (Example 4. Critical values for level a are easily determined because Nl . X ifT(x) < t ° (4.. 0 form with T(x) ..(x) any value in (0.3. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H.(X) = '" > 0. Thus.1.6. Moreover. equals the likelihood ratio test 'Ph(t) and is MP. However. Consider the oneparameter exponential family mode! o p(x.(X) is increasing in 0. is UMP level". we have seen three models where. N 1 < c. then L(x. Theorem 4. (1) For each t E (0.2. we get radically different best tests depending on which Oi we assume to be (ho under H. is of the form (4. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81.O) = 0'(1 . In this i. 6. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. for". J Definition 4. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. Because dt does not depend on ()l. ..d. the power function (3(/:/) = E.1) MLR in s. if and only if. . . for testing H : 8 < /:/0 versus K:O>/:/I' .i.nu)!". Note that because l can be any of the integers 1 . Then L = pnN1£N1 = pTl(E/p)N1. 02)/p(X. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact. : 8 E e}. If 1J(O) is strictly increasing in () E e.3. = h(x)exp{ry(O)T(x) ~ B(O)}. p(x.. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . e c R. is an MLRfamily in T(x). Oil = h(T(x)) for some increasing function h.1 is of this (.3. B( n.3. then 6. is an MLRfamily in T(x). under the alternative..1) if T(x) = t. Ow) under H. e c R. . . = . Example 4.3. then this family is MLR.228 Testmg and Confidence Regions Chapter 4 That is. in the case of a real parameter. Oil is an increasing function ofT(x).3.
0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately. where b = N(J.l. Suppose tha~ as in Example l. is UMP level a. then by (1). distribution. If 0 < 00 .. . we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard. . Thus.3.Xn . Corollary 4. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. we test H : u > Uo l versus K : u < uo.Section 4. H. d t is UMP for H : (J < (Jo versus K : (J > (Jo. and because dt maximizes the power over this larger class. the alternative K as (J < (Jo. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .1..<>. e c p(x.1 by noting that for any (it < (J2' e5 t is MP at level Eo.Xn is a sample from a N(I1.l of the measurements Xl..U 2 ) population. For simplicity suppose that bo > n. . (X) < <> and b t is of level 0: for H : (J :S (Jo. n). Suppose X!.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4.5.2. 0) = exp {~8 ~ IOg(27r0'2)} . and we are interested in the precision u. where xn(a) is the ath quantile of the X. and only if. X < h(a). To show (2). If a is a value taken on by the distribution of X. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. (l. R. where 11 is a known standard.3.(X) for testing H: 0 = 01 versus J( : 0 = 0. For instance.. Quality Control. (8 < s(a)) = a. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 .3. then e}.o.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo.o. N.l. 0 Example 4. the critical constant 8(0') is u5xn(a). Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. 0.. Then.l) yields e L( 0 0 )=b. distribution. she formulates the hypothesis H as > (Jo. Let S = l:~ I (X. where U O represents the minimum tolerable precision. if N0 1 = b1 < bo and 0 < x < b1 . we now show that the test O· with reject H if. N . Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution. _1')2. Testing Precision. Because the most serious error is to judge the precision adequate when it is not.bo > n. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. Eoo.(X). where h(a) is the ath quantile of the hypergeometric. . Suppose {Po: 0 E Example 4.( NOo.(bl1) x. is an MLRfamily in T(r). then the test/hat rejects H if and only ifT(r) > t(1 . and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'.4. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' .l.
O'"O. in (O . I i (3(t) ~ 'I>(z(a) + .) is increasing. In our nonnal example 4.Oo. Note that a small signaltonoise ratio ~/ a will require a large sample size n. ~) would be our indifference region.Il = (b l . that is.1. the appropriate n is obtained by solving i' . In both these cases. On the other hand. we would also like large power (3(0) when () E 8 1 .'..OIl box (Nn+I)(blx) . H and K are of the fonn H : () < ()o and K : () > ()o.c:O". .. That is. By CoroUary 4.L.1.a. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t. {3(0) ~ a. This equation is equiValent to = {3 z(a) + . Off the indifference region.(}o. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small.. In our example this means that in addition to the indifference region and level a.1. In Example 4. (0. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small.1.. this is a general phenomenon in MLR family models with p( x.3. It follows that 8* is UMP level Q. For such the probability of falsely accepting H is almost 1 . 0) continuous in 0.. 1 I .230 NotethatL(x.1. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r.1 and formula (4. The critical values for the hypergeometric distribution are available on statistical calculators and software. This is a subset of the alternative on which we are willing to tolerate low power. we want guaranteed power as well as an upper bound on the probability of type I error. This continuity of the power shows that not too much significance can be attached to acceptance of H. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n.n + I) . This is not serious in practice if we have an indifference region. ".. Thus. ° ° .:(x:. ~) for some small ~ > 0 because such improvements are negligible.' . I' .x) <I L(x..ntI" = z({3) whose solution is il . as seen in Figure 4.x) (N . this is..+. . This is possible for arbitrary /3 < 1 only by making the sample size n large enough. Thus. we want the probability of correctly detecting an alternative K to be large. in general. not possible for all parameters in the alternative 8.(b o .4 we might be uninterested in values of p.I. However.. Therefore. forO <:1' < b1 1.2)..4 because (3(p.ntI") for sample size n. i. and the powers are continuous increasing functions with limOlO.
Now let .6 (Example 4. we solve (3(00 ) = Po" (S > s) for s using (4. the size .35. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 . ifn is very large and/or a is small.1. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. This problem arises particularly in goodnessoffit tests (see Example 4. that is.1. we can have very great power for alternatives very close to O. Formula (4. Again using the nonnal approximation. to achieve approximate size a.7) + 1.35 and n = 163 is 0.5).3 requires approximately 163 observations to have probability . test for = . using the SPLUS package) for the level . 00 = 0.4 this would mean rejecting H if. Thus.90.35(0.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much." The reason is that n is so large that unimportant small discrepancies are picked up.0.. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo.4. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant.05 binomial test of H : 8 = 0. where (3(0. if Oi = .2) shows that.3.3 continued).1. The power achievable (exactly. and only if.86.90 of detecting the 17% increase in 8 from 0. Dt. (3 = . far from the hypothesis. Suppose 8 is a vector. Example 4.05.3.3 to 0.Oi) [nOo(1 . we fleed n ~ (0.) = (3 for n and find the approximate solution no+ 1 .00 )] 1/2 .55)}2 = 162. (h = 00 + Dt.645 x 0. Bd. > O. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.282 x 0.80 ) . 0 ° Our discussion can be generalized. and 0. We solve For instance.05)2{1. when we test the hypothesis that a very large sample comes from a particular distribution. = 0.4.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.35. In Example 4.Section 4.3(0. Our discussion uses the classical normal approximation to the binomial distribution. First. There are various ways of dealing with this problem.4.1. 1. As a further example and precursor to Section 5.
= {O : ).7.(0) > O} and Co (Tn) = C'(') (Tn) for all O. (4. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0).< (10 and rejecting H if Tn is small.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative... Testing Precision Continued. 00).3.= (10 versus K : 0. is detennined by a onedimensional parameter ). a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. to the F test of the linear model in Section 6. when testing H : 0 < 00 versus K : B > Bo. I . This procedure can be applied. . 0 E 9. then rejecting for large values of Tn is UMP among all tests based on Tn. .232 Testing and Confidence Regions Chapter 4 q. a). a E A = {O. Thus.O)<O forB < 00 forB>B o.3 that this test is UMP for H : (1 > (10 versus K : 0. Example 4..(0) so that 9 0 = {O : ). For each n suppose we have a level a test for H versus I< based on a suitable test statistic T.1. is the a percentile of X~l' It is evident from the argument of Example 4.< (10 among all tests depending on <.(0) = o} aild 9. we illustrate what can happen with a simple example.9. Although H : 0. The theory we have developed demonstrates that if C.4) j j 1 . that are not 01. the distribution of Tn ni:T 2/ (15 is X.l)l(O.3.3.= 00 is now composite. However. To achieve level a and power at least (3. In general.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis.Xf as in Example 2. For instance.+ 00. I}.(Tn ) is an MLR family. It may also happen that the distribution of Tn as () ranges over 9.3..2 is (}'2 = ~ E~ 1 (Xi . for instance. Then the MLE of 0.2 only. and also increases to 1 for fixed () E 6 1 as n .1) 1(0. for 9. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. The set {O : qo < q(O) < ql} is our indjfference region. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. 0 E 9. Suppose that in the Gaussian model of Example 4.. the critical value for testing H : 0. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II.. J.2. we may consider l(O.l' independent of IJ. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. is such that q( Oil = q. . 0) = (B .0) > ° I(O..£ is unknown. We have seen in Example 4.Bo). = (tm.4.
3.E. Thus.4 CONFIDENCE BOUNDS.R(O. it isn't worthwhile to look outside of complete classes. Intervals.I"(X) 0 for allO then O=(X) clearly satisfies (4.3.(I"(X))) < 0 for 8 > 8 0 .5) That is. hence.(x)) .O)]I"(X)} Let o. 1") = (1(0.0. is an MLR family in T( x) and suppose the loss function 1(0. For MLR models. (4. a) satisfies (4.o) < R(O.12) and.o.3.Section 4. INTERVALS. hence. is complete.o.E. The risk function of any test rule 'P is R(O.) . Theorem 4.3. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 . if the model is correct and loss function is appropriate. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. Suppose {Po: 0 E e}. 0 Summary. = R(O.I"(X) for 0 < 00 . In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.2.(X) = ". then any procedure not in the complete class can be matched or improved at all () by one in the complete class. e 4.I") ~ = EO{I"(X)I(O. For () real. Thus.(X)) = 1.{1(8. Proof. We also show how.4).1 and.(2) a v R(O.(o. when UMP tests do not exist. locally most powerful (LMP) tests in some cases can be found.3.E. If E. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1. (4. In the following the decision procedures arc test functions.3.3. E. (4. 1) + [1 1"(X)]I(O.(X) > 1.5) holds for all 8. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests). e c R.(Io. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a . Now 0.3) with Eo.5). 0 < " < 1. Finally. the risk of any procedure can be matched or improved by an NP test. we show that for MLR models. 1) 1(O. then the class of tests of the form (4.(X) = E"I"(X) > O. 1) 1(0.O)} E.(X) be such that. O))(E. for some 00 • E"o.6) But 1 .3.O) + [1(0. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4.3. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I).4 Confidence Bounds. 1") for all 0 E e.
(Xj < ... if v ~ v(P)..) = 1 . As an illustration consider Example 4. Now we consider the problem of giving confidence bounds. In general. and X ~ P. intervals.234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 . That is. In the nonBayesian framework. 1 X n are i..oz(1 . In the N(".a. .. is a parameter. ..8). We find such an interval by noting . as in (1.a.fri. and we look for a statistic ..(X) that satisfies P(.) = Ia...(X) and solving the inequality inside the probability for p.a) such as .a) for a prescribed (1..95 Or some other desired level of confidence.a). or sets that constrain the parameter with prescribed probability 1 .1.fri.. in many situations where we want an indication of the accuracy of an estimator.a. we want to find a such that the probability that the interval [X .a. it may not be possible for a bound or interval to achieve exactly probability (1 . In our example this is achieved by writing By solving the inequality inside the probability for tt. N (/1.4 where X I. X E Rq. We say that . .) = 1 . • j I where .a) confidence interval for ".Q equal to ..! = X .(X) . is a constant. This gives .. Finally.fri < .3.95. . we may be interested in an upper bound on a parameter.2 ) example this means finding a statistic ii(X) snch that P(ii(X) > . we find P(X .) and =1 a .+(X)] is a level (1 .a. That is....!a)/. ( 2 ) with (72 known.. ..a)l.i.a) of being correct.a) confidence bound for ". PEP... Here ii(X) is called an upper level (1 . Suppose that JL represents the mean increase in sleep among patients administered a drug..a)l.a)I..Q with 1 .oz(1 ..(X). we want both lower and upper bounds. Similarly.(X) is a lower confidence bound with confidence level 1 . 11 I • J . .±(X) ~ X ± oz (1. X n ) to establish a lower bound . .d. X + a] contains pis 1 . Then we can use the experimental outcome X = (Xl. is a lower bound with P(.fri < . In this case. We say that [j. we settle for a probability at least (1 ..(X) for" with a prescribed probability (1 . and a solution is ii(X) = X + O"z(1 . 0.
(72) population. Example 4.4. In this process Z(I') is called a pivot. the minimum probability of coverage).0)% confidence intervolfor v if.L.0) will be a confidence level if (I . v(X) is a level (I . Similarly.1) = T(I') .4. where S 2 = 1 nI L (X. In general. finding confidence intervals (or bounds) often involves finding appropriate pivots.1. we will need the distribution of Now Z(I') = Jii(X 1')/". In the preceding discussion we used the fact tbat Z (1') = Jii(X . Moreover.o. for all PEP.L) obtained by replacing (7 in Z(J. independent of V = (n .0) or a 100(1 .X) n i=l 2 .0') < (J . D(X) is called a level (1 . We conclude from the definition of the (Student) t distribution in Section B.l)s2 lIT'. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X).~o) for 1'. A statistic v(X) is called a level (1 . " X n be a sample from a N(J. Let X t . the confidence level is clearly not unique because any number (I . P[v(X) < v < v(X)] > I . The quantities on the left are called the probabilities of coverage and (1 . P E PI} (i.Section 4. Now we tum to the (72 unknown case and propose the pivot T(J. 1) distribution and is. The (Student) t Interval and Bounds. ii( Xl] formed by a pair of statistics v(X).1.4 Confidence Bounds. has a N(o. by Theorem B. and Regions 235 Definition 4. the random interval [v( X).. For a given bound or interval. P[v(X) = v] > 10.a) is called a confidence level. lhat is..! that Z(iL)1 "jVI(n ..3. P[v(X) < vi > I .a) upper confidence bound for v if for every PEP.O!) lower confidence bound for v if for every PEP. has aN(O. Intervals.o.3. . In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level.L) by its estimate s.3. which has a X~l distribution. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.!o) < Z(I') < z (1. and assume initially that (72 is known. I) distribu~on to obtain a confidence interval for I' by solving z (1 .1')1".0) is.e.
.. or very heavytailed distributions such as the Cauchy.355s/31 is the desired level 0.4. " ..i. then P(X("I) < V(<7 2) < x(l.l (I .355s/3. thus. See Figure 5.. For the usual values of a..!a) and tndl .. . N (p" a 2 ).1. Suppose that X 1.stn_ 1 (I .d. we can reasonably replace t n . It turns out that the distribution of the pivot T(J. Example 4.3. Testing and Confidence Regions Chapter 4 1 .4. if we let Xn l(p) denote the pth quantile of the X~1 distribution. . Confidence Intervals and Bounds for the Vartance of a Normal Distribution. the confidence coefficient of (4.l) < t n_ 1 (1.1 distribution and can be used as a pivot. Hence. 0 . To calculate the coefficients t n.. X n are i. we use a calculator. Solving the inequality inside the probability for Jt.a).1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5.a) in the limit as n + 00. The properties of confidence intervals such as (4.. ~.l) is fairly close to the Tn . • " ~. (4. .st n _ 1 (I ..st n_l (1 ~.2) f.a) /. Then (72.236 has the t distribution 7.12). .~a)/.7.1.2. ..)/ v'n].l)s2/X(1.. By solving the inequality inside the probability for a 2 we find that [(n . Up to this point.a).1. ..~a)) = 1.Q). for very skew distributions such as the X2 with few degrees of freedom.3.n is. t n .fii are natural lower and upper confidence bounds with confidence coefficients (1 .1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal.Q. . and if al + a2 = a. V( <7 2 ) = (n .~a) = 3. Let tk (p) denote the pth quantile of the P (t n.4.1 )s2/<7 2 has a X. .nJ = 1" X ± sc/..l < X + st n_ 1 (1. • r.. the interval will have probability (1 . distribution.1 (I . we find P [X . we have assumed that Xl. i '1 !. computer software. we enter Table II to find that the probability that a 'TnI variable exceeds 3.11 whatever be Il and Tj.7 (see Problem 8.355 and IX .fii. For instance.a. By Theorem B ..~a) < T(J.355 is .1 (p) by the standard normal quantile z(p) for n > 120.. Thus. X . If we assume a 2 < 00. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution.fii and X + stn. In this case the interval (4.99 confidence intervaL From the results of Section B.1) has confidence coefficient close to 1 .a.1. .01.1) can be much larger than 1 .4.) /.."2)' (n .005.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 .1) The shortest level (I . . X + 3.)/.~a)/. . . or Tables I and II.Xn is a sample from a N(p" a 2 ) population.l (I .) confidence interval of the type [X .1 (1 . (4."2)) = 1".3. On the other hand. .n < J. X +stn _ 1 (1 Similarly. if n = 9 and a = 0.
0)  1Ql] = 12 Q. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n . X) < OJ (4. We illustrate by an example.4) .l)sjx(Q).(2X + ~) e+X 2 For fixed 0 < X < I. the confidence interval and bounds for (T2 do not have confidence coefficient 1.0(1. If X I." we can write P [ Let ka vIn(X . vIn(X .16 we give an interval with correct limiting coverage probability. [0: g(O.X) = (1+ k!) 0' . = Z (1 . Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.4. the scope of the method becomes much broader.3.S)jn] {S + k2~ + k~j4} / (n + k~) (4. which unifonnly minimizes expected length among all intervals of this type.O)j )0(1 .S)jn] + k~j4} / (n + k~).0) has approximately a N(O.0) . However. taking at = (Y2 = ~a is not far from optimal (Tate and Klett. It may be shown that for n large. by the De MoivreLaplace theorem.X) < OJ'" 1where g(O.0) < z (1)0(1 . There is no natural "exact" pivot based on X and O.3) + ka)[S(n ..4. which typically is unknown.a even in the limit as n + 00.ka)[S(n .l)sjx(1 . . 0 The method of pivots works primarily in problems related to sampling from nonnal populations. If we consider "approximate" pivots. and Regions 237 The length of this interval is random. Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. 1 X n are the indicators of n Bernoulli trials with probability of success (j. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. X) is a quadratic polynomial with two real roots.Section 44 Confidence Bounds.Q) and (n . 1) distribution.4.. then X is the MLE of (j.1.1. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero.k' < . Example 4.0) ] = Plg(O. In contrast to Example 4.~ a) and observe that this is equivalent to Q P [(X . There is a unique choice of cq and a2. g( 0. In Problem 1. 1959).. In tenns of S = nX. if we drop the nonnality assumption. Intervals. they are(1) O(X) O(X) {S + k2~ .4.
16. Another approximate pivot for this example is yIn(X .a) procedure developed in Section 4.600. These inlervals and bounds are satisfactory in practice for the usual levels.7) See Brown.4.02 and is a confidence interval with confidence coefficient approximately 0. .02. That is.96) 2 = 9.n 1 tn (4.4. (1 . This leads to the simple interval 1 I (4.4. A discussion is given in Brown. To see this.6) Thus.5) is used and it is only good when 8 is near 1/2. Cai.(n + k~)~ = 10 .02 ) 2 .a) confidence interval for O.4.: iI j . O(X)] is an approximate level (1 .601.2a) interval are approximate upper and lower level (1 .4) has length 0. See Problem 4.X).O > 8 1 > ~.. = length 0.02 by choosing n so that Z = n~ ( 1. note that the length.(1. and uses the preceding model.. n(l .a) confidence bounds.5) (4.975. i " . He can then detennine how many customers should be sampled so that (4. i o 1 [.4. of the interval is I ~ 2ko { JIS(n .5. it is better to use the exact level (1 . For instance. Note that in this example we can detennine the sample size needed for desired accuracy. consider the market researcher whose interest is the proportion B of a population that will buy a product. [n this case. 1 .4.~n)2 < S(n .96 0. to bound I above by 1 0 choose .~a = 0.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . we choose n so that ka. say I.95. . We can similarly show that the endpoints of the level (1 . if the smaller of nO.238 Testing and Confidence Regions Chapter 4 so that [O(X). and Das Gupta (2000). we n . or n = 9. and we can achieve the desired .0)1 JX(1 .S)ln ~ to conclude that tn . ka. and Das Gupta (2000) for a discussion. = T. calls willingness to buy success. = 0.4. Cai.96. For small n. This fonnula for the sample size is very crude because (4.~a) ko 2 kcc i 1.8) is at least 6. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~. He draws a sample of n potential customers.
F(x) = exp{ x/O}. ii. then q(C(X)) ~ {q(O) .a).a.. x IT(1a.. I is a level (I . then the rectangle I( X) Ic(X) has level = h (X) x . 2nX/0 has a chi~square. . ...) confidence interval for qj (0) and if the ) pairs (T l ' 1'. (0). For instance. ~. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone.(X) < q. this technique is typically wasteful..(X)..). Intervals. Suppose q (X) and ii) (X) are realJ valued.Section 44 Confidence Bounds.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution..3.1.1 ).0'). X n is modeled as a sample from an exponential. if the probability that it covers the unknown but fixed true (ql (0).0') confidence region for q(O). (X) ~ [q. Let Xl.4.a Note that if I.a) confidence region. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q. ( 2 ) T.4..a) confidence interval 2nX/x (I . .a). and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet.).4. if q( B) = 01 . Let 0 and and upper boundaries of this interval. j ~ I. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 . we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 .. . . Example 4.. Then the rdimensional random rectangle I(X) = {q(O).X n denote the number of hours a sample of internet subscribers spend per week on the Internet. . . Note that ifC(X) is a level (1 ~ 0') confidence region for 0. That is. . . 0 E C(X)} is a level (1. By using 2nX /0 as a pivot we find the (1 . we will later give confidence regions C(X) for pairs 8 = (0 1 . qc (0)). (T c' Tc ) are independent. .a). and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 ..(O) < ii. Suppose X I. X~n' distribution. Here q(O) = 1 .4.. qc( 0)) is at least (I . j==1 (4. . If q is not 1 .8) .. In this case.r} is said to be a level (1 . By Problem B. .. We write this as P[q(O) E I(X)I > 1. £(8. distribution.
. h (X) X 12 (X) is a level (1 . min{l.4.2. 0 The method of pivots can also be applied to oodimensional parameters such as F. 1 X n is a N (M.a = set of P with P( 00. v(P) = F(·).a). Suppose Xl. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. in which case (Proposition 4. we find that a simultaneous in t size 1.. Moreover. Example 4. consists of the interval i J C(x)(t) = (max{O.6. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals.F(t)1 tEn i " does not depend on F and is known (Example 4.1 h(X) = X ± stn_l (1.. From Example 4. r.5).0:. .40) 2. . We assume that F is continuous.a.. See Problem 4.4.do}. j=1 j=1 Thus.1. if we choose 0' j = 1 . From Example 4. According to this inequality. Example 4. That is.4. ~ ~'f Dn(F) = sup IF(t) .15. .4. (7 2). tl continuous in t. Let dO' be chosen such that PF(Dn(F) < do) = 1 .(1 . and we are interested in the distribution function F(t) = P(X < t).~a)/rn !a). for each t E R.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1.a) confidence rectangle for (/" .. c c P[q(O) E I(X)] > 1.2..d.Laj. Then by solving Dn(F) < do for F. Suppose Xl> . that is. . F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1.0 confidence region C(x)(·) is the confidence band which. .2).~a). j = 1. Confidence Rectangle for the Parameters ofa Normal Distribution. ( 2 ) sample and we want a confidence rectangle for (Ii.. It is possible to show that the exact confidence coefficient is (1 . Thus.i. then I(X) has confidence level (1 ...1) the distribution of . X n are i. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 . D n (F) is a pivot.LP[qj(O) '" Ij(X)1 > 1.7).5.1. then leX) has confidence level 1 .4.a) r. . F(t) . as X rv P.240 Testing and Confidence Regions Chapter 4 Thus. if we choose a J = air.
!(F+) and sUP{J. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative.! ~ inf{J.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4. which is zero for t < 0 and nonzero for t > O.i.d.4. given by (4. a 2 ) model with a 2 unknown.5.! ~ /L(F) = f o tf(t)dt = exists.4.4. By integration by parts. and more generally confidence regions. Suppose Xl. We derive the (Student) t interval for I' in the N(Il.!(F) : F E C(X)} = J.9)   because for C(X) as in Example 4. a level 1 .4. J.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 . In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. Then a (1 . Suppose that an established theory postulates the value Po for a certain physical constant. subsets of the sample space with probability of accepting H at least 1 .4 and 4.4. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses.6.0: for all E e.4. We define lower and upper confidence bounds (LCBs and DCBs).Section 4.. if J.7.!(F)   ~ oosee Problem 4. is /1. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable.18) arise in accounting practice (see Bickel.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 .. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . In a parametric model {Pe : BEe}.5 to give confidence regions for scalar or vector parameters in nonparametric models. We begin by illustrating the duality in the following example. . we similarly require P(C(X) :J v) > 1.0: when H is true.!(F) : F E C(X)) ~ J.4. o 4. For a nonparametric class P ~ {P} and parameter v ~ v(P). Intervals for the case F supported on an interval (see Problem 4. Example 4. and we derive an exact confidence interval for the binomial parameter. 0 Summary. . for a given hypothesis H.4. Acceptance regions of statistical tests are." for all PEP. X n are i. as X and that X has a density f( t) = F' (t).6. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.0') lower confidence bound for /1.0'. 1WoSided Tests for the Mean of a Normal Distribution. Example 4.19.1. confidence intervals.
Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . X + Slnl (1  ~a) /vn]. This test is called twosided because it rejects for both large and small values of the statistic T..4.4..4. 1. if and only if.5. generated a family of level a tests {J(X.1) is used for every flo we see that we have. We achieve a similar effect.. /l) to equal 1 if.~1 >tn_l(l~a) ootherwise. 00) X (0.l (1. generating a family of level a tests.a).. Evidently. O{X. /l) ~ O. In contrast to the tests of Example 4. by starting with the test (4. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. all PEP.~a).4.2) we obtain the confidence interval (4. ...I n _ 1 (1.5.1) we constructed for Jl as follows. (/" . X .00).6. that is P[v E S(X)] > 1 . We accept H.2) These tests correspond to different hypotheses.fLo)/ s. if we start out with (say) the level (1 . Because p"IITI = I n. If any value of J. in Example 4.l other than flo is a possible alternative.5..242 Testing and Confidence Regions Chapter 4 measurements Xl . /l)} where lifvnIX. . We can base a size Q' test on the level (1 .~Ct).a.. . t n .a)s/vn > /l. i:.5. (4.4. • j n . it has power against parameter values on either side of flo. and in Example 4. Because the same interval (4. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP.a) LCB X .a) confidence region for v if the probability that S(X) contains v is at least (1 .2(p) takes values in N = (0. /l = /l(P) takes values in N = (00. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely.1) by finding the set of /l where J(X. For a function space example.1 (1 .00). For instance. (4.a)s/ vn and define J'(X.4. then S is a (I .!a) < T < I n. o These are examples of a general phenomenon. in Example 4.5'[) If we let T = vn(X .flO. then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i. consider v(P) = F.Q) confidence interval (4.2.1 (1 . Let v = v{P) be a parameter that takes values in the set N.00). where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions.I n . in fact.1..2) takes values in N = (00 . then our test accepts H.2 = . Let S = S{X) be a map from X to subsets of N.~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 . and only if. as in Example 4.1 (1.5.a. = 1. if and only if.
3.1. Duality Theorem. Then the acceptance regIOn A(vo) ~ (x: J(x. va) with level 0". H may be accepted.(01 + 0"2). and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1.a)) where to o (1 . Suppose we have a test 15(X.(t) in e. is a level 0 test for HI/o.0" of containing the true value of v(P) whatever be P. Moreover. this is a random set contained in N with probapility at least 1 .aj.(T) of Fo(T) = a with coefficient (1 . We have the following.0". Similarly. E is an upper confidence bound for 0 with any solution e.vo) = OJ is a subset of X with probability at least 1 . if 0"1 + 0"2 < 1.0" quantile of F oo • By the duality theorem.a)}. let Pvo = (P : v(P) ~ Vo : va E V}. Conversely. For some specified Va. That i~.G iff 0 > Iia(t) and 8(t) = Ilia.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va. for other specified Va.0" confidence region for v. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. By Corollary 4.Section 4.Ct.1. We next apply the duality theorem to MLR families: Theorem 4.Fo(t) for a test with critical constant t is increasing in ().Ct confidence region for v. eo e Proof. 0 We next give connections between confidence bounds. then 8(T) is a Ia confidence region forO. The proofs for the npper confid~nce hound and interval follow by the same type of argument. Fe (t) is decreasing in (). It follows that Fe (t) < 1 . By Theorem 4. the power function Po(T > t) = 1 . if S(X) is a level 1 .3. If the equation Fo(t) = 1 . acceptance regions. Consider the set of Va for which HI/o is accepted.Ct) is the 1. the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1.Ct). < to(la).5. 00). Formally.a has a solution O. then the test that accepts HI/o if and only if Va is in S(X). X E A(vo)). let . then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . H may be rejected. if 8(t) = (O E e: t < to(1 . then [QUI' oU2] is confidence interval for () with confidence coefficient 1 .1. Let S(X) = (va EN. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 .
4.. I c= {(t. t is in the acceptance region. E A(B)).2 known.1. for the given t.a)~k(Bo.. A"(B) Set) Proof.a)ifBiBo.1.1). and for the given v.Xn ) = {B : S < k(B.3.v) where. .v) : a(t.B) = (oo. Figure 4. B) ~ poeT > t) = 1 . Let k(Bo. In general.v) : J(t.1 that 1 . For a E (0. We illustrate these ideas using the example of testing H : fL = J.a) is nondecreasing in B.a) test of H.~a)/ v'n}. .5. a) denote the critical constant of a level (1. . I X n are U.to(l. we seek reasonable exact level (1 . • a(t.a)] > a} = [~(t). _1 X n be the indicators of n binomial trials with probability of success 6. v) =O} gives the pairs (t. let a( t.1 244 Testing and Confidence Regions Chapter 4 1 o:(t. '1 Example 4. B) pnints.3. a).Fo(t) is decreasing in t. a) . Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials.I}.10 when X I . cr 2 ) with 0. . Let T = X.1 I. Because D Fe{t) is a distribution function. then C = {(t.1). We have seen in the proof of Theorem 4.B) > a} {B: a(t. . . vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x). B) plane. v) <a} = {(t. The result follows. : : !.a) confidence region is given by C(X t. The corresponding level (1 . We shall use some of the results derived in Example 4.~) upper and lower confidence bounds and confidence intervals for B. a confidence region S( to). We claim that (i) k(B..1. Let Xl. Then the set . To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00. vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t.1 shows the set C.5. The pvalue is {t: a(t.Fo(t) is increasing in O. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.80 E (0.oo). vol denote the pvalue of a test6(T.p): It pI < <7Z (1. . 1 . (ii) k(B. v) = O}. In the (t. We call C the set of compatible (t.2.5. • . and an acceptance set A" (po) for this example.Fo(t). where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. N (IL. v will be accepted. Under the conditions a/Theorem 4..d.
Q) as 0 tOo. S(to) is a confidence interval for 11 for a given value to of T. and (iv) we see that.3.Q) ~ I andk(I. Q) would imply that Q > Po.Section 4. let j be the limit of k(O.5.o : J. Po. e < e and 1 2 k(O" Q) > k(02.1.Q) I] > Po. I] if S ifS >0 =0 . it is also nonincreasing in j for fixed e.Q) = S + I}. Po. The shaded region is the compatibility set C for the twosided test of Hp. then C(X) ={ (O(S). a) increases by exactly 1 at its points of discontinuity. Clearly. If fJo is a discontinuity point 0 k(O. whereas A'" (110) is the acceptance region for Hp.o' (iii) k(fJ. On the other hand. Therefore. Then P. [S > j] = Q and j = k(Oo. [S > j] < Q.L = {to in the normal model. Therefore. if 0 > 00 . and.Q)] > Po. (iv) k(O.5 The Duality Between Confidence Regions and Tests 245 Figure 4. The claims (iii) and (iv) are left as exercises. P.[S > k(02. To prove (i) note that it was shown in Theorem 4. (iii). (ii). Q).1 (i) that PetS > j] is nondecreasing in () for fixed j. if we define a contradiction.I] [0. The assertion (ii) is a consequence of the following remarks.[S > k(02.[S > k(O"Q) I] > Q.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O. Q). From (i). hence.Q) = n+ 1. [S > j] > Q.
Similarly. O(S) = I.Q) = S and. I • i I .5 I i . O(S) I oflevel (1.16) 3 I 2 11.Q) DCB for 0 and when S < n. i • .0. therefore. we define ~ O(S)~sup{O:j(O. Then 0(S) is a level (1 . As might be expected. . From our discussion.4.16) for n = 2. 8(S) = O. . 1 _ . we find O(S) as the unique solution of the equation.5. 0. . Putting the bounds O(S). . then k(O(S). if n is large.3 0.2Q).2.1 0. . O(S) together we get the confidence interval [8(S). 4 k(8.1 d I J f o o 0.246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I .Q)=Sl} where j (0.4 0.2 0.2 portrays the situatiou. these bounds and intervals differ little from those obtained by the first approximate method in Example 4. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations. Q) is given by. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. Plot of k(8. When S = n. j Figure 4.5. j '~ I • I • .Q) LCB for 0(2) Figure 4.. When S 0.3.I. when S > 0.
o o For instance. Thus. To see this consider first the case 8 > 00 . the probability of the wrong decision is at most ~Q. Example 4.Jii for !t. In Section 4.~a). but if H is rejected.0') confidence interval!: 1. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q.Section 4.~Q)..i. Then the wrong decision "'1 < !to" is made when T < z(1 . it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. v = va.~Q) / .!a).O. A and B. If J(x. by using this kind of procedure in a comparison or selection problem. Decide I' < 1'0 ifT < z(l. Because we do not know whether A or B is to be preferred.!a).. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. o o Decide B < 80 if! is entirely to the left of B .5. For this threedecision rule. then the set S(x) of Vo where . are given to high blood pressure patients. 2. vo) is a level 0' test of H . N(/l. . Therefore. we make no claims of significance.B .. The problem of deciding whether B = 80 . and 3. I X n are i.3) 3. However.2 ) with u 2 known. and vice versa. This event has probability Similarly.2. Suppose Xl. o (4. then we select A as the better treatment.d.5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests. Decide I' > I'c ifT > z(1 .5. when !t < !to. If H is rejected. the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. We explore the connection between tests of statistical hypotheses and confidence regions. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . Using this interval and (4.Q) confidence interval X ± uz(1 .3. Here we consider the simple solution suggested by the level (1 .5. we usually want to know whether H : () > B or H : B < 80 . we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval. and o Decide 8 > 80 if I is entirely to the right of B .4.4 we considered the level (1 . we decide whether this is because () is smaller or larger than Bo.13. If we decide B < 0. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions. suppose B is the expected difference in blood pressure when two treatments. 0. Summary. we test H : B = 0 versus K : 8 i.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 .8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1.
and only if. Which lower bound is more accurate? It does tum out that . and only if. e.6. . P. optimality of the tests translates into accuracy of the bounds.Q) con· fidence region for v.if. they are both very likely to fall below the true (J.. less than 80 .6. A level a test of H : /L = /La vs K . If (J and (J* are two competing level (1 .~'(X) < B'] < P. then the lest that accepts H . (4. which reveals that (4.n.5. a level (1 any fixed B and all B' > B. random variable S. 0 I i. If S(x) is a level (1 .. in fact.Ilo)/er > z(1 .I (X) is /L more accurate than . 4.Q for a binomial. ~).1) Similarly..a) LCB for if and only if 0* is a unifonnly most accurate level (1 .6..a) LCB 0* of (J is said to be more accurate than a competing level (1 . This is a consequence of the following theorem. we say that the bound with the smaller probability of being far below () is more accurate.Q) UCB for B.2..2) Lower confidence bounds e* satisfying (4.6. or larger than eo.2 and 4.z(1 . we find that a competing lower confidence bound is 1'2(X) = X(k).6. We also give a connection between confidence intervals..0) confidence region for va. We next show that for a certain notion of accuracy of confidence bounds.6.Xn ) is a sample of a N (/L.1 t! Example 4. (72) model. . B(n. j i I . The dual lower confidence bound is 1'. Ii n II r .248 Testing and Confidence Regions Chapter 4 0("'.6.u2(X) and is.Q).1) is nothing more than a comparison of (X)wer functions.n(X . I' i' . where 80 is a specified value. Note that 0* is a unifonnly most accurate level (1 .' .1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4.3. Definition 4.Q)er/. Thus. which is connected to the power of the associated onesided tests.[B(X) < B']. A level (1. . . Using Problem 4.0') lower confidence bounds for (J. twosided tests. where X(1) < X(2) < . and the threedecision problem of deciding whether a parameter 1 l e is eo. va) ~ 0 is a level (1 .Q) LCB B if.[B(X) < B'].. for X E X C Rq.1 continued).1 (Examples 3. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.2) for all competitors.Xn and k is defined by P(S > k) = 1 . P. v = Va when lIa E 8(:1') is a level 0: test.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account. .. (X) = X ..6. the following is true. for 0') UCB e is more accurate than a competitor e . for any fixed B and all B' < B.IB' (X) I • • < B'] < P. < X(n) denotes the ordered Xl. Formally. (4. unifonnly most accurate in the N(/L. Suppose X = (X" . But we also want the bounds to be close to (J.1. 0"2) random variables with (72 known. I' > 1'0 rejects H when .
and only if.Because O'(X.a) upper confidence bound q* for q( A) = 1 . We want a unifonnly most accurate level (1 . Let XI. Let O(X) be any other (1 . We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter.5. for e 1 > (Jo we must have Ee.2).4. the probability of early failure of a piece of equipment. O(x) < 00 . we find that j.2. For instance (see Problem 4. Uniformly most accurate (UMA) bounds turn out to have related nice properties.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4. if a > 0. [O'(X) > 001. . Suppose ~'(X) is UMA level (1 .a) LeB 00. 00 ) ~ 0 if. 00 ) by o(x.6. Example 4.Oo)) < Eo. they have the smallest expected "distance" to 0: Corollary 4.a) lower confidence bound. Identify 00 with {}/ and 01 with (J in the statement of Definition 4.(o(X.a) LCB q. .7 for the proof). 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 . a real parameter.1. for any other level (1 . [O(X) > 001 < Pe. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 .1 to Example 4.6.0') LeB for (J.a) lower confidence boundfor O. 00 )) or Pe.a) Lea for q( 0) if.5 favor X{k) (see Example 3. Defined 0(x. and 0 otherwise. Boundsforthe Probability ofEarly Failure ofEquipment.6.z(1 a)a/ JTi is uniformly most accurate. 0 If we apply the result and Example 4.1. .2.2 and the result follows. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). Proof Let 0 be a competing level (1 . Then O(X.Section 4. and only if.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 .6. (Jo) is given by o'(x. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it.6.a). then for all 0 where a+ = a. We define q* to be a uniformly most accurate level (1 . Also.e>.t o . Then!l* is uniformly most accurate at level (1 .•. However. (O'(X. Let f)* be a level (1 . such that for each (Jo the associated test whose critical function o*(x. Most accurate upper confidence bounds are defined similarly.1. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. the robustness considerations of Section 3.
Neyman defines unbiased confidence intervals of level (1 .a)/ 2Ao i=1 n (4. because q is strictly increasing in A. However. the UMP test accepts H if LXi < X2n(1.a) UCB for the probability of early failure. it follows that q( >. a uniformly most accurate level (1 . there exist level (1 .a) intervals for which members with uniformly smallest expected length exist.a) confidence intervals that have uniformly minimum expected lengili among all level (1.a) quantile of the X§n distribution. there does not exist a member of the class of level (1 .1 have this property. *) is a uniformly most accurate level (1 .8. pp. In particular. in the case of lower bounds.a) is the (1. Summary.a) UCB'\* for A. 374376).a) by the property that Pe[T.a) in the sense that. ~ q(()') ~ t] for every ().6. . Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.. These topics are discussed in Lehmann (1997). we can restrict attention to certain reasonable subclasses of level (1 . 0 Discussion We have only considered confidence bounds. The situation wiili confidence intervals is more complicated. Of course. the lengili t .. in general. If we turn to the expected length Ee(t . Pratt (1961) showed that in many of the classical problems of estimation .\ *) where>.a) UCB for A and. Thus. There are. That is. they are less likely ilian oilier level (1 . the confidence region corresponding to this test is (0..3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1.5.a) iliat has uniformly minimum length among all such intervals.) as a measure of precision.a) unbiased confidence intervals. the situation is still unsatisfactory because. * is by Theorem 4. By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 . By Problem 4.a) lower confidence bounds to fall below any value ()' below the true B. as in the estimation problem. 1962.a).. ~ q(B) ~ t] ~ Pe[T. the intervals developed in Example 4. subject to ilie requirement that the confidence level is (1 .a) intervals iliat has minimum expected length for all B.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1.6.T. Therefore. some large sample results in iliis direction (see Wilks.T. Considerations of accuracy lead us to ask that. B'. the interval must be at least as likely to cover the true value of q( B) as any other value.1. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. however. the confidence interval be as short as possible.6. l . is random and it can be shown that in most situations there is no confidence interval of level (1 .
with fLo and 75 known. then fl.Xn are i.Xn is N(liB..a) lower and upper credible bounds for if they respectively satisfy II(fl.i.2. Instead. Let II( 'Ix) denote the posterior probability distribution of given X = x. In the Bayesian formulation of Sections 1. then Ck will be an interval of the form [fl. Example 4.1. the interpretation of a 100(1". no probability statement can be attached to this interval. Xl.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 . the posterior distribution of fL given Xl.a)% confidence interval.4. If 7r( alx) is unimodal.7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown. Then.A)2 nro ao 1 Ji = liB + Zla Vn (1 + .::: k} is called a level (1 . N(fL.7.1.a)% of the intervals would contain the true unknown parameter value. X has distribution P e. II(a:s: t9lx) .. and that fL rv N(fLo. a E e c R. We next give such an example.::: 1 . what are called level (1 '.a) by the posterior distribution of the parameter given the data. a a Turning to Bayesian credible intervals and regions. 75).a.'" . then 100(1 .a) credible region for e if II( C k Ix) ~ 1  a .)2 nro 1 . and that a has the prior probability distribution II. Thus. Suppose that. (j].a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + .aB = n ~ + 1 I ~ It follows that the level 1 . with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 . a Definition 4. from Example 1.7. .a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 . given a.1.3. . a~). A consequence of this approach is that once a numerical interval has been computed from experimental data. and {j are level (1 . Let 7r('lx) denote the density of agiven X = x.7.12..d. with known.6.Section 4.a.7 Frequentist and Bayesian Formulations 251 4. then Ck = {a: 7r(lx) . Definition 4.2 and 1. :s: alx) ~ 1 . Suppose that given fL. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx).
N(/10. Xl. called level (1 a) credible bounds and intervals. X n .6.2 ..4.2 and suppose A has the gamma f( ~a. the center liB of the Bayesian interval is pulled in the direction of /10.2.a) upper credible bound for 0.n. See Example 1. we may want an interval for the .. Let xa+n(a) denote the ath quantile of the X~+n distribution. Note that as TO > 00.d.7. the interpretations of the intervals are different. b > are known parameters.. D Example 4. Similarly.8 PREDICTION INTERVALS In Section 1. 01 Summary. .12. however. In the Bayesian framework we define bounds and intervals. given Xl./10)2.2. For instance. 4.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/. In addition to point prediction of Y.3. the interpretations are different: In the frequentist confidence interval. Suppose that given 0..a) lower credible bound for A and ° is a level (1 . that determine subsets of the parameter space that are assigned probability at least (1 . the Bayesian interval tends to the frequentist interval. then . whereas in the Bayesian credible interval.. However.• . (t + b)A has a X~+n distribution.2 . Y] that contains the unknown value Y with prescribed probability (1 .252 Testing and Confidence Regions Chapter 4 while the level (1 . . Then.02) where /10 is known. where t = 2:(Xi .a) by the posterior distribution of the parameter () given the data :c. We shall analyze Bayesian credible regions further in Chapter 5. the probability of coverage is computed with the data X random and B fixed.a).a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. In the case of a normal prior w( B) and normal model p( X I B).4 for sources of such prior guesses.Xn are Li. = x a+n (a) / (t + b) is a level (1 . ~b) density where a > 0. . the level (1 . where /10 is a prior guess of the value of /1. is shifted in the direction of the reciprocal b/ a of the mean of W(A). Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4. Let A = a. the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x).a) credible interval is [/1..2. it is desirable to give an interval [Y. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 . a doctor administering a treatment with delayed effect will give patients a time interval [1::.4 we discussed situations in which we want to predict the value of a random variable Y. t] in which the treatment is likely to take effect. by Problem 1.
1) distribution and is independent of V = (n . X is the optimal estimator.Section 4. . We define a predictor Y* to be prediction unbiased for Y if E(Y* .Y) = 0. The (Student) t Prediction Interval. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so.1)1 L~(Xi ..1. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4.1.4. . fnl (1  ~a) for Vn. TnI. 8 2 = (n ..1)8 2 /cr 2. It follows that.Y to construct a pivot that can be used to give a prediction interval.+ 18tn l (1  (4. by the definition of the (Student) t distribution in Section B. as X '" N(p" cr 2 ). We want a prediction interval for Y = X n +1.d. the optimal estimator when it exists is also the optimal predictor.3 and independent of X n +l by assumption..4.3. in this case. Thus. "~). We define a level (1. Y) ?: 1 Ct. where MSE denotes the estimation theory mean squared error.Xn . we find the (1 .l distribution.l + 1]cr2 ). Moreover.3.a) prediction interval Y = X± l (1 ~Ct) ::. it can be shown using the methods of . . .i.1. ~ We next use the prediction error Y .X n +1 '" N(O.8.Ct) prediction interval as an interval [Y. has the t distribution.8.X) is independent of X by Theorem B.. let Xl. Let Y = Y(X) denote a predictor based on X = (Xl. ::. and can c~nclu~e that in the class of prediction unbiased predictors. Y ::.Xn be i... .8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.) acts as a confidence interval pivot in Example 4. As in Example 4.. Y] based on data X such that P(Y ::. we found that in the class of unbiased estimators. [n. . which has a X.4. It follows that Z (Y) p  Y vnY l+lcr has a N(O.8. Also note that the prediction interval is much wider than the confidence interval (4. which is assumed to be also N(p" cr 2 ) and independent of Xl.1).4. Note that Y Y = X . the optimal MSPE predictor is Y = X. Tp(Y) !a) . AISP E(Y) = MSE(Y) + cr 2 .1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. In Example 3. By solving tnl Y. In fact.
8.1) is approximately correct for large n even if the sample comes from a nonnormal distribution. where F is a continuous distribution function with positive density f on (a.Un+l are i.4.12.Xn are i.0:) Bayesian prediction interval for Y = X n + l if . the confidence level of (4. See Problem 4.254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4. (4.5 for a simpler proof of (4.. Ul . by Problem B.4.8. .. 00 :s: a < b :s: 00.d.E(U(j)) where H is the joint distribution of UCj) and U(k). b).i.1) tends to zero in probability at the rate n!. that is. . We want a prediction interval for Y = X n+l rv F. ..8. E(UCi») thus. . with a sum replacing the integral in the discrete case.8.2.' X n . By ProblemB.' •.2).v) j(v lI)dH(u. . < UCn) be Ul . as X rv F.. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution..v) = E(U(k») . " X n... Let X(1) < .1) is not (1 . Q(. Bayesian Predictive Distributions Xl' .. = i/(n+ 1). Suppose XI.2. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . The posterior predictive distribution Q(. Example 4. . .. Set Ui = F(Xi ). i = 1. .. where X n+l is independent of the data Xl'. . then.0:) in the limit as n ) 00 for samples from nonGaussian distributions.'. Here Xl"'" Xn are observable and Xn+l is to be predicted. 1). Now [Y B' YB ] is said to be a level (1 . p(X I e).j is a level 0: = (n + 1 . Let U(1) < . whereas the width of the prediction interval tends to 2(]z (1. This interval is a distributionfree prediction interval.2. I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI.xn ). U(O...i. Un ordered. UCk ) = v)dH(u. .2j)/(n + 1) prediction interval for X n +l .2) P(X(j) It follows that [X(j) . < X(n) denote the order statistics of Xl.9.~o:). uniform. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U. I x) has in the continuous case density e.8..d.d..i. " X n + l are Suppose that () is random with () rv 1'( and that given () = i. n + 1. Moreover. whereas the level of the prediction interval (4.
However.8.I . 1983). (J" B n(J" B 2) It follows that a level (1 . 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. . where X and X n + l are independent. and 7r( B) is N( TJo . (4. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval. Thus.2. A sufficient statistic based on the observables Xl. X n + l and 0. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.0)0] = E{E(Xn+l .c{(Xn +1 where. we construct the Student t prediction interval for the unobservable variable.B)B I 0 = B} = O. T2). the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures.i. we find that the interval (4.3) we compute its probability limit under the assumption that Xl " '" Xn are i.1 where (Xi I B) '" N(B . .9 4. and X ~ Bas n + 00. 4. (J"5 + a~) ~2 = (J" B n I (. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. Because (J"~ + 0.8. In the case of a normal sample of size n + 1 with only n variables observable.. This is the same as the probability limit of the frequentist interval (4.1) . T2 known. (n(J"~ /(J"5) + 1. Summary.9. by Theorem BA.0 and 0 are still uncorrelated and independent.7.c{Xn+l I X = t} = .~o:) V(J"5 + a~. Xn+l .0:).~o:) (J"o as n + 00.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point. Xn+l . . Consider Example 3.0 and 0 are uncorrelated and. independent. The Bayesian prediction interval is derived for the normal model with a normal prior.. To obtain the predictive distribution. (J"5 known. (J"5). Note that E[(Xn+l .2 o + I' T2 ~ P.3) converges in probability to B±z (1 . Xn is T = X = n 1 2:~=1 Xi .d.  0) + 0 I X = t} = N(liB. even in . note that given X = t.3) To consider the frequentist properties of the Bayesian prediction interval (4.9 Likelihood Ratio Procedures 255 Example 4.8.3. from Example 4. Thus. (J"5).8.1. N(B.Section 4.8. B = (2 / T 2) TJo + ( 2 / (J"0 X. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X.
In particular cases. the MP level a test 6a (X) rejects H for T > z(l .1 that if /Ll > /Lo.9.a )th quantile obtained from the table. note that it follows from Example 4. Calculate the MLE eo of e where e may vary only over 8 0 . p(x. Xn).256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional.2. e) : e E 8 l }. Note that in general A(X) = max(L(x). B')/p(x . We are going to derive likelihood ratio tests in several important testing problems. . 1. eo)./LO.2 that we think of the likelihood function L(e. optimal procedures may not exist. To see this.1) whose computation is often simple.its (1 . recall from Section 2.. . Because h(A(X)) is equivalent to A(X) . z(a ). 1).e): E 8 0 }. if /Ll < /Lo./Lo)/a. Xn is a sample from a N(/L. . by the uniqueness of the NP test (Theorem 4. Suppose that X = (Xl ./Lo. Also note that L(x) coincides with the optimal test statistic p(x . In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. sup{p(x. ifsup{p(x. 8 1 = {e l }. eo). then the observed sample is best explained by some E 8 1 . . e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. To see that this is a plausible statistic. 2. We start with a generalization of the NeymanPearson statistic p(x .'Pa(x). Although the calculations differ from case to case. e) and we wish to test H : E 8 0 vs K : E 8 1 . Form A(X) = p(x.1(c». there can be no UMP test of H : /L = /Lo vs H : /L I. e) : e E 8l}islargecomparedtosup{p(x. For instance.Xn ) has density or frequency function p(x. On the other hand. the MP level a test 'Pa(X) rejects H if T ::::. and for large samples. e) as a measure of how well e "explains" the given sample x = (Xl. a 2 ) population with a 2 known. 4. the basic steps are always the same. eo) when 8 0 = {eo}. there is no UMP test for testing H : /L = /Lo vs K : /L I. Because 6a (x) I.~a). 3.2. . likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6.x) = p(x. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6.2. . The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. e) E 8} : eE 8 0} (4. if Xl. ed/p(x.. . Calculate the MLE e of e. we specify the size a likelihood ratio test through the test statistic h(A(X) and . Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H . In the cases we shall consider. ed / p(x. . e) : () sup{p(x. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. So. and conversely... where T = fo(X . .
0'2) population in which both JL and 0'2 are unknown.. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 .Xn form a sample from a N(JL.'" . 4. and so on.e.9.9. This section includes situations in which 8 = (8 1 . To see how the process works we refer to the specific examples in Sections 4. and other factors. the difference Xi has a distribution that is . That is. Response measurements are taken on the treated and control members of each pair. while the second patient serves as control and receives a placebo. bounds. mileage of cars with and without a certain ingredient or adjustment. p (X .2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. 8) ~ [c( 8)t l sup p(x.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. which are composite because 82 can vary freely. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.9. If the treatment and placebo have the same effect. sales performance before and after a course in salesmanship. the experiment proceeds as follows.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 . In order to reduce differences due to the extraneous factors.9. Let Xi denote the difference between the treated and control responses for the ith pair.24.a) confidence region C(x) = {8 : p(x. We can regard twins as being matched pairs. with probability ~) and given the treatment. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. An example is discussed in Section 4. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. 8n e (4. An important class of situations for which this model may be appropriate occurs in matched pair experiments.5.2. In that case.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. After the matching.8) .Section 4. 8) > (8)] = eo a. In the ith pair one patient is picked at random (i.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter. For instance. We are interested in expected differences in responses due to the treatment effect.9. we measure the response of a subject when under treatment and when not under treatment. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . diet. and soon. [ supe p(X. Here are some examples.
'for the purpose of referring to the duality between testing and confidence procedures. .a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) .L Xi 1 ~( n i=l .258 Testing and Confidence Regions Chapter 4 symmetric about zero. = fJ. We found that sup{p(x. 8 0 = {(fJ" Under our assumptions. = E(X 1 ) denote the mean difference between the response of the treated and control subjects. e). Form of the TwoSided Tests Let B = (fJ" a 2 ). B) at (fJ." However. we test H : fJ.X) . good or bad. Let fJ.B): B E 8} = p(x. The problem of finding the supremum of p(x. Our null hypothesis of no treatment effect is then H : fJ.3.o. = fJ. where we think of fJ. = O. TwoSided Tests We begin by considering K : fJ. B) 1[1~ ="2 a 4 L(Xi .6.0}.0 is known and then evaluating p(x. This corresponds to the alternative "The treatment has some effect. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect.L ( X i . However.0. =Ie fJ.fJ. as representing the treatment effect. a~). The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied. = fJ. which has the immediate solution ao ~2 = .0. The likelihood equation is oa 2 a logp(x.L X i ' . a 2 ) : fJ.0) 2  a2 n] = 0. where B=(x. To this we have added the nonnality assumption. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ.=1 fJ.0 as an established standard for an old treatment.0 )2 . We think of fJ. Finding sup{p(x. B) was solved in Example 3.5. as discussed in Section 4. n i=l is the maximum likelihood estimate of B.
the testing problem H : M ::. the size a critical value is t n 1 (1 . Mo . or Table III. Mo versus K : M > Mo (with Mo = 0) is suggested.2. OneSided Tests The twosided formulation is natural if two treatments. Similarly.1). (Mo. suppose n = 25 and we want a = 0. the likelihood ratio tests reject for large values of ITn I.9. A and B.1)1 I:(Xi . which thus equals log . &5 gives the maximum of p(x. is of size a for H : M ::.9 Likelihood Ratio Procedures 259 By Theorem 2. Therefore. Therefore. Because Tn has a T distribution under H (see Example 4. 8) for 8 E 8 0. Thus. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem.\(x) logp(x. (n . . A proof is sketched in Problem 4. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2).1).x)2 = n&2/(n . to find the critical value. the test that rejects H for Tn Z t nl(1 . the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if. therefore. which can be established by expanding both sides.a). if we are comparing a treatment and control. &5/&2 (&5/&2) is monotone increasing Tn = y'n(x .Section 4. The test statistic . Then we would reject H if. and only if. ITnl Z 2. 8 Therefore.  {~[(log271') + (log&5)]~} Our test rule. and only if.Mo)/a. For instance. In Problem 4.1.11 we argue that P"[Tn Z t] is increasing in 8.Mo) .05.\(x) is equivalent to log . the relevant question is whether the treatment creates an improvement. However. Because 8 2 function of ITn 1 where = 1 + (x . To simplify the rule further we use the following equation.4. are considered to be equal before the experiment is performed.~a) and we can use calculators or software that gives quantiles of the t distribution.9.\(x).MO)2/&2. where 8 = (M . rejects H for iarge values of (&5/&2).3.8) logp(x.064.
b.1 are monotone in fo/L/ a. Computer software will compute n./Lo)/a) = 1. This distribution. With this solution.11) just as the power functions of the corresponding tests of Example 4..n(X . we may be required to take more observations than we can afford on the second stage.1.2. Likelihood Confidence Regions If we invert the twosided tests. . 1) and X~ distributions. If we consider alternatives of the form (/L /Lo) ~ ~. Problem 17). we know that fo(X /L)/a and (n . Note that the distribution of Tn depends on () (/L. 260.4./Lol ~ ~./Lo)/a .1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . A Stein solution. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8.jV/ k where Z and V are independent and have N(8. the ratio fo(X . Because E[fo(X /Lo)/a] = fo(/L . bring the power arbitrarily close to 0:.3. To derive the distribution of Tn.1. p.4. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases. with 8 = fo(/L . is possible (Lehmann.7) when discussing confidence intervals. Thus./Lo)/a and Yare .b distribution. 1) distribution.260 Power Functions and Confidence 4 To discuss the power of these tests. The power functions of the onesided tests are monotone in 8 (Problem 4. ( 2) only through 8. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L. note that from Section B. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. 1997. / (n1)s2 /u 2 n1 V has a 'Tn1. denoted by Tk./Lo) / a.12. We have met similar difficulties (Problem 4. The density of Z/ . we can no longer control both probabilities of error by choosing the sample size.9. fo( X .4.jV/k is given in Problem 4. The reason is that. whatever be n. however. is by definition the distribution of Z / ./Lo) / a has N (8. and the power can be obtained from computer software or tables of the noncentral t distribution. respectively.l distribution. say. thus./Lo) / a as close to 0 as we please and.9. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4.
Section 4.. . The 0. blood pressure). it is shown that = over 8 by the maximum likelihood estimate e. 'Yn2 are independent samples from N (JlI.1.'" .3 5 0. Let e = (Jlt.6 10 2.3.1 1.2.6 0..4 3 0. .2 2 1. p. weight.Xn1 could be blood pressure measurements on a sample of patients given a placebo..Y ) is n2 where n = nl + n2. 1958. it is usually assumed that X I. and ITnl = 4.5 1.0 7 3.0 4. . Jl2.3 4 1..9.6 4. In the control versus treatment example.1 0. while YI .4 4. 121) giving the difference B . 'Yn2 are the measurements on a sample given the drug.0 6 3. this is the problem of determining whether the treatment has any effect. .Xnl and YI ..8 1. respectively. height..1 1.3). and so forth. e . A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.6. ... For quantitative measurements such as blood pressure.58.7 1. Yn2 . The preceding assumptions were discussed in Example 1.7 5.g. volume.9.9 1. YI . ( 2) and N (Jl2. . Patient i A B BA 1 0.84]. 8 2 = 1.06. As in Section 4.2 0.Xn1 and YI .995) = 3.32.4 1.99 confidence interval for the mean difference Jl between treatments is [0. . consider the following data due to Cushny and Peebles (see Fisher.) 4..513.2 1.. .25. Then Xl. length.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I. .4 1.. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2.Xn1 ..8 2.6 0.1 0. one from each popUlation. we conclude at the 1% level of significance that the two drugs are significantly different. For instance. This is a matched pair experiment with each subject serving as its own control. Because tg(0.0 3. .8 9 0. (See also (4. then x = 1. . temperature.A in sleep gained using drugs A and B on 10 patients. ( 2) populations. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures..4 If we denote the difference as x's. ( 2)..2 the likelihood function and its log are maximized In Problem 4. suppose we wanted to test the effect of a certain drug on some biological variable (e. Y) = (Xl.9. .5.8 8 0.. It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.
j1. Y. our model reduces to the onesample model of Section 4. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.2.2 (Xi .262 (X.2 1 I:rt2 .1 I:rtl .X) + (X  j1)]2 and expanding. Y. By Theorem B.2 (y.1 .Y) '1':" a a a i=l a j=l J I . .3 Tn2 1 . where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi . . (7 ).9.3. Thus. (75).j112 log likelihood ratio statistic [(Xi .X) and .2 X. the maximum of p over 8 0 is obtained for f) (j1.
twosample t test rejects if.Section 4.3) for 1L2 . It is also true that these procedures are of size Q for their respective hypotheses. .ILl. Confidence Intervals To obtain confidence intervals for 1L2 .~ Q likelihood confidence bounds.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . 1) distriTn . onesided tests lead to the upper and lower endpoints of the interval as 1 . As for the special case Ll = 0. this follows from the fact that. corresponding to the twosided test. As in the onesample case. and that j nl n2/n(Y .9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. for H : 1L2 ::. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.2 • Therefore.2 under H bution and is independent of (n . 1/nI).2)8 2 /0. We conclude from this remark and the additive property of the X2 distribution that (n . We can show that these tests are likelihood ratio tests for these hypotheses. if ILl #. there are two onesided tests with critical regions. 1/n2).2)8 2/0.ILl = Ll versus K : 1L2 . IL 1 and for H : ILl ::. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. Similarly. f"V As usual.1L2.X) /0. T and the resulting twosided.N(1L2/0. and only if. by definition. X~ll' X~2l' respectively. 1L2.Ll.2 has a X~2 distribution. T has a noncentral t distribution with noncentrality parameter.ILl #.has a N(O.9.
.627 2. The level 0. :. it may happen that the X's and Y's have different variances.9. For instance. 4..Xn1 . thus. H is rejected if ITI 2 t n . . 'Yn2 are two independent N (Ji1. From past experience. Suppose first that aI and a~ are known. This is the BehrensFisher problem. . a~) samples. setting the derivative of the log likelihood equal to zero yields the MLE . Y1. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. Here y .0:) confidence interval has probability at most ~ 0: of making the wrong selection.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4. The log likelihood.95 confidence interval for the difference in mean log permeability is 0. . /11 = x and /12 = y for (Ji1.8 2 = 0.. 472) x (machine 1) Y (machine 2) 1.845 1.395. .776. Because t4(0. On the basis of the results of this experiment.583 1. and T = 2.. 1952. except for an additive constant. respectively. The results in terms oflogarithms were (from Hald. thus. Again we can show that the selection procedure based on the level (1 . When Ji1 = Ji2 = Ji.975) = 2.x = 0.9. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same.2 (1 . As a first step we may still want to compare mean responses for the X and Y populations.042 1. the more waterproof material.977.~o:). Ji2) E R x R. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines.282 We test the hypothesis H of no difference in expected log permeability.. we are lead to a model where Xl. p.0264.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration.3.790 1. If normality holds. we would select machine 2 as producing the smaller permeability and.368. a treatment that increases mean response may increase the variance of the responses.395 ± 0. ai) and N(Ji2. is The MLEs of Ji1 and Ji2 are.
1:::. It follows that "\(x..) I 8 D depends on at I a~ for fixed nl. 1) distribution. j1x = = nl . it Thus.t1 = J. An unbiased estimate is J...3. 2 8y 8~ . For large nl.t2 :::. by Slutsky's theorem and the central limit theorem.)18D has approximately a standard normal distribution (Problem 5.. DiaD has aN(I:::.n2)' Similarly. = J. j1. and more generally (D .X)2 + nl(j1. that is Yare X) + Var(Y) = nl + n2 . n2.)18D to generate confidence procedures.x)2 i=1 n2 i=1 n2 L(Yi . n2. BecauseaJy is unknown.1:::.X and aJy is the variance of D. = at I a~. IDI/8D as a test statistic for H : J.28). 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J..n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD. where I:::.fi)2 we obtain \(x. For small and moderate nI.fi? j=1 + n2(j1. (D 1:::.t1.t1. n2 an approximation to the distribution of (D .t2 must be estimated.n2(fi .9 Likelihood Ratio Procedures J 265 where. where D = Y .x)/(nl + . Unfortunately the distribution of (D 1:::..y)/(nl + .t2.Section 4.) 18 D due to Welch (1949) works well. J.fi = n2(x .j1)2 j=1 = L(Yi .. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}.y) By writing nl L(Xi .
. etc. mice.. uI 4.3.3.01. Note that Welch's solution works whether the variances are equal or not. u~ > O.05 and Q: 0. Yd. . X = average cigarette consumption per day in grams. X percentage of fat in score on mathematics exam. X test tistical studies. .J1. p). (Xn' Yn). Y) have a joint bivariate normal distribution. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X.3. can unfortunately be very misleading if =1= u~ and nl =1= n2.28.u~. Y cholesterol level in blood.9. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem.003. The LR procedure derived in Section 4.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . Y = blood pressure. Wang (1971) has shown the approximation to be very good for Q: = 0.3 and Problem 5.) are sampled from a population and two numerical characteristics are measured on each case. Y). TwoSided Tests . Some familiar examples are: X test score on English exam. then we end up with a bivariate random sample (XI. the maximum error in size being bounded by 0. fields. Y = age at death. machines. which works well if the variances are equal or nl = n2. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. our problem becomes that of testing H : p O. ur Testing Independence.2 .9.1 When k is not an integer.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons. If we have a sample as before and assume the bivariate normal model for (X.1. See Figure 5. N(j1. Y diet. with > 0. the critical value is obtained by linear interpolation in the t tables or using computer software..
Because ITn [ is an increasing function of [P1. (Un. (j~. Therefore. . p) distribution. and eo can be obtained by separately maximizing the likelihood of XI. Qualitatively. Pis called the sample correlation coefficient and satisfies 1 ::..Jll)/O"I.o. See Example 5.Now. A normal approximation is available.~( Xi . the distribution of pis available on computer packages. ~ = (Yi . .4. the distribution of p depends on p only.13 as 2 = . and the likelihood ratio tests reject H for large values of [P1.8). Because (U1.~( Yi ..5.1.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. If p = 0. 1 (Problem 2.(j~. where (jr e was given in Problem 2. .4) = Thus.Section 4.3. There is no simple form for the distribution of p (or Tn) when p #.9. .2 distribution. for any a.1. VI)"" .3.9. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. we can control probabilities of type II error by increasing the sample size. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.Xn and that ofY1 . if we specify indifference regions.Jl2)/0"2.X . Yn .Y .9.1.. We have eo = (x. then by Problem B. p::.X)(Yi ~=1 fj)] /n(jl(72.5) has a Tn.. where Ui = (Xi . . log A(X) is an increasing function of p2. To obtain critical values we need the distribution of p or an equivalent statistic under H. ii. 0) and the log of the likelihood ratio statistic becomes log A(X) (4. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . . (jr . When p = 0 we have two independent samples. 0. When p = 0. (4. y.0"2 = . p). V n ) is a sample from the N(O.
Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. and only if. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: . is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. The power functions of these tests are monotone.:::: c] is an increasing function of p for fixed c (Problem 4. We can similarly obtain 1 Q upper confidence bounds and. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. We can show that Po [p. by using the twosided test and referring to the tables. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. only onesided alternatives are of interest. To obtain lower confidence bounds. c(po)] 1 .1 Here p 0. Because c can be shown to be monotone increasing in p. We obtain c(p) either from computer software or by the approximation of Chapter 5. by putting two level 1 . we find that there is no evidence of correlation: the pvalue is bigger than 0.75. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. Po but rather of the "equaltailed" test that rejects if. Thus. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. For instance. c(po) where P po [I)' d(po)] P po [I)'::. Therefore. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. c(Po)] 1 ~Q.15).~Q bounds together. 'I7 Summary.: : d(po) or p::. for large n the equal tails and LR confidence intervals approximately coincide with each other.48. we obtain a commonly used confidence interval for p. 370.7 18. These tests can be shown to be of the form "Accept if. p::. c(po)" where Ppo [p::. ! Data Example A~ an illustration. we would test H : P = 0 versus K : P > 0 or H : P ::. However. The likelihood ratio test statistic >.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. and only if p . inversion of this family of tests leads to levell Q lower confidence bounds.Q. 0 versus K : P > O.9.18 and Tn 0.
. Mn e = ~? = 0.X)I SD.2 degrees of freedom. Assume the model where X = (Xl. .05? e :::.10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. where SD is an estimate of the standard deviation of D = Y .Xn ) is an £(A) sample. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. Suppose that Xl.. We also find that the likelihood ratio statistic is equivalent to a t statistic with n . When a? and a~ are known.L :::. When a? and a~ are unknown. Let Mn = max(X 1 .. J. respectively. We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. we use (Y .1 1.Xn denote the times in days to failure of n similar pieces of equipment.X. . Let Xl. . . 4.Ll' an and N(J. ( 2) and we test the hypothesis that the mean difference J.Lo. ~ versus K : e> ~. Approximate critical values are obtained using Welch's t distribution approximation. respectively. . (2) Twosample experiments in which two independent samples are modeled as coming from N(J.L2' ( 2) populations. ..Section 4. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl.98 for (e) If in a sample of size n = 20. .. e). (d) How large should n be so that the 6c specified in (b) has power 0. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. The likelihood ratio test is equivalent to the onesample (Student) t test. .48. where p is the sample correlation coefficient. X n ) and let 1 if Mn 2:: c = of e. .10 PROBLEMS AND COMPLEMENTS Problems for Section 4.. X n are independently and identically distributed according to the uniform distribution U(O.Ll' ( 2) and N(J.L. what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20. 0 otherwise.L2' a~) populations. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0.L is zero. the likelihood ratio test is equivalent to the test based on IY . what is the pvalue? 2. Consider the hypothesis H that the mean life II A = J. ..XI.
(b) Give an expression of the power in tenns of the X~n distribution. Hint: Use the central limit theorem for the critical value.4 to show that the test with critical region IX > I"ox( 1  <» 12n]. If 110 = 25. r..O) = (xIO')exp{x'/20'}. 1). Establish (4.0 > 0.. Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8... I . U(O. Draw a graph of the approximate power function.3). .2 L:. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 . (Use the central limit theorem. .<»1 VTi)J . Suppose that T 1 . (b) Show that the power function of your test is increasing in 8.3.. . . T ~ Hint: See Problem B. . " i I 5.3.~llog <>(Tj ) has a X~r distribution.. . j .1.4.r. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. . (0) Show that.) 4. Let Xl. (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00. i I (b) Check that your test statistic has greater expected value under K than under H. . Hint: See Problem B. 1 using a I . Let Xl.2.ontinuous distribution. Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution. f(x. j=l . Let o:(Tj ) denote the pvalue for 1j.05? 3. X n be a 1'(0) sample. give a normal approximation to the significance probability. ..a)th quantile of the X~n distribution. (d) The following are days until failure of air monitors at a nuclear plant. .12. . j = 1. distribution... then the pvalue <>(T) has a unifonn. 1nz t Z i .. . 7.a) is the (1 .. under H. X> 0.. . Hint: Approximate the critical region by IX > 1"0(1 + z(1 . . Show that if H is simple and the test statistic T has a c.. Assume that F o and F are continuous. 6. is a size Q test. where x(I .Xn be a sample from a population with the Rayleigh density .
. (a) For each of these statistics show that the distribution under H does not depend on F o.Fo(x)IOdFo(x) . then T. is chosen 00 so that 00 x 2 dF(x) = 1. That is.10..) Next let T(I).10 Problems and Complements 271 8.Section 4. Fo(x)1 > kol x ~ Hint: D n > !F(x) .. In Example 4..Fo• generate a U(O.. (a) Show that the statistic Tn of Example 4.X(B) from F o on the computer and computing TU) = T(XU)). 10. let . .) Evaluate the bound Pp(lF(x) . 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/.Fo(x)IO x ~ ~ sup.o J J x .. (In practice these can be obtained by drawing B independent samples X(I). Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T. . T(B) ordered.12(b). ~ ll.. T(X(B)) is a sample of size B + 1 from La.Fo(x)1 > k o ) for a ~ 0. Let T(l).. = (Xi . (b) Use part (a) to conclude that LN(". . Use the fact that T(X) is equally likely to be any particular order statistic. (This is the logistic distribution with mean zero and variance 1. = ~ I.Fo(x)IOdF(x). 9. B.p(F(x)IF(x) ..5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. . .u. Define the statistics S¢.5. 1. .p(Fo(x))lF(x) .6 is invariant under location and scale.(X') = Tn(X).Fo(x) 10 U¢.. and let a > O..o V¢. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.)(Tn ) = LN(O. . . .. T(B) be B independent Monte Carlo simulated values of T.5. Here.1.p(Fo(x))IF(x) . . b > 0.l) (Tn).1.. (b) When 'Ij.00). Express the Cramervon Mises statistic as a sum. Vw. . then the power of the Kolmogorov test tends to 1 as n .Fo(x)1 foteach. j = 1.o T¢. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x). .J(u) = 1 and 0: = 2. ..T(B+l) denote T. 1) to (0. if X.T(X(l))..o: is called the Cramervon Mises statistic. with distribution function F and consider H : F = Fo. .a)/b. Hint: If H is true T(X). to get X with distribution . X n be d.1. 80 and x = 0. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). n (c) Show that if F and Fo are continuous and FiFo. 00. .p(u) be a function from (0.5.p(F(x»!F(x) .2. T(l). and 1. Let X I.. (b) Suppose Fo isN(O.d.T.o sup.
Whichever system you buy during the year. I X n from this population. 12. = /10 versus K : J. (a) Show that L(x.1../"0' Show that EPV(O) = if! ( . let N I • N 2 • and N 3 denote the number of Xj equal to I. If each time you test.. . which system is cheaper on the basis of a year's operation? 1 I Ii I il 2. i • (a) Show that if the test has level a. and 3 occuring in the HardyWeinberg proportions f(I.to on the basis of the N(p" . Show that EPV(O) = P(To > T). and only if. 00 .. the other $105 • One second of transmission on either system COsts $103 each. 3. respectively.. f(3.2.') sample Xl. the other has v/aQ = 1..nO /. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). Suppose that T has a continuous distribution Fe. I 1 (b) Show that if c > 0 and a E (0. One has signalMtonoise ratio v/ao = 2.) I 1 . where" is known.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale.a)) = inf{t: Fo(t) > u}. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • . (See Problem 4. 2. 2N1 + N.0) = 20(1. i i (b) Define the expected pvalue as EPV(O) = EeU. Consider Examples 3. you intend to test the satellite 100 times. (For a recent review of expected p values see Sackrowitz and SamuelCahn. I). ! . then the test that rejects H if.O) = 0'. which is independent ofT.) . Consider a population with three kinds of individuals labeled 1. .1.05. Let T = X/"o and 0 = /" . Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. You want to buy one of two systems. 2. Problems for Section 4.2 I i i 1. j 1 i :J . 01 ) is an increasing function of 2N1 + N.1) satisfy Pe. Let 0 < 00 < 0 1 < 1. the power is (3(0) = P(U where FO 1 (u)  < a) = 1.10. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . Let To denote a random variable with distribution F o. The first system costs $106 . Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. For a sample Xl.. and 3.L > J..2 and 4. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. 5 about 14% of the time. then the pvalue is U =I . take 80 = O. the UMP test is of the form I {T > c).0). where if! denotes the standard normal distribution. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0.0) = (1. Without loss of generality. [2N1 + N.2..0)'.Fo(T). X n . whereas the other a I ~ a .. ! J (c) Suppose that for each a E (0. .3. 1999.)... ! .Fe(F"l(l. f(2. > cJ = a. Expected pvalues.
2. For 0 < a < I. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po. 1. . 0. (b) Find the test that is best in this sense for Example 4.<l.6) where all parameters are known. and only if. Y) known to be distributed either (as in population 0) according to N(O. [L(X.2.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. find the MP test for testing 10. +. A newly discovered skull has cranial measurements (X.4 and recall (Proposition B.1. 6. 7. M(n..1. (X. [L(X. .Bd > cJ then the likelihood ratio test with critical value c is best in this sense.~:":2:~~=C:="::===='. Show that if randomization is pennitted.1. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size.0. Bk). Find a statistic T( X. Prove Corollary 4. B" .2... prove Theorem 4. +akNk has approx. then a.. Y) belongs to population 1 if T > c. PdT < cJ is as small as possible...=  Section 4... Nk) ~ 4.2. Bo.. Y) and a critical value c such that if we use the classification rule.6) or (as in population 1) according to N(I. Hint: The MP test has power at least that of the test with test function J(x) 8. if . of and ~o '" I.1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4.2.N. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2. In Example 4. . (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel..1.. H : (J = (Jo versus K : (J = (J.2..0. with probability .o = (1.2. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. L . two 5's are obtained. In Examle 4. 5.10 Problems and Complements 273 four numbers are equally likely to occur (i.17). .7).. then the maximum of the two probabilities ofmisclassification porT > cJ. 9.e.. Bd > cJ = I .2. find an approximation to the critical value of the MP level a test for this problem. .imately a N(np" na 2 ) distribution. I.2. derive the UMP test defined by (4.2. I..4. and to population 0 ifT < c.0.Pe. A fonnulation of goodness of tests specifies that a test is best if the max. Hint: Use Problem 4. .. Bo.0196 test rejects if. Upon being asked to play.. (c) Using the fact that if(N" .imum probability of error (of either type) is as small as possible. = a..
01. Use the normal approximation to the critical value and the probability of rejection. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00. > 1/>'0. hence. i 1 .01 lest to have power at least 0. (a) Show that L~ K: 1/>.. . • • " 5..i .. x > O. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! .<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function. the probability of your deciding to stay open is < 0. show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function. A) = c AcxC1e>'x . Let Xl... Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. the probability of your deciding to close is also < 0.) 3.01. . but if the arrival rate is > 15. . . Consider the foregoing situation of Problem 4.3.• n. . 274 Problems for Section 4. I i . I i . a model often used (see Barlow and Proschan. Hint: Show that Xf . 1965) is the one where Xl.Xn is a sample from a truncated binomial distribution with • I.0)"'/[1.Xn be the times in months until failure of n similar pieces of equipment. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. Show that if X I.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . . Here c is a known positive constant and A > 0 is the parameter of interest.<» is the (1 . .1 achieves this? (Use the normal approximation. In Example 4.3 Testing and Confidence Regions Chapter 4 1.4. How many days must you observe to ensure that the UMP test of Problem 4.<»/2>'0 where X2n(1. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. You want to ensure that if the arrival rate is < 10.95 at the alternative value 1/>'1 = 15. Find the sample size needed for a level 0.3..&(>')... • I i • 4.1. (c) Suppose 1/>'0 ~ 12. Suppose that if 8 < 8 0 it is not worth keeping the counter open. ]f the equipment is subject to wear.3.(IOn x = 1•. 0) = ( : ) 0'(1.Xn is a sample from a Weibull distribution with density f(x. that the Xi are a sample from a Poisson distribution with parameter 8. r p(x.. . i' i . the expected number of arrivals per day. .
(i) Show that if we model the distribution of Y as C(max{X I . Find the UMP test.8 > 80 . . Suppose that each Xi has the Pareto density f(x.d. ~ > O. where Xl_ a isthe (1 .. 6.B) = Fi!(x).1."" X n denote the incomes of n persons chosen at random from a certain population. (a) Lehmann Alte~p. suppose that Fo(x) has a nonzero density on some interval (a. 0 < B < 1. .5.O) = cBBx(l+BI. then e. Show how to find critical values.Fo(y)  }). X N }).Section 4. Let Xl.. imagine a sequence X I.) It follows that Fisher's method for cQmbining pvalues (see 4. against F(u)=u B...•• of Li. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo. we derived the model G(y. survival times with distribution Fa.1. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . . In Problem 1..O<B<1. .XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0. b). X 2.1. .. For the purpose of modeling. (Ii) Show that if we model the distribotion of Y as C(min{X I . Y n be the Ll. random variable. and let YI . In the goodnessoffit Example 4.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ . A > O. (b) Find the optimal test statistic for testing H : JL = JLo versus K . (a) Express mean income JL in terms of e.6. X 2 . which is independent of XI. 7.6. .tive. 1 ' Y > 0.10 Problems and Complements 275 then 2. (See Problem 4. (b) NabeyaMiura Alternative.Q)th quantile of the X~n distribution.[1. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b). .d. > po..12. 00 < a < b < 00. .). Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a. A > O. P(). Let N be a zerotruncated Poisson.  ...~) = 1. > 1. survival times of a sample of patients receiving an experimental treatment. x> c where 8 > 1 and c > O.1.6) is UMP for testing that the pvalues are uniformly distributed. then P(Y < y) = 1 e  .2 to find the mean and variance of the optimal test statistic.Fo(y)l". X N e>.. y > 0. Assume that Fo has ~ density fo. Hint: Use the results of Theorem 1.. and consider the alternative with distribution function F(x. 8.6.
. Show that under the assumptions of Theorem 4.exp( x). e e at l · 6. J J • 9.d.. eo versus K : B = fh where I i I : 11. n.1 and 01 loss.4 1. 0. .6t is UMP for testing H : 8 80 versus K : 8 < 80 . .. Let Xi (f)/2)t~ + €i. 1 . x > 0.' " 2. Using a pivot based on 1 (Xi .3.01. I Hint: Consider the class of all Bayes tests of H : () = ...52.O)d. 1 X n be a sample from a normal population with unknown mean J. B> O.2. Problem 2. .(O) 6 . 0.(O) L':x. where the €i are independent normal random variables with mean 0 and known variance . ..).276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.(O)/ 16' I • 00 p(x. Show that the UMP test is based on the statistic L~ I Fo(Y..i. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. F(x) = 1 . .  1 j . j 1 To see whether the new treatment is beneficial. Oo)d.0 e6 1 0~0. 1 • .L and unknown variance (T2. L(x. = . Problems for Section 4.'? = 2. 1 X n be i. we test H : {} < 0 versus K : {} > O.' (cf. the denominator decreasing.. Let Xl. Show that under the assumptions of Theorem 4. 10. We want to test whether F is exponential.{Od varies between 0 and 1.1).2. Oo)d.{Oo} = 1 . Let Xl. with distribution function F(x). or Weibull. 'i . /.. I .. Assume that F o has a density !o(Y). Show that under the assumptions of Theorem 4..3.X)2.' L(x. 0) e9Fo (Y) Fo(Y).0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi . .. every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t.0) UeB for '. . F(x) = 1 . The numerator is an increasing function of T(x).2 the class of all Bayes tests is complete. Show that the test is not UMP.(O) «) L > 1 i The lefthand side equals f6". x > 0. 12.0". 00 p(x. 0 = 0. n announce as your level (1 . i = 1... > Er • . Hint: A Bayes test rejects (accepts) H if J .exp( _x B).j "'" . (a) Show how to construct level (1 ... What would you .X)' = 16. O)d.3.
z(1 .n is obtained by taking at = Q2 = a/2 (assume 172 known). then [q(X).l.4.c.~a)/ .SOt no l (1.01 [X .(X) = X . Compare yonr result to the n needed for (4. then the shortest level n (1 . although N is random.] . X!)/~i 1tl ofB. minI ii.1.(2) " .~a) /d]2 .n .a) interval of the fonn [ X . [0.~a)!. 0. i = 1. ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0. distribution. Show that if Xl..4.1)] if 8 given by (4.. I")/so. of the form Ji.7).a) confidence interval for B.1..1. with X . with N being the smallest integer greater than no and greater than or equal to [Sot". there is a shorter interval 7.1] if 8 > 0..IN(X  = I:[' lX. Use calculus. 5.Xn be as in Problem 4.(al + (2)) confidence interval for q(8). .al).no further observations. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2. .q(X)] is a level (1 . (a) Justify the interval [ii.1. .d. Suppose that in Example 4.a.) LCB and q(X) is a level (1.3).Section 4.3 we know that 8 < O. 6. < 0..3).X are ij. . .4. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4. t. Let Xl.. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a. 0. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .1) based on n = N observations has length at most l for some preassigned length l = 2d.. ... has a 1.4.IN] .1 Show that.4. < a..2. but we may otherwise choose the t! freely. Stein's (1945) twostage procedure is the following. It follows that (1 . Then take N .(2) UCB for q(8). Show that if q(X) is a level (1 .. what values should we use for the t! so as to make our interval as short as possible for given a? 3. N(Ji" 17 2) and al + a2 < a.a./N. find a fixed length level (b) ItO < t i < I.IN. n.) Hint: Use (A..10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 .02. (Define the interval arbitrarily if q > q. X + z(1 . . X+SOt no l (1. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji. where 8. Suppose we want to select a sample size N such that the interval (4.
(1  ~a) / 2. d = i II: .!a)/ 4. Let XI. Let 8 ~ B(n. I 1 j Hint: [O(X) < OJ = [. exhibit a fixed length level (1 . Hence. .001. 11. (c) If (12. and. . T 2 I I' " I'. X has a N(J1..a) confidence interval for sin 1 (v'e). I . find ML estimates of 11.fN(X .. 1947. (12 2 9.jn is an approximate level (b) If n = 100 and X = 0.jn(X .) I ~i . Y n2 be two independent samples from N(IJ. in order to have a level (1 . is _ independent of X no ' Because N depends only on sno' given N = k. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. Show that the endpoints of the approximate level (1 . v. and the fundamental monograph of Wald.) (lr!d observations are necessary to achieve the aim of part (a). Let 8 ~ B(n. ..3) are indeed approximate level (1 .) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi. (a) If all parameters are unknown. if (J' is large.a) interval defined by (4. (12) and N(1/. . X n1 and Yll . (J'2 jk) distribution.a) confidence interval for (1/1'). i (b) What would be the minimum sample size in part (a) if (). ±. (b) Exhibit a level (1 .1') has a N(o.O). .l4. By Theorem B.  10. (a) Show that X interval for 8. Suppose that it is known that 0 < ~. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 .1.... Show that these two quadruples are each sufficient. (a) Show that in Problem 4.. Show that ~(). we may very likely be forced to take a prohibitively large number of observations.a:) confidence interval of length at most 2d wheri 0'2 is known.~a)].3.: " are known.3. 0. Hint: Set up an inequality for the length and solve for n.') populations.4.O)]l < z (1.051 (c) Suppose that n (12 = 5. 0) and X = 81n.jn is an approximate level (1  • 12.95 confidence interval for (J. (The sticky point of this approach is that we have no control over N. 1"2.~ a) upper and lower bounds. (a) Use (A. " 1 a) confidence . > z2 (1  is not known exactly.18) to show that sin 1( v'X)±z (1 . 0") distribution and is independent of sn" 8. The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook.. = 0..0)/[0(1. use the result in part (a) to compute an approximate level 0./3z (1.4.. Indicate what tables you wdtdd need to calculate the interval. .6. .0') for Jt of length at most 2d. 1986. 4().a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a). but we are sure that (12 < (It. it is necessary to take at least Z2 (1 (12/ d 2 observations. (12. sn. respectively.
(In practice.' Now use the central limit theorem. Show that the confidence coefficient of the rectangle of Example 4.4.5 is (1 _ ~a) 2.4/0.9 confidence region for (It. See Problem 5.027 13. then BytheproofofTheoremB. Xl = Xn_l na). but assume that J1.J1. 15.) Hint: (n .3.)'.~Q). (c) Suppose Xi has a X~ distribution.4 and the fact that X~ = r (k. (d) A level 0.30.n} and use this distribution to find an approximate 1 .Section 4.4.<. Find the limit of the distribution ofn. Z. In Example 4.9 confidence interval for {L. .2 is known as the kurtosis coefficient.9 confidence interval for cr. K.". 10. 0).(n .4 ) .' 1 (Xi .J1.)' .I). If S "' 8(64.3.n(X .1) + V2( n .Q confidence interval for cr 2 .7).1) t z{1 . and 10 000. V(o') can be written as a sum of squares of n 1 independentN(O. Slutsky's theorem.4.2.t ([en .J1. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.3. XI 16."') . known) confidence intervals of part (b) when k = I. Assuming that the sample is from a /V({L. Hint: Use Problem 8.3.ia).3).10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.1)8'/0') . Now use the law of large numbers. 14.1. 4) are independent. Hint: By B.9 confidence interval for {L x= + cr.4.1 is known.)4 < 00 and that'" = Var[(X.4. Compare them to the approximate interval given in part (a).2. :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution. (b) A level 0.".J1. . it equals O.I) + V2( n1) t z( "1) and x(1 .) can be approximated by x(" 1) "" (n .4 = E(Xi . (a) Show that x( "d and x{1.3. (n _1)8'/0' ~ X~l = r(~(nl). cr 2 ) population. Hint: Let t = tn~l (1 . find (a) A level 0. give a 95% confidence interval for the true proportion of cures 8 using (al (4. 100. See A. is replaced by its MOM estimate.' = 2:.IUI.D andn(X{L)2/ cr 2 '" = r (4.). and X2 = X n _ I (1 . (c) A level 0.1 and.8). = 3. Compute the (K. and (b) (4.2. In the case where Xi is normal. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed. K. 1) random variables. . and the central limit theorem as given in Appendix A.)/01' ~ (J1. Now the result follows from Theorem 8. ~).
95.4. I' . .a) confidence interval for F(x).F(x)]dx.l). (a) Show that I" • j j j = fa F(x)dx + f.r fixed.9) and the upper boundary I" = 00.1.u) confidence interval for 1".F(x)1 Typical choices of a and bare . U(O. distribution function. . Show that for F continuous where U denotes the uniform. as X and that X has density f(t) = F' (t). (b)ForO<o<b< I. That is. we want a level (1 . Consider Example 4.1.i. i 1 . It follows that the binomial confidence intervals for B in Example 4. 1'1(0) < X < 1'.1 (b) V1'(x)[l. Suppose Xl.05 and . find a level (I .. Assume that f(t) > 0 ifft E (a.4.F(x)l.d. In Example 4. .I I 280 Testing and Confidence Regions Chapter 4 17. 19..F(x)1 )F(x)[1 . )F(x)[l.11 .4.1'(x)] Show that for F continuous.7. 18. . verify the lower bounary I" given by (4. X n are i. define . " • j .F(x)1 is the approximate pivot given in Example 4. I (b) Using Example 4.3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.4. !i .6 with . • i. (a) For 0 < a < b < I. In this case nF(l') = #[X.4.~u) by the value U a determined by Pu(An(U) < u) = 1. define An(F) = sup {vnl1'(X) .3 for deriving a confidence interval for B = F(x).6. (c) For testing H o : F = Fo with F o continuous. b) for some 00 < a < 0 < b < 00.F(x)1 .I Bn(F) = sup vnl1'(x) .. pl(O) < X <Fl(b)}.u. < xl has a binomial distribotion and ~ vnl1'(x) .
(d) Show that [Mn . .(2) together. Show that if 8(X) is a level (1 . (15 = 1. Hint: If 0> 00 . .al) and (1 .13 show that the power (3(0 1 .. X2) = 1 if and only if Xl + Xi > c. (a) Deduce from Problem 4..a) LCB corresponding to Oc of parr (a). 5. ~ 01). = 1 versns K : Il. What value of c givessizea? (b) Using Problems B.5.) denotes the o.th quantile of the . (e) Similarly derive the level (1 ..4 and 8. N( O . (b) Derive the level (1. How small must an alternative before the size 0. .. a (b) Show that the test with acceptance region [f(~a) for testing H : Il. respectively. Let Xl. X 2 be independent N( 01.· (a) If 1(0.12 and B.Section 4.1 has size a for H : 0 (}"2 be < 00 . < XI? < f(1. = 0.0.2 are level 0. and consider the prob2 + 8~ > 0 when (}"2 is known. with confidence coefficient 1 .a) UCB for 0. 1/ n J is the shortest such confidence interval. = 1 rejected at level a 0. test given in part (a) has power 0.2).4.Xn1 and YI .3. 3.. Experience has shown that the exponential assumption is warranted.3. Or .~a) I Xl is a confidence interval for Il. .1"2nl.. . and let Il. Yf (1 .2n2 distribution. Hint: Use the results of Problems B.3.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. of l.2 = VCBs of Example 4.. for H : (12 > (15.a. M n /0. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 . x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . Give a 90% confidence interval for the ratio ~ of mean life times. Let Xl.5 1. "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function. respectively. O ) is an increasing 2 function of + B~.\.05. (e) Suppose that n = 16. (a) Find c such that Oc of Problem 4. [8(X) < 0] :J [8(X) < 00 ]. then the test that accepts.5.90? 4.3.1. if and only if 8(X) > 00 .1 O? 2..2 that the tests of H : .2). is of level a for testing H : 0 > 00 .. show that [Y f( ~a) I X.10 Problems and Complements 281 Problems for Section 4.) samples. Yn2 be independent exponential E( 8) and £(. .
of Example 4.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2). X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . . Thus. L:J ~ ( . k . og are independentN (0. but I(t) = I( t) for all t. .Xn be a sample from a population with density f (t . 6. X. T) where 1] is a parameter of interest and T is a nuisance parameter. 'TJo) of the composite hypothesis H : ry = ryo. ry) = O}. .2). . ). N( 020g. .~n + ~z(l . we have a location parameter family. l 7. I . 1 Ct ~r (b) Find the family aftests corresponding to the level (1. . (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F.[XU) (f) Suppose that a = 2(n') < OJ and P. Suppose () = ('T}. (b) The sign test of H versus K is given by. Let C(X) = (ry : o(X.  j 1 j . >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large..1 when (72 is unknown. (a) Shnw that C(X) is a level (l .00 . 0. Let X l1 . 0.4. (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?.a) confidence region for the parameter ry and con versely that any level (1 . Hint: (c) X. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. (c) Show that Ok(o) (X.. We are given for each possible value 1]0 of1] a level 0: test o(X.0:) confidence interval for 11.a)y'n. O.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. lif (tllXi > OJ) ootherwise. . Show that PIX(k) < 0 < X(nk+l)] ~ .0:) LeB for 8 whatever be f satisfying our conditions.2). O?. respectively.IX(j) < 0 < X(k)] do not depend on 1 nr I.a.8) where () and fare unknown. and 1 is continuous and positive. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . (e) Show directly that P. Show that the conclusions of (aHf) still hold.
Let 1] denote a parameter of interest. p.5.3 and let [O(S). 12. 1) and q(O) the ray (X .a.[q(X) < 0 2 ] = 1. laifO>O <I>(z(1 . Po) is a size a test of H : p = Po. Note that the region is not necessarily an interval or ray. 0) ranges from a to a value no smaller than 1 . ' + P2 l' z(1 1 . 8( S)] be the exact level (1 . 00) is = 0'. Then the level (1 .1) is unbiased.a) . 9. 1).a) = (b) Show that 0 if X < z(1 . Thus. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'.1.a) confidence interval [ry(x).z(l. Let X ~ N(O. Show that the Student t interval (4.a). Show that as 0 ranges from O(S) to 8(S).pYI < (1 1 otherwise. (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X .5. 10.20) if 0 < 0 and.10 Problems and Complements 283 8. 1). That is. if 00 < O(S) (S is inconsistent with H : 0 = 00 ). 11. alll/.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4.2. Hint: You may use the result of Problem 4.a))2 if X > z(1 . This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.5.2a) confidence interval for 0 of Example 4. Establish (iii) and (iv) of Example 4.O ~ J(X.7. 7)).z(1 . Let P (p.7. or).a). that suP. . hence. the quantity ~ = 8(8) .2 a) (a) Show that J(X. and let () = (ry.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H. (b) Describe the confidence region obtained by inverting the family (J(X.Y. p)} as in Problem 4.p) = = 0 ifiX . Y.5. Yare independent and X ~ N(v.Section 4. Let a(S. Y. Suppose X. let T denote a nuisance parameter. a( S. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.5. Y ~ N(r/.2. Define = vl1J.
. (g) Let F(x) denote the empirical distribution.a. Show that P(X(k) < x p < X(nl)) = 1. k+ ' i .7. k(a) . Let F.4. Simultaneous Confidence Regions for Quantiles. Confidence Regions for QlIalltiles. Show that Jk(X I x'. " II [I .p) versus K' : P(X > 0) > (1 .a = P(k < S < n I + 1) = pi(1.. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}.a. 0 < p < 1.. il. We can proceed as follows.1» . iI . (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b).95 could be the 95th percentile of the salaries in a certain profession l or lOOx .h(a). P(xp < x p < xp for all p E (0.) Suppose that p is specified.. .p)"j.284 Testing and Confidence Regions Chapter 4 13. and F+(x) be as in Examples 4. X n be a sample from a population with continuous distribution F. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large.1 and Fu 1.6 and 4.05 could be the fifth percentile of the duration time for a certain disease.F(x p)].. (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I . L. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 .b) (a) Show that this statement is equivalent to = 1. = ynIP(xp) . Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. (e) Let S denote a B(n. 1 . (X(k)l X(n_I») is a level (1 . Then   P(P(x) < F(x) < P+(x)) for all x E (a.. (See Section 3. Construct the interval using F.a) confidence interval for x p whatever be F satisfying our conditions. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . Let x p = ~ IFI (p) + Fi! I (P)]. That is. F(x). X n x*) is a level a test for testing H : x p < x* versus K : x p > x*.4. it is distribution free. 14. . be the pth quantile of F.. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.a) LeB for x p whatever be f satisfying our conditions.p). ! .5. Thus. lOOx. I • (c) Let x· be a specified number with 0 < F(x') < 1. . . vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p. .p) variable and choose k and I such that 1 . Let Xl. That is.
. Suppose that X has the continuous distribution F. F(x) > pl. Note the similarity to the interval in Problem 4.r" ~ inf{x: a < x < b.1...x. X I.Section 4.17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p . then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ). Give a distributionfree level (1 .) < p} and. where p ~ F(x). .u are the empirical distributions of U and 1 .. .p. VF(p) = ![x p + xlpl. Express the band in terms of critical values for An(F) and the order statistics. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. I). . The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R. L i==I n I IFx (Xi) . F f (.(x)] < Fx(x)] = nF1_u(Fx(x)).U with ~ U(O. ~ ~ (a) Consider the test statistic x . (b) Express x p and .X n . F~x) has the same distribution Show that if F x is continuous and H holds. That is. LetFx and F x be the empirical distributions based on the i.. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X.T: a <.r < b.1.d.xp]: 0 < l' < I}.z. (c) Show how the statistic An(F) of Problem 4. that is. Hint: Let t.13(g) preceding.4..j < F_x(x)] See also Example 4.. Suppose X denotes the difference between responses after a subject has been given treatments A and B.(x) = F ~(Fx(x» .. We will write Fx for F when we need to distinguish it from the distribution F_ x of X. .Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}.10 Problems and Complements 285 where x" ~ SUp{. .i.. where A is a placebo.5. TbeaitemativeisthatF_x(t) IF(t) forsomet E R. 15.X I. the desired confidence region is the band consisting of the collection of intervals {[.r p in terms of the critical value of the Kolmogorov statistic and the order statistics.. where F u and F t . then L i==l n IIXi < X + t. X n and .
.3. . Fx. To test the hypothesis H that the two treatments are equally effective. I I I 1 1 D(Fx.x (': + A(x)).Fy ) = maxIFy(t) 'ER "" ..2A(x) ~ HI(F(x)) l 1 + vF(F(x)).Fy ): 0 < p < 1}. Give a distributionfree level (1. Hint: Define H by H. for ~. = 1 i 16..5.. nFy(t) = 2.17. i . treatment A (placebo) responses and let Y1 . It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H. < F y I (Fx(x))] i=I n = L: l[Fy (Y.x = 2VF(F(x)). where F u and F v are independent U(O.. 1 Yn be i.d. the probability is (1 .F x l (l . we get a distributionfree level (1 . i=I n 1 i t .. It follows that if c. Properties of this and other bands are given by Doksum. Let 8(. The result now follows from the properties of B(·)..".1.) < Fy(x p )] = nFv (Fx (t)) under H.:~ 1 1 [Fy(Y. (b) Consider the parameter tJ p ( F x.x p . .l (p) ~ ~[Fxl(p) .a) simultaneous confidence band for A(x) = F~l:(Fx(x)) .'" 1 X n be Li.) : :F VF ~ inf vF(p).. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y .) is a location parameter} of all location parameter values at F. and by solving D( Fx. Then H is symmetric about zero.U ).) = D(Fu . where :F is the class of distribution functions with finite support. .. Hint: Let A(x) = FyI (Fx(x)) . then D(Fx . ~ ~ ~ F~x.) ~ < F x (")] ~ ~ = nFu(Fx(x)).(x) ~ R.:~ I1IFx(X. Fx(t)l· .p)] = ~[Fxl(p) + F~l:(p)]. he a location parameter as defined in Problem 3. As in Example 1. where x p and YP are the pth quan tiles of F x and Fy.F~x. i I f F. (p). where do:. "" = Yp . let Xl.Jt: 0 < p < 1) for the curve (5 p (Fx.d.a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(. Also note that x = H.:~ II[Fx(X..) < do. vt(p)) is the band in part (b). F y ).286 Testing and Confidence Regions ~ Chapter 4 '1. .) < Fx(t)] = nFu(Fx(t))..F1 _u). Show that for given F E F. <aJ Show that if H holds.) < Fx(x)) = nFv(Fx(x)). Moreover..a) simultaneous confidence band 15. Fenstad and Aaberge (1977). treatment B responses. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ .t::. is the nth quantile of the distribution of D(Fu l F 1 .i. .1) empirical distributions.. vt = O<p<l sup vt(p) O<p<l where [V. nFx(x) we set ~ 2.(F(x)) . Hint: nFx(t) = 2.. F y ) .x. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R. then  nFy(x + A(X)) = L: I[Y. Let t (c) A Distdbution and ParameterFree Confidence Interval. F y ) has the same distribntion as D(Fu . then D(Fx..
2. T n n = (22:7 I X. .) ~ + t. Exhibit the UMA level (1 . if I' = 1I>'. 6.a) UCB for O. 3. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('. and 8 are shift parameters. B(. Let 6 = minO<1'<16p(Fx.6 1.Xn is a sample from a r (p. O(Fx. F y ) E F x F. ~ D(Fu •F v ). then Y' = Y. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T. Fy ) .).L. x' 7 R.6+] contains the shift parameter set {O(Fx. then by solving D(F x.0:) confidence bound forO.) I i=l L tt .6. 0 < p < 1..) < do: for Ll.Section 4.2. F x ) ~ a and YI > Y.(.F y ' (p) = 6. Show that if 0(·. moreover X + 6 < Y' < X + 6.Fy). is called a shift parameter if O(Fx. F y ) and b _maxo<p<l 6p(Fx ..)I L ti . Fy ).0:) simultaneouS coofidence band for t. t Hint: Set Y' = X + t.. we find a distributionfree level (1 . t. F y ).4. (b) Consider the unbiased estimate of O. . (d) Show that E(Y) .). .. :F x :F O(Fx. F y ) < O(Fx . 6+ maxO<p<1 6+(p). ~) distribution. nFx ("') = nFu(Fx(x)). Show that n p is known and 0 O' ~ (2 2~A X. Fy. tt Hint: Both 0 and 8* are nonnally distributed.2uynz(1 i=! i=l all L i=l n ti is also a level (1. where is nnknown.E(X).(x)) . F y ) is in [6. F y. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976). It follows that if we set F\~•.. Now apply the axioms.    Problems for Section 4.T.0: UCB for J.('. thea It a unifonnly most accurate level 1 . Fv ).) . Show that for the model of Problem 4. Show that B = (2 L X.·) is a shift parameter.(X). F y.) = F (p) . where :F is the class of distri butions with finite support. F y ) > O(Fx " Fy).(x) = Fy(x thea D(Fx. ) is a shift parameter} of the values of all shift parameters at (Fx . (a) Consider the model of Problem 4..(P).10 Problems and Complements 287 Moreover. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval. then B( Fx . Let do denote a size a critical value for D(Fu . Q. the probability is (1 .(Fx .2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. = 22:7 I Xf/X2n( a) is .3.a) that the interval [8. Fx +a) = O(Fx a. Show that for given (Fx .) 12:7 I if.6].. ...4.= minO<p<l 6. Suppose X I. Let6.
(1: dU) pes.B') < E.1 are well defined and strictly increasing..\ =. e> 0. B). j .12 so that F. /3( T. Construct unifoffilly most accurate level 1 . uniform. V be random variables with d. . 1.B).2. I'.Xn are Ll. establish that the UMP test has acceptance region (4.W < B' < OJ for allB' l' B. ·'"1 1 . .3. (a) Show that if 8 has a beta. . and that B has the = se' (t. Pa( c. Suppose that given.1. . satisfying the conditions of Problem 8. .aj LCB such that I .288 Testing and Confidence Regions Chapter 4 4.2 and B.( 0' Hint: E.\ is distributed as V /050.W < BI = 7.Bj has the F distribution F 2r . s > 0. Suppose that B' is a uniformly most accurate level (I ... .E(V) are finite.C. [B. then )" = sB(r(1 . Xl.[B < u < O]du. 0']. s) distribution with T and s integers.2.B')+. OJ. Prove Corollary 4. '" J. 2. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t. 3. f3( T.. (0 ..i.6.6.' with "> ~. and for ().) and that. t > e. distribution and that B has beta. ~. p.6.d. 0). i . n = 1.6.X. Hint: By Problem B.l2(b).O] are two level (1 . F1(tjdt.l. In Example 4.7 1 . (b) Suppose that given B = B. where Show that if (B.6 for c fixed. U = (8 . Establish the following result due to Pratt (1961).6 to V = (B . where s = 80+ 2n and W ~ X.\.f:s F..B) = J== J<>'= p( s. • .d.1.a) upper and lower credible bounds for A. U(O. Suppose [B'. Poisson.(O . 5. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for.3.. then E. s). Suppose that given B Pareto.a. 8. s).2" Hint: See Sections 8. .3 and 4. J • 6. respectively.3. Show that if F(x) < G(x) for all x and E(U).. Hint: Apply Problem 4. (B" 0') have joint densities.B).3). Let U. P(>.B)+.2. then E(U) > E(V). • • • = t) is distributed as W( s. Hint: Use Examples 4. Problems for Section 4. X has a binomial. g.B(n. . t)dsdt = J<>'= P. G corresponding to densities j. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I .a) confidence intervals such that Po[8' < B' < 0'] < p. . 1I"(t) Xl. density = B. t) is the joint density of (8. .4."" X n are i.4.0 upper and lower confidence bounds for J1 in the model of Problem 4. distribution with r and s positive integers. E(U) = .
r 1x. (c) Give a level (1 . 1 r. = So I (m + n  2). y) is proportional to p(8)p(x 1/11. = PI ~ /12 and'T is 7f(Ll. . Hint: p( 0 I x. r( TnI (c) Set s2 + n.x)1f(/1. respectively... (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll. y).rlm) density and 7f(/12 1r.x)' + E(Yj y)'.Xn is observable and X n + 1 is to be predicted..y) is 1f(r 180)1f(/11 1 r. 1f(/11 1r.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.y.d.10 Problems and Complements 289 (a)LetM = Sf max{Xj.. 'T).0:) prediction interval for Xnt 1. (d) Compare the level (1. y) is obtained by integrating out Tin 7f(Ll.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2.Xn ). .0:) confidence interval for B. ... Suppose that given 8 = (ttl' J. (a) Let So = E(x. /11 and /12 are independent in the posterior distribution p(O X. Show thatthe posterior distribution 7f( t 1 x. . 'P) is the N(x .1 )) distribution. as X. y) of is (Student) t with m + n ..s') withe' = max{c. Show fonnaUy that the posteriof?r(O 1 x. (b) Show that given r. Here Xl.i. Let Xl..1. and Y1.2. y) is aN(y. Hint: 7f(Ll 1x.m} and + n.a) upper and lower credible bounds for O.r 1x. T) = (111.. (b) Find level (1 . r) where 7f(Ll I x .Xn + 1 be i.x) is aN(x. .y) = 7f(r I sO)7f(LlI x .Section 4. Problems for Section 4.2 degrees of freedom. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. . r > O. .y. Xl.. . (a) Give a level (1 ..Xm.112.0"5). Show that (8 = S IM ~ Tn) ~ Pa(c'. where (75 is known. .r). Suppose 8 has the improper priof?r(O) = l/r.y. N(J.8 1. . y) and that the joint density of.. In particular consider the credible bounds as n + 00. r In) density..l..r)p(y 1/12. ".1.. 4.
8. 1 X n are i. Suppose XI. Present the results in a table and a graph.9 1. distribution. )B(r+x+y.. . . Let Xl.(0 I x)dO...3) by doing a frequentist computation of the probability of coverage. A level (1 . • . . as X .5. give level 3. . Un + 1 ordered. .'" .2n distribution. (3(r. Give a level (1 .12.9.2) hy using the observation that Un + l is equally likely to be any of the values U(I). and a = ..Xn are observable and X n + 1 is to be predicted.11. give level (I .2. N(Jl.d.d.e. has a B(m. Establish (4. 0). Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X .8. where Xl. X n such that P(Y < Y) > 1.' .·) denotes the beta function. Find the probability that the Bayesian interval covers the true mean j.. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 . ! " • Suppose Xl..s+n. (b) If F is N(p" bounds for X n + 1 .8.• . Suppose that given (J = 0.Xn + 1 be i.9.. which is not observable. X is a binomial. and that (J has a beta.. 0'5)' Take 0'5 ~ 7 2 = I.. b). .15. B( n. In Example 4.a) lower and upper prediction bounds for X n + 1 .. 4..a).) Hint: First show that ..i. i! Problems for Section 4.8. .05. CJ2) with (J2 unknown. . 5.s+nx+my)/B(r+x. I x) is sometimes called the Polya q(y I x) = J p(y I 0).Xn are observable and we want to predict Xn+l.d. :[ = . let U(l) < . q(ylx)= ( . suppose Xl...10.10.290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. 0 > 0. . ~o = 10. . That is.t for M = 5.. s). 1 2. random variable. " u(n+l). . Suppose that Y.'.8).i..0'5) with 0'5 known.a) prediction interval for X n + l . Let X have a binomial.5. (This q(y distribution.i. as X where X has the exponential distribution F(x I 0) =I . X n+ l are i. 0) distribution given (J = 8. x > 0.nl. give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. . Then the level of the frequentist interval is 95%.a (P(Y < Y) > I . B(n.Q) distribution free lower and upper prediction bounds for X n + 1 • < 00.. F. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I ..x . < u(n+l) denote Ul . distribution. 00 < a < b (1 ..x ) where B(·. n = 100. (a) If F is N(Jl.0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl.
° 3.3. .Q). = aD nO' 1".f. OneSided Tests for Scale.n) and >.(x) = .. In testing H : It < 110 versus K .V2nz(1 ..Section 4. where F is the d. · . Xn_1 (~a). We want to test H . Thus.( x) ~ 0. let X 1l . (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12.Q. In Problems 24.(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy.. 0' 4. ~2 .~Q) also approximately satisfy (i) and (ii) of part (a).~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . where Tn is the t statistic.) .Xn be a N(/L. (ii) CI ~ C2 = n logcl/C2. (i) F(c.. JL > Po show that the onesided. I . if ii' /175 < 1 and = .log (ii' / (5)1 otherwise.. and only if. 0'2 < O'~ versus K : 0'2 > O'~.F(CI) ~ 1.I)) for Tn > 0.. if Tn < and = (n/2) log(1 + T. >.3. Hint: log .. (b) Use the nonnal approximatioh to check that CIn C'n n . 2. Show that (a) Likelihood ratio tests are of the form: Reject if.~Q) n + V2nz(l. XnI (1 . and only if.='7"nlog c l n /C2n CIn ..10 Problems and Complements 291 is an increasing function of (2x . (c) Deduce that the critical values of the commonly used equaltailed test.(nx). n Cj I L.. 0'2) sample with both JL and 0'2 unknown. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B.X) aD· t=l n > C. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. = Xnl (1 . log . 2 2 L. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise. TwoSided Tests for Scale.(x) Hint: Show that for x < !n.(x) = 0. onesample t test is the likelihood ratio test (fo~ Q <S ~).(Xi .C2n !' 1 as n !' 00. of the X~I distribution./(n .
05 onesample t test. . 7. . = 0 and €l. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. (a) Find a level 0..P. . Y. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T.95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~. I Yn2 be two independent N(J.lIl (7 2) and N(Jl.3. .1'2. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124.90 confidence interval for the mean bloC<! pressure J. (b) Consider the problem aftesting H : Jl. 8..9. (7 2) is Section 4. .. e 1 ) depends on X only throngh T. . Show that A(X.. 100. I'· (c) Find a level 0. where a' is as defined in < ~.L. x y vi I I .1 < J. 110..O).95 confidence interval for equaltailed tests of Problem 4. • Xi = (JXi where X o I 1 + €i.4. .90 confidence interval for a by using the pivot 8 2 ja z . Assume the onesample normal model.z . (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power. .. .Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ . 190.L2 versus K : j. Forage production is also measured in pounds per acre. 6. find the sample size n needed for the level 0.. a 2 ) random variables. eo. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. . respectively.. 9. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. (X. . Let Xl.9. 114.2' (12) samples. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0. The control measurements (x's) correspond to 0 pounds of mulch per acre... (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0. (a) Using the size 0: = 0. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances. 0'2).05. Suppose X has density p(x.l.. €n are independent N(O.292 Testing and Confidence Regions Chapter 4 5..tl > {t2.01 test to have power 0. (a) Show that the MLE of 0 ~ (1'1. i = 1.95 confidence interval for p. The nonnally distributed random variables Xl. 0 E e. n. and that T is sufficient for O.Xn1 and YI .
0) by the following table. Xo = o. (b) P. Then. The power functions of one.IITI > tl is an increasing function of 161. I). Then use py. = 0.. . samples from N(J1l. Fix 0 < a < and <>/[2(1 . (T~). P. . (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl.[Z > tVV/k1 is increasing in 6. get the joint distribution of Yl = ZI vVlk and Y2 = V..<» (1 . Consider the following model.OXi_d 2} i= 1 < Xi < 00. X powerful whatever be B. Yl. for each v > 0. Define the frequency functions p(x.O) for 00 = (Z.6. II.2. (Tn. 12. (a) P.and twosided t tests. 7k. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~.IIZI > tjvlk] is increasing in [61. (Yl) = f PY"Y.. (An example due to C. has density !k . with all parameters assumed unknown. Hint: Let Z and V be independent and have N(o. Show that the noncentral t distribution. distribution. 1 {= xj(kl)ellx+(tVx/k'I'ldx. 13.(t) = .6. The F Test for Equality of Scale. Y2 )dY2. / L:: i 10. . X n1 . Condition on V and apply the double expectation theorem.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if.. respectively.<»1 < e < <>. .10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl.[T > tl is an increasing function of 6.n. XZ distributions respectively. Show that. has level a and is strictly more 11. Yn2 be two independent N(P. . ..2) In exp{ (I/Z(72) 2)Xi . J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. 7k.. i = 1. Let Xl. From the joint distribution of Z and V. P.. (Yl.Section 4.. and only if... .'" p(X. Let e consist of the point I and the interval [0. Stein). Suppose that T has a noncentral t.
.9. YI ).1)]E(Y..7 to conclude that tribution as 8 12 /81 8 2 .8. show that given Uz = Uz . . (11. . .1 . and using the arguments of Problems B. · . distribution and that critical values can be .'.nl1 ar is of the fonn: Reject if.4. . . I ' LX. 1. then P[p > c] is an increasing function of p for fixed c. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model. (Un. .24) implies that this is also the unconditional distribution.. note that . that T is an increasing function of R. and use Probl~1ll4.96 279 2. 1 .. x y 254 2. Consider the problem of testing H : p = 0 versus K : p i O.61 310 2. p) distribution. . Un). .1. Because this conditional distribution does not depend on (U21 •.X)' (b) Show that (aUa~)F has an F nz obtained from the :F table. L Yj'. can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I . where l _ . The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study.Y)'/E(X.a/2) or F < f( n/2). . . and only if.. where f( t) is the tth quantile of the F nz . using (4..1' Sf = L U...71 240 2. ! . i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. • I " and (U" V.1)/(n. Hint: Use the transfonnations and Problem B. si = L v. 15. > C. !I " .7 and B.19 315 2.9.62 284 2.68 250 2.8. (1~. the continuous versiop of (B.V.64 298 2.4. where a.'. p) distribution. Let (XI.J J . . . Vn ) is a sample from a N(O. Finally. F ~ [(nl . 1.12 337 1. .5) that 2 log '(X) V has a distribution. i=2 i=2 n n S12 =L i=2 n U. 0. <. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . Yn) be a sampl~ from a bivariateN(O. .0 has the same dis r j . a?. Let R = S12/SIS" T = 2R/V1 R'. (Xn .~ I i " 294 Testing and Confidence Regions Chapter 4 .0 has the same distribution as R.. (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. . V. Argue as in Proplem 4. r> (c) Justify the twosided F test: Reject H if.37 384 2.nt 1 distribution.~ I .4) and (4. T has a noncentral Tnz distribution with noncentrality parameter p.10. . (a) Show that the LR test of H : af = a~ versus K : ai > and only if.' ! . Sbow. 14.4. F > /(1 .). Un = Un. . . vn 1 l i 16.4. • .4. 0.9.94 Using the likelihood ratio test for the bivariate nonnal model. i 0 in xi !:. p) distribution.1I(a). .
see also Ferguson (1967). S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 .. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0.05 and 0.. 1973.(t) = tl(. 4.3. HAMMEL.01 + 0..9. REFERENCES BARLOW. 1974) to the critical value is <. (2) In using 8(5) as a confidence hound we are using the region [8(5). G. the class of Bayes procedures is complete. W. AND F.[ii) where t = 1.819 for" = 0.). and C. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. Consider the cases Ii. Data Analysis and Robustness. Wu. Notes for Section 4. i] versus K : 8 't [i. Wiley & Sons.11. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.3 (1) Such a class is sometimes called essentially complete... Consider the bioequivalence example in Problem 3.035.[ii . Because the region contains C(X). if the parameter space is compact and loss functions are bounded. Apology for Ecumenism in Statistics and Scientific lnjerence. P. R. (2) The theory of complete and essentially complete families is developed in Wald (1950).9. (a) Find the level" LR test for testing H : 8 E given in Problem 3. 00. E. PROSCHAN. T. F.3) holds for some 8 if <p 't V..11 Notes 295 17. . AND J. G. Leonard. Editors New York: Academic Press.0.895 and t = 0.10.o = 0. O'CONNELL. 00.851. 187.0. The term complete is then reserved for the class where strict inequality in (4.~. P.3).. BICKEL.. T6 . E. Tl .398404 (975).01. i].4 (I) If the continuity correction discussed in Section A.2.I\ E. Nntes for Section 4. respectively.11 NOTES Notes for Section 4. (3) A good approximation (Durbin. Rejection is more definitive. 1983. 4. . Box. Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding. Essentially.1. Box.15 is used here.1 (1) The point of view usually taken in science is that of Karl Popper [1968].Section 4. and r. !.12 1965. "Is there a sex bias in graduate admissions?" Science. it also has confidence level (1 . Mathematical Theory of Reliability New York: J. Stephens.2.
W. POPPER. The Theory of Probability Oxford: Oxford University Press. S. B. K. 13th ed. "Length of confidence intervals. 63. OLlaN. F. "Distribution theory for tests based on the sample distribution function. "Further notes on Mrs. JEFFREYS. V." Biometrika. "Plots and tests for symmetry. AND I. Amer. WELCH... . G. New York: I J 1 Harper and Row. 66. j 1 . Statistical Decision Functions New York: Wiley.. 1950. Y. T.. Testing Statistical Hypotheses. Sequential Analysis New York: Wiley. HAlO. A. D. c. New York: Hafner Publishing Company. AND R.. 38. 54 (2000). R. Wiley & Sons. S. L. T. Statist. 36. . 1997. R." Regional Conference Series in Applied Math. 1967. CAl." J. Statistical Theory with Engineering Applications . 421434 (1976). 16. WILKS. H. 605608 (1971). j j New York: 1." Biometrika. TATE. 1985. j 1 . Statist.. Sratisr. AARBERGE. "Probabilities of the type I errors of the Welch tests. A." An" Math.." The American Statistician. WANG." J. "A twosample test for a linear hypothesis whose pOWer is independent of the variance." 1. D. the Growth ofScientific Knowledge. I . 1958. Assoc. Wiley & Sons. Mathematical Statistics New York: J. SAMUEL<:AHN. Amer. A. 1. 8. L. Sequential Methods in Statistics New York: Chapman and Hall.. WALD. New York: Springer. 243246 (1949). STEPHENs. R. AND K. .• Conjectures and Refutations.." J. "On the combination of independent test statistics. A. 326331 (1999). ! . PRATI... WETHERDJ.. AND 1. Amer. "P values as random variablesExpected P values. 1986. L. 473487 (1977). 9. 1%2.. W.. M. A Decision Theoretic Approach New York: Academic Press. Statist. . VAN ZWET. 730737 (1974). DAS GUPTA." The AmencanStatistician." Ann. Aspin's tables:' Biometrika. ! I I I FERGUSON. G. Assoc. "PloUing with confidence: Graphical comparisons of two populations. E. Math. 1947.... . . 1968. HEDGES. Pennsylvania (1973). STEIN.674682 (1959). SIEVERS. R. 549567 (1961). "EDF statistics for goodness of fit. FENSTAD. Mathematical Statistics. . FISHER.. AND E. Statistical Methods for MetaAnalysis Orlando. AND A. J. OSTERHOFF. 1961. SIAM. 64.243258 (1945).. I . 69.. 3rd ed.. WALD. 56. 659680 (1967). AND G.. FL: Academic Press. DURBIN. GLAZEBROOK. 53. A. H. 54. Statist.. AND G. K. LEHMANN..• Statistical Methods for Research Workers. K.296 Testing and Confidence Regions Chapter 4 BROWN. Assoc. 1952. DOKSUM. DOKSUM. Statist. "Optimal confidence intervals for the variance of a normal distribution. . Philadelphia. SACKRoWITZ. "Interval estimation for a binomial proportion. 2nd ed. A. Amer. KLETI.
..Xn ) as an estimate of the population median II(F).Chapter 5 ASYMPTOTIC APPROXIMATIONS 5. However. .. Worse. a 2 ) we have seen in Section 4. computation even at a single point may involve highdimensional integrals. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule.1. and F has density f we can wrile (5.Xn are i. This distribution may be evaluated by a twodimensional integral using classical functions 297 . If XI. and most of this chapter.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n . Worse. Even if the risk is computable for a specific P by numerical integration in one dimension.1. from Problem (B. N(J1.1.3). If n is odd.1 degrees of freedom.13). consider evaluation of the power function of the onesided t test of Chapter 4.2) where.• . 1 X n from a distribution F. consider a sample Xl. (5. if n ~ 2k + 1.1) This is a highly informative formula. and calculable for any F and all n by a single onedimensional integration. To go one step further.1. . Ifwe want to estimate J1(F} = (5. but a different one for each n (Problem 5.i.1 (D.2. consider med(X 11 •. telling us exactly how the MSE behaves as a function of n.F(x)k f(x).1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with.2 that .9.1. In particular. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5..1). .d.1. k Evaluation here requires only evaluation of F and a onedimensional integration. our setting for this section EFX1 and use X we can write.3) gn(X) =n ( 2k ) F k(x)(1 . v(F) ~ F.2) and (5. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain.
fiit !Xi. X nj }.. . VarF(X 1 ». .d. . In its simplest fonn. in this context. _. . It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.n (Xl. ..3. fiB ~ R n (F). . observations Xl. X n as n + 00.)2) } . we can approximate Rn(F) arbitrarily closely.3. I j {Tn (X" .. 10=1 f=l There are two complementary approaches to these difficulties. i ..o(X1). which we explore further in later chapters. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. X n + EF(Xd or p £F( . S) and in general this is only representable as an ndimensional integral. We shall see later that the scope of asymptotics is much greater. . for instance the sequence of means {Xn }n>l. Rn (F). {X 1j l ' •• . just as in numerical integration. Xn)}n>l.. Monte Carlo is described as follows. 1 < j < B from F using a random number generator and an explicit fonn for F.Xnj ) . 00. Approximately evaluate Rn(F) by _ 1 B i (5.1.i. of distributions of statistics. or the sequence Asymptotic statements are always statements about the sequence. .1.fii(Xn  EF(Xd) + N(O. but for the time being let's stick to this case as we have until now. which occupies us for most of this chapter.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. based on observing n i. is to use the Monte Carlo method. But suppose F is not Gaussian.4) i RB =B LI(F. Asymptotics.Xn):~Xi<n_l (~2 (EX. j=l • By the law of large numbers as B j. The first. where X n = ~ :E~ of medians. .2) and its qualitative properties are reasonably transparent. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or. • . or it refers to the sequence of their distributions 1 Xi.. more specifically. The other. save for the possibility of a very unlikely event. Draw B independent "samples" of size n. where A= { ~ . The classical examples are. . always refers to a sequence of statistics I. Thus..
if more delicate Hoeffding bound (B.. the much PFlIx n  Because IXd < 1 implies that .Xn ) or £F(Tn (Xl.1.omes 11m'. We interpret this as saying that..1. The trouble is that for any specified degree of approximation. For instance.01.25 whereas (5.1.5) That is..1. Xn is approximately equal to its expectation.10) is .' = 1 possible (Problem 5. say.Section 5. For € = .1... Thus. Further qualitative features of these bounds and relations to approximation (5. this reads (5..7) where 41 is the standard nonnal d. 1') < z] ~ ij>(z) (5.2 . if EFIXd < 00. . below .14. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .10) < 1 with .9) As a bound this is typically far too conservative. which are available in the classical situations of (5. by Chebychev's inequality.1.' m (5.1.9. if EFXf < 00.1.8) are given in Problem 5.. PF[IX n _  .1.6) and (5. n = 400..l1) states that if EFIXd' < 00. For instance.1.4. ' .1.01. (5.3).15.' 1'1 > €] < . then (5.6) to fall.1. the central limit theorem tells us that if EFIXII < 00. the weak law of large numbers tells us that. for n sufficiently large.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl. £ = . Is n > 100 enough or does it have to be n > 100.6) gives IX11 < 1. .(Xn . 7).VarF(Xd. the right sup PF x [r.14. 1'1 > €] < 2exp {~n€'}. x.8) Again we are faced with the questions of how good the approximation is for given n. (5..1. As an approximation.6) for all £ > O.1. . Similarly.2 is unknown be.. .1. X n ) but in practice we act as if they do because the T n (Xl.1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5. X n )) (in an appropriate sense).9) when . . then P [vn:(X F n .6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5.1. (5.l) (5. (see A. DOD? Similarly.1. say. and P F • What we in principle prefer are bounds.1.9) is ./' is as above and .2 hand side of (5.. the celebrated BerryEsseen bound (A..f.11) .
B ~ O(F). If they are simple. : i . good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. i . (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. Yet. Practically one proceeds as follows: (a) Asymptotic approximations are derived.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good.11) is again much too consctvative generally.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families.1. In particular. The estimates B will be consistent.t and 0"2 in a precise way.di) where '(0) = 0.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4.8) is typically much betler than (5. For instance. • model. and asymptotically normal. as we have seen. Although giving us some idea of how much (5.11) suggests.e(F)]) (1(e. behaves like . although the actual djstribution depends on Pp in a complicated way.1.1.1. quite generally. "(0) > 0. The arguments apply to vectorvalued estimates of Euclidean parameters.1.2 and asymptotic normality via the delta method in Section 5. asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate. (5. • c F (y'n[Bn . which is reasonable. 1 i . Asymptotics has another important function beyond suggesting numerical approximations for specific nand F. . As we shall see. for any loss function of the form I(F.3. consistency is proved for the estimates of canonical parameters in exponential families. .I '1. It suggests that qualitatively the risk of X n as an estimate of Ji.12) where (T(O. We now turn to specifics. Section 5. . Section 5.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median. even here they are not a very reliable guide.' (0)( (1 / y'n)( v'27i') (Problem 5. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD.1. for all F in the n n . (5. As we mentioned. (5. I . The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i.2 deals with consistency of various estimates including maximum likelihood. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo.1.8) differs from the truth.F) > N(O 1) . d) = '(II' . Consistency will be pursued in Section 5.(l) The approximation (5.5) and quite generally that risk increases with (1 and decreases with n.
The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. .7.2 5.d. Finally in Section 5. > O.2. with all the caveats of Section 5. and B.1) and (B.1). X ~ p(P) ~ ~  p = E(XJ) and p(P) = X. (5.. ..2.lS. by quantities that can be so computed. 1denotes Euclidean distance.. and other statistical quantities that are not realistically computable in closed form. and probability bounds.2. A stronger requirement is (5. e Example 5. If Xl. if. for . In practice.) forsuP8 P8 lI'in .q(8)1 > 'I that yield (5. and 8. which is called consistency of qn and can be thought of as O'th order asymptotics. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.14. for all (5.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models.2.2.2. we talk of uniform cornistency over K.7 without further discussion.Section 5. 'in 1] q(8) for all 8. A. We also introduce Monte Carlo methods and discuss the interaction of asymptotics.1.Xn from Po where 0 E and want to estimate a real or vector q(O). distributions.7.1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. by the WLLN. is a consistent estimate of p(P).i.14. The simplest example of consistency is that of the mean. . We will recall relevant definitions from that appendix as we need them.. The least we can ask of our estimate Qn(X I. case become increasingly valid as the sample size increases. but we shall use results we need from A.2) are preferable and we shall indicate some of qualitative interest when we can. But. Most aSymptotic theory we consider leads to approximations that in the i.i. remains central to all asymptotic theory.1) where I .14. 5.1. ' . 1 X n are i. 00.2 Consistency 301 in exponential families among other results. AlS. .2. in accordance with (A.2.) However. Means. where P is the empirical distribution. .2) Bounds b(n.2) is called unifonn consistency.2. (See Problem 5.Xn ) is that as e n 8 E ~ e. Section 5. asymptotics are methods of approximating risks. P where P is unknown but EplX11 < 00 then. The stronger statement (5. For P this large it is not unifonnly consistent. . Monte Carlo. If is replaced by a smaller set K. That is.1). Summary.5 we examine the asymptotic behavior of Bayes procedures.d..
= Proof. Pn = (iiI.2. by A. w(q..l : [a. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates.1.2. By the weak law of large numbers for all p. consider the plugin estimate p(Iji)ln of the variance of p. we can go further.. there exists 6 «) > 0 such that p. w(q. Suppose thnt P = S = {(Pl. Theorem 5. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5.. q(P) is consistent. P = {P : EpX'f < M < oo}. Binomial Variance. where q(p) = p(1p). which is ~ q(ji).6. 0) is defined by . the kdimensional simplex. Ip' pi < 6«). . xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. for all PEP.1 < j < k. w( q. i .5) I A simple and important result for the case in which X!. in this case.2.ql > <] < Pp[IPn . Suppose that q : S + RP is continuous.3) that > <} (5. w.6) = sup{lq(p)  q(p')I: Ip .p) distribution. (5.I l 302 instance..l «) = inf{6 : w(q.4) I i (5.2. But further. l iA) E S be the empirical distribution. p' E S.2..p'l < 6}.2. implies Iq(p')q(p) I < <Then Pp[li1n .pi > 6«)] But. Then qn q(Pn) is a unifonnly consistent estimate of q(p). sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5. Letw.3) Evidently.LJ~IPJ = I}..b] < R+bedefinedastheinver~eofw. Pp [IPn  pi > 6] ~ .·) is increasing in 0 and has the range [a.2. Evidently. 0 In fact.2. Thus. Let X 1. and p = X = N In is a uniformly consistent estimate of p. 0 < p < 1..6)! Oas6! O. it is uniformly continuous on S. with Xi E X 1 . 6) It easily follows (Problem 5.• X n be the indicators of binomial trials with P[X I = I] = p. . where Pi = PIXI = xi]' 1 < j < k.. and (Xl. If q is continuous w(q.1) and the result follows. 6 >0 O. Other moments of Xl can be consistently estimated in the same way." is the following: I X n are LLd.14. . Then N = LXi has a B(n. then by Chebyshev's inequality. .b) say. .Pk) : 0 < Pi < 1.. . · . for every < > 0. Because q is continuous and S is compact. Suppose the modulus ofcontinuity of q.
U implies !'.1 and Theorem 2. Let Xi = (Ui .Jor all 0. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj.1 that the empirical means. Theorem 5.U 2. Let TI. 1 < j < d. and let q(O) = h(m(O)). ai.9d) map X onto Y C R d Eolgj(X')1 < 00. (ii) ij is consistent. . 1 < j < d.). variances.d. 'Vi) is the statistic generating this 5parameter exponential family. then vIP) h(g). !:.6) if Ep[g(X.then which is well defined and continuous at all points of the range of m. . 1 < i < n be i.V2. then Ifh=m. Let g(u.3.p). Then.. = Proof.. where P is the empirical distribution.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B.2. More generally if v(P) = h(E p g(X ..)1 < 1 } tions such that EUr < 00. 11.) > 0. is a consistent estimate of q(O).2. . where h: Y ~ RP.1. if h is continuous. )) and P = {P : Eplg(X .v} = (u. We need only apply the general weak law of large numbers (for vectors) to conclude that (5.1: D n 00. Var(V. 1 are discussed in Problem 5.ar. (i) Plj [The MLE Ti exists] ~ 1. [ and A(·) correspond to P as in Section 1. Vi). h(D) for all continuous h. let mj(O) '" E09j(X.lpl < 1. . is consistent for v(P).7.a~.2. ag. X n are a sample from PrJ E P.. o Example 5.JL2. thus. EVj2 < 00.V.) > 0. If we let 8 = (JLI.2.2.Section 5. and correlation coefficient are all consistent.. 0 Here is a general consequence of Proposition 5. af > o.2. Var(U. Then.6.2 Consistency 303 Suppose Proposition 5. ' .2. Suppose [ is open. If Xl.2.UV) so that E~ I g(Ui .i. conclude by Proposition 5. Jl2. Variances and Correlations. )1 < oo}. N2(JLI.4. Suppose P is a canonical exponentialfamily of rank d generated by T.1 . Let g (91.p). We may.1.3.
1 belong to the interior of the convex support hecause the equation A(1)) = to. j 1 Pn(X. T(Xdl < 6} C CT' By the law of large numbers. vectors evidently used exponential family properties.L[p(Xi .. Let Xl. 5. {t: It . . if 1)0 is true.1I)11: II E 6} ~'O (5.7) occurs and (i) follows.d.2. .. the MLE. Let 0 be a minimum contrast estimate that minimizes I . . ..2. Suppose 1 n PII sup{l.2. (5. where to = A(1)o) = E1)o T(X 1 ).E1). D . for example.1I 0 ) foreverye I . Po.3.. n .3. z=l 1 n i.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. is a continuous function of a mean of Li. the inverse AI: A(e) ~ e is continuous on 8.LT(Xi ).1 that On CT the map 17 A(1]) is II and continuous on E.. But Tj. II) = where. 7 I .8) I > O. I J Theorem 5. exists iff the event in (5. The argument of the the previous subsection in which a minimum contrast estimate. .d. i=l 1 n (5.2.3.= 1 .D(1I0 . II) L p(X n. I I Proof Recall from Corollary 2. Note that. By definition of the interior of the convex support there exists a ball 8. X n be Li.2. '1 P1)'[n LT(Xi ) E C T) ~ 1.. is solved by 110. By a classical result. BEe c Rd. = 1 n p... 0 Hence.2. (T(Xl)) must hy Theorem 2.lI) . II) 0 j =EII. I .1.3. T(Xl). • .7) I .3. We showed in Theorem 2.j.Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. Rudin (1987).. E1).. as usual.. which solves 1 n A(1)) = . see.9) n and Then (} is consistent. I ! .  inf{D(II.1 to Theorem 2. and the result follows from Proposition 5. i I . i i • • ! .. . A more general argument is given in the following simple theorem whose conditions are hard to check. 1 i .1 that i)(Xj.2.lI o): 111110 1 > e} > D(1I0 . . .304 Asymptotic Approximations Chapter 5 . . n LT(Xi ) i=I 21 En. D( 110 .2 Consistency of Minimum Contrast Estimates .
_ I) 1 n PIJ o [16 . IJo)) .IJol > E] < PIJ [inf{ . 2 0 (5. An alternative condition that is readily seen to work more widely is the replacement of (5. IJ j )) : 1 <j < d} > <] < d maxi PIJ.5.2. IIJ . for all 0 > 0...2.2.IJd ). But for ( > 0 let o ~ ~ inf{D(IJ. (5. . IJ) . IJ) .[IJ ¥ OJ] = PIJ. Ie ¥ OJ] ~ Ojoroll j. and (5.2. then. [I ~ L:~ 1 (p(Xi .9) hold for p(x.D(IJo.IJor > <} < 0] because the event in (5.2.L(p(Xi.14) (ii) For some compact K PIJ."'[p(X i .Section 5.D(IJ o.2.11) sup{l.11) implies that the righthand side of (5.9) follows from Shannon's lemma.IJ)1 < 00 and the parameterization is identifiable.8) follows from the WLLN and PIJ.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1. o Then (5.D(IJo.10) By hypothesis. IIJ . IJ) = logp(x. IJ) .lJ o): IIJ IJol > oj.2 Consistency 305 proof Note that. IJ)I : IJ K } PIJ !l o.IJ) p(Xi.D( IJ o.2.L[p(Xi .2.[max{[ ~ L:~ 1 (p(Xi . IJ)II ' IJ n i=l E e} > .2. PIJ.1.10) tends to O.13) By Shannon's Lemma 2. Coronary 5.8) by (i) For ail compact K sup In { c e.2. IJj))1 > <I : 1 < j < d} ~ O. EIJollogp(XI.2. IJ) n ~ i=l p(X"IJ o)] .IIIJ  IJol > c] (5. 0 Condition (5.D(IJo. is finite. E n ~ [p(Xi .12) which has probability tending to 0 by (5. IJj ) . ifIJ is the MLE. 1 n PIJo[inf{. .1 we need only check that (5. {IJ" .8) can often failsee Problem 5. [inf c e. IJ j ) . /fe e =   Proof Note that for some < > 0. (5.11) implies that 1 n ~ 0 (5.2.8).2.2. A simple and important special case is given by the following.IJol > <} < 0] (5. But because e is finite. PIJ.8) and (5. 0) .2.2.IJ o)  D(IJo. IJ).2.inf{D(IJo.IJO)) ' IIJ IJol > <} n i=l .2.
.i. consistency of the MLE may fail if the number of parameters tends to infinity. Unfortunately checking conditions such as (5.14) is in general difficult. > 2. m Mi) and assume (b) IIh(~)lloo j 1 .1. =suPx 1k<~)(x)1 <M < .AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued. Sufficient conditions are explored in the problems. then = h(ll) + L (j)( ) j=l h . ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. 5.Il E(X Il).306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems. Unifonn consistency for l' requires more. I ~. X valued and for the moment take X = R. Let h : R ~ R. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families.Xn be i. Summary.d.33. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean.3.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. We introduce the minimal property we require of any estimate (strictly speak ing.1) where • . . If fin is an estimate of B(P).) = <7 2 Eh(X) We have the following. in Section 5. let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn.. see Problem 5..B(P)I > Ej : PEP} t 0 for all € > O. If (i) and (ii) hold.3. and assume (i) (a) h is m times differentiable on R.2.3.2. . We denote the jth derivative of h by 00 .3. D . VariX. As usual let Xl. ~ 5.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders.. When the observations are independent but not identically distributed. .3 FIRST. + Rm (5. we require that On ~ B(P) as n ~ 00. We then sketch the extension to functions of vector means. ~l Theorem 5. that sup{PIiBn . sequence of estimates) consistency.8) and (5. J.
. . . The expression in (c) is. . _ . If EjX Ilj < 00.. :i1 + .3. then (a) But E(Xi ... bounded by (d) < t.3. 21.. We give the proof of (5.) ~ 0. (b) a unless each integer that appears among {i I . so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r.ij } • sup IE(Xi . tr i/o>2 all k where tl . tI. Xij)1 = Elxdj il.. 'Zr J .'" . C [~]! n(n .3) and j odd is given in Problem 5. In!... + i r = j. . +'r=j J .3.4) for all j and (5..3) (5..1) .and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion.3.i j appears at by Problem 5..3 First. then there are constants C j > 0 and D j > 0 such (5... Moreover.2) where IX' that 1'1 < IX .. j > 2.'1 +.3. . rl l:: . .t" = t1 . for j < n/2.3.2.4) Note that for j eveo. EIX I'li = E(X _1')' Proof. .3) for j even. t r 1 and [t] denotes the greatest integer n.1. . The more difficult argument needed for (5.3. Lemma 5. (5. (n . ik > 2.Section 5.3. . . X ij ) least twice. Let I' = E(X.1 < k < r}}..1'1. and the following lemma. . .5.3.. .5[jJ2J max {l:: { . ..[jj2] + 1) where Cj = 1 <r.
EIX1 .l) + {h(21(J.3f2 ) in (5. and (e) applied to (a) imply (5. I I I Corollary 5.3.2 ).3) for j even..3.5) follows.l)}E(X . Then Rm = G(n 2) and also E(X .+ O(n. then G(n.3.3.5) can be replaced by Proof For (5.3.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 . (n li/2] + 1) < nIJf'jj and (c). If the conditions of (b) hold. 1 iln . 0 1 I . 00 and 11hC 41 11= < then G(n. Proof (a) Write 00.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . if I' = O. i • r' Var h(X) = 02[h(11(J.llJ' + G(n3f2) (5.2. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00. . using Corollary 5. 1 < j < 3. I . (d). I .1 with m = 4.3. I 1 . give approximations to the bias of h(X) as an estimate of h(J.3. (5.3f') in (5.1 with m = 3.J.1')3 = G(n 2) by (5.1'11. I (b) Ifllh(j11l= < 00.l) and its variance and MSE.1. . • ..3.3. then h(')( ) 2 0 = h(J.1. < 3 and EIXd 3 < 00. if 1 1 <j (a) Ilh(J)II= < 00..3.4) for j odd and (5.l) + [h(1)]2(J..ll j < 2j EIXd j and the lemma follows. respectively. and EXt < replaced by G(n'). apply Theorem 5.3.l) + 00 2n I' + G(n3f2). In general by considering Xi .l)h(J. then I • I.3f2 ).4).3.1')2 = 0 2 In.3. n (b) Next. (5. .6) can be 1 Eh 2 (X) = h'(J. Because E(X .3.6) n .5) apply Theorem 5. By Problem 5.l)h(1)(J.l) 6 + 2h(J.1') + {h(2)(J.l)E(X .6. Corollary 5.1..l)h(J.. 0 The two most important corollaries of Theorem 5.3.5) (b) if E(Xt) < G(n.l) + [h(llj'(1')}':.3.
2 ) 2e.3.4) E(X .Section 5. which is G(n.3.3 First. X n are i.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.3.11) Example 5.3. Thus.3(1_ ')In + G(n. Bias and Variance of the MLE of the Binomial VanOance.(h(X) .1.4 ) exp( 2/t).t) and X. If heX) is viewed.(h(X)) E.3. then we may be interested in the warranty failure probability (5. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +..1. and.6). by Corollary 5. .5).3. which is neglible compared to the standard deviation of h(X).= E"X I ~ 11 A..exp( 2It).1) = 11./1)3 = ~~. for large n. We will compare E(h(X)) and Var h(X) with their approximations.h(/1)) h(2~(II):: + O(n. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5.l / Z) unless h(l)(/1) = O.7) If h(t) ~ 1 .exp( 2') c(') = h(/1). when hit) = t(1 . To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.t.h(/1) is G(n.l ). (5.3..3.3.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 . by Corollary 5. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.3.Z ".3.2.3.Z) (5. . A qualitatively simple explanation of this important phenonemon will be given in Theorem 5. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX). then heX) is the MLE of 1 .3. Example 5.3. where /1. T!Jus. (1Z 5.Z.2 (Problem (5.  . the MLE of ' is X . as the plugin estimate of the parameter h(Jl) then. the bias of h(X) defined by Eh(X) .1 with bounds on the remainders.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.3.8) because h(ZI (t) = 4(r· 3 . {h(l) (/1)h(ZI U')/13 (5..and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5. Note an important qualitative feature revealed by these approximations.i. as we nonnally would. Bias.i.3.d. If X [.
9d(Xi )f.j . • .p). . 1 and in this case (5. (5. D " ~ I .p) .5) yields E(h(X))  ~ p(1 .3.e x ) : i l + .amh i.{2(1.pl. 1 < j < { Xl . the error of approximation is 1 1 + . Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi). asSume that h has continuous partial derivatives of order up to m.' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2.p)(I..[Var(X) + (E(X»2] I nI ~ p(1 .2p)' . .310 Asymptotic Approximations Chapter 5 B(l. and will illustrate how accurate (5. . M2) ~ 2. a Xd d} ..2(1 .6p(l.3..10) is in a situation in which the approximation can be checked. + id = m. (5. Next compute Varh(X)=p(I.E(X') ~ p . " ! .2p)')} + R~. R'n p(1 ~ p) [(I . n n Because MII(t) = I . • ~ O(n..2p). First calculate Eh(X) = E(X) . .p) . II'.3.!c[2p(1 n p) .p) {(I _ 2p)2 n Thus. .P) {(1_2 P)2+ 2P(I..p(1 . ~ Varh(X) = (1.p)J n p(1 ~ p) [I .10) yields . .P )} (n_l)2 n nl n Because 1'3 = p(l.2t.p)'} + R~ p(l .3.2p)p(l.5) is exact as it should be.p) + . 0 < i. Let h : R d t R.2p)2p(1 ..p) n . < m.' . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i.2p(1 .3.p) = p ( l .p)] n I " .2. Theorem 5.3 ).2p) n n +21"(1 .p(1 . I .p)(I. D 1 i I .
3.8. (5.) + 2 g:'.2 ).) Cov(Y".5).3. is to m = 3.3.I' < 00.6).) )' Var(Y. Lemma 5. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00.. as for the case d = 1.3..~ (J1. if EIY.»)'var(Y'2)] +O(n') Approximations (5.14) + (g:.11.) gx~ (J1. The proof is outlined in Problem 5.) + a.(J1.3..'.3. 1 < j < d where Y iJ I = 9j(Xi ).) Var(Y. B.".hUll)~.3.3. We get.~E(X.12) Var h(Y) ~ [(:. (5.3.. Y12 ) (5. The most interesting application.3 / 2 ) in (5.1. Similarly.3.2. and (5.. + ~ ~:l (J1. EI Yl13 < 00 Eh(Y) h(J1.)} + O(n (5. Y 12 ) 3/2).15) Then . Theorem 5.3. 5.))) ~ N(O.2 The Delta Method for In law Approximations = As usual we begin with d !.h(!. E xl < 00 and h is differentiable at (5.3.1 Then.13).) Cov(Y".Section 5.).3.x. for d = 2.3). Suppose thot X = R.4.) Var(Y. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.C( v'n(h(X) .14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . then O(n. (J1.13) Moreover.) + ~ {~~:~ (J1.3. and the appropriate generalization of Lemma 5. h : R ~ R. (7'(h)) where and (7' = VariX. 1. = EY 1 . by (5. The results in the next subsection go much further and "explain" the fonn of the approximations we already have.. then Eh(Y) This is a consequence of Taylor's expansion in d variables. (J1. under appropriate conditions (Problem 5.3.3. and J. 0/ The result follows from the more generally usefullernma.12) can be replaced by O(n. .and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.3 First.).
~ EVj (although this need not be true.3. I (g) I ~: .16) Proof. we expect (5.17) . • • . an = n l / 2.32 and B. = I 0 . (5. 0 .N(O. (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). for every (e) > O. I . (e) Using (e). then EV. The theorem follows from the central limit theorem letting Un X.3.• ' " and. from (a). Thus. for every € > 0 there exists a 8 > 0 such that (a) Iv . . by hypothesis.a 2 ).u) !:.i.N(O.g(ll(U)(v  u)1 < <Iv . u = /}" j \ V .3. V . an(Un . ~: . . V and the result follows. _ F 7 .ul Note that (i) (b) '* . V.3..7. 'j But (e) implies (f) from (b).3. By definition of the derivative.8). Therefore. Formally we expect that if Vn !:. hence. Consider 1 Vn = v'n(X !") !:. j . .ul < Ii '* Ig(v)  g(u) . for every Ii > 0 (d) j. ·i Note that (5.312 Asymptotic Approximations Chapter 5 (i) an(Un . see Problems 5. ( 2 ).1. . j " But.15) "explains" Lemma 5. PIIUn €  ul < iii ~ 1 • "'. V for some constant u. .u)!:.
1 (1 . (1  A)a~ . Let Xl. FE :F where EF(X I ) = '".j odd..3.).= 0 versuS K : 11. In Example 4.3.1 (Ia) + Zla butthat the t n.2. Yn2 be two independent samples with 1'1 = E(X I ). else EZJ ~ = O.3 First. Now Slutsky's theorem yields (5.> 0 . we can obtain the critical value t n.+A)al + Aa~) . 1).3.l distribution. we find (Problem 5. But if j is even.) and a~ = Var(Y./2 ). (1 .18) In particular this implies not only that t n. Consider testing H: /11 = j. v) = u!v. Using the central limit theorem.X" be i. EZJ > 0. N n (0 Aal.17) yields O(n J. and the foregoing arguments.a) for Tn from the Tn .). where g(u.X) n 2 .t2 versus K : 112 > 111. (a) The OneSample Case.9. al = Var(X. 1 2 a P by Theorem 5.l (1. then S t:.d.2 and Slutsky's theorem.. A statistic for testing the hypothesis H : 11. For the proof note that by the central limit theorem.18) because Tn = Un!(sn!a) = g(Un .O < A < 1. sn!a). Example 5.i. (b) The TwoSample Case.28) that if nI/n _ A. In general we claim that if F E F and H is true. .'" .'" 1 X n1 and Y1 . £ (5. VarF(Xd = a 2 < 00.3.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~. and S2 _ n nl ( ~ X) 2) n~(X. 1'2 = E(Y. "t" Statistics.TiS X where S 2 = 1 nl L (X.3.s Tn =.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.N(O.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian.. then Tn . Then (5. Let Xl. i=l If:F = {Gaussian distributions}. j even = o(nJ/'). ••• . 1). Slutsky's theorem.Section 5.3.
5 . . 316. Chisquare data I 0. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently..3.5 . Each plotted point represents the results of 10. approximations based on asymptotic results should be checked by Monte Carlo simulations. then the critical value t.000 onesample t tests using X~ data.'.. X~. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.1. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5... .3. . Other distributions should also be tried.. 20.02 . the asymptotic result gives a good approximation when n > 10 1. . " . when 0: = 0.5 3 Log10 sample size j i Figure 5...1 (0. and the true distribution F is X~ with d > 10. the For the twosample t tests. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution. 10... i I .. Figure 5..1.. . The X~ distribution is extremely skew..2 or of a~. The simulations are repeated for different sample sizes and the observed significance levels are plotted.. One sample: 10000 Simulations. 32..I = u~ and nl = n2.. and in this case the t n . Y . Figure 5.3.05.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. where d is either 2. ur .5 2 2.5 1 1.1 shows that for the onesample t test. oL. i 2(1 approximately correct if H is true and the X's and Y's are not normal.c:'::c~:____:_J 0. or 50.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. as indicated in the plot..95) approximation is only good for n > 10 2 .
000 twosample t tests.' 0.3. 1 (ri .4 based on Welch's approximation works well.n2 and at I=. _ y'ii:[h(X) . y'ii:[h(X) . To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. ChiSquare Dala.3. even when the X's and Y's have different X~ distributions. 2:7 ar Two sample. the t n 2(1 .(})) do not have approximate level 0.a~ show that as long as nl = n2._' 0.Xi have a symmetric distribution.) i O.3.. D Next.5 2 2.2.ll)(!.2 (1 . as we see from the limiting law of Sn and Figure 5. let h(X) be an estimate of h(J. }. or 50. Y .L) where h is con tinuously differentiable at !'. Each plotted point represents the results of 10.h(JL)] ".n2 and at = a').5 1 1.9. in this case. 10.5 3 log10 sample size Figure 5.Xd. and Yi .hoi n s[h(l)(X)1 .X = ~. in the onesample situation. N(O. and the data are X~ where d is one of 2. the t n _2(O.0' H<I o 0. when both nl I=. Moreover.cCc":'::~___:.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because. However. scaled to have the same means. In this case Monte Carlo studies have shown that the test in Section 4.a~. By Theoretn 5.Section 5_3 First. For each simulation the two samples are the same size (the size indicated on the xaxis). af = a~.0:) approximation is good when nl I=. Equal Variances 0.12 0. then the twosample t tests with critical region 1 {S'n > t n .95) approximation is good for nl > 100.3.. Other Monte Carlo runs (not shown) with I=.3.o'[h(l)(JLJf).02 d'::'. and a~ = 12af. 10000 Simulations.
Each plotted point represents the results of IO. o. and 9. " 0. The data in the first sample are N(O. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.316 Asymptotic Approximations Chapter 5 Two Sample.3.I _ __ 0 I +__ I :::. We have seen that smooth transformations heX) are also approximately normally distributed.' OIlS .. gamma. Unequal Variances. For each simulation the two samples differ in size: The second sample is two times the size of the first. too. such as the binomial... .3. 1) so that ZI_Q is the asymptotic critical value.. '! 1 . From (5.3.10000 Simulations: Gaussian Data.".3. and beta. +2 + 25 J O'. If we take a sample from a member of one of these families. +___ 9.3.. It turns out to be useful to know transformations h. we see that here. N(O.5 Log10 (smaller sample size) l Figure 5. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered. Variance Stabilizing Transfonnations Example 5..J~ f).0.{x)() twosample t tests. such that Var heX) is approximately independent of the parameters indexing the family we are considering..12 .02""~". £l __ __ 0.£. .~'ri i _ . I: .c:~c:_~___:J 0. The xaxis denotes the size of the smaller of the two samples. Tn . called variance stabilizing. as indicated in the plot. 2nd sample 2x bigger 0.5 1 1.3..6) and .oj:::  6K . Poisson. if H is true . Combining Theorem 5.3 and Slutsky's theorem. 0. which are indexed by one or more parameters.6. In Appendices A and B we encounter several important families of distributions.4.
.3.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/.)C. .hb)) + N(o.6. Under general conditions (Bhattacharya and Rao.)] 2/ n .19) is an ordinary differential equation.15 and 5. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals.. . If we require that h is increasing.' . X n) is an estimate of a real parameter! indexing a family of distributions from which Xl. p. As an example.3.3 First. I X n is a sample from a P(A) family.. In this case (5. in the preceding P( A) case. Thus. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. Also closely related but different are socaBed normalizing transformations.. Suppose further that Then again. In this case a'2 = A and Var(X) = A/n.\ + d. by their definition. To have Varh(X) approximately constant in A. A > 0. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5. suppose that Xl. where d is arbitrary. 0 One application of variance stabilizing transformations.. 1/4) distribution.5. Such a function can usually be found if (J depends only on fJ. Suppose. 1976. Thus.3. this leads to h(l)(A) = VC/J>. The notion of such transformations can be extended to the following situation.. Substituting in (5. which varies freely. a variance stabilizing transformation h is such that Vi'(h('Y) . .1 0 ) Vi' ' is an approximate 1. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. .d.3. yX± r5 2z(1 . 538) one can improve on . A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models..(A)') has approximately aN(O.Xn are an i. Major examples are the generalized linear models of Section 6. . 1n (X I.Ct confidence interval for J>..6) we find Var(X)' ~ 1/4n and Vi'((X) .19) for all '/. c) (5.16. Thus.i. See Example 5.. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. Some further examples of variance stabilizing transformations are given in the problems. sample.Section 5. which has as its solution h(A) = 2.3. See Problems 5.3.3. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.
xro I • I I .6548 4.75 0.. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.38 0.66 0.3006 0.ססoo 1.!. According to Theorem 8. .40 0.1964 1.4 to compute Xl. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V . Fn(x) ~ <!>(x) .9876 0.ססoo 0.0032 ·1.85 0 0.9996 1.9943 0. 1) distribution.0284 1.0877 0.0005 0 0.3.IOx 3 + 15x)] I I y'n(X n 9n ! I i !..9997 4.9995 0.0655 O.34 2.40 0. Example 5. 0 x Exacl 2.6000 0. P(Tn < x).5999 0.n. • " .5000 0.15 0. Table 5.9684 0. " I. We can use Problem B.77 0. i = 1.2. TABLE 5.2000 0.20) is called the Edgeworth expansion for Fn .0254 0.8008 0. H 3 (x) = x3  3x.n)1 V2ri has approximately aN(O.2024 EA NA x 0. ii.0050 0.1000 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V . .95 :' 1.ססoo 0.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn. 4 0.n)' _ 3 (2n)2 = 12 n 1 I .0500 ! ! .3..95 3.38 0. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I..9905 0.04 1. 0.9999 1.ססOO I .9097 o.7000 0.0481 1.3000 0. Hs(x) = xS  IOx 3 + 15x.1.6999 0. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.9984 0.9900 0. we need only compute lIn and 1'2n.1051 0.5421 0.0208 ~1.86 0. 1)." • [3 n ./2 2 .3. To improve on this approximation.9500 'll .9506 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments. ~ 2.9750 0. Edgeworth Approximations to the X 2 Distribution.0553 0.5.7792 5. Therefore.15 0.I .0001 0 0.91 0.0010 ~: EA NA x Exact .<p(x) ' .0105 0.61 0.8000 0.79 0. .0250 0.4000 0.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.4000 0.3.3.3513 0.9950 0.2706 0.9999 1.1254 0.9724 0..9990 0. (5.1.0287 0.0397 0.3.4415 0.1) + 2 (x 3 .35 0. 0.. I' 0.9029 0. Exact to' EA NA {. Suppose V rv X~.72 .II 0.51 1.21) The expansion (5.' •• .3x) + _(xS .0100 0. Let F'1 denote the distribution of Tn = vn( X .4999 0. where Tn is a standardized random variable.
i. and Yj = (Y .2.0 TO .ar.2. Let central limit theorem T'f.= (Xi .0.2.5.. Lemma 5. vn(i7f .p).. (ii) g.j.9.2. Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl. a~ = Var Y. ./U2U3.3.3. Let p2 = Cov 2(X. 1).0 .0. Yn ) be i.0._p2). P = E(XY).1. E Next we compute = T11 . . >'2.l = Cov(XkyJ. U3) = Ui/U2U3.3.J11)/a. The proof follows from the arguments of the proof of Lemma 5. = Var(Xkyj) and Ak.3. Then Proof. _p2. Using the central limit and Slutsky's theorems.1) and vn(i7~ . 1. 4f.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al. R d !' RP has a differential g~~d(U) at u.1.6.2.u) ~~ V dx1 forsome d xl vector ofconstants u. Example 5.2 T20 .0.3.u) where Un = (n1EXiYi.n lEX.22) 2 g(l)(u) = (2U..Section 5.2 extends to the dvariate case.J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0.3 and (B. = ai = 1.0.1.o}. we can .I.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5. Ui/U~U3. where g(UI > U2. we can show (Problem 5.ij~ where ar Recall from Section 4.0.1. Let (X" Y. xmyl).3.0 >'1. Because of the location and scale invariance of p and r.2 >'1.n1EY/) ° ~ ai ~ and u = (p. It follows from Lemma 5.1) jointly have the same asymptotic distribution as vn(U n . .9) that vn(C .2. Y)/(T~ai where = Var X. (i) un(U n .d.. >'1. aiD. E). .2rJ >".6) that vn(r 2 N(O.3. UillL2U~) = (2p.2 >'2. Y) where 0 < EX 4 < 00.0 2 (5.3.u) ~ N(O.a~) : n 3 !' R. and let r 2 = C2 /(j3<. (X n . then by the 2 vn(U .5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl. as (X.m.1.). with  r?) is asymptotically nonnal.3 First. We can write r 2 = g(C.0 >'1.2. 0< Ey4 < 00.2 + p'A2.
3. . 2 1.h(p)]).p).rn) + o(ly . that is.3. N(O.P The approximation based on this transfonnation. has been studied extensively and it has been shown (e. Suppose Y 1. and it provides the approximate 100(1 . per < c) '" <I>(vn .1) is achieved hy choosing h(P)=!log(1+ P ).~a)/vn3} where tanh is the hyperbolic tangent.fii(r .19).fii(Y .5 (a) ! .g.3. it gives approximations to the power of these tests. then u5 . .h(p)J !:. we see (Problem 5.rn) N(o.h(p)) is closely approximated by the NCO. EY 1 = rn.3(h(r) . N(O.a)% confidence interval of fixed length. (1 _ p2)2).3. I Y n are independent identically distributed d vectors with ElY Ii 2 < 00. (5. 1938) that £(vn . which is called Fisher's z.. . Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'. E) = . ': I = Ilgki(rn)11 pxd.23) • . .: ! + h(i)(rn)(y .4.24) I Proof. .. and (Prohlem 5. .p) !:.3[h(c) . .1'2. .9) Refening to (5. David.UI. Theorem 5. .10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . .h= (h i . ! f This expression provides approximations to the critical value of tests of H : p = O.l) distribution..3. Argue as before using B. . c E (1. o Here is an extension of Theorem 5. (5. p=tanh{h(r)±z (1. Then ~.3.mil £ ~ hey) = h(rn) and (b) so that .3.1). Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd.fii[h(r) .Y) ~ N(/li. j f· ~.8. .hp )andhhasatotaldifferentialh(1)(rn) f • .3.u 2.320 When (X.
the :Fk. then P[T5 .7) implies that as m t 00.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) . we can use Slutsky's theorem (A. we first note that (11m) Xl is the average ofm independent xi random variables.m distribution can be approximated by the distribution of Vlk.m' To show this.d.= density.i=k+l Xi ".' = Var(X.Xn is a sample from a N(O. as m t 00. 14. L::+.21 for the distribution of Vlk.m distribution.3 First...1 in which the density of Vlk. This row. The case m = )'k for some ).15.~1. But E(Z') = Var(Z) = 1. where V .9) to find an approximation to the distribution of Tk. when the number of degrees of freedom in the denominator is large. .37 for the :F5 .x%.3. only that they be i. we conclude that for fixed k. the mean ofaxi variable is E(Z2).37] = P[(vlk) < 2. we can write. By Theorem B.3. EX. 1 Xl has a x% distribution..3.k+m L::l Xl 2 (5.). Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. Then according to Corollary 8.3.. > 0 and EXt < 00.3. D Example 5.k.7.m  (11k) (11m) L. which is labeled m = 00.25) has an :Fk.0 < 2.0 distribution and 2.m. See also Figure B.211 = 0.05 quantiles are 2.:: . By Theorem B.1. where k + m = n.3.l) distribution. the:F statistic T _ k. . > 0 is left to the problems. To get an idea of the accuracy of this approximation.:l ~ m k+m L X. 00. I).h(m)) = y'nh(l)(m)(Y  m) + op(I). Suppose for simplicity that k = m and k . if . Suppose that Xl. Thus.1.05 and the respective 0.26) l.. (5.i.1. if k = 5 and m = 60. when k = 10. is given as the :FIO . Then. where Z ~ N(O.. Next we turn to the normal approximation to the distribution of Tk. We write T k for Tk. Now the weak law of large numbers (A. xi and Normal Approximation to the Distribution of F Statistics. When k is fixed and m (or equivalently n = k + m) is large.Section 5. i=k+1 Using the (b) part of Slutsky's theorem. check the entries of Table IV against the last row. gives the quantiles of the distribution of V/k.m.3. We do not require the Xi to be normal. with EX l = 0. For instance.
i. 1 5. k. = (~. When it can be eSlimated by the method of moments (Problem 5. By Theorem 5.o) k) K (5.5.m 1) <) :'. if it exists and equal to c (some fixed value) otherwise.»)T and ~ = Var(Yll)J. Suppose P is a canonical exponential family of rank d generated by T with [open.m(l. 2) distribution. .3.m = (1+ jK(:: z.3. (5. Then if Xl. ~ Ck. Equivalently Tk = hey) where Y i ~ (li" li. X n are a sample from PTJ E P and ij is defined as the MLE 1 .28) satisfies Zto '" 1 I I:. .1) "" N(O.k(T>. 1'.k (t 1 (5. 1953) is that unlike the t test.' !k . In general (Problem 53.4.7). E(Y i ) ~ (1. I)T and h(u.). v) identity.k(tl)] '" 'fI() :'. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2.3. j m"': k (/k. 1'.. the upper h. .m(1. when rnin{k. :. ! i l _ .8(c» one has to use the critical value . when Xi ~ N(o. a'). which by (5.a) . Specifically. = E(X. . ifVar(Xf) of 2a'.3. 4).3.m 1) can be ap proximated by a N(O.1)/12 j 2(m + k) mk Z. P[Tk.m} t 00. as k ~ 00.k(Tk.8(a» does not have robustness of level. l (: An interesting and important point (noted by Box.27) where 1 = (1. Thus (Problem 5. _. • • 1)/12) .a) '" 1 + is asymptotically incorrect.3. the F test for equality of variances (Problem 5. .m • < tl P[) :'. 0 ! j i · 1 where K = Var[(X. i. i = 1.2Var(Yll))' In particular if X. !1 . In general. a').3. I' Theorem 5..3.). ~ N (0. h(i)(u. the distribution of'.m(1 .'.m critical value fk.28) ._o l or ~.8(d)). where J is the 2 x 2 v'n(Tk 1) ""N(0.3.)T.. :f..a).)/ad 2 . 1)T.29) is unknown.3. v) ~ ~. We conclude that xl jer}.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows. and a~ = Var(X. v'n(Tk .I.
We showed in the proof ofTheorem 5.8.I(T/) (5. (5. The result is a consequence. This is an "asymptotic efficiency" property of the MLE we return to in Section 6.l4.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X. = (5.A 1(71))· Proof.2. X n be i. (i) follows from (5. Example 5.3.  (TIl' . by Example 2. 0'). Then T 1 = X and 1 T2 = n. PT/[ii = AI(T)I ~ L Identify h in Theorem 5. as X witb X ~ Nil'..T. are sufficient statistics in the canonical model. PT/[T E A(E)] ~ 1 and. the asymptotic variance matrix /1(11) of .d. (5.71) for any nnbiased estimator ij.2 and 5.2.3.3.Section 5.EX.3. = A by definition and. Now Vii11. and.3. thus.3. hence.8.In(ij .N(o.'l 1 (ii) LT/(Vii(iiT/)). = 1/20'.32) Hence. = 1/2.11) eqnals the lower bound (3.2. Thus. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information. if T ~ I:7 I T(X. iiI = X/O". of Theorems 5.24).A(T/)) + oPT/ (n .1.30) But D A . Thus. by Corollary 1. (ii) follows from (5.). in our case. Let Xl>' . o Remark 5.1. .Nd(O.3.) .O.1.3.3 First.ift = A(T/).3.4. I'..3. T/2 By Theorem 5. and 1).2 where 0" = T.i. therefore.38) on the variance matrix of y'n(ij ..4. where.4.33) 71' 711 Here 711 = 1'/0'. Note that by B.3.5. (I" +0')] .3.".23).4 with AI and m with A(T/).31) .6. Ih our case.2 that. For (ii) simply note that.
stochastic approximations in the case of vector statistics and parameters are developed. sampling.4. We focus first on estimation of O.. . i In this section we define and study asymptotic optimality for estimation.L~ I l(Xi = Xj) is sufficient. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. is twice differentiable for 0 < j < k. .(T. . Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal.. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. testing. J J .3. Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean.. We begin in Section 5.3. Thus.4 and Problem 2.7). the (k+ I)dimensional simplex (see Example 1.. Consistency is Othorder asymptotics.". 0). see Example 2.4 to find (Problem 5.4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! .3.i. 5.15). Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation.P(Xk' 0)) : 0 E 8}. and il' = T. I i I .i.6. . taking values {xo.1) i .)'. .. the difference between a consistent estimate and the parameter it estimates. variables are explained in tenns of similar stochastic approximations to h(Y) . i 7 •• . . and confidence bounds.') N(O. il' . Xk} only so that P is defined by p . as Y. I . Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. .s where Eo = diag(a 2 )2a 4 ). 5.d.g.. we can use (5.1. N = (No.26) vn(X /". for instance.L.33) and Theorem 5. . The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families.4. Assume A : 0 ~ pix.. 8 open C R (e.324 Asymptotic Approximations Chapter 5 Because X = T. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families. when we are dealing with onedimensional smooth parametric models. under Li. < 1. These stochastic approximations lead to Gaussian approximations to the laws of important statistics. 0). . . . Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. 0 . 0. and so on.3.d. and PES.1.h(JL) where Y 1.Pk) where (5. Nk) where N j . . . .1 by studying approximations to moments and central moments of estimates. . ... Y n are i..d. EY = J. We consider onedimensional parametric submodels of S defined by P = {(p(xo. and h is smooth. I I I· < P. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families.1 Estimation: The Multinomial Case .d..: CPo. Eo) . Summary. . Finally.
8))T.p(Xk.6) 1.8)1(X I =Xj) )=00 (5.4. .11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5.8) ~ I)ogp(Xj. > rl(O) M a(p(8)) P.2) is twice differentiable and g~ (X I) 8) is a well~defined. < k. if A also holds. for instance.Jor all 8.4.4. Many such h exist if k 2. Next suppose we are given a plugin estimator h (r.4. Theorem 5.l (Xl.4.5) As usual we call 1(8) the Fisher infonnation. (5.1.4. 8) is similarly bounded and well defined with (5. (5.4.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I . 88 (Xl> 8) and =0 (5.1. h) with eqUality if and only if. pC') =I 1 m .1. bounded random variable (5. h) is given by (5. 0 <J (5.7) where .2 (8.4. Then we have the following theorem. Moreover.4.8) .4) g. Assume H : h is differentiable.3) Furthermore (See'ion 3.2(0.:) (see (2.4.4. Under H. Consider Example where p(8) ~ (p(xo.8) logp(X I .8). (0) 80(Xj. fJl E..Section 5.11).. .0).9) .2)..4..
I n..10) Thus. we obtain 2: a:(p(8)) &0(xj.15) i • . we obtain &l 1 <0 . 0)] p(Xj.8)&Pj 2: • &h a:(p(0))p(x. h k &h &pj(p(8)) (N p(xj.8) = 1 j=o PJ or equivalently.II.8) ) ~ .4. h(p(8)) ) ~ vn Note that.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.4. (5. Taking expectations we get b(8) = O. by noting ~(Xj.8) ) +op(I).4.4.8)).4.8) j=O 'PJ Note that by differentiating (5.4. (5.12) [t. ( = 0 '(8.4. Apply Theorem 5. which implies (5.4.326 Asymptotic Approximations Chapter 5 Proof.')  h(p(8)) } asymptotically normal with mean 0. h) = I .8) with equality iff.12). by (5.4.4.p(xj. whil.9). not only is vn{h (.p(xj.13) I I · .h)I8) (5. ~ir common variance is a'(8)I(8) = 0' (0.4. By (5. using the correlation inequality (A. &8(X. 0) = • &h fJp (5.p(x). (8..10).2 noting that N) vn (h (.16) o .. 8).. h). kh &h &Pj (p(8))(I(Xi = Xj) . I (x j.4.11) • (&h (p(O)) ) ' p(xj. Noting that the covariance of the right' anp lefthand sides is a(8). I h 2 (5. we s~~ iliat equality in (5.3.14) . for some a(8) i' 0 and some b(8) with prob~bility 1. (5. but also its asymptotic variance is 0'(8.h)Var. • • "'i.8) gives a'(8)I'(8) = 1.l6) as in the proof of the information inequality (3..6). 2: • j=O &l a:(p(8))(I(XI = Xj) .13). : h • &h &Pj (p(8)) (N . using the definition of N j .4./: .
Then Theo(5.. which (i) maximizes L::7~o Nj logp(xj.o(p(X"O)  p(X"lJo )) . Suppose p(x.4. E open C R and corresponding density/frequency functions p('.::7 We shall see in Section 5. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case.19) The binomial (n. Note that because n N T = n .8) = exp{OT(x) . ."j~O (5. 5..Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:. In the next two subsections we consider some more general situations. .).4..3.2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates.8) is. Xk}) and rem 5.d. Write P = {P.. Xl. if it exists and under regularity conditions.A(O)}h(x) where h(x) = l(x E {xo. 0 E e.. .1..i. achieved by if = It (r. As in Theorem 5. Let p: X x ~ R where e D(O... 0).3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check.0 (5.17) L.3) L.4.Xn are tentatively modeled to be distributed according to Po. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B).0)) ~N (0. Suppose i. is a canonical oneparameter exponential family (supported on {xo. OneParameter Discrete Exponential Families.0)=0. . 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena.3.4. .I L::i~1 T(Xi) = " k T(xJ )". .4.xd). then... 0) and (ii) solvesL::7~ONJgi(Xj.2. : E Ell.4. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.18) and k h(p) = [A]I LT(xj)pj )".4.5 applies to the MLE 0 and ~ e is open. by ( 2.(vn(B .3 that the information bound (5. Example 5.Oo) = E.
8. o if En ~ O.328 Asymptotic Approximations Chapter 5 .22) J i I .1.4.4. Let On be the minimum contrast estimate On ~ argmin .p(. # o. n i=l _ 1 n Suppose AO: . rather than Pe. < co for all PEP. That is. ~I AS: On £. f 1 .LP(Xi. J is well defined on P.pix. PEP and O(P) is the unique solution of(5. denote the distribution of Xi_ This is because.!:. as pointed out later in Remark 5.p(x. .21) and. A4: sup.20) In what follows we let p.p2(X. L.p = Then Uis well defined.t) . P) = .O). { ~ L~ I (~(Xi. 1 n i=l _ . .4. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}. Theorem 5.0(P)) l. i' On where = O(P) +.' '.1/ 2) n '. O(P). That is.1.2..23) I.4.O(P)I < En} £. (5. O(P))) . hence.=1 n ! (5.. Suppose AI: The parameter O(P) given hy the solution of . We need only that O(P) is a parameter as defined in Section 1. As we saw in Section 2. (5. . 0) is differentiable. O(P» / ( Ep ~~ (XI. parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. O)dP(x) = 0 (5. .~(Xi. ~i  i .4.. On is consistent on P = {Pe : 0 E e}. 0 E e.L .4. • is uniquely minimized at Bo.p(x.O(P)) +op(n.p(Xi.p(X" On) n = O.3.O(p)))I: It .. j. O)ldP(x) < co. . O(Pe ) = O. Under AOA5.p(x. ~ (X" 0) has a finite expectation and i ! .4.p E p 80 (X" O(P») A3: .21) i J A2: Ep.
4 Asymptotic Theory in One Dimension 329 Hence.O(P)) ~ .29) .26) and A3 and the WLLN to conclude that (5.O(P)I· Apply AS and A4 to conclude that (5. n.4. 1 1jJ(Xi .425)(5.L 1jJ(X" 8(P)) = Op(n.O(P))) p (5. . Let On = O(P) where P denotes the empirical probability.p) = E p 1jJ2(Xl ..p) < 00 by AI..4.l   2::7 1 iJ1jJ.4. en) around 8(P). O(P)).4.24) follows from the central limit theorem and Slutsky's theorem. Next we show that (5.Section 5.4.21).20).22) follows by a Taylor expansion of the equations (5.4.' " iJ (Xi . where (T2(1jJ. we obtain.1j2 ). and A3._ .O(P)) 2' (E '!W(X.O(P)) Ep iJO (Xl.20) and (5.O(P)) = n.On)(8n .25) where 18~  8(P)1 < IiJn .lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1. 8(P)) + op(l) = n ~ 1jJ(Xi . By expanding n.4.27) we get.. A2.4. (iJ1jJ ) In (On .4. applied to (5.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ. But by the central limit theorem and AI. using (5.8(P)) I n n n~ t=1 n~ t=1 e (5.4. (5.28) .22) because .4.4. t=1 1 n (5.27) Combining (5.Ti(en .' " 1jJ(Xi .4.24) proof Claim (5.
d. Identity (5.4. Conditions AO.'. {xo 1 ••• 1 Xk}. B) where p('. 0) l' P = {Po: or more generally.. 1 •• I ! .(x) (5.8).! .4. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po..4.O)?jJ(X I .=0 ?jJ(x. . Z~ (Xl. O(P)) ) and (5. This extension will be pursued in Volume2. we see that A6 corresponds to (5. =0 (5.2 may hold even if ?jJ is not differentiable provided that .4.e. that 1/J = ~ for some p).4.22) follows from the foregoing and (5. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. 0) = 6 (x) 0. B» and a suitable replacement for A3. e) is as usual a density or frequency function. • Theorem 5. 0) = logp(x.30) is formally obtained by differentiating the equation (5. 0 • .30) suggests that ifP is regular the conclusion of Theorem 5.29).4.Eo (XI. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I . O(P)) + op (I k n n ?jJ(Xi . A2. A4 is found. O)dp. essentially due to Cramer (1946). and A3 are readily checkable whereas we have given conditions for AS in Section 5. for Mestimates.4. (2) O(P) solves Ep?jJ(XI. 0) is replaced by Covo(?jJ(X" 0).O).31) for all O. Our arguments apply even if Xl. 0) Covo (:~(Xt.4.4.4.1.4. . I o Remark 5. Remark 5. en j• . P but P E a}. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. written as J i J 2::.: 1. and we define h(p) = 6(xj )Pj.30) Note that (5.O) = O. This is in fact truesee Problem 5. n I _ . If further Xl takes on a finite set of values.21).O(P) Ik = n n ?jJ(Xi .i. for A4 and A6.3. Solutions to (5.1.2 is valid with O(P) as in (I) or (2).28) we tinally obtain On .4.4.12). . Our arguments apply to Mestimates.?jJ(XI'O)). Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i. A6: Suppose P = Po so that O(P) = 0.20) are called M~estimates. and that the model P is regular and letl(x. O)p(x.4.2. We conclude by stating some sufficient conditions. it is easy to see that A6 is the same as (3. AI.4..4.2.4. Remark 5. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . X n are i. I 'iiIf 1 I • I I .
then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (. = en en = &21 Ee e02 (Xl.O).Section 5.eo (Xl.0'1 < J(O)} 00.4. AOA6. In this case is the MLE and we obtain an identity of Fisher's..3. then the MLE On (5. < M(Xl.32) where 1(8) is t~e Fisher information introduq. w.8) is a continuous function of8 for all x. 0') is defined for all x. 10' . 0) g~ (x.:d in Section 3.4.O) = l(x.4. 0) (5.O) = l(x. 5.O) 10gp(x.4.1). We also indicate by example in the problems that some conditions are needed (Problem 5.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5.4. . Pe) > 1(0) 1 (5. 0) and .O) and P ~ ~ Pe.p.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI.33) so that (5. 0) obeys AOA6.4. where EeM(Xl.. where iL(X) is the dominating measure for P(x) defined in (A.4. Theorem 5. satisfies If AOA6 apply to p(x.34) Furthermore.35) with equality iff. We can now state the basic result on asymptotic normality and e~ciency of the MLE.4. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl. "dP(x) = p(x)diL(x).20) occurs when p(x. That is. lO. 0) < A6': ~t (x.4) but A4' and A6' are not necessary (Problem 5. s) diL(X )ds < 00 for some J ~ J(O) > 0..4. 0) . B)." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.p(x.01 < J( 0) and J:+: JI'Ii:' (x. 0') .4.p = a(0) g~ for some a 01 o.
(5. }. ..4. 0 because nIl' .4.4.1/4 I (5.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5. Then X is the MLE of B and it is trivial to calculate [( B) _ 1.Xn be i. see Lehmann and Casella. N(B.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise.37) . 1 I I. (5.nB) ..4.35) is equivalent to (5.36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!.<1>( _n '/4 . thus.A'(B). 1). j 1 " Note that Theorem 5... 5. we know that all likelihood ratio tests for simple BI e e eo.(X B)... B) is an MLR family in T(X).4. B) = 0.4.38) Therefore. 0 e en e.4.. 442.B).4. claim (5. . .'(B) = I = 1(~I' B I' 0. cross multiplication shows that (5. We discuss this further in Volume II..i .1 once we identify 'IjJ(x.nBI < nl/'J <I>(n l/ ' . For this estimate superefficiency implies poor behavior of at values close to 0..4 Testing ! I i . for some Bo E is known as superefficiency.34) follow directly by Theorem 5. We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n. {X 1.. .4. 1. By (5..1/4 X if!XI > n. The optimality part of Theorem 5.35)..ne . B) with J T(x) .. . 1 . .36) Because Eo 'lj. Let Z ~ N(O.". PolBn = Xl .4. . . 1.4.n(Bn .nB). 00. Then .4. . . Consider the following competitor to X: B n I 0 if IXI < n..4. . If B = 0. We next compule the limiting distribution of . Therefore.2. However.4. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. I I Example 5. and.3 is not valid without some conditions on the esti· mates being considered.1).30) and (5. Hodges's Example.3 generalizes Example 5..33) and (5.4. PIIZ + . PoliXI < n1/'1 . 0 1 .2..• •. and Polen = OJ . p.l'1 Let X" . I..4.'(0) = 0 < l(~)' The phenomenon (5. . . PoliXI < n 1/4 ] .39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo).i. if B I' 0. 1998.439) where ..d. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(. 1 .
4. Xl.a quantile o/the N(O. < >"0 versus K : >. It seems natural in general to study the behavior of the test.. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo. ljO < 00 . 00 )]  ~ 1.46) PolBn > cn(a.(a.41) where Zlct is the 1 .Oo) = 00 + ZIa/VnI(Oo) +0(n.40) > Ofor all O. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a.Section 5. derive an optimality property.4. B E (a..::. Then ljO > 00 .. > AO.4. (5. Theorem 5.l4.00) other hand.4.:c".22) guarantees that sup IPoolvn(Bn .3 versus simple 2 .2 apply to '0 = g~ and On.4.42) is sometimes called consistency of the test against a fixed alternative.I / 2 ) (5. If pC e) is a oneparameter exponential family in B generated by T(X). PolO n > c. Zla (5. That is.0)]. as well as the likelihood ratio test for H versus K.. and (5.h the same behavior.(a.4. 1 < z . 00) . .00) > z] ~ 11>(z) by (5.Oo)] PolOn > cn(a. e e e e.4. Suppose the model P = {Po .r'(O)) where 1(0) (5.:cO"=~ ~ 3=3:.4.0:c":c'=D.. On the (5..4. 0 E e} is such thot the conditions of Theorem 5. Then c. 1) distribution.43) Property (5. the MLE.4. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. 00 )" where P8 . • • where A = A(e) because A is strictly increasing.:.. (5.d. [o(vn(Bn 0)) ~N(O. "Reject H for large values of the MLE T(X) of >. a < eo < b.42)  ~ o. ~ 0.(1 1>(z))1 ~ 0. The test is then precisely.44) But Polya's theorem (A. this test can also be interpreted as a test of H : >. (5.40). Thus.. Let en (0:'.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij.(a. Proof. b).00) > z] . The proof is straightforward: PoolvnI(Oo)(Bn . X n distributed according to Po.. "Reject H if B > c(a.45) which implies that vnI(Oo)(c.4. and then directly and through problems exhibit other tests wi!.4.4.. eo) denote the critical value of the test using the MLE en based On n observations. . . 00) . Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.m='=":c'.4 Asymptotic Theo:c'Y:c.41) follows.[Bn > c(a.
f.4.51) I 1 I I .80 1 < «80 )) I J . then by (5. o if "m(8 .4. Theorem 5.80 ) tends to zero.[. (5. j .8) < z] .48).8 + zl_o/.4. 00 if8 > 80 0 and .  i.   j Proof. Claims (5. (5.43) follow. X n ) is any sequence of{possibly randomized) critical (test) functions such that . .jI(80)) ill' < O.42) and (5..jnI(8)(Cn(0'.50) i.")' ~ "m(8  80 ). < 1 lI(Zl_a ")'. In fact. ~ " uniformly in 'Y.80) ~ 8)J Po[.49) ! i .jnI(8)(80 .4.jnI(8)(80 . j (5. (3) = Theorem 5. 80 )  8) .jnI(8)(Cn(0'.jnI(O)(Bn . (5.50).4. 80)J 1 P.1 / 2 )) .jnI(80) +0(n.4. 00 if 8 < 80 . That is.4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 . . the power of the test based on 8n tends to I by (5.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5..k'Pn(X1. If "m(8 .80 ). 0.(80) > O. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B .5.47) lor. then limnE'o+. then (5. In either case.8) > . I.4.8) + 0(1) .4.017".···.2 and (5. Suppose the conditions afTheorem 5.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5. X n ) i.4. .4.(1 lI(z))1 : 18 . On the other hand.. the power of tests with asymptotic level 0' tend to 0'. Furthennore.jnI(8)(Bn . . .J 334 Asymptotic Approximations Chapter 5 By (5.4. .4..8) > .8 + zl_a/.4.4.50)) and has asymptotically smallest probability of type I error for B < Bo.41). LetQ) = Po.1/ 2 ))].jnI(8)(80 . 1 ~.48) . I .jnI(80) + 0(n. Write Po[Bn > cn(O'.jnI(8)(Bn .4. ~! i • Note that (5.80) tends to infinity. the test based on 8n is still asymptotically MP. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 .48) and (5.[.40) hold uniformly for (J in a neighborhood of (Jo.4. iflpn(X1 . assume sup{IP.4.jI(80)) ill' > 0 > llI(zl_a ")'.4. .
It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i ..8) 1. .Section 5. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.54) Assertion (5. 0) denotes the density of Xi and dn .52) > 0.4. .4. ..4.O+. . . .80 ) < n p (Xi.50) and. is asymptotically most powerful as well.53) tends to the righthand side of (5.4. hand side of (5.7)... is O..4. € > 0 rejects for large values of z1.54) establishes that the test 0Ln yields equality in (5.jI(Oo)) + 0(1)) and (5. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t.8) that.4.Xn ) = 5Wn (X" .)] ~ L (5.o + 7.4. To prove (5.8 0 ) .. The details are in Problem 5. .4. 00 + E) logPn(X l E I " . . 0 n p(Xi .jn(I(8o) + 0(1))(80 ..50) note that by the NeymanPearson lemma.4. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n. . [5In (X" . > dn (Q.." + 0(1) and that It may be shown (Problem 5. if I + 0(1)) (5. X n . 80)] of Theorems 5.4. " Xn. X. 00)]1(8n > 80 ) > kn(Oo. 8 + fi) 0 P'o+:. k n (80 ' Q) ~ if OWn (Xl. ~ ..Xn) is the critical function of the Wald test and oLn (Xl. There are two other types of test that have the same asymptotic behavior.53) is Q if. Finally.' L10g i=1 p (X 8) 1.j1i(8  8 0 ) is fixed. i=1 P 1..4 Asymptotic Theory in One Dimension 335 If. Q).[logPn( Xl. The test li8n > en (Q.53) where p(x. for all 'Y.50) for all...4. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5.4. Thus.Xn) is the critical function of the LR test then.48) follows.4. 0 (5.4. P.. 1.5.<P(Zl_a .4.4 and 5. .OO)J . for Q < ~. t n are uniquely chosen so that the right.<P(Zla(1 + 0(1)) + . Llog·· (X 0) =dn (Q. These are the likelihood ratio test and the score or Rao test.OO+7n) +EnP.?.. hence..5 in the future will be referred to as a Wald test.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
o)} =0 where S( 0) is the 0 ball about Therefore. eol > A} > 0 for some A < 00. n Lt= ". eo) : e' E S(O. 05) where ao is known.lO)2)) .p(X. then is a consistent estimate of p. N (/J. {p(X" 0) . Show that the maximum contrast estimate 8 is consistent.e)1 : e' E 5'(e. ". (1 (J. Hint: sup 1..\} c U s(ejJ(e j=l T J) {~ t n . t e. (a) Show that condition (5.. (Ii) Show that condition (5. A]. . V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00. .p(X. Eo. inf{p(X. Suppose Xl. sup{lp(X.l)2 _ + tTl = o:J.~ .i. and < > 0 there is o{0) > 0 such that e. inf{p(X.(ii) holds. e) is continuous.e'l < .n 1 (XiJ.e')p(X. where A is an arbitrary positive and finite constant.p(X. .•. Hint: K can be taken as [A. (c) Suppose p{P) is defined generally as Covp (U.d. eo) : Ie  : Ie . e') . . e E Rand (i) For some «eo) >0 Eo.p(Xi . eo)} : e E K n {I : 1 IB . Or of sphere centers such that K Now inf n {e: Ie . for each e eo.eol > .8) fails even in this simplest case in which X ~ /J is clear. Hint: From continuity of p.sup{lp(X. VarpUVarpV > O}.lJ. 7..2. Prove that (5.Xn are i.Section 5. 6. i (e))} > <. by the basic property of maximum contrast estimates. (i).14)(i).eol > .2. e)1 (ii) Eo. 5.14)(i) add (ii) suffice for consistency. (Wald) Suppose e ~ p( X. By compactness there is a finite number 8 1 . 5_0  lim Eo. .2 rII.\} . e') .(eo)} < 00. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent. and the dominated convergence theorem. e) .2.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution.
Problems for Section 5.d. . .X~ are i. J (iii) Condition on IXi  X. . .3. ~ . Compact sets J( can be taken of the form {II'I < A. . for some constants M j ... 1 + .k / 2 .L.I'lj < EIX .d. Hint: Taylor expand and note that if i . Then EIX . en are constants. < Mjn~ E (. with the same distribution as Xl.l m < Cm k=I n.1'.2. .3. < < 17 < 1/<. i . Let X~ be i.xW) < I'lj· 3. . Extend the result of Problem 7 to the case () E RP.9) in the exponential model of Example 5.. = 1.Xn but independent of them.O(BJll}. li . The condition of Problem 7(ii) can also fail. L:~ 1(Xi . n. Hint: See part (a) of the proof of Lemma 5. Establish Theorem 5... 2.) 2] .348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.O'l n i=l p(X"Oo)}: B' E S(Bj.. For r fixed apply the law of large numbers.. . (72).... L:!Xi .i. II I .i."d' k=l d < mm+1 L d ElY. • • (i) Suppose Xf.3.X. < > OJ. and let X' = n1EX.)I' < MjE [L:~ 1 (Xi .. Show that the log likelihood tends to 00 as a + 0 and the condition fails.11). . 4. .1. I 10. + id = m E II (Y.3) for j odd as follows: ..)2]' < t Mjn~ E [. .. 8. then by Jensen's inequality.3 I.. Establish (5.3. P > 1. .X'i j . 1/(J.X.[.3. . and if CI. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. . Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K.d.3. and take the values ±1 with probability ~. i . Establish (5. Establish (5.1. (ii) If ti are ij. 9.
1 < j < m. .m with K. . . . Let XI. where s1 = (n.. Show that under the assumptions of part (c).andsupposetheX'sandY's are independent.m) Then.i.1) +E{IX. 1)'2:7' . (b) Show that when F and G are nonnal as in part (a).. replaced by its method of moments estimate. km K..pli ~ E{IX. Show that if EIXlii < 00. Show that ~ sup{ IE( X.(l'i .Fk. theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. R valued with EX 1 = O. 8..m 00.i. FandYi""'Yn2 bei. s~ ~ (n2 .28.Xn1 bei.6 Problems and Complements 349 Suppose ad > 0.. ~ iLl' I IXli > liLl}P(lXd > 11./LI = E(Xt}. . (d) Let Ck. X. 6.a as k t 00.d. ~ iLli < 2 i EIX.i. i. I < liLl}P(IXd < liLl)· 7.a~d < [max(al"" .d.Tn) + 1 .l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck.3. ~ iLli I IX. . Establish 5. G... PH(st! s~ < Ck.d.r'j". n} EIX.\ > 0 and = 1 + JI«k+m) ZIa. under H : Var(Xtl ~ Var(Y. j = 1. j > 2.Section 5. 2 = Var[ ( ) /0'1 ]2 .I. )=1 5.aD. ~ xl". ..(X. P(sll s~ < ~ 1~ a as k ~ 00. L~=1 i j = m then m a~l. Show that if m = .I) Hint: By the iterated expectation theorem EIX. LetX1"".1)'2:7' ."" X n be i. then EIX.). .ad)]m < m L aj j=1 m <mmILaj.m distribution with k = nl .\k for some . then (sVaf)/(s~/a~) has an .c> . respectively.1 and 'In = n2 ...m be Ck. (a) Show that if F and G are N(iL"af) and N(iL2.) I : i" . if a < EXr < 00. 0'1 = Var(Xd· Xl /LI Ck.
(I_A)a2). ' .xd.i. Var(€.lIl) + 1 . if p i' 0. • . i = 1. jn((C .+.xd. (UN.N. then jn(r . n. TN i... . be qk. i (b) Suppose Po is such that X I = bUi+t:i. Y) ~ N(I'I. then P(sUs~ < qk.. iik. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3.I))T has the same asymptotic distribution as n~ [n. 7 (1 .m (depending on ~'l ~ Var[ (X I .p.1. . .1JT. and if EX.1 · /.p) ~ N(O. 1). .XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters. "1 = "2 = I. Wnte (l 1pP . . < 00 (in the supennodel).p) ~ N(O.m (0 Let 9.XN} or {(uI.. "i. • 10.iL) • . = = 1. In survey sampling the modelbased approach postulates that the population {Xl. Et:i = 0. (I _ P')').1.a as k . = (1. jn(r .l EXiYi . • op(n').\ 00. n.4.d.m with KI and K2 replaced by their method of moment estimates. . Show that under the assumptions of part (e).) =a 2 < 00 and Var(U) Hint: (a) X .I' ill "rI' and "2 = Var[( X.2< 7'."..350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g.00.i. .p). .+xn n when T." ._ J . Consider as estimates e = ') (I X = x. = ("t .4. (cj Show that if p = 0. 4p' (I .Il2.3. .I." •. .. from {t I.. 0 < . suppose in the context of Example 3. Without loss of generality. to estimate Ii 1 < j < n. (b) If (X..3..6. (iJl. Po.~) (X . p). (a) If 1'1 = 1'2 ~ 0.\ < 1. N where the t:i are i. . . then jn(r' . if ~ then .2" ( 11. as N 2 t Show that.d. . () E such that T i = ti where ti = (ui. Hint: Use the centra] limit theorem and Slutsky's theorem. .l EY? . JLl = JL2 = 0. ~ ] . .. In particular. H En.t . (b) Use the delta method for the multivariate case and note (b opt  b)(U .. i = I. . t N }. (iJi .I).1'2)1"2)' such that PH(SI!S~ < Qk.. Instead assume that 0 < Var(Y?) < x. Without loss of generality. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.a: as k + 00. ..1 that we use Til.p') ~ N(O..". . Tin' which we have sampled at random ~ E~ 1 Xi. Show that jn(XRx) ~N(0. Under the assumptions of part (c). . if 0 < EX~ < 00 and 0 < Eyl8 < 00. ! . suppose i j = j.1 EX. (a) jn(X . there exists T I . In Example 5.. + I+p" ' ) I I I II.XC) wHere XC = N~n Et n+l Xi.p')') and.Ii > 0. . ...A») where 7 2 = Var(XIl. show that i' ~ . use a normal approximation to tind an approximate critical value qk.6.m) + 1 .x) ~ N(O. . that is. .
Suppose X I I • • • l X n is a sample from a peA) distrihution. each with HardyWeinberg frequency function f given by .Section 5. . It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. Here x q denotes the qth quantile of the distribution. (a) Use this fact and Problem 5. n = 5.3 < 00. (a) Show that the only transformations h that make E[h(X) . . This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). (h) Deduce fonnula (5. EU 13. (h) Use (a) to justify the approximation 17..90.14 to explain the numerical results of Problem 5.: . Suppose X 1l .. W). !) distribution. is known as Fisher's approximation.6) to explain why..3. X.s.. Hint: If U is independent of (V..12). Suppose XI..14)..3. .. (a) Suppose that Ely. 1931) is found to be excellent Use (5. X n are independent. 15.3. E(WV) < 00. Normalizing Transformation for the Poisson Distribution.n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.3.99. X~..i'b)(Yc .y'ri has approximately aN (0. then E(UVW) = O.i'c) I < Mn'.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. .10.3.. and Hint: Use (5. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x . Show that IE(Y. 16. The following approximation to the distribution of Sn (due to Wilson and Hilferty.3' Justify fonnally j1" variance (72. (b) Let Sn .6 Problems and Complements 351 12.13(c).25.. 14. X n is a sample from a population with mean third central moment j1. (a) Show that if n is large. Let Sn have a X~ distribution.i'a)(Yb .. . X = XO.. = 0.
.a tends to zero at the rate I/(m + n)2. Let X I. (b) Find an approximation to P[JX' < t] in terms of 0 and t. . Bm. Y) where (X" Yi).(x. . ° . 1'2) Hint: h(X. .n  m/(m + n») < x] 1 > 'li(x). V)) '" . Y) ~ p!7IC72. 172 + [h 2(1'l. h 2 (x. Show that the only variance stabilizing transformation h such that h(O) = 0. and h'(t) > for all t. Cov(X. 19.n tends to zero at the rate I/(m + n)2. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. • j b _ . Yn are < < . where I I h. (1'1.. [vm +  n (Bm. (1'1. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. + (mX/nY)] where Xl. which are integers. (a) (b) Var(h(X. Yn) is a sample from a bivariate population with E(X) ~ 1'1.y) = a ayh(x.h(l'l.y). Var(Y) = C7~.. if m/(m + n) . I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. Justify fonnally the following expressions for the moments of h(X.1'2)]2 177 n +2h.. 1'2)h2(1'1. 20. (Xn .Xm . 1'2)]2C7n + O(n 2) i . 0< a < I. Var(X) = 171. then I m a(l.. .I'll + h2(1'1. X n be the indicators of n binomial trials with probability of success B.a) Hmt: Use Bm.1') + X 2 . i 21. is given by h(t) = (2/1r) sin' (Yt).y). .a) E(Bmn ) = . (e) What is the approximate distribution of Vr'( X . ..• • • ~{[hl (1". where I' = (a) Find an approximation to P[X < E(X. 1'2)(Y  + O( n '). Show directly using Problem B. Show that ifm and n are both tending to oc in such a way that m/(m + n) > a.n P . Var Bmnm + n +Rmn = 'm+n ' ' where Rm. . 1'2)(X . 1'2) = h.n = (mX/nY)i1 independent standard eXJX>nentials. h(l) = I.. Y) .y) = a axh(x.. 1'2)pC7.5 that under the conditions of the previous problem.2.)? 18. E(Y) = 1'2. Y" . Let then have a beta distribution with parameters m and n.. v'a(l.
6 Problems and Complements 353 22. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .2. . xi 25. (It may be shown but is not required that [y'nR" I is bounded. = E(X) "f 0. .l4 is finite.) + ": + Rn where IRnl < M 1(J1.l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ. XI.2) using the delta method.)] ~ as ~h(2)(J1.4 + 3o. . and that h(1)(J.J1. = 0.X) in tenns of the distribution function when J. find the asymptotic distribution of y'n(T.3 1/6n2 + M(J1.) 23. Compare your answer to the answer in part (a).l) = o. Suppose that Xl.2V with V "' Give an approximation to the distribution of X (1 . X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k).. < t) = p(Vi < (b) When J1.. ()'2 = Var(X) < 00. Show that Eh(X) 3 h(J1. k > I. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J.Section 5..2)/24n2 Hint: Therefore. n[X(I .).4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00. (e) Fiud the limiting laws of y'n(X .)2.J1. (a) Show that y'n[h(X) .h(J1.)2 and n(X . xi.) + ~h(2)(J1.)] !:.J1. Let Xl. J1. = ~. Let Sn "' X~. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi). (a) When J1. ..)1J1. 1 X n be a sample from a population with mean J. _.J1.J1.(1.l = ~.2V where V ~ 00.)] is asymptotically distributed (b) Use part (a) to show that when J1.. 24.. Usc Stirling's approximation and Problem 8. ° while n[h(X h(J1.X2 .
3.t..a. 27. . 1 Xn. I i t I . . " Yn as separate samples and use the twosample t intervals (4.3). Hint: Use (5. n2 + 00.\ < 1.•.9. Yn2 are as in Section 4.9.3) and the intervals based on the pivot ID . /"' = E(YIl. . and SD are as defined in Section 4. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X .\a'4)I'). We want to obtain confidence intervals on J1. 28. .2a 4 ).JTi times the length of the onesample t interval based on the differences.\ > 1 .t.9. Let T = (D .) of Example 4./".4. . .') < 00. al = Var(XIl.) (b) Deduce that if.9. (a) Show that P[T(t.. Suppose Xl. if n}..\. We want to study the behavior of the twosample pivot T(Li. . n2 + 00. the interval (4.3) has asymptotic probability of coverage < 1 .o. Hint: (a). .\ probability of coverage. Suppose nl + 00 . .I I (a) Show that T has asymptotically a standard Donnal distribution as and I I .) < (b) Deduce that if p tl ~ <P (t [1.9.a..3. p) distribution..Xi we treat Xl. that E(Xt) < 00 and E(Y. then limn PIt. .)/SD where D.}. 1 X n1 and Y1 . (c) Apply Slutsky's theorem.9. 'I :I ! t. . Viillnl ~ 2vul + a~z(l ..(j ' a ) N(O. < tJ ~ <P(t[(. Show that if Xl" .4. and a~ = Var(Y1 ).3) have correct asymptotic (c) Show that if a~ > al and . Suppose (Xl. .\)a'4)/(1 . . . the intervals (4.1/ S D where D and S D are as in Section 4.(~':~fJ').9. a~. 2pawz)z(1  > 0 and In is given by (4.354 26. (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. .9.~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .\. Yd.. XII are ij. (d) Make a comparison of the asymptotic length of (4. 29.2 . I .d.t2. E In] > 1 . so that ntln ~ .. Eo) where Eo = diag(a'.3 independent samples with III = E(XIl.. Let n _ 00 and (a) Show that P[T(t.3). Assume that the observations have a common N(jtl.4. What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution. Y1 . b l . .9. (c) Show that if IInl is the length of the interval In. . Suppose that instead of using the onesample t intervals based on the differences Vi .3. .33) and Theorem 5.\)al + . whereas the situation is reversed if the sample size inequalities and variance inequalities agree.\ul + (1 ~ or al . cyi .Jli = ~. 0 < . N(Pl (}2). = = az.
6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4. Tn .I)sz/(). . where I..i. but Var(Tn ) + 00. Show that if 0 < EX 8 < 00. 33.1).I) + y'i«n . .I)z(a). (a) SupposeE(X 4 ) < 00. Let X ij (i = I.8Z = (n . 30.I k and k only.£.:~l IXjl.16.. (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l .xdf and Ixl is Euclidean distance. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. I< = Var[(X 1')/()'jZ. Hint: See Problems B.3.z = Var(X).p. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31. .1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" . and let Va ~ (n . has asymptotic level a. Vn = (n .Z). Show that as n t 00. EIY.3.a)) and evaluate the approximations when F is 7. (). x = (XI. where tk(1. Plot your results. vectors and EIY 1 1k < 00. as X ~ F and let I' = E(X). T and where X is unifonn.a).4....9 and 4.9. X..z has a X~I distribution when F is theN(fJl (72) distribution. Hint: If Ixll ~ L. Carry out a Monte Carlo study such as the one that led to Figure 5. k) be independent with Xi} ~ N(l'i.X)" Then by Theorem B.1. .3. I is the Euclidean norm. . Show that k ~ ex:: as 111 ~ 00 and 112 t 00. then lim infn Var(Tn ) > Var(T).. then for all integers k: where C depends on d. .3 by showing that if Y 1. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X.z are Iii = k.d. 32. X n be i. U( 1.d." .. Let XI. . . (c) Let R be the method of moment estimaie of K. Y nERd are Li. Let .4. . j = I.~ t (Xi . (a) Show that the MLEs of I'i and ().. then P(Vn < va<) t a as 11 t 00.£._I' Find approximations to P(Yn and P{Vn < XnI (I .1)1 L.3 using the Welch test based on T rather than the twosample t test based on Sn.Section 5. (). (d) Find or write a computer program that carries out the Welch test.3. Generalize Lemma 5.a) is the critical value using the Welch approximation.
. . random variables distributed according to PEP.On) < 0). Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) .. . ...p(X. . ... I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 .. . 1/).f.)). N(O.p(x) (b) Suppose lbat for all PEP.nCOn . . I .. • 1 = sgn(x). Show lbat if I(x) ~ (d) Assume part (c) and A6.. . \ . Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique.. I(X.0) ~ N Hint: P( .0) and 7'(0) . . O(P) is lbe unique solution of Ep..On) . (b) Show that if k is fixed and p ~ 00. 00 (iii) I!/!I(x) < M < for all x. (Use .0) = O. Show that On is consistent for B(P) over P.O(P)) > 0 > Ep!/!(X. ! Problems for Section 5. Let Xl.. .n(On . 1 X n..p(X. under the conditions of (c).p( 00) as 0 ~ 00. aliI/ > O(P) is finite.p(X. is continuous and lbat 1(0) = F'(O) exists. Let i !' X denote the sample median.p(Xi . . .L~.....356 Asymptotic Approximations Chapter 5 .n(X .) Hint: Show that Ep1/J(X1  8) is nonincreasing in B.. . _ . Set (. (ii).p(X.4 1. N(O) = Cov(. ... . [N(O))' .0))  7'(0)) 0.n7(0) ~[. Show that _ £ ( .. . I ! 1 i . 1948). (e) Suppose lbat the d. That is the MLE jl' is not consistent (Neyman and Scott.d.. then < t) = P(O < On) = P (. (c) Give a consistent estimate of (]'2.(0) Varp!/!(X. (a) Show that (i). N(O.. Let On = B(P). .n for t E R.0) !:.. .0). F(x) of X. then 8' !.1)cr'/k. Ep.p(X. (k . Use the bounded convergence lbeorem applied to . I i .0). .. 1) for every sequence {On} wilb On 1 n = 0 + t/..Xn be i. 1/4/'(0)). where P is the empirical distribution of Xl.0) !. n 1 F' (x) exists.p(X. and (iii) imply that O(P) defined (not uniquely) by Ep.\(On)] !:. (c) Assume the conditions in (a) and (b).. . . Show that. Assmne lbat '\'(0) < 0 exists and lbat .i..
b)dl"(x) J 1J.(x.0. This is compatible with I(B) = 00.' .(x .05.. the asymptotic 2 .20 and note that X is more efficient than X for these gross error cases..0. ( 2 ).'" . 0) = . 4. 82 ) Pis N(I".7.(x.I / 2 . Hint: teJ1J. x E R.15 and 0. 0 «< 0.B) < xl = 1. Let XI."' c . denotes the N(O.O)p(x. U(O.B)) ~ [(I/O).(x) + <'P. evaluate the efficiency for € = . Show that A6' implies A6.(x). 0» density.O)p(x. J = 1.a)dl"(x).y.O).O) is defined with Pe probability I but < x. oj). (a) Show that g~ (x. Conclude that (b) Show that if 0 = max(XI.4.Section 5..O)dl"(x)) = J 1J. 2.(x.d.10.5  and '1'.0)p(x. Show that   ep(X.9)dl"(x) = J te(1J. X) = 1r/2. 'T = 4. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 .~ for 0 > x and is undefined for 0 g~(X.a)p(x.exp(x/O).5) where f.. not O! Hint: Peln(B . X) as defined in (f).jii(Oj .Xn ) is the MLE. lJ :o(1J.(x) = (I .  = a?/o. then ep(X.b)p(x.X) =0. Hint: Apply A4 and the dominated convergence theorem B.6 Problems and Complements 357 (0 For two estImates 0..i.(x. X n be i. . Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.. Show that assumption A4' of this section coupled with AGA3 implies assumption A4.c)'P.(1:e )n ~ 1. and O WIth . If (] = I. Find the efficiency ep(X.O))dl"(x) ifforalloo < a < b< 00.0) (see Section 3.    . ~"'... 1979) which you may assume.(x. Show that if (g) Suppose Xl has the gross error density f. Thus. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n. . then 'ce(n(B . 3. .0) ~ N(O. 0 > 0.2. . (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2).
.·_t1og vn n j = p( x"'o+.in) 1 is replaced by the likelihood ratio statistic 7.) P"o (X . .In) + .in. Apply the dominated convergence theorem (B. 1(80 )."log .7. Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".in) ~ . I .22). and A6 hold for. 8 ) i . .In ~ "( n & &8 10gp(Xi.g. i I £"+7. Show that 0 ~ 1(0) is continuous. 8) is continuous and I if En sup {:. 80 +.< ~log n p(X.358 5.(X. . under F oo ' and conclude that .O') .18 .. (8) Show that in Theorem 5.80 + p(Xi .00) . O.80 p(X" 80) +. Suppose A4'. I .5 continue to hold if . 8)p(x..i (x. ) n en] ..Oo) p(Xi .4. ~ L. Show that the conclusions of PrQple~ 5. . ~log (b) Show that n p(Xi . 0 ~N ~ (.in) .. 0 for any sequence {en} by using (b) and Polyii's theorem (A.p ~ 1(0) < 00. I I j 6.4.4. j. 0).80 + p(X ..[ L:. A2.0'1 < En}'!' 0 .i'''i~l p(Xi. 8 ) 0 .5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.50)..i."('1(8) .14. &l/&O so that E.4) to g. i.i(X. I 1 .~ (X. ) ! . I (c) Prove (5.O) 1(0) and Hint: () _ g.
4. gives B Compare with the exact bound of Example 4. ~ 1 L i=I n 8'[ ~ 8B' (Xi.4.Section 5.58).54) and (5.4. the bound (5. (Ii) Suppose the conditions of Theorem 5.2.59).61).6 Problems and Complements 359 8.:vell . Then (5. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .2. which is just (4. B' nj for j = 1.5. (a) Show that under assumptions (AO)(A6) for all Band (A4').7). . and give the behavior of (5. for all f n2 nI n2 . which agrees with (4.4. (a) Show that. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ). 10.B lower confidence bounds. setting a lower confidence bound for binomial fJ. Hint: Use Problems 5. Let B be two asymptotic l. Hint.4.4. ~.5 and 5.59).7. (a) Establish (5.3. at all e and (A4').4. 11.4. B). (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. (Ii) Compare the bounds in Ca) with the bound (5.Q" is asymptotically at least as good as B if. Compare Theorem 4. = X.59) coincide.4. A. 9.3). hence.61) for X ~ 0 and L 12. We say that B >0 nI Show that 8~I and.a.4. all the 8~i are at least as good as any competitors.6.2.6 and Slutsky's theorem. Let B~ be as in (5.4. f is a is an asymptotic lower confidence bound for fJ. Establish (5.4.4.57) can be strengthened to: For each B E e. Let [~11. Consider Example 4. B' _n + op (n 1/') 13. Hint: Use Problem 5.4. (Ii) Deduce that g.4.56).4.14. if X = 0 or 1.4.5 hold.
21J2 2x A} . = O. 1.riX) = T(l + nr 2)1/2rp (I+~~. (b) Suppose that I' ). the pvalue fJ .i. where Jl is known. 1 15.10) and (3.'" XI! are ij. 1) distribution. 0 given Xl.5 ! 1. I' has aN(O. Show that (J l'.•. N(Jt. By Problem 4. Let X 1> . X n i. Establish (5. inverse Gaussian with parameters J..\ > o.2.t = 0 versus K : J..)l/2 Hint: Use Examples 3. I . (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. Consider the Bayes test when J1.55).)) and that when I' = LJ. That is. ! (a) Show that the posterior probability of (OJ is where m n (. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0.I). (e) Find the Rao score test for testing H : .I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox").4. .01 ..11). Hint: By (3. the test statistic is a sum of i. . ! I • = '\0 versus K : . .. (d) Find the Wald test for testing H : A = AO versus K : A < AO. 7f(1' 7" 0) = 1 .1 and 3. =f:.L and A. the evidence I .6.1. Consider the problem of testing H : I' E [0.. 00.1. 1 .A 7" 0. (a) Show that the test that rejects H for large values of v'n(X .=2[1  <1>( v'nIXI)] has a I U(O.) has pvalue p = <1>(v'n(X . > ~.i.4.\ l .2.LJ. Show that (J/(J l'.5. X n be i. variables with mean zero and variance 1(0). X > 0..1.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. " " I I .d. Consider testing H : j. . T') distribution. if H is false. A exp .2. That is.d. 2.\ < Ao. 1.4. LJ.d.d. Suppose that Xl. is a given number. . 1).] versus K . N(I'.... Problems for Section 5.. each Xi has density ( 27rx3 A ) 1/2 {AX + J. where .. (c) Suppose that I' = 8 > O. phas a U(O. Jl > OJ . : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . J1.i.LJ.360 1 Asymptotic Approximations Chapter 5 14. Now use the central limit theorem. 1) distribution.
Hint: In view of Theorem 5.5. (e) Show that when Jl ~ ~. 10 . 0' (i» n L 802 i=l < In n ~sup {82 8021(X i . Extablish (5.) sufficiently large.0 3.645 and p = 0.(Xi.046 . By Theorem 5.5.052 100 .~~(:. By (5.S. yn(E(9 [ X) ..Oo»)+log7f(O+ J. yn(anX   ~)/ va. Show that the posterior probability of H is where an = n/(n + I). Suppose that in addition to the conditions of Theorem 5.2.s. Apply the argnment used for Theorem 5.14).. all . i ~.2 and the continuity of 7f( 0).2. J:J').1 is not in effect.l has a N(O.01 < 0 } continuThen 00.5.. 4.I).0'): iO' ..i Itlqn(t)dt < J:J'). Hint: By (5.050 logdnqn(t) = ~ {I(O). L M M tqn(t)dt ~ 0 a. 1) and p ~ L U(O. P.2 it is equivalent to show that J 02 7f (0)dO < a.05. [0.:.1.O') : 100'1 < o} 5. ~l when ynX = n 1'>=0. oyn.i It Iexp {iI(O)'.. Fe.8) ~ 0 a. (Lindley's "paradox" of Problem 5.6 Problems and Complements 361 (b) Suppose that J.5. (e) Verify the following table giving posterior probabilities of 1. ~ E.042 .029 .034 ..5.5. Hint: 1 n [PI .1 I'> = 1. ifltl < ApplytheSLLN and 0 ous at 0 = O..5.s.17).13) and the SLLN.058 20 .17).' " .Section 5. Establish (5.. ~ L N(O.054 50 .4.(Xi..sup{ :.} dt < .for M(.)}.O'(t)). 1) prior..I(Xi. for all M < 00.( Xt.) (d) Compute plimn~oo p/pfor fJ.
~ 0 a. . (a) Show that sup{lqn (t) . . Show tbat (5.8) and (5.5 " I i . I i. . • Notes for Section 5.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d.Pn where Pn a or 1. .2 we replace the assumptions A4(a. jo I I . " .S. The sets en (c) {t . .s. . (O)<p(tI.5. I: . Suppose that in Theorem 5. For n large these do not correspond to distributions one typically faces.d) for some c(d). I : ItI < M} O.) by A4 and A5.1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 . qn (t) > c} are monotone increasing in c. Finally. von Misessee Stigler (1986) and Le Cam and Yang (1990).". J I . convergence replaced by convergence in Po probability. (I) If the rightband side is negative for some x. . • . Notes for Section 5.!. . L ..1 . 5.5. See Bhattacharya and Ranga Rao (1976) for further discussion. 'i roc J.I(Xi}»} 7r(t)dt. for all 6.. .5.29).s. A. .s. Bernstein and R. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.f. (1) This famous result appears in Laplace's work and was rediscovered by S.1. dn Finally.) and A5(a. .7 NOTES Notes for Section 5. . (0» (b) Deduce (5.I Notes for Section 5.16) noting that vlnen. all d and c(d) / in d. i=l }()+O(fJ) I Apply (5. A proof was given by Cramer (1946).4 (1) This result was first stated by R. Fisher (1925).5. Fn(x) is taken to be O. ~ II• '. 362 f Asymptotic Approximations Chapter 5 > O. .9) hold with a.5. ! ' (2) Computed by Winston Cbow. 7.) ~ 0 and I jtl7r(t)dt < co.5.3 ).
C. LEHMANN. J. 684 (1931). Statist. WILSON. HAMMERSLEY.. Theory ofPoint EstimatiOn New York SpringerVerlag. New York: McGraw Hill. Hartley and E. 1946.. BHATTACHARYA. Normal Approximation and Asymptotic Expansions New York: Wiley. Vth Berkeley Symposium. AND G. CRAMER. AND M. E. "Consistent estimates based on partially consistent observations. Mathematical Methods of Statistics Princeton. P. G. 1990." Econometrica. p. 1958.. U. Linear Statistical Inference and Its Applications. AND D.. AND E. YANG. 'The distribution of chi square. T. L. Phil. New York: J. HfLPERTY. O. S. H..8 REFERENCES BERGER. Sci. I Berkeley. 1973.. J.. Asymptotics in Statistics. . H.. 132 (1948). RANGA RAO. Vth Berk. B. F." J. Amer. 1986. A CoUrse in Large Sample Theory New York: Chapman and Hall. L. S. 3rd ed. E. "Theory of statistical estimation." Proc. J. CA: University of California Press. Prob. LEHMANN. W. LE CAM. M. J. A. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. 700725 (1925). R. 58. Approximation Theorems of Mathematical Statistics New York: J. W. Elements of LargeSample Theory New York: SpringerVerlag. RUDIN. 2nd ed.A. P. Cambridge University Press. "Probability inequalities for sums of bounded random variables. Vol. Soc. NEYMAN. FISHER. STIGLER. 1964. S. E. Statist. FERGUSON. N.Section 5.. R. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag.. Theory o/Statistics New York: Springer. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. R. Vol. R. 3rd ed. NJ: Princeton University Press. 1380 (1963). 1987. L. AND R.. Math.8 References 363 5. Monte Carlo Methods London: Methuen & Co. Assoc. Editors Cambridge: Cambridge University Press. Mathematical Analysis. Wiley & Sons. P. Proc. 1938. M... RAo. 1976. 1996. Some Basic Concepts New York: Springer. 17.. 1995. 40... H. FISHER. 22. SCHERVISCH. SERFLING. DAVID... C. HANSCOMB.. R. 1967. M. HOEFFDING. AND G. Pearson. Acad.. 16. BILLINGSLEY. Camb. Probability and Measure New York: Wiley. Tables of the Correlation Coefficient. I. HUBER. E. J. A.. L. MA: Harvard University press. 1979. Box. reprinted in Biometrika Tables/or Statisticians (1966). "Nonnormality and tests on variances:' Biometrika. Wiley & Sons. 1985. Symp. 318324 (1953). 1999.S. 1998. SCOTT. 1980.. L." Proc. Nat. Statisticallnjerence and Scientific Method. CASELLA.
' i ' .\ . . . I . ..I 1 . . (I I i : " .
3. the number of parameters. There is. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}.3. 2. and prediction in such situations.3). The inequalities ofVapnikChervonenkis.3. the bootstrap.2. with the exception of Theorems 5.6. 2.2 and 5.3).4.1. tests. the modeling of whose stochastic structure involves complex models governed by several.7. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. 1. the fact that d. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics. for instance. we have not considered asymptotic inference. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates.6. 2.8 C R d We have presented several such models already. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. the properties of nonparametric MLEs. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. However. multiple regression models (Examples 1.1. real parameters and frequently even more semi. often many.1) and more generally have studied the theory of multiparameter exponential families (Sections 1.2.2. and n.1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. in which we looked at asymptotic theory for the MLE in multiparameter exponential families.or nonpararnetric models.4. confidence regions.and semiparametric models. 365 . 2. the multinomial (Examples 1. and efficiency in semiparametric models. however.5. the number of observations. testing. curve estimates. an important aspect of practical situations that is not touched by the approximation. are often both large and commensurate or nearly so. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible.
1. andJ is then x nidentity matrix.. . and Z is called the design matrix.3): l I • Example 6. N(O.1. These are among the most commonly used statistical techniques.€nareij. . I I.1.. . n (6.1.a2).. .1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil.1.." .Herep= 1andZnxl = (l. The OneSample Location Problem..2J) (6. . Example 6. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI.. We consider experiments in which n cases are sampled from a population. 366 Inference in the Multiparameter Case Chapter 6 ! .. i = 1.l)T.2.1 is also of the fonn (6. We have n independent measurements Y1 . • i . i = 1. I The regression framewor~ of Examples 1.1.3).l p.1. . we have a response Yi and a set of p . 1 Yn from a population with mean /31 = E(Y). The model is Yi j = /31 + €i. .n (6.1.1..1. Notational Convention: In this chapter we will. 1 en = LZij{3j j=1 +Ei..2) Ii . " n (6. Here Yi is called the response variable... In Section 6.1.6 we will investigate the sensitivity of these procedures to the assumptions of the model.. let expressions such as (J refer to both column and row vectors.d. when there is no ambiguity.1. .4 and 2. i = l.3) whereZi = (Zil.3). . " 'I and Y = Z(3 + e. In this section we will derive exact statistical procedures under the assumptions of the model (6. e ~N(O. Here is Example 1.. 0"2). ... • . and for each case.d. . say the ith case. '. .1.N(O.zipf.1. 6.1 covariate measurements denoted by Zi2 • . .. are U. i . Z = (Zij)nxp.1) j . . . The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i. . the Zij are called the design values.Zip' We are interested in relating the mean of the response to the covariate values.Zip. Regression.'" . It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6. .1.. we write (6.2(4) in this framework. In vector and matrix notation.5) .4) 0 where€I.j .
To see that this is a linear model we relabel the observations as Y1 .~) random variables. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values.6) is an example of what is often called analysis ojvariance models. The model (6. The pSample Problem Or One. . we want to do so for a variety of locations.1.3) applies. we arrive at the one~way layout or psample model.5) is called the fixed design norrnallinear regression model. if no = 0.. · · . one from each population.1fwe set Zil = 1.1. where YI./. We treat the covariate values Zij as fixed (nonrandom).2) and (6. In this case. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 . Generally.1. /3p are called the regression coefficients. . (6. 0 Example 6. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k.Section 6. .1. In Example I.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. Then for 1 < j < p. + n p = n. The random design Gaussian linear regression model is given in Example 1.. .3 and Section 4. n.1. Yn1 + 1 . and the €kl are independent N(O.0: because then Ok represents the difference between the kth and average treatment .4.. and so on. we often have more than two competing drugs to compare.. Frequently.3 we considered experiments involving the comparisons of two population means when we had available two independent samples. 1 < k < p. and so on. the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros..3.6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k ."i = 1. . .Yn .9..3. 13k is the mean response to the kth treatment.1. If we are comparing pollution levels.1. this terminology is commonly used when the design values are qualitative.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. . TWosample models apply when the design values represent a qualitative factor taking on only two values. Yn1 correspond to the group receiving the first treatment. Ynt +n2 to that getting the second. . nl + .Way Layout. The model (6. then the notation (6.. (6. we are interested in qualitative factors taking on several values.•...
and with the parameters identifiable only once d . .P. . of the design matrix.. .1.(T2J) where Z. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. is common in analysis of variance models.6jCj are the columns of the design matrix.r additional linear restrictions have been specified.1. n. j = I •.. Cj = (Zlj. V r span w. {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. I5 h . Note that any t E Jl:l can be written . 1 JLn)T I' where the Cj = Z(3 = L j=l P . Let T denote the number of linearly independent Cj. . E ~N(O.Y. by the GramSchmidt process) (see Section B. Because dimw = r.. .7) !. .. i = T + 1. .. It is given by 0 = (itl. Z) and Inx I is the vector with n ones. vT n i I t and that T = L(vTt)v. .po In tenns of the new parameter {3* = (0'. . . • . Even if {3 is not a parameter (is unidentifiable). •. This type oflinear model with the number of columns d of the design matrix larger than its rank r. . . then r is the rank of Z and w has dimension r. .. When VrVj = O. Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. we call Vi and Vj orthogonal.! x (p+l) = (1. •• Note that w is the linear space spanned by the columns Cj.2)... Ui = v..p.3. However. We assume that n > r.368 effects. i = 1. the linear Y = Z'(3' + E. b I _ .a 2 ) is identifiable if and only ifr = p(Problem6. (3 E RP}.. k model is Inference in the Multiparameter Case Chapter 6 1... • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. TJi = E(Ui) = V'[ J1. i . Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. i=I (6. the vector of means p of Y always is. j = 1.g. n.. The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3.. V n for Rn such that VI.. n. j = I. .. r i' .17).. " .. . We now introduce the canonical variables and means ... there exists (e.• znj)T. .... . i5p )T. . It follows that the parametrization ({3. an orthononnal basis VI.
Proof. . (3.. U. 2 ' " 'Ii i=l ~2(T2 6.3."..2 Estimation 0. and then translate them to procedures for .. The U1 are independent and Ui 11i = 0. v~. 7J = AIJ.1]r.1.2"log(21r<7 ) t=1 n 1 n _ _ ' " u.' based on Y using (6. . n.. (T2)T. which are sufficient for (IJ. Un are independent nonnal with 0 variance 0. whereas (6.10).10) It will be convenient to obtain our statistical procedures for the canonical variables U. n. (6.2.1. it = E. In the canonical/orm ofthe Gaussian linear model with 0. (iii) U i is the UMVU estimate of1]i.. ..(Ui .)T is sufJicientfor'l.1..). = L. and .L = 0 for i = r + 1.8)(6. while (1]1.1. 1 ViUi and Mi is UMVU for Mi.Ur istheMLEof1]I. i i = 1.. Then we can write U = AY.1.Cr are constants.11) _ n log(21r"').. 2 0.1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::.1.. then the MLE of 0: Ci1]i is a = E~ I CiUi. n because p.... We start by considering the log likelihood £('1. which is the guide to asymptotic inference in general.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6. Let A nxn be the orthogonal matrix with rows vi. n. .9) Var(Y) ~ Var(U) = .. IJ. 20. (6. . . E w. .. i (iv) = 1.2 We first consider the known case..2 known (i) T = (UI •. . u) based on U £('I.. • .and 7J are equivalently related...•• (ii) U 1 .2. where = r + 1.1. . Theorem 6..2<7' L. . Note that Y~AIU.'J nxn . 0.'Ii) ..u) 1 ~ 2 n 2 .2 ) using the parametrization (11. . . i it and U equivalent. •. r. observing U and Y is the same thing.9 N(rli. . . 1 If Cl. . . .. also UMVU for CY.Section 6. U1 .2 and E(Ui ) = vi J. .1. = 1. Theorem 6.1. is Ui = making vTit.• . Moreover.. ...1. 1]r)T varies/reely over Rr. '" ~ A 1'1.. . ais (v) The MLE of.8) So. . and by Theorem B.
1.3. .1). .6.O)T E R n . (iv) By the invariance of the MLE (Section 2. l Cr .4 and Example 3. Proof. 370 Inference in the Multiparameter Case Chapter 6 I iI ..2. To this end.4.. In the canonical Gaussian linear model with a 2 unknown. The distribution of W is an exponential family. 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. (6. . (6. (ii) U1. .2. " V n of R n with VI = C = (c" .1. Q is UMVU.10. (U I .'  .2 (ii).. this statistic is equivalent to T and (i) follows. wecan assume without loss of generality that 2:::'=1 c. .1. then W . . I i > .• EUl U? Ul Projections We next express j1 in terms of Y.1.. ( 2 )T. (iv) The conclusions a/Theorem 6. To show (iv).11) by setting T j = Uj . .3..N(~. . and . . . (i) By observation.. T r + 1 = L~ I and ()r+l = 1/20. . the MLE of q(9) = I:~ I c. 1 Ur .ti· 1 . there exists an orthonormal basis VI) . j = 1..11) is a function of 1}1.11)..4.. 2 2 ()j = 1'0/0.'7. 2(J i=r+l ~ U.4."7i)2 and is minimized by setting '(Ii = Ui. WI = Q is sufficient for 6 = Ct. . 2:7 r+l un Tis sufficientfor (1]1. n. That is.2).1.1]r. i > r+ 1. By (6.. define the norm It I of a vector tERn by  ! It!' = I:~ .. (ii) The MLE ofa 2 is n.. By Problem 3.. I i I .. (v) Follows from (iv). Ui is UMVU for E(U.1..1 L~ r+l U. = 1. .. we need to maximize n (log 27f(J2) 2 II ".3 and Example (iii) is clear because 3.(v) are still valid.. ~i = vi'TJ. apply Theorem 3.3. by observation.. . recall that the maximum of (6.2.2 .1.6 to the canonical exponential family obtained from (6.) = '7" i = 1.. I.. Theorem 6. " as a function of 0. . I . . Assume that at least one c is different from zero.11) has fJi 1 2 = Ui ..4. is q(8) = L~ 1 <. Proof. 0. r. .1. obtain the MLE fJ of fl... 1lr only through L~' 1 CUi .11) is an exponential family with sufficient statistic T. . i = 1.) (iii) By Theorem 3. and give a geometric interpretation of ii.4. Let Wi = vTu. To show (ii). The maximizer is easily seen to be n 1 L~ r+l (Problem 6.. (We could also apply Theorem 2. .1. = 0. where J is the n x n identity matrix. . But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.Ui · If all thee's are zero. (iii) 8 2 =(n . (i) T = (Ul1" ..r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2.(J2J) by Theorem B.Ur'L~ 1 Ul)T is sufficient. By GramSchmidt orthogonalization. 1 U1" are the MLEs of 1}1. and is UMVU for its expectation E(W.. r.) = a. .2 . n " . "7r because. {3.
because Z has full rank.' and by Theorems 6.4. f3j and Jii are UMVU because. any linear combination ofY's is also a linear combination of U's.14) follows.. (i) is clear because Z.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6. /3.1.3. IY .3 of. . (ii) and (iii) are also clear from Theorem r+l VjUj . In the Gaussian linear model (i) jl is the unique projection oiY on L<.1.2(iv) and 6.1.2. It follows that /3T (ZT s) = 0 for all/3 E RP.13) (iv) lfp = r.. The projection Yo = 7l"(Y I L.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.1. That is. 0 ~ .ji) = 0 and the second equality in (6..n log(21f<T ..3 E RP.1. ~ The maximum likelihood estimate . Thus. To show (iv). the MLE = LSE of /3 is unique and given by (6. by (6. _.12) implies ZT. Proof.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6. . = Z/3 and (6. /3 = (ZTZ)l ZT. equivalently.1. .L of vectors s orthogonal to w can be written as ~ 2:7 w.9).1 and Section 2.Ii = .1...ti. ZT (Y ..1.1.12) ii...1.p. ) 2 or.2<T' IY . which implies ZT s = 0 for all s E w.jil' /(n r) (6.1. i=l. j = 1.Z/3I' : /3 E W}.Z/3 I' .n.3(iv). note that 6.3 maximizes 1 log p(y. then /3 is identifiable.1. = Z T Z/3 and ZTji = ZTZfj and.1. fj = arg min{Iy .L. To show f3 = (ZTZ)lZTy. any linear combination of U's is a UMVU estimate of its expectation.J) of a point y Yo =argmin{lyW: tEw}. . the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2... We have Theorem 6.1. fj = (ZTZ)lZTji.Section 6.14) (v) f3j is the UMVU estimate of (3j. and /3 = (ZTZll ZT 1". ZTZ is nonsingular. note that the space w.. spans w. and  Jii is the UMVU estimate of J. <T) = ...3 because Ii = l:~ 1 ViUi and Y .
Example 2.H)Y.I" (12H). • ..1... ~ ~ ·. and I. In statistics it is also called the hat matrix because it "puts the hat on Y.2 with P = 2. . 1 < i < n.1. . In the Gaussian linear model (i) the fitted values Y = iJ. the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. Taking z = Zi. n lie on the regression line fitted to the data {('" y. • ~ ~ Y=HY where i• 1 .. i = 1. the ith component of the fitted value iJ.({3t + {3. . Note that by (6.1. Var(€) = (12(J .H).2.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2. We can now conclude the following. H I I It follows from this and (B.12) and (6. 1 ~ ~ ! I ~ ~ ~ i . whenp = r.1.' € = Y . In this method of "prediction" of Yi.1. and the residual € are independent.1 we give an alternative derivation of (6. There the points Pi = {31 + fhzi. 1 < i < n.ill 2 = 1 q.ii is called the residual from this fit. That is. then (6. i = 1. L7 1 .1.)I are the vertical distances from the points to the fitted line. Suppose we are given a value of the covariate z at which a value Y following the linear model (6. Y = il.2 illustrates this tenninology in the context of Example 6.5. (j ~ N(f3. .3) that if J = J nxn is the identity matrix.1.14).Y = (J . H T 2 =H. (ii) (iii) y ~ N(J. I.1.14) and the normal equations (Z T Z)f3 = Z~Y. (12(J . moreover. The goodness of the fit is measured by the residual sum of squares (RSS) IY . (6.. 1 · .1. € ~ N(o.4. =H . By Theorem 1. see also Section RIO. the residuals €. The residuals are the projection of Y on the orthocomplement of wand ·i . (12 (ZTz) 1 ).1.16) J I I CoroUary 6. = [y.3) is to be taken. w~ obtain Pi = Zi(3.).. The estimate fi = Z{3 of It is called thefitted value and € = y . n}." As a projection matrix H is necessarily symmetric and idempotent. i . (iv) ifp = r. .15) Next note that the residuals can be written as ~ I . . it is commOn to write Yi for 'j1i. I .1. . : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w.H)). .'.
. . 1=1 At this point we introduce an important notational convention in statistics. .1. 0 Example 6. a = {3. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k .1 and Section 2. the treatments. + n p and we can write the least squares estimates as {3k=Yk .. . Moreover.. One Sample (continued). k = 1.1.p. and € are nonnally distributed with Y and € independent. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem.ii = Y.Y are given in Corollary 6.. Var(.ii. where n = nl + .2). In this example the nonnal equations (Z T Z)(3 = ZY become n.3. .81 and i1 = {31 Y. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1.Ill'. ..i respectively.5.1.1.1. hence. then replacement of a subscript by a dot indicates that we are considering the average over that subscript. Y.1.1. If {Cijk _.2. joint Gaussian..3). in the Gaussian model.1 ~ Inference for Gaussian Linear Models 373 Proof (Y. (not Y. The OneWay Layout (continued). .3.. In the Gaussian case.a of the kth treatment is ~ Pk = Yk.1.j = 1.8) = (J2(Z T Z)1 follows from (B. By Theorem 6. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2. and€"= Y . k=I. . o . The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY .4. (n .p..1. If the design matrix Z has rank P..p..p.1). ...8. } is a multipleindexed sequence of numbers or variables. 0 ~ ~ ~ ~ ~ ~ ~ Example 6. We now see that the MLE of It is il = Z.8 and that {3j and f1i are UMVU for {3j and J1. .8.3. nk(3k = LYkl.. .2...Section 6. . is th~ UMVU estimate of the average effect of all a 1 p = Yk . Here Ji = . i = 1.8 and (3. Thus. k = 1. Regression (continued). Example 6. 'E) is a linear transformation of U and. 0 ~ We now return to our examples.no The variances of .
Li = 13 E R. Thus. whereas for the full model JL is in a pdimensional subspace of Rn. .17) I. a regression equation of the form mean response ~ . (6. The LSEs are preferred because of ease of computation and their geometric properties.\(y) = sup{p(Y.1. I. . . j where Zi2 is the dose level of the drug given the ith patient.i 1 6..1. We first consider the a 2 known case and consider the likelihood ratio statistic ." Now.2. .3 Tests and Confidence Intervals .17). the LADEs are obtained fairly quickly by modem computing methods.1) have the Laplace distribution with density . . see Koenker and D'Orey (1987) and Portnoy and Koenker (1997).. Next consider the psample model of Example 1. q < r.1.JL): JL E wo} ~ ! .2. In general.p}. j. see Problems 1.En in (6. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986).1. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider.1.al..31. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w. and the matrix IIziillnx3 with Zil = 1 has rank 3.374 Inference in the Multiparameter Case Chapter 6 Remark 6.6. i = 1.. {JL : Jti = 131 + f33zi3. . For instance. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.4.zT . An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. which is a onedimensional subspace of Rn. the mean vector is an element of the space {JL : J. Zi3 is the age of the ith patient. Now we would test H : 132 = 0 versus K : 132 =I O.1. " . n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6.. . For more on LADEs.3 with 13k representing the mean resJXlnse for the kth population. However. • = 131 + f32Zi2 + f33Zi3. . 1 and the estimates of f3 and J. . which together with a 2 specifies the model. i = 1.. under H. under H. 1 < . in the context of Example 6.. The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = .L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi .1. j 1 • 1 . .. ! j ! I I .JL): JL E w} sup{p(Y. • .7 and 2. .
\(Y) ~ exp {  2t21Y . then r. We write X.J X. v~' such that VI.\(Y) It follows that = exp 1 2172 L '\' Ui' (6. V r span wand set . 2 log .1.1. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r .q( (}2) distribution with 0 2 =a 2 ' \ '1]i L... by Theorem 6.. . V q (6.IY . by Theorem 6.\(Y) = L i=q+l r (ud a )'.1. .20) i=q+I 210g. In the Gaussian linear model with 17 2 known.1..19). o .1.. span Wo and VI.l9) then. We have shown the following.iii' .q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.1.\(Y) Proof We only need to establish the second equality in (6.21).2(v)..1. i=q+l r = a 21 fL  Ito [2 (6. = IJL i=q+l r JLol'.q {(}2) for this distribution.1. 2 log .21 ) where fLo is the projection of fL on woo In particular.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.3.iio 2 1 } where i1 and flo are the projections of Yon wand wo. . . respectively. .'" (}r)T (see Problem B. . when H holds. But if we let A nx n be an orthogonal matrix with rows vi.\(Y) has a X.1.._q' = AJL where A L 'I. Proposition 6.l2).. Note that (uda) has a N(Oi.1.1. ... 1) distribution with OJ = 'Ida. Write 'Ii is as defined in (6.I8) then. (}r.
= aD IIY . We write :h. (6. .r IY ./L.~I. We have seen in PmIXlsition 6. The resulting test is intuitive.l) {Ir . /L._q(02) distribution with 0' = .liol' IY r _q IY _ iii' iii' (r .r.itol 2 = L~ q+ 1 (Ui /0) 2. I . and 86 into the likelihood ratio statistic.• j .(Y).2. T is called the F statistic for the general linear hypothesis.JLo. T is an increasing function of A{Y) and the two test statistics are equivalent. T has the (central) Frq.22)./Lol'. Remark 6. In the Gaussian linear model the F statistic defined by (6.2 is known" is replaced by '.5) that if we introduce the variance equal likelihood ratio w'" .q and m = n .3. is poor compared to the fit under the general model.2 1JL . In Proposition 6. Proposition 6. it can be shown (Problem 6... we obtain o A(Y) = P Y. In particular.14).1.E Wo for K : Jt E W .iT') : /L E wo} (6. J • T = (noncentral X:.L a =  n I' and .L n where p{y.2 is the same under Hand K and estimated by the MLE 0:2 for /.nr distribution. We know from Problem 6. T = n .q variable)/df (central X~T variahle)/df with the numerator and denominator independent.1.22) Because T = {n .1.. 8 2 .1.0. The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .q and n .max{p{y. {io.') denotes the righthand side of (6.'}' YJ.q)'Ili . Substituting j).21ii. which is equivalent to the likelihood ratio statistic for H : Ii.'IY _ iii' = L~ r+l (Ui /0)2.a = ~(Y'E.a:.I.1.1.18). which has a X~r distribution and is independent of u.. .1. E In this case.itol 2 have a X.19)..'I/L.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown. . . has the noncentral F distribution F r _q.1.I}.. we can write . statistic :\(y) _ max{p{y.23) .1./to I' ' n J respectively.m{O') for this distrihution where k = r .wo.liol' (n _ r) 'IY _ iii' . We have shown the following.1 that the MLEs of a 2 for It E wand It E wQ are . as measured by the residual sum of squares under the model specified by H. ]t consists of rejecting H when the fit.r degrees affreedam (see Prohlem B.1 suppose the assumption "0.JtoI 2 . For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.iT'): /L E w} ./L. T has the representation . By the canonicaL representation (6.1 that 021it ..J.r){r . Thus.q)'{[A{Y)J'/n .. IIY .1.n_r(02) where 0 2 = u.2. .1. when H I lwlds.
\(Y) = .24) Remark 6. q = 0. (6. We next return to our examples. One Sample (continued).iLl' ~++Yl I' I' 1 .1'0 I' y.2. r = 1 and T = 1'0 versus K : (3 'f 1'0. See Pigure 6.1. The projections il and ilo of Yon w and Wo.iLol'.iLol' = IY .1. where t is the onesample Student t statistic of Section 4.3.1. It follows that ~ (J2 known case with rT 2 replaced by r.r)ln central X~_cln where T is the F statistic (6. 0 .1. Yl = Y2 Figure 6.1 and Section B.iLl' + liL .1.1. This is the Pythagorean identity.q noncentral X? q 2Iog.9.~T = (n .25) which we exploited in the preceding derivations.(Y) equals the likelihood ratio statistic for the 0'2. (6. The canonical representation (6.1. We test H : (31 case wo = {I'o}. and the Pythagorean identity.1.22).1.1. In this = (Y 1'0)2 (nl) lE(Y.1 Inference for Gaussian Linear Models ~ 377 then >.Y)2' which we recognize as t 2 / n.Section 6.19) made it possible to recognize the identity IY . Y3 y Iy .IO. Example 6.
Using (6. Under H all the observations have the same mean so that.. . age...g. .y)2 k .g. I3I) where {32 is a (p .Way Layout (continued)...1.q)1f3nZfZ2 .ito 1 are the residual sums of squares under the full model and H.22) we obtain the F statistic for the hypothesis H in the oneway layout .1.{3p are Y1.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i .  I 1. .1. respectively.. .np distribution.. anddfF = np and dfH = nq are the corresponding degrees of freedom.1..)2= Lnk(Yk. 7) iW l.)2' . (6.1 L~=.. Without loss of generality we ask whether the last p . ..q covariates does not affect the mean response. 02 simplifies to (J2(p .n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6.)2. Recall that the least squares estimates of {31.1. •. • 1 F = (RSSlf .1.Q)f3f(ZfZ2)f32. respectivelY. '1 Yp. I • I b .. P nk (Y.3.26) 0 versus K : f3 2 i' O.q covariates in multiple regression have an effect after fitting the first q. Regression (continued).}f32..27) 1 . However. which only depends on the second set of variables and coefficients. 0 I j Example 6. As we indicated earlier. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62.1.1. I • ! . economic status) coefficients.: I :I T _ n ~ p L~l nk(Y .dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .q) x 1 vector of main (e. P . we want to test H : {31 = .vy. = {3p.. Z2) where Z1 is n x q and Z2 is 11 X (p .n_p(fP) distribution with noncentrality parameter (Problem 6. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.ZfZl(ZiZl)'ziz. To formulate this question. The One. 1 02 = (J2(p . . j ! j • litPol2 = LL(Yk _y. We consider the possibility that a subset of p . In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.q). Under the alternative F has a noncentral Fp_q. . and we partition {3 as f3T = (f3f.Yk. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6.26) and H.2. _y.' • ito = Thus.P . The F test rejects H if F is large when compared to the ath quantile of the Fpq.1..=11=1 k=1 •• Substituting in (6. L~"(Yk' .. we partition the design matrix Z by writing it as Z = (ZI. Now the linear model can be written as i i I I ! We test H : f32 (6.RSSF)/(dflf .
Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. .···. (3p. (31.1) and 8Sw /(n . [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y. .p) degrees of freedom.1. the within groups (or residual) sum of squares. 888 =L k=I P nk(Yk.Section 6.30) can also be viewed stochastically. k=l 1=1 measures variation within the samples.. SST/a 2 is a (noncentral) X2 variable with (n . and III with I being the "high" end. identifying 0' and (p . Because 88B/0" and SSw / a' are independent X' variables with (p . . Note that this implies the possibly unrealistic .np distribution. T has a Fpl.. Yi nl .1 Inference for Gaussian Linear Models 379 When H holds.3. This information as well as S8B/(P .. ...)' .1.1.Y. and the F statistic. The sum of squares in the numerator.p). then by the Pythagorean identity (6. are often summarized in what is known as an analysis a/variance (ANOVA) table. we have a decomposition of the variability of the whole set of data.1.fLo 12 for the vector Jt (rh. the unbiased estimates of 02 and a 2 . (3p)T and its projection 1'0 = (ij. . As an illustration. . into two constituent components. k=1 1=1 which measures the variability of the pooled samples.29) Thus..)'. .28) where j3 = n I :2:=~"" 1 nifJi. the between groups (or treatment) sum of squares and SSw. .. . . we see that the decomposition (6. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic. respectively. Ypl .1 and 6. . . (3" . SSB. .1. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I.1) degrees of freedom and noncentrality parameter 0'. (6. If the fJ. which is their ratio.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'.np distribution with noncentrality parameter (6.1. To derive IF.1) and (n .. are not all equal.fJ.. See Tables 6. We assume the oneway layout is valid.1) degrees offreedom as "coming" from SS8/a' and the remaining (n .. is a measure of variation between the p samples YII . n. sum of squares in the denominator. the total sum of squares..25) 88T = 888 + 88w .2 1M . SST. T has a noncentral Fp1... compute a...
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
3. Confidence regions that parallel the tests will also be developed in Section 6. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . .19). A new major issue that arises is computation. Kadane. . as is usually the case. The three approaches coincide asymptotically but differ substantially..2. up to the proportionality constant f e . fJ p vary freely.2. We simply make 8 a vector. . The consequences of Theorem 6.Section 6.s. This approach is refined in Kass. the equivalence of Bayesian and frequentist optimality asymptotically. say. t)dt. ~ (6. say (fJ 1. the likelihood ratio principle.. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution.5. See Schervish (1995) for some of the relevant calculations. However. Op). under PeforallB. we are interested in the posterior distribution of some of the parameters. All of these will be developed in Section 6. A4(a. Using multivariate expansions as in B.3. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems.32) a. and Tierney (1989). and Rao's tests. Again the two approaches differ at the second order when the prior begins to make a difference. .2. Ifthe multivariate versions of AOA3. A5(a. We have implicitly done this in the calculations leading up to (5. These methods are beyond the scope of this volume but will be discussed briefly in Volume II.s.. typically there is an attempt at "exact" calculation. and interpret I . The problem arises also when.2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. because we then need to integrate out (03 .3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly.).5.) and A6A8 hold then.(t) n~ 1 p(Xi . When the estimating equations defining the M estimates coincide with the likelihood .s. Although it is easy to write down the posterior density of 8.5. O ).19) (Laplace's method).2. 1T(8) rr~ 1 P(Xi1 8).3 are the same as those of Theorem 5. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5.8 we obtain Theorem 6. in perfonnance and computationally. for fixed n. Summary. 6.3..2. the latter can pose a fonnidable problem if p > 2. .I as the Euclidean nonn in conditions A7 and A8. ife denotes the MLE. Wald tests (a generalization of pivots).
However. X II .•• . PO'  . However. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.3 • . .. . 1 . I . "' ~ 6.9). as in the linear model. 1 Zp. to the N(O.1 . ry E £} if the asymptotic distribution of In(() . X n are i. We present three procedures that are used frequently: likelihood ratio.I . 9 E 8 0 } for testing H : 9 E 8 0 versus K . 9 E 8" 8 1 = 8 . LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a.8 0 . and other methods of inference.d.s. In this section we will use the results of Section 6.i. . '(x) = sup{p(x. equations.1 we considered the likelihood ratio test statistic.".4 to vectorvalued parameters. Wald and Rao large sample tests. 170 specified.4. 9 E 8} sup{p(x.f/ : () E 8. we need methods for situations in which. Finally we show that in the Bayesian framework where given 0. In Section 6. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i.I . I . In such . I H I! ~.2 to extend some of the results of Section 5.4. under P.9) . In linear regression. confidence regions. and confidence procedures. 1 '~ J .B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. ! I In Sections 4. These were treated for f) real in Section 5. in many experimental situations in which the likelihood ratio test can be used to address important questions. the exact critical value is not available analytically.392 Inference in the Multiparameter Case Chapter 6 . A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) .3. In these cases exact methods are typically not available.. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. . 6. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO.4. and showed that in several statistical models involving normal distributions.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. We shall show (see Section 6.1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. . and we tum to asymptotic approximations to construct tests.9 and 6.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal. I j I I . this result gives the asymptotic distribution of the MLE.110 : () E 8}.
Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i. 0 = (iX.1 exp{ iJx )/qQ).. iJ).5) to be iJo = l/x and p(x. distribution with density p(X. . qQ. The MLE of f3 under H is readily seen from (2.. . ~ In Example 2. This is not available analytically..\(x) is available as p(x. . As in Section 6. i = r+ 1.. . . Using Section 6.1). . the X~_q distribution is an approximation to £(21og .l. .l.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of .3. 0) ~ iJ"x·. =6r = O. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6.. r and Xi rv N(O. which is usually referred to as Wilks's theorem or approximation. as X where X has the gamma. In this u 2 known example. Let YI .2 we showed how to find B as a nonexplicit solution of likelihood equations. .3.. 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. Thus.. the hypothesis H is equivalent to H: 6 q + I = . ~ X..\(X) based on asymptotic theory.1.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo.. iJo) is the denominator of the likelihood ratio statistic.q' i=q+1 r Wilks's theorem states that. •. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) . . Moreover. and we transform to canonical form by setting Example 6. 2 log .. such that VI.1.'··' J. exists and in Example 2..£ X~" for degrees of freedom d to be specified later. The Gaussian Linear Model with Known Variance.1. the numerator of ..2 we showed that the MLE.. 0) = IT~ 1 p(Xi. 1). iJ).. . . we conclude that under H. Y n be nn.i = 1.d.3. E W . Wilks's approximation is exact. where A nxn is an orthogonal matrix with rows vI.i = 1. SetBi = TU/uo.. Suppose XL X 2 . . We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi. = (J.. 0 It remains to find the critical value.tl.\(Y) = L X.n. Q > 0. . <1~) where {To is known.n."" V r spanw.3.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. . 0). under regularity conditions.v'[.2. . q < r.4.3 we test whether J. 1.randXi = Uduo. x ~ > 0.i.3. iJ > O.. Then Xi rvN(B i . when testing whether a parameter vector is restricted to an open subset of Rq or R r . . i = 1. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations.1.Section 6.\(Y)).Xn are i. V q span Wo and VI.
3.(Jol < I(J~ .\(Y) £ X. and conclude that Vn S X.q also in the 0.i=q+l 'C'" I=r+l Xi' ) X._q as well..2.3. (J = (Jo.logp(X. where DO is the derivative with respect to e. If Yi are as in = (IL.3.· . .2. 0 Consider the general i. we can conclude arguing from A.Onl + IOn  (Jol < 210n .(Jo) ". Suppose the assumptions of Theorem 6. I1«(Jo)).. (72) ranges over an r + Idimensional Example 6.. By Theorem 6. I( I In((J) ~ n 1 n L i=l & & &" &".~ L H .3.Xn a sample from p(x.394 Inference in the Multiparameter Case Chapter 6 1 .3.2) V ~ N(o.d..(Jo) In«(Jn)«(Jn . . (6. hy Corollary B.. . Then.3. 21og>.7 to Vn = I:~_q+lXl/nlI:~ r+IX. an = n. ranges over a q + Idimensional manifold. N(o. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6. .. : . ". X. o . .2..1) " I. The result follows because..2.q' Finally apply Lemma 5. "'" (6. ! .2 but (T2 is unknown then manifold whereas under H. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn .(Jol. Because .1. under H .1 ((Jo)). case with Xl. Eln«(Jo) ~ I«(Jo). j .(J) Uk uJ rxr c'· • 1 . 0). I(J~ ..2 with 9(1) = log(l + I). ·'I1 . Apply Example 5.2 are satisfied.(Y) defined in Remark ° 6. . I .1. where Irxr«(J) is the Fisher information matrix. Here ~ ~ j. ... Hence.3. The Gaussian Linear Model with Unknown Variance. I I i Example 6. In Section 6.(In[ < l(Jn .. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. . i\ I .(Jo) for some (J~ with [(J~ .2." . = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X.6. "'" T.(X) ~ 1 . !.. V T I«(Jo)V ~ X. B).1.i. .. c = and conclude that 210g .3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V. 2 log >'(Y) = Vn ".(Jol. Write the log likelihood as e In«(J) ~ L i=I n logp(X i .3. and (J E c W. y'ri(On .2 unknown case. where x E Xc R'. I. Proof Because On solves the likelihood equation DOln(O) = 0.. . Note that for J.3.
". where x r (1.1 and 6.a)) (6.(Ia). . Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] . (JO.10) and (6. SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po. ~ {Oo : 2[/ n (On) /n(OO)] < x.3.n = (80 . the test that rejects H : () 2Iog>.0:'.2[1.(] .4) It is easy to see that AOA6 for 10 imply AOA5 for Po. 8 2)T where 8 1 is the first q coordinates of8.3.j.2. Let 00.3. Suppose that the assumptions of Theorem 6.T.2 illustrate such 8 0 .n and (6. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1.. has approximately level 1.3.3) is a confidence region for 0 with approximat~ coverage probability 1 . T (1) (2) Proof. Then under H : 0 E 8 0. for given true 0 0 in 8 0 • '7=M(OOo) where.a.2 hold for p(x. Furthennore.. and {8 0 .35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .. We set d ~ r .81(00).)T.3. 0b2) = (80"+1".0) is the 1 and C!' quantile of the X.6) . ~(11 ~ (6.+l.)1'.1) applied to On and the corresponding argument applied to 8 0 .4). By (6.80.3. OT ~ (0(1). .(9 0 .0(2».2.0 0 ).3. 0 E e. 10 (0 0 ) = VarO..8. j ~ q + 1. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6. Next we tum to the more general hypothesis H : 8 E 8 0 . 0).n) /n(Oo)l.0(1) = (8 1" ". where e is open and 8 0 is the set of 0 E e with 8j = 80 . distribution.' "..)T. (6. Make a change of parameter.. Theorem 6..(X) 8 0 when > x. 0(2) = (8.j} are specified values.8q ).3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.3.q.8.Section 6.2.. Examples 6.
1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .II).l.10) LTl(o) .1I0 + M. ~ 'I.1ITD III(x. is invariant under reparametrization .2.3.1 '1).. This asymptotic level 1 . : • Var T(O) = pT r ' /2 II. rejecting if . Tbe result follows from (6.13) D'1I{x..8 0 : () E 8}.3. Such P exists by the argument given in Example 6.(l .• I .3..7).+1 ~ .1'1 '1 E eo} and from (B.8) that if T('1) then = 1 2 n. ). which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O.1/ 2 p = J. . '1 E Me}.l.6) and (6.8. Me : 1]q+l = TJr = a}. His {fJ E applying (6.(1 .3.3. J) distribution by (6.9).0..+ 1 . = .3. Or)J < xr_. ••. Moreover.In( 00 .00 : e E 8 0} MAD = {'1: '/. 18q acting as r nuisance parameters.3.all (6.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.I '1): 11 0 + M.3.5) to "(X) we obtain...(X) = y{X) where ! 1 1 .. We deduce from (6. because in tenus of 7].1I0 + M. with ell _ .1 '1) = [M.. (6.1I 0 + M. a piece of 8. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.11) .Tf(O)T .396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.0: confidence region is i . .. • 1 y{X) = sup{P{x.. ..(X) > x r .1 '1ll/sup{p(x. 0 Note that this argument is simply an asymptotic version of the one given in Example 6.2.. I I eo~ ~ ~ {(O.1> . if A o _ {(} .9) .LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l). • . IIr ) : 2[ln(lI n ) ./ L i=I n D'1I{X" liD + M.3.. ...3.+I> . . (O) r • + op(l) (6. Note that. under the conditions of Theorem 6. Thus.2.00 . = 0. 1e ). by definition. then by 2 log "(X) TT(O)T(O) .
. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6.cTe:::.3). q + 1 <j < r}.8r are known.2 is still inadequate for most applications.3..3. if 0 0 is true..:::f. We only need note that if WQ is a linear space spanned by an orthogonal basis v\. 8.o:::"'' 397 CC where 00.. X.5 and 6. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6...2)(6.. J). 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I). which require the following general theorem. .. .'! are the MLEs.g. q + 1 < j < r.3.3. 210g A(X) ".13). Theorem 6. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}. If 81 + B2 = 1. More sophisticated examples are given in Problems 6.e:::S:::.'':::':::"::d_C:::o:::.q at all 0 E 8. "'0 ~ {O : OT Vj = 0. (6. .2..13). X~q under H.3... More complicated linear hypotheses such as H : 6 . " ' J v q and Vq+l " .d. v r span " R T then. Suppose the MLE hold (JO.3.l:::. such that Dg(O) exists and is of rauk r .o''C6::.3. It can be extended as follows. Vl' are orthogonal towo and v" .. t.3... l B.. The proof is sketched iu Problems (6. Then. . let (XiI. The esseutial idea is that. Suppose H is specified by: There exist d functions.3.. B2 .2 to this situation is easy and given in Problem 6.. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension. themselves depending on 8 q + 1 .13) then the set {(8" . A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O . Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}. Examples such as testing for independence in contingency tables. We need both properties because we need to analyze both the numerator and denominator of A(X). .3. The formulation of Theorem 6.14) a hypothesis of the fonn (6... 8q o assuming that 8q + 1 .3.( 0 ) ~ o} (6.8 0 E Wo where Wo is a linear space of dimension q are also covered. . q + 1 < j < r written as a vector g.::Se::. .3..p:::'e:. will appear in the next section. As an example ofwhatcau go wrong.2 and the previously conditions on g.80.n under H is consistent for all () E 8 0.13) Evideutly. 9j .. . 8q)T : 0 E 8} is opeu iu Rq. of f)l" . Suppose the assumptions of Theorem 6... R.3....t.. (6.3.::.3.3. N(B 1 .) be i. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X.2 falls under this schema with 9..i.:::m..6.(0) ~ 8j ..3."de"'::'"e:::R::e"g.12) The extension of Theorem 6.3. Theorem 6..
hence. j Wn(IJ~2)) !:. The last Hessian choice is ~ ~ . X. .rl(IJ» asn ~ p ~ L 00. It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . (6.10). Under the conditions afTheorem 6. 0 ""11 F .2. More generally. for instance.2. the Hessian (problem 6. I.3.X.2.3.4.2 hold.6.fii(lJn (2) L 1J 0 ) ~ Nd(O. . For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . I .16).q' ~(2) (6. theorem completes the proof. it follows from Proposition B. . ~ Theorem 6. according to Corollary B.3.0. I22(lJo)) if 1J 0 E eo holds..fii(1J IJ) ~N(O.~ D 2 1n ( IJ n). More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ).. 1(8) continuous implies that [1(8) is continuous and.) and IJ n ~ ~ ~(2) ~ (Oq+" . r 1 (1J)) where.3.. y T I(IJ)Y .~ D 2l n (1J 0 ) or I (IJ n ) or .Or) and define the Wald (6. Slutsky's it I' I' .3.2. i . It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when .31).7. respectively. If H is true.7.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.2.1.2. . .3.~ D 2ln (IJ n ).15) and (6. .Nr(O. (6.3.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.: b. ~ ~ 00.2. [22 is continuous.. I 22 (9 n ) is replaceable by any consistent estimate of J22( 8). has asymptotic level Q. the lower diagonal block of the inverse of . Then .l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.15) Because I( IJ) is continuous in IJ (Problem 6.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d.9). y T I(IJ)Y.3. i favored because it is usually computed automatically with the MLE.3..18) i 1 Proof.3.a)} is an ellipsoid in W easily interpretable and computablesee (6.398 Inference in the Multiparameter Case Chapter 6 6.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" . Y . (6. in particular .. But by Theorem 6..
n) under n H. X.. asymptotically level Q.3. as n _ CXJ. It follows from this and Corollary B. ~1 '1'n(OO. The extension of the Rao test to H : (J E runs as follows. The argument is sketched in Problem 6.. The Wald test leads to the Wald confidence regions for (B q +] ..19) indicates.\(X) is the LR statistic for H Thus. . 1(0 0 )) where 1/J n = n . and so on. the asymptotic variance of ... requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests.3.2 that under H. The Rao Score Test For the simple hypothesis H that.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n. This test has the advantage that it can be carried out without computing the MLE.3.3 Large Sample Tests and Confidence Regions 399 The Wold test. Rao's SCOre test is based on the observation (6. in practice they can be very different. (Problem 6. a consistent estimate of .0:) is called the Rao score test. 112 is the upper right q x d block.n)[D. by the central limit theorem. The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo.nl]  2  1 . under H.19) : (J c: 80.8) E(Oo) = 1.21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).. as (6. Furthermore..3.' (0 0 )112 (00) (6. What is not as evident is that. = 2 log A(X) + op(l) (6.9) under AOA6 and consistency of (JO.3.ln(8 0 . which rejects iff HIn (9b ») > x 1_ q (1 .. un.22) .Q). Let eo '1'n(O) = n.20) vnt/Jn(OO) !..3. the two tests are equivalent asymptotically... Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. (J = 8 0 . It can be shown that (Problem 6.Section 6. 2 Wn(Oo ) ~(2) where . is. N(O. The test that rejects H when R n ( (}o) > x r (1.9.:El ((}o) is ~ n 1 [D.nlE ~ (2) _ T  .n) ~  where :E is a consistent estimate of E( (}o).ln(Oo. therefore. (6.n W (80 .3. and the convergence Rn((}o) ~ X.3..(00 )  121 (0 0 )/.n) 2  + D21 ln (00.n] D"ln(Oo.6.I Dln ((Jo) is the likelihood score vector.n under H.
Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. and showed that this quadratic fonn has limiting distribution q ..0 0 ) = T'" "" (vn(On . for the Wald test neOn . . X. The analysis for 8 0 = {Oo} is relatively easy.) . Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. Power Behavior of the LR.2 but with Rn(lJb 2 1 1 » !:. . which stales that if A(X) is the LR statistic. it shares the disadvantage of the Wald test that matrices need to be computed and inverted. Finally. . . we introduced the Rao score test. called the Wald statistic. under regularity conditions. We established Wilks's theorem.q distribution under H. I I . D 21 the d x d Theorem 6.3. .19) holds under On and that the power behavior is unaffected and applies to all three tests.)I(On)( vn(On  On) + A) !:. . In particular we shall discuss problems of . On the other hand. then.4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. We considered the problem of testing H : 8 E 80 versus K : 8 E 8 .. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. which measures the distance between the hypothesized value of 8(2) and its MLE. 2 log A(X) has an asymptotic X. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. and so on.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). . Rao. Summary.2.0 0 ) I(On)(On . ._q(A T I(Oo)t. X~' 2 j > xd(1 . We also considered a quadratic fonn.a)}.On) + t.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l.8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified. For instance. I .3. b . e(l). 0(2). and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . which is based on a quadratic fonn in the gradient of the log likelihood..5.
E~. the Wald statistic is klkl Wn(Oo) = n LfB. j = 1.d. For i. k Wn(Oo) = L(N. and 2.+ ] 1 1 0. . where N.3. . j = 1.6.i. ()j.00. Pearson's X2 Test As in Examples 1. + n LL(Bi . j=1 To find the Wald test. trials in which Xi = j if the ith trial prcx:luces a result in the jth category. j=l . we consider the parameter 9 = (()l. k1...8 we found the MLE 0.00k)2lOOk.. j=l i=1 The second term on the right is kl 2 n Thus. 2.)2InOo.4. 6. .) /OOk = n(Bk . # j.Section 6.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM). .2. consider i. Ok if i if i = j.1. with ()Ok =  1 1 E7 : kl j=l ()OJ. In Example 2." .2. Let ()j = P(Xi = j) be the probability of the jth category.OOi)(B..)/OOk.8. L(9.) > Xkl(1.. . Thus.2.OO. we need the information matrix I = we find using (2. = L:~ 1 1{Xi = j}. . we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution. k.4. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN. In.1 GoodnessofFit in a Multinomial Model.33) and (3.nOo.7.5.3. j = 1.0).lnOo. treated in more detail in Section 6. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory. I ij [ .. .. Ok Thus.]2100. log(N.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. k ..32) thaI IIIij II. . ~ N. j=1 Do. Because ()k = 1 . .
]1(9) = ~ = Var(N).8. . It is easily remembered as 2 = SUM (Observed .13. ~ = Ilaijll(kl)X(kl) with Thus.1) of Pearson's X2 will reappear in other multinomial applications in this section.4. The general form (6.2).11). The second term on the right is n [ j=1 (!. n ( L kl ( j=1 ~ ~) !.. by A. we write ~ ~  8j 80j .~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k .2.4. the Rao statistic is . Then..1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).2) . .402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem.4..80k)..L . 80i 80j 80k (6.. . 80j 80k 1 and expand the square keeping the square brackets intact. j=1 .~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6.. because kl L(Oj . '" . where N = (N1 . ~ .. we could invert] or note that by (6.2. note that from Example 2.4.1S.Nk_d T and.L .. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .. with To find ]1.Expected)2 Expected (6. .80j ) = (Ok . To derive the Rao test.
75)2 104. M(n. n020 = n030 = 104. l::. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. 04. 3.4. n4 = 32.4. k = 4 X 2 = (2. Mendel's theory predicted that 01 = 9/16. 8). Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds. Then. this value may be too small! See Note 1. However. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:.4). . where 8 0 is a composite "smooth" subset of the (k . 04 = 1/16. and (4) wrinkled green.25)2 104.75 + (3. nOlO = 312.1) dimensional parameter space k 8={8:0i~0.75. Nk)T has a multinomial. i=1 For example. 2. which has a pvalue of 0.9 when referred to a X~ table.25. which is a onedimensional curve in the twodimensional parameter space 8. . fh = 03 = 3/16. In experiments on pea breeding. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . 7. (3) round green. LOi=l}. distribution. (2) wrinkled yellow. Possible types of progeny were: (1) round yellow..1.2) becomes It follows that the Rao statistic equals Pearson 's X2.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6.: .48 in this case. j.fh. n2 = 101. • 6.25)2 312. in the HardyWeinberg model (Example 2.25 + (2.75.i::.k. Testing a Genetic Theory.2 GoodnessofFit to Composite Multinomial Models. If we assume the seeds are produced independently. n3 = 108.75)2 = 04 34.. There is insufficient evidence to reject Mendel's hypothesis.25 + (3.1. 8 0 . Contingency Tables Suppose N = (NI .Section 6. 02. n040 = 34. Example 6.4. Mendel observed nl = 315. 4 as above and associated probabilities of occurrence fh.75 . For comparison 210g A = 0.
3) Le.. then it must solve the likelihood equation for the model..(t)~ei(11)=O.. 8) denote the frequency function of N. and the map 11 ~ (e 1(11).. 2 log . .4. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij).3). which will be pursued further later in this section. and ij exists. by the algebra of Section 6. . also equal to Pearson's X2 • .4. 1 ~ j ~ q or k (6. The Rao statistic is also invariant under reparametrization and. l~j~q.8 0 .3 that 2 log .1)... . {p(. . a 11 'r/J .~) and test H : e~ = O. thus..q distribution for large n. nk.. . . . That is. 8(11)) = 0. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6.("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. The algebra showing Rn(8 0 ) = X2 in Section 6..X approximately has a X. If a maximizing value.4. the log likelihood ratio is given by log 'x(nb .X approximately has a xi distribution under H.. . j = q + 1.1. .q) exists. approximately X.1. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. e~ = e2 . 8(11)) for 11 E £. ij = (iiI.r.3 and conclude that. . ... However. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. to test the HardyWeinberg model we set e~ = e1 . Maximizing p(n1.4." For instance.r for specified eOj . ek(11))T takes £ into 8 0 . j Oj is. ...4. .. . To apply the results of Section 6. .404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 .. Other examples. . we define ej = 9j (8).2) by ej(ij). Then we can conclude from Theorem 6. e~) T ranges over an open subset of Rq and ej = eOj . To avoid trivialities we assume q < k . i=l k If 11 ~ e( 11) is differentiable in each coordinate. If £ is not open sometimes the closure of £ will contain a solution of (6. i=l t n. where 9j is chosen so that H becomes equivalent to "( e~ . 8(11)) : 11 E £}.2~(1 ... 8) for 8 E 8 0 is the same as maximizing p( n1. ..3. The Wald statistic is only asymptotically invariant under reparametrization. We suppose that we can describe 8 0 parametrically as where 11 = ('r/1.4. r. nk) = L ndlog(ni/n) log ei(ij)].l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni .q' Moreover. ."" 'r/q) T. ij satisfies l a a'r/j logp(nl. j = 1. nk. £ is open... is a subset of qdimensional space. nk.ne j (ij)]2 = 2 ne. under H.. . 8 0 • Let p( n1.nk.
. . An individual either is or is not inoculated against a disease. i(1 . The likelihood equation (6. we obtain critical values from the X~ tables. 1958. {}21. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0. The Fisher Linkage Model. AB.4. then (Nb . AB. Then a randomly selected individual from the population can be one of four types AB. 301). do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. iTJ) : TJ :.• . The only root of this equation in [0. If Ni is the number of offspring of type i among a total of n offspring.1). For instance. Denote the probabilities of these types by {}n.Section 6. (}4) distribution. (2) sugarygreen. 9(ij) ((2nl + n2) 2n 2n 2. H is rejected if X2 2:: Xl (1 . . 1] is the desired estimate (see Problem 6.4. specifies that where TJ is an unknown number between 0 and 1. We found in Example 2. (4) starchygreen. is male or female.4 Methods for Discrete Data 405 Example 6. ij {}ij = (Bil + Bi2 ) (Blj + B2j ).TJ). respectively. TJ).4. {}12. {}22. ( 2n3 + n2) 2) T 2n 2n o Example 6. is or is not a smoker. AB. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary.5. and so on. HardyWeinberg. (6. (h. To study the relation between the two characteristics we take a random sample of size n from the population. (2nl + n2)(2 3 + n2) .a) with if (2nl + n2)/2n. A linkage model (Fisher. . . T() test the validity of the linkage model we would take 8 0 {G(2 + TJ).. k 4. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B . Because q = 1. (3) starchywhite. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite.4.4.4. We often want to know whether such characteristics are linked or are independent.. p. The results are assembled in what is called a 2 x 2 contingency table such as the one shown. 1} a "onedimensional curve" of the threedimensional parameter space 8. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories.3) becomes i(1 °:. N 4 ) has a M(n.4) which reduces to a quadratic equation in if.6 that Thus.2.
the (Nij . X2 has approximately a X~ distribution. In fact (Problem 6. (1 . 021 .'r/2)) : 0 ::.2).Tid (n12 + n22) (1 . 'r/2 (1 . where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. C j = N 1j + N 2j is the jth column sum.406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column. if N = (Nll' N 12 . Then.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic.5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n. ( 22 ). 'r/2 to indicate that these are parameters.'r/1).3) become . This suggests that X2 may be written as the square of a single (approximately) standard normal variable. 'r/2 ::. (6. By our theory if H is true. 0 ::. where z tt [R~jl1 2=1 J=l . 'r/1 ::.6) the proportions of individuals of type A and type B.4. I}.'r/2). which vary freely. the likelihood equations (6. These solutions are the maximum likelihood estimates.7) where Ri = Nil + Ni2 is the ith row sum. For () E 8 0 . we have N rv M(n. respectively. 1. Here we have relabeled 0 11 + 0 12 .4.011 + 021 as 'r/1.4. because k = 4. q = 2.Ti2) (6. 8 0 . Thus. 0 12 . 'r/1 (1 . N 21 .'r/1) (1 . Pearson's statistic is then easily seen to be (6.RiCj/n) are all the same in absolute value and.4.4. We test the hypothesis H : () E 8 0 versus K : 0 f{. N 22 )T. 0 11 .
. ...4. . . j :::.. 1 :::.. that A is more likely to occur in the presence of B than it would in the presence of B).g. j :::.... a. (Jij : 1 :::. i = 1. Next we consider contingency tables for two nonnumerical characteristics having a and b states. The X2 test is equivalent to rejecting (twosidedly) if.4. i :::. that is. B to denote the event that a randomly selected individual has characteristic A.. j :::..4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6. hair color). peA I B)) versus K : peA I B) > peA I B). 1 :::.. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. a. b where the TJil. . Therefore. 1 Nu 2 N12 . i :::. b} "J M(n. respectively. It may be shown (Problem 6. B. b ~ 2 (e. .. then {Nij : 1 :::. a. Z ~ z(l.. b NIb Rl a Nal C1 C2 . if and only if. The N ij can be arranged in a a x b contingency table.e. peA I B) = peA I B). then Z is approximately distributed as N(O. if X2 measures deviations from independence.3) that if A and B are independent. . If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . . it is reasonable to use the test that rejects.Section 6. a. eye color. i :::. Positive values of Z indicate that A and B are positively associated (i. Z = v'n[P(A I B) . a. B. B. j = 1. ..a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. b). b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. and only if. .. Thus... 1 :::. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A..4. Nab Cb Ra n . Z indicates what directions these deviations take.4. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]. 1).8) Thus.
7r)] are also used in practice. and the loglog transform g2(7r) = 10g[log(1 .3 Logistic Regression for Binary Responses In Section 6."" 7rk)T based on X = (Xl"'" Xk)T is t. 6. whicll we introduced in Example 1.) over the whole range of z is impossible. Instead we turn to the logistic transform g( 7r).9) which has approximately a X(al)(bl) distribution under H. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0). Thus.4. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J." We assume that the distribution of the response Y depends on the known covariate vector ZT. we obtain what is called the logistic linear regression model where .408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. log(l "il] + t. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1..10. The argument is left to the problems as are some numerical applications. we observe independent Xl' .4. k. we observe the number of successes Xi = L. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0).1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are.11) When we use the logit transfonn g(7r). usually called the log it. 1) d. (6. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0).~ =1 Zij {3j = unknown parameters {31. . B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi.4. i :::. As is typical.{3p. perhaps after a transfonnation. In this section we assume that the data are grouped or replicated so that for each fixed i. we call Y = 1 a "success" and Y = 0 a "failure. 1 :::. Because 7r(z) varies between 0 and 1.6..Li = L. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O. Other transforms. .8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 . Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses. The log likelihood of 7r = (7rl.7r)].. log ( ~: ) . [Xi C :i"J log + m.4. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.. (6. with Xi binomial. a simple linear representation zT ~ for 7r(."F~1 Yij where Yij is the response on the jth of the mi trials in block i.f. ' XI.
4. if N = 2:7=1 mi.log(l:'. It follows that the NILE of f3 solves E f3 (Tj ) = T . . mi 2mi Xi 1 Xi 1 = .4.+ .14) where W = diag{ mi1fi(l1fi)}kxk.12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit.Li = E(Xi ) = mi1fi. the likelihood equations are just ZT(X .! ) ( mi . Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' .~ Xi 2 1 +"2 ' (6.J.16) 1fi) 1].1 and by Proposition 3.3.3. the solution to this equation exists and gives the unique MLE 73 of f3.4.1... . As the initial estimate use (6. that if m 1fi > 0 for 1 ::. . p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.4. The condition is sufficient but not necessary for existencesee Problem 2.1fi)*}' + 00..J) SN(O. Tp) T.3.4.4 the Fisher information matrix is I(f3) = ZTWZ (6.14). I N(f3) = f. . i ::. k m} (V. 7r Using the 8method. Then E(Tj ) = 2:7=1 Zij J.. The coordinate ascent iterative procedure of Section 2.l..3. IN(f3) of f3 = (f31.Section 6.+ .p. Zi The log likelihood l( 7r(f3)) = == k (1.f3p )T is..2 can be used to compute the MLE of f3. it follows by Theorem 5.15) Vi = log X+.4. Similarly.1fi )* = 1 . Zi)T is the logistic regression model of Problem 2.. j = 1.Li. where Z = IIZij Ilrnxp is the design matrix. Alternatively with a good initial value the NewtonRaphson algorithm can be employed.13) By Theorem 2. Although unlike coordinate ascent NewtonRaphson need not converge. ( 1 . or Ef3(Z T X) = ZTX.(1.L) = O. (6. .4..3. W is estimated using Wo = diag{mi1f. j Thus. k ( ) (6..[ i(l7r iW')· . mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00.4.4 Large Sample Methods for Discrete Data 409 The special case p = 2.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. in (6. We let J. the empirical logistic transform.3.. Theorem 2. we can guarantee convergence with probability tending to 1 as N + 00 as follows.
17) where X: = mi Xi and M~ mi Mi.6.8 that the inverse of the logit transform 9 is the logistic distribution function Thus. i = 1.) +X.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi . ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.1 linear subhypotheses are important.. fl) has asymptotically a X%r. . We want to contrast w to the case where there are no restrictions on TJ. k.410 Inference in the Mllltin::lr."" Xk independent with Xi example.~. Thus.13. If Wo is a qdimensional linear subspace of w with q < r.4.\=2t [Xi log i=1 (Ei. As in Section 6. by Problem 6.)l {LOt (6.w is denoted by D(y. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. 2 log .4.flm.IOg(E. Testing In analogy with Section 6.11) and (6. k < oosee Problem 6.3. The Binomial OneWay Layout. f"V . i = 1. . .1. i 1. In the present case.. The LR statistic 2 log . we let w = {TJ : 'f]i = {3.k. it foHows (Problem 6. Example 6. Theorem 5.\ has an asymptotic X. The samples are collected indeB(1ri' mi).Wo 210g. {3 E RP} and let r be the dimension of w. one from each location. . from (6.4.:IITlPt'pr Case 6 Because Z has rankp.. D(X.1. Here is a special case. ji). We obtain k independent samples. where ji is the MLE of JL for TJ E w.4. To get expressions for the MLEs of 7r and JL. D(X. For a second pendently and we observe Xl.1 (L:~=l Xij(3j). ji) measures the distance between the fit ji based on the model wand the data X.4.q distribution as mi + 00.. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w . recall from Example 1. .4..14) that {3o is consistent.. In this case the likelihood is a product of independent binomial densities. and for the ith location count the number Xi among mi that has the given attribute. that is..4.12) k zT D(X.4.4. and the MLEs of 1ri and {Li are Xdmi and Xi.\ for testing H : TJ E w versus K : TJ n . distribution for TJ E w as mi + 00..13. By the multivariate delta method. we set n = Rk and consider TJ E n. . the MLE of 1ri is 1ri = 9.. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover.
.Li 7ri = gl(~i)' where gl(y) is the logistic distribution .Idimensional parameter space e.3 we considered experiments in which the mean J.3 to find tests for important statistical problems involving discrete data. the Pearson statistic is shown to have a simple intuitive form. we give explicitly the MLEs and X2 test.7r)]. in the case of a Gaussian response. We found that for testing the hypothesis that a multinomial parameter equals a specified value.3.1.1). where J. discuss algorithms for computing MLEs. Using Z as given in the oneway layout in Example 6. We derive the likelihood equations. N 2::=1 mi. the Wald and Rao statistics take a form called "Pearson's X2.mk7r)T. then J. The LR statistic is given by (6.3. the Rao statistic is again of the Pearson X2 form. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r .7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6.4.5 GENERALIZED LINEAR MODELS Yi In Sections 6. and (3 = 10g[7r1(1 . and as in that section. It follows from Theorem 2.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. Summary." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.4.Section 6.1 that if 0 < T < N the MLE exists and is the solution of (6.15). . an important hypothesis is that the popUlations are homogenous. In the special case of testing equality of k binomial parameters.4.Li = EYij.. and give the LR test. we find that the MLE of 7r under His 7r = TIN. 6. we test H : 7rl = 7r2 7rk = 7r. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6.3. (3 = L Zij{3j j=l p of covariate values. versus the alternative that the 7r'S are not all equal.Li = ~i.18) with JiOi mi7r. z. In Section 6. J.1.13). if J. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. Under H the log likelihood in canonical exponential form is {3T . Finally. Thus. We used the large sample testing results of Section 6.4. In particular. .L (ml7r.4. 7r E (0.Li of a response is expressed as a function of a linear combination ~i =. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k .1 and 6.
1).. g = or 11 L J'j Z(j) = Z{3. g(JLn)) T.6. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. in which case JLi A~ (TJi). .5. which is p x 1. . the natural parameter space of the nparameter canonical exponential family (6. the GLM is the canonical subfamily of the original exponential family generated by ZTy. in which case 9 is also called the link function.. .. (ii) Log linear models. the mean fL of Y is related to 11 via fL = A(11). Typically. j=1 p In this case. A (11) = Ao (TJi) for some A o. but in a subset of E obtained by restricting TJi to be of the form where h is a known function. As we know from Corollary 1. Canonical links The most important case corresponds to the link being canonical. McCullagh and NeIder (1983.r Case 6 function.412 Inference in the 1\III1IITin'~r:>rn"'T. More generally.5. called the link junction. Note that if A is oneone.. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj. = L:~=1 Zij{3. Special cases are: (i) The linear model with known variance."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. most importantly the log linear model developed by Goodman and Haberman.. that is.1. fL determines 11 and thereby Var(Y) A( 11). Yn)T is n x 1 and ZJxn = (ZI. 11) given by (6. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. the identity. We assume that there is a onetoone transform g(fL) of fL. 1989) synthesized a number of previous generalizations of the linear model. Znj)T is the jth column vector of Z. Typically. . Y) where Y = (Yb . ••• .1) where 11 is not in E.. g(fL) is of the form (g(JL1) .. See Haberman (1974).
j::. in general. 1 ::. and the log linear model corresponding to log Bij = f3i + f3j. 2:~=1 B = 1. But. for example.5 Generalized linear Models 413 Suppose (Y1.2) It's interesting to note that (6. Suppose. . 1f. j ::.tp) T .5..5. If we take g(/1) = (log J.1::. log J. say. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6.. classification i on characteristic 1 and j on characteristic 2. . In this case. p.4. 1 ::. j ::.Yp)T is M(n.Bp). Bj > 0. 1 ::. b.5. as seen in Example 1. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment. B+ j = 2:~=1 Bij . The coordinate ascent algorithm can be used to solve (6 . by Theorem 2. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij .80 ~ f3 0 . .. /1(.5.1.B 1. This isjust the logistic linear model of Section 6.3. the link is canonical.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . that Y = IIYij 111:Si:Sa.8) (6.t1.4.2) (or ascertain that no solution exists).8) is not a member of that space. . 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. .5..7. j ::. where f3i.1. Then. f3j are free (unidentifiable parameters). if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist. that procedure is just (6. b. .8) is orthogonal to the column space of Z. p are canonical parameters. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). See Haberman (1974) for a further discussion. a. "Ij = 10gBj.Section 6. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2. 1 ::. the .3) where In this situation and more generally even for noncanonical links. Algorithms If the link is canonical. Then () = IIBij II.6. ZTy or = ZT E13 y = ZT A(Z.3../1(. i ::.. so that Yij is the indicator of.
2.5. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI .Wo with WI ~ Wo is D(Y. In that case the MLE of J1 is iL M = (Y1 . the deviance of Y to iLl and tl(iL o. This name stems from the following interpretation.5. then with probability tending to 1 the algorithm converges to the MLE if it exists.1.5.2..4).6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components. iLl) where iLl is the MLE under WI. iLo) . iLo) == inf{D(Y. ••• . . Unfortunately tl =f.414 Inference in the Multiparameter Case Chapter 6 true value of {3.5. We can think of the test statistic .4) That is.13m' which satisfies the equation (6. iLl) == D(Y. As in the linear model we can define the biggest possible GLM M of the form (6.5. the algorithm is also called iterated weighted least squares.20) when the data are the residuals from the fit at stage m. (6.1) for which p = n. If Wo C WI we can write (6. as n ~ 00.1](Y)) l(Y . 210gA = 2[l(Y. Testing in GLM Testing hypotheses in GLM is done via the LR statistic. Write 1](') for AI.5) is always ~ O. For the Gaussian linear model with known variance 0'5 (Problem 6.D(Y .D(Y. The LR statistic for H : IL E Wo is just D(Y. iLl)' each of which can be thought of as a squared distance between their arguments.5. D generally except in the Gaussian case. In this context. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. the correction Am+ l is given by the weighted least squares formula (2. called the deviance between Y and lLo. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. iLo) . This quantity. 1]) > o}). Let ~m+l == 13m+1 .
if we take Zi as having marginal density qo. which. and can conclude that the statistic of (6.5. and 6.. . .. Asymptotic theory for estimates and tests If (ZI' Y 1 ). fil) is thought of as being asymptotically X. the theory of Sections 6. then (ZI' Y1 ). Thus. .3. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. (3))Z. = f3d = 0. This can be made precise for stochastic GLMs obtained by conditioning on ZI. . For instance. (Zn.5.q. f3p ). Ao(Z.5. . . (Zn' Y n ) has density (6.7) More details are discussed in what follows.. . 1. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0.10) is asymptotically X~ under H. ••• .5..3.. 6. we obtain If we wish to test hypotheses such as H : 131 = . can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. is consistent with probability 1.. in order to obtain approximate confidence procedures. Similar conclusions r '''''''.. (3) term. ..5.5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families.0.. f3d+l.2.Section 6 .1 . there are easy conditions under which conditions of Theorems 6. (3) is (Yi and so. d < p. if we assume the covariates in logistic regression with canonical link to be stochastic. . .8) This is not unconditionally an exponential family in view of the Ao (z.2. so that the MLE {3 is unique. which we temporarily assume known. then ll(fio. y . However.3 hold (Problem 6.3)... Zn in the sample (ZI' Y 1 ).9) What is ]1 ((3)? The efficient score function a~i logp(z . asymptotically exists. .2. and (6. Y n ) from the family with density (6.2 and 6.
The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean.13) (6.1 ::. (6. However. JL ::. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families. if in the binary data regression model of Section 6. 11. From the point of analysis for fixed n.5.5.5. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model.r)dy. It is customary to write the model as. For further discussion of this generalization see McCullagp. when it can. which we postpone to Volume II. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. Because (6. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.A(11))}h(y. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. Note that these tests can be carried out without knowing the density qo of Z1. for c( r) > 0. A) families. r)dy = 1. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred.5. As he points out.11) Jp(y.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. r).9 are rather similar. p(y.4.1) depend on an additional scalar parameter r.1 (r)(11T y . 1989). (J2) and gamma (p.12) The lefthand side of (6.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. r) = exp{c. General link functions Links other than the canonical one can be of interest. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. the results of analyses with these various transformations over the range .5. Important special cases are the N (JL.5.3. For instance. . . and NeIder (1983.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics. 11.
. whose further discussion we postpone to Volume II. have a fixed covariate exact counterpart. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II. if P3 is the nonparametric model where we assume only that (Z. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6.i. for estimating /lL(Z) in a submodel of P 3 with /3 (6. still a consistent asymptotically normal estimate of (3 and. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function.5). under further mild conditions on P.2.2. are discussed below. ____ . we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. had any distribution symmetric around 0 (Problem 6."n".Section 6. so is any estimate solving the equations based on (6. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. E p y2 < oo}.31) with fa symmetric about O. Y) rv P given by (6. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z. PI {P : (zT. with density f for some f symmetric about O}. and tests. EpZTZ nonsingular. Ii 6. in fact." Of course. the distributional and implicit structural assumptions of parametric models are often suspect.2. we use the asymptotic results of the previous sections to develop large sample estimation results. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients.J .6. These issues. and Models 417 Summary.1.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known. the right thing to do if one assumes the (Zi' Yi) are i.8 but otherwise also postponed to Volume II.19) with Ei i. confidence procedures. ".19) even if the true errors are N(O. the LSE is not the best estimate of (3. roughly speaking.. in fact.2.6 Robustness Pr. of a linear predictor of the form Ef3j Z(j). We discussed algorithms for computing MLEs of In the random design case. is "act as ifthe model were the one given in Example 6.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6.2. We found that if we assume the error distribution fa. In Example 6.. the GaussMarkov theorem. /3. a 2 ) or. y)T rv P that satisfies Ep(Y I Z) = ZT(3.d.2. then the LSE of (3 is. There is another semiparametric model P 2 = {P : (ZT.2.. if we consider the semiparametric model.d.6. That is. Furthermore.r+i".i. which is symmetric about 0.2.2. called the link function.
. Because Varc = VarCM for all linear estimators. . Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. The optimality of the estimates Jii.H).1 and the LSE in general still holds when they are compared to other linear estimates. ( 2 ). and by (B.1.3) are normal. Var(jj) = a 2 (ZTZ)1.1.. The GaussMarkov theorem shows that Y.6.6.14). where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1.1. Var(e) = a 2 (I .. Because E( Ei) = 0.1. .1. Theorem 6. Moreover. Then. In Example 6. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1. Instead we assume the GaussMarkov linear model where (6. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions. the result follows.1.. n = 0:. the conclusions (1) and (2) of Theorem 6.1. in Example 6. . JL = f31. See Problem 6.1. Let Ci stand for any estimate linear in Y1 . the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 . .. Yj) = Cov( Ei. for any parameter of the form 0: = E~l aiJLi for some constants a1. The preceding computation shows that VarcM(Ci) = Varc(Ci). By Theorems 6. . . and a is unbiased.6.. n 2 VarcM(a) = La..4. E(a) = E~l aiJLi Cov(Yi. jj and il are still unbiased and Var(Ji) = a 2 H.' En with the GaussMarkov assumptions.En are i. and Ji = 131 = Y. .Ej) = a La. One Sample (continued).1.. .En in the linear model (6.Y . Suppose the GaussMarkov linear model (6. {3 is p x 1.2) holds. 0 Note that the preceding result and proof are similar to Theorem 1.. in addition to being UMVU in the normal case.1. . Var(Ei) + 2 Laiaj COV(Ei.4 are still valid.1. and Y.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case.2) where Z is an n x p matrix of constants.an. n a Proof. .2(iv) and 6. for . Ifwe replace the Gaussian assumptions on the errors E1. 13i of Section 6. € are n x 1..3(iv). In fact.6).1. However.4. Example 6.3.6. is UMVU in the class of linear estimates for all models with EY/ < 00. and ifp = r.'. moreover.i. Varc(Ci) for all unbiased Ci. Ej). 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.d.S..2. Varc(a) ::.p .Yn . N(O.6.
aT Robustness of Tests In Section 5..6. Y n ) are d. 0. 8)dP(x) ° = 80 (6..2 are UMVU.ii(8 . Y has a larger variance (Problem 6..6.6.2.1. As seen in Example 6. which does not belong to the model.6. Thus. the asymptotic behavior of the LR..Ill}.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 ."n". Remark 6..6.ti". consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ).. ~(w.8(P)) ~ N(O.:lrarnetlric Models 419 sample n large.4. 0.2. ..6 Robustness "" .:. Because ~ ('lI . ~('11. Suppose (ZI' Yd. Suppose we have a heteroscedastic 0. 00. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6.Section 6. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 . a major weakness of the GaussMarkov theorem is that it only applies to linear estimates.. but Var( Ei) depends on i. as (Z. For this density and all symmetric densities. Example 6. From the theory developed in Section 6. The 6method implies that if the observations Xi come from a distribution P. y E R.4. our estimates are estimates of Section 2. which is the unique solution of ! and w(x. Now we will use asymptotic methods to investigate the robustness of levels more generally. The Linear Model with Stochastic Covariates. 1 (Zn. However. . Il E R.6. then Tw ~ VT I(8 0 )V where V '" N(O. There is an important special case where all is well: the linear model we have discussed in Example 6.2.and twosample problems using asymptotic and Monte Carlo methods. Y is unbiased (Problem 3. P)) with :E given by (6.3 we investigated the robustness of the significance levels of t tests for the one. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown. when the not optimal even in the class of linear estimates. This observation holds for the LR and Rao tests as wellbut see Problem 6. for instance. 8) and Pis true we expect that 8n 8(P). and . P)). the asymptotic distribution of Tw is not X.d.2.3) .12). Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly ..6).ennip. If 8(P) (6. If the are version of the linear model where E( €) known.2 we know that if '11(" 8) = Die.P) i= II (8 0 ) in general.5) . A > O.1.6.6..
!ltin:::. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model.7) and (6. say. (Jp (Jo. (see Examples 6.420 Inference in the M .30) then (3(P) specified by (6. H : (3 = O. the test (6..9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level. For instance.6.. E p E2 < 00.p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP. .2) the LR.P i=l n p  Z(n)(31 . the limiting distribution of is Because !p.a) where _ ""T T  2 (6..B) by where W pxp is Gaussian with mean 0 and.5) and (12(p) = Varp(E).np(l a) . by the law of large numbers.6. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6.6. when E is N(O..1 and 6. It follows by Slutsky's theorem that. it is still true that if qi is given by (6. under H : (3 = 0.1 that (6. .t. X.np(1 .6. (12).. it is still true by Theorem 6.2.6. Zen) S (Zb .3) equals (3 in (6.2.8) Moreover. .6) (3 is the LSE.2.5) has asymptotic level a even if the errors are not Gaussian.2.r::nn". Y) such that Eand Z are independent.Zn)T and  2 1 ~ " 2 1 ~(Yi . hence.. Thus..6. even if E is not Gaussian.t xp(l a) by Example 5. Then.q+ 1. It is intimately linked to the fact that even though the parametric model on which the test was based is false.6. Wald. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. . EpE 0.7.r Case 6 with the distribution P of (Z..Yi) = IY n . and we consider. procedures based on approximating Var(.  2 Now. This kind of robustness holds for H : (Jq+ 1 (Jo.3..
4. under H. .6. then it's not clear what H : (J = (Jo means anymore. Q) (6. If we take H : Al . the lifetimes A2 the standard Wald test (6.1 Vn(~ where (3) + N(O. (6.5) are still meaningful but (and Z are dependent. V2 VarpX{I) (~2) Xz in general. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z.3. (X(I)..) respectively.15) .6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. X(2) are independent E of paired pieces of equipment. If it holds but the second fails. For simplicity.a)" (6. note that. XU) and X(2) are identically distributed.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  . To see this.6. Example 6.2 clearly hold. The Linear Model with Stochastic Covariates with E and Z Dependent.. i=1 (6. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 .. Suppose E(E I Z) = 0 so that the parameters f3 of (6. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.6.6..10) does not have asymptotic level a in general. the theory goes wrong.6.6. we assume variances are heteroscedastic.. 2  (th).6. E (i. We illustrate with two final examples.10) nL and ~ ~ X(j) 2 (6.. Then H is still meaningful.14) o Example 6.Section 6.2.3. (6. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential. X(2») where X(I)..6.6.. The TwoSample Scale Problem.11) i=1 &= v 2n ! t(xP) + X?»).3.7) fails and in fact by Theorem 6.6. But the test (6.6.13) but & V2Ep(X(I») ::f. That is.12) The conditions of Theorem 6. Suppose without loss of generality that Var( (') = L Then (6.
6.1. In particular. 0 < Aj < 1.6. . Compare this bound to the variance of 8 2 • .n l1Y . identical variances. .6) by Q1.. ••• . N(O. .422 Inference in the Multiparameter Case Chapter 6 (Problem 6. .6).d. . and are uncorrelated. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = . . we showed that the MLEs and tests generated by the model where the errors are U. (5 2)T·IS (U1. and Rao tests need to be modified to continue to be valid asymptotically. (5 .1 are still approximately valid as are the LR.. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4.6..11) with both 11 and (52 unknown.. .JL • ""n 2.Id) has a multinomial (A1.11 . 1 ::. the confidence procedures derived from the normal model of Section 6. (6.6. In the linear model with a random design matrix. In this case.. provided we restrict the class of estimates to linear functions of Y1 . Wald.6.fiT) o:2)T of U 2)T .1 1. j ::.A1.4. Ur. (52) assumption on the errors is replaced by the assumption that the errors have mean zero. Specialize to the case Z = (1. 1967).4. = Ad = 1/ d (Problem 6.. in general. (ii) if n 2: r + 1.9..Ad)T where (I1..Id .3 to compute the information lower bound on the variance of an unbiased estimator of (52. For the canonical linear Gaussian model with (52 unknown.6) does not have the correct level. the methods need adjustment.. LR.3.6) and.. We considered the behavior of estimates and tests when the model that generated them does not hold... . This is the stochastic version of the dsample model of Example 6. It is easy to see that our tests of H : {32 = . the test (6. d. (i) the MLE does not exist if n = r.i.. where Qis a consistent estimate of Q. n 1 L. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber.6. 0 To summarize: If hypotheses remain meaningful when the model is false then. . . TJr. . the twosample problem with unequal sample sizes and variances. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. . and Rao tests. and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. Wald. in general. then the MLE (iil. Show that in the canonical exponential model (6.7 PROBLEMS AND COMPLEMENTS Problems for Section 6. Yn . use Theorem 3.. In partlCU Iar."i=r+ I i . ""2 _ ""12 (TJ1. ..d.Ad) distribution. A solution is to replace Z(n)Z(n)/ 8 2 in (6.1. N(O. (52) are still reasonable when the true error distribution is not Gaussian. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model. .16) Summary.
Let 1. i = 1.Section 6.1] is a known constant and are i. {Ly = /31.'''' Zip)T.4 • 3.29) for the noncentrality parameter 82 in the oneway layout.. where IJ.22... i eo = 0 (the €i are called moving average errors. .2 ). J =~ i=O L)) c i (1. fO 0. i = 1. . Show that in the regression example with p = r = 2.29). i = 1. Derive the formula (6. and (3 and {Lz are as defined in (1.2 coincides with the likelihood ratio statistic A(Y) for the 0. of 7. 9. Show that >:(Y) defined in Remark 6. 0."" (Z~. where a.n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0..a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . Let Yi denote the response of a subject at time i.4.. . n. Consider the model (see Example 1. N(O..1. then B the MLE of (). the 100(1 . .l+c (_c)j+l)/~ (1.2 replaced by 0: 2 • 6. Suppose that Yi satisfies the following model Yi ei = () + fi.i. = (Zi2.1.. Find the MLE B (). c E [0. see Problem 2. (Zi..13. 5.2 ). . Show that Ii of Example 6.. Yl). and the 1.n. (e) Show that Var(B) < Var(Y) unless c = O. . .5) Yi () + ei. . is (c) Show that Y and Bare unbiased.. zi = (Zi21"" Zip)T.1.1.2.. . i = 1. Derive the formula (6. n.2 known case with 0. . . 0..14). t'V (b) Show that if ei N(O. .d. .1.(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). (d) Show that Var(O) ~ Var(Y). 8.28) for the noncentrality parameter ()2 in the regression example. Var(Uf) = 20.Li = {Ly+(zi {Lz)f3. 1 {LLn)T. Here the empirical plugin estimate is based on U. fn where ei = ceil + fi.2 .2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. 4.. 1 n f1.d..7 Problems and Complements 423 Hint: By A. Yn ) where Z.
(b) Find a level (1 a) prediction interval for Y (i. 'lj.2.":=::. We want to predict the value of a 14. .Z. (b) Find confidence intervals for 'lj. . Assume the linear regression model with p future observation Y to be taken at the pont z.a). (c) If n is fixed and divisible by 2(p . ...1.~ L~=1 nk = (p~:)2 + Lk#i . ~ 1 .. where n is even. .a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z. (a) Find a level (1 . Y ::. Note that Y is independent of Yl.p (1 . (d) Give the 100(1 .1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::. Consider a covariate x. l(Yh .2) L(Zi2 . T2 1 (1/ 2n) L.a)% confidence intervals for a and 6k • 12. Yn . Yn ). Yn ) such that P[t..:1 Yi + (3/ 2n) L~= ~ n+ 1 }i..Z.2). (n .p (~a) (n .p)S2 /x n .. ••.. (a) Show that level (1 . n2 = . (Zi2 Z.. .• 424 10. then Var(8k ) is minimized by choosing ni = n/2.k)' C (b) If n is fixed and divisible by p. a 2 ::. = 2(Y .p)S2 /x n . Yn be a sample from a population with mean f.2)2 a and 8k in the oneway layout Var(8k ) .2)(Zj2 . Often a treatment that is beneficial in small doses is harmful in large doses. 15. .. Show that for the estimates (a) Var(a) = +==_=:.{3i are given by = ~(f. In the oneway layout..a) confidence intervals for linear functions of the form {3j .1 and variance a 2 .I). The following model is useful in such situations. ::... then Var( a) is minimized by choosing ni = n / p. ~(~ + t33) t31' r = 2. which is the amount . 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. Let Y 1 . Consider the three estimates T1 = and T3 Y. then the hat matrix H = (h ij ) is given by 1 n 11.• statistics HYl.e . = np n/2(p 1). Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6.
which is yield or production.5 Zi2 * + (33zi2 where Zil = Xi  X. r :::. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . E :::. : {L E R.2.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL.0289. xC' = 0 => x = 0 for any rvector x.or dose of a treatment. ( 2 ) density. ( 2 ).• . Hint: Because C' is of rank r..7 Probtems and Lornol.r2 )l 1 2 7 > O. i = 1. n are independent N(O. Let 8.77 167.1) + 2<t'CJL.0'2) is the N(Il'.2 1. 8) = 2.r2) (x). fli) where fi1.emEmts 425 .Section 6. 653). Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. compute confidence intervals for (31.. ( 2 ). . (32. 1 2 where <t'CJL. (a) For the following data (from Hald.I::llogxi' You may use X = 0. then the r x r matrix C'C is of rank r and. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. Yi) and (Xi. fi3 and level 0. 1952... hence. . Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. 16. Assume the model = (31 + (32 X i + (33 log Xi + Ei. (b) Show that the MLE of ({L.0'2)(x ) + E<t'(JL. 8). But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. and Q = P. In the Gaussian linear model show that the parametrization ({3. and p be as in Problem 6. . . ( 2 ) logp(x. (b) Plot (Xi. a 2 > O}.95 Yi = e131 e132xi Xf3. 17. P = {N({L. a 2 <7 2 . 7 2 ) does not exist so that A6 doesn't hold. Check AO. •. Show that if C is an n x r matrix of rank r. Y (yield) 3.A6 when 8 = ({L. ( 2 )T is identifiable if and only if r = p. . nonsingular. En Xi. Problems for Section 6. fi2. and a response variable Y. (33. n. p(x. p. P.
 I .2. n[z(~)~en)jl)' X.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. In Example 6. In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H..  (3)T)T has a (". show that ([EZZ T jl )(1. given Z(n).24)..20).1. T 2 ) based on the first two I (d) Deduce from Problem 6.2.1 show thatMLEs of (3.2. (6. 8. Hint: (a). 6. H E H}. (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic.1. (e) Show that 0:2 is unconditionally independent of (ji.2.2.)Z(n)(fj . i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.1) EZ .I3). '".i > 1.2. (P("H) : e E e. Z i ~O. . Hint: !x(x) = !YIZ(Y)!z(z). X ._p distribution.2. Show that if we identify X = (z(n).2.426 Inference in the Multiparameter Case Chapter 6 I . H abstract. In Example 6. (e) Construct a method of moment estimate moments which are ~ consistent.2. I I .2. Fill in the details of the proof of Theorem 6.IY . Establish (6. . The MLE of based on (X. iT 2 ) are conditional MLEs.10 that the estimate Raphson estimates from On is efficient. 3. I . In Example 6..2 hold if (i) and (ii) hold. (I) Combine (aHe) to establish (6. that (d) (fj  (3)TZ'f. (a) In Example 6.'.2.(3) = op(n. (c) Apply Slutsky's theorem to conclude that and. b . 5.. /1. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ). t) is then called a conditional MLE. Y).. and .(b) The MLE minimizes .". e Euclidean.2 are as given in (6.1/ 2 ). ell of (} ~ = (Ji.2..Pe"H). . T(X) = Zen).21).. (fj multivariate nannal distribution with mean 0 and variance. . . show that the assumptions of Theorem 6.Zen) (31 2 e ~ 7.2. hence. then ({3.fii(ji .
L'l'(Xi'O~) n.2. <).0 0 ).'2:7 1 'l'(Xi. Hint: n .l exists and uniquely ~ .(XiI') _O. (ii) gn(Oo) ~ g(Oo).l + aCi where fl.0 0 1 < <} ~ 0. ao (}2 = (b) Write 8 1 uniquely solves 1 8.l real be independent identically distributed Y. hence..2.Oo) i=cl (1 .Dg(O)j . p is strictly convex.O) has auniqueOin S(Oo.1) starting at 0. .7 Problems and Complements 427 9. a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and.•. ] t=l 1 n . Y. Let Y1 •.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi. (iii) Dg(0 0 ) is nonsingular. t=l 1 tt Show that On satisfies (6.1/ 2 ). the <ball about 0 0 .LD. (iv) Dg(O) is continuous at 0 0 . (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.3).. Examples are f Gaussian. and f(x) = e.. E R. R d are such that: (i) sup{jDgn(O) . Him: You may use a uniform version of the inverse function theorem: If gn : Rd j. p i=l J.x (1 + C X ) .1' _ On = O~  [ . (logistic). 8~ = 1 80 + Op(n.' . = J. (a) Show that if solves (Y ao is assumed known a unique MLE for L. that is.LD. a' .Section 6. . Suppose AGA4 hold and 8~ is vn consistent. 10 . . (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1.p(Xi'O~) + op(1) n i=l n ) (O~ ..p(Xi'O~) n. Show that if BB a unique MLE for B1 exists and 10.
Hint: Write n J • • • i I • DY...(j) l. converges to the unique root On described in (b) and that On satisfies 1 1 • j. hence. Problems for Section 6. 1 Zw Find the asymptotic likelihood ratio.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1. Similarly compute the infonnation matrix when the model is written as Y. I '" LJ i=l ~(1) Y. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}.3. f. .3. 2 ° 2. .i. a 2 ).O E e. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj. i=l • Z... {Vj} are orthonormal.. i2.3 i 1. Show that if 3 0 = {'I E 3 : '1j = 0. . Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n .. < Zn 1 for given covariate values ZI. q(. . ~. and ~ .. Wald. P I .2..Zi )13.27).' " 13. Inference in the Multiparameter Case Chapter 6 > O. .O).ip range freely and Ei are i.~_. 'I) = p(. Suppose that Wo is given by (6. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3.3). Thus. N(O. and Rao tests for testing H : Ih. LJ j=2 Differentiate with respect to {31./3)' = L i=1 1 CjZiU) n p . = versus K : O oF 0.. . . . Y. iteration of the NewtonRaphson algorithm starting at 0. •.3. that Theorem 6. P(Ai).l hold for p(x.. Establish (6. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X.2 holds for eo as given in (6..3. II. •. 0 < ZI < .12) and the assumptions of Theorem (6. .26) and (6.d.12).(Z" . .II(Zil I Zi2. Suppose responses YI .2.6). for "II sufficiently large. CjIJtlZ. log Ai = ElI + (hZi.2. i .. • (6. 'I) : 'I E 3 0 } and.428 then. = 131 (Zil . . > 0 such that gil are 1 . there exists a j and their image contains a ball S(g(Oo). Zi.(Z"  ~(') z. then it converges to that solution. 1 Yn are independent Poisson variables with Yi . t• where {31.. minimizing I:~ 1 (Ii 2 i. Zip)) + L j=2 p "YjZij + €i J ..
q('. 0 ») is nK(Oo. Let e = {B o . Show that under H.3.7 Problems and C:com"p"":c'm"'. independent. respectively. Oil + Jria(00 .l.. (Adjoin to 9q+l.3 is valid.n) satisfy the conditions of Problem 6.'1) and Tin '1(lJ n) and.Section 6.IJ) where q(. . {IJ E S(lJo): ~j(lJ) = 0. Consider testing H : 8 1 = 82 = 0 versus K : 0. pXd .o log (X 0) PI. < 00.(X1 . hence. OJ}.Xn ) be the likelihood ratio statistic. j = 1. Vi). ..d. Suppose that Bo E (:)0 and the conditions of Theorem 6.) where k(Bo. Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo. q + 1 <j< rl. Xi. . B).. . N(IJ" 1).. There exists an open ball about 1J0. Deduce that Theorem 6.3..n 7J(Bo. N(Oz. 'I) S(lJo)} then. Suppose 8)" > 0.. Testing Simple versus Simple. IJ. aTB. 1 5.3.X. Oil and p(X"Oo) = E..3 hold.aq are orthogonal to the linear span of ~(1J0) .'" .n:c"' 4c:2=9 3. . with Xi and Y. . 210g. .) O.2. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0).2..'1 8 0 = and. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo). Let (Xi.d. even if ~ b = 0. (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ .. q + 1 <j < r S(lJo) .. X n Li.. (ii) E" (a) Let ). IJ. .gr..'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio. \arO whereal. Consider testing H : (j = 00 versus K : B = B1 . be ii.. = ~ 4. =  p(·. p . . Assume that Pel of Pea' and that for some b > 0.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('.. > 0 or IJz > O.) 1(Xi . 1 < i < n. 1). with density p(.(I(X i .(X" .
= (T20 = 1 andZ 1= X 1. .0. > 0.i. I .\(X). 7. under H. 0. Exhibit the null distribution of 2 log . < . Show that 2log. 1) with probability ~ and U with the same distribution. XI and X~ but with probabilities ~ . .19) holds. Let Bi l 82 > 0 and H be as above.6.\(Xi. are as above with the same hypothesis but = {(£It.3. Hint: By sufficiency reduce to n = 1.3. mixture of point mass at 0. Then xi t. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. ' · H In!: Consl'defa1O . which is a mixture of point mass at O. Wright. < cIJ"O...1. (iii) Reparametrize as in Theorem 6.. (0.3. ii. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0.) if 0. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson.6.  0.2 for 210g.\( Xi.  OJ. Hint: ~ (i) Show that liOn) can be replaced by 1(0). Show that (6. be i. (b) If 0. 2 e( y'n(ii.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. 2 log >'( Xi.) under the model and show that (a) If 0.. In the model of Problem 5(a) compute the MLE (0. (d) Relate the result of (b) to the result of Problem 4(a).li : 1 <i< n) has a null distribution.. and Dykstra. Yi : l<i<n). = 0.0. = v'1~c2' 0 <.=0 where U ~ N(O.d. (h) : 0 < 0. for instance.1.0. > OJ. Po) distribution and (Xi.)) ~ N(O. /:1. (b) Suppose Xil Yi e 4 (e) Let (X" Y. 0. ±' < i < n) is distributed as a respectively. > 0... we test the efficacy of a treatment on the basis of two correlated responses per individual. = a + BB where . 1 < i < n. Sucb restrictions are natural if. O > 0. and ~ where sin. • i .2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.0. Z 2 = poX1Y v' .0"10' O"~o. Po . j ~ ~ 6.) have an N. li).~. 1988).0). =0. Yi : 1 and X~ with probabilities ~.
P(A))P(B)(1 .. (b) (6.4. (a) Show that the correlation of Xl and YI is p = peA n B) .3.4. 3.7 Problems and Complements ~(1 ) 431 8. Show that under A2. Exhibit the two solutions of (6.3. 1. 10.22) is a consistent estimate of ~l(lIo).8) by Z ~ .4) explicitly and find the one that corresponds to the maximizer of the likelihood. 9.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5.1(0 0 ).3. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6.8) (e) Derive the alternative form (6. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6. 0 < P( A) < 1.nf.6 is related to Z of (6. 2.4 ~ 1(11) is continuous.4. hence. (e) Conclude that if A and B are independent.2 to lin .4. A3. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.Section 6.P(A)P(B) JP(A)(1 .3. Hint: Write and apply Theorem 6.10. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born. Hint: Argue as in Problem 5. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero. then Z has a limitingN(011) distribution. A611 Problems for Section 6.21).0 < PCB) < 1.8) for Z.2. .
S. TJj2 are given by TJil ~ = . .. B!c1b!. Consider the hypothesis H : Oij = TJil TJj2 for all i. n a2 ) . It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. j. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI.2 is 1t(Ci... Suppose in Problem 6. (b) Deduce that Pearson's X2 is given by (6. TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. . i = 1. (22 ) as in the contingency table. 8 21 ..54). 1 : X 2 (b) How would you. in principle.4. . . Fisher's Exact Test From the result of Problem 6. i = 1. ..4 deduce that jf j(o:) (depending on chosen so that Tl. =T 1. 811 \ 12 .. ... n R.. N ll and N 21 are independent 8( r.~l. TJj2. C i = Ci.. R 2 = T2 = n .. j = 1. .)).~~. 6. . N 22 ) rv M (u. (a) Show that then P[N'j niji i = 1. are the multinomial coefficients.4. b I Ri ( = Ti.~. ( rl.ra) nab .. j = 1. . .2. TJj2 ~ = Cj n where R. 432 Inference in the Multiparameter Case Chapter 6 i 4. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 .9) and has approximately a X1al)(bl) distribution under H. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. a . I (c) Show that under independence the conditional distribution of N ii given R. 8l! / (8 1l + 812 )). (a) Show that the maximum likelihood estimates of TJil... ( where ( .b1only.C.4. nal ) n12.. = Lj N'j. . n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. Let N ij be the entries of an let 1Jil = E~=l (}ij. n. C j = L' N'j. CI..u. Ti) (the hypergeometric distribution).1 II . This is known as Fisher's exact test. 8(r2' 82 I/ (8" + 8. . nab) A ) = B.D. (a) Let (NIl.6 that H is true. Hint: (a) Consider the likelihood as a function of TJil.. 7.TI.~~. .. Cj = Cj] : ( nll.1. N 12 • N 21 ..1 .
(h) Construct an experiment and three events for which (i) and (ii) hold. (a) If A.12 I . 2:f .. where Pp~ [2:f . (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. The following table gives the number of applicants to the graduate program of a small department of the University of California. N 22 is conditionally distributed 1t(r2.(rl + cI). and petfonn the same test on the resulting table. Then combine the two tables into one. if and only if. Establish (6.. classified by sex and admission status. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. ~i = {3.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). Give pvalues for the three cases. B INDEPENDENT) (iii) PeA (C is the complement of C. Test in both cases whether the events [being a man] and [being admitted] are independent. 9. there is a UMP level a test.) Show that (i) and (ii) imply (iii).1""". + IhZi.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. n.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A. Zi not all equal. Would you accept Or reject the hypothesis of independence at the 0.ZiNi > k] = a. and that under H. and that we wish to test H : Ih < f3E versus K : Ih > f3E. . (b). Suppose that we know that {3. consider the assertions. for suitable a.(rl + cd or N 22 < ql + n .BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. C2).5? Admit Deny Men Women 1 19 . 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n . C are three events.ziNi > k.. = 0 in the logistic model.7 Problems and Complements 433 8. B.Section 6. 11. but (iii) does not. which rejects. 10.0. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'.4.4. if A and C are independent or B and C are independent. Show that.14).
" lOk vary freely. a5 j J j 1 J I . which is valid for the i..Ld....... .5. but under K may be either multinomial with 0 #..18) tends 2 to Xrq' Hint: (Xi .'Ir(. k and a 2 is unknown.2. for example. X k be independent Xi '" N (Oi. ~ 8 m + [1(8 m )Dl(8 m ).8) and.3.. Suppose the ::i in Problem 6.. i Problems for Section 6. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI.".5. Show that the likelihO<Xt ratio test of H : O} = 010 .Oio)("Cooked data"). Suppose that (Z" Yj).. 010. . then 130 as defined by (6.. but Var. a 2 = 0"5 is of the form: Reject if (1/0".4.5 construct an exact test (level independent of (31).i.LJ2 Z i ))..(Ni ) < nOiO(1 . in the logistic regression model.20) for the regression described after (6. This is an approximation (for large k. " Ok = 8kO. .) ~ B(m. . . 13. 434 • Inference in the Multiparameter Case Chapter 6 f • 12.4. with (Xi I Z. . . 1 . . Show that. mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. Verify that (6. . Xi) are i.3) l:::~ I (Xi . i 16. Nk) ~ M(n.. n) and simplification of a model under which (N1. a 2 ) where either a 2 = (known) and 01 . = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown.00 or have Eo(Nd : . .z(kJl) ~I 1 .OiO)2 > k 2 or < k}.. asymptotically N(O.f.15) is consistent. ..4. .. or Oi = Bio (known) i = 1.4) is as claimed formula (2. 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.)llmi~i(1 'lri).d... case. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 . Compute the Rao test statistic for H : (32 case.4. Let Xl. Zi and so that (Zi. I. a under H. I). · . ..l 1 .. (a) Compute the Rao test for H : (32 Problem 6. (Zn.i.4. if the design matrix has rank p.: nOiO.. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. 3.5 1. I < i < k are independent. . Given an initial value ()o define iterates  Om+l . Y n ) have density as in (6. . .11.11 are obtained as realization of i. • I 1 • • (a)P[Z.5. 2. Tn. .1 14. E {z(lJ.d.. . Use this to imitate the argument of Theorem 6.4). .4.OkO) under H.3. . 15.. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973).
T)exp { C(T) where T is known. . Let YI. . .. .. Wn). h(y. . T. Set ~ = g(.1 = 0 V fJ~ . Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . Give 0.9).. Find the canonical link function and show that when g is the canonical link.b(Oi)} p(y. 00 d" ~ a(3j (b) Show that the Fisher information is Z].= 1 .7 Problems and Complements 435 (b) The linear span of {ZII). (e) Suppose that Y.).ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j.. Show that when 9 is the canonical link. (c) Suppose (Z" Y I ). the resuit of (c) coincides with (6. . Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known). d". (d) Gaussian GLM.i. b(9). J JrJ 4. T). .p.. ~ .. and v(.) ~ 1/11("i)( d~i/ d". and b' and 9 are monotone. 1'0) = jy 1'01 2 /<7.. and v(. give the asymptotic distribution of y'n({3 . Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1. Y follow the model p(y. Show that.) and v(. (a) Show that the likelihood equations are ~ i=I L..9).Oi)~h(y.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. . ... h(y. the deviance is 5.5). ~ N("" <7. 05. <. (Zn. . under appropriate conditions. C(T). Show that for the Gaussian linear model with known variance D(y.y . T). C(T).)z'j dC.. . .Section 6. g("i) = zT {3. Y) and that given Z = z. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.. distribution. j = 1..5.). In the random design case. .. b(9). k. Yn ) are i. (y. L. Assume that there exist functions h(y. as (Z. J .5.5.O(z)) where O(z) solves 1/(0) = gI(zT {3). ball about "k=l A" zlil in RP .) = ~ Var(Y)/c(T) b"(O). your result coincides with (6. b(B).) . then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .. g(p.. Give 0...) and C(T) such that the model for Yi can be written as O..F. T.d. has the Poisson. Suppose Y. Wi = w(". P(I"). 9 = (b')l...(3).(. k. contains an open L.T).
then it is. Wn Wn + op(l) + op(I). Consider the linear model of Example 6.). 2. . then the Rao test does not in general have the correct asymptotic level.O' 2 (p») = 1/4f(v(p». W n . the infonnation bound and asymptotic variance of Vri(X 1').Q+1>'" . Show that. . Suppose ADA6 are valid. 0 < Vae f < 00.. P. .6. I: .6. Show that 0: 2 given in (6.2. O' 2(p)/0'2 = 2/1r.1.2. in fact. then under H.i. l 1 . then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. (a) Show that if f is symmetric about 1'.6.7. hence.d. . that is. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold.14) is a consistent estimate of2 VarpX(l) in Example 6. . Wald. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6.3 is as given in (6. • I 3.1 under this model and verifying the fonnula given.1) ~ . but if f~(x) = ~ exp Ix 1'1.3 and. then v( P) . . Establish (6..6.2 and the hypothesis (3q+l = (3o. By Problem 5. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0.6.6 1. let 1r = s(z. I . n . and Rao tests are still asymptotically equivalent in the sense that if 2 log An. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. I 4. Show that the LR.15) by verifying the condition of Theorem 6. set) = 1.p under the sole assumption that E€ = 0. Note: 2 log An. (6. /10) is estimated by 1(80 ). . j. replacing &2 by 172 in (6. t E R. f}o) is used.10).6. then O' 2(p) > 0'2 = Varp(X. 7. = 1" (b) Show that if f is N(I'. the unique median of p.10) creates a valid level u test.3.Xn are i. and R n are the corresponding test statistics.s(t).j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6.. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.(3p ~ (3o. . Suppose Xl. " .3) then f}(P) = f}o. Show that the standard Wald test forthe problem of Example 6. 0'2). I "j 5.4. ! I I ! I . if VarpDl(X. but that if the estimate ~ L:~ dDIllDljT (Xi.6. if P has a positive density v(P). Apply Theorem 6. i 6. then 0'2(P) < 0'2.3. In the hinary data regression model of Section 6.1.6.
the model with 13d+l = .{3lp) p and {3(P) ~ (/31...p don't matter. f3p.d. i .. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i. .i .. then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular. 0. 1973. _.9. Let f3(p) be the LSE under this assumption and Y. i = 1.2 is known. (Model Selection) Consider the classical Gaussian linear model (6.2 is an unbiased estimate of EPE(P).2 (b) Show that (1 + ~D + .2. where Xln) ~ {(Y. 1 n.1. Zi are ddimensional vectors for covariate (factor) values. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution.(b) continue to hold if we assume the GaussMarkov model. 1 n. hence.. A natural goal to entertain is to obtain new values Yi"..Section 6. . Gaussian with mean zero J1i = Z T {3 and variance 0'2. . Show that if the correct model has Jri given by s as above and {3 = {3o.. . Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows.. that Zj is bounded with probability 1 and let ih(X ln )).I EL(Y. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n. where ti are i.1) Yi = J1i + ti.lpI)2 be the residual sum of squares. (a) Show that EPE(p) ~ .. Suppose that ...lp)2 Here Yi" l ' •• 1 Y.7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models.. at Zll'" ..1. 1 V.. be the MLE for the logit model. 8. Zi) : 1 <i< n}. /3P+I = '" = /3d ~ O.. .(p) the corresponding fitted value. O)T and deduce that (c) EPE(p) = RSS(p) + ~. . Zi. . . (b) Suppose that Zi are realizations of U. y~p) and.9. Let RSS(p) = 2JY. i = 1. L:~ 1 (P. .. But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1. for instance).Zn and evaluate the performance of Yjep) ..' i=l .p. ~ ~ (d) Show that (a).jP)2 where ILl ) = z. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d .d. .. Hint: Apply Theorem 6.
this makes no sense for the model we discussed in this section. New York: McGrawHill. + Evaluate EPE for (i) ~i ~ "IZ. Use 0'2 = 1 and n = 10.1 (1) From the L A. Note for Section 6. ~(p) n . To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. ti. 1969.~I'.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. 3rd 00. AND F. MASSEY. A.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. . if we consider alternatives to H.4.8 NOTES Note for Section 6. R. . Hint: (a) Note that ". Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good. Note for Section 6.".4 (1) R. D.5.n.6. which are not multinomial. 6. we might envision the possibility that an overzealous assistant of Mendel "cooked" the data. W. .16). Y. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6.9 for a discussion of densities with heavy tails. t12 and {Z/I. )I'i). 1 n n L (I'.. Introduction to Statistical Analysis. 1 I . . (b) The result depends only on the mean and covariance structure of the i=l"". Y/" j1~. The Analysis of Binary Data London: Methuen..z.9 REFERENCES Cox. 1 "'( i=1 . Give values of /31. i=1 Derive the result for the canonical model.L . and (ii) 'T}i = /31 Zil + f32zi2. 1970. LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6.) . Heart Study after Dixon and Massey (1969).2 (1) See Problem 3. ! . DIXON. EPE(p) n R SS( p) = . . .' . but it is reasonable.. i 6.  J . For instance.
Math." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. SCHEFFE. T. New York: Hafner. Theory ofStatistics New York: Springer. 36. An Inrmdllction to Linear Stati. DYKSTRA. Statistical Theory with £r1gineering Applications New York: Wiley." Technometrics. D'OREY.. 279300 (1997). L. GauthierVillars. New York. I New York: McGrawHill. S. Order Restricted Statistical Inference New York: Wiley.9 References 439 Frs HER..425433 (1989). "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators.. P. Fifth BNkeley Symp. 2nd ed. 2nd ed. AND Y. 1959. f. II. "The behavior of the maximum likelihood estimator under nonstandard conditions. AND R. 1988. t 983.• "Sur quelques points du systeme du monde. Roy. 1989. AND j. Statist. New York: Wiley. RAO. C. Paris) (1789).. Prob. McCULLAGH. /5.. ROBERTSON. MALLOWS. STIGLER. GRAYBILL. TIERNEY. Applied linear Regression. J. T. WEISBERG. 76. R. A. S. M. 12. AND R. . KADANE AND L. R.. c. 1986. Statist.• St(ltistical Methods {Of· Research lV(Jrkers. Soc. HUBER. R. Genemlized Linear Models London: Chapman and Hall." Biometrika. "Computing regression qunntiles. WRIGHT. PORTNOY. 475558. "Some comments on C p . S. 383393 (1987). The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. 1995. J.. KASS. KOENKER. second edition.Section 6. 221233 (1967). "Approximate marginal densities of nonlinear functions. I. HABERMAN.. of California Press. R.. 1958." Statistical Science. 1952. 1974. 1961. Univ.661675 (1973). The Analysis of Variance New York: Wiley. LAPLACE.Him{ Models. PS. P. NELDER. 13th ed. HAW. C. $CHERVISCH. E.. Linear Statisticallnference and Its Applications. New York: J. 1973. Ser. A. II." J. 1985. S." Proc. KOENKER. A. Wiley & Sons. Vol. The Analysis of Frequency Data Chicago: University of Chicago Press. MA: Harvard University Press.
I j i. . ... .I :i . I . . . i 1 I . . I I . j.
We add to this notion the requirement that to be called an experiment such an action must be repeatable. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. require that every repetition yield the same outcome. Sections A. we include some proofs as well in these sections. we include some commentary. A coin that. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. Viewed naively. A. The situations we are going to model can all be thought of as random experiments. The Kolmogorov model and the modem theory of probability based on it are what we need. although we do not exclude this case. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. The reader is expected to have had a basic course in probability theory. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. which are relevant to our study of statistics. Therefore. is tossed can land heads or tails. This 441 . The adjective random is used only to indicate that we do not. at least conceptually. The intensity of solar flares in the same month of two different years can vary sharply.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts.14 and A.IS contain some results that the student may not know. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. in addition.
In this sense. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). We denote it by n." The set operations we have mentioned have interpretations also. almost any kind of activity involving uncertainty. induding the authors. Raiffa and Schlaiffer (\96\). A random experiment is described mathematically in tenns of the following quantities. . By interpreting probability as a subjective measure. from horse races to genetic experiments. I l 1. A2 . If wEn.l. A is always taken to be a sigma field. they are willing to assign probabilities in any situation involving uncertainty. 1974. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. the null set or impossible event. i ! A. set theoretic difference.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . A. Grimmett and Stirzaker. n. as we shall see subsequently. then (ii) If AI. We denote events by A. . A. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. and Loeve. If A contains more than one point. For a discussion of this approach and further references the reader may wish to consult Savage (1954). is to many statistician::. c. it is called a composite event. B. and Berger (1985). intersection. We now turn to the mathematical abstraction of a random experiment. falls under the vague heading of "random experiment. 1977). the operational interpretation of the mathematical concept of probability. the probability model. Savage (1962).1. de Groot (1970). and so on or by a description of their members. which by definition is a nonempty class of events closed under countable unions. whether it is conceptually repeatable or not.l.2 A sample point is any member of 0.3 Subsets of are called events. Chung.. . " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common.1. and complementation (cf. Its complement. . 1 1  . In this section and throughout the book.l The sample space is the set of all possible outcomes of a random experiment. • ." Another school of statisticians finds this formulation too restrictive. C for union. For example.. intersections. = 1. 1992. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. ..1/ /1. Lindley (1965). A. {w} is called an elementary event. . where 11. We shall use the symbols U. Ii I I I j . However. is denoted by A. complementation.~ n and is typically denoted by w.4 We will let A denote a class of subsets of to which we an assign probabilities. and inclusion as is usual in elementary set theory.\ is the number of times the possible outcome A occurs in n repetitions. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n .
. (A.Section A. For convenience.1 If A c B.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P. (1967) Chapter l.68 Grimmett and Stirzaker (1992) Sections 1.A) = PCB) . c A.3.40 < P(A) < 1. A. by axiom (ii) of (A.l1. and Stone (1971) Sections 1.3 Panen (1960) Chapter I.PeA).2 PeN) 1 . A.' 1 peA..2 Elementary Properties of Probability Models 443 A.1.3. References Gnedenko (1967) Chapter I.. > PeA).2) n . (U..3 A. we have for any event A. That is.. when we refer to events we shall automatically exclude those that are not members of A. W2.2. Sections 45 Pitman (1993) Section 1. P(0) ~ O. c . (n~~l Ai) > 1.2. P) either as a probability model or identify the model with what it represents as a (random) experiment.) (Bonferroni's inequality). Sections 15 Pitman (1993) Sections 1. In this case. J.4).S The three objects n... ~ A. A..2.)..3 A.L~~l peA.P(A). . P(B) A. .3 Hoel.2.S P A.3 Hoe!. C An . A.2 and 1. then PCB .6 If A . then P (U::' A. A.2. Port.' 1 An) < L.. 1.2. References Gnedenko. We shall refer to the triple (0. and P together describe a random experiment mathematically.2 Parzen (1960) Chapter 1.l.7 P 1 An) = limn~= P(A n ). } and A is the collection of subsets of n.2. Sections 13. Section 8 Grimmett and Stirzaker (1992) Section 1.3 If A C B. Port.3 DISCRETE PROBABILITY MODELS A. we can write f! = {WI.t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability. and Stone (1992) Section 1.
4. . Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another.4) .1. and P( A ) ~ j .3. are (pairwise) disjoint events and P(B) > 0. all of which are equally likely. (A. ~f 1 . Given an event B such that P( B) > 0 and any other event A. shaking well. .~ 1 1 ~• 1 ..=:. then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur.4.3) If B l l B 2 . A) which is referred to as the conditional probability measure given B. is an experiment leading to the model of (A.444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . For large N. A. n PiA) = LP(A I Bj)P(Bj ).I B) is a probability measure on (fl.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment.4. (A..3) yield U. I B) = ~ PiA. which we write PtA I B).3) 1 I A.3). then . I B). (A..l) gives the multiplication rule. If A" A 2 .=. i References Gnedenko (1967) Chapter I. ..• 1 B n are (pairwise) disjoint events of positive probability whose union is fl. Sections 45 Parzen (1960) Chapter I. . etc. (A.:.:c Number of elements in A N . Then P( {w}) = 1/ N for every wEn. the function P(.). by l P(A I B) ~ PtA n B) P(B) . machines. WN are the members of some population (humans. . guinea pigs. a random number table or computer can be used. 1 1 PiA n B) = P(B)P(A I B).4 Suppose that WI.4..3. and drawing.. for fixed B as before. j=l (A.1 • 1 .• P (Q A. the identity A = .:c:.l n' " I"I ". (A.' .4. Transposition of the denominator in (AA.4)(ii) and (A. Sections 67 Pitman (1993) Section l. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper.. say N. selecting at random.(A n B j ).2) In fact.3. From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred. i • i: " i:.. flowers. we define the conditional probability of A given B. '< = ~=~:. .
.4.lO) for any subset {iI.. (AA.) = II P(A.. Section 4. .i k } of the integers {l.4) and obtain Bayes rule .. An are said to be independent if P(A i .i.. Chapter 3.. Sections IA Pittnan (1993) Section lA . . B n ) and for any events A. A and B are independent if knowledge of B does not affect the probability of A. .. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel..I_1 .. B I . ...4.8) > 0.}..···..i." .II) for any j and {i" ..n}. P(il. I. . . relation (A.8) may be written P(A I B) ~ P(A) (AA. Simple algebra leads to the multiplication rule. n B n ) > O.. .)P(B. and Stone (1971) Sections lA.. B ll is written P(A B I •. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill. .I J (AA.En such that P(B I n .7) whenever P(B I n ... . n Bnd > O. B 2 ) .• .S parzen (1960) Chapter 2.. Ifall theP(A i ) are positive. BnJl (AA. . . The events All .) A) ~ ""_ PIA I B )p(BT L.3).} such thatj ct {i" .lO) is equivalent to requiring that P(A J I A....9) In other words. and (A.. References Gnedenko (1967) Chapter I.) ~ P(A J ) (AA.A. ...1 ).Section AA Conditional Probability and Independence 445 If P( A) is positive.S) The conditional probability of A given B I defined by .4.. I PeA I B. the relation (AA. Port. ..) j=l k (AA. (AA.. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B). we can combine (A. P(B" I Bl. n··· nA i ..
..0 1 X '" x .. On the other hand. independent. then (A52) defines P for A) x ".. it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix.) .' . ."" . .5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments. that is..1 X Ai x 0i+ I x . . 1 < i < n}.o i . .0 ofthe n stage compound experiment is by definition 0 1 x·· . A. x'" . InformaUy.0) given by .o n = {(WI. The interpretation of the sample space 0 is that (WI..53) holds provided that P({(w"". . The (n stage) compound experiment consists in performing component experiments £1.. a compound experiment is one made up of two or more component experiments. x flnl n Ifl) x A 2 x ' .w n ) E o : Wi = wf}. we introduce the notion of a compound experiment. if we toss a coin twice. we should have .. The reader not interested in the formalities may skip to Section A. A2). X .' . . X A 2 X '" x fl n ).3). x An) ~ P(A. then the probability of a given chip in the second draw will depend on the outcome of the first dmw. x . . If we are given n experiments (probability models) t\.. To say that £t has had outcome pound event (in . ..." X An) = P.6 where examples of compound experiments are given. I P n on (fln> An). x fl 2 x '" x fln)P(fl.. .0 if and only if WI is the outcome of £1. An is by definition {(WI. .5. For example. X fl n 1 X An). Loeve.3) for events Al x .1 Recall that if AI. the sigma field corresponding to £i.on. I <i< n. then intuitively we should have alI classes of events AI. If P is the probability measure defined on the sigma field A of the compound experiment. \ I I . I ! i w? 1 .. . W n ) : W~ E Ai.. x " . P.. . P(fl. . There are certain natural ways of defining sigma fields and probabilities for these experiments.5. (A5A) i.. P([A 1 x fl2 X . 446 A Review of Basic Probability Theory Appendix A A. the Cartesian product Al x . the subsets A ofn to which we can assign probability(1)...An with Ai E Ai. then the sample space ... ) = P(A.. . . This makes sense in the compound experiment.on.. x flnl n ' . More generally. .1 . We shall speak of independent experiments £1. and if we do not replace the first chip drawn before the second draw. on (fl2. if Ai E Ai. 1 £n if the n stage compound experiment has its probability structure specified by (A5. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip... These will be discussed in this section. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on. (A53) It may be shown (Billingsley.0 1 x··· XOi_I x {wn XOi+1 x··· x. Pn({W n }) foral! Wi E fli. To be able to talk about independence and dependence of experiments. r.. then Ai corresponds to . 1 1 . £n with respective sample spaces OJ. Pn(A n ). .. I n  I . 1995. . the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. If we want to make the £i independent. x An.wn )}) I: " = PI ({Wi}) . wn ) is a sample point in .'" £n and recording all n outcomes.. . (A5. (A. 1974.2) 1 i " 1 If we are given probabilities Pi on (fl" AI). x On in the compound experiment. Chung.. In the discrete case (A. . x An of AI. An are events. . . . x An by 1 j P(A. . 1977) that if P is defined by (A.
. .5.6. we refer to such an experiment as a multinomial trial with probabilities PI. . If we repeat such an experiment n times independently. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated.6..1 Suppose that we have an experiment with only two possible outcomes.. References Grimmett and Stirzaker ( (992) Sections (.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. we shall refer to such an experiment as a Bernoulli trial with probability of success p.6.£"_1 has outcome wnd.k)!' The fonnula (A.2) where k(w) is the number of S's appearing in w.3) is known as the binomial probability. (A.. which we shall denote by 5 (success) and F (failure).6.6 BERNOULLI AND MULTINOMIAL TRIALS. we say we have performed n Bernoulli trials with success probability p. n The probability structure is determined by these conditional probabilities and conversely. the following.: n )}) for each (WI. i = 1" . if an experiment has q possible outcomes WI. the compound experiment is called n multinomial trials with probabilities PI. any point wEn is an ndimensional vector of S's and F's and.6.·. . Other examples will appear naturally in what follows.'" . Ifweassign P( {5}) = p.. .. SAMPLING WITH AND WITHOUT REPLACEMENT A.. If Ak is the event [exactly k S's occur].SP({(w] . If the experiment is perfonned n times independently...6.4 More generally. A.: n ) with Wi E 0 1. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). By the multiplication rule (A.S) .5 Parzen (1960) Chapter 3 A.7) we have.5..Section A 6 Bernoulli and Multinomial Trials. . . then (A... /I. and Stone (1971) Section 1.. ill the discrete case.wq and P( {Wi}) = Pi..Pq' If fl is the sample space of this experiment and W E fl. 1.· IPq. J. .4.. .u.6 Hoel. Port. A. If o is the sample space of the compound experiment. then (A. In the discrete case we know P once we have specified P( {(wI .3) where n ) ( k = n! kl(n . . .
If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). . (N)n . A.k q is the event (exactly k l WI 's are observed. The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space.6.p)) < k < min( n. i .1 . Port.A l P)..6. If AkJ.6. I A. . then P( k" . . and Stone (1971) Section 2. ..6. n ..k.6) where the k i are natural numbers adding up to n. References Gnedeuko (1967) Chapter 2. . .. . 10) is known as the hypergeometric probability.7 If we perform an experiment given by (. P) independently n times.P))nk k (N)n = (A6. exactly k 2 wz's are observed.N (1 . with replacement is added to distinguish this situation from that described in (A.0.1O) J for max(O. . n P({a}) ~ (N)n where 1 I (A. I' . . .448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w.6. kq!Pt' . · · • A. i . .n)!' If the case drawn is replaced before the next drawing. Np).) of the compound experiment. . = N! (N .. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n.1 j . . Sectiou 11 Hoel. ·Pq t (A.. 1 ..S) as follows.) A = n! k k k !.6. __________________J i . Sections 14 Pitman (1993) Section 2.9) .1 . The fonnula (A. A.(N(I. When n is finite the tenn. we are sampling with replacement. for any outcome a = (Wil"" 1Wi. and P(Ak) = 0 otherwise.7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice.S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement.4 Parzen (1960) Chapter 3. then ~ . exactly kqwq's are observed). and the component experiments are independent and P( {a}) = liNn. . . the component experiments are not independent and... PtA ) k =( n ) (Np).
.8 are usually called absolutely continuous.7.. dx n • is called a density function. An important special case of (A.. for practical purposes. (A7..I) because the study of this model and that of the model that has = {Xl. .b k ) = {(XI"". A.S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A. } are equivalent.(ak.7.bk ) are k open intervals..3. xd'. . which we denote by f3k. we shall call the set (aJ. It may be shown that a function P so defined satisfies (AlA). and 0 otherwise. We will write R for R1 and f3 for f31.5) A. x A..Xn .S) is given by (A. where ( )' denotes transpose. A. Geometrically.2 The Borelfield in R k . Integrals should be interpreted in the sense of Lebesgue. Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»). . JR' where dx denotes dX1 . Any subset of R k we might conceivably be interested in turns out to be a member of f3k. However. any nonnegative function pon R k vanishing except on a sequence {Xl.Section A.bd x '" (ak.EA pix. 1 <i <k}anopenkrectallgle. . That is.6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1. This definition is consistent with (A..7.7.7...7. Riemann integrals are adequate. A. X n . Recall that the integral on the right of (A.). . P defined by A. . } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely..7.3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k . (A7A) Conversely.. .7..bd.S) for some density function P and all events A. ..9) .7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl.Xk) :ai <Xi <bi. is defined to be the smallest sigma field having all open k rectangles as members. only an Xi can occur as an outcome of the experiment...1If (al.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7.7. P(A) is the volume of the "cylinder" with base A and height p(x) at x. .
(A.1. 5.7.) F is defined by F(Xl' .1. 5. 22 Hoel. F is continuous at x if and only if P( {x}) (A.15) I. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A.16) defines a unique P on the real line.13HA. P Xl (A.h..7. Sections 21. x. We always have F(x)F(xO)(2) =P({x}). thus.:0 and Xl are in R.2 parzen (1960) Chapter 4.7.7.14) . . ! I . x. x n j X =? F(x n ) ~ (A..17) = O. F is a function of a real variable characterized by the following properties: > I .f. . Thus.J x .xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ).1 and 4. When k = 1. then P = Q. POlt.4.h.13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A. . • J .11 The distribution function (dJ.]).5 i • . r . (A. References Gnedenko (1967) Chapter 4.7.) = P( ( 00..2..7. then by the mean value theorem paxo .16) I It may be shown that any function F satisfying (A. and h is close to 0.450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. (A.7. . For instance."(l) Although in a continuous model P( {x}) = 0 for every x. the density function has an operational interpretation close to thal of the frequency function. x (00. if p is a continuous density on R..lO) The ratio p(xo)jp(xl) can. 3. defines P in the sense that if P and Q are two probabilities with the same d. and Stone (1971) Sections 3. Sections 14.7.7 Pitman (1993) Sections 3.7. limx~oo F(x) limx~_oo =1 F(x) = O.x.1'.7. 4. Xo P([xo .12) The dJ.
and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. ifXisdiscrete xEA L (A. the time to breakdown and length of repair time for a randomly chosen machine.8. density. or equivalently a function from to Rk such that the set {w . The probability distribution of a random vector X is. in fact. Similarly. dJ. and so on of a random vectOr when we are. In the probability model. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H]. by definition. referring to those features of its probability distribution.S.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred. k.Xk)T is ktuple of random variables.8. these quantities will correspond to random variables and vectors. if X is continuous. dj. the concentration of a certain pollutant in the atmosphere..7. from (A.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. m > 1.a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse. The study of real. Letg be any function from Rk to Rm.8) PIX E: A] LP(X).8. Thus. Here is the formal definition of such transformations. the yield per acre of a field of wheat in a given year.4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. In the discrete case this means we need only know the frequency function and in the continuous case the density. Forexample. and so on.7. we measure the weight of pigs drawn at random from a population.8. the probability measure Px in the model (R k . X(w) E: B} ~ X. When we are interested in particular random variables or vectors.8 Random VariOlbles and Vectors: Transformations 451 A.5) and (A.Section A. such that(2) gl(B) = {y E: Rk : g(y) E: . Px ) given by n Px(B) = PIX E BI· (A.3) A.1 (B) is in 0 fnr every BE B.(1) = A. . A.2 A random vector X = (Xl •. . 13k . we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined. we will refer to the frequency Junction. The subscript X or X will be used for densities.'s. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X.5) p(x)dx .
7) If X is discrete with frequency function Px. a random vector obtained by putting two random vectors together. .S. Yf is a discrete random vector with frequency function p(X. is given by(4) i I ( PX(X) = LP(X.7) and (A. known as the marginal frequency function.92/ with 91 (X) = k.8. max{X.8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1.1 1 Xi = X and 92(X) = k. (A.8. y).: for every B E Bill.y)(x. y (A.8. .8.8. assume that the derivative l of 9 exists and does not vanish on S.8. Then g(X) is continuous with density given by . . then the frequency function of X. i . The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)]. .8. Furthennore. Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)). This is called the change of variable formula.1O) From (A.8. y)T is continuous with density p(X.jPX a (A. .1 E~' l(X i .1') . • II ! I • .). and X is continuous.8)) that X is a marginal density function given by px(x) ~ 1: P(X.11) Similarly.S. V). ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. a # 0. it may be shown (as a consequence of (A.Y). (A.9) t E g(S).Y) (x.X)2. .8.7. Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa.(5) (A.452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. .y)(x.)'. The (marginal) frequency or density of X is found as in (A.12) by summing or integrating out over yin P(X.6) An example of a transformation often used in statistics is g (91.y).Y).11) and (A. (A. These notions generalize to the case Z = (X. then 1 (I . j . then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x).8) it follows that if (X. If g(X) ~ aX + 1'.y)dy.12) 1 . Another common example is g(X) = (min{X. if (X. Pg(X) (I) = j. and 0 otherwise.
. . Then the random variables Xl.S.1. .7..9.l5 and B.. (X"XIX. 1 X n are independent if..9.7 The preceding equivalences are valid for random vectors XlJ .Xn are said to be (mutually) independent if and only if for any sets AI. . if (Xl.9.4) (A.[Xn E An] are independent. Nevertheless.9. .9. if X and Yare independent..6. References Gnedenko (1967) Chapter 4.) and Y" and so on. A. the events [XI E AI and IX..7 Hoel. so are g(X) and h(Y).7). A.An in S. 9 Pitman (1993) Section