https://www.scribd.com/doc/65669825/MathStats
09/03/2012
text
original
Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
These will not be dealt with in OUr work. Appendix B is as selfcontained as possible with proofs of mOst statements. then go to parameters and parametric models stressing the role of identifiability. Chapter 1 now has become part of a larger Appendix B. we do not require measure theory but assume from the start that our models are what we call "regular. Our one long book has grown to two volumes. Despite advances in computing speed. Hilbert space theory is not needed. and functionvalued statistics. instead of beginning with parametrized models we include from the start non. and probability inequalities as well as more advanced topics in matrix theory and analysis. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. problems. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. our focus and order of presentation have changed. Volume I.and semipararnetric models. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. As a consequence our second edition. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. or the absolutely continuous case with a density. Specifically. However. weak convergence in Euclidean spaces. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. (5) The study of the interplay between numerical and statistical considerations. of course. reflecting what we now teach our graduate students. which we present in 2000. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. However.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. we assume either a discrete probability whose support does not depend On the parameter set. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. each to be only a little shorter than the first edition. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. such as the density. From the beginning we stress functionvalued parameters. There have. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. such as the empirical distribution function. As in the first edition. We . and references to the literature for proofs of the deepest results such as the spectral theorem. some methods run quickly in real time. is much changed from the first." That is.
and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. Chapter 6 is devoted to inference in multivariate (multiparameter) models. which parallels Chapter 2 of the first edition. The conventions established on footnotes and notation in the first edition remain. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. Included are asymptotic normality of maximum likelihood estimates. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. Chapter 5 of the new edition is devoted to asymptotic approximations. Robustness from an asymptotic theory point of view appears also. there is clearly much that could be omitted at a first reading that we also star. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. include examples that are important in applications. such as regression experiments. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. Finaliy. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. There are clear dependencies between starred . Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). As in the first edition problems playa critical role by elucidating and often substantially expanding the text.Preface to the Second Edition: Volume I xv also. from the start. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. One of the main ingredients of most modem algorithms for inference. if somewhat augmented. including some optimality theory for estimation as well and elementary robustness considerations. the Wald and Rao statistics and associated confidence regions. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Nevertheless. Save for these changes of emphasis the other major new elements of Chapter 1. inference in the general linear model. Chapters 14 develop the basic principles and examples of statistics. There is more material on Bayesian models and analysis. Wilks theorem on the asymptotic distribution of the likelihood ratio test. including a complete study of MLEs in canonical kparameter exponential families. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Although we believe the material of Chapters 5 and 6 has now become fundamental. Generalized linear models are introduced as examples.
4 ~ 6. We also thank Faye Yeager for typing. Michael Jordan.5 I. taken on a new life. encouragement. other transformation models.3 ~ 6. density estimation. in part in appendices. j j Peter J. will be studied in the context of nonparametric function estimation. greatly extending the material in Chapter 8 of the first edition.3 ~ 6. The basic asymptotic tools that will be developed or presented. Semiparametric estimation and testing will be considered more generally.berkeley.edn Kjell Doksnm doksnm@stat. Jianhna Hnang. 5. For the first volume of the second edition we would like to add thanks to new colleagnes. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo. in part in the text and. Michael Ostland and Simon Cawley for producing the graphs. 6. appeared gratifyingly ended in 1976 but has. Ying Qing Chen. convergence for random processes.2 ~ 5. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley. Last and most important we would like to thank our wives. Fujimura. and Prentice Hall for generous production support. particnlarly Jianging Fan. are weak. Examples of application such as the Cox model in survival analysis.berkeley. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics.6 Volume II is expected to be forthcoming in 2003. and the functional delta method.2 ~ 6.4. We also expect to discuss classification and model selection using the elementary theory of empirical processes. and active participation in an enterprise that at times seemed endless. The topic presently in Chapter 8. and our families for support.edn . Bickel bickel@stat.4. Yoram Gat for proofreading that found not only typos but serious errors. elementary empirical process theory. and the classical nonparametric k sample and independence problems will be included. Nancy Kramer Bickel and Joan H. • with the field.XVI • Pref3ce to the Second Edition: Volume I sections that follow.
the treatment is abridged with few proofs and no examples or problems.arters. However. which go from modeling through estimation and testing to linear models. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. Be cause the book is an introduction to statistics. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. Hoel. we select topics from xvii . covers most of the material we do and much more but at a more abstract level employing measure theory. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. multinomial models. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. Finally. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. In the twoquarter courses for graduate students in mathematics. the Lehmann5cheffe theorem. we need probability theory and expect readers to have had a course at the level of. Introduction to Mathematical Statistics. Our book contains more material than can be covered in tw~ qp. and the structure of both Bayes and admissible solutions in decision theory. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). Our appendix does give all the probability that is needed. The work of Rao. and engineering that we have taught we cover the core Chapters 2 to 7. and nonparametric models... statistics. 3rd 00. Linear Statistical Inference and Its Applications. and Stone's Introduction to Probability Theory. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. 2nd ed. Port. for instance. Although there are several good books available for tbis purpose. the physical sciences. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. and the GaussMarkoff theorem. we feel that none has quite the mix of coverage and depth desirable at this level. the information inequality. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice.
Scholz. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. and so On. Cannichael. R. I for the first. The comments contain digressions. and friends who helped us during the various stageS (notes.2. final draft) through which this book passed. P. G. and moments is established in the appendix. L. Quang. I . Chou. J. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. Without Winston Chow's lovely plots Section 9. E. A serious error in Problem 2. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. distribution functions. The foundation of oUr statistical knowledge was obtained in the lucid. Lehmann's wise advice has played a decisive role at many points. S. densities. and stimUlating lectures of Joe Hodges and Chuck Bell. preliminary edition. . X. Chen. Drew. (iii) Basic notation for probabilistic objects such as random variables and vectors. in proofreading the final version. Among many others who helped in the same way we would like to mention C. Gupta. Peler J. and A Samulon. W. 2 for the second. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. respectively. or it may be included at the end of an introductory probability course that precedes the statistics course. reservations. caught mOre mistakes than both authors together. These comments are ordered by the section to which they pertain. Chapter 1 covers probability theory rather than statistics.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. and additional references. It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. Bickel Kjell Doksum Berkeley /976 : .5 was discovered by F. i . Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers. They need to be read only as the reader's curiosity is piqued. Gray. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics. enthusiastic. C. We would like to acknowledge our indebtedness to colleagues. A special feature of the book is its many problems. Minassian who sent us an exhaustive and helpful listing. U. (i) Various notational conventions and abbreviations are used in the text. students. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
j 1 J 1 . I . . ..
1 1. which statisticians share. Data can consist of: (1) Vectors of scalars. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy.1. but we will introduce general model diagnostic tools in Volume 2. Subject matter specialists usually have to be principal guides in model formulation. and/or characters. PARAMETERS AND STATISTICS Data and Models Most studies and experiments. we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book.1. AND PERFORMANCE CRITERIA 1. Moreover.Chapter 1 STATISTICAL MODELS.2. A 1 . A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. The goals of science and society.1 and 6. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1. Chapter 1. (2) Matrices of scalars and/or characters. and so on. scientific or industrial. In any case all our models are generic and. (4) All of the above and more. MODELS. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. trees as in evolutionary phylogenies. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. produce data whose analysis is the ultimate object of the endeavor. are to draw useful information from data using everything that we know. in particUlar.1 DATA. large scale or small.5. for example.1.4 and Sections 2. measurements. as usual. GOALS.3). functions as in signal processing. for example. a single time series of measurements.
This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. plus some random errors. (b) We want to study how a physical or economic feature. A to m. and B to n.! I 2 Statistical Models. Chapter I. We begin this in the simple examples that follow and continue in Sections 1. we approximate the actual process of sampling without replacement by sampling with replacement. The population is so large that. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. robustness. For instance.. "Models of course. to what extent can we expect the same effect more generally? Estimation. In this manner. and more general procedures will be discussed in Chapters 24. reducing pollution. It is too expensive to examine all of the items. (c) An experimenter makes n independent detenninations of the value of a physical constant p. (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee. for modeling purposes. Random variability " : . and diagnostics are discussed in Volume 2. e. and so on. for instance. have the patients rated qualitatively for improvement by physicians. Hierarchies of models are discussed throughout. for example. treating a disease.. Goodness of fit tests. in the words of George Box (1979). confidence regions. producing energy. .3 and continue with optimality principles in Chapters 3 and 4. give methods that assess the generalizability of experimental results. and so on. So to get infonnation about a sample of n is drawn without replacement and inspected. randomly selected patients and then measure temperature and blood pressure. if we observe an effect in our data. We begin this discussion with decision theory in Section 1. we can assign two drugs.21. is distributed in a large population. (5) We can be guided to alternative or more general descriptions that might fit better. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. The data gathered are the number of defectives found in the sample. learning a maze. and Performance Criteria Chapter 1 priori. a shipment of manufactured items. Goals. (2) We can derive methods of extracting useful information from data and. testing. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. For instance.5 and throughout the book. height or income. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p. in particular. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. An unknown number Ne of these elements are defective. (3) We can assess the effectiveness of the methods we propose. Here are some examples: (a) We are faced with a population of N elements. An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population.
X has an hypergeometric. First consider situation (a). . The sample space consists of the numbers 0. .8).2. Parameters.Section 1. . Sample from a Population.. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. . if the measurements are scalar. X n ? Of course. completely specifies the joint distribution of Xl. H(N8....l.) random variables with common unknown distribution function F. we observe XI. Models.t + (i.1.0) < k < min(N8.1 Data. which are modeled as realizations of X I..6) (J. The main difference that our model exhibits from the usual probability model is that NO is unknown and.2) where € = (€ I . n) distribution. that depends on how the experiment is carried out. identically distributed (i.." The model is fully described by the set :F of distributions that we specify. n)} of probability distributions for X.. 1. Sampling Inspection.xn . Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not. then by (AI3.... .. . in principle. N. anyone of which could have generated the data actually observed.. OneSample Models. N. .. X n as a random sample from F. .. as X with X . as Xi = I.1. 0 ° Example 1... . ..1. It can also be thought of as a limiting case in which N = 00. X n are ij. n.i.. If N8 is the number of defective items in the population sampled.". . We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models. . .l) if max(n . which we refer to as: Example 1. Fonnally. we cannot specify the probability structure completely but rather only give a family {H(N8./ ' stands for "is distributed as. can take on any value between and N. €I.d. I. where . n). That is. . . which together with J1."" €n are independent. 1 <i < n (1. although the sample space is well defined.1. What should we assume about the distribution of €. The mathematical model suggested by the description is well defined. k ~ 0.N(I ..d.. A random experiment has been perfonned.. Thus.Xn independent. The same model also arises naturally in situation (c). So.. n corresponding to the number of defective items found. On this space we can define a random variable X given by X(k) ~ k. We often refer to such X I. 1 €n) T is the vector of random errors. and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. and also write that Xl. so that sampling with replacement replaces sampling without.. . Given the description in (c). Here we can write the n determinations of p... F.
G) : J1 E R. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. (3) The control responses are normally distributed. That is. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained.. cr 2 ) distribution.2) distribution and G is the N(J. Heights are always nonnegative. . then F(x) = G(x  1") (1. whatever be J. that is. We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo.~).Xn are a random sample and. All actual measurements are discrete rather than continuous.. This implies that if F is the distribution of a control. Often the final simplification is made. .. or by {(I". heights of individuals or log incomes. (3) The distribution of f is independent of J1.t.3. then G(·) F(' . . patients improve even if they only think they are being treated. £n are identically distributed. we have specified the Gaussian two sample model with equal variances. G E Q} where 9 is the set of all allowable error distributions that we postulate. 1 Yn a sample from G..1.. It is important to remember that these are assumptions at best only approximately valid. E}.. . 0 ! I = . will have none of this.4 Statistical Models. TwoSample Models. the Xi are a sample from a N(J. and Y1. Thus.. G) pairs. The Gaussian distribution. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. Goals. if drug A is a standard or placebo. .Yn. There are absolute bounds on most quantitiesIOO ft high men are impossible. Then if treatment B had been administered to the same subject instead of treatment A.. Commonly considered 9's are all distributions with center of symmetry 0. •..t + ~. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another. '. respectively." Now consider situation (d). Example 1. or alternatively all distributions with expectation O. By convention. Then if F is the N(I". . The classical def~ult model is: (4) The common distribution of the errors is N(o.3) and the model is alternatively specified by F. 0'2). for instance. .1. so that the model is specified by the set of possible (F. . where 0'2 is unknown.t and cr. if we let G be the distribution function of f 1 and F that of Xl. Equivalently Xl. we refer to the x's as control observations. Let Xl. response y = X + ~ would be obtained where ~ does not depend on x. cr > O} where tP is the standard normal distribution. YI. We call the y's treatment observations.. We call this the shift model with parameter ~. (72) population or equivalently F = {tP (':Ji:) : Jl E R.. 1 X Tn .. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B. . 1 X Tn a sample from F.. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect. To specify this set more closely the critical constant treatment effect assumption is often made. tbe set of F's we postulate.
It is often convenient to identify the random vector X with its realization. and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations. For instance. equally trained observers with no knowledge of each other's findings. The number of defectives in the first example clearly has a hypergeometric distribution.6.4. For instance. A review of necessary concepts and notation from probability theory are given in the appendices. Models. The advantage of piling on assumptions such as (I )(4) of Example 1. In Example 1. This will be done in Sections 3. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. That is.5. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise.'" . we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated.1.3 and 6. P is referred to as the model. in Example 1.Xn ). As our examples suggest. However. have been assigned to B. The danger is that.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease.2 is that. if they are false. Statistical methods for models of this kind are given in Volume 2. we can be reasonably secure about some aspects.1. P is the . there is tremendous variation in the degree of knowledge and control we have concerning experiments. in comparative experiments such as those of Example 1. we need only consider its probability distribution.2. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. we can ensure independence and identical distribution of the observations by using different.1. if (1)(4) hold.1.2.3 when F. For instance. though correct for the model written down. When w is the outcome of the experiment. if they are true. may be quite irrelevant to the experiment that was actually performed. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N.1.Section 1. in Example 1. In others. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. Experiments in medicine and the social sciences often pose particular difficulties. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1.1). In some applications we often have a tested theoretical model and the danger is small. In this situation (and generally) it is important to randomize. our analyses. We are given a random experiment with sample space O. Fortunately. we now define the elements of a statistical model. Since it is only X that we observe. Using our first three examples for illustrative purposes. On this sample space we have defined a random vector X = (Xl. ~(w) is referred to as the observations or data. Parameters. G are assumed arbitrary.1. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. for instance. This distribution is assumed to be a member of a family P of probability distributions on Rn.1.1 Data. All the severely ill patients might. but not others. the data X(w).
under assumptions (1)(4). Thus.1 we can use the number of defectives in the population.2 ) distribution. knowledge of the true Pe. we know we are measuring a positive quantity in this model.. as a parameter and in Example 1.2)). such that we can have (}1 f. and Performance Criteria Chapter 1 family of all distributions according to which Xl.e.. What parametrization we choose is usually suggested by the phenomenon we are modeling. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. in Example 1. on the other hand. Models such as that of Example 1. NO. 1 • • • .2 with assumptions (1)(4) we have = R x R+ and. still in this example.. 1) errors.. G) : I' E R. parts of fJ remain unknowable. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. ... that is. tt is the unknown constant being measured.6 Statistical Models. we only wish to make assumptions (1}(3) with t:...t2 + k e e e J . if () = (Il. . .3 with only (I) holding and F. . .1. the parameter space e.. we have = R+ x R+ .X l . or equivalently write P = {Pe : BEe}. For instance. X n ) remains the same but = {(I'. .1. In Example 1.. 02 ). in Example = {O. if) (X'. () t Po from a space of labels.1.1. that is. we will need to ensure that our parametrizations are identifiable. as we shall see later. () is the fraction of defectives.2) suppose that we pennit G to be arbitrary.. .1 we take () to be the fraction of defectives in the shipment. n) distribution. models P are called parametric. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. 1) errors lead parametrization is unidentifiable because.. N. . .1.G taken to be arbitrary are called nonparametric. If.2. . Goals. Of even greater concern is the possibility that the parametrization is not onetoone. that is. Pe the H(NB.Xrn are identically distributed as are Y1 . Yn . ." that is. to P. X rn are independent of each other and Yl . We may take any onetoone function of 0 as a new parameter.I} and 1.Xn are independent and identically distributed with a common N (IL. For instance.l. Then the map sending B = (I'. Yn . e j .1. I 1.2 Parametrizations and Parameters i e To describe P we USe a parametrization. Pe the distribution on Rfl with density implicitly taken n~ 1 . 0.G) has density n~ 1 g(Xi 1'). having expectation 0. lh i O => POl f. Ghas(arbitrary)densityg}. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth. . Now the and N(O. in senses to be made precise later. moreover. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1.1.1. G with density 9 such that xg(x )dx = O} and p(". Pe 2 • 2 e ° . the first parametrization we arrive at is not necessarily the one leading to the simplest analysis. j. However.JL) where cp is the standard normal density. . for eXample.. Thus. Note that there are many ways of choosing a parametrization in these and all other problems. . The critical problem with such parametrizations is that ev~n with "infinite amounts of data.2 with assumptions (1)(3) are called semiparametric. Finally.G) : I' E R. by (tt.3 that Xl. in (l. If.. models such as that of Example 1. a map. It's important to note that even nonparametric models make substantial assumptionsin Example 1. G) into the distribution of (Xl. we can take = {(I'.
3. then 0'2 is a nuisance parameter. from to its range P iff the latter map is 11. that is. or the midpoint of the interquantile range of P. which can be thought of as the difference in the means of the two populations of responses. thus.1.3) and 9 = (G : xdG(x) = J O}. A parameter is a feature v(P) of the distribution of X.1. a function q : + N can be identified with a parameter v( P) iff Po.. ) evidently is and so is I' + C>. For instance. which indexes the family P. formally a map. Then "is identifiable whenever flx and flY exist. that is. instead of postulating a constant treatment effect ~. make 8 + Po into a parametrization of P.1. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable).. say with replacement.1.2 is now well defined and identifiable by (1. is that of a parameter.1.1. More generally.Section 1." = f1Y . We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). G) parametrization of Example 1. But given a parametrization (J + Pe. (J is a parameter if and only if the parametrization is identifiable. Similarly.2 again in which we assume the error € to be Gaussian but with arbitrary mean~. But = Var(X. or more generally as the center of symmetry of P. the fraction of defectives () can be thought of as the mean of X/no In Example 1. E(€i) = O. v. where 0'2 is the variance of €.3 with assumptions (1H2) we are interested in~. our interest in studying a population of incomes may precisely be in the mean income.2 where fl denotes the mean income and.1. as long as P is the set of all Gaussian distributions.2 with assumptions (1)(4) the parameter of interest fl. ~ Po.flx. Sometimes the choice of P starts by the consideration of a particular parameter. Formally. The (fl. in Example 1. For instance. In addition to the parameters of interest. in Example 1. and observe Xl. we can start by making the difference of the means. For instance. Implicit in this description is the assumption that () is a parameter in the sense we have just defined. it is natural to write e e e . we can define (J : P + as the inverse of the map 8 + Pe.. Models. the focus of the study.1. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. Then P is parametrized by 8 = (fl1~' 0'2). X n independent with common distribution. For instance. Parameters. and Statistics 7 Dual to the notion of a parametrization. . Here are two points to note: (1) A parameter can have many representations. When we sample. or the median of P. which correspond to other unknown features of the distribution of X. implies q(BJl = q(B2 ) and then v(Po) q(B). in Example 1. from P to another space N. if the errors are normally distributed with unknown variance 0'2.1. implies 8 1 = 82 . if POl = Pe'). consider Example 1.1 Data. .2. a map from some e to P. . For instance. in Example 1. there are also usually nuisance parameters.M(P) can be characterized as the mean of P.
Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference. and Performance Criteria Chapter 1 1. Thus.• 1 X n ) = X ~ L~ I Xi. T(x) = x/no In Example 1.8 Statistical Models. This statistic takes values in the set of all distribution functions on R. in Example 1. consider situation (d) listed at the beginning of this section. but the true values of parameters are secrets of nature. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. Our aim is to use the data inductively. . which now depends on the data. for example. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. It estimates the function valued parameter F defined by its evaluation at x E R. These issues will be discussed further in Volume 2. is the statistic T(X 11' . For instance. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. x E Ris F(X " ~ I .3 Statistics as Functions on the Sample Space I j • . Xn)(x) = n L I(X n i < x) i=l where (X" . can be related to model formulation as we saw earlier..1. The link for us are things we can compute. Informally. F(P)(x) = PIX.2 a cOmmon estimate of J1. Models and parametrizations are creations of the statistician. For instance.. a statistic T is a map from the sample space X to some space of values T. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. however. .1. Deciding which statistics are important is closely connected to deciding which parameters are important and.. Next this model. this difference depends on the patient in a complex manner (the effect of each drug is complex). usually a Euclidean space. Nevertheless.2 is the statistic I i i = i i 8 2 = n1 "(Xi . statistics. Formally. a common estimate of 0.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section. T( x) is what we can compute if we observe X = x. hence. is used to decide what estimate of the measure of difference should be employed (ct. the fraction defective in the sample. Mandel.1. 1964). called the empirical distribution function. to narrow down in useful ways our ideas of what the "true" P is.. < xJ. then our attention naturally focuses on estimating this constant If. a statistic we shall study extensively in Chapter 2 is the function valued statistic F.1. Goals. In this volume we assume that the model has . we can draw guidelines from our numbers and cautiously proceed.
weight. the number of patients in the study (the sample size) is random. height. and so On of the ith subject in a study. Y n ) where Y11 ••• . We observe (ZI' Y. density and frequency functions by p(. patients may be considered one at a time. Models. ). Notation. When dependence on 8 has to be observed. Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. 0).. 0). There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. They are not covered under our general model and will not be treated in this book. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. Regular models. .. Zi is a d dimensional vector that gives characteristics such as sex.. Yn are independent.X" . It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. Moreover. The experimenter may.1 Data.. age. Distribution functions will be denoted by F(·. are continuous with densities p(x. For instance. in Example 1. Example 1. Yn ). and Statistics 9 been selected prior to the current experiment. See A. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation.. This is the Stage for the following. in the study. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. we shall denote the distribution corresponding to any particular parameter value 8 by Po. Thus. . X m ). However. (B. .3.4 Examples. 'L':' Such models will be called regular parametric models. Thus. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other.. This selection is based on experience with previous similar experiments (cf. drugs A and B are given at several . Xl). Parameters. (A. Problems such as these lie in the fields of sequential analysis and experimental deSign. Expectations calculated under the assumption that X rv P e will be written Eo. 1990).1. Y. after a while. . assign the drugs alternatively to every other patient in the beginning and then.1. and there exists a set {XI. assign the drug that seems to be working better to a higher proportion of patients.4. in situation (d) again. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O. (zn. for example. This is obviously overkill but suppose that.3 we could take z to be the treatment label and write onr observations as (A. 0). In the discrete case we will use both the tennsfrequency jimction and density for p(x.1. (B. these and other subscripts and arguments will be omitted where no confusion can arise. 8). Lehmann. Regression Models.1..lO.). For instance.Section 1. 0). (2) All of the P e are discrete with frequency functions p(x. sequentially. 1.
Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems.1...3(3) is a special case of this model.. Then. I'(B) = I' + fl. Here J. That is. Goals.10 Statistical Models.L(z) is an unknown function from R d to R that we are interested in. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi. Example 1. In the two sample models this is implied by the constant treatment effect assumption. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi. Often the following final assumption is made: (4) The distributiou F of (I) is N(O. (3) g((3. then we can write. . and Performance Criteria Chapter 1 dose levels.1.treat the Zi as a label of the completely unknown distributions of Yi..L(z) denote the expected value of a response with given covariate vector z. See Problem 1.. .. (3) and (4) above.E(Yi ). which we can write in vector matrix form. 0 . By varying the assumptions we obtain parametric models as with (I). . (c) whereZ nxa = (zf. Treatment Dose Level) for patient i.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model. (b) where Ei = Yi . z) where 9 is known except for a vector (3 = ((31..3 with the Gaussian twosample model I'(A) ~ 1'.. and nonparametric if we drop (1) and simply . i=l n If we let J. Then we have the classical Gaussian linear model.1. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. .2 uuknown. the effect of z on Y is through f.2) with . The most common choice of 9 is the linear fOlltl. Clearly.(z) only. semiparametric as with (I) and (2) with F arbitrary. then the model is zT (a) P(YI. For instance. . . A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F.2 with assumptions (1)(4). .. [n general. .Z~)T and Jisthen x n identity.. We usually ueed to postulate more.z) = L.. i = 1.1.1.8. I3d) T of unknowns. in Example 1. d = 2 and can denote the pair (Treatment Label. . So is Example 1. n.Yn) ~ II f(Yi I Zi).
n. say.. .t+ei. eo = a Here the errors el. . Xi ~ 1.{3Xj1 j=2 n (1..(3cd··· f(e n Because €i =  (3e n _1). . i ~ 2. Let j. Parameters. ... 0'2) density. . .. model (a) assumes much more but it may be a reasonable first approximation in these situations. ...(3)I'). and the associated probability theory models and inference for dependent data are beyond the scope of this book.. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors.(1 . and Statistics 11 Finally. i = 1. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence. However. .. X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore.. . p(e n I end f(edf(c2 .. . en' Using conditional probability theory and ei = {3ei1 + €i. . we give an example in which the responses are dependent. . It is plausible that ei depends on Cil because long waves tend to be followed by long waves. at best an approximation for the wave example.xn ) ~ f(x1 I') II f(xj .n ei = {3ei1 + €i.1 Data. Xl = I' + '1.. save for a brief discussion in Volume 2. .. Models. we have p(edp(C21 e1)p(e31 e"e2)". is that N(O.1. Measurement Model with Autoregressive Errors. C n where €i are independent identically distributed with density are dependent as are the X's.t. An example would be.n. To find the density p(Xt... X n be the n determinations of a physical constant J.cn _d p(edp(e2 I edp(e3 I e2) .. . The default assumption. i = l. the conceptual issues of stationarity.Section 1. X n is P(X1. In fact we can write f.Il. ergodicity. Let XI. . ' . Consider the model where Xi and assume = J.p(e" I e1... . .5. 0 . . . Xi .. the elapsed times X I. . . .t = E(Xi ) be the average time for an infinite series of records. x n ).. Example 1. Of course.(3) + (3X'l + 'i. we start by finding the density of CI.. the model for X I...
. the most important of which is the workhorse of statistics. N. .1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution."" 1l"N} for the proportion (J of defectives in past shipments. in the inspection Example 1. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. in the past. In this section we introduced the first basic notions and formalism of mathematical statistics. vector observations X with unknown probability distributions P ranging over models P. given 9 = i/N.2.. we have had many shipments of size N that have subsequently been distributed. 1.2. Models are approximations to the mechanisms generating the observations. This is done in the context of a number of classical examples. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. We view statistical models as useful tools for learning from the outcomes of experiments and studies.I 12 Statistical Models. N. and Performance Criteria Chapter 1 Summary..2) This is an example of a Bayesian model. and indeed necessary.. . We know that. n). Thus. the regression model. There is a substantial number of statisticians who feel that it is always reasonable. we can construct a freq uency distribution {1l"O. Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. .1. a . If the customers have provided accurate records of the number of defective items that they have found. .2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data.1. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data. ([. Goals. it is possible that. There are situations in which most statisticians would agree that more can be said For instance. N. X has the hypergeometric distribution 'H( i. PIX = k. 1l"i is the frequency of shipments with i defective items. The notions of parametrization and identifiability are introduced. 0 = N I I ([. That is. i = 0.
2. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. whose range is contained in 8. In this section we shall define and discuss the basic clements of Bayesian models. and Berger (1985). . the information about (J is described by the posterior distribution. . = ( I~O ) (0. 1. suppose that N = 100 and that from past experience we believe that each item has probability . x) = ?T(O)p(x. (1. Before sampling any items the chance that a given shipment contains . There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies.2..1.O). Eqnation (1. the information or belief about the true value of the parameter is described by the prior distribution. then by (B.3). After the value x has been obtained for X. The joint distribution of (8. However.1. I(O. To get a Bayesian model we introduce a random vector (J.Section 1. We shall return to the Bayesian framework repeatedly in our discussion. (8. select X according to Pe. insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later). However.1)'(0. Lindley (1965). let us turn again to Example 1. Raiffa and Schlaiffer (1961).2. the resulting statistical inference becomes subjective. given (J = (J.3) Because we now think of p(x. We now think of Pe as the conditional distribution of X given (J = (J. In the "mixed" cases such as (J continuous X discrete.3). we can obtain important and useful results and insights. (1962).l. If both X and (J are continuous or both are discrete.9)'00'. the joint distribution is neither continuous nor discrete.2) is an example of (1.2 Bayesian Models 13 Thus. with density Or frequency function 7r. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. De Groot (1969). ?T. Before the experiment is performed.4) for i = 0. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. This would lead to the prior distribution e.2. which is called the posterior distribution of 8. (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. For a concrete illustration. X) is appropriately continuous or discrete with density Or frequency function.1 of being defective independently of the other members of the shipment. For instance. Savage (1954). 100. Suppose that we have a regular parametric model {Pe : (J E 8}.. (1. The theory of this school is expounded by L. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. 1.
.Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1.1) '" I . is iudependeut of X and has a B(81. Example 1. 1.1100(0.8) as posterior density of 0. 1008 .30.1. X) given by (1.X.6) we argue loosely as follows: If be fore the drawing each item was defective with probability . 1r(t)p(x I t) if 8 is discrete.10). . some variant of Bayes' rule (B.9) > .j .2. to calculate the posterior. Bernoulli Trials.:.5) 5 ) = 0.k dt (1.X > 10 I X = lOJ P [(1000 .7) In general.. .1) (1.2. If we assume that 8 has a priori distribution with deusity 1r.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous. I 20 or more bad items is by the normal approximation with continuity correction. Goals. This leads to P[1000 > 20 I X = lOJ '" 0.10 '" .9 ] .1 > . . Therefore.6) To calculate the posterior probability given iu (1. 1r(9)9k (1 _ 9)nk 1r(Blx" . (1. P[IOOO > 201 = 10] P [ . (A 15. .3).2.1 and good with probability . ( 1.30. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.. we obtain by (1. (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).8.. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous..181(0. the number of defectives left after the drawing.2.<1>(0.xn )= Jo'1r(t)t k (lt)n.9)(0. Specifically.9) I .2. Here is an example..181(0..1)(0.2.2. 14 Statistkal Models.1100(0. P[1000 > 20 I X = 10] P[IOOO .X) .1)(0.. Suppose that Xl. .9) I . 0.2. In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0. and Performance Criteria Chapter 1 I .52) 0.1.2.J C3 (1.001..<I> 1000 .. theu 1r(91 x) 1r(9)p(x I 9) 2.9)(0.1) distribution.. Thus. .9 independently of the other items.II ~. Now suppose that a sample of 19 has been drawn in which 10 defective items are found.4) can be used.
k = I Xi· Note that the posterior density depends on the data only through the total number of successes.9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1.2. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family. so that the mean is r/(r + s) = 0. If n is small compared to the size of the city.15.. We return to conjugate families in Section 1.2.<. 'i = 1. One such class is the twoparameter beta family. We present an elementary discussion of Bayesian models. 1). which corresponds to using the beta distribution with r = . We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.11» be B(k + T.s) distribution. x n ).2.8) distribution.2 Bayesian Models 15 for 0 < () < 1. which depends on k.6. Then we can choose r and s in the (3(r.. The result might be a density such as the one marked with a solid line in Figure B.·) is the beta function. and s only. upon substituting the (3(r. we need a class of distributions that concentrate on the interval (0.(8 I X" . which has a B( n. (A. We also by example introduce the notion of a conjugate family of distributions. . 8) distribution given B = () (Problem 1. To choose a prior 11'.1).. If we were interested in some proportion about which we have no information or belief.k + s) where B(·. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi.2. we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. for instance. introduce the notions of prior and posterior distributions and give Bayes rule. and the posterior distribution of B givenl:X i =kis{3(k+r. To get infonnation we take a sample of n individuals from the city.Section 1.16. = 1. we might take B to be uniformly distributed on (0.2 indicates.2. we only observe We can thus write 1[(8 I k) fo".05 and its variance is very small. s) density (B. r. . Specifically. For instance.2. must (see (B.2.2. Summary. the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. As Figure B.2.13) leads us to assume that the number X of geniuses observed has approximately a 8(n. Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. Or else we may think that 1[(8) concentrates its mass near a small number. Xi = 0 or 1. n .10) The proportionality constant c. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial.2.ll) in (1. where k ~ l:~ I Xj.2. nonVshaped bimodal distributions are not pennitted. Suppose. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions. .9).nk+s)..n. say 0.05. .
lo is the critical matter density in the universe so that J. we have a vector z. shipments. there are k! possible rankings or actions. say a 50yearold male patient's response to the level of a drug. 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function.1. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. "hypothesis" or "nonspecialness" (or alternative). As the second example suggests. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z).1. with (J < (Jo. in Example 1. For instance. as it's usually called. We may wish to produce "best guesses" of the values of important parameters. i. sex." (} > Bo." In testing problems we.2. There are many possible choices of estimates. Making detenninations of "specialness" corresponds to testing significance. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po.1.1. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not.16 Statistical Models. say. Given a statistical model. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). A very important class of situations arises when. Zi) and then plug our estimate of (3 into g.2.l < Jio or those corresponding to J. P's that correspond to no treatment effect (i. i we can try to estimate the function 1'0. of g((3. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good.1. 0 Example 1.lo as special. in Example 1. Thus. at a first cut.1.3. In Example 1.1 or the physical constant J. We may have other goals as illustrated by the next two examples.3. if we have observations (Zil Yi). the fraction defective B in Example 1.1. state which is supported by the data: "specialness" or. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment.4. If J. the expected value of Y given z. Prediction.1 do we use the observed fraction of defectives . For instance. 1 < i < n. then depending on one's philosophy one could take either P's corresponding to J. Example 1. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. and Performance Criteria Chapter 1 1.. for instance.l > JkO correspond to an eternal alternation of Big Bangs and expansions. Ranking. For instance. These are estimation problems.3. However. (age. one of which will be announced as more consistent with the data than others. Intuitively. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. z) we can estimate (3 from our observations Y. if there are k different brands. drug dose) T that can be used for prediction of a variable of interest Y. if we believe I'(z) = g((3.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. say. such as.L in Example 1. as in Example 1.3 THE DECISION THEORETIC FRAMEWORK I . Thus.l < J. Unfortunately {t(z) is unknown.lo means the universe is expanding forever and J.l > J. and as we shall see fonnally later. whereas the receiver does not wish to keep "bad.e. Goals.
1.Section 1. and reliability of statistical procedures.1. 1. ik) of {I. in Example 1. in estimation we care how far off we are. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples..1. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. X = ~ L~l Xi. 1. A new component is an action space A of actions or decisions or claims that we can contemplate making. I} with 1 corresponding to rejection of H. Intuitively. or the median. A = {a. we need a priori estimates of how well even the best procedure can do. : 0 E 8). and so on. A = {O. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P.. even if. or p.. accuracy. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. By convention. 3. taking action 1 would mean deciding that D. (3. Testing. Thus. 3). once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. (2. 3). Here quite naturally A = {Permntations (i I . it is natural to take A = R though smaller spaces may serve equally well. (1. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing.1. we would want a posteriori estimates of perfomlance. most significantly. A = {( 1.. (3) provide assessments of risk. For instance. or combine them in some way? In Example 1. in testing whether we are right or wrong. Ranking. 2). In any case.1.2.1 Components of the Decision Theory Framework As in Section 1.1. on what criteria of performance we use. 2. .1. . (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. k}}. If we are estimating a real parameter such as the fraction () of defectives. Thus. (2. . .2 lO estimate J1 do we use the mean of the measurements. Here are action spaces for our examples. 2). We usnally take P to be parametrized. is large. ~ 1 • • • l I} in Example 1. 3. 2. That is. . in Example 1. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there.3.3. The answer will depend on the model and. I)}. Action space. (3. 1). P = {P. if we have three air conditioners. 1. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's). there are 3! = 6 possible rankings. (2) point to what the different possible actions are. in Example 1.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. in ranking what mistakes we've made. Thus. Estimation. for instance.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. On the other hand. in Example 1.1.1.6. # 0.
.. [f Y is real. a) = 0. The interpretation of I(P." respectively. and z = (Treatment. "does not respond" and "responds. a) = J (I'(z) . If.18 Statistical Models.2. then a(B. a) = [v( P) . ~ 2. say. . say. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·). Q is the empirical distribution of the Zj in .)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. I(P. 1 d} = supremum distance. . For instance.Vd) = (q1(0). M) would be our prediction of response or no response for a male given treatment B. asymmetric loss functions can also be of importance. Quadratic Loss: I(P. or 1(0.. a) I d .1).a(z))2dQ(z).3. a) 1(0. and z E Z. 1'() is the parameter of interest. the probability distribution producing the data. l( P.: la. Loss function. a) = (v(P) . a) 1(0. Closely related to the latter is what we shall call confidence interval loss. and Performance Criteria Chapter 1 Prediction. the expected squared error if a is used.a)2). Sex)T. Other choices that are. sometimes can genuinely be quantified in economic terms. A = {a . As we shall see. = max{laj  vjl. If v ~ (V1.a)2. Here A is much larger.. a) = l(v < a).. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. d'}. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient. r " I' 2:)a.a)2 (or I(O. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable.qd(lJ)) and a ~ (a""" ad) are vectors. a) = 1 otherwise." that is.ai. .a) ~ (q(O) .. For instance.V. For instance. a) = min {(v( P) . as the name suggests. is P. l(P.i = We can also consider function valued parameters. in the prediction example 1. r I I(P. Goals. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+.3. Estimation. if Y = 0 or 1 corresponds to.3. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider.al < d. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. a) if P is parametrized. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. a). Although estimation loss functions are typically symmetric in v and a. ' . . less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P.. examples of loss functions are 1(0. although loss functions. as we shall see (Section 5. [v(P) . and truncated quadraticloss: I(P.
3.3 The Decision Theoretic Framework 19 the training set (Z 1... I) /(8. this leads to the commonly considered I(P. a) = 1 otherwise (The decision is wrong). . Y). is a or not. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.1 suppose returning a shipment with °< 1(8.I loss: /(8. The data is a point X = x in the outcome or sample space X.3. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision... In Example 1. then a reasonable rule is to decide Ll = 0 if our estimate x .. . (Zn. 0.))2.1) Decision procedures. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median.2). y) = o if Ix::: 111 (J <c (1.1 times the squared Euclidean distance between the prediction vector (a(z.y is close to zero. This 0 ..9. Y n). e ea.2) I if Ix ~ 111 (J >c .a) = I n n. if we are asking whether the treatment effect p'!1'ameter 6. that is. ]=1 which is just n. and to decide Ll '# a if our estimate is not close to zero.. The decision rule can now be written o(x..0) sif8<80 Oif8 > 80 rN8. (1. Using 0 means that if X = x is observed. For the problem of estimating the constant IJ. Here we mean close to zero relative to the variability in the experiment.).1'(zn))T Testing. Of course. For instance.3 we will show how to obtain an estimate (j of a from the data. a) ~ 0 if 8 E e a (The decision is correct) l(0.1.. the statistician takes action o(x). in the measurement model. L(I'(Zj) _0(Z. relative to the standard deviation a.. We define a decision rule or procedure to be any function from the sample space taking its values in A. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost.1 loss function can be written as ed. Then the appropriate loss function is 1. If we take action a when the parameter is in we have made the correct decision and the loss is zero.Section 1. . In Section 4.). . a Estimation.3 with X and Y distributed as N(I' + ~. a(zn) jT and the vector parameter (I'( z.).1. is a partition of (or equivalently if P E Po or P E P. Otherwise. Testing. respectively. other economic loss functions may be appropriate.2) and N(I'.. I) 1(8. . . the decision is wrong and the loss is taken to equal one. where {So.
the risk or riskfunction: The risk function.d.. That is. withN(O. then Bias(X) 1 n Var(X) . fi) = Ep(fi(X) . We illustrate computation of R and its a priori use in some examples. (]"2) errors.) = 2L n ~ n .. but for a range of plausible x·s. and X = x is the outcome of the experiment. . 6(X)] as the measure of the perfonnance of the decision rule o(x).) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function." Var(X. Moreover. i=l . our risk function is called the mean squared error (MSE) of.E(fi)] = O. lfwe use the mean X as our estimate of IJ. X n are i. we typically want procedures to have good properties not at just one particular x. so is the other and the result is trivially true. measurements of IJ. Suppose Xl. Write the error as 1 = (Bias fi)' + E(fi)] Var(v). Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule). 6 (x)) as a random variable and introduce the riskfunction R(P. the cross term will he zero because E[fi . 1 is the loss function. How do we choose c? We need the next concept of the decision theoretic framework. we regard I(P..3.v) = [V  + [E(v) .i.3. If we use quadratic loss.1. () is the true value of the parameter. and assume quadratic loss. MSE(fi) I.3. R('ld) is our a priori measure of the performance of d. and is given by v= MSE{fl) = R(P. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) . for each 8. and Performance Criteria Chapter 1 where c is a positive constant called the critical value. . fJ(x)). Thus. R maps P or to R+. we turn to the average or mean loss over the sample space.. Thus. Proof. A useful result is Bias(fi) Proposition 1. Goals.6) = Ep[I(P.20 Statistical Models. (Continued). (fi . (If one side is infinite. Estimation of IJ.. If we expand the square of the righthand side keeping the brackets intact and take the expected value. The other two terms are (Bias fi)' and Var(v).. We do not know the value of the loss because P is unknown.v(P))' (1. e Estimation.3.v can be thought of as the "longrun average error" of v. Example 1. then the loss is l(P.3) where for simplicity dependence on P is suppressed in MSE. If d is the procedure used.1'].
write for a median of {aI.6 through a .3. or approximated asymptotically. = X.1"1 ~ Eill 2 ) where £. planning is not possible but having taken n measurements we can then estimate 0. Let flo denote the mean of a certain measurement included in the U.i. For instance. In fact.3.4) can be used for an a priori estimate of the risk of X. .X)2.. ( N(o..1". then for quadratic loss.5) This harder calculation already suggests why quadratic loss is really favored.. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements. as we discussed in Example 1. a'. If we only assume. areN(O. The choice of the weights 0. If. Example 1.2.a 2 then by (A 13. census. or numerical and/or Monte Carlo computation. that the E:i are i.3. . R( P.2 .3. We shall derive them in Section 1. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.3. Then R(I". If we have no data for area A. age or income. The a posteriori estimate of risk (j2/n is.8)X.1")' ~ E(~) can only be evaluated numerically (see Problem 1.8 can only be made on the basis of additional knowledge about demography or the economy.. (. say. with mean 0 and variance a 2 (P). If we have no idea of the value of 0.4.X) = ". .t. . by Proposition 1. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator.fii V.fii/a)[ ~ a . or na 21(n . an estimate we can justify later. Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. as we assumed.2 and 0.Section 1. n (1. an}).3. X) = a' (P) / n still.1 is useful for evaluating the performance of competing estimators.S.2)1"0 + (O. in general. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States. 00 (1. 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1.a 2 .d.. Ii ~ (0.fii 1 af2 00 Itl<p(t)dt = .. a natural guess for fl would be flo. of course. the £. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X.X) =. whereas if we have a random sample of measurements X" X 2. Then (1. for instance. but for absolute value loss only approximate. itself subject to random error.2. is possible.4) which doesn't depend on J.23). E(i .1. . 1 X n from area A. if X rnedian(X 1 .. .Xn ) (and we. 1) and R(I".3. X) = EIX .1 2 MSE(X) = R(I".1).3 The Decision Theoretic Framework 21 and. for instance by 8 2 = ~ L~ l(X i .6). analytic.
MSE(ii) MSE(X) i . if we use as our criteria the maximum (over p.64)0" In.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'.64)0" In MSE(ii) = . thus.1') (0.3.6) = 1(i'>.8)'Yar(X) = (. In the general case X and denote the outcome and parameter space. The two MSE curves cross at I' ~ 1'0 ± 30'1 . R(i'>. Y) = I] if i'> = 0 P[6(X.1.1 loss is 01 + lei'>.I' = 0. I' E R} being 0. The mean squared errdrs of X and ji. Y) = IJ. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X). Goals. the risk is = 0 and i'> i 0 can only take on the R(i'>.. Ii) of ii is smaller than the risk R(I'. Figure 1.n.I. Figure 1. The test rule (1. If I' is close to 1'0.6) P[6(X.3.) ofthe MSE (called the minimax criteria).2) for deciding between i'> two values 0 and 1. 22 Statistical Models. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . I)P[6(X. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge.81' .04(1'0 . Because we do not know the value of fL.4).2(1'0 . = 0. using MSE. neither estimator can be proclaimed as being better than the other. Here we compare the performances of Ii and X as estimators of Jl using MSE. Y) = 01 if i'> i O.3.3. I. D Testing. However.64 when I' = 1'0.1')' + (. the risk R(I'.21'0 R(I'. then X is optimal (Example 3. Y) = which in the case of 0 . iiJ + 0.0)P[6(X. respectively. A test " " e e e . We easily find Bias(ii) Yar(ii) 0.
R(8.3. I(P.1) and tests 15k of the fonn. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v.3. Thus.3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. the focus is on first providing a small bound.3.6) Probability of Type [error R(8. on the probability of Type I error.1. "Reject the shipment if and only if X > k.8) sP. Here a is small.05.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error.IX > k] + rN8P. If (say) X represents the amount owed in the sample and v is the unknown total amount owed. say . that is. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80. 6(X) = IIX E C].18). For instance.Section 1. the loss function (1. For instance.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation. 8 < 80 rN8P. the risk of <5(X) is e R(8.05 or . in the treatments A and B example. a > v(P) I.a (1. [n the NeymanPearson framework of statistical hypothesis testing. 8 > 80 . we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0).6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error. it is natural to seek v(X) such that P[v(X) > vi > 1.a) upper coofidence houod on v. usually .1 lead to (Prohlem 1.3. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error.o) 0. and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0).[X < k[. (1." in Example 1.[X < k]. For instance. and then trying to minimize the probability of a Type II error. a < v(P) .01 or less. confidence bounds and intervals (and more generally regions). we call the error committed a Type I error. where 1 denotes the indicator function. This is not the only approach to testing.3. an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed. Such a v is called a (1 .8) for all possible distrihutions P of X.
as we shall see in Chapter 4. or sell partial rights. a) (Drill) aj (Sell) a2 (Partial rights) a3 . We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. The 0. c ~1 .3. We next tum to the final topic of this section. it is customary to first fix a in (1. In the context of the foregoing examples. Suppose that three possible actions. a2. For instance. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. a patient either has a certain disease or does not. and a3.a>v(P) . or wait and see. Ii) = E(Ii(X) v(P))+.3. A has three points.8) and then see what one can do to control (say) R( P. rather than this Lagrangian form. the connection is close. We shall go into this further in Chapter 4.3. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. We shall illustrate some of the relationships between these ideas using the following simple example in which has two members.24 Statistical Models. and Performance Criteria Chapter 1 1 . 1. Goals. . and so on. The decision theoretic framework accommodates by adding a component reflecting this.a<v(P). . The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . What is missing is the fact that. we could operate. a component in a piece of equipment either works or does not work. or repair it. for some constant c > O. an asymmetric estimation type loss function. For instance = I(P. general criteria for selecting "optimal" procedures.5. e Example 1.3.. aJ. replace it. we could drill for oil. and the risk of all possible decision procedures can be computed and plotted. a 2 certain location either contains oil or does not. sell the location. Typically.a ·1 for all PEP.1. though upper bounding is the primary goal. where x+ = xl(x > 0). Suppose we have two possible states of nature. The loss function I(B. are available. administer drugs.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures.1 nature makes it resemble a testing loss function and. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ .a)  a~v(P) . we could leave the component in. Suppose the following loss function is decided on TABLE 1. It is clear. though. which we represent by Bl and B .2 .
1.6. B) given by the following table TABLE 1. .4 8 8.3 The Decision Theoretic Framework 25 Thus. 2 a.7 0.7) = 7 12(0. a3 7 a3 a. 6 a. B2 Thus.) 0 12 2 7 7. I x=O x=1 a.). an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. and so on. i Rock formation x o I 0. R(B2.6) + 1(0.5(X))] = I(B.4) ~ 7. R(025." Criteria for doing this will be introduced in the next subsection.. 1 9.3 0. Possible decision rules 5i (x) . 8 a3 a2 9 a3 a3 Here.)..52 ) + 10(0.6 and 0.0 9 5 6 It remains to pick out the rules that are "good" or "best.3.5. it is known that formation 0 occurs with frequency 0.2. a.5 9.6 0. e TABLE 1. if there is oil and we drill. The risk of 5 at B is R(B.2 for i = 1.2 (Oil) (No oil) e.6 3 3. Risk points (R(B 5. If is finite and has k members. a.5 8. X may represent a certain geological formation.3 and formation 1 with frequency 0.. R(B k ..5 3 7 1.4.5. a3 4 a2 a.3) For instance. 5 a. and when there is oil. (R(O" 5). the loss is 12. and frequency function p(x." 02 corresponds to ''Take action Ul> if X = 0. a.4 10 6 6. whereas if there is no oil. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space. . TABLE 1.4 and graphed in Figure 1. R(B" 52) R(B2 . 3 a.3.a3)P[5(X) = a3]' 0(0. Next. whereas if there is no oil and we drill.3. We list all possible decision rules in the following table.)) are given in Table 1. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5.) R(B.4.3. if X = 1. The frequency function p(x. The risk points (R(B" 5. e. the loss is zero. 01 represents "Take action Ul regardless of the value of X.5) E[I(B.6 4 5 1 " 3 5.Section 1.7.)P[5(X) = ad +1(B. .3.a2)P[5(X) ~ a2] + I(B. ." and so on. a.). 5.3. 0 . formations 0 and 1 occur with frequencies 0. take action U2..)) 1 1 R(B .4 = 1.5 4.
8 5 . if we ignore the data and use the estimate () = 0.S. Researchers have then sought procedures that improve all others within the class. for instance. It is easy to see that there is typically no rule e5 that improves all others.9. a5). The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O. Si).9 .   . R( 8" Si)).26 Statistical Models. unbiasedness (for estimates and tests). ..2 .6 .2.) < R( 8" S6) but R( 8" S.5 0 0 5 10 R(8 j . We say that a procedure <5 improves a procedure Of if.. 1. S.3. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry.7 .5). R(8. Here R(8 S.4 .) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. and only if. or level of significance (for tests).3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing. Section 1. Usually. (2) A second major approach has been to compare risk functions by global crite . if J and 8' are two rules. and Performance Criteria Chapter 1 R(8 2 . Extensions of unbiasedness ideas may be found in Lehmann (1997. The risk points (R(8 j . Goals. . in estimating () E R when X '"'' N((). i = 1.Si) Figure 1. For instance. Symmetry (or invariance) restrictions are discussed in Ferguson (1967). and S6 in our example.) I • 3 10 .S') for all () with strict inequality for some (). neither improves the other.S) < R(8. We shall pursue this approach further in Chapter 3. we obtain M SE(O) = (}2. Consider.3.
and reo) = J R(O. ()2 and frequency function 1I(OIl The Bayes risk of <5 is. if () is discrete with frequency function 'Tr(e). TABLE 1.. If we adopt the Bayesian point of view. and only if.8R(0. = 0. Note that the Bayes approach leads us to compare procedures on the basis of. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.Section 1.8.38 9.8 10 6 3. which attains the minimum Bayes risk l that is. r(o. O(X)) I () = ()]. given by reo) = E[R(O.5 gives r( 0.20) in Appendix B.) maxi R(0" Oil.9 8.4 8 4.).2.9).2.2R(0. If there is a rule <5*. We shall discuss the Bayes and minimax criteria. 11(0. r( 09) specified by (1. therefore.0) Table 1. it has smaller Bayes risk.) ~ 0.10) r( 0) ~ 0. .3. 0)11(0). . (1. .4 5 2.02 8. To illustrate.7 6. In this framework R(O. r(o") = minr(o) reo) = EoR(O. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general.3.3.5 7 7.5 9 5. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.3. 0) isjust E[I(O.I.o)] = E[I(O. R(0" Oi)) I 9.3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis. if we use <5 and () = (). Bayes and maximum risks oftbe procedures of Table 1.3.5. . such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule..6 3 8. .9) The second preceding identity is a consequence of the double expectation theorem (B.92 5.5 we for our prior.6 4 4.2.3.6 12 2 7.o(x))].0). From Table 1. Then we treat the parameter as a random variable () with possible values ()1.3.o)1I(O)dO. we need not stop at this point. but can proceed to calculate what we expect to lose on the average as () varies. the only reasonable computational method.3.48 7.8 6 In the Bayesian framework 0 is preferable to <5' if. (1. We postpone the consideration of posterior analysis. + 0. to Section 3.. suppose that in the oil drilling example an expert thinks the chance of finding oil is . the expected loss. Bayes: The Bayesian point of view leads to a natural global criterion.
I Randomized decision rules: In general. in Example 1. but only ac.20 if 0 = O . who picks a decision procedure point 8 E J from V. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. For instance. The criterion comes from the general theory of twoperson zero sum games of von Neumann. For simplicity we shall discuss only randomized . Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. J) A procedure 0*. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. J)) we see that J 4 is minimax with a maximum risk of 5. From the listing ofmax(R(O" J). To illustrate computation of the minimax rule we tum to Table 1. = infsupR(O.5 we might feel that both values ofthe risk were equally important. which makes the risk as large as possible. 2 The maximum risk 4.28 Statistical Models. But this is just Bayes comparison where 1f places equal probability on f}r and ()2.75 if 0 = 0. if V is the class of all decision procedures (nonrandomized).4. the sel of all decision procedures. e 4. Nature's choosing a 8. 8)].3. Our expected risk would be. sup R(O. .3. Of course. R(02.3. we prefer 0" to 0"'.4. J'). The maximum risk of 0* is the upper pure value of the game. This criterion of optimality is very conservative. The principle would be compelling. Player II then pays Player I. a weight function for averaging the values of the function R( B.(2) We briefly indicate "the game of decision theory. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used." Nature (Player I) picks a independently of the statistician (Player II). Goals. and Performance Criteria Chapter 1 if (J is continuous with density rr(8). This is. < sup R(O. 4. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. Nevertheless l in many cases the principle can lead to very reasonable procedures.J). if and only if. suppose that. For instance. we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. 6). is called minimax (minimizes the maximum risk). R(O.5. . J). Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. in Example 1. It aims to give maximum protection against the worst that can happen.J') . which has . .75 is strictly less than that of 04. supR(O.
J)) and consider the risk sel 2 S = {(RrO"J). As in Example 1. R(O ..J): J E V'} where V* is the set of all procedures. = ~A. Jj . T2) on this line can be written AR(O"Ji ) + (1. That is.R(OI.).3. By (1.11) Similarly we can define. S is the convex hull of the risk points (R(0" Ji ). All points of S that are on the tangent are Bayes. (1.3. the Bayes risk of l5 (1. We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1. when ~ = 0.r). This is thai line with slope .5.3. we then define q R(O.3. S= {(rI. TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. l5q of nOn randomized procedures. We will then indicate how much of what we learn carries over to the general case. 1 q.13) defines a family of parallel lines with slope i/(1 .14) . 0 < i < 1. If . this point is (10. given a prior 7r on q e.'Y) that is tangent to S al the lower boundary of S.(02 ).13) intersects S. . Ai >0.A)R(O"Jj ).3.R(O. Ji ) + (1 . Finding the Bayes rule corresponds to finding the smallest c for which the line (1..3. . including randomiZed ones. i=l (1.Section 1. i = 1 . A point (Tl.J. Ei==l Al = 1.).A)R(02. ~Ai=1}. J).2. AR(02.. R(O . If the randomized procedure l5 selects l5i with probability Ai.(0 d = i = 1 . minimizes r(d) anwng all randomized procedures...J.3.12) r(J) = LAiEIR(IJ. which is the risk point of the Bayes rule Js (see Figure 1.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • .Ji)' r2 = tAiR(02.. i = 1.R(02. we represent the risk of any procedure J by the vector (R( 01 .i)r2 = c.5..3. Ji )).J) ~ L. .d) anwng all randomized procedures. 9 (Figure 2 1.3). then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 .3).3.3.1). (1.10).. (2) The tangent is the line connecting two unonrandomized" risk points Ji . Jj ). For instance. A randomized minimax procedure minimizes maxa R(8.Ji )] i=l A randomized Bayes procedure d.13) As c varies.3. .\.i/ (1 . (1.r2):r.
Goals. the first point of contact between the squares and S is the . = 0).3..3. .3. and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1. is Bayes against 7I". 0 < >.3. It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*.30 Statistical Models. = r2./(1 .3. See Figure 1.3. 2 The point where the square Q( c') defined by (1. by (1.. Let c' be the srr!~llest c for which Q(c) n S i 0 (i.)') of the line given by (1. ranges from 0 to 1.3. .11) corresponds to the values Oi with probability>.13). To locate the risk point of the nilhimax rule consider the family of squares. OJ with probability (1 .• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). namely Oi (take'\ = 1) and OJ (take>. i = 1. (1. (1.16) touches S is the risk point of the minimax rule. thus. < 1. oil.3.>'). and.9. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le.e. R(B .. In our example.15) Each one of these rules. < 1. where 0 < >.16) whose diagonal is the line r. Then Q( c*) n S is either a point or a horizontal or vertical line segment. the first square that touches S).3. (\)). as >. Because changing the prior 1r corresponds to changing the slope . The convex hull S of the risk poiots (R(B!. We can choose two nonrandomized Bayes rules from this class.
if and only if. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. the set of all lower left boundary points of S corresponds to the class of admissible rules and. which yields .4'\ + 3(1 .)). there is no (x. Using Table 1.0). (a) For any prior there is always a nonrandomized Bayes procedure. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1.6 ~ R( 8" J. we can define the risk set in general as s ~ {( R(8 . for instance. There is another important concept that we want to discuss in the context of the risk set. agrees with the set of risk points of Bayes procedures. thus. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures.\ + 6.\).5(1 . . 04) = 3 < 7 = R( 81 . for instance). if there is a randomized one.. j = 6 and . However. (e) If a Bayes prior has 1f(Oi) to 1f is admissible.3. l Ok}.) and R( 8" 04) = 5. .4 < 7.3.4 we can see.. A rule <5 with risk point (TIl T2) is admissible. the minimax rule is given by (1.59.3.\ ~ 0. 1 . this equation becomes 3. {(x. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. 1967.V) : x < TI. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5.3.. Naturally. y) in S such that x < Tl and y < T2.5 can be shown to hold generally (see Ferguson.3. From the figure it is clear that such points must be on the lower left boundary. If e is finite. The following features exhibited by the risk set by Example 1. all admissible procedures are either Bayes procedures or limits of . (c) If e is finite(4) and minimax procedures exist. > a for all i. they are Bayes procedures. under some conditions. then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes. Y < T2} has only (TI 1 T2) in common with S.14). (d) All admissible procedures are Bayes procedures. In fact. e = {fh. or equivalently.\ the solution of From Table 1. if and only if.Section 1. that <5 2 is inadmissible because 64 improves it (i. 0.14) with i = 4. R( 81 . Thus.4.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 . ..e. all rules that are not inadmissible are called admissible..\) = 5.
y)2. testing. ]967. Here are some further examples of the kind of situation that prompts our study in this section. The frame we shaH fit them into is the following. A government expert wants to predict the amount of heating oil needed next winter. . and prediction. we tum to the mean squared prediction error (MSPE) t>2(Y. in the college admissions situation.I 'jlI . he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. Statistical Models. Bayes procedures (in various senses). Since Y is not known. at least in their original fonn.g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) . are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility.Yj'. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility.3. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio. ranking. • 1. decision rule. In terms of our preceding discussion. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal.. I II . These remarkable results. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. Using this information. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. We stress that looking at randomized procedures is essential for these conclusions. which include the admissible rules.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. Summary. although it usually turns out that all admissible procedures of interest are indeed nonrandomized. Goals. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average. we referto Blackwell and Girshick (1954) and Ferguson (1967). One reasonable measure of "distance" is (g(Z) . loss function.32 .4). We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. Section 3. Similar problems abound in every field. An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. The MSPE is the measure traditionally used in the . Next we must specify what close means. Z is the information that we have and Y the quantity to be predicted. which is the squared prediction error when g(Z) is used to predict Y. Other theorems are available characterizing larger but more manageable classes of procedures.4 PREDICTION . A meteorologist wants to estimate the amount of rainfall in the coming spring. at least when procedures with the same risk function are identified. confidence bounds. For more information on these topics. I The prediction Example 1. For example. The basic biasvariance decomposition of mean square error is presented. and Performance Criteria Chapter 1 '. and risk through various examples including estimation.
L = E(Y). E(Y .3.4.YI) (Problems 1. Lemma 1.c)' ~ Var Y + (c _ 1')'. In this section we bjZj .4.g(Z))2.c = (Y 1') + (I' .4.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.c)2 is either oofor all c or is minimized uniquely by c In fact. EY' = J. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y . 1957) presuppose it.4. 1. that is. or equivalently.4 Prediction 33 mathematical theory of prediction whose deeper results (see. and by expanding (1.1 assures us that E[(Y .1) < 00 if and only if E(Y .16). consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information. EY' < 00 Y . (1. Just how widely applicable the notions of this section are will become apparent in Remark 1.4.4.4.) = 0 makes the cross product term vanish. Let (1.1. E(Y .L and the lemma follows.4. (1. we can find the 9 that minimizes E(Y . see Problem 1. L1=1 Lemma 1.4.3) If we now take expectations of both sides and employ the double expectation theorem (B.C)2 has a unique minimum at c = J. exists.5 and Section 3. Because g(z) is a constant. see Example 1. By the substitution theorem for conditional expectations (B.20).1) follows because E(Y p.c)2 as a function of c.4) .4.c) implies that p.4. we have E[(Y . ljZ is any random vector and Y any random variable.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'. given a vector Z. we can conclude that Theorem 1. g(Z)) such as the mean absolute error E(lg( Z) . The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class. in which Z is a constant. when EY' < 00. 0 Now we can solve the problem of finding the best MSPE predictor of Y.Section 1.6. Proof.g(Z))' (1.711).g(z))' I Z = z].(Y. See Remark 1.1.l.2) I'(z) = E(Y I Z = z).I'(Z))' < E(Y . Grenander and Rosenblatt. We see that E(Y .g(Z))' I Z = z] = E[(Y .c)' < 00 for all c.25.4. for example.4. then either E(Y g(Z))' = 00 for every function g or E(Y .
6) which is generally valid because if one side is infinite.I'(Z))' + E(g(Z) 1'(Z»2 ( 1. then by the iterated expectation theorem. Property (1.E(V)J Let € = Y . To show (a). Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . then we can write Proposition 1.4.4. so is the other. That is. Suppose that Var Y < 00. If E(IYI) < 00 but Z and Y are otherwise arbitrary. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z. = I'(Z). = O.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV . then Var(E(Y I Z)) < Var Y. Proof. As a consequence of (1.1(c) is equivalent to (1.E(Y I z)]' I z). and recall (B. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor. that is.8) or equivalently unless Y is a function oJZ.4.4. Theorem 1. we can derive the following theorem. (1.20). Vat Y = E(Var(Y I Z» + Var(E(Y I Z). E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O.1. when E(Y') < 00. .5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes.6) and that (1. (1. Infact. Properties (b) and (c) follow from (a).4. then (1.4.4.2. which will prove of importance in estimation theory. then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <.6).34 Statistical Models. Var(Y I z) = E([Y .7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.E(U)] = O.4.E(v)IIU .4.g(Z)' = E(Y . Goals. let h(Z) be any function of Z.4.5) is obtained by taking g(z) ~ E(Y) for all z.Il(Z) denote the random prediction error.4. 0 Note that Proposition 1.5) An important special case of (1. I'(Z) is the unique E(Y .4.4.5) becomes. 1.
8) is true. or quarter capacity. But this predictor is also the right one. half. the average number of failures per day in a given month. if we are trying to guess. The assertion (1. An assembly line operates either at full.05 0. Within any given month the capacity status does not change. These fractional figures are not too meaningful as predictors of the natural number values of Y.Section 1. o Example 1.50 z\y 1 0 0.4.4.25 0.1 2. also. I. 1. E (Y I Z = ±) = 1. and only if. Equality in (1.25 1 0. The following table gives the frequency function p( z. y) = 0.4. In this case if Yi represents the number of failures on day i and Z the state of the assembly line.'" 1 Y.05 0. whereas the column sums py (y) yield the frequency of 0.6).1.05 0. the best predictor is E (30.E(Y I Z))' ~ 0 By (A.20.15 I 0. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state. Each day there can be 0.4 Prediction 35 Proof. .:. We want to predict the number of failures for a given day knowing the state of the assembly line for the month.25 0. 2.10.10 0.lI.025 0. The first is direct. E(Var(Y I Z)) ~ E(Y .45 ~ 1 The MSPE of the best predictor can be calculated in two ways.30 I 0. or 3 shutdowns due to mechanical failure. (1.7) follows immediately from (1.885.15 3 0.25 2 0. as we reasonably might. 2. p(z.10 0.E(Y I Z = z))2 p (z. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day.E(Y I Z))' =L x 3 ~)y y=o .025 0.y) pz(z) 0. or 3 failures among all days.10 I 0. I z) = E(Y I Z).10 0. E(Y .45. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2.7) can hold if.30 py(y) I 0. and only if.4.9) this can hold if.4. i=l 3 E (Y I Z = ~) ~ 2.
2.9) I . E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) . which corresponds to the best predictor of Y given Z in the bivariate normal model. Theorem B.6) we can also write Var /lO(Z) p = VarY . p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence.4. If p = 0. If (Z.y as we would expect in the case of independence. can reasonably be thought of as a measure of dependence.4. (1.885 as before.4.4. p) distribution. The Bivariate Normal Distribution. the best predictor of Y using Z is the linear function Example 1.J. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. Therefore. whereas its magnitude measures the degree of such dependence. = /lY + p(uyjuz)(Z ~ /lZ).4. . and Performance Criteria Chapter 1 The second way is to use (1./lz).36 Statistical Models. Suppose Y and Z are bivariate normal random variables with the same mean . is usually called the regression (line) of Y on Z. Goals. which is the MSPE of the best constant predictor. Similarly.11) I The line y = /lY + p(uy juz)(z ~ /lz).E(Y I Z p') (1. the MSPE of our predictor is given by. (1 . Regression toward the mean.LY. u~. O"~.10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . Because of (1.p')).4. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y. The term regression was coined by Francis Galton and is based on the following observation.L[E(Y I Z y . this quantity is just rl'. Thus. a'k. If p > O.6) writing. Y) has a N(Jlz l f. E((Y . for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. = z)]'pz(z) 0. . the best predictor is just the constant f. E(Y .4. 2 I . " " is independent of z.E(Y I Z))' (1. = Z))2 I Z = z) = u~(l _ = u~(1 _ p').2 tells us that the conditional dis /lO(Z) Because . The larger this quantity the more dependent Z and Yare. In the bivariate normal case.
3. (Zn.COV(Zd. should. where the last identity follows from (1. N d +1 (I'.6. This quantity is called the multiple correlation coefficient (MCC). be less than that of the actual heights and indeed Var((! . there is "regression toward the mean. The variability of the predicted value about 11. .p)tt + pZ. By (1. distribution (Section B.4. Y) is unavailable and the regression line is estimated on the basis of a sample (Zl. .4. coefficient ofdetennination or population Rsquared. Theorem B. . Note that in practice. YI ). Then the predicted height of the son. We shall see how to do this in Chapter 2..6) in which I' (I"~. I"Y = E(Y).8 = EziEzy anduYYlz ~ uyyEyzEziEzy.. the distribution of (Z. y)T has a (d + !) multivariate normal.4. variance cr2.. The Multivariate Normal Distribution. I"Y) T. these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. uYYlz) where. .4 Prediction 37 tt. usual correlation coefficient p = UZy / crfyCT lz when d = 1.11). One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y. E zy = (COV(Z" Y). Thus. E). The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate.6).. . Yn ) from the population. the best predictor E(Y I Z) ofY is the linear function (1.4. in particular in Galton's studies.. consequently.8. 0 Example 1.12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy.Section 1. .5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf.. Let Z = (Zl. Thus. tall fathers tend to have shorter sons. and positive correlation p. (1 . ." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y. " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. the MCC equals the square of the . is closer to the population mean of heights tt than is the height of the father. . We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . Y» T T ~ E yZ and Uyy = Var(Y).. or the average height of sbns whose fathers are the height Z. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) . ttd)T and suppose that (ZT. In Galton's case.p)1" + pZ) = p'(T'.
zz ~ (. In words. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') . the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. The best linear predictor. The natural class to begin with is that of linear combinations of components of Z. .393.y = . we can avoid both objections by looking for a predictor that is best within a class of simple predictors. l.. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6. and parents. We first do the onedimensional case.~ ~:~~). The percentage reductions knowing the father's and both parent's heights are 20. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z.209. E(Y .5%.ba) + Zba]}' to get ba )' . Y) a I VarZ. We expand {Y . Z2 = father's height). when the distribution of (ZT.3%..4. let Y and Z = (Zl. In practice.39 l. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor.13) .2.bZ)' is uniquely minintized by b = ba. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1. E(Y . knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height.. yn)T See Sections 2. respectively. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z. If we are willing to sacrifice absolute excellence.335. p~y = .)T.4. (Z~.l Proof. y)T is unknown.zy ~ (407.bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b .E(Z')b~. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. Goals.3. . are P~. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant.bZ)' = E(Y') + E(Z')(b  Therefore. respectively.4.:.1. 29W· Then the strength of association between a girl's height and those of her mother and father. Y.38 Statistical Models. The problem of finding the best MSPE predictor is solved by Theorem 1.y = .9% and 39. P~. and Performance Criteria Chapter 1 For example..1 and 2.
14) Yl T Ep[Y .4.4.bZ)2 we see that the b we seek minimizes E[(Y . We can now apply the result on zero intercept linear predictors to the variables Z .a.I1. by (1.1).a .bZ)2 is uniquely minimized by taking a ~ E(Y) .E(Z)IIZ .b. .4.a)'. E(Y .l7) in the appendix. Note that R(a.E(Y)]) = EziEzy. Let Po denote = .l).16) directly by calculating E(Y .4.4. and only if. if the best predictor is linear.E(Z)J[Y . = I'Y + (Z  J. In that example nothing is lost by using linear prediction. E[Y .E(ZWW ' E([Z . A loss of about 5% is incurred by using the best linear predictor.4. 0 Remark 1..E(Z) and Y . in Example 1.1 the best linear predictor and best predictor differ (see Figure 1. then a = a.4 Prediction 39 To prove the second assertion of the theorem note that by (lA.bE(Z). E(Y .bZ) + (E(Y) .E(Z))]'. If EY' and (E([Z .4.Section 1.4.13) we obtain tlte proof of the CauchySchwarz inequality (A.5). OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z .E(Z)f[Z . Substituting this value of a in E(Y . it must coincide with the best linear predictor.b(Z . E(Y . and b = b1 • because.bE(Z) .2.a .E(Y) to conclude that b. (1.Z)'.tzf [3.bZ)' ~ Var(Y . Therefore. b) X ~ (ZT. whiclt corresponds to Y = boZo We could similarly obtain (A. On the other hand.boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.I'd Z )]' = 1 05 ElY I'(Z)j' . whatever be b. then the Unique best MSPE predictor is I'dZ) Proof.4.E(Z)J))1 exist. 0 Note that if E(Y I Z) is of the form a + bZ. This is in accordance with our evaluation of E(Y I Z) in Example 1.1. Best Multivariate Linear Predictor.I'l(Z)1' depends on the joint distribution P of only through the expectation J. From (1.boZ)' = 0. is the unique minimizing value.a . I 1. Theorem 1. This is because E(Y . That is.E(Y)) .t and covariance E of X.
.05 + 1. thus.4.I'I(Z)]'. . .. 0 Remark 1.4.4. distribution and let Ro(a.14).4 by extending the proof of Theorem 1.. the multivariate uormal. not necessarily nonnal. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd. However. N(/L. b) is minimized by (1.45z. 0 Remark 1. Thus. A third approach using calculus is given in Problem 1.15) . that is.4. By Example 1.4. our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution.(a + ZT 13)]) = 0. (1.50 0.19. . Set Zo = 1. Suppose the mooel for I'(Z) is linear. b) = Ro(a.4. Goals. . Y). In the general.3 to d > 1. and Performance Criteria Chapter 1 y .75 1.4.4.25 0. ..4. Zd are uncorrelated.3.. p~y = Corr2 (y.2. j = 0.14). The line represents the best linear predictor y = 1. By Proposition 1. We could also have established Theorem 1.1. E).4. 2 • • 1 o o 0.4. See Problem 1.3.d.. 0 Remark 1. . The three dots give the best predictor.00 z Figure 1. b) = Epo IY .1(a). the MCC gives the strength of the linear relationship between Z and Y. that is. R(a.4. We want to express Q: and f3 in terms of moments of (Z.4.I'(Z) and each of Zo. Because P and Po have the same /L and E. . E(Zj[Y . . case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y. b).I'LCZ)). b) is miuimized by (1.4. ~ Y .40 Statistical Models.17 for an overall measure of the strength of this relationship. and R(a. Ro(a.
(Y 19NP). The notion of mean squared prediction error (MSPE) is introduced.4. o Summary. With these concepts the results of this section are linked to the general Hilbert space results of Section 8.I'(Z) is orthogonal to I'(Z) and to I'dZ).4. Thus.. Using the distance D.' (l'e(Z). When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B.3.14) (Problem 1. but as we have seen in Sec~ tion 1.(Y.5(X»].5. this gives a new derivation of (1. Recall that a statistic is any function of the observations generically denoted by T(X) or T. The optimal MSPE predictor in the multivariate normal distribution is presented.1. T(X) = X loses information about the Xi as soon as n > 1.(Y 1ge) ~ "(I'(Z) 19d (1. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. I'(Z» Y . We return to this in Section 3.12). I'dZ) = .. Consider the Bayesian model of Section 1. If we identify II with Y and X with Z.6. Because the multivariate normal model is a linear model.16) is the Pythagorean identity.2.4. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation.15) for a and f3 gives (1. Remark 1.16) (1. usually R or Rk. we see that r(5) ~ MSPE for squared error loss 1(9. 1.Section 1.4. We begin by fonnalizing what we mean by "a reduction of the data" X EX. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss. then by recording or taking into account only the value of T(X) we have a reduction of the data.2 and the Bayes risk (1. I'dZ» = ". g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O. I'(Z» + ""'(Y.17) ""'(Y.10..g(Z»): 9 E g).4. the optimal MSPE predictor is the conditional expected value of Y given Z. .5)'.3. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. we can conclude that I'(Z) = . Moreover.23). Note that (1. can also be a set of functions.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{".. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y.Thus.4.5 Sufficiency 41 Solving (1. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = . The range of T is any space of objects T.5 SUFFICIENCY Once we have postulated a statistical model.(Y I 9).4. and projection 1r notation.4.4.5) = (9 . Remark 1.8) defined by r(5) = E[I(II. If T assigns the same value to different sample points...
5. once the value of a sufficient statistic T is known. T is a sufficient statistic for O. Although the sufficient statistics we have obtained are "natural. By Example B.l. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. We prove that T = X I + X 2 is sufficient for O. whatever be 8. where 0 is unknown. We could then represent the data by a vector X = (Xl. Thus.l. = tis U(O. recording at each stage whether the examined item was defective or not. is sufficient.3. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following. .2. and Performance Criteria Chapter 1 Even T(X J.) are the same and we can conclude that given Xl + X. whatever be 8. X(n))' loses information about the labels of the Xi..0.) given Xl + X.1 we had sampled the manufactured items in order.4). It follows that. Therefore. The total number of defective items observed. XI/(X l +X.. o Example 1. Begin by noting that according to TheoremB. 0 In both of the foregoing examples considerable reduction has been achieved. + X. Using our discussion in Section B. 1) whatever be t. Thus.1. t) distribution. " .) and Xl +X. Y) where X is uniform on (0. For instance. Goals. X 2 the time between the arrival of the first and second customers. Example 1. .2.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. Each item produced is good with probability 0 and defective with probability 1. = t. .. given that P is valid. • X n ) (X(I) . the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O. in the context of a model P = {Pe : (J E e}. Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer.1. P[X I = XI.Xn ) does not contain any further infonnation about 0 or equivalently P. Xl has aU(O.' • . the conditional distribution of Xl = [XI/(X l + X. T = L~~l Xi. suppose that in Example 1.5).42 Statistical Models.1) where Xi is 0 or 1 and t = L:~ I Xi.) is conditionally distributed as (X. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. X n ) into the same number. Instead of keeping track of several numbers. X. Xl and X 2 are independent and identically distributed exponential random variables with parameter O. By (A. 1).5.) and that of Xlt/(X l + X. t) and Y = t .. the conditional distribution of XI/(X l + X. ~ t. Then X = (Xl.X. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise.'" ...9. is a statistic that maps many different values of (Xl.. (Xl. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O.'" . . the sample X = (Xl.'" . By (A. are independent and the first of these statistics has a uniform distribution on (0. .)](X I + X. . X." it is important to notice that there are many others e. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information.Xn ) is the record of n Bernoulli trials with probability 8.5. .8)n' (1. when Xl + X. A machine produces n items in succession. However.Xn = xnl = 8'(1. we need only record one. 16. . Thus.l. . .l we see that given Xl + X2 = t. We give a decision theory interpretation that follows.
. Fortunately.e) = g(ti. 0) porT ~ Ii] ~~CC+ g(li. Let (Xl. It is often referred to as the factorization theorem for sufficient statistics. (153) By (B.} p(x.In a regular model. checking sufficiency directly is difficult because we need to compute the conditional distribution. In general.. e porT = til = I: {x:T(x)=t. By our definition of conditional probability in the discrete case. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y).5. To prove the sufficiency of (152). and e and a/unction h defined (152) = g(T(x). The complete result is established for instance by Lehmann (1997. if (152) holds. h(xj) (1. Proof.] Po[X = Xj. • only if. there exists afunction g(t..S. Applying (153) we arrive at. a statistic T(X) with range T is sufficient for e if.]/porT = Ii] p(x.2. More generally. 0 E 8. it is enough to show that Po[X = XjlT ~ l.5) .6). T = t. and Halmos and Savage. then T 1 and T2 provide the same information and achieve the same reduction of the data.. O)h(x) for all X E X. e)h(Xj) Po[T (15.Section 1.} h(x). Now.Ll) and (152).") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e. Such statistics are called equivalent.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . forO E Si.I. X2. Neyman. Po [X ~ XjlT = t.. We shall give the proof in the discrete case. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one.] o if T(xj) oF ti if T(xj) = Ii. . This result was proved in various forms by Fisher.5 Sufficiency 43 that will do the same job. Section 2. Po[X = XjlT = t. i = 1.O) I: {x:T(x)=t. e) definedfor tin T and e in on X such that p(X.O) Theorem I.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ. a simple necessary and sufficient criterion for a statistic to be sufficient is available. we need only show that Pl/[X = xjlT = til is independent of for every i and j.
S.2 (continued).16..'t n /'[exp{ .. . 0 o Example 1. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2.. Conversely.5.. and h(xl •. If X 1.. l X n are the interarrival times for n customers. which admits simple sufficient statistics and to which this example belongs. are introduced in the next section.Xn ) is given by I. Goals.. •. and both functions = 0 otherwise.5.9) if every Xi is an integer between 1 and B and p( X 11 • (1.xn }. .5. We may apply Theorem 1.4.. Let Xl. By Theorem 1. if T is sufficient.' L(Xi . T is sufficient.. both of which are unknown. 10 fact. = max(xl. 0) by (B. .Xn ) is given by (see (A. O)h(x) (1.n /' exp{ .. 1.8) if all the Xi are > 0. I x n ) = 1 if all the Xi are > 0. Estimating the Size of a Population. and P(XI' . A whole class of distributions. .Il) . Expression (1.5.1 to conclude that T(X 1 .2.5. [27".8) = eneStift > 0. X(n) is a sufficient statistic for O. 0 Example 1..'I. X n ). The population is sampled with replacement and n members of the population are observed and their labels Xl. we need only keeep track of X(n) = max(X .O)=onexP[OLXil i=l (1.X. let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1.9) can be rewritten as " 1 Xn . ..JL)'} ~=l I n I! 2 .. .Xn are recorded.[27f..'). > 0.5.5. .~. _ _ _1 .3). . Let 0 = (JL.1.Xn.. Takeg(t. Common sense indicates that to get information about B. T = T(x)] = g(T(x).4)). 1=1 (!.44 Statistical Models. }][exp{ .5.5. ' . we can show that X(n) is sufficient.2.6) p(x... Consider a population with () members labeled consecutively from I to B. n P(Xl.'" ...Xn ) = L~ IXi is sufficient.3.' (L x~ 1=1 1 n n 2JL LXi))]. . ~ Po[X = x.5. .O) where x(n) = onl{x(n) < O}. and Performance Criteria Chapter 1 Therefore.. O} = 0 otherwise. then the joint density of (X" . .10) P(Xl.X n JI) = 0 otherwise. The probability distribution of X "is given by (1. Then the density of (X" ..7) o Example 1. . .5. .
5. {32. Thus.~ I: xn By the factorization theorem. EYi 2 . Then 6 = {{31. we can. . X n are independent identically N(I).} + EY. 0 X Example 1. 2:>. An equivalent sufficient statistic in this situation that is frequently used IS S(X" . find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function.5.( 2?Ta ')~ exp p {E(fJ. . X n ) = [(lin) where I: Xi.6.Section 1. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally. a<.X)']. Then pix.EZiY. i= 1 = (lin) L~ 1 Xi. I)) . .• Yn are independent. where we assume diat the given constants {Zi} are not all identical. Let o(X) = Xl' Using only X. with JLi following the linear regresssion model ("'J . I)) = exp{nI)(x .5 Sufficiency 45 I Evidently P{Xl 1 • •• .) i=l i=l is sufficient for B.1.' + 213 2a + 213.9) and _ (y. if T(X) is sufficient. (1. X n ) = (I: Xi. Here is an example.1. 1 2 2 .. Suppose X" .. T = (EYi .o) = R(O. 0') for allO. L~l x~) and () only and T(X 1 .2a I3. for any decision procedure o(x). X is sufficient. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting...5.~0)}(2"r~n exp{ . that is.5. in Example 1.Zi)'} exp {Er.4 with d = 2. . that Y I . .1 we can conclude that n n Xi. a 2 ).xnJ)) is itself a function of (L~ upon applying Theorem 1. By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B. The first and second components of this vector are called the sample mean and the sample variance.5. Yi N(Jli.··· . Ezi 1'i) is sufficient for 6. Suppose. R(O. [1/(n i=l n n 1)1 I:(Xi .12) t ofT(X) and a Example 1. respectively. Specifically. a 2)T is identifiable (Problem 1.l) distributed.
. 0 and X are independent given T(X). we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X . Theorem }. 6'(X) and 6(X) have the same mean squared error. 0) as p(x. Then by the factorization theorem we can write p(x.l and (1. R(O.6).1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi.4.. But T(X) provides a greater reduction of the data.Xn is a N(p.6(X»)IT]} = R(O.5. Minimal sufficiency For any model there are many sufficient statistics: Thus. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x.14. In Example 1. IfT(X) is sufficient for 0. . This 6' (T(X)) will have the same risk as 6' (X) because. Equivalently. Example 1.. O)h(x) E:  . ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X . Thus. the same as the posterior distribution given T(X) = L~l Xi = k. ). X n ) are hoth sufficient. (Kolmogorov). Using Section B.5. . choose T" = 15"' (X) from the normal N(t. if Xl. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e. by the double expectation theorem.6') ~ E{E[R(O. then T(X) = (L~ I Xi.2. In this Bernoulli trials case.2.46 Statistical Models. Now draw 6' randomly from this conditional distribution. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). in that. 0 The proof of (1. the distribution of 6(X) does not depend on O. we can find a transformation r such that T(X) = r(S(X)). Let SeX) be any other sufficient statistic. and Performance Criteria Chapter 1 given X = t..5. Goals..6'(T))IT]} ~ E{E[R(O. (T2) sample n > 2. This result and a partial converse is the subject of Problem 1. it is Bayes sufficient for every 11.1 (continued).12) follows along the lines of the preceding example: Given T(X) = t. = 2:7 Definition. n~l) distribution. T = I Xi was shown to be sufficient.O) = g(S(x). where k In this situation we call T Bayes sufficient.6).5. . L~ I xl) and S(X) = (X" .
it gives. (T') example. In the discrete case.) = (/".1) nlog27r . The formula (1. when we think of Lx(B) as a function of ().O). for a given 8. The likelihood function o The preceding example shows how we can use p(x. 2/3)/g(S(x). For any two fixed (h and fh.. then Lx(O) = (27rO.8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8.1). ()) x E X}. In this N(/".)}. L Xf) i=l i=l n n = (0 10 0. t2) because.Section 1.4 (continued). }exp ( I . as a function of (). ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient.5. take the log of both sides of this equation and solve for T. Thus. L x (') determines (iI. for a given observed x.) ~ (L Xi. ~ 1/3. (T'). the ratio of both sides of the foregoing gives In particular. However.) n/' exp {nO. the likelihood function (1.5. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. Now. ~ 2/3 and 0.o)nT ~g(S(x).T.5. Example 1. = 2 log Lx(O. Lx is a map from the sample space X to the class T of functions {() t p(x. It is a statistic whose values are functions. t. T is minimally sufficient. 20' 20 I t.O E e. 1/3)]}/21og 2. L x (()) gives the probability of observing the point x.5. the statistic L takes on the value Lx. 20.O)h(x) forall O. the "likelihood" or "plausibility" of various 8. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x.5 Sufficiency 47 Combining this with (1. we find T = r(S(x)) ~ (log[2ng(s(x).(t. we find OT(1 . if X = x. if we set 0. Thus. for example.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.
Summary.O) > for all B. We show the following result: If T(X) is sufficient for 0. See Problem ((x:).5. By arguing as in Example 1. the residuals.X)2) is sufficient. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x).. Lehmann. itself sufficient.. hence. 0) = g(T(X). . We say that a statistic T (X) is sufficient for PEP. Goals. 1 X n ). as in the Example 1. X is sufficient. The likelihood function is defined for a given data vector of . 0) and heX) such that p(X.X Cn )' the order statistics. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. or if T(X) ~ (X~l)' .Xn . Consider an experiment with observation vector X = (Xl •.17). a 2 is assumed unknown.5.. The "irrelevant" part of the data We can always rewrite the original X as (T(X).). Let p(X.5. . • X( n» is sufficient. Let Ax OJ c (x: p(x.48 Statistical Models. l:~ I (Xi . Then Ax is minimal sufficient. S(X) = (R" . Thus.13.5.. or for the parameter O. .. If T(X) is sufficient for 0. 0 In fact. 1) (Problem 1.5.. If P specifies that X 11 •.. Suppose that X has distribution in the class P ~ {Po : 0 E e}. B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. . The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t. L is minimal sufficient. it is Bayes sufficient for O. ifT(X) = X we can take SeX) ~ (Xl . (X. .4. then for any decision procedure J(X). 1) and Lx (1.1 (continued) we can show that T and.5.1. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X). _' . where R i ~ Lj~t I(X. in Example 1. < X.X. OJ denote the frequency function or density of X. and Scheffe. . hence. For instance. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. 0 the likelihood ratio of () to Bo. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X. but if in fact a 2 I 1 all information about a 2 is contained in the residuals. the ranks. a statistic closely related to L solves the minimal sufficiency problem in general. 1 X n are a random sample..5. Suppose there exists 00 such that (x: p(x.Rn ). Thus. O)h(X).5. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. if the conditional distribution of X given T(X) = t does not involve O. if a 2 = 1 is postulated. Ax is the function valued statistic that at (J takes on the value p x.. 1. If.Oo) > OJ = Lx~6o)' Thus. (X( I)' . .12 for a proof of this theorem of Dynkin.X). L is a statistic that is equivalent to ((1' i 2) and.
x. B) of the Pe may be written p(x. realvalued functions T and h on Rq. IfT(X) is {x: p(x. then. B).B o) > O}. Poisson. Example 1. B) > O} c {x: p(x. This is clear because we need only identify exp{'7(B)T(x) . We return to these models in Chapter 2. and Dannois through investigations of this property(l). many other common features of these families were discovered and they have become important in much of the modern theory of statistics.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. binomial.6. gamma.B) ~ h(x)exp{'7(B)T(x) . for x E {O. 1. Here are some examples. these families form the basis for an important class of models called generalized linear models.B) = eXe.exp{xlogBB}.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. (1.B(B)} with g(T(x).B) and h(x) with itself in the factorization theorem.6.1) where x E X c Rq. by the factorization theorem. Probability models with these common features include normal. 1. 1.6. is said to be a oneparameter exponentialfami/y. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. Then. B. B E' sufficient for B.6. beta. We shall refer to T as a natural sufficient statistic of the family.2) . and if there is a value B E such that o e e. Let Po be the Poisson distribution with unknown mean IJ.1. p(x.o I x. They will reappear in several connections in this book. B>O. such that the density (frequency) functions p(x. Ax(B) is a minimally sufficient statistic. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. and multinomial regression models used to relate a response variable Y to a set of predictor variables. = 1 .1 The OneParameter Case The family of distributions of a model {Pe : B E 8}.B(B)} (1. Subsequently. More generally. The Poisson Distribution. Note that the functions 1}. B( 0) on e.2"" }. Pitman. BEe. if there exist realvalued functions fJ(B).Section 1. and T are not unique. In a oneparameter exponential family the random variable T(X) is sufficient for B.
.~O'(y  z)' logO}. 0) distribution.O) f(z. + OW.6.3) I' . we have p(x.1](0) = .h(x) = . Z and W are f(x. The Binomial Family.5) o = 2.. Statistical Models. is the family of distributions of X = (Xl' ..y.6. .O) IIh(x. 0 Example 1.50 .1](O)=log(I~0). Then > 0.6. B(O) = logO. .T(x) = (y  z)'.Xm are independent and identically distributed with common distribution Pe. Here is an example where q (1.6. If {pJ=». 1 X m ) considered as a random vector in Rmq and p(x. This is a oneparameter exponential family distribution with q = 2.2.) i=1 m B(8)] ( 1. the Pe form a oneparameter exponential family with ~~I .6.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .) exp[~(O)T(x.~z. .' x. Goals..O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l. fonn a oneparameter exponential family as in (1.T(x)=x.3. q = 1. y)T where Y = Z independent N(O..6. I I! .6. B(O) = 0. suppose Xl.4) Therefore.1). . . for x E {O.(y I z) = <p(zW1<p((y . the family of distributions of X is a oneparameter exponential family with q=I. . 1). h(x) = (2rr)1 exp { . and Performance Criteria Chapter 1 Therefore. . (1. Example 1. 0 Then. B) are the corresponding density (frequency) functions.0)]. Suppose X has a B(n. where the p. 1.O) = f(z)f.. Specifically.1](0) = logO.h(X)=( : ) . 1 (1. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families. n) < 0 < 1.6) . Suppose X = (Z.} .B(O)=nlog(I0). 0 E e.~O'. T(x) ~ x. p(x. .
and h. the m) fonn a oneparameter exponential family. If we use the superscript m to denote the corresponding T.. This family of Poisson distIibutions is oneparameter exponential whatever be m. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T.8) . if X = (Xl.. B. B(m)(o) ~ mB(O). . and h. . X m ) corresponding to the oneparameter exponential fam1 T(Xd.. oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table..2) r(p.Xm ) = I Xi is distributed as P(mO).1 the sufficient statistic T :m)(x1. I:::n I::n Theorem 1. · .x log x 10g(l x) logx The statistic T(m) (X I.Il)" . . h(m)(x) ~ i=l II h(xi).. pi pi I:::n TABLE 1. then qCm) = mq.6 Exponential Families 51 where x = (x 1 . . Therefore. In the discrete case we can establish the following general result.Section 1..7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m.6. •.' fixed I' fixed p fixed . 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. B. For example.1 Family of distributions . 1]..\ fixed r fixed s fixed 1/2.. fl. We leave the proof of these assertions to the reader.6.. . then the m ) fonn a I Xi. . 1)(0) N(I'.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x . ily of distributions of a sample from any of the foregoin~ is just In our first Example 1.6. i=l (1. and pJ LT(Xi).· . 1 X m).1. . then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t .B(O)} for suitable h * . .\) (3(r.6.
1}) = (1(x!)exp{1}xexp[1}]}.2. Goals. Theorem 1.A(~)I.B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x). Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case. The Poisson family in canonical form is q(x.9) and ~ is an Ihlerior point of E. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families.9) with 1} E E contains the class of models with 8 E 8. x E {O.6. Let E be the collection of all 1} such that A(~) is finite. . The exponential family then has the form q(x.1) by letting the model be indexed by 1} rather than 8.1. the result follows.. .1. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .. P9[T(x) = tl L {x:T(x)=t} p(x.6.B(8)] L {x:T(x)=t} ( 1.8) h(x) exp[~(8)T(x) .. if q is definable. where 1} = log 8.A(1})1 for s in some neighborhood 0[0.6.9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. Example 1.52 Statistical Models. By definition.1}) = h(x)exp[1}T(x) .6. then A(1}) must be finite..6. = L(e"' (x!) = L(e")X (x! = exp(e"). selves continuous.6. 00 00 exp{A(1})} andE = R. Then as we show in Section 1.2. (continued). }. If X is distributed according to (1.2. E is either an interval or all of R and the class of models (1. £ is called the natural parameter space and T is called the natural sufficient statistic.6.6. x=o x=o o Here is a useful result. If 8 E e. L {x:T{x)=t} h(x)}.8) exp[~(8)t . x E X c Rq (1. We obtain an important and useful reparametrization of the exponential family (1. The model given by (1.6.9) where A(1}) = logJ . and Performance Criteria Chapter 1 Proof.
6 Exponential Families 53 Moreover.. We give the proof in the continuous case. A family of distrib~tioos {PO: 0 E 8}. E(T(X)) ~ A'(1]).12). The rest of the theorem follows from the momentenerating property of M(s) (see Section A. We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) . x> 0. Therefore. exp[A(s + 1]) .O) = h(x) exp[L 1]j(O)Tj (x) ..4 Suppose X 1> ••• .8) ~ (x/8 2)exp(_x 2/28 2). if there exist realvalued functions 171. 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic. 1 Tk. ' e k p(x.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). Proof. _ . 02 ~ 1/21].6. .6. c R k . . 1. It is used to model the density of "time until failure" for certain types of equipment. and realvalued functions T 1.<·or .8 > O.A(1])] because the last factor. 2 i=1 i=l n 1 n Here 1] = 1/20 2. Koopman. is said to be a kparameter exponential family.8(0)]. 0 Here is a typical application of this result. Now p(x.. h on Rq such that the density (frequency) functions of the Po may be written as. Pitman. (1. being the integral of a density. x E X j=1 c Rq. B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 . Example 1.Section 1.8) = (il(x. and Dannois were led in their investigations to the following family of distributions.6. nlog0 J. is one. This is known as the Rayleigh distribution.17k and B of 6. X n is a sample from a population with density p(x. More generally.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . Direct computation of these moments is more complicated.10) . Var(T(X)) ~ A"(1]).
. (]"2 > O}.1.Ot = fl.. If we observe a sample X = (X" .(x) = x'.3. It will be referred to as a natural sufficient statistic of the family. I ! In the discrete case.') : 00 < f1 x2 1 J12 2 p(x. Again. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi).'" .Tk(x)f and. .(X) •. which corresponds to a twOparameter exponential family with q = I. then the preceding discussion leads us to the natural sufficient statistic m m (LXi.6.6.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1.. q(x. . i=l i=1 which we obtained in the previous section (Example 1. . we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}.O) =exp[". . the vector T(X) = (T. 0 Again it will be convenient to consider the "biggest" families.5.'l) = h(x)exp{TT(x)'l.LX.. h(x) = I. . A(71) is defined in the same way except integrals over Rq are replaced by sums. . ...6.2(". and Performance Criteria Chapter 1 By Theorem 1.5... . suppose X = (Xl.x . in the conlinuous case. the canonical kparameter exponential indexed by 71 = (fJI.LTk(Xi )) t=1 Example 1. +log(27r" »)]. Suppose lhat P. .x E X c Rq where T(x) = (T1 (x).A('l)}. . In either case.. . X m ) from a N(I".11) (]"2. (1..fJk)T rather than family generated by T and h is e. Goals.') population...4).' . The Normal Family. . . 1)2(0) = 2 " B(O) " 1"' 1 " T. = N(I". letting the model be Thus.I' 54 Statistical Models. i=l .'). The density of Po may be written as e ~ {(I".10). + log(27r"'».TdX))T is sufficient.').. 8 2 = and I" 1 2' Tl(x) = x.2.. 2 (". J1 < 00.
.6.5.6. Suppose a<. TT(x) = T(Y) ~ (EY"EY. 2.EziY.. i = 1. In this example. I <j < k 1.'I2.'12 ~ (3. the density of Y = (Yb . E R. 0) ~ exp{L .k. Then p(x. .. = n'Ezr Example 1.. It will often be more convenient to work with unrestricted parameters. . where A is the simplex {A E R k : 0 < A. ... with J.j = 1. k ~ 2..~3 ~ 1/2(7'./(7'.. . . 0.~3 < OJ. in Examples 1. A(1/) ~ ~[(~U2~.n log j=l L exp(. (N(/" (7') continued}.. Let T.id.~2): ~l E R. . . we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj . Y n are independent.j j=l = 1. . where m..'.. 0) for 1 = (1./Ak) = ".. This can be identifiable because qo(x.6.(x). is not and all c.6 Exponential Families 55 (x. . Example 1.~... ./(7'. . Now we can write the likelihood as k qo(x. We write the outcome vector as X = (Xl. x') = (T. I yn)T can be put in canonical form with k = 3. Linear Regression. < l. Yi rv N(Jli. . and AJ ~ P(X i ~ j).5. and rewriting kl q(x.4 and 1. n._l)(x)1/nlog(l where + Le"')) j=1 .Xn)'r where the Xi are i.ti = f31 + (32Zi. A) = I1:~1 AJ'(X)./(7'.Section 1..(x»). 0 + el) ~ qo(x.~3): ~l E R.) + log(rr/.. . < OJ. From Exam~ pIe 1.. In this example. Example 1.)j. ~l ~ /.. Multinomial Trials.~·(x) . k}] with canonical parameter 0. . and t: = Rk . A E A. L71 Aj = I}.k.kj.= {(~1.. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)].(x) = L:~ I l[X i = j]. = 1/2(7'.6. E R k ..5. . However 0.5. 1/) =exp{T'f.5 that Y I . j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. ~. remedied by considering IV ~j = log(A.~2)]. h(x) = I andt: ~ R x R. ~3 and t: = {(~1.. a 2 ).T. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories.1."k. .7. . as X and the sample space of each Xi is the k categories {I.)T..~l= (3.~.
1]) is a k and h(x) = rr~ 1 l[xi E over.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x. is unchanged. If the Ai are unrestricted. and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. I < k.7 and X = (X t.d. from Example 1... 0 < Ai < 1. B(n. < X n be specified levels and (1. h(y) = 07 I ( ~.6. See Problem 1. where (J E e c R1. and 1] is a map from e to a subset of R k • Thus.12) taking on k values as in Example 1.6.56 Note that q(x. 1 <j < k .13) . .6. .. < .6. " . then all models for X are exponential families because they are submodels of the multinomial trials model. More10g(P'I[X = jI!P'IIX ~ k]). if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' .6. ) 1(0 < However. then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. as X.).6. . . 0 1. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO.i.1. 11 E £ C R k } is an exponential family defined by p(x. Logistic Regression. Here is an example of affine transformations of 6 and T.. Here 'Ii = log 1\. " Example 1. this. 1/(0») (1.6. 1 < i < n. X n ) T where the Xi are i.Y n ) ~ Y. let Xt =integers from 0 to ni generated Vi < n.·. k}] with canonical parameter TJ and £ = Rk~ 1. is an npararneter canonical exponential family with Yi by T(Yt .' A(1/) = L:7 1 ni 10g(l + e'·). Similarly. 17).8.. are identifiable. if c Re and 1/( 0) = BkxeO C R k . the parameters 'Ii = Note that the model for X Statistical Models. .17 e for details.2. 1 < i < n. Let Y. Goals.. Ai). 0) ~ q(x. be independent binomial.
'. }Ii N(J1.1.. and l1n+i = 1/2a. . this is by Example 1.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1. which is called the coefficient of variation or signaltonoise ratio.10.i.9. + 8. o Curved exponential families Exponential families (1. y. i = 1.5 a 2nparamctercanonical exponential family model with fJi = P.. + 8.. Yn.x)W'.x = (Xl. > O. .6. 8) in the 8 parametrization is a canonical exponential family.. LocationScale Regression. with B = p" we can write where T 1 = ~~ I Xi. Then (and only then).13) holds.. '. ~ (1.)T .!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied.a 2 ). which is less than k = n when n > 3. that is. = '\501 and 'fJ2(0) = ~'\5B2.)...6.6..i. .00). + 8..Xi)).6.1)T. Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic. n..d. then this is the twoparameter canonical exponential family generated by lYIY = (L~l. SetM = B T . p(x. N(Jl. a. x). Y n are independent.PIX < x])) 8. Example 1. E R. PIX < xl ~ [1 + exp{ (8.. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. This is a 0 In Example 1. L:~ 1 Xl yi)T and h with A(8. T 2 = L:~ 1 X. . where 1 is (1. The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. . Example 1..'V T(Y) = (Y" . In the nonnal case with Xl. Gaussian with Fixed SignaltoNoise Ratio.6. .. suppose the ratio IJlI/a. generated by .14) 8.x and (1. . .. i=I This model is sometimes applied in experiments to determine the toxicity of a substance. However. log(P[X < xl!(1 .xn)T. ranges over (0.8.Xn i...) = L ni log(1 + exp(8.i/a. is a known constant '\0 > O. If each Iii ranges over R and each a. I ii. 'fJ1(B) curved exponential family with l = 1.8.Section 1. .6. Y.6. . the 8 parametrization has dimension 2. Then. so it is not a curved family.12) with the range of 1J( 8) restricted to a subset of dimension l with I < k .. .8. Suppose that YI ~ .
We return to curved exponential family models in Section 2.6. with an exponential family density Then Y = (Y1. and Performance Criteria Chapter 1 and h(Y) = 1.i.8. In Example 1. For 8 = (8 1 .12. Section 15.12) is not an exponential family model. 1 Tj(Y.10 and 1.6. Next suppose that (JLi. 1 < j < n.6. 8._I11. Let Yj . and = Ee sTT .13) exhibits Y.6..5 that for any random vector Tkxl. Goals. 1. a. and B(O) = ~. 1988.. .58 Statistical Models.3.). sampling.6. M(s) as the momentgenerating function. 1978. Ii. Bickel. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. Recall from Section B.) = (Yj. Sections 2.1/(O)) as defined in (6. 1 Bj(O).) and 1 h'(YJ)' with pararneter1/(O). for unknown parameters 8 1 E R.E Yj c Rq. we define I I .6.. TJ'(Y). then p(y. We extend the statement of Theorem 1. Thus.10).6.(O)Tj'(Y) for some 11. yn)T is modeled by the exponential family generated by T(Y) ~.1 generalizes directly to kparameter families as does its continuous analogue. respectively.1.g.) depend on the value Zi of some covariate. 83 > 0 (e.(0). the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~.8. Carroll and Ruppert.d.8 note that (1.. as being distributed according to a twoparameter family generated by Tj(Y. . Even more is true. be independent. say. and Snedecor and Cochran. E R. 0) ~ q(y.5.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before. ~. 1989.6 are heteroscedastic and homoscedastic models. n. but a curved exponential family model with 0 1=3. Supermode1s We have already noted that the exponential family structure is preserved under i.2. Examples 1.4 Properties of Exponential Families Theorem 1.
6. .r(T) = IICov(7~.6.5.. Let P be a canonicaL kparameter exponential famiLy generated by (T. h(x) > 0.3.A('70) = II 8".6.T(x)).6. Suppose '70' '71 E t: and 0 < a < 1. v(x).6. ~ = 1 .1.A('7o)} vaLidfor aLL s such that 110 + 5 E E.15) is finite. u(x) ~ exp(a'7.a)'72) < aA('7l) + (1 . Under the conditions ofTheorem 1. Corollary 1. 8". T. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.2.15) that a'7. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. ('70). We prove (b) first.a)'72 E is proved in exactly the same way as Theorem 1.a. Proof of Theorem 1. Theorem 1.Section 1. A(U'71 Which is (b).15) t: the righthand side of (1. + (1 .3. t: and (a) follows.6. A where A( '70) = (8". for any u(x).6. If '7" '72 E + (1 . ('70) II· The corollary follows immediately from Theorem B.9. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1.4).6.6 Expbnential Families 59 V.r'7o T(X) .6.6.8". then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) .3(c).1 give a classical result in Example 1. 8A 8A T" = A('7o) f/J.3 V. Substitute ~ ~ a. By the Holder inequality (B. vex) = exp«1 . S > 0 with ~ + ~ = 1.1 and Theorem 1..6. 1O)llkxk. (with 00 pennitted on either side).6. ('70)) . . Since 110 is an interior point this set of5 includes a baLL about O.a)'7rT(x)) and take logs of both sides to obtain.a)A('72) Because (1. h) with corresponding natural parameter space E and function A(11). Fiually (c) 0 The formulae of Corollary 1. .
7.6. (continued). and Performance Criteria Chapter 1 Example 1. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open..4. (i) P is a/rank k. using the a: parametrization. Sintilarly. Then the following are equivalent.1 is in fact its rank and this is seen in Theorem 1. But the rank of the family is 1 and 8 1 and 82 are not identifiable.7 we can see that the multinomial family is of rank at most k . P7)[L. Xl Y1 ). . there is a minimal dimension.7). An exponential family is of rank k iff the generating statistic T is kdimensional and 1. Our discussion suggests a link between rank and identifiability of the 'TJ parameterization.6. such that h(x) > O.~.(Tj(X» = P>. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2.6.x.4 that follows.6.4. + 9. Formally."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k. T. p:x.8. and ry. Goads. (X)". k A(a) ~ nlog(Le"') j=l and k E>. However. Here. o The rank of an exponential family I . 9" 9.1. ajTj(X) = ak+d < 1 unless all aj are O. (iii) Var7)(T) is positive definite. in Example 1. Suppose P = (q(x. It is intuitively clear that k .60 Statistical Models. (ii) 7) is a parameter (identifiable).~~) < 00 for all x. However. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i.6. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . We establish the connection and other fundamental relationships in Theorem 1. . for all 9 because 0 < p«X. f=l .lx ~ j] = AJ ~ e"'ILe a . . . Tk(X) are linearly independent with positive probability. ifn ~ 1.6.(9) = 9. Theorem 1. II ! I Ii . 2' Going back to Example 1.
~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0.6.6. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 . because E is open.) with probability 1 =~(i). ranges over A(I'). of 1)0· Let Q = (P1)o+o(1). The proof for k details left to a problem.2." Then ~(i) > 1 is then sketched with '* P"[a. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1). ~ P1)o some 1). = Proof ofthe general case sketched I. Conversely.TJ2)T(X) = A(TJ2) . III. Taking logs we obtain (TJI . by our remarks in the discussion of rank...(v). all 1) = (~i) II. ". ~ (ii) ~ _~ (i) (ii) = P1).Section 1.A(TJ2))h(x). :.6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. by Theorem 1.A(TJI)}h(x) ~ exp{TJ2T(x) ... Thus. Let ~ () denote "(.~> .A(TJ. . which that T implies that A"(TJ) = 0 for all TJ and. 1)) is a strictly concavefunction of1) On 1'.2 and. for all T}.4 hold and P is of rank k. (iii) = (iv) and the same discussion shows that (iii) . Note that.T(x) . We. with probability 1. A is defined on all ofE. thus. '" '. hence.6.. A"(1]O) = 0 for some 1]0 implies c. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0. Proof. Now (iii) =} A"(TJ) > 0 by Theorem 1...1)o)TT. hence. have (i) . (b) logq(x.1)o) : 1)0 + c(1).T ~ a2] = 1 for al of O. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where".(ii) = (iii). .'t·· .6.3. F!~ . Apply the case k = 1 to Q to get ~ (ii) =~ (i). We give a detailed proof for k = 1. 0 Corollary 1. = Proof.) is false.. A'(TJ) is strictly monotone increasing and 11. . A' is constant. This is just a restatement of (iv) and (v) of the theorem. Suppose tlult the conditions of Theorem 1. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/.' . 0 .
An important exponential family is based on the multivariate Gaussian distributions of Section 8. Jl.10). .5 Conjugate Families of Prior Distributions In Section 1. h(Y) .1 '" 11"'. .21).16) Rewriting the exponent we obtain 10gf(Y....6.._.Jl)}. write p(x 19) for p(x.Jl) . families to which the posterior after sampling also belongs. and.. E) = Idet(E) 1 1 2 / .2 (Y .X') = (JL. with its distinct p(p + I) /2 entries.a 2 + J12). N p (1J..(Xi) i=l j=l i=l n • n nB(9)}. then X . which is obviously a 11 function of (J1.. E(X.6..tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y.1 (Y . where X is the Bernoulli trial..(Y). B(9) = (log Idet(E) I+JlTEl Jl).. E. the relation in (a) may be far from obvious (see Problem 1. 9 ~ (Jl.. the N(/l..2 we considered beta prior distributions for the probability of success in n Bernoulli trials.Jl.4 applies. We close the present discussion of exponential families with the following example. .~p). It may be shown (Problem 1. ifY 1.2 p p (1. (1.17) The first two terms on the right in (1. Goals. . E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log". The p Van'ate Gaussian Family.11. This is a special case of conjugate families of priors.Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi .6. which is a p x p symmetric matrix. with mean J.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E.. .Xn is a sample from the kparameter exponential family (1. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j.6."5) family by E(X). .6.)] eXP{L 1/. (1. where we identify the second element ofT. The corollary will prove very important in estimation theory. .. Thus.6. For {N(/" "')}. Example 1. is open. Recall that Y px 1 has a p variate Gaussian distribution. Then p(xI9) = III h(x. as we always do in the Bayesian context. so that Theorem 1.18) _. 0).5.6. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi.{Y.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2.(9) LT.Jl)TE..ljh<i<. See Section 2.Yp. and that (. E). However. E).6.. T (and h generalizing Example 1.3. Suppose Xl. the 8(11) 0) family is parametrized by E(X). . 0 ! = 1. ..E). Y n are iid Np(Jl..t parametrization is close to the initial parametrization of classical P...6.I. ~iYi Vi).62 Statistical Models.6.11.p/2 I exp{ . By our supermodel discussion.a 2 ).._.6.6.
.(Olx) ex p(xlO)".(lk+1 j=1 i=1 + n)B(O)) ex ".. We assume that rl is nonempty (see Problem 1. To choose a prior distribution for model defined by (1. .18) by letting 11. which is k~dimensional.19) o {(iJ.20) where t = (fJ. .22) 02 p(xIO) ex exp{ 2 .. If p(xIO) is given by (1.20). j = 1.6... Suppose Xl.. . Proof.6. is a conjugate prior to p(xIO) given by (1.21) and our assertion follows.21) is an updating formula in the sense that as data Xl. then k n ..) . . .. Because two probability densities that are proportional must be equal..6.. .20). k. (1. That is.21) where S = (SI. .(Xi)+ f.Section 1.6. be "parameters" and treating () as the variable of interest.6.2)' Uo 2uo Ox .20) given by the last expression in (1. 1r(6jx) is the member of the exponential family (1. . the parameter t of the prior distribution is updated to s = (t + a). The (k + I)parameter exponential/amity given by k ".'''' I:~ 1 Tk(Xi).6. .6.6. . Example 1.6. we consider the conjugate family of the (1."" Sk+l)T = ( t} + ~TdxiL . where u6 is known and (j is unknown. where a = (I:~ 1 T1 (Xi). .1. then Proposition 1.O<W(tJ.18).6. ik + ~Tk(Xi)' tk+l + 11..'" .fk+IB(O) logw(t)) j=1 (1.6.6.18) and" by (1.fk+Jl. n n )T and ex indicates that the two sides are proportional functions of (). 0 Remark 1..1.Xn is a N(O. let t = (tl.6.. tk+l) E 0. . Forn = I e.12.(0).6. .(O) ex exp{L ~J(O)(LT.36). n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way. and t j = L~ 1 Tj(xd.6 Exponential Families 63 where () E e.fkIJ)<oo) with integrals replaced by sums in the discrete case. A conjugate exponential family is obtained from (1. X n become available. tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1..(0) ~ exp{L ~j(O)fj . Note that (1.6.6..u5) sample.
) } 2a5 h t2 t1 2 (1.. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1. i.(n) = .24). (J Np(TJo. 75) prior density.( 0 .26) intuitively as ( 1.25) By (1.27) Note that we can rewrite (1.64 Statistical Models.(O) <X exp{ . + n. rJ) distributions where Tlo varies freely and is positive.21). (0) is defined only for t. = nTg(n)I(1~.6.6. Moreover. it can be shown (Problem 1.6. .(S) = t. we obtain 1r. we must have in the (t l •t2) parametrization "g (1. Using (1. tl(S) = 1]0(10 2 TO 2 + S. if we observe EX. = s. E RP and f symmetric positive definite is a conjugate family f'V . 761) where '10 varies over RP. = rg(n)lrg so that w. Eo).6. the posterior has a density (1.30) that the Np()o" f) family with )0.6. Goals.6. If we start with aN(f/o.6. Np(O.6. Our conjugate family. consists of all N(TJo. we find that 1r(Olx) is a normal density with mean Jl(s. 1 < i < n. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1.i. 0 These formulae can be generalized to the case X.W. W.28) where W. Eo known. > 0 and all t I and is the N (tI!t" (1~ It. t.24) Thus.6.) density.(n) (g +n)I[s+ 1]0~~1 TO TO (1.23) with (70 TO .6.. therefore.20) has density (1.23) Upon completing the square.n) and variance = t.6.26) (1.6. 1r. = 1 .37).d.6.
Problem 1. is not of the form L~ 1 T(Xi). must be kparameter exponential families.Xn )... which is onedimensional whatever be the sample size.20) except for p = 1 because Np(A.A(7Jo)} e e. Discussion Note that the uniform U( (I. Summary.20). h on Rq such that the density (frequency) function of Pe can be written as k p(x. the conditions of Proposition 1. . In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape.Section 1. {PO: 0 E 8}. the family of distributions in this example and the family U(O.. 8 C R k .6..6.'T}k and B on e. to Np (8 . with integrals replaced by sums in the discrete case. and realvalued functions TI. which admit kdimensional sufficient statistics for all sample sizes. O}) model of Example 1. a theory has been built up that indicates that under suitable regularity conditions families of distributions. ... Some interesting results and a survey of the literature may be found in Brown (1986).1 are often too restrictive. In fact. See Problems 1.B(O)]. (1. .5. . If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I . ••• .31 and 1.6. . B) are not exponential. The set is called the natural parameter space. Pitman. 0) = h(x) expl2:>j(I1)Ti (x) . starting with Koopman. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:.. Despite the existence of classes of examples such as these. Eo). .10 is a special result of this type.6.. 2. the map A . In fact. r) is a p(p + 3)/2 rather than a p + I parameter family.6.3 is not covered by this theory. x j=1 E X c Rq.6. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. " Tk... The natural sufficient statistic max( X 1 ~ .(s) = exp{A(7Jo + s) . . The set E is convex.J: h(x)exp{TT(x)7J}dx in the continuous case.6 Exponential Families 65 but a richer one than we've defined in (1. E 1 R is convex..32. and Darmois.29) (Tl (X). e.6. T k (X)) is called the natural sufficient statistic of the family. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1.
and t = (t" .. Suppose that the measuring instrument is known to be biased to the positive side by 0. He wishes to use his observations to obtain some infonnation about J. (v) A is strictlY convex on f. T" .. is conjugate to the exponential family p(xl9) defined in (1.L and variance a 2 . (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.parameter exponential family k ". State whether the model in question is parametric or nonparametric. If P is a canonical exponential family with E open.7 PROBLEMS AND COMPLEMENTS Problems for Section 1.1 units. 1. Ii II I . (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J. .L and a 2 but has in advance no knowledge of the magnitudes of the two parameters.66 Statistical Models. and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1.6..B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:.(0) = exp{L1)j(O)t) .1 1. Goals. (iv) the map '7 ~ A('7) is I ' Ion 1'.) E R k+l : 0 < w < oo}. Give a formal statement of the following models identifying the probability laws of the data and the parameter space. (iii) Var'7(T) is positive definite. l T k are linearly independent with positive Po probability for some 8 E e.29). Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F. ...L.. Assume that the errors are otherwise identically distributed nonnal random variables with known variance.1)) (O)t j . (ii) '7 is identifiable. tk+. then the following are equivalent: (i) P is of rank k.B(O)}dO. The (k + 1). tk+1) 0 ~ {(t" .
. Ab. 1'2) and we observe Y X. i=l (c) X and Y are independent N (I' I . .2) and N (1'2. .1(c). 2.. and Po is the distribution of X (b) Same as (a) with Q = (X ll ..Section 1. . 0. are sampled at random from a very large population.Xpb . i = 1..··. ... each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others. . . (e) The parametrization of Problem l.ap ) : Lai = O}.for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case.. v. = (al. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. . . Two groups of nl and n2 individuals.p. B = (1'1..Xp ).. .. Show that Fu+v(t) < Fu(t) for every t.) (a) The parametrization of Problem 1. (d) Xi. ~ N(l'ij.7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown. Which of the following parametrizations are identifiable? (Prove or disprove.) (a) Xl. Can you perceive any difficulties in making statements about f1.. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest.. respectively.. ..2). e (e) Same as (d) with (Qt.1.t..1 describe formaHy the foHowing model...1. = o.. (a) Let U be any random variable and V be any other nonnegative random variable. then X is (b) As in Problem 1. Once laid. . . j = 1.1(d)."" a p . . 3..Xp are independent with X~ ruN(ai + v.. . Q p ) {(al. . Q p ) and (A I . .) < Fy(t) for every t. Are the following parametrizations identifiable? (Prove or disprove.. restricted to p = (all' .. 4. Each . .2) where fLij = v + ai + Aj.2 ). (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. bare independeht with Xi.1. (b) The parametrization of Problem 1. 0.2 ) and Po is the distribution of Xll. At. .
3(2) and 1.. ..' I .. . Suppose the effect of a treatment is to increase the control response by a fixed amount (J. 5.. 8. . .c bas the same distribution as Y + c.c) = PrY > c .. }.. nk·mI .2. Consider the two sample models of Examples 1. . Mk.c has the same distribution as Y + c.1.k + VI + . Let Po be the distribution of a treatment response.9 and they oecur with frequencies p(0.. + Vk + P = 1. k.:". • ... 0 < J. In each of k subsequent years the number of students graduating and of students dropping out is recorded. + mk + r = n 1. mk·r. . and Performance Criteria Chapter 1 ":.··· .0. J. Both Y and p are said to be symmetric about c. V k mk P r where J. 7. M k = mk] I! n. e = = 1 if X < 1 and Y {J.9).) (a) p.t) = 1 ..P(Y < c . Goals.. 2. Show that Y . I..4(1).I nl .1). I ! fl...p(0. (0). P8[N I . • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour.1. ..t) fj>r all t.1.0'2) (d) Suppose the possible control responses in an experiment are 0. It is known that the drug either has no effect or lowers blood pressure. . the density or frequency function p of Y satisfies p( c + t) ~ p( c .2). is the distribution of X e = (0. i = 1. .. Vi = (1 11")(1 ~ v)iI V for i = 1.p(0. is the distribution of X when X is unifonn on (0.0... . 0'2). The simplification J. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large. Let Y and Po is the distribution of Y. but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown.k is proposed where 0 < 71" < I.0). . 68 Statistical Models. Hint: IfY ..· . 0 < Vi < 1. 0 = (1'..... ml .. Let N i be the number dropping out and M i the number graduating during year i.. . . (e) Suppose X ~ N(I'.Li < + ..Lk VI n".. then P(Y < t + c) = P( Y < t . VI. The number n of graduate students entering a certain department is recorded. Vk) is unknown. o < J1 < 1. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour... . Which of the following models are regular? (Prove or disprove. whenXis unifonnon {O. I nl. I . + fl. 0 < V < 1 are unknown.2. + nk + ml + . if and only if.Li = 71"(1 M)iIJ. =X if X > 1.Nk = nk. .O}.... 1 < i < k and (J = (J. What assumptions underlie the simplification? 6.  ...LI ni + .t). = n11MI = mI.. (b)P..L.LI. The following model is proposed.
."" Yn ). N} are identifiable. j = 0..Xn are observed i. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. C(t} = F(tjo). log y' satisfy a shift model? = Xc. . N}. e(j) > 0.2x yield the same distribution for the data (Xl. . That is. C). eY and (c) Suppose a scale model holds for X. C are independent.tl) implies that o(x) tl. C(·) = F(· .d.. The Lelunann 1\voSample Model.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X).i.'" .. according to the distribution of X. o(x) ~ 21' + tl . The Scale Model. e(j). For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1.. (b) In part (a). . 1 < i < n. eX o. . .Znj?' (a) Show that ({31. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. . In Example 1. C). . Show that if we assume that o(x) + x is strictly increasing. Does X' Y' = yc satisfy a scale model? Does log X'.zp... t > O.. (]"2) independent.. . 0 'Lf 'Lf op(j) = I. . (Yl. the two cases o(x) . 10.. Sbow that {p(j) : j = 0. Y = T I Y > j]. . e) : p(j) > 0. 0 < j < N. I(T < C)) = j] PIC = j] PIT where T. Y. Hint: Consider "hazard rates" for Y min(T. fJp) are not identifiable if n pardffieters is larger than the number of observations. . and (I'... ci '" N(O.. ."').tl) does not imply the constant treatment effect assumption. . . p(j). > 0. . .tl and o(x) ~ 21' + tl. Suppose X I."') and o(x) is continuous. (b) Deduce that (th. . N. Therefore. . I} and N is known.3 let Xl. Let c > 0 be a constant.N = 0. X m and Yi.. then satisfy a scale model with parameter e. o (a) Show that in this case. are not collinear (linearly independent). Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. then C(·) ~ F(. . Yn denote the survival times of two groups of patients receiving treatments A and B.6. e) vary freely over F ~ {(I'..Section 1. Sx(t) = ..{3p) are identifiable iff Zl. i . r(j) = = PlY = j. {r(j) : j ~ 0. that is.1. suppose X has a distribution F that is not necessarily normal. .. = 9. 12. . = (Zlj.Xn ).2x and X ~ N(J1.. 11.. or equivalently. if the number of = (min(T.. then C(·) = F(. Let X < 1'.tl).
l3. Set C. Specify the distribution of €i. Then h. Survival beyond time t is modeled to occur if the events T} > t. as T with survival function So. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. respectively. t > 0.(t) if and only if Sy(t) = Sg. = exp{g(..7.00).ll) with scale parameter 5. z)) (1. F(t) and Sy(t) ~ P(Y > t) = 1 .{a(t). The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(.1) Equivalently. then Yi* = g({3.l3. z) zT. So (T) has a U(O. (c) Suppose that T and Y have densities fort) andg(t). (c) Under the assumptions of (b) above.h.I ~I . 1 P(X > t) =1 (. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. . Also note that P(X > t) ~ Sort). . .2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 . are called the survival functions. . A proportional hazard model.12. hy(t) = c. t > 0.i.z)}. Show that if So is continuous.  .2. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). I (3p)T of unknowns..(t). I' 70 StCitistical Models.. Show that hy(t) = c. I (b) Assume (1. k = a and b.G(t). " T k are unobservable and i. The most common choice of 9 is the linear form g({3... Hint: See Problem 1. Sy(t) = Sg. . thus.. Goals. log So(T) has an exponential distribution.. Zj) + cj for some appropriate €j.(t). h(t I Zi) = f(t I Zi)jSy(t I Zi). ~ a5. t > 0. (b) By extending (bjn) from the rationals to 5 E (0. where TI".d..(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y.2) and that Fo(t) = P(T < t) is known and strictly increasing. we have the Lehmann model Sy(t) = si..) Show that (1. and Performance CriteriCl Chapter 1 t Ii .1.7.(t) with C. Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t).. . I) distribution. For treatments A and E. Moreover. = = (. 13.).2) is equivalent to Sy (t I z) = sf" (t).ho(t) is called the Cox proportional hazard model. 12.l3.. Hint: By Problem 8. 7. (1. show that there is an increasing function Q'(t) such that ifYt = Q'(Y.) Show that Sy(t) = S'. Tk > t all occur.7.
Section 1.. .O) 1 .2 with assumptions (t){4). Show how to choose () to make J.000 is the minimum monthly salary for state workers. the parameter of interest can be characterl ized as the median v = F. Examples are the distribution of income and the distribution of wealth. In Example 1. Ca) Find the posterior frequency function 1f(0 I x). . Let Xl.11 and 1. G.2 1. . .G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R.1. 15.12.2) distribution with ..7 Problems and Complements 71 Hint: See Problems 1.. When F is not symmetric. Yn be i. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x." .6 Let 1f be the prior frequency function of 0 defined by 1f(I1. F.d. and for this reason.4 1 0. J1 and v are regarded as centers of the distribution F.Xm be i. where () > 0 and c = 2. the median is preferred in this case. 'I/1(±oo) = ±oo. X n are independent with frequency fnnction p(x posterior1r(() I Xl. 14.8 0.2 unknown. wbere the model {(F. Consider a parameter space consisting of two points fh and ()2. x> c x<c 0. '1/1' > 0.1.p(t))...L . Find the . Problems for Section 1.1. Generally. have a N(O. 1) distribution. still identifiable? If not.(x/c)e.i.i.2 0. and suppose that for given (). ) = ~.5) or mean I' = xdF(x) = Fl(u)du.p and ~ are identifi· able. Here is an example in which the mean is extreme and the median is not. and Zl and Z{ are independent random variables.. what parameters are? Hint: Ca) PIXI < tl = if>(. .. 1f(O2 ) = ~.d. Are. (b) Suppose X" . have a N(O. (b) Suppose Zl and Z. (a) Suppose Zl' Z..p and C. Merging Opinions. Yl .v arbitrarily large. Find the median v and the mean J1 for the values of () where the mean exists. . Show that both . an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0. Observe that it depends only on l:~ I Xi· I 0). J1 may be very much pulled in the direction of the longer tail of the density.l (0.X n )..
4. 4'2'4 (b) Relative to (a). .'" . (b) Find the posterior density of 8 when 71"( OJ = 302 . 7l" or 11"1.75. Unllormon {'13} .)=1. .=l1r(B i )p(x I 8i ). . 0 (c) Find E(81 x) for the two priors in (a) and (b).'" 1rajI Show that J + a. (3) Find the posterior density of 8 when 71"( 0) ~ 1. where a> 1 and c(a) .. " J = [2:. X n be distributed as where XI. 7r (d) Give the values of P(B n = 2 and 100. 3. .0 < 0 < 1.m+I. Goals... Then P. " ••. c(a). for given (J = B. k = 0. l3(r. Show that the probability of this set tends to zero as n t 00.8).2. (J. . IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is .0 < B < 1.Xn are natural numbers between 1 and Band e= {I. 'IT . 2.[X ~ k] ~ (1. .. Compare these B's for n = 2 and 100. < 0 < 1. "X n ) = c(n 'n+a m) ' J=m. Assume X I'V p(x) = l:.. (3) Suppose 8 has prior frequency. does it matter which prior. ) ~ . Consider an experiment in which.O)kO. .. . Suppose that for given 8 = 0.. .72 Statistical Models. }.. 71"1 (0 2 ) = . ':.Xn = X n when 1r(B) = 1. . 1. and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 . (d) Suppose Xl. Find the posteriordensityof8 given Xl = Xl.. ~ = (h I L~l Xi ~ = . the outcome X has density p(x OJ ~ (2x/O'). Let X I. what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. Let 71" denote a prior density for 8.. . 3.. is used in the fonnula for p(x)? 2. This is called the geometric distribution (9(0)).. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O. For this convergence. I XI . .2...25. 0 < x < O. 1r(J)= 'u . X n are independent with the same distribution as X. .
X n+ k ) be a sample from a population with density f(x I 0). S.. Let (XI. Show in Example 1. Show rigorously using (1.. (b) Suppose that max(xl. where E~ 1 Xi = k.f) = [~..7 Problems . .1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •.b> 1. W. Assume that A = {x : p( x I 8) > O} does not involve O. . } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is . .J:n).. 11'0) distribution. 9. a regular model and integrable as a function of O. . I Xn+k) given . (a) Show that the family of priors E... Va' W t . Show that a conjugate family of distributions for the Poisson family is the gamma family.." .8) that if in Example 1. X n+ Il . where VI.2. .. f(x I f).1= n+r+s _ . are independent standard exponential. is the standard normal distribution function and n _ T 2 x + n+r+s' a J. If a and b are integers. Xn) + 5. 7.1. then !3( a. In Example 1...• X n = Xn. f3(r. Suppose Xl.. 6."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl.2.0 E Let 9 have prior density 1r. . 11'0) distribution.. .) = n+r+s Hint: Let!3( a. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N ."'tF·t'.. Justify the following approximation to the posterior distribution where q. where {i E A and N E {l. s). Xn) = Xl = m for all 1 as n + 00 whatever be a.. ZI. b) denote the posterior distribution..2..• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. D = NO has a B(N.Xn is a sample with Xi '"'' p(x I (}). ..2. . X n = Xn is that of (Y. b) is the distribution of (aV loW)[1 + (aV IbW)]t. . 10.··. .and Complements 73 wherern = max(xl.'" .(1.. Interpret this result.1. .. XI = XI. Show that ?T( m I Xl.1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.Section 1. .. Show that the conditional distribution of (6. n. Next use the central limit theorem and Slutsky's theorem..n.c(b.
.)T. In a Bayesian model whereX l .. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { . .. compute the predictive and n + 00. 0 > O. N(I". . Letp(x I OJ =exp{(xO)}.d.~vO}. •• • . X 1.) pix I 0) ~ ({" .2) and we formally put 7t(I".. "5) and N(Oo.. has density r("" a) IT • fa(u) ~ n' r(. 9= (OJ. 01 )=1 j=l Uj < 1..J u. then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I.. 8 2 ) is such that vn(p.6 ""' 1r. 1 X n . Find the posterior distribution 7f(() I x) and show that if>' is an integer. the predictive distribution is the marginal distribution of X n + l . Find I 12.. Next use Bayes rule.d. I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. i). 13. Q'j > 0. . 11. X n are i.. Show that if Xl.d. X and 5' are independent with X ~ N(I". (12). j=1 '\' • = 1.O. 0<0. .i.. .9). given (x. I(x I (J). 52) of J1. .. 'V . <0<x and let7t(O) = 2exp{20}.i. Here HinJ: Given I" and ". 14. ."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. and () = a..J t • I X.2 is (called) the precision of the distribution of Xi.~X) '" tnI. vB has a XX distribution. . 0> 0 a otherwise. Xl . The Dirichlet distribution. posterior predictive distribution. .. a = (a1. Let N = (N l . 0 > O. . unconditionally. (N. . (. Note that.O the posterior density 7t( 0 Ix). Goals. . (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8. .. . T6) densities. (b) Discuss the behavior of the two predictive distributions as 15.) be multinomial M(n../If (Ito. The posterior predictive distribution is the conditional distribution of X n +l given Xl.) u/' 0 < cx) L".") ~ . X n . given x. and Performance Criteria Chapter 1 + nand ({. . Suppose p(x I 0) is the density of i... N. O(t + v) has a X~+n distribution.i. .74 where N' ~ N Statistical Models.~l ..\ > 0. x> 0. . . 82 I fL. . x n ). D(a). (c) Find the posterior distribution of 0".... j=1 • .. v > 0. The Dirichlet distribution is a conjugate prior for the multinomial.Xn.1 < j < T. LO. .Xn + l arei. where Xi known. (a) If f and 7t are the N(O... = 1. L.. Q'r)T. < 1.
. (b)p=lq=. a2. . O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x. 1. a) is given by 0... The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. Problems for Section 1. . .3. Find the Bayes rule for case 2. 8 = 0 or 8 > 0 for some parameter 8. (c) Find the minimax rule among the randomized rules.1. Compute and plot the risk points (a)p=q= . 0.3.5 and (ii) l' = 0. Find the Bayes rule when (i) 3... 1957) °< °= a 0..7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a).3. then the posteriof7r(0 where n = (nt l · · · .1. = 11'. respectively and suppose . a3. (d) Suppose that 0 has prior 1C(OIl (a).5. or 0 > a be penoted by I.p) (I .5. the possible actions are al. (J2. 159 of Table 1.1.3.5. 1C(O. Suppose the possible states of nature are (Jl.3 IN ~ n) is V( a + n).q) and let when at. and the loss function l((J.) = 0. (J) given by O\x a (I .3.Section 1. n r ). (c) Find the minimax rule among bt. (d) Suppose 0 has prior 1C(01) ~ 1'. 159 for the preceding case (a).3.159 be the decision rules of Table 1. . I b"9 }.1. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St. . Let the actions corresponding to deciding whether the loss function is given by (from Lehmann. (b) Find the minimax rule among {b 1 . . = 0. I.. See Example 1. 1C(02) l' = 0. . Suppose that in Example 1..
are known.) r=2s=1.Xnjj.=l iii = N. Within the jth stratum we have a sample of i. Suppose X is aN(B. c<I>(y'n(sO))+b<I>(y'n(rO))..j = 1.(X) 0 b b+c 0 c b+c b 0 where b and c are positive. We want to estimate the mean J1. < J' < s.0)). 11 . variances.7.0)) +b<f>( y'n(s . . i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4.76 StatistiCCII Models.g. I . J".j = I.. Weassurne that the s samples from different strata are independent.. are known (estimates will be used in a latet chapter). b<f>(y'ns) + b<I>(y'nr)..andastratumsamplemeanXj. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. where <f> = 1 <I>. . 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s. n = 1 and . How should nj. random variables Xlj"". Stratified sampling. .s.. 0<0 0=0 0 >0 R(O.) c<f>( y'n(r . Assume that 0 < a. < 00. geographic locations or age groups). Goals. and <I> is the N(O. 1) distribution function.3) .i. (b) Plot the risk function when b = c = 1... I I I. 0 be chosen to make iiI unbiased? I (b) Neyman allocation. .d.8. and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J". E. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. and MSEs of iiI and 'j1z. Show that the strata sample sizes that minimize MSE(!i2) are given by (1. (. (a) Compute the biases. 1 (l)r=s=l. iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. 1 < j < S. 1 < j < S.I L LXij.
(b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0.bl.2. [f the parameters of interest are the population mean and median of Xi .) with nk given by (1. (c) Same as (b) except when n ~ 1.. (iii) F is normal.4.1.bl when n = compare it to EIX . (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i.. 5.b.. a = . Let X and X denote the sample mean and median. I). > k) (ii) F is uniform. Hint: See Problem B. Hint: See Problem B. ~ a) = P(X = c) = P..5. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = .3) is N. ° = 15.= 2:. Hint: Use Problem 1.0.15.0 . X n are i.2. Next note that the distribution of X involves Bernoulli and multinomial trials.2.b. . and that n is odd.2p.3.3. 75.45..5.7 Problems and Complements 77 Hint: You may use a Lagrange multiplier. Each value has probability . Suppose that n is odd. and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). plot RR for p = .2'.3. .7. ~ ~  ~ 6. p = . .5. .d. Let XI. . n ~ 1. respectively.25.3.5(0 + I) and S ~ B(n. ~ ~ ~ . Also find EIX .. X n be a sample from a population with values 0. where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~... Hint: By Problem 1.0 + '.=1 Pj(O'j where a. and = 1.bl for the situation in (i).i.40. We want to estimate "the" median lJ of F.0 ~ + 2'. b = '. P(X ~ b) = 1 . ~ 7. that X is the median of the sample. (d) Find EIX . 1).2.Section 1.5. < ° P < 1. . . show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift).9. set () = 0 without loss of generality. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c. . Let Xb and X b denote the sample mean and the sample median of the sample XI b.. ali X '"'' F. Use a numerical integration package. (b) Evaluate RR when n = 1. l X n .. .I 2:. Suppose that X I.p).13. . . U(O.. > 0..20. N(o.=1 pp7J" aV.. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X)..5.'..
then a test function is unbiased in this sense if.X)2 keeping the square brackets intact. (a) Show that if 6 is real and 1(6.J. .0) = E. (i) Show that MSE(S2) (ii) Let 0'6 c~ . 6' E e.1'])2.0): 6 E eo}. . 10% have the characteristic. (a) Show that s:? = (n . (a)r = 8 =1 (b)r= ~s =1. I I 78 Statistical Models. for all 6' E 8 1. . If the true 6 is 60 . Find MSE(6) and MSE(P).(o(X)).[X . Let Xl. In Problem 1. .3) that . suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. and Performance Criteria Chapter 1 :! .X)2.X)' = ([Xi . I .4. (b) Suppose Xi _ N(I'. satisfies > sup{{J(6.(1(6.. I (b) Show that if we use the 0 1 loss function in testing. ~ 2(n _ 1)1. You may use the fact that E(X i ..(1(6'. A person in charge of ordering equipment needs to estimate and uses I . 10.Xn be a sample from a population with variance a 2 . i1 . a) = (6 . Let () denote the proportion of people working in a company who have a certain characteristic (e.g. 8.').8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100.. .0(X))) < E. Which one of the rules is the better one from the Bayes point of view? 11. 0 < a 2 < 00. and only if.3.a)2.1'] .. I .2)(.10) + (. It is known that in the state where the company is located. e 8 ~ (.3. defined by (J(6.2 ~~ I (Xi . ' . . Show that the value of c that minimizes M S E(c/.X)2 is an unbiased estimator of u 2 • Hint: Write (Xi . 9. then this definition coincides with the definition of an unbiased estimate of e.1)1 E~ 1 (Xi . A decision rule 15 is said to be unbiased if E.0(X))) for all 6. . II.X)2 has a X~I distribution. 0) (J(6'.L)4 = 3(12.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. Goals..  . = c L~ 1 (Xi . being lefthanded).. then expand (Xi . the powerfunction.3(a) with b = c = 1 and n = I..
there is a randomized procedure 03 such that R(B. Suppose that Po.3. show that if c ::. Furthersuppose that l(Bo.. 15. In Problem 1.a )R(B.4. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1.1) and let Ok be the decision rule "reject the shipment iff X > k. wc dccide El. In Example 1.wo to X. 16. . find the set of I' where MSE(ji) depend on n. If U = u.7). but if 0 < <p(x) < 1. by 1 if<p(X) if <p(X) >u < u. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. (c) Same as (b) except k = 2. .. Compare 02 and 03' 19.Polo(X) ~ 01 = Eo(<p(X)). Po[o(X) ~ 11 ~ 1 .3. we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known. Show that J agrees with rp in the sense that. Your answer should Mw If n.3. and k o ~ 3. and then JT. (a) find the value of Wo that minimizes M SE(Mw).7 Problems and Complements 79 12. For Example 1. In Example 1. 14. (b) find the minimum relative risk of P. 1) and is independent of X. Show that the procedure O(X) au is admissible. then. consider the loss function (1. use the test J u ..Section 1. 0 < u < 1. 0. Suppose that the set of decision procedures is finite. consider the estimator < MSE(X). s = r = I. Show that if J 1 and J 2 are two randomized.. Convexity ofthe risk set. o Suppose that U . given 0 < a < 1. if <p(x) = 1. 18.' and 0 = II' .1.3. 0. 03) = aR(B." (a) Show that the risk is given by (1.~ is unbiased 13. The interpretation of <p is the following. z > 0. (h) If N ~ 10. a) is .3. procedures.UfO.4.) + (1 . If X ~ x and <p(x) = 0 we decide 90.1'0 I· 17. b. and possible actions at and a2.3. Suppose the loss function feB.1.ao) ~ O.1. Define the nonrandomized test Ju .) for all B. plot R(O. Consider the following randomized test J: Observe U. Ok) as a function of B. Consider a decision problem with the possible states of nature 81 and 82 . B = .
Let U I . al a2 0 3 2 I 0. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z). Give an example in which Z can be used to predict Y perfectly. !. What is 1. U2 be independent standard normal random variables and set Z Y = U I. Give the minimax rule among the nonrandomized decision rules. and the best zero intercept linear predictor. Four balls are drawn at random without replacement.1. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. In Problem B..8 0. An urn contains four red and four black balls.Is Z of any value in predicting Y? = Ur + U?. 6.(0. The midpoint of the interval of such c is called the conventionally defined median or simply just the median. 0. 5. 3. (c) Suppose 0 has the prior distribution defined by . (b) Give and plot the risk set S.4 = 0. In Example 1. (b) Compute the MSPEs of the predictors in (a).7 find the best predictors ofY given X and of X given Y and calculate their MSPEs.. 0 0.6 (a) Compute and plot the risk points of the nonrandomized decision rules. and the ratio of its MSPE to that of the best and best linear predictors. !.4. Let X be a random variable with probability function p(x ! B) O\x 0. its MSPE.(0. . (a) Find the best predictor of Y given Z. 4. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error. 2.4 I 0. 7.l. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn.) = 0.2 0.1 calculate explicitly the best zero intercept linear predictor. Goals.80 Statistical Models. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly. Give the minimax rule among the randomized decision rules. and Performance Criteria Chapter 1 B\a 0.9. the best linear predictor..) the Bayes decision rule? Problems for Section 1.
7 2 ) variables independent of each other and of (ZI. p( c all z. ° 14. exhibit a best predictor of Y given Z for mean absolute prediction error.2 ) distrihution. Z". a 2. p( z) is nonincreasing for z > c. y l ). that is. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. 9. Suppose that Z. Let Zl and Zz be independent and have exponential distributions with density Ae\z. that is. Suppose that Z has a density p. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor. If Y and Z are any two random variables.Section 1. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c.7 Problems and Complements 81 Hint: If c ElY . (a) Show that the relation between Z and Y is weaker than that between Z' and y J .cl) = oQ[lc . Y'. Y) has a bivariate normal distribution. 0. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. 10. Show that c is a median of Z. Let Y have a N(I". z > 0. and otherwise. which is symmetric about c and which is unimodal. (b) Show directly that I" minimizes E(IY .t. which is symmetric about c. p) distribution and Z". 12. (b) Suppose (Z.  = ElY cl + (c  co){P[Y > co] .col < eo. Y') have a N(p" J.el) as a function of c. Show that if (Z. a 2.1"1/0] where Q(t) = 2['I'(t) + t<I>(t)]. Define Z = Z2 and Y = Zl + Z l Z2. Y = Y' + Y". (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. Let ZI. where (ZI . Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» . Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . Yare measurements on such a variable for a randomly selected father and son. (a) Show that E(IY . Suppose that Z has a density p.L. Y" be the corresponding genetic and environmental components Z = ZJ + Z". Y)I < Ipl.z) for 11.PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. Y" are N(v. ICor(Z.YI > s. + z) = p(c .
2 'T ' .: (0 The best linear MSPE predictor of Y based on Z = z. and is diagnosed with a certain disease.(y) = >.S from infection until detection. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. IS. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1.. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. At the same time t a diagnostic indicator Zo of the severity of the disease (e. and Var( Z. Consider a subject who walks into a clinic today. (See Pearson. hat1s.4. 1905. at time t. Show that p~y linear predictors.15. j3 > 0.) = Var( Z. .I • ! .) ~ 1/>''. and Performance Criteria Chapter 1 . It will be convenient to rescale the problem by introducing Z = (Zo .ElY .) = E( Z./LdZ) be the linear prediction error. 1). where piy is the population multiple correlation coefficient of Remark 1. in the linear model of Remark 1. 1995.exp{ >. Show that Var(/L(Z))/Var(Y) = Corr'(Y. . . Let S be the unknown date in the past when the sUbject was infected.) (a) Show that 1]~y > piy. . g(Z)) 9 where g(Z) stands for any predictor. Let /L(z) = E(Y I Z ~ z). >. then'fJh(Z)Y =1JZY.y}. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. 82 Statisflcal Models.. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z./L£(Z)) ~ maxgEL Corr'(Y. . Show that.4. Predicting the past from the present. Goals. ~~y = 1 . and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y. IS. > 0./L(Z)) = max Corr'(Y. (c) Let '£ = Y . Hint: Recall that E( Z. and Doksurn and Samarov. mvanant un d ersuchh. 16. Yo > O. y> O.4. . that is.) ~ 1/>. We are interested in the time Yo = t ./L)/<J and Y ~ /3Yo/<J.3. on estimation of 1]~y. Hint: See Problem 1. I.g. = Corr'(Y.g(Z)) where £ is the set of 17. .1] ./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). a blood cell or viral load measurement) is obtained.4.
b) equal to zero. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). 20... wbere Y j . see Berman. . Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. b).g(Zo)l· Hint: See Problems 1. (c) Show that if Z is real. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo. where X is the number of spots showing when a fair die is rolled.4. (c) The optimal predictor of Y given X = x. This density is called the truncated (at zero) normal.I exp { .~ [y  (z .Section 1. c. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.). . (b) The MSPE of the optimal predictor of Y based on X. s(Y)] < 00. Establish 1. N(z .9. (In practice.7 and 1. Hint: Use Bayes rule.. 19. . 21.. where Z}.4. 2000).4.>')1 2 } Y> ° where c = 4>(z . Z2 and W are independent with finite variances. 1). all the unknowns. 7r(Y I z) ~ (27r) . Find (a) The mean and variance of Y. and Nonnand and Doksum.s(Y)] Covlr(Y). x = 1. (b) Show that (a) is equivalent to (1.(>' .z) . (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo .4.6) when r = s. Cov[r(Y).z).>.7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density .>. s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. and checking convexity. 6. s(Y» given Z = z.14 by setting the derivatives of R(a.. I Z]' Z}. (a) Show that ifCov[r(Y). 1990. solving for (a. including the "prior" 71". Let Y be the number of heads showing when X fair coins are tossed. need to be estimated from cohort studies. then = E{Cov[r(Y). Find Corr(Y}. Write Cov[r(Y). + b2 Z 2 + W. Let Y be a vector and let r(Y) and s(Y) be real valued. Y2) using (a). density.
x  . 1 X n be a sample from a Poisson. 0) = Ox'. 3.e)' = Y' .. n > 2.z) = 6w (y.. s). z) and c is the constant that makes Po a density. 0 < z < I. we say that there is a 50% overlap between Y 1 and Yz. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. g(Z)) < for some 9 and that Po is a density. 0) = < X < 1. (a) Let w(y. if b1 = b2 and Zl.Xn is a sample from a population with one of the following densities. population where e > O. > a. Zz). . Suppose Xl.2eY + e' < 2(Y' + e'). a > O.e' < (Y . 25.. . Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. 0 (b) p(x.5 1. and (ii) w(y. Show that EY' < 00 if and only if E(Y . Y z )? (f) In model (d). . (c) p(x.z). show that the MSPE of the optimal predictor is .84 Statistical Models. Verify that solving (1. Problems for Section 1.4. suppose that Zl and Z. Y ~ B(n..z) be a positive realvalued function.6(r.z). Then [y .l . . 1).15) yields (1. density. (a) p(x. Assume that Eow (Y. where Po(Y.') and W ~ N(po. p(e). 00 (b) Suppose that given Z = z. 0 > 0. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z). < 00 for all e.p~y). optimal predictor of Yz given (Yi.0> O. Oax"l exp( Ox").4. In this case what is Corr(Y. a > O.4.e)' Hint: Whatever be Y and c.~(1 . 23.. z) = I.3. 1 2 y' . . z) = z(1 ..density. and Performance Criteria Chapter 1 (e) In the preceding model (d). (a) Show directly that 1 z:::: 1 Xi is sufficient for 8. \ (b) Establish the same result using the factorization theorem. Find Po(Z) when (i) w(y. and suppose that Z has the beta. z "5).14). Let Xl. z) ~ ep(y. and = 0 otherwise. 24. This is known as the Weibull density. Find the 22. Let Xi = 1 if the ith item drawn is bad. are N(p. Zz and ~V have the same variance (T2. . This is thebeta.z)lw(y.g(z)]'lw(y. 0 > 0. 1. Goals. 2.x > 0. In Example 1.6(O. . .g(z») is called weighted squared prediction error. 0) = Oa' Ix('H).
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
1986. 0 . m New York: Hafner Publishing Co. Leighton. 1957. R." Biomelika.. Sands. The Feynmtln Lectures on Physics. E.. S. P. positions.266291 (1978). 22. .9 REFERENCES BERGER. ley.I' L6HMANN. Testing Statistical Hypotheses. KENDALL.. P. Nonlinearity. D. L6HMANN. A. D. Elements of Information Theory New York: Wiley. 1975. 1969. 125. BROWN. FERGUSON. E." Statist. Mathematical Statistics New York: Academic Press." Ann. S. A. AND COVER.733741 (1990). Theory ofGames and Statistical Decisions New York: Wiley. v. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. "Model Specification: The Views of Fisher and Neyman..96 Statistical Models. "A Theory of Some Multiple Decision Problems.. and spheres (e." J. 1988. CARROLL. ROSENBLATT. L. and M. R. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. THOMAS. "Using Residuals Robustly I: Tests for Heteroscedasticity. J.t AND D.. Vols. Goals. GRENANDER. . L. for instance. MA: AddisonWesley.• Statistical Decision Theory and Bayesian Analysis New York: Springer.. 383430 (1979)." Ann. Transformation and Weighting in Regression New York: Chapman and Hall. L. R. This permits consideration of data such as images." Ann. 1997. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion). 1954. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Soc. T. • • .g. 2nd ed. II. P. . RUPPERT. H. I and II.. Ch. 1961. K A AND A. M. CRUTCHFIELD.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. . T. The Advanced Theory of Statistics. and so on... DE GROOT. Ii . and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). J. STUART. Hayward.. AND M. L. Math Statist. GIRSHICK. Iii I' i M. Note for Section 1. A 143. R. 1963. AND A. B. Eels. 23. FEYNMAN. 160168 (1990). 6. DoKSUM. J.. Feynman. BICKEL. KRETcH AND R. Statist. Statistical Analysis of Stationary Time Series New York: Wi • .. BLACKWELL. .. 77. Optimal Statistical Decisions New York: McGrawHili. 1966. Royal Statist. r HorXJEs. BOX. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. . 1967. I: I" Ii L6HMANN.... 1. New York: Springer. G.547572 (1957). M. IMS Lecture NotesMonograph Series. BERMAN. AND M. SAMAROV. G. 5. P. J.. 40 Statistical Mechanics ofPhysics Reading.. JR. 1985. 1991. and Later Developments.. U. E. 14431473 (1995). Science 5. I. E. L. Statist. M. the Earth).
1965.. IA: Iowa State University Press. SL. Wiley & Sons. Ames. WETHERILL. COCHRAN.. SAVAGE. Boston. J. 1. G. The Foundation ofStattstical Inference London: Methuen & Co. DOKSUM. AND K. E. Biometrics Series II. Soc. Harvard University. SAVAGE. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data.. 1964. J. (Draper's Research Memoirs. Introduction to Probability and Statistics from a Bayesian Point of View. Roy. 9 References 97 LINDLEY. G. V. 303 (1905). PEARSON. SCHLAIFFER. Applied Statistical Decision Theory. Division of Research. L. 2000. Editors: S. ET AL." Empirical Bayes and Likelihood Inference. "On the General Theory of Skew Correlation and NOnlinear Regression. D. AND R. Dulan & Co. Statistical Methods. The Statistical Analysis of Experimental Data New York: J. Reid. SNEDECOR.. . B.. H. A. G. London 71. New York. 1986. MANDEL. 1961. NORMAND. New York: Springer. K. AND W. J.. W.) RAIFFA.Section 1. Lecture Notes in Statistics.. GLAZEBROOK. L. 1962. AND K. 1989. Wiley & Sons. Sequential Methods in Statistics New York: Chapman and Hall." Proc. D. Graduate School of Business Administration. 8th Ed. 1954. The Foundations ofStatistics. Ahmed and N. 6779. Part I: Probability. Part II: Inference London: Cambridge University Press..
:i. I . I II 1'1 .! ~J . . I 'I i . I:' o.·.'.: '".
The equations (2. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo.O). So it is natural to consider 8(X) minimizing p(X. X """ PEP. In this parametric case.1 2. how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. Now suppose e is Euclidean C R d . Estimating Equations Our basic framework is as before.Chapter 2 METHODS OF ESTIMATION 2. we could obtain 8 0 as the minimizer.O) where V denotes the gradient. p(X. (2.1. but in a very weak sense (unbiasedness). we don't know the truth so this is inoperable.1) Arguing heuristically again we are led to estimates B that solve \lOp(X.1. how do we select reasonable estimates for 8 itself? That is.2) define a special form of estimating equations. the true 8 0 is an interior point of e.8) is an estimate of D(8 0 . This is the most general fonn of the minimum contrast estimate we shall consider in the next section. 8).8).8) is smooth.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates. 6) ~ o. and 8 Jo D( 8 0 .1. Then we expect  \lOD(Oo. X E X. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. Of course. usually parametrized as P = {PO: 8 E e}.1. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.2) 99 .O) EOoP(X. As a function of 8. =0 (2. if PO o were true and we knew DC 8 0 .8) as a function of 8. That is.
. W(X. But.8/3 ({3. If. N(O.1L1 2 = L[Yi .1.2) or equivalently the system of estimating eq!1ations..)g(l3. Then {3 parametrizes the model and we can compute (see Problem 2. "'. i=l 3 I (2. A naiural(l) function p(X.z)l: 1{31 ~ oo} 'I = 00 I (Problem 2.g({3.1.4 with I'(z) = g({3. The estimate (3 is called the least squares estimate.1. . I In the important linear case.1. Least Squares. :i = ~8g~ ~ L. (3) exists if g({3.Zid)T 'j • r .5) Strictly speaking P is not fully defined here and this is a point we shall explore later.100 More generally. . Zn»T That is.1. z. z.1. Wd)T V(l1 o. i=l (2. (1) Suppose V (8 0 .10). IL( z) = (g({3. g({3.i. g({3. we take n p(X. for convenience. (3) = !Y . there is a substantial overlap between the two classes of estimates.. {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y.)J i=1 ~ 2 (2. . d g({3. . l'i) : 1 < i < n} where Yi. then {3 satisfies the equation (2.z.4 R d. Example 2.d.g({3. (3) n n<T.3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e.1. Yn are independent.. Here is an example to be pursued later. An estimate (3 that minimizes p(X. where the function 9 js known.('l/Jl. (2. further.6) . .z. (3) E{3.4 are i. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d .). I D({3o.4) w(X. ~ Then we say 8 solving (2.7) i=l J .i + L[g({3o.1.!I [" '1 L • .»)'. Here the data are X = {(Zi.zd = LZij!3j j=1 andzi = (Zil. z) is differentiable in {3. W .)Y. z.p(X. .) . z) is continuous and ! I lim{lg(l3. ua).8/3 ({3. Consider the parametric version of the regression model of Example 1.1. ~ ~8g~ L.I1) =0 is an estimating equation estimate. J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. {3 E R d . suppose we postulate that the Ei of Example 1. z.. z). ZI)..16).' .8) ~ E I1 . Evidently. (1).1.1.1.
. This very important example is pursued further in Section 2. .. which can be judged on its merits whatever the true P governing X is. we need to be able to express 9 as a continuous function 9 of the first d moments. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. Suppose Xl.. 1 1 < j < d. .(8). L. I'd of X. say q(8) = h(I'I.. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). These equations are commonly written in matrix fonn (2. To apply the method of moments to the problem of estimating 9.i. as X ~ P8.Section 2.1 Basic Heuristics of Estimation 101 the system becomes (2.. Thus. In fact.. Method a/Moments (MOM).d. . Here is another basic estimating equation example.1. suppose is 1 . 8 E R d and 8 is identifiable. I'd). The method of moments prescribes that we estimate 9 by the solution of p..1. Suppose that 1'1 (8).  n n. . N(O.1 <j < d if it exists. if we want to estimate a Rkvalued function q(8) of 9. 0 Example 2. . 1 1j ~_l".2. We return to the remark that this estimating method is well defined even if the Ci are not i.Xn are i. . Thus. ~ .i.. . lid) as the estimate of q( 8). I'd( 8) are the first d moments of the population we are sampling from. we assume the existence of Define the jth sample moment f1j by. More generally. /1j converges in probability to flj(fJ).Xi t=l . d > k.1.1 from R d to Rd. Least squares. . and then using h(p!. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po. provides a first example of both minimum contrast and estimating equation methods. .. = 1'. .8) the normal equations.. 1 < i < n}.2 and Chapter = 6. . thus.d. u5).9) where Zv IIZijllnxd is the design matrix.
. 1968). express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. Suppose we observe multinomial trials in which the values VI. 2.1. A = Jl1/a 2 . consider a study in which the survival time X is modeled to have a gamma distribution. or 5.. . This algorithm and others will be discussed more extensively in Section 2. case. We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments. . i = 1.~ (I'l/a)'. X n be Ltd... .10.4 and in Chapter 6. A = X/a' where a 2 = fl. . 4. neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. as X and Ni = number of indices j such that X j = Vi. 1'1 for () gives = E(X) ~ "/ A. . Here is some job category data (Mosteller.1 0) . . the proportion of sample values equal to Vi.)]xO1exp{Ax}.1. '\).2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles.i. .•. It is defined by initializing with eo. with density [A"/r(.Pk are completely unknown. Example 2.X 2 • In this example. . If we let Xl. In this case f) ~ ('" A). for instance. 0 Algorithmic issues We note that.6.(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. two other basic heuristics particularly applicable in the i. An algorithm for estimating equations frequently used when computationofM(X. 2. . Frequency Plugin(2) and Extension. I.5.·) fli =D'I'(X. Here k = 5.. Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category.2 and 0=2 = n 1 EX1. . and 1" ~ E(X') = .(1 + ")/A 2 Solving . then setting (2.11). in general.102 Methods of Estimation Chapter 2 For instance.. x>O. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn. 3. the method of moment estimator is not unique. Vi = i.d. l Vk of the population being sampled are known. We can. in particular Problem 6. ~ iii ~ (X/a)'.. I.·)  l~t. A>O. As an illustration consider a population of men whose occupations fall in one of five different job categories. but their respective probabilities PI. f(u. There are many algorithms for optimization and root finding that can be employed.3..... .>0.1.
P2 ~ 20(1..}}. . lPk).41 4 217 0. v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI. . Equivalently.12 3 289 0.1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0.. Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. . If we assume the three different genotypes are identifiable. Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. Pk) of the population proportions. .. the multinomial empirical distribution of Xl. + P3). v.44 . P3 PI = 0 = (1.53 = 0.. whereas categories 2 and 3 correspond to whitecollar jobs. Example 2. .03 2 84 0..Pk) with Pi ~ PIX = 'Vi]. can be identified with a parameter v : P ~ R. The frequency plugin principle simply proposes to replace the unknown population frequencies PI. N 3 ) has a multinomial distribution with parameters (n.. in v(P) by 0 P ). .Section 2. together with the estimates Pi = NiJn. that is. If we use the frequency substitution principle.0). the estimate is which in our case is 0. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . .11) to estimate q(Pll··. . .. and Then q(p) (p. use (2. .Ni =708 2:::o.0. (2.." . suppose that in the previous job category table. HardyWeinberg Equilibrium.pl . We would be interested in estimating q(P"". P3) given by (2.Pk by the observable sample frequencies Nt/n.1.. .0 < 0 < 1.. Suppose . .. let P dennte p ~ (P". 1 < think of this model as P = {all probability distributions Pan {VI.12).09. Now suppose that the proportions PI' ..4.13 n2::..1. . i < k. categories 4 and 5 correspond to bluecollar jobs.31 5 95 0.1. That is.. then (NIl N 2 .Xn . 1 Nk/n.»i=1 for Danish men whose fathers were in category 3. .(P2 + P3).P2.Ps) = (P4 + Ps) . Next consider the marc general problem of estimating a continuous function q(Pl' .12) If N i is the number of individuals of type i in the sample of size n. . 1~!. Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type. the difference in the proportions of bluecollar and whitecollar workers.. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).0)2. For instance.. .1.Pk) = (~l ..
h(p) and v(P) = v(P) for P E Po. (2. then v(P) is the plugin estimate of v. Now q(lJ) can be identified if IJ is identifiable by a parameter v . (2.13) Given h we can apply the extension principle to estimate q( 8) as. . F(x) > a}. suppose that we want to estimate a continuous Rlvalued function q of e.1. the representation (2.'" . Let be a submodel of P. 1  V In general. where a E (0.. . T(Xl. . that is. case if P is the space of all distributions of X and Xl.1./P3 and. Po ~ R given by v(PO) = q(O). we can use the principle we have introduced and estimate by J N l In. ( . the frequency of one of the alleles. q(lJ) with h defined and continuous on = h(p. If w.13) and estimate (2. In particular.1.Pk are continuous functions of 8.. We shall consider in Chapters 3 (Example 3. F(x) < a}.13) defines an extension of v from Po to P via v.1. Then (2. . .). I: t=l (2. I) and ! F'(a) " = inf{x. 0 write 0 = 1 .4) and 5 how to choose among such estimates.Pk(IJ»). We can think of the extension principle alternatively as follows.1.d.104 Methods of Estimation Chapter 2 we want to estimate fJ.. . we can usually express q(8) as a continuous function of PI. . "·1 is. Fu'(a) = sup{x.4. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. .Xn ) =h N. The plugin and extension principles can be abstractly stated as follows: Plugin principle.(IJ). a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance. .1.Pk. by the law of large numbers. . however.1. in the Li. thus.. .15) ~ ".16) .. P > R where v(P) . . E A) n.. Nk) . Because f) = ~. consider va(P) = [Fl (a) + Fu 1(a )1.. (2. Note. ...14) As we saw in the HardyWeinberg case.E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. If PI.d. if X is real and F(x) = P(X < x) is the distribution function (dJ. X n are Ll. that we can also Nsfn is also a plausible estimate of O...14) are not unique. as X p. .
Here x ~ = median. then the plugin estimate of the jth moment v(P) = f. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. but only VI(P) = X is a sensible estimate of v(P). e e Remark 2. they are mainly applied in the i.d. With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP. = v(P).' Nl = h N () ~ ~ = v(P) and P is the empirical distribution. The plugin and extension principles must be calibrated with the target parameter. ~ ~ .1. ~ I ~ ~ Extension principle. the sample median V2(P) does not converge in probability to Ep(X).) h(p) ~ L vip. (P) is the ath population quantile Xa. Remark 2. Pe as given by the HardyWeinberg p(O)..tj = E(Xj) in this nonparametric . is called the sample median. = Lvi n i= 1 k . i=1 k /1j ~ = ~ LXI 1=1 I n . then Va. However. ~ For a second example. As stated.1.. P 'Ie Po. 1/. these principles are general. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).i. This reasoning extends to the general i.4. (P) is called the population (2. P E Po. if X is real and P is the class of distributions with EIXlj < 00.Section 2. For instance.1.2. and v are continuous. context is the jth sample moment v(P) = xjdF(x) ~ 0.i. in the multinomial examples 2. . The plugin and extension principles are used when Pe. casebut see Problem 2.1. case (Problem 2.12) and to more general method of moment estimates (Problem 2.1 I:~ 1 Xl.1. A natural estimate is the ath sample quantile . Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter.14.1.1.17) where F is the empirical dJ. is a continuous map from = [0.1 Basic Heuristics of Estimation 105 V 1.3 and 2.13). because when P is not symmetric. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. II to P. then v(P) is an extension (and plugin) estimate of viP). Here x!. For instance.d. let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero.1. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.1.
For instance. To estimate the population variance B( 1 . a 2 ) sample as in Example 1. .2 with assumptions (1)(4) holding.1 is a method of moments estimate of B. X n are the indicators of a set of Bernoulli trials with probability of success fJ.1. .Ll). •i . 2. X n is a N(f1. the frequency of successes. as we shall see in Chapter 3. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are. (X. Moreover. Because we are dealing with (unrestricted) Bernoulli trials.4.I and 2X . [ [ . . there are often several method of moments estimates for the same q(9). See Section 2.3. Discussion. Plugin is not the optimal way to go for the Bayes. X).1 because in this model B is always at least as large as X (n)' 0 As we have seen. Because f. for large amounts of data. Suppose X I.1. Algorithms for their computation will be introduced in Section 2.X).1. Example 2. )" I I • . . This is clearly a foolish estimate if X (n) = max Xi> 2X . . a frequency plugin estimate of 0 is Iogpo.8) we are led by the first moment to the estimate. valuable as preliminary estimates in algorithms that search for more efficient estimates. they are often difficult to compute. 0 Example 2.) = ~ (0 + I). C' X n are ij. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3.7.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates.l [iX.4.4. if we are sampling from a Poisson population with parameter B. the plugin principle is justified. 0 = 21' . where Po is n.LI (8) = (J the method of moments leads to the natural estimate of 8.. In Example 1.. . X(l . optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. Suppose that Xl. It does turn out that there are "best" frequency plugin estimates. = 0].d.2. a special type of minimum contrast and estimating equation method. a saving grace becomes apparent in Chapters 5 and 6.1.5. This minimal property is discussed in Section 5. because Po = P(X = 0) = exp{ O}. these estimates are likely to be close to the value estimated (consistency). minimax. Remark 2. therefore. U {I. Thus. Unfortunately.. . 'I I[ i['l i . We will make a selection among such procedures in Chapter 3. then B is both the population mean and the population variance. For example. these are the frequency plugin (substitution) estimates (see Problem 2. O}. we find I' = E..3 where X" . " . X.6.1. Example 2. Estimating the Size of a Population (continued). (b) If the sample size is large. ". as we shall see in Section 2.. The method of moments estimates of f1 and a 2 are X and 2 0 a. If the model fits. However. When we consider optimality principles. estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. and there are best extensions. . those obtained by the method of maximum likelihood. we may arrive at different types of estimates than those discussed in this section.5. . The method of moments can lead to either the sample mean or the sample variance.. • . I .
P.lj = E( XJ\ 1 < j < d. when g({3. is parametric and a vector q( 8) is to be estimated. I < i < n. where Pj is the probability of the jth category. 0).(3) ~ L[li . and the contrast estimating equations are V Op(X. 6). Let Po and P be two statistical models for X with Po c P. Zi). 0 E e. A minimum contrast estimator is a minimizer of p(X. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients. D(P) is called the extensionplug·in estimate of v(P)..Pk). Suppose X .. 1 < j < k..Section 2.1 L:~ 1 I[X i E AI. . If P is an estimate of P with PEP.ld)T where f. li) : I < i < n} with li independent and E(Yi ) = g((3.lJ) ~ E'o"p(X. then is called the empirical PIE. .1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement.) = q(O) and call v(P) a plugin estimator of q(6). For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo. If P = PO..Zi)j2. When P is the empirical probability distribution P E defined by Pe(A) ~ n. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3.. P E Po.. ~ ~ ~ ~ ~ v we find a parameter v such that v(p. f. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). For data {(Zi.g((3.. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. . The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. ~ ~ ~ ~ 2.2 Minimum Contrast Estimates and Estimating Equations 107 Summary. For this contrast.. a least squares estimate of {3 is a minimizer of p(X. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter. The general principles are shown to be related to each other.O) = O.. Method of moment estimates are empirical PIEs based on v(P) = (f.2 2. z) = ZT{3.ll. It is of great importance in many areas of statistics such as the analysis of variance and regression theory.2. . where ZD = IIziJ·llnxd is called the design matrix.
) = E(Y.) ..2. 13 E R d }. . .zt}.. I) are i.5) (2.1 we considered the nonlinear (and linear) Gaussian model Po given by Y.g(l3. . .1. is modeled as a sample from a joint distribution.g(l3.z. The least squares method of estimation applies only to the parameters {31' . j. then 13 has an interpretation as a parameter on p. aJ) and (3 ranges over R d or an open subset. we can compute still Dp(l3o.Y) such thatE(Y I Z = z) = g(j3.2. • . Zn»)T is 1.g(l3.13) n n i=1  L Varp«. (Zi.2).. I<i<n. This follows "I .g(l3. j.i. E«. and 13 ~ (g(l3.C=ha:::p::.4}{2. that is. where Ci = g(l3. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ.»)2.2. I Zj = Zj).. as (Z. For instance.) + L[g(l3 o. 13) = LIY. a 2 and H unknown. Z could be educationallevel and Y income.. Yi).E(Y. 1_' . " ..:..::2 : In Example 2. . the model is semiparametric with {3.zn)f is 11. This is frequently the case for studies in the social and biological sciences. .3) continues to be valid.)..1.g(l3. The contrast p(X. . that is.z).:0::d=':Of.2. .z.3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3. E«. which satisfies (but is not fully specified by) (2.:1.4 with <j simply defined as Y. 1 < i < j < n 1 > 0\ .En) is any distribution satisfying the GaussMarkov assumptions.2. That is. Y) ~ PEP = {Alljnintdistributions of(Z. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj.=..108_~ ~_ _~ ~ ~cMc.m::a.f3d and is often applied in situations in which specification of the model beyond (2. < i < n.) = g(l3. I' . .) + <" I < i < n n (2.) i. " (2.2) Are least squares estimates still reasonable? For P in the semiparametric model P.1.2.2.). Z)f.4) (2.6) is difficult Sometimes z can be viewed as the realization of a population variable Z.) = 0. . (2. The estimates continue to be reasonable under the GaussMarkov assumptions.2.. 13 = I3(P) is the miniml2er of E(Y . Note that the joint distribution H of (C}. 1 < j < n.)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. z. .d. z.2.::'hc. .z. as in (a) of Example 1. z.i.(z. (Z" Y.. fj) because (2.. .::. = 0. i=l (2.6) Var( €i) = u 2 COV(fi.E:::'::'.o=nC. (3)  E p p(X. 1 < i < n. If we consider this model.d. NCO.) =0.2.
1. i = 1.3 and 6..7) (2.. J for Zo an interior point of the domain.L{3jPj.1). We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!.5. . n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix.2.2. Zd. We continue our discussion for this important special case for which explicit fonnulae and theory have been derived.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1. As we noted in Example 2.8) d f30 = («(3" . a further modeling step is often taken and it is assumed that {Zll . is called the linear (multiple) regression modeL For the data {(Zi.2.2.1. (2. and Volnme n. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il. E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2.Section 2. see Rnppert and Wand (1994).4.2.2. in conjnnction with (2.6). For nonlinear cases we can use numerical methods to solve the estimating equations (2.5) and (2.4. where P is the empirical distribution assigning mass n. we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z . we are in a situation in which it is plausible to assume that (Zi. In that case. In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). z) = zT (3.zo).4). Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous. j=1 (2. 2:).9) . {3d). (2) If as we discussed earlier. Yi). as we have seen in Section lA. which. .2. . See also Problem 2. ...2. Fan and Gijbels (1996).41..7). I)/. Sections 6. (2.1.. and Seber and Wild (1989). j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj. = py .1 the most commonly used 9 in these models is g({3.I to each of the n pairs (Zil Yi). The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth.
i I . . the solution of the normal equations can be given "explicitly" by (2. For this reason. g(z. L 1 that.ecor and Cochran. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d.2.12) . For certain chemicals and plants. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil. Zi. {3. Following are the results of an experiment to which a regression model can be applied (Sned.11) When the Zi'S are not all equal.1. The nonnal equations are n i=l n ~(Yi .2.(3zz.) ~ 0. In the measurement model in which Yi is the detennination of a constant Ih. g(z. If we run several experiments with the same z using plants and soils that are as nearly identical as possible.2. We have already argued in Examp}e 2.) = il" a~. Y is random with a distribution P(y I z).) = O.il.1.25.2.) = 1 and the normal equation is L:~ 1(Yi .1 '. We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil. Example 2.no Furthennore. . we assume that for a given z. 1 < i < n. LZi(Yi . see Problem 2. Nine samples of soil were treated with different amounts Z of phosphorus.2. we will find that the values of y will not be the same.il2Zi) = 0. we get the solutions (2. Zi 1'. il. the least squares estimate.10) Here are some examples.ill . whose solution is .2. 139). Estimation of {3 in the linear regression model.1.il.2. exists and is unique and satisfies the Donnal equations (2. necessarily.8). 0 Ii. I I. Example 2. il. il. we have a Gaussian linear regression model for Yi. d = 1. if the parametrization {3 ) ZD{3 is identifiable. ~ = (lin) L:~ 1Yi = ii. is independent of Z and has a N(O. 1967.I 'i. ( (J2 2 ) distribution where = ayy Therefore. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy. p. We want to estimate f31 and {32. given Zi = 1 < i < n.2. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. i=l (2. . In that case.
.1. . 0 ~ ~ ~ ~ ~ Remark 2. . .Yn on ZI. (zn. 9p( z). Here {31 = 61. The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1.3. and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl. For instance..13) where Z ~ (lin) I:~ 1 Zi.42. Zn.2.(Z). €6 is the residual for (Z6. .. &atter plot {(Zi.(a + bzi)l.Section 2.132 z ~ ~ (2.2..9p.4. . . The linear regression model is considerably more general than appears at first sight. P > d and postulate that 1'( z) is a linear combination of 91 (z).. if we measure the distance between a point (Zi. This connection to prediction explains the use of vertical distance! in regression. 91... Yn). then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl). and 131 = fi . Yi). .. i = 1.. Y6).. i = 1. .2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22.1.2. .. suppose we select p realvalued functions of z.58 and 132 = 1. . n} and sample regression line for the phosphorus data. The regression line for the phosphorus data is given in Figure 2.(/31 + (32Zi)] are called the residuals of the fit. j=1 Then we are still dealing with a linear regression model because we can define WpXI . n. Geometrically. . The vertical distances €i = fYi .Yi) and a line Y ~ a + bz vertically by di = Iy. that is p I'(z) ~ 2:). .9.1. ... .. .
0 < j < 2. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci.2.17) . Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. . .II 112 Methods of Estimation Chapter 2 (91 (z)...'l are sufficient for the Yi and that Var(Ed.16) as a function of {3.". i = 1.8..)]2 i=l i=l ~ n (2..g(. The weighted least squares estimate of {3 is now the value f3. iI L Vi[Yi i=l n ({3.2..2.2.Zi)]2 = L ~.. + /3zzi)J2 (2.filii minimizes ~  I i i L[1Ii . . we arrive at quadratic regression. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. Zi) = {3l + fhZi. For instance. Consider the case in which d = 2... .. Zi) + Ei . I and the Y i satisfy the assumption (2.. < n. Weighted least squares. • I' "0 r . 1 <i < n Yi n 1 . 1 n. Thus.g«(3. if we setg«(3.. Example 2. we can write (2..= g({3..2. Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic).. and g({3. Zi) = (2.2.3. We return to this in Volume II. then = WW2/Wi = . which for given Yi = yd..15) . and (32 that minimize ~ ~ "I. Yi = 80 + 8 1Zi + 82 + Cisee Problem 2. We need to find the values {3l and /32 of (3. . [Yi . if d = 1 and we take gj (z) = zj.2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable. .5) fails.2. That is.2.z.2. However. .zd + ti. In Example 2. Zi2 = Zi.14) where (J2 is unknown as before.wi) g((3.wi  _ g((3.. _ Yi .. Zil = 1.wi.24 for more on polynomial regression. I . fi = <d. Weighted Linear Regression.wi 1 < .... . 1<i <n Wij = 9j(Zi). but the Wi are known weights. z.= y.wi..wi . Note that the variables . Zi)/.2.5). The method of least squares may not be appropriate because (2.
.Section 2. Var(Z") and  _ L:~l UiZiYi .. .26.B that minimizes (2.2. .1 leading to (2. when g(.2.4.4 as follows.. suppose Var(€) = a 2W for some invertible matrix W nxn. z) = zT. When ZD has rank d and Wi > 0. .2.2. (3 and for general d.~ UiZi· n n i"~l I" n n F"l This computation suggests..6). .B.16) for g((3. we may allow for correlation between the errors {Ei}. (Zn.n.(Li~] Ui Z i)2 n 2 n (2. .18) ~ ~ ~I" (31 = E(Y') . the .~ UiYi .2.Y. as we make precise in Problem 2.2. Y*) V.. i i=l = 1. using Theorem 1.2.3.2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2. . Let (Z*.2..13 minimizing the least squares contrast in this transformed model is given by (2.1. . Y*) denote a pair of discrete random variables with possible values (z" Yl). . By following the steps of Exarnple 2.28) that the model Y = ZD. Yn) and probability distribution given by PI(Z".2.(32E(Z') ~ .1) and (2.)] ~Ui.. that weighted least squares estimates are also plugin estimates.B+€ can be transformed to one satisfying (2. More generally.1. Moreover. Then it can be shown (Problem 2.8).19) and (2. we can write Remark 2.. i ~ I.2. 0 Next consider finding the.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi .20).n where Ui n = vi/Lvi. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .2. .4H2.B. We can also use the results on prediction in Section 1. we find (Problem 2. Thus. ~ ..27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI. 1 < i < n.(3.Y") ~(Zi. a__ Cov(Z"'.1 7) is equivalent to finding the best linear MSPE predictor of y . That is. Zi) = z.2...2. .7). wn ) and ZD = IIZijllnxd is the design matrix.2.
.
.
.
.
OJ = (1  0)' where 0 < () < 1 (see Example 2.5.2. Example 2. A is an unknown positive constant and we wish to estimate A using X. which has the unique solution B = ~.(1.2.x=O. (2. .2. Nevertheless. . Consider a popUlation with three kinds of individuals labeled 1..28) If 2n.OJ'". x n } equal to 1. ~ maximizes Lx(B).0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.lx(0) = 5 1 0' . + n. .27) doesn't make sense. x. X2 = 2. the likelihood is {1 . Because . 0 e Example 2. represents the expected number of arrivals in an hour or. is zero. 1).. Evidently. then X has a Poisson distribution with parameter nA. 0)p(2.>.4). then I • Lx(O) ~ p(l. and 3 and occurring in the HardyWeinberg proportions p(I.O) = 0'.27) that are not maxima or only local maxima.29) '(j . Similarly. p(3. equivalently. 2. . 1.2. situations with f) well defined but (2.27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section.2.22) and (2. respectively. X3 = 1. If we observe a sample of three individuals and obtain Xl = 1. Here are two simple examples with (j real. the rate of arrival.. the MLE does not exist if 112 + 2n3 = O. I). the dual point of view of (2. where ). there may be solutions of (2..0). let nJ.2.ny l.118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables.2. 0) = 20'(1. Let X denote the number of customers arriving at a service counter during n hours. .2. 2n (2. 0) ~ 20(1 . which is maximized by 0 = 0.2. n2..l.0)' < ° for all B E (0. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive. 8' 80. O)p(l.) = ·j1 e'"p. p(2... p(X. In general.2. . } with probabilities. Here X takes on values {O.7. and n3 denote the number of {Xl.6. + n. If we make the usual simplifying assumption that the arrivals form a Poisson process. ]n practice.1. so the MLE does not exist because = (0. and as we have seen in Example 2. the maximum likelihood estimate exists and is given by 8(x) = 2n. 2 and 3.
j = 1. let B = P(X. If x = 0 the MLE does not exist. Example 2.\ I 0.6.2. ~ n . 8' (2... the maximum is approached as .d.2.. ~ ~ . A sufficient condition. familiar from calculus.30)). Multinomial Trials.2. .32) to find I.2. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category.32). this is well known to be equivalent to ~ e. J = I.. trials in which each trial can produce a result in one of k categories.32) We first consider the case with all the nj positive. is that l be concave in If l is twice differentiable. L. = jl. .30) for all This is the condition we applied in Example 2. k.2.logB. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. (see (2. .LBj j=1 • (2.Section 2. the MLE must have all OJ > 0. Then.. However. 31=1 By (2. consider an experiment with n Li.ekl with kl Ok ~ 1. p(x.8B " 1=13 k k =0. j=d (2. 80... 8B. and k j=1 k Ix(fJ) = LnjlogB j . = j) be the probability J of the jth category.8. fJ) = TI7~1 B7'. As in Example 1. thus. 8) = 0 if any of the 8j are zero. this estimate is the MLE of ). 8Bk/8Bj (2.o. 0 To apply the likelihood equation successfully we need to know when a solution is an MLE.2. . . = x/no If x is positive.. . Then p(x.31 ) To obtain the MLE (J we consider l as a function of 8 1 . e. for an experiment in which we observe nj ~ 2:7 1 I[X.2. A similar condition applies for vector parameters. = L.lx(B) <0. fJE6= {fJ:Bj >O'LOj ~ I}..2.n...2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). Let Xi = j if the ith trial produces a result in the jth category.1. We assume that n > k ~ 1. and the equation becomes (BkIB j ) n· OJ = .6.kl.7.
we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. Suppose that Xl. .28. . This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • . Then Ix «(3) log IT ~'P (V.. we can consider minimizing L:i.. X n are Li.3. As we have IIV.) = gi«(3). Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. ((g((3.1 L:~ .34) g«(3. YnJT. n.. Zi)J[l:} .1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.6. Using the concavity argument. wi(5). See Problem 2.(Xi . Summary.1 holds and X = (Y" . Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O). o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2.33) It follows that in this nj > 0.11(a))... z.. Example 2. (2. .Zi)]2. Zi).120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). . 0..2. Suppose the model Po of Example 2.202 L...gi«(3) 1 . then <r <k I.d.). Then (} with OJ = njjn.. i=l g«(3. The 0 < OJ < 1. ~ g((3.2.1. g«(3.g«(3. k. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3. More generally.. Thus. 1 < i < n.2.2. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) . see Problem 2. where E(V.X)2 (Problem 2. zn) 03W. i = 1. 'ffi f. .' . In Section 2..30. . 1 < j < k..1). these estimates viewed as an algorithm applied to the set of data X make sense much more generally. we check the concavity of lx(O): let 1 1 <j < k . ~ g«(3. gi.2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood.1.9. is still the unique MLE of fJ. N(lll 0. version of this example will be considered in the exponential family case in Section 2. . j = 1.2 ) with J1..IV.2. OJ > ~ 0 case. . .. are known functions and {3 is a parameter to be estimated from the independent observations YI .lV. where Var(Yi) does not depend on i. and 0.2. zi)1 .2. Next suppose that nj = 0 for some j. least squares estimates are maximum likelihood for the particular model Po. as maximum likelihood estimates for f3 when Y is distributed as lV. 1 Yn .2 both unknown.  seen and shall see more in Section 6.
a < b< oo} U {(a.. = RxR+ and ee e.1 and 1.3. This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI. m.3 and 1.3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly.6. Extensions to weighted least squares. including all points with ±oo as a coordinate.5. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. Suppose also that e t R where e c RP is open and 1 is (2. In Section 2.1 only. Proof. We start with a useful general framework and lemma. For instance.4 and Corollaries 1. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.1) lim{l(li) : Ii ~ ~ De} = 00. m).00.6.3. though the results of Theorems 1.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil. Suppose we are given a function 1: continuous. Formally. or 8 1nk diverges with 181nk I .2 and other exponential family properties also playa role.3. (m.3. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d.Section 2. In the case of independent response variables Yi that are modeled to have a N(9i({3). (m. (}"2) distribution. in the N(O" O ) case. e e I"V e De= {(a. are given. Properties that derive solely fTOm concavity are given in Propositon 2.. m. (a.. if X N(B I .2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. . That is. (a. as k . it is shown that the MLEs coincide with the LSEs. (m. .zid. b). oo]P.1. For instance. . These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. which are appropriate when Var(Yj) depends on i or the Y's are correlated.2.bE {a.a= ±oo.3.1. In general.1 ). z=1=1 ~ 2.I ) all tend to De as m ~ 00. b) : aER. b).oo}}. Let &e = be the boundary of where denotes the closure of in [00. (h).ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e. for a sequence {8 m } of points from open. Suppose 8 c RP is an open set.00. See Problem 2. we define 8 1n . where II denotes the Euclidean norm.6.6. Concavity also plays a crucial role in the analysis of algorithms in the next section. 2 e Lemma 2. 8). a8 is the set of points outside of 8 that can be obtained as limits of points in e.b) .
Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. (ii) The family is of rank k.122 Methods of Estimation Chapter 2 Proposition 2.3. Then.'1 .. Suppose P is the canonical exponential/amity generated by (T. We can now prove the following. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT. Furthennore. lI(x) =  exists. ' L n .(11) ~ 00 as densities p(x.3. which implies existence of 17 by Lemma 2. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) .1. Existence and Uniqueness o/the MLE ij. i . Write 11 m = Am U m . If 0. Suppose the cOlulitions o/Theorem 2. a contradiction.  (hO! + 0. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. then the MLE8(x) exists and is unique.1 hold.1.. '10) for some reference '10 E [ (see Problem 1. Applications of this theorem are given in Problems 2.3) has i 1 !.3. From (B.3. with corresponding logp(x. no solution. thus. open C RP. We show that if {11 m } has no subsequence converging to a point in E. . II E j.9) we know that II lx(lI) is continuous on e. Let x be the observed data vector and set to = T(x).to.3.3) I I' (b) Conversely. " We. I I. and 0. /fCT is the convex suppon ofthe distribution ofT (X).1.8 and 2.1.)) 0 Themem 2.3. We give the proof for the continuous case. . are distinct maximizers.. if lx('1) logp(x. Proofof Theorem 2.3. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2.6.2). Suppose X ~ (PII .)) > ~ (fx(Od +lx (0. .3. and is a solution to the equation (2. is open. Without loss of generality we can suppose h(x) = pix. ! " .3.1.). is unique.'1) with T(x) = 0.1.2) then the MLE Tj exists. Corollary 2. if to doesn't satisfy (2. U m = Jl~::rr' Am = .3. then the MLE doesn't exist and (2. II) is strictly concave and 1. = . By Lemma 2. then Ix lx(O.12. [ffunher {x (II) e = e e t 88. II). h) and that (i) The natural parameter space. then lx(11 m ) . E. " .3.3.3.27).00. ~ Proof.
So In either case limm. T(X) has a density and.3.3. u mk t u. this is the exponential family generated by T(X) = 1 Xi. um~. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. contradicting the assumption that the family is of rank: k.3. The equivalence of (2. t u.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2.3. . fOT all 1].1. Then.4.d. CT = R )( R+ FOT n > 2. CT = C~ and the MLE always exists.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1.k Lx( 11m.5.3. Theorem 2.9. Nonexistence: if (2. Suppose X"".3). 0 Example 2.3. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0. 0 ° '* '* PrOD/a/Corollary 2. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets.6.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it.J = 00.* P'IlcTT = 0] = I.Xn are i. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist. As we observed in Example 1. lIu m ll 00. (1"2 > O. Por n = 1. which is impossible. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows..3.3. It is unique and satisfies x (2.6.i. that is. Po[uTT(X) > 61 > O. 0 (L:7 L:7 CJ. Then because for some 6 > 0. So. Suppose the conditions ofTheorem 2. By (B. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0. existence of MLEs when T has a continuous ca.3) by TheoTem 1.1.1 follow. Write Eo for E110 and Po for P11o . iff. N(p" ( 2 ). I Xl) and 1.se density is a general phenomenon.2) and COTollary 2.2) fails. Then AU ¢ £ by assumption. In fact. . I' E R. So we have Case 2: Amk t A.2.Section 2. Case 1: Amk t 111]=11. thus.3. Evidently. The Gaussian Model. for every d i= 0.
3.. I < j < k. Thasadensity.1 in the second. o Remark 2. h(x) = XI. Suppose Xl. A nontrivial application of Theorem 2.(X) = r~.2(a).2 follows. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2.3. I < j < k.1.)e>"XxPI. The statistic of rank k . if T has a continuous case density PT(t).T(X) = A(.2. They are determined by Aj = Tj In. For instance.Xn are i. We assume n > k ..2.I..2(b» r' log . liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2. Thus. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .5) ~=X A where log X ~ L:~ t1ogXi.13 and the next example). Example 2. we see that (2.1 that in this caseMLEs of"'j = 10g(AjIAk). The likelihood equations are equivalent to (problem 2. where 0 < Aj PIX = j] < I. This is a rank 2 canonical exponential family generated by T   = (L log Xi.ln.I which generates the family is T(k_l) = (Tl>"" T. It is easy to see that ifn > 2. Multinomial Trials. with i'. where Tj(X) L~ I I(Xi = j)..124 Methods of Estimation Chapter 2 Proof. The boundary of a convex set necessarily has volume 0 (Problem 2. with density 9p.4) (2.5) have a unique solution with probability 1.3.i.1.3.). and one of the two sums is nonempty because c oF 0.3.3 we know that E. lit ~ .3.3.3. In Example 3. exist iff all Tj > O.d. Example 2. Because the resulting value of t is possible if 0 < tjO < n. From Theorem 1.3. Here is an example. the maximum likelihood principle in many cases selects the «best" estimate among them. Thus.9). in the HardyWeinberg examples 2.n.. = = L L i i bJ _ I . I < j < k.3. using (2.3.6.I and verify using Theorem 2. >.7.3.4) and (2. then and the result follows from Corollary 2. LXi). I < j < k iff 0 < Tj < n.3.. . thus. the best estimate of 8.2) holds. I <j < k._t)T.4. 0 = If T is discrete MLEs need not exist. How to find such nonexplicit solutions is discussed in Section 2. x > 0.4 and 2.6.3.3. by Problem 2. To see this note that Tj > 0.3).2 that (2.>.. When method of moments and frequency substitution estimates are not unique. but only Os is a MLE.... I <t< k . We conclude from Theorem 2. > O.3. We follow the notation of Example 1. in a certain sense.= r(jJ) A log X (2.1. The TwoParameter Gamma Family. P > 0.4 we will see that 83 is.1. .4.1.6.3.1)..
the MLE does not exist if 8 ~ (0. 0 < j < k .6) on c( II) = ~~. note that in the HardyWeinberg Example 2. Corollary 2. Let Q = {PII : II E e).I.3.13).1.3.3. . and unicity can be losttake c not onetoone for instance. x E X. h).3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand. If the equations ~ have a solution B(x) E Co..1 and Haberman (1974). for example.3. Alternatively we can appeal to Corollary 2.k. e open C R=.2) so that the MLE ij in P exists. m < k . .A(c(ii)) ~ = O.1 is useful. when we put the multinomial in canonical exponential family form.1. if 201 + 0. .2.2.3 can be applied to determine existence in cases for which (2. lfP above satisfies the condition of Theorem 2. if any TJ = 0 or n. In some applications.3. . our parameter set is open.Section 2..3. When P is not an exponential family both existence and unicity of MLEs become more problematic.~l Aj = I}. then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to .10).1 we can obtain a contradiction to (2.exist and are unique.8see Problem 2. Here [ is the natural paramerer space of the exponential fantily P generated by (T. whereas if B = [0.1.3. .3.2. the following corollary to Theorem 2.7) Note that c(lI) E c(8) and is in general not ij. Similarly. Let CO denote the interior of the range of (c. mxk e. Ck (B)) T and let x be the observed data. n > kl.. .3. BEe. Consider the exponential family k p(x. Unfortunately strict concavity of Ix is not inherited by curved exponential families. 0 The argument of Example 2.3. The following result can be useful.3.1] it does exist ~nd is unique. Then Theorem 2. . Remark 2. In Example 2.3) does not have a closedform solution as in Example 1. = 0..1 directly D (Problem 2. c(8) is closed in [ and T(x) = to satisfies (2.6.3. 1)T.lI) ~ exp{cT (II)T(x) .1.3.3.3.A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2..8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0. L.B) ~ h(x)exp LCj(B)Tj(x) . (II) . (2.2) by taking Cj = 1(i = j). the bivariate normal case (Problem 2.. the MLEs ofA"j = 1.6. (B).3. The remaining case T k = a gives a contradiction if c = (1.1). then it is the unique MLE ofB. However.B(B) j=l . 1 < i < k . be a cUlVed exponential family p(x.
it. .6..4.126 The proof is sketched in Problem 2..?' m)T " ". generated by h(Y) = 1 and "'> I T(Y) '.d. < O}. . L: Xi and I. m} is a 2nparameter canonical exponential family with 'TJi = /tila./2'1' . LocationScale Regression. . Methods of Estimation Chapter 2 family with (It) = C2 (It) = .5 and 1.7) if n > 2.. which implies i"i+ > 0. 9) is a curved exponential family of the form (2. As a consequence of Theorems 2.i. Gaussian with Fixed Signal to Noise. Evidently c(8) = {(lh.3.. Using Examples 1. i Example 2.\()'. Jl > 0.. Next suppose.5X' +41i2]' < 0.n. Ii± ~ ~[)..ry.. t. I .'. m m m f?.3)T.. . we can conclude that an MLE Ii always exists and satisfies (2. i = 1.10.~Y'1. < Zn where Zt.6.A6iL2 = 0 Ii..) : ry.. = ( ~Yll'''' .!0b''.2 . af = f)3(8 1 + 82 z i )2..3.. with I.it.g( it' .3..1/'1') T .0'2) with Jl/ a = AO > a known. .< O.2 and 2.2~~2 .Zn are given constants.6.. C(O) and from Example 1. ).6) with " • . This is a curved exponential ¥. as in Example 1. . l = 1.nit.) : ry. = L: xl.Xn are i. . simplifies to Jl2 + A6XIl. Because It > 0.. Equation (2. which is closed in E = {( ryt> ry. are n independent random samples. j = 1. where 0l N(tLj 1 0'1).oJ). We find CI Example 2.5 = . .6. '1. = 1 " = 2 n( ryd'1' ..3 )(t. .3. 'TJn+i = 1/2a. corresponding to 1]1 = //x' 1]2 = .~Yn... E R.3. we see that the distribution of {01 : j = 1. n. .6.: Note that 11+11_ = '6M2 solution we seek is ii+. .. that . Suppose that }jI •.'' .7) becomes = 0.3. 11. l}m..ry. suppose Xl.10.9. tLi = (}1 + 82 z i . = ~ryf.5. the o q.. ry. '() A 1/ Thus.gx±).". .11.3. Zl < .n(it' + )"5it'))T which with Ji2 = n. N'(fl. As in Example 1. 1 n.2.I LX.ryl > O. .3. Now p(y..3.. < O}. . .
Given f continuous on (a. b) such that f(x*) = O.3.3.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol. Finally. even in the classical regression model with design matrix ZD of full rank.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. The packages that produce least squares estimates do not in fact use fonnula (2. strict concavity. at various times. entrust ourselves. f i strictly.1). However. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then.3. an MLE (J of (J exists and () 0 ~ ~ Summary. Initialize x61d ~ XI. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. then the full 2nparameter model satisfies the conditions of Theorem 2. 2. in pseudocode. ~ 2. E R.1. > OJ. which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2. if implemented as usual. b). Given tolerance € > a for IXfinal .1. is the bisection algOrithm to find x*. the basic property making Theorem 2.4 Algorithmic Issues 127 If m > 2. d. x o1d = xo. f( a+) < 0 < f (b. there exists unique x*€(a. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency. even in the context of canonical multiparameter exponential families.3. the fonnula (2.1 and Corollary 2. then.3. such as the twoparameter gamma. Ixd large enough.02 E R.4. by the intermediate value theorem.1 work. is isolated and shown to apply to a broader class of models.3. We begin with the bisection and coordinate ascent methods. > 2. L 10).O.Section 2. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all. . In fact. Here. in this section.4 ALGORITHMIC ISSUES As we have seen. Then c(8) is closed in £ and we can conclude that for m satisfies (2.7). Let £ be the canonical parameter set for this full model and let e ~ (II: II. order d3 operations to invert.). It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis.
IXm+l . x~ld (5) If f(xnew) > 0. X n be i. • . the interior (a. so that f is strictly increasing and continuous and necessarily because i.1.3. xfinal = !(x~ld + x old ) and return xfinal' (2) Else. Moreover. Xm < x" < X m +l for all m. . o ! i If desired one could evidently also arrange it so that. which exists and is unique by Theorem 2.x*1 < €.i. Let X" . I (3) and X m + x* i as m j. the MLE Tj. f(a+) < 0 < f(b).4...4. by the intermediate value theorem..xoldl Methods of Estimation Chapter 2 < 2E. By Theorem 1.xol· 1 (2) Therefore. !'(.d. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. lx. 00. x~ld = Xnew· Go to (I). Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. From this lemma we can deduce the following.4. for m = log2(lxt .3. b) of the convex support a/PT. i I i.) = EryT(X) .4.xol/€). Theorem 2.to· I' Proaf.128 (I) If IX~ld . 0 Example 2.4. [(8..1. Then. in addition.. satisfying the conditions of Theorem 2.6. If(xfinal)1 < E.. xfinal = Xnew· (4) If f(xnew) < 0.1 and T = to E C~... . Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.. I.1).1) . . exists. (2.1. End Lemma 2. The Shape Parameter Gamma Family.1. h).. may befound (to tolerance €) by the method afbisection applied to f(.) = VarryT(X) > 0 for all ..
Notes: (1) in practice.· .oJ _ = (~1 "" {}) ~02 _ 1]1.···.'fJk· Repeat.1 can be evaluated by bisection. for a canonical kparameter exponential family. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2. it is in fact available to high precision in standard packages such as NAG or MATLAB.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists. which is slow. 1 k. always converges to Ti. getting 1j 1r). say c. In fact.'TJk =tk' ) ..1]2.2 Coordinate Ascent The problem we consider is to solve numerically. This example points to another hidden difficulty.'fJk' ~1) an soon. d and finally 1] _ ~Il) _ =1] (~1 1]1"". E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists. but as we shall see. for each of the if'. The case k = 1: see Theorem 2. Here is the algorithm. r > 1. we would again set a tolerenceto be.'TJ3.4. 1] ~Ok = (~1 ~1 {} "") 1]I. bisection itself is a defined function in some packages.···. in cycle j and stop possibly in midcycle as soon as <I< .'TJk.Section 2. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method. 0 J:: 2.4.4.1. However. .1]2. It solves the r'(B) r(B) T(X) n which by Theorem 2. eventually.
(3) and (4) => 1]1 . Whenever we can obtain such steps in algorithms.. (I) l(ij'j) Tin j for i fixed and in.\(0) = ~. ij = (V'I) . (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. Consider a point we noted in Example 2.2. We give a series of steps.2.2. (6) By (4) and (5). ..1j{ can be explicit. Else lim. they result in substantial savings of time.3. 71J. Thus. .4..1 because the equa~ tion leading to Anew given bold' (2. x ink) ( '~ini . the algorithm may be viewed as successive fitting of oneparameter families. The case we have C¥. as it should. the MLE 1j doesn't exist.j l(i/i) = A (say) exists and is > 00. rp differ only in the second coordinate. Bya standard argument it follows that..I ~(~ (1) 1 to get V. Proof. .l(ij') = A. refuses to converge (in fI space!)see Problem 2. 'tn. ijik) has a convergent subsequence in t x . . 71 1 is the unique MLE. ij as r t 00.···) C!J. .2. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known.I)) = log X + log A 0) and then A(1) = pY.4. Let 1(71) = tif '1. _A (1»). . We pursue this discussion next. For n > 2 we know the MLE exists.. .2. ...4. to ¢ Fortunately in these cases the algorithm.. (2) The sequence (iji!. that is.5). To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then. Here we use the strict concavity of t.'. limi.4. Therefore.1 hold and to E 1j(r) t g:..2'  It is natural to ask what happens if..2: For some coordinates l. We can initialize with the method 2 of moments estimate from Example 2. (Wn') ~ O. ilL" iii.··. by (I). (fij(e) are as above. I • . . We use the notation of Example 2. The TwoParameter Gamma Family (continued). TIl = . j (5) Because 1]1.) where flj has dimension d j and 2::. is computationally explicit and simple. . We note some important generalizations. Theorem 2. 0 ~ (4) :~i (77') = 0 because :~..4. A(71') ~ to. 1j(r) ~ 1j. (ii) a/Theorem 2...1. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l. = 7Jk.130 Methods of Estimation Chapter 2 (2) Notice that (iii. in fact. 1 < j < k. Suppose that we can write fiT = (fir.3.? Continuing. (i).. Hence. fI. Continuing in this way we can get arbitrarily close to 1j...'11 11 t t (I . j ¥ I) can be solved in closed form. 1J ! But 71j E [.1J k) .. This twodimensional problem is essentially no harder than the onedimensional problem of Example 2. the log likelihood.I) solving r0'. Then each step of the iteration both within cycles and from cycle to cycle is quick.pO) = We now use bisection ' r' . Suppose that this is true for each l. ff 1 < j < k. 'II. 0 Example 2.=1 d j = k and the problem of obtaining ijl(tO. Because l(ijI) ~ Aand the MLE is unique..A(71) + log hex).. W = iji = ij. I(W ) = 00 for some j.3.
Solve g~: (Bt.1 in which Ix(O). values of (B 1 .4. .3.4.1 illustrates the j process.B~) = 0 by the method of j bisection in B to get OJ for j = 1. Next consider the setting of Proposition 2.. Figure 2. and Problems 2.4.4.4. which we now sketch.B2 )T where the log likelihood is constant..1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. each of whose members can be evaluated easily.. 1' B . e ~ 3 2 I o o 1 2 3 Figure 2. .4 Algorithmic Issues 131 just discussed has d 1 = . .4. r = k.7. iterate and proceed.. The coordinate ascent algorithm can be slow if the contours in Figure 2. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop. the log likelihood for 8 E open C RP. The coordinate ascent algorithm.. At each stage with one coordinate fixed.. The graph shows log likelihood contours. If 8(x) exists and Ix is differentiable.1.Section 2. Change other coordinates accordingly.92. See also Problem 2.2 has a generalization with cycles of length r. the method extends straightforwardly. Feinberg. . for instance. ..p. Then it is easy to see that Theorem 2. that is.4.. 1 B.10. is strictly concave. BJ+l' . and Holland (1975). find that member of the family of contours to which the vertical (or horizontal) line is tangent. = dr = 1.
We return to this property in Problem 6. Bjork. In this case.4..B) We find exp{ (x . methods such as bisection. B) The density is = [I + exp{ (x = B) WI. I 1 The NewtonRaphson method can be implemented by taking Bold X. and NewtonRaphson's are still employed. X n be a sample from the logistic distribution with d. A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2. (2. then by f1new is the solution for 1] to the approximation equation given by the right. can be shown to be faster than coordinate ascent.2) gives (2. the argument that led to (2.3. 1 l(x. iinew after only one step behaves approximately like the MLE. Let Xl. when it converges. When likelihoods are noncave.  I o The NewtonRaphson algorithm has the property that for large n.4. F(x. if l(B) denotes the log likelihood.4.B) i=l 1 1 2 L i=l n I(X" B) < O.7.3) Example 2.1.4. _.to). coordinate ascent. and Anderson (1974).k I (iiold)(A(iiold) .10. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist. then ii new = iiold . A onedimensional problem 7 . Newton's method also extends to the framework of Proposition 2.3 The NewtonRaphson Algorithm An algorithm that.3.6. If 110 ld is close enough to fj.4.2) The rationale here is simple. is the NewtonRaphson method.4. This method requires computation of the inverse of the Hessian.. which may counterbalance its advantage in speed of convergence when it does converge. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. in general.f.and lefthand sides.132 Methods of Estimation Chapter 2 2.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X. Here is the method: If 110 ld is the current value of the algorithm. B)}F(Xi.
0)] ~ 28(1 .0)2. difficult to compute. Many examples and important issues and methods are discussed.x(B) is concave in B. Petrie.0.4. 1 <i< m (€i1 +€i2. and its main properties. 0 leads us to an MLE if it exists in both cases. and Weiss (1970). Yet the EM algorithm.4. This could happen if. (2.5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. . in Chapter 6 of Dahlquist. .2. Lumped HardyWeinberg Data. for instance. We give a few examples of situations of the foregoing type in which it is used. and so on.4. the function is not concave. 0) is difficultto maximize. i = 1. .n. I)] (1 .B) + 2E'310g(1 . 0). Xi = (EiI. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood. where Po[X = (1. a < 0 < I. Ei3). m+ 1 <i< n.€i3). As in Example 2.B). we observe 5 5(X) ~ Qo with density q(s.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure. Po[X = (0. .13. E'2. S = S(X) where S(X) is given by (2.4) Evidently. Po[X (0.4. (0) = log q(s. then explicit solution is in general not lXJssible.4. for some individuals. however. with an appropriate starting point. A fruitful way of thinking of such problems is in terms of 8 as representing part of X. Unfortunately. Say there is a closedfonn MLE or at least Ip. 0.4). X ~ Po with density p(x.6. let Xi..4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2. 2. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). Bjork. 1.4.. The log likelihood of S now is 1"..x(B) is "easy" to maximize. a)] = B2.Section 2.OJ] i=l n (2. Laird.(0) })2EiI 10gB + Ei210g2B(1 . is not X but S where = 5i 5i Xi. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example. What is observed. €i2 + €i3).. Here is another important example. A prototypical example folIows. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. 0) where I". Soules.4. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1). If we suppose (say) that observations 81. 1 8 m are not Xi but (€il. e = Example 2. There are ideal observations. and Rubin (977).0 E c Rd Their log likelihood Ip. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus. though an earlier general form goes back to Baum. and Anderson (1974). The algorithm was fonnalized with many examples in Dempster.
Bo) and (2.4) = L()(Si I Ai) = N(Ail'l + (1. = 11 = . • • i ! I • . EM is not particularly appropriate. The rationale behind the algorithm lies in the following formulas. .6).4.B) 0=00 = E. Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed.\)"'u. p(s. reset Bold = Bnew and repeatthe process. are independent identically distributed with p() [A.9) I for all B (under suitable regularity conditions). Then we set Bnew = arg max J(B I Bold).12) q(s. S has the marginal distribution given previously.8). I That is.\"'u.. Suppose 8 1 .4.<T.4. If this is difficul~ the EM algorithm is probably not suitable. J.tz E Rand 'Pu (5) = .I. Mixture of Gaussians. Note that (2.4.8) by o taking logs in (2.). It is easy to see (Problem 2.B) IS(X)=s) q(s. Although MLEs do not exist in these models.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities.Bo) 0 p(X. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. .8) .134 Methods of Estimation Chapter 2 Example 2. This fiveparameter model is very rich pennitting up to two modes and scales.4.. .i = 1. I' Hi where we suppress dependence on s.4. . (P(X. = 01.ll.4. As we shall see in important situations. The EM algorithm can lead to such a local maximum.)<T~). Bo) _ I S(X) = s ) (2. Again.Ai)I'Z.B) =E.<T. j.af) or N(J12.6) where A. . we can think of S as S(X) where X is given by (2. Let . including the examples.5.\. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE. the M step is easy and the E step doable. I. . Suppose that given ~ = (~11"" ~n).(f~). It is not obvious that this falls under our scheme but let (2. The second (M) step is to maximize J(B I Bold) as a function of B.\ = 1..4.(I'. + (1  A. I . Here is the algorithm.2)andO <.12). we have given.4. A.(sl'd +.p() [A. i.7) ii. o (:B 10gp(X. (P(X.4. that under fJ.11).(sI'z)where() = (. if this step is difficult.\ < 1. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2.. Thus. .o log p(X. O't. I:' ..4.. differentiating and exchanging E oo and differentiation with respect i.0'2 > 0. The EM Algorithm. :B 10gq(s. the Si are independent with L()(Si I .B) I S(X) = s) 0=00 (2.B) J(B I Bo) ~ E.9) follows from (2.p (~).4. B) = (1. ~i tells us whether to sample from N(Jil. .
Then (2.4. (2. Because. We give the proof in the discrete case. I S(X) ~ s > 0 } (2.4 Algorithmic Issues 135 to 0 at 00 . Let SeX) be any statistic. 00ld are as defined earlier and S(X) (2. Theorem 2.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation. Suppose {Po : () E e} is a canonical exponential family generated by (T.1. q S. uold Now.lfOnew. On the other hand.O) { " ( X I 8.14) whete r(· j ·.E. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew." ) I S(X) ~ 8 . DJ(O I 00 ) [)O and.1. S (x) ~ 8 p(x.4. O n e w ) } log ( " ) ~ J(Onew I 0old) .O)r(x I 8. 0) ~ q(s. then .15) q(s.o log .3. r(Xj8. Lemma 2.. (2. 0 0 = 0old' = Onew. However. uold 0 r s.4. (2.1.4. DO The main reason the algorithm behaves well follows.17) o The most important and revealing special case of this lemma follows. 0 ) +E.3. Lemma 2.12) = s. Id log (X I .Onew) {r(X I 8 .Section 2.4.0) } ~ log q(s.4.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s.16) ° q(s.4.11)  D IOgq(8. 0old)' Equality holds in (2.4.4.Onew) E'Old { log r(X I 8.0) (2.13) q(8.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof.h) satisfying the conditions a/Theorem 2.0 ) I S(X) ~ 8 .4. Fat x EX.0) = O.00Id) by Shannon's ineqnality.2. formally. In the discrete case we appeal to the product rule. Onew) > q(s.4. 0) I S(X) = 8 ) DO (2. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. hence. ~ E '0 (~I ogp(X .(X 18.
(2. In this case.BO)TT(X) .I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 . Part (b) is more difficult.B).A(Bo)) I S(X) = s} (B .4.4. then it converges to a limit 0·.+l (2. i=m. Thus.18) (2. that. A'(1)) = 2nB (2.4. which is necessarily a local maximum J(B I Bo) I I = Eoo{(B .0)' 1.4. we see. X is distributed according to the exponential family I where p(x.4 (continued). Proof. 1'I. A proof due to Wu (I 983) is sketched in Problem 2.(x).= +Eo ( t (2<" + <i') I <it + <i'.(A(O) . 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0. . I • Under the assumption that the process that causes lumping is independent of the values of the €ij. = 11 <it +<" = IJ. • I E e (2Ntn + N'n I S) = 2Nt= + N.4.24) " ! I . 1 <j < 3.18) exists it is necessarily unique.(A(O) .A(Bo)) (2.A('1)}h(x) (2.4.4.. A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).22) • ~ 0) .4.20) has a unique solution. m + 1 < i < n) . Now. .19) = Bnew · If a solution of(2.136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2. 0 Example 2. after some simplification. = Ee(T(X) I S(X) = s) ~ (2.4.25) I" . i i.23) I . h(x) = 2 N i .21) Part (a) follows.(1 .B) 1 . (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.4.Pol<i.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) .4.Bof Eoo(T(X) I S(X) = y) .16. I .
T 2 = Y. ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112.Bold n ~ (2. the EM iteration is L i=tn+l n ('iI + <.d. compute (Problem 2. ..new T4 (Bo1d ) .. a~. . then B' . In this case a set of sufficient statistics is T 1 = Z.(Y. and for n2 + 1 < i < n.1) A( B) = E. and find (Problem 2.. new = T 3 (8ol d) . l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2.) E. I'l)/al [1'2 + P'Y2(Z.8) of B based on the observed data. 112. as (Z. we observe only Yi . Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi. For the Mstep.: n.T{ (2. r) (Problem 2.Yn) be i.4. p).new ~ + I'i.4. we note that for the cases with Zi andlor Yi observed.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0.T2 ]}' . + 1 <i< n}. /L2.): 1 <i< nt} U {Z. 2 Mn .1) that the Mstep produces I1l.4.4.1). for nl + 1 < i < n2. This completes the Estep. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. to conclude ar. Y). Y) ~ N (I"" 1"" 1 O'~.l LY/.new ii~. 1"1)/ad 2 + (1 .27) hew i'tT2Ji{[T3 (8ol d) .) E.= > N 3 =)) 0 and M n > 0. T 5 = n. a~ + ~. where (Z. B). unew  _ 2N1m + N 2m n + 2 .26) It may be shown directly (Problem 2. i=I i=l n n n i=l s = {(Z" Y.: nl + 1 <i< n.Tl IlT4 (8ol d) . which is indeed the MLE when S is observed.4..1.12) that if2Nl = Bm. ~ T l (801 d). the conditional expected values equal their observed values.Section 2. T3 = The observed data are n l L zl. where B = (111.4 Algoritnmic Issues 137 where Mn ~ Thus. converges to the unique root of ~ + N..). T4 = n.4. I Zi) 1"2 + P'Y2(Z. [T5 (8ald) 2 = T2 (8old)' iii. Let (Zl'yl). Jl2. To compute Eo (T I S = s).} U {Y. we oberve only Zi.4.T = (1"1.(y.2 I Z.l L ZiYi.(ZiY.4). b. at Example 2. al a2P + 1"1/L2).i.6. I). I Z. iii.T 2 . (Zn.p2)a~ [1'2 + P'Y2(Z. ii2 .
which as a function of () is maximized where the contrast logp(X.B)? .2. in the context of Example 2.i' I' I i. in general. 2B(1 . (a) Find the method of moments estimate of >.1 with respective probabilities PI. By considering the first moment of X. Remark 2. the EM algorithm is often called multiple imputation. . which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.). Note that if S(X) = X.0'2) distribution. involves imputing missing values. Suppose that Li. Consider n systems with failure times X I .4...4. ~' ". 2.4.B o)]. based on the second moment.0. what is a frequency substitution estimate of the odds ratio B/(l. We then.4 we derive and discuss the important EM algorithm and its basic properties. . then J(B i Bo) is log[p(X. are discussed and introduced in Section 2... 0) is minimized... 0 Because the Estep. Important variants of and alternatives to this algorithm. distributions. based on the first moment. use this algorithm as a building block for the general coordinate ascent algorithm. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e. j : > I) that one system 3. in Section 2. (b) Find the method of moments estimate of>.. X n have a beta.3 and the problems. (d) Find the method of moments estimate of the probability P(X1 will last at least a month. X I.d. .0'2) based on the first two moments.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1. based On the first two moments..2.1 1. respectively. where 0 < B < 1. . E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. (3( 0'1.1. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02. including the NewtonRaphson method.4.4.0) and (I . X n assumed to be independent and identically distributed with exponential. Summary.6. (c) Suppose X takes the values 1. ! I 2.P3 given by the HardyWeinberg proportions. show that T 3 is a method of moment estimate of (). Find the method of moments estimates of a = (0'1. Finally in Section 2.P2.B)2. . (b) Using the estimate of (a). j. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists). &(>. . Also note that.23)..2.5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. B)/p(X.
. Let X(l) < .2. Y2 ). . N 2 = n(F(t2J . of Xi < xl/n.. (a) Show that X is a method of moments estimate of 8. Yn ) be a set of independent and identically distributed random vectors with common distribution function F..Section 2.12. Show that these estimates coincide. (See Problem B. ..F(t. . 4. X 1l be the indicators of n Bernoulli trials with probability of success 8. .. (Zn. Let Xl.findthejointfrequencyfunctionofF(tl). 5.8)ln first using only the first moment and then using only the second moment of the population. .)).5 Problems and Complements 139 Hint: See Problem 8.. which we define by ~ F ~( s.2.Xn be a sample from a population with distribution function F and frequency function or density p.. . . . 6. .. < ~ ~ Nk+1 = n(l  F(tk)).t ) = Number of vectors (Zi.F(tk). The natural estimate of F(s. of Xi n ~ =X. Hint: Consi~r (N I . The empirical distribution function F is defined by F(x) = [No. ... Yi) such that Zi < sand Yi < t n .. t).. 8.. YI ). Nk+I) where N I ~ nF(t l ). The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants.. we know the order statistics. .5. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X).... Give the details of this correspondence. Let (ZI. < X(n) be the order statistics of a sample Xl. . ~  7. . empirical substitution estimates coincides with frequency substitution estimates. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). given the order statistics we may construct F and given P. < . (Z2. Let X I. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. . X n .. ~ Hint: Express F in terms of p and F in terms of P ~() X = No. See A.8. = Xi with probability lin.) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that. . (b) Exhibit method of moments estimates for VaroX = 8(1 . t) is the bivariate empirical distribution function FCs. (d) FortI tk. ~ ~ ~ (a) Show that in the finite discrete case.
1..2).17. X n be LLd. k=l n  n . as the corresponding characteristics of the distribution F. f(a.L. There exists a compact set K such that for f3 in the complement of K. respectively. . as X ~ P(J. p(X. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. >. . Hint: See Problem B. suppose that g({3. " .81 tends to 00. the result follows. The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates. z) I tends to 00 as \. In Example 2. (a) Find an estimate of a 2 based on the second mOment.IU9) that 1 < T < L .. are independent N"(O. . Show that the least squares estimate exists.Yn . {3) > c. 9. In Example 2. . Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" . .) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. . . = #.j) is given by ~ The sample covariance is given by n l n L (Zk k=l . 0).. Show that the sample product moment of order (i.• .).2 with X iiI and /is. Suppose X has possible values VI. I) = ". (b) Construct an estimate of a using the estimate of part (a) and the equation a . .j2. Let X". . 10.ZY. and so on.. j). z) is continuous in {3 and that Ig({3. Hint: Set c = p(X. .Xn ) where the X. Since p(X. L I. with (J identifiable..Y) = Z)(Yk . (J E R d. .. ~ i ..!'r«(J)) I . the sampIe correlation. find the method of moments estimate based on i 12.140 Methods of Estimation Chapter 2 (a) Show that F(. . (See Problem 2. . Suppose X = (X"". 1". the sample covariance.) Note that it follows from (A."ZkYk . : J • . Y are the sample means of the sample correlation coefficient is given by Z11 •. (b) Define the sample product moment of order (i. Vi).4.2.1. Zn and Y11 .. 11. I .". {3) is continuous on K.' where Z....
.O) ~ (x/O')exp(x'/20'). General method of moment estimates(!).(X) ~ Tj(X). . 13. X n are i.1.1. j i=l = 1. (a) Use E(X. Iii = n... . Show that the method of moments estimate q = h(j1!. to give a method of moments estimate of (72.1) (iii) Raleigh.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . (b) Suppose P ~ po and I' = b are fixed. ~ (X. Use E(U[)..p:::'.Section 2. . See Problem 1. in estimate.. rep. A). r..36..5..6. Consider the Gaussian AR(I) model of Example 1. > 0.. Suppose X 1... When the data are not i. can you give a method of moments estimate of {3? . .) to give a method of moments estimate of p. Let g.p(x.:::m:::..i.• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments.l and (72 are fixed.5 Problems and Com". with (J E R d and (J identifiable.. 1'(0.l Lgj(Xi ).d. IG(p. Let 911 . .i.d. p fixed (v) Inverse Gaussian. Vk and that q(O) for some Rkvalued function h.10). = h(PI(O). Hint: Use Corollary 1. where U.9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)). . 1 < j < k.0 > 0 14.:::nt:::' ~__ 141 for some R k valued function h. 0 ~ (p.. A). 0). .6. .  1/' Po) / . (c) If J.. In the fOllowing cases.x (iv) Gamma. . 1'(1. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1. Suppose that X has possible values VI.0) (ii) Beta.6.Pr(O)) . . find the method of moments estimates (i) Beta. fir) can be written as a frequency plugin estimate. as X rv P(J.
. II. + '2P4 + 2PS .. b... I I .142 Methods of Estimation Chapter 2 15. Let X = (Z. where (Z.• l Yn are taken at times t 1. of observations.. ..z. 1 q.. I.J is' _ n n. ' l X q )... where OJ = 1. . 17..ZY ~ .... .2 1. .... n. I ... let the moments be mjkrs = E(XtX~).. = Y .. Q. respectively. The reading Yi ..Y. t n on the position of the object. . k > 0.. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S.! . SF.... = n 1 L:Z. An Object of unit mass is placed in a force field of unknown constant intensity 8. . 11P2 + "2P4 + '2P6 . I • . bI) are as in Theorem 1. .1..z. For independent identically distributed Xi = (XiI. ."" X iq ).)] + [g(l3o. and IF.. Establish (2... ()2.g(l3o. >.1 " Xir Xk J..8. we define the empirical or sample moment to be j ~ ... FF.g(I3.. k> Q mjkrs L.. 5 SF 28. ).•• Om) can be expressed as a function of the moments.83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.. Y) and B = (ai...)] = [Y. . j > 0.... and F. Hint: [y. and F at one locus resulting in six genotypes labeled SS. . . . i = 1.. P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ. 1 6 and let Pl ~ N j / n...S 1 . _ (Z)' ' a. SI..bIZ. S = 1.. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28. the method of moments l by replacing mjkrs by mjkrs..) .  1. Show that <j < i ! . I . .z. Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z...3.  t=1 liB = (0 1 ..q. For a vector X = (Xl.. 83 6 IF 28.1".4.)]... and (J3. 1 ..z.. Readings Y1 . 1 I ..~b.. Multivariate method a/moments. I..._ . Let 8" 8" and 83 denote the probabilities of S... Y) and (ai. 16. HardyWeinberg with six genotypes..6).g(I3. n 1  Problems for Sectinn 2.. 1"..
What is the least squares estimate based on YI . Let X I.2. . 9. . B) = Bc'x('+JJ.Section 2. . (a) f(x.BJ ..We suppose the oand be uncorrelated with constant variance.. .t of the best (MSPE) predictor of Yn + I ? 4. and 13 ranges over R d 8. Suppose that observations YI . (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1. to have mean 2. c cOnstant> 0. (exponential density) (b) f(x. all lie on aline.. ~p ~(yj}) ~ .. B > O..(zn.. Zn and that the linear regression model holds. B) = Be'x. ..I is to be taken at time Zn+l. (Pareto density) .5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l . . .2. 13). Hint: The quantity to be minimized is I:7J (y. .n.. Hint: Write the lines in the fonn (z. Show that the fonnulae of Example 2. . Find the least squares estimates of ()l and B . Yn). B > O.. < O. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi... X n denote a sample from a population with one of the following densities Or frequency functions. if we consider the distribution assigning mass l/n to each of the points (zJ. . i = 1. x > c. _.. Show that the least squares estimate is always defined and satisfies the equations (2.4. Find the least squares estimates for the model Yi = 8 1 (2.Yi). 2 10. The regression line minimizes the sum of the squared vertical distances from the points (Zl. Yn have been taken at times Zl. . 3.3. Find the MLE of 8. Find the line that minimizes the sum of the squared perpendicular distance to the same points.. . A new observation Ynl.1. . Yl).. Y.Yn).4. the range {g(zr. ..2 may be derived from Theorem 1." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi.4)(2. x > 0. (a) Let Y I . (zn. " 7 5. nl and Yi = 82 + ICi.6) under the restrictions BJ > 0.. .5) provided that 9 is differentiable with respect to (3" 1 < i < d.)' 1+ B5 6. B.13 E R d } is closed. . where €nlln2 are independent N(O. Find the LSE of 8. . 7. Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants. nl + n2. ..<) ~ .y. i + 1... . in fact.. . 13).B.2.). .. g(zn. Suppose Yi ICl".z. i = 1. . Find the least squares estimate of 0:. (J2) variables. .
.I")/O") . (beta. If n exists. 0 > O. I I I (b) LetP = {PO: 0 E e}. then the unique MLEs are (b) Suppose J. .. the MLE of w is by definition ~ W. 0 <x < I. Hint: Let e(w) = {O E e : q(O) = w}. 0) = (x/0 2) exp{ _x 2/20 2 }. . (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".  e .. ry) denote the density or frequency function of X in terms of T} (Le.. bl to make pia) ~ p(b) = (b . .0"2 (a) Find maximum likelihood estimates of J.144 Methods of Estimation Chapter 2 (c) f(". (72 Ii (a) Show that if J1 and 0. c constant> 0. be a family of models for X E X C Rd. (We write U[a.16(b).II ". Suppose that Xl. = X and 8 2 ~ n. x > 1". X n • n > 2. (Rayleigh density) (f) f(x. 0 E e and let 0 denote the MLE of O.2. n > 2.5.t and a 2 . (Pareto density) (d) fix. 0 > O.a)l rather than 0.2 are unknown. . then {e(w) : wEn} is a partition of e. they are equivariant 16. 0> O. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O. and o belongs to only one member of this partition. MLEs are unaffected by reparametrization. x > 0. Let X I. e c Let q be a map from e onto n. . WEn . J1 E R.. be independently and identically distributed with density f(x.1. reparametrize the model using 1]).:r > 0. 0) = Ocx c . < I" < 00. > O. I i' 13. Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. .r(c+<).X)" > 0.P > I.O) where 0 ~ (1".0"2).5 show that no maximum likelihood estimate of e = (1".) ! ! !' ! 14. (Wei bull density) 11.  under onetoone transformations).0"2) 15. Hint: Use the factorization theorem (Theorem 1. . say e(w). 00 = I 0" exp{ (x . q(9) is an MLE of w = q(O). 0) = cO".. x> 0. density) (e) fix. ncR'.1). Define ry = h(8) and let f(x.. then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. Hint: You may use Problem 2. is a sample from a N(/l' ( 2 ) distribution.l L:~ 1 (Xi .Pe. I). Show that if 0 is a MLE of 0. I < k < p.t and a 2 are both known to be nonnegative but otherwise unspecified. X n be a sample from a U[0 0 + I distribution. Thus. (a) Let X .l exp{ OX C } . . . 12. X n .e. iJ( VB.: I I \ . Pi od maximum likelihood estimates of J1 and a 2 . ~ I in Example 2. I • :: I Suppose that h is a onetoone function from e onto h(e). c constant> 0. Because q is onto n.. Show that the MLE of ry is h(O) (i. for each wEn there is 0 E El such that w ~ q( 0). Show that depends on X through T(X) only provided that 0 is unique. 0) = VBXv'O'. 0 > O. Let X" .
We want to estimate O.n .) Let M = number of indices i such that Y i = r + 1..1 (1 . . and Brunk (1972). Y n IS _ B(Y) = L. X n ) is a sample from a population with density f(x." Y: . 0 < a 2 < oo}. Show that maximum likelihood estimates do not exist. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. 21. and have commOn frequency function. In the "life testing" problem 1.  . Bremner. k ~ 1. (b) The observations are Xl = the number of failures before the first success. M 18. identically distributed. . a model that is often used for the time X to failure of an item is P. We want to estimate B and Var.B). k=I. B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'.2.X I = B(I. (a) The observations are indicators of Bernoulli trials with probability of success 8. If time is measured in discrete periods.~1 ' Li~1 Y... and so on. . Thus.Section 2. 19. . where 0 < 0 < 1. but e .B). (KieferWolfowitz) Suppose (Xl.B) = 1. (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely.. 1) distribution.0") E = {(I'.. X 2 = the number of failures between the first and second successes.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). < 00. (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. Let Xl... .O") : 00 < fl..[X < r] ~ 1 LB k=l k  1 (1_ B) = B'. Censored Geometric Waiting Times. we observe YI . f(k.1 (I_B).. 20. find the MLE of B.. 16(i). Suppose that we only record the time of failure.B) =Bk . 1 <i < n. ... . .Xn be independently distributed with Xi having a N( Oi..P. Derive maximum likelihood estimates in the following models. A general solution of this and related problems may be found in the book by Barlow.6. .[X ~ k] ~ Bk . in a sequence of binomial trials with probability of success fJ. •• .. (We denote by "r + 1" survival for at least (r + 1) periods. 17.r f(r + I. Bartholomew. Yn which are independent. . Show that the maximum likelihood estimate of 0 based on Y1 .
2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3. ..0"2) if. 23.z!..1 2.'.8 3.X)' + L(Y. = fl.ji.. .0 0.I' fl. TABLE 2.t. (1') is If = (X. x)/ L(b. Set zl ~ . Show that the MLE of = (fl.6). Suppose Y.146 Methods of Estimation Chapter 2 that snp. Weisberg.Xj for i =F j and that n > 2.. Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer.4 o o o o o o o o o o o o o Life 20.p(x.< J.Xm and YI .. II I I :' 24. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise.jp): 0 <j.0 I .0 2. Polynomial Regression. x) as a function of b. (1') where e n ii' = L(Xi . J2 J2 o o o I 3. Tool life data Peed 1 Speed 1 1 Life 54.2 4. .6.a 2 ) = Sllp~.2. Assume that Xi =I=.5 86. 1 1 1 o o .2 3.8 0. (1') populations. and only if.0 11. where [t] is the largest integer that is < t.8 14. 1i(b. Suppose X has a hypergeometric.a 2 ) and NU".2 4.x n . . distribution.tl.. Xl. " Yn be two independent samples from N(J.5 66. respectively. i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution). n). . N. the following data were obtained (from S.1.Y)' i=1 j=1 /(rn + n). Y. . 1 <k< p}.5 0. where'i satisfy (2. Ii equals one of the numbers 22.8 2.0 5. Hint: Coosider the mtio L(b + 1. .4)(2. and assume that zt'·· ..5 I' I I .2.' wherej E J andJisasubsetof {UI . Let Xl.up(X.J.. 7 . .9 3. 1985).(Zi) + 'i.
y) > 0 be a weight funciton such that E(v(Z..Y') have density v(z.z.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models.6. (2. (b) Let P be the empirical probability . = (cutting speed .5) for (a) and (b). Consider the model Y = Z Df3 + € where € has covariance matrix (12 W.13)/6.2.2. Let (Z.y)dzdy. Being larger. Y) have joint probability P with joint density ! (z. Use these estimated coefficients to compute the values of the contrast function (2.2.4)(2. Y)Z') and E(v(Z.1. . (a) Show tha!.Y = Z D{3+€ satisfy the linear regression mooel (2. ZI ~ (feed rate . 25.  27. ZD = Wz ZD and € = WZ€.19).. ~. z. l/Var(Y I Z = z). (12 unknown.(P)Z of Y is defined as the minimizer of t = zT {3. this has to be balanced against greater variability in the estimated coefficients.y) defined .. Let W.900)/300.ZD is of rank d. the second model provides a better approximation. Y')/Var Z' and pI(P) = E(Y*) 11.2. Derive the weighted least squares nonnal equations (2. let v(z. This will be discussed in Volume II.y)!(z. . Y)[Y . . Y)Y') are finite.y)!(z. Consider the model (2. .6) with g({3. Show that (3.1). Set Y = Wly.y)/c where c = I Iv(z.8 and let v(z.1).Section 2. (a) Let (Z'.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life. Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix. 28. Show that the follow ZDf3 is identifiable. (c) Zj. Two models are contemplated (a) Y ~ 130 + pIZI + p.1. Show that p.(P) and p. z) ing are equivalent. (2. Both of these models are approximations to the true mechanism generating the data. z) ~ Z D{3.(b l + b.l be a square root matrix of W.4)(2.(P) = Cov(Z'.1 (see (B. in Problem 2.2. E{v(Z.6)).(P) coincide with PI and 13. The best linear weighted mean squared prediction error predictor PI (P) + p.2. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. (a) The parameterization f3 (b) ZD is of rank d. 26.2. That is. of Example 2...6) with g({3.3.(P)E(Z·). However.Z)]'). weighted least squares estimates are plugin estimates. y). .2.
k..). J. /1i + a}.071 0.5.) ~ ~ (I' + 1'. different from Y? I I TABLE 2..393 3...723 1. n.j}.2.nk > 0.058 3.ZD. Yn are independent with Yi unifonnly distributed on [/1i . nq+1 > p(x.€n+l are i.100 3.093 5..ZD. Consider the model Yi = 11 + ei.053 4..455 9. .d....(Y .20)..582 2.476 3.131 5.. 4 (b) Show that Y is a multivariate method of moments estimate of p.968 2. .155 4.397 4. .156 5. 2.788 2. \ (e) Find the weighted least squares estimate of p. .020 8.6) is given by (2. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1. 0) = II j=q+l k 0. which vanishes if 8 j = 0 for any j = q + 1.097 1. .1. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI .038 3. k.300 6. _'.229 4. then the  f3 that minimizes ZD.ZD.301 2. ' I (f) The following data give the elapsed times YI .611 4. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.+l I Y .6)T(y . with mean zero and variance a 2 • The ei are called moving average errors.114 1. where El" .360 1. Hint: Suppose without loss of generality that ni 0.6) W.689 4.392 4.054 1. (See Problem 2.453 1. where 1'. .860 5. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In.249 1.091 3.480 5..971 0. .457 2.918 2. . Chon) shonld be read row by row. 1'. .jf3j for given covariate values {z.~=l Z. n.676 5. . j = 1..075 4. = L.019 6.511 3. Is ji. 31. <J > 0.703 4.834 3. In the multinomial Example 2.046 2.666 3..274 5. Let ei = (€i + {HI )/2.856 2.. Then = n2 = . That is.a. I Yi is (/1 + Vi).391 0..081 2. = nq = 0. The data (courtesy S.968 9..716 7. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore.. suppose some of the nj are zero.i. .2.599 0.6) T 1 (Y .182 0.669 7..564 1. .665 2.858 3. .379 2. . .916 6.1.093 1.039 9.870 30.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e.064 5. = (Y  29. Show that the MLE of . i = 1. i = 1..'.196 2.8.921 2.958 10.17. (a) Show that E(1'. Suppose YI .148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.908 1.
~ 33.I/a}. 35. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj).Ili I and then setting (j = n~I L~ I IYi .. An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass . i < j. x E R. (3p. the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l)..{:i...17).6. Z and W are independent N(O.. be i. A > 0. be the observations and set II = ~ (Xl . .& at each Xi. Give the MLE when .Jitl.. XHL i& at each point x'. ..8).1.x. be i..28Id(F.2.5 Problems and Complements 149 (f3I' .tLil and then setting a = max t IYi ./Jr ~ ~ and iii. .O) Hint: See Problem 2. 1).J. F)(x) *F denotes convolution. Let X./3p that minimizes the maximum absolute value conrrast function maxi IYi .{3p. Yn ordered from smallest to largest.. . the sample median yis defined as ~ [Y(. 8 E R.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. (See (2. (a) Show that the MLE of ((31. Let x. Show that the sample median ii is the minimizer of L~ 1 IYi . .. where Iii = ~ L:j:=1 ZtjPj· 32.d.  Illi < 1.. .l1. . (a) Show Ihat if Ill[ < 1. + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x . .i.. .d.. Hint: Use Problem 1. ..fin are called (b) If n is odd. . let Xl and X.. Yn are independent with Vi having the Laplace density 1 2"exp {[Y.3..(3p that minimizes the least absolute deviation contrast function L~ I IYi . . These least absolute deviation estimates (LADEs).. y)T where Y = Z + v"XW. then the MLE exists and is unique. .7 with Y having the empirical distribution F. Suppose YI ...(1 + x'j. be the Cauchy density.:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x.4. = I' for each i. Show that XHL is a plugin estimate of BHL. and x. a) is obtained by finding (31.). a)T is obtained by finding 131. . . Find the MLE of A and give its mean and variance. Let g(x) ~ 1/". as (Z. Let B = arg max Lx (8) be "the" MLE. " . Hint: See Example 1. .i. = L Ix. .) Suppose 1'. with density g(x .d. where Jii = L:J=l Zij{3j.32(b).Section 2.) + Y(r+l)] where r = ~n. 34. ..I'. If n is even.. ~ f3I. Y( n) denotes YI .
150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. (b) Either () = x or () is not unique.B)g(i .+~l Wj have P(rnAl) and P(mA2) distributions.d. distribution.B)g(x.B) g(x + t> . Let Xi denote the number of hits at a certain Web site on day i. n. 37. and 5. 9 is twice continuously differentiable everywhere except perhaps at O.0). ryl Show that o n ». ry) = p(x. . Let K(Bo.t> . Suppose X I.j'. 1 n + m.x2)/2.Xn are i.. (a.h(y . . > h(y) . then h"(y) > 0 for some nonzero y. If we write h = logg. h I (ry») denote the density or frequency function of X for the ry parametrization. Also assume that S..B )'/ (7'. . then the MLE is not unique. .B I ) (K'(ryo. Let Vj and W j denote the number of hits of type 1 and 2 on day j. ~ (d) Use (c) to show that if t> E (a.. Assume :1. . Assume that 5 = L:~ I Xi has a Poisson.. Define ry = h(B) and let p' (x. 3. B) and pix. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) .\2 = A.) be a random samplefrom the distribution with density f(x. = I ! I!: Ii. Let (XI. Suppose h is a II function from 8 Onto = h(8).i. j = n + I. Bo) is tn( B.+~l Vi and 8 2 = E. ~ (XI + x. B) denote their joint density. N(B. X. . ~ Let B ~ arg max Lx (B) be "the" MLE. B) is and that the KullbackLiebler divergence between p(x. symmetric about 0. . and positive everywhere. Show that the entropy of p(x. . 1985). Let X ~ P" B E 8. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev.b).. (7') and let p(x.ryt» denote the KullbackLeibler divergence between p( x. such that for every y E (a.. 'I i 39. Let 9 be a probability density on R satisfying the following three conditions: I. where Al +. b). . .. ryo) and p' (x... B ) and p( X.B).)/2 and t> = (XI .. . Find the MLEs of Al and A2 hased on 5. 36. where x E Rand (} E R. o Ii !n 'I .f)) in the likelihood equation. Let Xl and X2 be the observed values of Xl and X z and write j. P(nA). On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making).h(y) . Ii I. 51. that 8 1 E. B) = g(xB).. 2. i = 1. Sl. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x . The likelihood function is given by Lx(B) g(xI . BI ) (p' (x. a < b. then B is not unique. . 9 is continuous. 38. Show that (a) The likelihood is symmetric about ~ x. and 8 2 are independent.
An example of a neural net model is Vi = L j=l p h(Zij. . 0 > O. [3.. 0).. n > 4. .~e p = 1. A) = g(z. = L X. > OJ and T. on a population of a large number of organisms. . . and g(t.}. X n be a sample from the generalized Laplace distribution with density 1 O + 0. Variation in the population is modeled on the log scale by using the model logY.~t square estimating equations (2.O) ~ a {l+exp[[3(tJ1. Let Xl. j = 1. [3 > 0. 1959. For the ca.Section 2. o + 0.p. Aj) + Ei. give the lea. a > 0.. 1 X n satisfy the autoregressive model of Example 1.1. yd 1 ••• . and 1'.5.)/o)} (72. x > 0.. j 1 1 cxp{ x/Oj}. Give the least squares where El" •. . a get. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case. [3. Yn). a.I[X.1.. exp{x/O. i = 1.I[X. = LX. [3. . I' E R. and fj.2. .1.7) for estimating a. [3. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0.0). ... The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. . 6.) Show that statistics.j ~ x <0 1. ..olog{1 + exp[[3(t. where OJ > O. n where A = (a. and /1. + C.)/o]}" (b) Suppose we have observations (t I . 1). i = 1. Suppose Xl. 1'). 41.1'. h(z. T. . (.5 Problems and Complements 151 40. ..  J1. /3. 1 En are uncorrelated with mean 0 and variance estimating equations (2. n. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism. = log a . Seber and Wild.. 1'. where 0 (a) Show that a solution to this equation is of the form y (a.c n are uncorrelated with mean zero and variance (72. . (tn. .7) for a. 42.
15. .i.fi) = 1log p pry..152 Methods of Estimation Chapter 2 (aJ If /' is known. Hint: I . Prove Lenama 2. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p. . < . Hint: Let C = 1(0). gamma. show that the MLE of fi is jj =  2..J.. = 1] = p(x"a. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. In a sample of n independent plants.'" £n)T of autoregression errors.::. X n be i. .. fi) = a + fix.02): 01 > 0. a.1..5). . Consider the HardyWeinberg model with the six genotypes given in Problem 2. _ C2' 2.. = OJ. find the covariance matrix W of the vector € = (€1. fi exists iff (Yl . < Show that the MLE of a. S.(8 .l...).)I(c. Let = {(01... n > 2.) Then find the weighted least square estimate of f. I .3 1. . . C2' y' 1 1 for (a) Show that the density of X = (Xl. r(A..3. L x.3.0 .1 > _£l.x.y. I < j < 6.4) and (2.Xi)Y' < L(C1 + c. Give details of the proof or Corollary 2. = L(CI + C. . . Yn) is not a sequence of 1's followed by all O's or the reverse. 3.0. . t·.. Under what conditions on (Xl 1 .) 1 II(Z'i /1)2 (b) If j3 is known. n i=l n i=l n i=1 n i=l Cl LYi + C. Xn· Ip (x. '12 = A and where r denotes the gamma function. .l..d.l  L: ft)(Xi /. .3. • II ~. write Xi = j if the ith plant has genotype j. Is this also the MLE of /1? Problems for Section 2..3. 1 Xl <i< n.p).. (X. (b) Show that the likelihood equations are equivalent to (2. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4. This set K will have a point where the max is attained. There exists a compact set K c e such that 1(8) < c for all () not in K. > 0.. < I} and let 03 = 1. + 0. Yn are independent PlY. .x. .. Suppose Y I . 0 for Xi < _£I. + 8. Let Xl. = If C2 > 0.. the bound is sharp and is attained only if Yi x. + C1 > 0).
u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm. X n be Ll.' .'~ve a unique solution (.0. .z=7 11..lkIl :Ij >0. 10.8) . a(i) + a. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 .. . n > 2. 8.fl. _. . Show that the boundary of a convex C set in Rk has volume 0.5 Problems and Complements 153 n 6. .t). But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. w( ±oo) = 00. then it must contain a sphere and the center of the sphere is an interior point by (B. MLEs of ryj exist iffallTj > 0.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0.10 with that the MLE exists and is unique. 0:0 = 1. 0 < Zl < . Let Xl.0). Zj < .40. and Zi is the income of the person whose duration time is ti..3.. (0.0 < ZI < .0).3. Use Corollary 2.9. " (a) Show that if Cl: > ~ 1. .Ld. if n > 2.. Hint: The kpoints (0. Let Y I . ~fo (X u I. Yn denote the duration times of n independent visits to a Web site.. Suppose Y has an exponential.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. < Show that the MLE of (a. ji(i) + ji.3. In the heterogenous regression Example 1. ...}. with density..1 (a) ~ ~ c(a) exp{ Ix . C! > 0. . J.. fe(x) wherec.O. convex set {(Ij. Prove Theorem 2. 1 <j< k1. Let XI. . 9. (b) Show that if a = 1.1).. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i .. < Zn. the MLE 8 exists but is not unique if n is even.d. (3) Show that..l E R.n. See also Problem 1.1 to show that in the multinomial Example 2. Then {'Ij} has a subsequence that converges to a point '10 E f.Xn E RP be i.n) are the vertices of the :Ij <n}. .. ... (O. .en.. [(Ai). < Zn Zn. distribution where Iti = E(Y. .1 <j < ac k1.Section 2...ill T exists and is unique. 12. e E RP.6. • It) w' ( Xi It) .) ~ Ail = exp{a + ilz. > 3. and assume forw wll > 0 so that w is strictly convex. .3. Hint: If BC has positive volume. .3..6. the MLE 8 exists and is unique. show 7. .
for example. provided that n > 3.bo) D(a. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. > 0. 3. and p when J1. . EeT = (J11.6.l and CT. I (Xi .8. show that the estimates of J11. .3. /12.:.2). L:.4.b)_(ao. b successively.. (i) If a strictly convex function has a minimum. IPl < 1..'2). O'~ + J1i.. Chapter 10. O'?.1 and J1. pal0'2 + J11J.1. Golnb and Van Loan.J1.J1. En i. Yn ) be a sample from a N(/11.) has a density you may assume that > 0. . complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2..4 1.n log a is strictly convex in (a. Hint: (b) Because (XI. . (Xn . verify the Mstep by showing that E(Zi I Yi). and p = [~(Xi .. Show that if T is minimal and t: is open and the MLE doesn't exist. 2. ag. respectively. EM for bivariate data. then the coordinate ascent algorithm doesn't converge to a member of t:.' ! (a) In the bivariate nonnal Example 2.. . rank(ZD) = k. b) ~ I w( aXi .2)2. N(O. 1985. 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex. (b) Reparametrize by a = . b) and lim(a..3. (See Example 2.2)/M'''2] iI I'..6. (I't') IfEPD 8a2 a2 D > 0 an d > 0. . "i . J12.i. . Y. b) = x if either ao = 0 or 00 or bo = ±oo. O'~. ~J ' .154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. = (lin) L:. Apply Corollary 2. p) population. Note: You may use without proof (see Appendix B. il' i.)(Yi .2. 8aob :1 13..:. 0'1 Problems for Section 2. (b) If n > 5 and J11 and /12 are unknown. b = : and consider varying a.J1. ..J1.4.) I .2 are assumed to be known are = (lin) L:. (a) Show that the MLEs of a'f. Let (X10 Yd. O'~ + J1~. {t2.4. it is unique. W is strictly convex and give the likelihood equations for f. P coincide with the method of moments estimates of Problem 2.) Hint: (a) Thefunction D( a. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations.d. See.(Yi .9).1)". E" .b) . ar 1 a~.
Hint: Use Bayes rule. the model often used for X is that it has the conditional distribution of a B( n. (c) Give explicitly the maximum likelihood estimates of Jt and 5. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". is distributed according to an exponential ryl < n} = i£(1 I) uzuy.(I. 6. where 8 = 80l d and 8 1 estimate of 8. Do you see any problems with >. j = 0. IX > 1) = \ X ) 1 (1 6)" .. Y1 '" N(fLl a}). ~2 = log C\) + (b) Deduce that T is minimal sufficient. as the first approximation to the maximum likelihood . 1 < i < n. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique. Suppose that in a family of n members in which one has the disease (and.? (b) Give as explicitly as possible the E.3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I.and Msteps of the EM algorithm for this problem. 2 (uJ L }iIi + k 2: Y. X is the number of members who have the trait..(1 _ B)n]{x nB . Let (Ii. given X>l.[lt ~ 1] = A ~ 1 ..n.x = 1.B)[I.2B)x + nB'] _. 1 and (a) Show that X .. 1].B)n[n . . Y'i). and that the (b) Use (2.. also the trait).. Suppose the Ii in Problem 4 are not observed.{(Ii.Ii). (a) Justify the following crude estimates of Jt and A. Yd : 1 < i family with T afi I a? known.1) x Rwhere P. when they exist. Because it is known that X > 1. I' >. 1') E (0.Section 2.I ~ + (1  B)n] ._x(1. 2:Ji) .(1 .4. and given II = j. 8new . Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it. i ~ 1'.5 Problems and Complements 155 4. B= (A.O"~).B)n} _ nB'(I.[lt ~ OJ. B) variable.P. thus. For families in which one member has the disease.[1 . it is desired to estimate the proportion 8 that has the genetic trait. B E [0. be independent and identically distributed according to P6.B)"]'[(1 . .
d. v < 00. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case. N . 9.b. Define TjD as before. X Methods of Estimation Chapter 2 = 2.Na+c. ~ 0..riJ(A) . Let Xl. b1 c and then are ..v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i.c)} and "+" indicates summation over the ind~x. V = b. iff U and V are independent given W. (c) Show that the MLEs exist iff 0 < N a+c .1 generated by N++c. Consider the following algorithm under the conditions of Theorem 2. (1) 10gPabc = /lac = Pabe. 7. Hint: Apply the argument of the proof of Theorem 2. X 3 = a. PIU = a.4.1 (1 + (x e.9) = ". Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. N +bc given by < N ++c for all a.N+bc where N abc = #{i : Xi = (a. find (}l of (b) above using (} =   x/n as a preliminary estimate. . c]' Show that this holds iff PIU ~ a.4.c Pabc = l. W = c] 1 < a < A. hence. where X = (U. X n be i. N++ c N a +c N+ bc Pabc = . . 1 <c < C and La. n N++ c ++c  .156 (c) [f n = 5.X 2 . * maximizes = iJ(A') t. Let and iJnew where ). 1 < b < B.9)')1 Suppose X. W). b1 C.2. V. X.e. .2 noting that the sequence of iterates {rymJ is bounded and.1) + C(A + B2) = C(A + B1) .i. = 1. 8. Let Xl. v vary freely is an exponential family of rank (C .A(iJ(A)). Show that the sequence defined by this algorithm converges to the MLE if it exists. (a) Suppose for aU a. .X3 be independent observations from the Cauchy distribution about f(x. the sequence (11ml ijm+l) has a convergent subse quence. (h) Show that the family of distributions obtained by letting 1'. . + Vbc where 00 < /l.b.
Justify formula (2.[X = x I SeX) = s] = 13.5 has the specified mixtnre of Gaussian distribution.. (a) Show that this is an exponential family of rank A + B + C .I) + (B . Suppose X is as in Problem 9.5 Problems and Complements 157 Hint: (b) Consider N a+ c . Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=. (a) Show that S in Example 2. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~.I)(C . c" parameters. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise. (b) Give explicitly the E.1) + (A .c' obtained by fixing the "b. 12. v. 11. 10.c.N++c/A.4. f vary freely.4.=_ L ellalblp~~~'CI a'.3 + (A .I) ~ AB + AC + BC . but now (2) logPabc = /lac + Vbc + 'Yab where J1.(A + B + C).Section 2.:.8).N++ c / B.I)(C . Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations. (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model.1)(B .a) 1 2 . c" and "a.(x) :1:::: 1(S(x) ~ = s).a>!b". Let f.and Msteps of the EM algorithm in this case. N±bc . = fo(x  9) where fo(x) 3'P(x) + 3'P(x . Hint: P.b'..
Limitations a/the EM Algorithm. the probability that Vi is missing may depend on Zi. and independent of El.6. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood. find the probability that E(Y. This condition is called missing at random. ~ . (2) The frequency plugin estimates are sometimes called Fisher consistent.. Bm+d} has a subsequence converging to (0" JJ*) and. . These considerations lead essentially to maximum likelihood estimates. For instance. Hint: Show that {( 8m . . Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2. For example. If Vi represents the seriousness of a disease. I Z. 18. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n. 17.6 .. Complete the E.~ go. EM and Regression.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. N(O. That is.158 Methods of Estimation Chapter 2 and r.4.4. . N(ftl. En a an 2."" En.4. thus. {32). consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. NOTES . For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986). A. ! . the process determining whether X j is missing is independent of X j .n}. (J*) and necessarily g.i. 15...2. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. 14.3 for the actual MLE in that example. Noles for Section 2. given X . al = a2 = 1 and p = 0. where E) .{Xj }. but not on Yi.5. . . I I' .d. {31 1 a~. In Example 2. suppose all subjects with Yi > 2 drop out of the study." . the "missingness" of Vi is independent of Yi. ~ 16. Hint: Use the canonical nature of the family and openness of E. Verify the fonnula given in Example 2.4. Hint: Show that {(19 m. a 2 . R.. Zn are ii. this assumption may not be satisfied.. 2 ).d.4. we observe only}li. suppose Yi is missing iff Vi < 2. necessarily 0* is the global maximizer. Zl. The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . • . = {(Zil Yi) Yi :i = 1. Establish part (b) of Theorem 2. For X .. Bm + I)} has a subsequence converging to (B* . Ii .p is the N(O} '1) density. if a is sufficiently large.6. If /12 = 1.3. . Establish the last claim in part (2) of the proof of Theorem 2. . given Zi.and Msteps of the EM algorithm for estimating (ftl.) underpredicts Y. That is. in Example 2.5.
"Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm. GIlBELS. A. THOMAS. 1985. S. and MacKinlay. DEMPSTER. Note for Section 2. Fisher 1950) New York: J. Statist. M. R. E. HOLLAND. AND C. J. PETRIE. AND K. W. MA: MIT Press. FEINBERG. Statist. BRUNK. BISHOP. Campbell. VAN LOAN. BJORK.. 1972. 199200 (1985). G.g...2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964).. EiSENHART. 1997.7 REFERENCES BARLOW.. 1.1. (2) For further properties of KullbackLeibler divergence. FAN. A. 164171 (1970). 1975. 138 (1977). A. "Y. Sciences. 1974. G. AND N. Matrix Computations Baltimore: John Hopkins University Press. AND N. Numerical Analysis New York: Prentice Hall. E. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!.. H. Appendix A. B. see Cover and Thomas (1991). FISHER. DAHLQUIST. AND A. DHARMADHlKARI... "The Meaning of Least in Least Squares. "Examples of Nonunique Maximum Likelihood Estimators.. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. M.. t 996. Acad." reprinted in Contributions to MatheltUltical Statistics (by R. M. G. NJ: Princeton University Press. Roy. A. 2433 (1964). SOULES. 1991. M.Section 2.5 (1) In the econometrics literature (e. 39.. F. The Econometrics ofFinancial Markets Princeton.2. 2. J. A. "On the Mathematical Foundations of Theoretical Statistics.3 (I) Recall that in an exponential family. 41. ANOH.• AND I. C. Local Polynomial Modelling and Its Applications London: Chapman and Hall. HABERMAN. W. MACKINLAY. AND P. D." Journal Wash. Soc. 1922. Statistical Inference Under Order Restrictions New York: Wiley. BAUM. for any A.7 References 159 Notes for Section 2. c. RUBIN. S.loAGDEV. B." J. T. GoLUB. AND D. Elements oflnfonnation Theory New York: Wiley." The American Statistician. E. La. AND J. Math. The Analysis ofFrequency Data Chicago: University of Chicago Press. Wiley and Sons. 1997). CAMPBELL. P[T(X) E Al o for all or for no PEP. Note for Section 2. La. LAIRD. S. M.. WEISS. 0. . A. 54. BARTHOLOMEW. T. "Y. Discrete Multivariate Analysis: Theory and Practice Cambridge. R. L. M. BREMNER. COVER. ANDERSON.1974. J..
WEISBERG. 1989.. 102108 (1956). Theory. 11. Applied linear Regression. 10. AND M. COCHRAN. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. J. C. J. 1997. B. Statistical Methods.379243." J. The History ofStatistics Cambridge. S. G. Ames. MOSTELLER. W. Exp. 2nd ed." Ann. SEBER.623656 (1948).95103 (1983) .. AND D. Wiley. I 1 . "Multivariate Locally Weighted Least Squares Regression. "A Flexible Growth Function for Empirical Use. Amer. • . G. MACLACHLAN. 1986." Ann. J. 290300 ( 1959). P."/RE Trans! Inform. R. . In. G.. SHANNON.160 Methods of Estimation Chapter 2 KOLMOGOROV. AND T.. IA: Iowa State University Press. F. SNEDECOR. WU. WAND. Journal.. S. "Association and Estimation in Contingency Tables.1. Nonlinear Regression New York: Wiley.. New York: Wiley." 1. i I. 63. A. 6th ed. RICHARDS... 22.. Statist. STIGLER. The EM Algorithm and Extensions New York: Wiley. Statisr. 1985. E.. 1%7. F. Assoc. A. Statist. C. AND W." Bell System Tech. WILD. N.. E. J. Botany. I • I . Statistical Anal. KR[SHNAN. MA: Harvard University Press. LIlTLE.. "On the Convergence Properties of the EM Algorithm. RUPPERT. "A Mathematical Theory of Communication. 128 (1968).. RUBIN. 13461370 (1994).·sis with Missing Data New York: J. A. D. E... AND c. 27. 1987.
6 2 ) for all 0 or vice versa.1 INTRODUCTION Here we develop the theme of Section 1.6(X).6) = EI(6. 6) as measuring a priori the performance of 6 for this model. in the context of estimation. In Sections 3. actual implementation is limited. We also discuss other desiderata that strongly compete with decision theoretic optimality.Chapter 3 MEASURES OF PERFORMANCE. by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. OUf examples are primarily estimation of a real parameter.2 BAYES PROCEDURES Recall from Section 1. ) < R(B. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness.a). after similarly discussing testing and confidence bounds. In Section 3. However.+ R+ by ° R(O.3.3 that if we specify a parametric model P = {Po: E 8}.2 and 3.1) 161 . in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. AND OPTIMAL PROCEDURES 3. (3. in particular computational simplicity and robustness. action space A. o r(1T. 6) : e . We return to these themes in Chapter 6. which is how to appraise and select among decision procedures. loss function l(O. R(·.6) =ER(O.2. we study. 3. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.3 we show how the important Bayes and minimax criteria can in principle be implemented.4. However.6) = E o{(O.6(X)). and 62 on the basis of the risks alone is not well defined unless R(O. We think of R(. Strict comparison of 6.6 . NOTIONS OF OPTIMALITY.
In the continuous case with real valued and prior density 1T. Recall also that we can define R(. such that (3. we can give the Bayes estimate a more explicit form.8) for the posterior density and frequency functions.2. e "(). (3. Our problem is to find the function 6 of X that minimizes r(". (0. In view of formulae (1. oj = ". if 1[' is a density and e c R r(".. and instead tum to construction of Bayes procedure.6) In the discrete case..2. 6) ~ E(q(O) .3) In this section we shall show systematically how to construct Bayes rules. We don't pursue them further except in Problem 3. and 1r may express that we care more about the values of the risk in some rather than other regions of e.2.(O)dO (3. for instance.2.3 we showed how in an example.2. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0). Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). . If 1r is then thought of as a weight function roughly reflecting our knowledge. 0) 2 and 1T2(0) = 1. we could identify the Bayes rules 0.162 Measures of Performance Chapter 3 where (0. if computable will c plays behave reasonably even if our knowledge is only roughly right.6): 6 E VJ (3.. j I 1': II .(0) a special role ("equal weight") though (Problem 3. 'f " .) ~ inf{r("..3). 1 1  ..6) = J R(0. It is in fact clear that prior and loss function cannot be separated out clearly either.5). Clearly. 1(0.4) the parametrization plays a crucial role here. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ. We may have vague prior notions such as "IBI > 5 is physically implausible" if.4) !. 0) and "I (0) is equivalent to considering 1 (0.r= . we just need to replace the integrals by sums. 0) = (q(O) using a nonrandomized decision rule 6.(O)dfI p(x I O). as usual..2) the Bayes risk of the problem. Here is an example. it is plausible that 6.2.(0)1 . This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1.2. We first consider the problem of estimating q(O) with quadratic loss. Thus. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on ().5) This procedure is called the Bayes estimate for squared error loss. and that in Section 1. considering I.4.6(X)j2.6).2. we find that either r(1I". x _ !oo= q(O)p(x I 0).5. (0.(O)dfI (3. After all.2. 00 for all 6 or the Using our results on MSPE prediction. B denotes mean height of people in meters. For testing problems the hypothesis is often treated as more important than the alternative. X) is given the joint distribution specified by (1.
and X with weights inversely proportional to the Bayes risks of these two estimates.6) yields. if X = :x and we use action a. 1]0 fixed). we see that the key idea is to consider what we should do given X = x.2. X is the estimate that (3. For more on this.5. . .2. Applying the same idea in the general Bayes decision problem. '1]0. J' )]/r( 7r. take that action a = J*(x) that makes r(a I x) as small as possible. Thus.2. This action need not exist nor be unique if it does exist.1.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 . If we choose the conjugate prior N( '1]0. In fact. In fact.7) Its Bayes risk (the MSPE of the predictor) is just r(7r. for each x. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3. But X is the limit of such estimates as prior knowledge becomes "vague" (7 . To begin with we consider only nonrandomized rules. Fonnula (3. we fonn the posterior risk r(a I x) = E(l(e. If we look at the proof of Theorem 1. see Section 5. /2) as in Example 1.1). J') ~ 0 as n + 00. E(Y I X) is the best predictor because E(Y I X ~ .6. This quantity r(a I x) is what we expect to lose. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large. However. a) I X = x). Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. we should. Because the Bayes risk of X. J') E(e . Intuitively.12. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3.Section 3.) minimizes the conditional MSPE E«(Y . 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l. r) . tends to a as n _ 00. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior.a)' I X ~ x) as a function of the action a.w)X of the estimate to be used when there are no observations.2.2.2 Bayes Procedures 163 Example 3. that is.00 with. a 2 / n.4. " Xn.1. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r.E(e I X))' = E[E«e .E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. .r( 7r.
Op}. consider the oildrilling example (Example 1. E[I(O. [I) = 3.8. 6* = 05 as we found previously.. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}.o'(X)) IX = x]. A = {ao. we obtain for any 0 r(?f. and aa are r(at 10) r(a.) = 0.) = 0.2.3. ?f(O. 10) 8 10.5) with priOf ?f(0. 0'(0) Similarly. E[I(O.o(X)) I X = x] = r(o(x) Therefore.2.o(X)) I X)]. o As a first illustration. Bayes Procedures Whfn and A Are Finite.·.1. .o) = E[I(O. " .! " j. az has the smallest posterior risk and. az. .. I 0)  + 91(0" ad r(a.o(X))] = E[E(I(O. and the resnlt follows from (3. a q }.1.164 Measures of Performance Chapter 3 Proposition 3.4. (3. r(a.· .2. Example 3. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures.. As in the proof of Theorem 1. (3.35. Suppose we observe x = O.2.74.o(X)) I X] > E[I(O.70 and we conclude that 0'(1) = a. the posterior risks of the actions aI. ~ a.8). by (3.9) " "I " • But. I x) > r(o'(x) I x) = E[I(O. if 0" is the Bayes rule. 11) = 5.67 5. let Wij > 0 be given constants.2. Let = {Oo.2.8) Then 0* is a Bayes rule. More generally consider the following class of situations.89.o'(X)) I X]..2. and let the loss incurred when (}i is true and action aj is taken be given by j e e . !i n j.2. Then the posterior distribution of o is by (1.2. r(a.8) Thus.9). . r(al 11) = 8. Therefore. :i " " Proof. Therefore.
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
I I . and so on. I X n are d. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['..4. all 5 E Va.1 . N (Il. in Section 1. Von Neumann's Theorem states that if e and D are both finite.2. such as 5(X) = q(Oo). We show that Bayes rules with constant risk. 176 Measures of Performance Chapter 3 .1 Unbiased Estimation. it is natural to consider the computationally simple class of linear estimates. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. for instance. the game is said to have a value v. Obviously. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2. 3.. We introduced. The most famous unbiased estimates are the familiar estimates of f. the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). When Do is the class of linear procedures and I is quadratic Joss. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . An alternative approach is to specify a proper subclass of procedures.d. D. II . symmetry. S(Y) = L:~ 1 diYi .. When v = ii. In the nonBayesian framework. or more generally with constant risk over the support of some prior. This notion has intuitive appeal. then in estimating a linear function of the /3J. are minimax. looking for the procedure 0.3.L and (72 when XI. according to these criteria. in many cases.4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. Bayes and minimaxity. for example. Moreover. the solution is given in Section 3. estimates that ignore the data. on other grounds. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable.I . An estimate such that Biase (8) 0 is called unbiased. . which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. Ii I More specifically.2. This approach has early on been applied to parametric families Va." R(0. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. and then see if within the Do we can find 8* E Do that is best according to the "gold standard. (72) = . compute procedures (in particular estimates) that are best in the class of all procedures.6. computational ease. . then the game of S versus N has a value 11. ruling out. we can also take this point of view with humbler aims. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. 0'5) model. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. 5') for aile.• •••. . for which it is possible to characterize and.1.. Do C D. 5) > R(0.
As we shall see shortly for X and in Volume 2 for . (3.3) = 0 otherwise.. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3.il) Clearly for each b.. a census unit.6) where b is a prespecified positive constant.4.1. (3. X N )..4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1. (3. UN for the known last census incomes. . We ignore difficulties such as families moving.···.. X N ) as parameter . these are both UMVU.2.. This leads to the model with x = (Xl.4..3.Section 3. . Suppose we wish to sample from a finite population. and u =X  b(U .4) where 2 CT. .4.4. . Unbiased estimates playa particularly important role in survey sampling.8) Jt =X .' .14) and has = iv L:fl (3.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0. X N for the unknown current family incomes and correspondingly UI. .3.5) This method of sampling does not use the information contained in 'ttl.4.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 . UMVU (uniformly minimum variance unbiased).(Xi 1=1 I" N ~2 x) .....Xn denote the incomes of a sample of n families drawn at random without replacement. . for instance. . If . . ~ N L. ......4. . (~) If{aj. ..< 2 ~ ~ (3.4. We want to estimate the parameter X = Xj...3 and Problem 1.an}C{XI. . UN' One way to do this. Example 3. Write Xl.XN} 1 (3. . reflecting the probable correlation between ('ttl. We let Xl. [J = ~ I:~ 1 Ui· XR is also unbiased. . UN) and (Xl. Unbiased Estimates in Survey Sampling.4. Ui is the last census income corresponding to = it E~ 1 'Ui. to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census. . is to estimate by a regression estimate ~ XR Xi.
7) II ! where Ji is defined by Xi = xJ.2Gand minimax estimates often are. X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef .11. (iii) Unbiased_estimates do not <:. For each unit 1.. 0 Discussion..18. M (3. . .i . estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5.~'! I . . ~ . ' .4.4. . this will be a beller cov(U. 1. 111" N < 1 with 1 1fj = n..4). Ifthc 'lrj are not all equal. .. II it >: . Specifically let 0 < 7fl. The result is a sample S = {Xl..3. . Unbiasedness is also used in stratified sampling theory (see Problem 1.4. for instance. .1 the correlation of Ui and Xi is positive and b < 2Cov(U. To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). I j . q(e) is biased for q(e) unless q is linear.4.. " 1 ". If B is unbiased for e. An alternative approach to using the Uj is to not sample all units with the same prob ability.j . Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems. The HorvitzThompson estimate then stays unbiased.j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows.'I . They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later..I. .. However. This makes it more likely for big incomes to be induded and is intuitively desirable.. outside of sampling.. the unbiasedness principle has largely fallen out of favor for a number of reasons. .19). j=1 3 N ! I: . . X M } of random size M such that E(M) = n (Problem 3. It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj..4.bey the attractive equivariance property. I . . :I .3. 178 Measures of Performance Chapter 3 i . I II. . (ii) Bayes estimates are necessarily biasedsee Problem 3. ". N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads. Xl!Var(U). (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. A natural choice of 1rj is ~n.15).Xi XHT=LJN i=l 7fJ.X)!Var(U) (Problem 3.
From this point on we will suppose p(x. That is.9. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00.Section 3. For instance. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n .4. For all x E A.ZDf3). %0 [j T(X)P(x. Simpler assumptions can be fonnulated using Lebesgue integration theory. unbiased estimates are still in favor when it comes to estimating residual variances.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless. B)dx. 167. Some classical conditiolls may be found in Apostol (1974). e e. The arguments will be based on asymptotic versions of the important inequalities in the next subsection.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. (I) The set A ~ {x : p(x.4. For instance.4. Finally.p) where i = (Y .4. 3. Then 11 holdS provided that for all T such that E. See Problem 3.4. B)dx. e. and appears in the asymptotic optimality theory of Section 5.8) whenever the righthand side of (3. and p is the number of coefficients in f3. Assumption II is practically useless as written. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x.( lTD < 00 J . We suppose throughout that we have a regular parametric model and further that is an open subset of the line. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates.OJdx] = J T(xJ%oP(x. has some decision theoretic applications. The lower bound is interesting in its own right. good estimates in large samples ~ 1 ~ are approximately unbiased. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later.8) is finite. p.2. for integration over Rq. What is needed are simple sufficient conditions on p(x. in the linear regression model Y = ZDf3 + e of Section 2.4. f3 is the least squares estimate.   .OJdx (3. We expect that jBiase(Bn)I/Vart (B n ) . as we shall see in Chapters 5 and 6.O) > O} docs not depend on O.O E 81iJO log p( x. OJ exists and is finite. and we can interchange differentiation and integration in p(x.. Note that in particular (3. B) for II to hold. which can be used to show that an estimate is UMVU. We make two regularity assumptions on the family {Pe : BEe}. suppose I holds. B) is a density.8) is assumed to hold if T(xJ = I for all x.
I J J & logp(x 0 ) = ~n 1 .4. For instance.1. /fp(x. which is denoted by I(B) and given by Proposition 3. 0))' = J(:Ologp(x.180 for all Measures of Performance Chapter 3 e.O)). Then Xi . Similarly.n and 1(0) = Var (~n_l .11 ) .0 Xi) 1 = nO =  0' n O· o . (3.4.1) ~(O) = 0la' and 1 and 11 are satisfied.1. 0))' p(x.9) Note that 0 < 1(0) < Lemma 3. 1 X n is a sample from a Poisson PCB) population. . then I and II hold.O)] dxand J T(x) [:oP(X. 1 X n is a sample from a N(B. t. the Fisher information number.6. 0) ~ 1(0) = E.4. thus. suppose XI. O)dx = O...4.O)} p(x.  Proof. If I holds it is possible to define an important characteristic of the family {Po}.10) and.4.2.i J T(x) [:OP(X.&0 '0 {[:oP(X.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.. (~ logp(X. (J2) population.O)] dx " I'  are continuous functions(3) of O. Then (3.I [I ·1 . Suppose that 1 and II hold and that E &Ologp(X. I' . O)dx. Then (see Table 1. the integrals . It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. O)dx :Op(x. o Example 3.O) & < 00. . where a 2 is known.. (3. O)dx ~ :0 J p(x. . Suppose Xl. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed. Ii ! I h(x) exp{ ~(O)T(x) .4. 00. I 1(0) 1 = Var (~ !Ogp(X.O)]j P(x.
4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section. 1/. Theorem 3. (Information Inequality). (3. nI..(T(X)) < 00 for all O. e) and T(X). T(X)) . 0 E e. Suppose that X = (XI.4. Here's another important special case.(T(X) > I(O) 1 (3. ej. by Lemma 3. Then Var.Xu) is a sample from a population with density f(x. If we consider the class of unbiased estimates of q(B) = B. .4. We get (3.4. Using I and II we obtain.4.(T(X)) > [1/.2.l4) and Lemma 3. CoroUary 3. > [1/.4..Section 3.O))P(x.4.1 hold and T is an unbiased estimate ofB.1.(0). .'(0) = J T(X):Op(x.I 1.' (e) = Cov (:e log p(X.14) Now let ns apply the correlation (CauchySchwarz) inequality (A. 0 (3. (3.'(0)]'  I(e) . (3.13) By (A. 1/.4.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B). (e) and Var. 0 The lower bound given in the information inequality depends on T(X) through 1/.O).4. 1/.O)dX= J T(x) UOIOgp(x. Then for all 0.17) . Let T(X) be all)' statistic such that Var. 10gp(X. Proposition 3.4. and that the conditions of Theorem 3.1. Denote E. we obtain a universal lower bound given by the following.O)dX. Suppose that I and II hold and 0 < 1(0) < 00.1. then I(e) = nI.16) to the random variables 8/8010gp(X.4. Let [.(0).1.'(~)I)'.4.(0) is differentiable and Var (T(X)) . e)) = I(e). Suppose the conditions of Theorem 3.15) The theorem follows because.12) Proof. Var (t.1 hold.I1.4.(0) = E(feiogf(X"e»)'.(T(X)) by 1/.4.
(3....182 Measures of Performance Chapter 3 Proof. we have in fact proved that X is UMVU even if a 2 is unknown.3. o II (B) is often referred to as the information contained in one observation.(B) such that Var.(8) has a continuous nonvanishing derivative on e. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.p'(B)]'ll(B) for all BEe. This is no accident.4. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x.j.(B).18) follows. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E. whereas if 'P denotes the N(O. the MLE is B ~ X. (Continued).B) ] [ ~var [%0 ]Ogj(Xi. the conditions of the information inequality are satisfied. Because X is unbiased and Var( X) ~ Bin. These are situations in which X follows a oneparameter exponential family. Note that because X is UMVU whatever may be a 2 . Suppose that the family {p. I I "I and (3. We have just shown that the information [(B) in a sample of size n is nh (8).4. Suppose Xl. This is a consequence of Lemma 3. (Br Now Var(X) ~ a'ln.18) 1 a ' .4. Example 3. Theorem 3.j. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 . By Corollary 3.[T'(X)] = [.4. then  1 (3.4. then T' is UMVU as an estimate of 1.4.. .Xn are the indicators of n Bernoulli trials with probability of success B. ..B(B)J..1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI.1) that if Xl.4.1 for every B.1) with natural sufficient statistic T( X) and . . . Next we note how we can apply the information inequality to the problem of unbiased estimation. then X is UMVU.2.B) = h(x)exp[ry(B)T'(x) .4. 0 We can similarly show (Problem 3. then X is a UMVU estimate of B.B)] Var " i) ~ 8B iogj(Xi. Conversely.(T(X». which achieves the lower bound ofTheorem 3.19) Ii i .2. if {P9} is a oneparameter exponentialfamily of the form (1. Example 3. i g .B)] = nh(B). As we previously remarked. 1) density.1 and I(B) = Var [:B !ogp(X..4. .6.4. For a sample from a P(B) distribution.
4.4 and 2.A'(B» = VareT(X) = A"(B). B) = a. Thus.4.4.20) with Pg probability 1 for each B.20) to (3..4. 0 Example 3.20) hold.B)) + nz)[log B . Note that if AU = nmAem .4. .20) with respect to B we get (3. B) = aj(B)T'(x) + a2(B) (3. The passage from (3. j hence.p(B) = A'(B) and. By (3.Section 3.22) for j = 1. (3.6).4. be a denumerable dense subset of 8. there exist functions al ((J) and a2 (B) such that :B 10gp(X.4.19) is highly technical.4.14) and the conditions for equality in the correlation inequality (A. However. (3. Here is the argument. Suppose without loss of generality that T(xI) of T(X2) for Xl.B) I + 2nlog(1  BJ} . it is necessary. then (3.4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. B) IdB.4.25) But .4. continuous in B.4.. " and both sides are continuous in B.1) we assume without loss of generality (Problem 3.23) must hold for all B.(Ae) = 1 for all B' (Problem 3.6. Let B . then (3.24) [(B) = Vare(T(X) .A (B) .6.1.23) for all B1 .10g(1 . we see that all a2 are linear combinations of {} log p( Xj. (B)T'(X) + a2(B) (3.4. + n2) 10gB + (2n3 + n3) Iog(1 . B . B . B) = 2"' exp{(2nj 2"' exp{(2n.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.4.ll.B) so that = T(X) . 2 A ** = A'" and the result follows..4. p(x.4.(A") = 1 for all B'.16) we know that T'" achieves the lower bound for all (J if.2 and. But now if x is such that = 1.4.4. Then ~ ( 80 logp X.(B)T' (x) +a2(B) for all BEe}.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx. :e Iogp(x. thus. Conversely in the exponential family case (1. denotes the set of x for which (3.2.4. and only if.21 ) Upon integrating both sides of (3. In the HardyWeinberg model of Examples 2. From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. Our argument is essentially that of Wijsman (1973). J 2 Pe.19).B) = a. If A. (3.. By solving for a" a2 in (3. X2 E A"''''.2. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X).
26) Proposition 3.IOgp(X. DiscRSsion. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. (3. that I and II fail to hold.2. . By (3. .4. B). Na). but the variance of the best estimate is not equal to the bound [.27) and integrate both sides with respect to p(x. ~ ~ This T coincides with the MLE () of Example 2. =e'E(~Xi) = .6. B)  2 (a log p(x.p' (B) I'( I (B).'" .25)..B)] ~ B. 0 ~ Note that by differentiating (3. Proof We need only check that I a aB' log p(x.6. assumptions I and II are satisfied and UMVU estimates of 1/. B) aB' p(x.4.4. we have iJ' aB' logp (X. B) example. Then (3.. . Example 3.4. B) )' aB (3. . The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .2 suggests. " 'oi . . B) satisfies in addition to I and II: p(·. as Theorem 3.B)) which equals I(B).4. I I' f . • The multiparameter case We will extend the information lower bound to the case of several parameters. (Continued).B)(2n. '. or by transforming p(x. in many situations. Sharpenings of the information inequality are available but don't help in general.2..4. in the U(O. Extensions to models in which B is multidimensional are considered next. we will find a lower bound on the variance of an estimator .4.24).2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 .184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211.25) we obtain a' 10gp(X.B) ~ A "( B). 0 0 . Even worse. We find (Problem 3. Suppose p('.7) Var(B) ~ B(I . B) to canonical form by setting t) = 10g[B((1 . () (ell" . Theorem 3.. .B)] and then using Theorem 1.ed)' In particular.4.4. A third metho:! would be to use Var(B) = I(I(B) and formula (3.2. See Volume II. For a sample from a P( B) distribution o EB(::. for instance.(e) exist.26) holds. B) 2 1 a = p(x.B) is twice differentiable and interchange between integration and differentiation is pennitted.4. e). Because this is an exponential family. It often happens. although UMVU estimates exist.3.4.
Xnf' has information matrix 1(0) = EO (aO:... (3..O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. a ). d.32) Proof. 0)] = E[er.O») . .4. j = 1.4 /2.. . er 2). 0)..0) = er.Section 3.31) and 1(0) (b) If Xl. er 2 ). Suppose X = 1 case and are left to the problems. Then 1 = log(211') 2 1 2 1 2 loger .d..29) Proposition 3.( x 1") 2 2a 2 logp(x 0) . . 0) j k That is. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. . 0). 6) denote the density or frequency function of X where X E X C Rq.4. 1 <k< d. . (c) If.2 = er.30) iJ Ijd O) = Covo ( eO logp(X.2 ] ~ er.4.4.4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 .4..4 E(x 1") = 0 ~ I. then X = (Xl. . Let p( x. III (0) = E [~:2 logp(x.Xn are U. Under the conditions in the opening paragraph.5.? ologp(X. in addition. (3. We assume that e is an open subset of Rd and that {p(x. (a) B=T 1 (3.Ok IOgp(X. ..4.4. . . 0 = (I". as X.28) where (3. The (Fisher) information matrix is defined as (3. nh (O) where h is the information matrix ofX.Bd are unknown.. = Var('. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged..4. 1 <j< d. iJO logp(X. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x. The arguments follow the d Example 3. ~ N(I".
0).. We will use the prediction inequality Var(Y) > Var(I'L(Z». that is. (3.4.4..13). Then Theorem 3. = .4. By (3. ! p(x.4/2 .. Suppose the conditions of Example 3. We claim that each of Tj(X) is a UMVU . I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. Let 1/.4. . where I'L(Z) denotes the optimal MSPE linear predictor of Y. . The conditions I.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82. UMVU Estimates in Canonical Exponential Families. Assume the conditions of the .4. in this case [(0) ! . (continued).6.4.O) = T(X) .186 Measures of Performance Chapter 3 Thus.4. 1/.33) o Example 3.6.(0) = EOT(X) and let ".1.!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.4.. Then VarO(T(X» > Lz~ [1(0) Lzy (3. [(0) = VarOT(X) = (3.(O). Here are some consequences of this result.38) where Lzy = EO(TVOlogp(X. J)d assumed unknown. .34) () E e open.4.4..30) and Corollary 1.(0) exists and (3.4.36) Proof.6 hold.( 0) be the d x 1 vector of partial derivatives.A.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3. A(O).3. II are easily checked and because VOlogp(x. Example 3.' = explLTj(x)9j j=1 . Suppose k . Canonical kParameter Exponential Family. Z = V 0 logp(x.A(O)}h(x) (3. 0) .6.37) = T(X).4.'(0) = V1/.2 ( 0 0 ) . then [(0) = VarOT(X)..
Example 3. ... (3. without loss of generality. . This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi. . ....Tk_I(X)).p(0) is the first row of 1(0) and. = nE(T. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >.0. AI. In the multinomial Example 1. kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'..>'..A(0) j ne ej = (1 + E7 eel _ eOj ) . we transformed the multinomial model M(n.4.39) where. .4. i =j:. j.4. j=1 X = (XI. .4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). But f.4 8'A rl(O)= ( 8080 t )1 kx k . k.. are known. . .6. X n i. .p(0)11(0) ~ (1.T(O) = ~:~ I (3.6 with Xl.d. by Theorem 3.40) J We claim that in this case .. " Ok_il T.A(O)} whereTT(x) = (T!(x). hence.4.O) = exp{TT(x)O .7.(X)) 82 80. then Nj/n is UMVU for >'j. . .Section 3.)/1£. .j.i.p(OJrI(O).41) because . .A.(X) = I)IXi = j]. We have already computed in Proposition 3. Multinomial Trials. (1 + E7 / eel) = n>'j(1... j = 1... a'i X and Aj = P(X = j).0)..3. Ak) to the canonical form p(x. 0 = (0 1 . .(1.. n T. To see our claim note that in our case (3.. . .>'j)/n.. we let j = 1.4. Thus. is >. But because Nj/n is unbiased and has 0 variance >'j(1 .) = Var(Tj(X».Xn)T.4.A(O) is just VarOT! (X).
We study the important application of the unbiasedness principle in survey sampling.42) are d x d matrices.2 + (J2.4.3 hold and • is a ddimensional statistic. 3.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. These and other examples and the implications of Theorem 3.d. 1 j Here is an important extension of Theorem 3.4.4. and robustness to model departures.4. If Xl..21. 3.pd(OW = (~(O)) dxd . X n are i. The Normal Case.42) where A > B means aT(A . ~ In Chapters 5 and 6 we show that in smoothly parametrized models. Summary.3 are 0 explored in the problems... The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure.4.B)a > Ofor all . features other than the risk function are also of importance in selection of a procedure. 188 MeasureS of Performance Chapter 3 j Example 3. (72) then X is UMVU for J1 and ~ is UMVU for p. Then J (3. . We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family." . . interpretability of the procedure.4.i. Also note that ~ o unbiased '* VarOO > r'(O). even if the loss function and model are well specified.X)2 is UMVU for 0 2 ."xl' Note that both sides of (3. N(IL.4. LX.3 whose proof is left to Problem 3.8. reasonable estimates are asymptotically unbiased.4. Using inequalities from prediction theory.5 NON DECISION THEORETIC CRITERIA In practice. . Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).. T(X) is the UMVU estimate of its expectation. we show how the infomlation inequality can be extended to the multiparameter case. Suppose that the conditions afTheorem 3.5.4.. . • Theorem 3. " il 'I. They are dealt with extensively 'in books on numerical analysis such as Dahlquist. . . Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.4. But it does not follow that n~1 L:(Xi . .
and Anderson (1974).2 Interpretability Suppose that iu the normal N(Il.4. But it reappears when the data sets are big and the number of parameters large. Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved. It may be shown that.9) by.1.2 is given by where 0.2 is the empirical variance (Problem 2. The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable. variance and computation As we have seen in special cases in Examples 3.P) in Example 2. the signaltonoise ratio. estimates of parameters based on samples of size n have standard deviations of order n./ a even if the data are a sample from a distribution with .4.10). On the other hand. solve equation (2.1.5.4.A(BU ~l»)).1). at least if started close enough to 8. 3.1.2.5 Nondecision Theoretic Criteria 189 Bjork. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2. It is in fact faster and better to Y. takes on the order of log log ~ steps (Problem 3.3.3 and 3. Of course. consider the Gaussian linear model of Example 2. if we seek to take enough steps J so that III III < . The interplay between estimated.11). say.Section 3. Closed form versus iteratively computed estimates At one level closed form is clearly preferable. with ever faster computers a difference at this level is irrelevant. We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory. for this population of measurements has a clear interpretation. It is clearly easier to compute than the MLE.4.AI (flU 1) (T(X) .2). The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used. (72) Example 2. For instance. in the algOrithm we discuss in Section 2.3. Then least squares estimates are given in closed fonn by equ<\tion (2.5.2. It follows that striving for numerical accuracy of ord~r smaller than n. the NewtonRaphson method in ~ ~(J) which the jth iterate. On the other hand.1. < 1 then J is of the order of log ~ (Problem 3.5. fI(j) = iPl) . a method of moments estimate of (A.1 / 2 is wasteful.1 / 2 . Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1.5 we are interested in the parameter III (7 • • This parameter.
)(p/>. See Problem 3. This idea has been developed by Bickel and Lehmann (l975a. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B.I1'. E P*). Similarly. We can now use the MLE iF2. but there are a few • • = . E p. the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. we could suppose X* rv p.\)' distribution. Then E(X)/lVar(X) ~ (p/>. (c) We have a qualitative idea of what the parameter is. We consider three situations (a) The problem dictates the parameter. But () is still the target in which we are interested. We return to this in Section 5. . but there are several parameters that satisfy this qualitative notion.5. 3. which as we shall see later (Section 5.4.e. P(X > v) > ~).. that is.1. For instance. Alternatively.. we observe not X· but X = (XI •. we turn to robustness. and both the mean tt and median v qualify. suppose we initially postulate a model in which the data are a sample from a gamma. economists often work with median housing prices. On the other hand. v is any value such that P(X < v) > ~.5. anomalous values that arise because of human error (often in recording) or instrument malfunction. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. To be a bit formal.. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. X n ) where most of the Xi = X:. Gross error models Most measurement and recording processes are subject to gross errors.. However. 1975b. they may be interested in total consumption of a commodity such as coffee. 1976) and Doksum (1975). we may be interested in the center of a population. However. would be an adequate approximation to the distribution of X* (i. However.3 Robustness Finally. . but we do not necessarily observe X*. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about).190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. 9(Pl. We will consider situations (b) and (c).13.5. For instance. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately.. amoog others.')1/2 = l.4) is for n large a more precise estimate than X/a if this model is correct. suppose that if n measurements X· (Xi. 1 X~) could be taken without gross errors then p. if gross errors occur. the parameter v that has half of the population prices on either side (fonnally.
.f. F. Consider the onesample symmetric location model P defined by ~ ~ i=l""l n .2). X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.1. has density h(y .J) for all such P. where f satisfies (3. and then more generally. . make sense for fixed n.\ is the probability of making a gross error. and definitions of insensitivity to gross errors.j = PC""f. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O. we do not need the symmetry assumption. identically distributed. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. with common density f(x . if we drop the symmetry assumption.. with probability . Informally B(X1 . Xi Xt with probability 1 .5 Nondecision Theoretic Criteria ~ 191 wild values.h(x). Again informally we shall call such procedures robust.1)... We next define and examine the sensitivity curve in the context of the Gaussian location model. . This corresponds to. •. iff X" . . P.. If the error distribution is normal. it is the center of symmetry of p(Po. The breakdown point will be discussed in Volume II.5.)~ 'P C) + . it is possible to have PC""f.l.2) for some h such that h(x) = h(x) for all x}.5. = {f (. so it is unclear what we are estimating. Without h symmetric the quantity jJ.i.j) E P..l. Now suppose we want to estimate B(P*) and use B(X 1 . (3. (3. two notions.2. X is the best estimate in a variety of senses. the gross errors. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· ..i.remains identifiable. p(".is not a parameter.Xn are i. That is.d.2) Here h is the density of the gross errors and .) are ij.1').1') and (Xt. we encounter one of the basic difficulties in formulating robustness in situation (b). Y.l. for example. Example 1. Formal definitions require model specification.5. the sensitivity curve and the breakdown point. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6. That is. ISi'1 or 1'2 our goal? On the other hand. Further assumptions that are commonly made are that h has a particular fonn. but with common distribution function F and density f of the form f(x) ~ (1 . where Y. The advantage of this formulation is that jJ. in situation (c). .A Y. Unfortunately. However. . . In our new formulation it is the xt that obey (3.) for 1'1 of 1'2 (Problem 3. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors.5.d. Then the gross error model issemiparametric.1 ) where the errors are independent.18)..Section 3. . specification of the gross error mechanism.5. X n ) knowing that B(Xi. We return to these issues in Chapter 6. X~) is a good estimate. and symmetric about 0 with common density f and d. However.5.d.1') : f satisfies (3.
~ ) SC( X.j)) = J. has the plugin property. At this point we ask: Suppose that an estimate T(X 1 . X"_l )J.5.L = E(X). shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution.X. X" 1'). 0) = n[O(xl.. . See Section 2. How sensitive is it to the presence of gross errors among XI. . ..··. Now fix Xl!'" . See Problem 3. Xnl as an "ideal" sample of size n . . (i) It is the empirical plugin estimate of the population median v (Problem 3. The sample median can be motivated as an estimate of location on various grounds.. .od because E(X.od Problem 2. . The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ .. we take I' ~ 0 without loss of generality.J. .5.2.192 The sensitivity curve Measures of Performance Chapter 3 . .l.L = (}(X l 1'. x) .4). in particular. .2..Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n). where Xl. .. therefore.1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J." . . (2.1 for all p(/l.16). 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P. In our examples we shall.. that is.1. .. and it splits the sample into two equal halves. .. • . P..L. that is.. not its location.. We start by defining the sensitivity curve for general plugin estimates. SC(x. Because the estimators we consider are location invariant. X(n) are the order statistics.1.. . See (2. X n ) = B(F). .32. Often this is done by fixing Xl. Then ~ ~ O. +X"_I+X) n = x.17).14. (}(PU.. Xl. the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely.Xnl so that their mean has the ideal value zero. I .. .O(Xl' . where F is the empirical d.. X n ordered from smallest to largest. ..1. X n ) . . 1') = 0. .. Thus. is appropriate for the symmetric location model.. . (}(X 1 . . Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' .f) E P. X =n (Xl+ . . We are interested in the shape of the sensitivity curve. Suppose that X ~ P and that 0 ~ O(P) is a parameter.I.
The sensitivity curve of the median is as follows: If. .5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3.... The sensitivity curve in Figure 3.1). x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3. SC(x.1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. XnI. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < . 27 a density having substantially heavier tails than the normaL See Problems 2. we obtain .Section 3..5.5.5. < X(nl) are the ordered XI. say. The sensitivity curves of the mean and median. Although the median behaves well when gross errors are expected.32 and 3. .Xn_l = (X(k) + X(ktl)/2 = 0.1. v coincides with fL and plugin estimate of fL. SC(x) SC(x) x x Figure 3.5...2.9.1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}. . its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X.5. n = 2k + 1 is odd and the median of Xl. A class of estimates providing such intermediate behavior and including both the mean and .
.5. i I . The range 0. Let 0 trimmed mean." see Jaeckel (1971). suppose we take as our data the differences in Table 3.1. Rogers.5.5). . The estimates can be justified on plugin grounds (see Problem 3. For a discussion of these and other forms of "adaptation. Which a should we choose in the trimmed mean? There seems to be no simple answer.l)Q] and the trimmed mean of Xl. If f is symmetric about 0 but has "heavier tails" (see Problem 3.) SC(x) X x(n[naJ) .4. For more sophisticated arguments see Huber (1981). f(x) = 1/11"(1 + x 2 ).4.2. Intuitively we expect that if there are no gross errors. (The middle portion is the line y = x(1 . .2[nnl where [na] is the largest integer < nO' and X(1) < .194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century. infinitely better in the case of the Cauchysee Problem 5. l Xnl is zero.5. 4e'x'. If [na] = [(n . and Hogg (1974).. Note that if Q = 0. Figure 3. that is... Hampel. Xu = X. Bickel. the mean is better than any trimmed mean with a > 0 including the median. I I ! .5.1. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. for example.2[nn]/n)I. whereas as Q i ~. Haber.. See Andrews. by <Q < 4.5. + X(n[nuj) n . Huber (1972). < X (n) are the ordered observations. 4.10 < a < 0. the sensitivity curve calculation points to an equally intuitive conclusion. That is. then the trimmed means for a > 0 and even the median can be much better than the mean. For instance. . Xa  X. Xa.2. The sensitivity curve of the trimmed mean. We define the a (3.3) Xu = X([no]+l) + .20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution. f(x) = or even more strikingly the Cauchy. we throw out the "outer" [na] observations on either side and take the average of the rest.5.8) than the Gaussian density. f = <p. There has also been some research into procedures for which a is chosen using the observations. However. the Laplace density... which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5. . and Tukey (1972). ..1 again.
.5. SC(X. P( X > xO') > 1 . A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X. 0 Example 3. Xu is called a Ctth quantile and X. other examples will be given in the problems. Xo: = x(k).X)2 is the empirical plugin estimate of 0. Quantiles and the lQR. 0 < 0: < 1. . Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI. . then SC(X.1.75 and X.25 are called the upper and lower quartiles. . . &2) It is clear that a:~ is very sensitive to large outlying Ixi values.XnI.25.742(x. 0"2) model. Spread.5. X n denote a sample from that population.1x.4) where the approximation is valid for X fixed.Section 3.. Let B(P) = X o deaote a ath quantile of the distribution of X.75 . the nth sample quantile is Xo. If no: is an integer.X.10). the scale measure used is 0.2...5. Example 3.75 . denote the o:th sample quantile (see 2. Then a~ = 11.16). Similarly. < x(nI) are the ordered Xl. where Xu has 100Ct percent of the values in the population on its left (fonnally.5.. Because T = 2 x (. The IQR is often calibrated so that it equals 0" in the N(/l.X. .I L:~ I (Xi . and let Xo.Ct).1 L~ 1 Xi = 11. then the variance 0"2 or standard deviation 0. To simplify our expression we shift the horizontal axis so that L~/ Xi = O. If we are interested in the spread of the values in a population.1. n + 00 (Problem 3. X n is any value such that P( X < x n ) > Ct. &) (3..5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations. = ~ [X(k) + X(k+l)]' and at sample size n  1. Write xn = 11. where x(t) < .is typically used.25).. We next consider two estimates of the spread in the population as well as estimates of quantiles.674)0".2 . say k.
) . and computability.75 X. and other procedures. Discussion. Ronchetti.. Most of the section focuses on robustness.. r: ~i II i . Rousseuw.Xnl.2.• . which is defined by IF(x. median.SC(x. . The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean.5.1 denotes the empirical distribution based on Xl. 1'75) .5. We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. The sensitivity of the parameter B(F) to x can be measured by the influence function. in particular the breakdown point. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3.. ~ ~ ~ Next consider the sample lQR T=X. Summary. Other aspects of robustness. An exposition of this point of view and some of the earlier procedures proposed is in Hampel. i) = SC(x. o Remark 3. . 0. discussing the difficult issues of identifiability..6.25· Then we can write SC(x. = <1[0((1  <)F + <t.: : where F n . Unfortunately these procedures tend to be extremely demanding computationally. although this difficulty appears to be being overcome lately.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly.(x. ~' is the distribution function of point mass at x (.O. xa: is not sensitive to outlying x's. Ii I~ .196 thus.O(F)] = . trimmed mean. SC(x. have been studied extensively and a number of procedures proposed and implemented. . 1'. We will return to the influence function in Volume II.25) and the sample IQR is robust with respect to outlying gross errors x. F) and~:z. x (t) that (Problem 3.O. It plays an important role in functional expansions of estimates. It is easy to see ~ ! '..15) I[t < xl). for 2 Measures of Performance Chapter 3 < k < n .1. ~ .F) dO I.(x.F) where = limIF. and Stabel (1983).5.
give the MLE of the Bernoulli variance q(0) = 0(1 . (X I 0 = 0) ~ p(x I 0).6 Problems and Complements 197 3.. the parameter A = 01(1 . 6.O) = p(x I 0)71'(0) = c(x)7f(0 . Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule. x) where c(x) =.2 preceeding.. ~ ~ 4.O)P. (3(r.0)/0(0).2.3). we found that (S + 1)/(n + 2) is the Bayes rule. In the Bernoulli Problem 3. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior..2 o e 1. 1 X n is a N(B. lf we put a (improper) uniform prior on A. Find the Bayes risk r(7f. In Problem 3.0 E e.r 7f(0)p(x I O)dO. change the loss function to 1(0.24. 3..Xn be the indicators of n Bernoulli trials with success probability O. . 71') as . Suppose IJ ~ 71'(0). (72) sample and 71" is the improper prior 71"(0) = 1. . Hint: See Problem 1.2.Section 3.1.6 PROBLEMS AND COMPLEMENTS Problems for Section 3. (c) In Example 3. 0) is the quadratic loss (0 . is preferred to 0. In some studies (see Section 6. 0) the Bayes rule is where fo(x. J) ofJ(x) = X in Example 3. . the Bayes rule does not change.a)' jBQ(1.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite. Compute the limit of e( J. respectively.2.0) and give the Bayes estimate of q(O). then the improper Bayes rule for squared error loss is 6"* (x) = X. 71') = R(7f) /r( 71'. (a) Show that the joint density of X and 0 is f(x. density. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin.2 with unifonn prior on the probabilility of success O.0)' and that the prinf7r(O) is the beta. Let X I. That is. a(O) > 0. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5. J). 2. Check whether q(OB) = E(q(IJ) I x). s).4. 0) = (0 . = (0 o)'lw(O) for some weight function w(O) > 0. Show that (h) Let 1(0. Show that if Xl.3. where R( 71') is the Bayes risk.2. Consider the relative risk e( J.4. where OB is the Bayes estimate of O. which is called the odds ratio (for success).0). Suppose 1(0. if 71' and 1 are changed to 0(0)71'(0) and 1(0. E = R.
Suppose we have a sample Xl. B). find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule. . Cr are given constants. (a) Problem 1. B . defined in Problem 1.2. For the following problems. (b) Problem 1. B. Set a{::} Bioequivalent 1 {::} Not Bioequivalent . There are two possible actions: 0'5 a a with losses I( B.(0) should be negative when 0 E (E. E) and positive when f) 1998) is 1. A regulatory " ! I i· . (E. . (Bj aj)2. i: agency specifies a number f > a such that if f) E (E. Let q( 0) = L:. a) = Lj~.d.aj)/a5(ao + 1). On the basis of X = (X I. where ao = L:J=l Qj. where E(X) = O. Note that ). (e) a 2 + 00. do not derive them. Find the Bayes decision rule. a r )T.a)' / nj~. a = (a" . . Measures of Performance Chapter 3 (b) n ..198 (a) T + 00. B = (B" .. Xl. E). 8. 7.B.I(d). . . .\(B) = I(B. . (Use these results. One such function (Lindley.O)  I(B. 0) and I( B. Let 0 = ftc . N{O.3.Xn ) we want to decide whether or not 0 E (e..Xu are Li. 00. equivalent to a namebrand drug.O'5). and that 0 has the Dirichlet distribution D( a). a) = (q(B) . • 9. Hint: If 0 ~ D(a).2(d)(i) and (ii). Var(Oj) = aj(ao . then E(O)) = aj/ao. . by definition. . Suppose tbat N lo . (c) Problem 1. where CI...9j ) = aiO:'j/a5(aa + 1). . to a close approximation. . " X n of differences in the effect of generic and namebrand effects fora certain drug. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x).) with loss function I( B. (c) We want to estimate the vector (B" .exp {  2~' B' } .3. 76) distribution. (a) If I(B.. bioequivalent.=l CjUj . and that 9 is random with aN(rlO. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O..) (b) When the loss function is I(B. where is known. I) = difference in loss of acceptance and rejection of bioequivalence.. 1).• N.  I . .ft B be the difference in mean effect of the generic and namebrand drugs. E).15.. then the generic and brandname drugs are..19(c). Assume that given f).E). Bioequivalence trials are used to test whether a generic drug is. I j l . (.. I ~' . and Cov{8j .3.)T. c2 > 0 I. a) = [q(B)a]'. given 0 = B are multinomial M(n.\(B) = r ....
For the model defined by (3. v) > ?r/(l ?r) is equivalent to T > t. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00"). (Xv..1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. Yo) is in the interior of S x T. Any two functions with difference "\(8) are possible loss functions at a = 0 and I. .Xm).) + ~}" . Hint: See Example 3. 0. find (a) the linear Bayes estimate of ~l.3 1. Yo) to be a saddle point is that.6. Note that .. {} (Xo. .Yp).Section 3.. RP. Discuss the preceding decision rule for this "prior. respectively.) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l((). (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3.2.1) < (T6(n) + c'){log(rg(~)+.2." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b).2. representing x = (Xl... 1) are not constant. A point (xo.16) and (3. S T Suppose S and T are subsets of Rm..Yo) Xi {}g {}g Yj = 0. Suppose 9 . and 9 is twice differentiable. 0) and l((). Yo) = inf g(xo.6 Problems and Complements 199 where 0 < r < 1. y).y = (YI.1. 2.2 show that L(x. S x T ~ R. In Example 3. yo) is a saddle point of 9 if g(xo. . Con 10.6..Yo) = {} (Xo. Yo) = sup g(x. (b) the linear Bayes estimate of fl.. (a) Show that a necessary condition for (xo. (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3.\(±€. .17).3. . + = 0 and it 00").
.n12. BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1.0). 1 <j.. and show that this limit ( .. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. d) = (a) Show that if I' is known to be 0 (!. a) ~ (0 . B1 ) >  (l.i) =0.<7') and 1(<7'. the test rule 1511" given by o. and 1 ._ I)'. 5. iij. Suppose i I(Bi. o(S) = X = Sin. f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta.j)=Wij>O. (b) Show that limn_ooIR(O.o•• ) =R(B1 ..a)'. N(I'.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo. .d< p. B1 ) Ip(X. Thus.1 < i < m. .(X) = 1 if Lx (B o. fJ( .j=O. (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is.n). • " = '!. &* is minimax.l.. f. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = .)w".=1 CijXiYj with XE 8 m . X n be i. 4. I(Bi. Let X" ..y) ~ L~l 12. I}..andg(x.2.= 1 . A = {O.n)/(n + . Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo." 1 Xi ~ l}. y ESp. and that the mndel is regnlar. B ) and suppose that Lx (B o. n: I>l is minimax.d. . Yo .Xn ) .a.b < m. 2 J'(X1 ..i. prior. L:. B ) ~ p(X. Let Lx (Bo.. I • > 1 for 0 i ~.0•• ). Let S ~ 8(n. .Yo) > 0 8 8 8 X a 8 Xb 0. Suppose e = {Bo. I j 3. Yc Yd foraHI < i. 0») equals 1 when B = ~.n12). • .c. the conclusion of von Neumann's theorem holds.PIB = Bo]..:.1(0. . (b) Suppose Sm = (x: Xi > 0.thesimplex.. and o'(S) = (S + 1 2 .2. Bd. o')IR(O. Hint: See Problem 3. i.
l ... J. . j 7. . (c) Use (B. jl~ < .Section 3_6 Problems and Complements 201 (b) If I' ~ 0. I' (X"". .a? 1.X)' and.k.. ttl . . . .. < im.j. . o(X 1. i=l then o(X) = X is minimax. show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible. X is no longer unique minimax. .. that is. .Pi. 9... then is minimax for the loss function r: l(p. 1 <i< k. distribution. > jm). . . 1 = t j=l (di .. a) = (.. " a. N k ) has a multinomial. LetA = {(iI.I). Hint: Consider the gamma.~~n X < Pi < < R(I'. . (jl.?.d) where qj ~ 1 . .tj. b... prior.2. Write X  • "i)' 1(1'. . .Pi)' Pjqj <j < k. h were t·. 1).. M (n.. respectively. See Problem 1. . . Then X is nurnmax.a < b" = 'lb.\) distribution and 1(. . where (Ill. X k ) = (R l .. = tj. .tn I(i. an d R a < R b.12. .I'kf..... d) = L(d.1)1 L(Xi .. o'(X) is also minimax and R(I'. / . For instance. .. ...1. hence. 0') = (1.. Show that if (N). f(k.j. Jlk. 8.o) for alII'. ftk) = (Il?l'"'' J1. . 1 < j < k.) . Xk)T..P. . Xl· (e) Show that if I' is unknown. . . .. . Hint: (a) Consider a gamma prior on 0 = 1/<7'.).). . .29).. Remark: Stein (1956) has shown that if k > 3.j. See Volume II. .k} I«i).X)' are inadmissible. Rk) where Rj is the rank of Xi. and iI. Let Xl. Permutations of I. PI. .. 6. X k be independent with means f. Show that if = (I'I''''. Let Xi beindependentN(I'i. 0 1.. 00. .\.. dk)T.. ik is an arbitrary unknown permutation of 1. that both the MLE and the estimate 8' = (n . Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l .. . < /12 is a known set of values. d = (d). ik).\ .» = Show that the minimax rule is to take L l. o(X) ~ n~l L(Xi . ik) is smaller than that of (i'l"'" i~). b = ttl.3. Show that X has a Poisson (. Let k . .\...X)' is best among all rules of the form oc(X) = c L(Xi .
X n be independent N(I'. See also Problem 3... Let Xi(i = 1.. d X+ifX 11.... of Xi < X • 1 + 1 o. Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss.. + l)j(n + k). . with unknown distribution F.. Define OiflXI v'n < d < v'n d v'n d d X . ~ I I . ! p(x) = J po(x)1r(B)dB. 14. LB= j 01 1. distribution...d. . .15. M(n. 13.. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . (b) How does the risk of these estimates compare to that of X? 12.BO)T.8..... Pk. distribution.. Let the loss " function be the KullbackLeibler divergence lp(B. .2..i f X> . . a) is the posterior mean E(6 I X)._.1'»2 of these estimates is bounded for all nand 1'.4.202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI. Let K(po. For a given x we want to estimate the proportion F(x) of the population to the left of x.. Hint: Consider the risk function of o~ No. 1). n) be i. See Problem 3.Xo)T has a multinomial..I)!... X has a binomial... . .. 10.q)1r(B)dB. B(n..3.=1 Show that the Bayes estimate is (Xi .. . Suppose that given 6 = B = (B . X = (X"". B j > 0.d with density defined in Problem 1.. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q.i.. BO l) ~ (k .2. .. J K(po. 1 . Let X" . . Suppose that given 6 ~ B.a) and let the prior be the uniform prior 1r(B" .1r) = Show that the marginal density of X. . I: • ..B).. Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B. I: I ir'z . B)..
a )I( 0. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). . Show that if 1(0. Let A = 3. N(I"o. R. K) .1 E: 1 (Xi  J.'? /1 as in (3. and O~ (1 . J') < R( (J. Show that B q(ry) = Bp(h.. then R(O. S. 15. 4. Give the Bayes rules for squared error in these three cases. .. Let X .Section 3. . It is often improper.4. 7f) and that the minimum is I(J.1 . .x =.f [E. (ry»). 0. {log PB(X)}] K(O)d(J.aao + (1 . .2) with I' .tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible. a) is convex and J'(X) = E( J(X) I I(X».. that is. . Show that in theN(O. . Show that X is an UMVU estimate of O. then That is.. Jeffrey's priors are proportional to 1.~). . the Fisher information lower bound is equivariant. . . Prove Proposition 3. Jeffrey's "Prior. p(X) IO. N(I".alai) < al(O. a < a < 1. 0) and q(x. (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations.6 Problems and Complements 203 minimizes k( q.4 1. respectively. h1(ry)) denote the model in the new parametrization. X n are Ll.12) for the two parametrizations p(x. 0) with 0 E e c R. Problems for Section 3. Reparametrize the model by setting ry = h(O) and let q(x.4.k(p.. J). if I(O. Show that (a) (b) cr5 = n. Suppose X I. .1"0 known. a" (J. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable..O)!. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij.4.). Let X I. 2. 0) cases. p( x.X is called the mutual information between 0 and X. a. X n be the indicators of n Bernoulli trials with success probability B. K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality.. ry) = p(x. then E(g(X) > g(E(X». Hint: k(q. for any ao. Equivariance. 0) and B(n. ry)." A density proportional to VJp(O) is called Jeffrey's prior. (b) Equivariance of the Fisher Information Bound. Fisher infonnation is not equivariant under increasing transformations of the parameter. ao) + (1 . We shall say a loss function is convex.d.
Let X" . i = 1.. For instance. Show that .XN }. Establish the claims of Example 3. Show that assumption I implies that if A ..11(9) as n ~ give the limit of n times the lower bound on the variances of and (J.1 and variance 0.ZDf3)T(y . . Suppose (J is UMVU for estimating fJ..ZDf3)/(n . Let a and b be constants. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0.' . .{x : p(x.. 1). P. B(O. .p) is an unbiased estimate of (72 in the linear regression model of Section 2. . .P. . ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. X n be a sample from the beta. . Ct. . 7. Show that 8 = (Y . 204 Hint: See Problem 3. 0) for any set E. 9.4.13 E R. ... Find lim n. ( 2). compute Var(O) using each ofthe three methods indicated. 0) = Oc"I(x > 0). find the bias o f ao' 6.1.4. Hint: Consider T(X) = X in Theorem 3. Let F denote the class of densities with mean 0. 00. I I .Ii .. distribution.4... .2.Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate. 8. 14. a ~ i 12.8. X n ) is a sample drawn without replacement from an unknown finite population {Xl.2 (0 > 0) that satisfy the conditions of the information inequality. Show that if (Xl. In Example 3. Y n in twoparameter canonical exponential form and (b) Let 0 = (a. . 2 ~ ~ 10. . Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji. then = 1 forallO. Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered.4. • 13. Show that a density that minimizes the Fisher infonnation over F is f(x. .3. a ~ (c) Suppose that Zi = log[i/(n + 1)1.. n. Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1).o. Hint: Use the integral approximation to sums. Suppose Yl . =f.. and . then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. (a) Find the MLE of 1/0.\ = a + bOo 11.\ = a + bB is UMVU for estimating . . . Does X achieve the infonnation .5(b).
= N(O.. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. 7.4). (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl.6 Problems and Complements 205 (b) The variance of X is given by (3. = 1.4. Suppose X is distributed accordihg to {p.4.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. .• .Xkl. (b) Deduce that if p. 8). .6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U.) Suppose the Uj can be relabeled into strata {xkd.Section 3.X)/Var(U).~). Show that X k given by (3.. even for sampling without replacement in each stratum. Define ~ {Xkl. X is not II Bayes estimate for any prior 1r. if and only if.4. (c) Explain how it is possible if Po is binomial.p). . 16. K.511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ . Show that if M is the expected sample size.~ for all k. . k = 1... UN are as in Example 3. Let X have a binomial.3. (See also Problem 1. Let 7fk = ~ and suppose 7fk = 1 < k < K. distribution. 18.. 20. 15. ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss. that ~ is a Bayes estimate for O. More generally only polynomials of degree n in p are unbiasedly estimable. Stratified Sampling. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk. Suppose UI. : 0 E for (J such that E((J2) < 00.} and X =K  1". k=l K ~1rkXk. 1 < i < h.l. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. B(n.. 'L~ 1 h = N. . 17.. G 19.. Show that is not unbiasedly estimable. P[o(X) = 91 = 1. then E( M) ~ L 1r j=1 N J ~ n. B(n. . .4.". :/i. XK.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. ..
Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3. the upper quartile X. use (3. B») ~ °and the information bound is infinite.4.9)' 21. 4. If n IQR. T . = 2k is even.4. If a = 0. for all adx 1. Note that logp(x.206 Hint: Given E(ii(X) 19) ~ 9. for all X!.5. If a = 0. Prove Theorem 3.. Show that the a trimmed mean XCII.. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . B) is differentiable fat anB > x. with probability I for each B."" F (ij) Vat (:0 logp(X.25 and net =k 3. and we can thus define moments of 8/8B log p(x. however.l)a is an integer.75' and the IQR.5) to plot the sensitivity curve of the 1. (iii) 2X is unbiased for . that is.4. . 22. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6. Show that the sample median X is an empirical plugin estimate of the population median v. B) be the uniform distribution on (0. An estimate J(X) is said to be shift or translation equivariant if.3./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]. 5.25 and (n . . 1 X n1 C.25. Regularity Conditions are Needed for the Information Inequality. Let X ~ U(O. Yet show eand has finite variance. is an empirical plugin estimate of ~ Here case. B). 1 2. • Problems for Section 35 I is an integer. B). Show that. give and plot the sensitivity curves of the lower quartile X. give and plot the sensitivity curve of the median. Hint: It is equivalent to show that. Var(a T 6) > aT (¢(9)II(9)..
Xu. '& is an estimate of scale.6.. and xiflxl < k kifx > k kifx<k.  XI/0. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00.1.67. . J is an unbiased estimate of 11. trimmed mean with a = 1/4.03 (these are expected values of four N(O. . (b) Show that   ..e. (7= moo 1 IX. k t 0 to the median. . Deduce that X.5 and for (j is.. . xH L is translation equivariant and antisymmetric. I)order statistics).) ~ 8.Section 3. (See Problem 3. For x > . .. X are unbiased estimates of the center of symmetry of a symmetric distribution. . . X n is a sample from a population with dJ. The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj).(a) Show that X.). Its properties are similar to those of the trimmed mean.JL) where JL is unknown and Xi . i < j. X a arc translation equivariant and antisymmetric. 7. ..30. median. <i< n Show that (a) k = 00 corresponds to X.6 Problems and Complements 207 It is antisymmetric if for all Xl... Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite.03. (b) Suppose Xl. then (i. X.. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified. and the HodgesLehmann estimate.5. plot the sensitivity curves of the mean. (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1.30.1. is symmetrically distributed about O. F(x . One reasonable choice for k is k = 1. In .3.
n is fixed. Laplace. and is fixed.. Let JJo be a hypothesized mean for a certain population.25. and Cauchy distributions. [. (d) Xk is translation equivariant and antisymmetric (see Problem 3.O)/ao) where fo(x) for Ixl for Ixl <k > k.• 13. 00. •• . 1 Xnl to have sample mean zero.208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0. .i.75 .11. we say that g(.) are two densities with medians v zero and identical scale parameters 7.5.5. In what follows adjust f and 9 to have v = 0 and 'T = 1. Let 1'0 = 0 and choose the ideal sample Xl. plot the sensitivity curve of (a) iTn .x}2.. = O. 11. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr. .d." . we will use the IQR scale parameter T = X.. and "" "'" (b) the Iratio of Problem 3. This problem may be done on the computer.~ !~ t . with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. with density fo((. . thus. Show that SC(x. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00. ii.6). (b) Find thetml probabilities P(IXI > 2). Use a fixed known 0"0 in place of Ci. (e) If k < 00.1)1 L:~ 1(Xi . 9. The (student) tratio is defined as 1= v'n(x I'o)/s. thenXk is the MLEofB when Xl..X. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. . .. then limlxj>00 SC(x.5. Xk) is a finite constant. If f(·) and g(.7(a). the standard deviation does not exist.) has heavier tails than f() if g(x) is above f(x) for Ixllarge. 00. P(IXI > 3) and P(IXI > 4) for the nonnal. For the ideal sample of Problem 3. Suppose L:~i Xi ~ . (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10. where S2 = (n .Xn arei. Location Parameters. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00. Let X be a random variable with continuous distribution function F. . X 12. In the case of the Cauchy density.
5) are location parameters.  vl/O. ~ ~ 8n (a +bxI.a +bxn ) ~ a + b8n (XI. Show that J1. it is called a location parameter. compare SC(x.). then for a E R.xn ).Xn_l) to show its dependence on Xnl (Xl. X is said to be stochastically smaller than Y if = G(t) for all t E R. SC(a + bx. 8) and ~ IF(x. .(F) = !(xa + XI").Section 3. b > 0.. .. any point in [v(F). the ~ = bdSC(x. . median v. ~ = d> 0. c E R.. if F is also strictly increasing. and ordcr preserving. ~ ~ 15.5. let v" = v. ~ se is shift invariant and scale equivariant. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F). . An estimate ()n is said to be shift and scale equivariant if for all xl. F(t) ~ P(X < t) > P(Y < t) antisymmctric.. In Remark 3...5.. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v. 8._. In this case we write X " Y. xnd.. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y.1: (a) Show that SC(x. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). · .]. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval. and trimmed population mean JiOl (see Problem 3. x n _. ..:.O. Let Y denote a random variable with continuous distribution function G.6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :. where T is the median of the distributioo of IX location parameter._. and note thatH(xi/(F» < F(x) < H(xv(F». Also note that H(x) is symmetric about zero.. i/(F)] and.(k) is a (d) For 0 < " < 1. a + bx.. 8. n (b) In the following cases. and sample trimmed mean are shift and scale equivariant.) 14. let H(x) be the distribution function whose inverse is H'(a) = ![xax. b > 0.) That is. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :.. 8. 0 < " < 1. a. .xn . (a) Show that if F is symmetric about c and () is a location parameter.8..t )) = 0. then ()(F) = c. c + dO. sample median. Hint: For the second part. ([v(F). 8) = IF~ (x. F) lim n _ oo SC(x.F).(F): 0 <" < 1/2}. ~ (a) Show that the sample mean. (jx. . If 0 is scale and shift equivariant. (b) Show that the mean Ji.67 and tPk is defined in Problem 3. Show that if () is shift and location equivariant. Fn _ I ). ~ (b) Write the Be as SC(x. then v(F) and i/( F) are location parameters.5.t.8. i/(F)] is the value of some location parameter. .
In the gross error model (3.1 is commonly known as the Cramer~Rao inequality. (ii) O(F) = (iii) e(F) "7. (e) Does n j [SC(x.1) Hint: (a) Try t/J(x) = Alogx with A> 1. ~ I(x  = X n . we shall • I' . is not identifiable.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J . We define S as the . This is. 0.Bilarge enough.7 NOTES Note for Section 3. Let d = 1 and suppose that 'I/J is twice continuously differentiable. (b) If no assumptions are made about h. .3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets).4. in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. (b) Show that there exists.. A. B E B.B ~ {F E :F : Pp(A) E B}. • • . Frechet. C < 00.0 > 0 (depending on t/J) such that if lOCO) . I I Notes for Section 3. show that (a) If h is a density that is symmetric about zero. • .5. (b) ~ ~ . then J.OJ then l(I(i) . (ii). (1) The result of Theorem 3. and (iii) preceding? ~ 16. Because priority of discovery is now given to the French mathematician M. Assume that F is strictly increasing. and we seek the unique solution (j of 'IjJ(8) = O. where B is the class of Borel sets.I F(x.rdF(l:). The NewtonRaphson method in this case is .t is identifiable.8) .4 I . Show that in the bisection method. {BU)} do not converge. · . also true of the method of coordinate ascent. .. IlFfdF(x). . ~ ~ 17.' > 0. we in general must take on the order of log ~ steps. ) (a) Show by example that for suitable t/J and 1(1(0) . I : .field generated by SA. 18. 81" < 0. consequently.BI < C1(1(i. then fJ.2). . F)! ~ 0 in the cases (i). 3. .1/.
Families. 4. 0." Ann.] = Jroo. "Measures of Location and AsymmeUy. D. A. J. Jroo T(x) :>. MA: AddisonWesley. 10381044 (l975a). in Proceedings of the Second Pa. Dispersion. "Descriptive Statistics for Nonparametric Models.. 1974. W. 1. Mathematical Analysis. Numen'cal Analysis New York: Prentice Hall. J.. P. Reading.. Statist. DAHLQUIST. ANDERSON. 1985. 1998. M. (3) The continuity of the first integral ensures that : (J [rO )00 .. NJ: Princeton University Press. Robust Estimates of Location: Sun>ey and Advances Princeton. R.(T(X)) = 00. II.thematical Methods in Risk Theory Heidelberg: Springer Verlag. J. BJORK. 1970. F. DE GROOT. J. 1969. 15231535 (1969). . ofStatist. H. DoKSUM. BERGER. BICKEL." Ann. AND E. P." Ann.. T. K.8 REFERENCES ANDREWS. ROGERS. J. Bayesian Theory New York: Wiley.. AND C. S. SMITH. Statist." Scand. P. LEHMANN. 1994. I. 1974. 1122 (1975). 1972. Point Estimation Using the KullbackLeibler Loss Function and MML. F. A. M. "Descriptive Statistics for Nonparametric Models. LEHMANN. "Descriptive Statistics for Nonparametric Models. Optimal Statistical Decisions New York: McGrawHill. Introduction... TUKEY. AND A.)dXd>. 2. >. BICKEL. Math.. M. LEHMANN. AND E. LEHMANN. "Unbiased Estimation in Convex. BICKEL. M. AND E. 40. APoSTOL. BAXTER.Section 3. III.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement.10451069 (1975b). P.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. BICKEL. G.8). (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. Statistical Decision Theory and Bayesian Analysis New York: Springer. 2nd ed. Location.. J.. Ma. DOWE. WALLACE. HAMPEL. AND E. F. Statist. OLIVER. A. AND N.. L.4... H. HUBER. P. Statist. 3. Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3.. W.. BICKEL.. 11391158 (1976). AND J. BOHLMANN. P." Ann.. BERNARDO. 3. H. R.. p(x. 3. D.
A. StrJtist. AND W. WUSMAN. 1948. . HAMPEL.. Robust Statistics: The Approach Based on Influence Functions New York: J. C. I. 1959. Mathematical Methods and Theory in Games.. 1972. "Adaptive Robust Procedures. 13... 1954. "Estimation and Inference by Compact Coding (With Discussions). and Economics Reading.V. 1986. AND B. 1981." Scand. Statist. WALLACE. H.. Part I: Probability. FREEMAN. RoyalStatist." Ann. F." Proc. 538542(1973). H. Stu/ist." J. 69. MA: AddisonWesley.. Part II: Inference. RISSANEN. "Decision Analysis and BioequivaIence Trials. STAHEL. 2Q4." Statistica Sinica." J. CASELLA.. P. Third Berkeley Symposium on Math. Introduction to Probability and Statistics from a Bayesian Point of View. Actuarial J. LEHMANN. HUBER. LINDLEY. SHIBATA.. P. University of California Press. P. KARLIN. (2000). HOGG. London: Oxford University Press. 42. E." J. 1. "Stochastic Complexity (With Discussions):' J. "Robust Statistics: A Review. Assoc. L. The Foundations afStatistics New York: J..10411067 (1972). W. Stall'. A... J. 223239 (1987). Soc. New York: Springer." Ann. Soc. Stlltist. 197206 (1956). JAECKEL. 2nd ed..n. R. 10201034 (1971). C. Theory ofProbability. 43. and Probability. Testing Statistical Hypotheses New York: Springer. 7. 383393 (1974). Assoc. Theory ofPoint Estimation.. L. RONCHEro. Amer.. J. StatiSl. 136141 (1998)...222 (1986). Amer. HANSEN. "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. E. "Model Selection and the Principle of Mimimum Description Length. Wiley & Sons. Y.. 1998.. 49. 49. Wiley & Sons. • NORBERG. YU. SAVAGE. 909927 (1974). 2nd ed. 1986. L. AND P.. "Robust Estimates of Location. E. 1. Royal Statist. B. "Boostrap Estimate of KullbackLeibler Information for Model Selection. . Math. Amer. 240251 (1987). . 1965. M. AND G. R. LINDLEY. 69." Ann.." J. Statist. R. • • 1 . Robust Statistics New York: Wiley. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. S. "The Influence Curve and Its Role in Robust Estimation. S.." Statistical Science. E. R. Math. ROUSSEUW.. D. Assoc. Cambridge University Press. Programming. HUBER. STEIN. Exploratory Data Analysis Reading.. Math. London. . "On the Attainment of the CramerRao Lower Bound..375394 (1997). B. D.. R.. LEHMANN.212 Measures of Performance Chapter 3 HAMPEL... MA: AddisonWesley. TuKEY. j I JEFFREYS. L.
it might be tempting to model (Nm1 . PI or 8 0 . This framework is natural if. peIfonn a survey. 8 1 are a partition of the model P or. the parameter space e. not a sample.Nf1 .Pfl. and we have data providing some evidence one way or the other. modeled by us as having distribution P(j. 8d with 8 0 and 8.Nfb the numbers of admitted male and female applicants. we are trying to get a yes or no answer to important questions in science.PmllPmO.3. e Example 4. Usually.1 INTRODUCTION In Sections 1. Nfo of denied applicants. parametrically. conesponding. and indeed most human activities. Here are two examples that illustrate these issues. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. respectively. treating it as a decision theory problem in which we are to decide whether P E Po or P l or. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial.PfO). pUblic policy.3 we defined the testing problem abstractly. As we have seen.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. the situation is less simple. The design of the experiment may not be under our control. Nmo. 3. to answering "no" or "yes" to the preceding questions. what does the 213 .3 the questions are sometimes simple and the type of data to be gathered under our control. They initially tabulated Nm1. If n is the total number of applicants. where is partitioned into {80 . Accepting this model provisionally. where Po. distribution.1. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. Nfo) by a multinomial. o E 8. medicine. or more generally construct an experiment that yields data X in X C Rq.2.1. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. and 3.1. respectively. Sex Bias in Graduate Admissions at Berkeley. as is often the case. and the corresponding numbers N mo . in examples such as 1. M(n. But this model is suspect because in fact we are looking at the population of all applicants here. what is an appropriate stochastic model for the data may be questionable.
has a binomial (n. I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. . pml +P/I !. . OUf multinomial assumption now becomes N ""' M(pmld' PmOd. Mendel's Peas. . . 0 I • . . This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. • In fact.=7xlO 5 ' n 3 . NjOd. m P [ NAA . and O'Connell (1975). as is discussed in a paper by Bickel. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department.! . . N mOd . where N m1d is the number of male admits to department d. i:. distribution.2. In one of his famous experiments laying the foundation of the quantitative theory of genetics. if there were n dominant offspring (seeds). . PfOd. • I I . The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive).. . ~. That is. d = 1. if the inheritance ratio can be arbitrary." then the data are naturally decomposed into N = (Nm1d.: Example 4. one of which was dominant. and so on. d = 1. NIld.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. . Hammel. The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis..than might be expected under the hypothesis that N AA has a binomial.... If departments "use different coins. In a modem formulation. Mendel crossed peas heterozygous for a trait with two alleles. I .1. I . D). ~).< . It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3.D. • Pml + P/I + PmO + Pio • I . D). In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd .. the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. ! . • .p) distribution.. Pfld. that N AA. for d = 1. either N AA cannot really be thought of as stochastic or any stochastic I . .. B (n.. . The hypothesis of dominant inheritance ~.n 3 t I . the natural model is to assume. the number of homozygous dominants. .
we call 8 0 and H simple. 0 E 8 0 . That is.1. and then base our decision on the observed sample X = (X J. Thus. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D.11.(1) As we have stated earlier. then = [00 . In situations such as this one . The same conventions apply to 8] and K. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. our discussion of constant treatment effect in Example 1. If we suppose the new drug is at least as effective as the old. say. see. Moreover. Now 8 0 = {Oo} and H is simple. 8 1 is the interval ((}o. That a treatment has no effect is easier to specify than what its effect is. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old. where Xi is 1 if the ith patient recovers and 0 otherwise. Most simply we would sample n patients. then 8 0 = [0 . = Example 4. we reject H if S exceeds or equals some integer.Xn ). (}o] and eo is composite. it's not clear what P should be as in the preceding Mendel example. in the e . acceptance and rejection can be thought of as actions a = 0 or 1.. (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. the number of recoveries among the n randomly selected patients who have been administered the new drug. To investigate this question we would have to perform a random experiment.Section 4.. .1. If the theory is false. Thus. n 3' 0 What the second of these examples suggests is often the case.!. I} or critical region C {x: Ii(x) = I}. It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. the set of points for which we reject. p). say 8 0 . In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. Suppose that we know from past experience that a fixed proportion Bo = 0.3 recover from the disease with the old drug. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ .1 loss l(B. a) = 0 if BE 8 a and 1 otherwise. then S has a B(n. where 1 ~ E is the probability that the assistant fudged the data and 6!. administer the new drug. 8 0 and H are called compOSite. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. we shall simplify notation and write H : () = eo. When 8 0 contains more than one point. suppose we observe S = EXi . .3. We illustrate these ideas in the following example. is better defined than the alternative answer 8 1 . is point mass at . In science generally a theory typically closely specifies the type of distribution P of the data X as.3. P = Po. 11 and K is composite. B) distribution.1 Introduction 215 model needs to pennit distributions other than B( n.1.€)51 + cB( nlP). for instance. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. where 00 is the probability of recovery usiog the old drug. and accept H otherwise. and we are then led to the natural 0 . See Remark 4. say k. (1 . for instance. What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. The set of distributions corresponding to one answer.
We will discuss the fundamental issue of how to choose T in Sections 4.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.2 and 4.1 . . when H is false. and large. if H is true. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity.. One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H. if H is false.e. To detennine significant values of these statistics a (more complicated) version of the following is done.. We do not find this persuasive. . one can be thought of as more important.3. then the probability of exceeding the threshold (type I) error is smaller than Q".. as we shall see in Sections 4. 1. e 1 . Thresholds (critical values) are set so that if the matches occur at random (i.. matches at one position are independent of matches at other positions) and the probability of a match is ~. how reasonable is this point of view? In the medical setting of Example 4. suggesting what test statistics it is best to use.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. There is a statistic T that "tends" to be small.T would then be a test statistic in our sense.2. is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute.3. . asymmetry is often also imposed because one of eo.1. but if this view is accepted. As we noted in Examples 4. : I I j ! 1 I . of the two errors.2. No one really believes that H is true and possible types of alternatives are vaguely known at best..3 this asymmetry appears reasonable. announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new. We now tum to the prevalent point of view on how to choose c. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. ! : .3. The value c that completes our specification is referred to as the critical value of the test. 0 > 00 . (5 > k) < k). . Given this position.1 and4. but computation under H is easy. generally in science. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. In most problems it turns out that the tests that arise naturally have the kind of structure we have just described. .1. (5 The constant k that determines the critical region is called the critical value. 1 i . ~ : I ! j . For instance. It has also been argued that.that has in fact occurred. it again reason~bly leads to a Neyman Pearson fonnulation.. 0 PH = probability of type II error ~ P. Note that a test statistic generates a family of possible tests as c varies.1. (Other authors consider test statistics T that tend to be small. • . 4. and later chapters. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. We call T a test statistic.
3 and n = 10. {3(8.8)n.05 critical value 6 and the test has size .05 are commonly used in practice. the probabilities of type II error as 8 ranges over 8 1 are determined. 80 = 0. e t . The values a = 0. it is convenient to give a name to the smallest level of significance of a test. even nominally.P [type 11 error] is usually considered. Both the power and the probability of type I error are contained in the power function.(S > 6) = 0.1. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00.1. 8 0 = 0. we can attach.2. That is. k = 6 is given in Figure 4. in the Bayesian framework with a prior distribution on the parameter.. This quantity is called the size of the test and is the maximum probability of type I error.2. Once the level or critical value is fixed. whereas if 8 E power against ().01 and 0. it is too limited in any situation in which. The power of a test against the alternative 0 is the probability of rejecting H when () is true. The power is a function of 8 on I.1. even though there are just two actions. 0) is the {3(8.2(b) is the one to take in all cases with 8 0 and 8 1 simple. such as the quality control Example 1.1.(6) = Pe. Indeed. In that case. the approach of Example 3. Example 4. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. if we have a test statistic T and use critical value c. our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 .3. Here > cJ.1. Such tests are said to have level (o/significance) a. numbers to the two losses that are not equal and/or depend on 0. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. In Example 4.3 with O(X) = I{S > k}. and we speak of rejecting H at level a. .1.1.0473. then the probability of type I error is also a function of 8. See Problem 3. there exists a unique smallest c for which a(c) < a.j J = 10. 0) is just the probability of type! errot. Definition 4. which is defined/or all 8 E 8 by e {3(8) = {3(8. It is referred to as the level a critical value. Here are the elements of the Neyman Pearson story. By convention 1 .9. if our test statistic is T and we want level a. if 0 < a < 1. Specifically.1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate.1. This is the critical value we shall use. Finally.3 (continued).Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 . Thus. the power is 1 minus the probability of type II error. Because a test of level a is also of level a' > a.Section 4. we find from binomial tables the level 0. If 8 0 is composite as well.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8.
1..[T(X) > k].:"::':::'. If we assume XI. :1 • • Note that in this example the power at () = B1 > 0.0473.3 for the B(lO. suppose we want to see if a drug induces sleep.0 Figure 4. 0 Remark 4.. ._~:.2 0. (The (T2 unknown case is treated in Section 4. I I I I I 1 ..~==/++_++1'+ 0 o 0.05 test will detect an improvement of the recovery fate from 0. 0) family of distributions.. . Xn ) is a sample from N (/'.2) popnlation with .1. From Figure 4. whereas K is the alternative that it has some positive effect.8 0. then the drug effect is measured by p. Power function of the level 0.0 0 ].t < 0 versus K : J.4.6 0.218 Testing and Confidence Regions Chapter 4 1. ! • i.2 o~31:. When (}1 is 0.1. _l X n are nonnally distributed with mean p.1 it appears that the power function is increasing (a proof will be given in Section 4.1. and variance (T2. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient.:.3. For instance. Example 4.t > O. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives.3 to (it > 0. . What is needed to improve on this situation is a larger sample size n.3770.3 versus K : B> 0.3 is the probability that the level 0.8 06 04 0.3). That is.1 0.3.) We want to test H : J. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe. I i j . for each of a group of n randomly selected patients.0 0. j 1 . OneSided Tests for the Mean ofa Normal Distribution with Known Variance. record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. this probability is only . k ~ 6 and the size is 0. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0.5. We return to this question in Section 4. We might.3 04 05 0. .. .5..2 is known. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. The power is plotted as a function of 0.05 onesided test c5k of H : () = 0. and H is the hypothesis that the drug has no effect or is detrimental.7 0. Suppose that X = (X I. a 67% improvement.9 1..1.
Section 4. (12) observations with both parameters unknown (the t tests of Example 4. where T(l) < . it is natural to reject H for large values of X. But it occurs also in more interesting situations such as testing p.2 and 4. then the test that rejects iff T(X) > T«B+l)(la)). #. has level a if £0 is continuous and (B + 1)(1 . = P.i. That is. In Example 4.(jo) = (~ (jo)+ where y+ = Y l(y > 0).3..1. p ~ P[AA].1.a) is an integer (Problem 4. . The task of finding a critical value is greatly simplified if £.~I.3. This occurs if 8 0 is simple as in Example 4. It is convenient to replace X by the test statistic T(X) = . . (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property.1. The smallest c for whieh ~(c) < C\' is obtained by setting q:... < T(B+l) are the ordered T(X). critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.P.~ (c . Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. where d is the Euclidean (or some equivalent) distance and d(x. Because (J(jL) a(e) ~ is increasing.d. eo). 1) distribution. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests. the common distribution of T(X) under (j E 8 0 . In all of these cases. N~A is the MLE of p and d(N~A. . T(X(B)) from £0.( c) = C\' or e = z(a) where z(a) = z(1 . has a closed form and is tabled. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.nX/ (J.o versus p..V. sup{(J(p) : p < OJ ~ (J(O) = ~(c).3.1. if we generate i..1.p) because ~(z) (4.1.~(z).a) is the (1.1. T(X(l)). it is clear that a reasonable test statistic is d((j.9).o if we have H(p.2. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP. and we have available a good estimate (j of (j.co =. . This minimum distance principle is essentially what underlies Examples 4.2) = 1. which generates the same family of critical regions. ~ estimates(jandd(~.5.. S) inf{d(x.I') ~ ~ (c + V.a) quantile oftheN(O.co . In Example 4.y) : YES}. in any case.• T(X(B)).. Rejecting for large values of this statistic is equivalent to rejecting for large values of X.1 Introduction 219 Because X tends to be larger under K than under H. However.1.5).(T(X)) doesn't depend on 0 for 0 E 8 0 .eo) = IN~A .1 and Example 4. T(X(lI).   = . £0.
where U denotes the U(O. 1).1. The distribution of D n under H is the same for all continuous Fo.. Set Ui ~ j .u[ O<u<l . 1997). . Suppose Xl.220 Testing and Confidence Regions Chapter 4 1 . o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). P Fo (D n < d) = Pu (D n < d).05.. can be wriHen as Dn =~ax max tI. and the result follows. .. for n > 80. The natural estimate of the parameter F(!... + (Tx) is . the order statistics.)} n (4.7) that Do. x It can be shown (Problem 4. This statistic has the following distributionjree property: 1 Proposition 4. where F is continuous... and hn(t) = t/( Vri + 0. . < x(n) is the ordered observed sample. Let X I. • . . X n are ij.1. as X . . < Fo(x)} = U(Fo(x)) . F o (Xi)... as a tcst statistic . ). . F. In particular.1 J) where x(l) < . . In particular. .U.4.L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X ..Xn be i.(i_l.1.6.01. Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0. 1) distribution... 1).Fo(x)[. Consider the problem of testing H : F = Fo versus K : F i. and h n (L224) for" = .n {~ n Fo(x(i))' FO(X(i)) _ .d. As x ranges over R.1 El{Fo(Xi ) < Fo(x)} n. Proof.. then by Problem B. This is again a consequence of invariance properties (Lehmann. We can proceed as in Example 4.10 respectively. Also F(x) < x} = n. What is remarkable is that it is independ~nt of which F o we consider.1. thus.1 El{Xi ~ U(O.5. that is. and .. The distribution of D n has been thoroughly smdied for finite and large n. U.1. .d.1 El{U. :i' D n = sup [U(u) .3. F and the hypothesis is H : F = ~ (' (7'/1:) for some M. the empirical distribution function F. . Goodness of Fit to the Gaussian Family.   Dn = sup IF(x) . Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. n. which is called the Kolmogorov statistic. (T2 = VarF(Xtl. . 0 Example 4. Goodness of Fit Tests.Fa... • Example 4. h n (L358)..i.12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). which is evidently composite.
observations Zi. · · . . and only if. T nB . Zn arc i.(iix u > z(a) or upon applying <l' to both sides if. whereas experimenter II accepts H on the basis of the same outcome x of an experiment..4) Considered as a statistic the pvalue is <l'( y"nX /u). a > <l'( T(x)). Tn sup IF'(X x x + iix) .<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . whereas experimenter II insists on using 0' = 0. .01.a) + IJth order statistic among Tn. .(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size.3.1. where Z I. and only if. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test. 1 < i < n. N(O. . otherwise. If the two experimenters can agree on a common test statistic T. thereby obtaining Tn1 .i.Z) / (~ 2::7 ! (Zi  if) • . from N(O.<l'(x)1 sup IG(x) .d. H is not rejected. then computing the Tn corresponding to those Zi. 1.Section 4. . That is. a satisfies T(x) = . the pvalue is <l'( T(x)) = <l' (. It is then possible that experimenter I rejects the hypothesis H.. the joint distribution of (Dq . Tn has the same distribution £0 under H... Now the Monte Carlo critical value is the I(B + 1)(1 . .05. H is rejected. (Sec Section 8. If we observe X = x = (Xl. Therefore. we would reject H if. I < i < n. ~n) doesn't depend on fl. . I)..X)/ii.2. We do this B times independently.4. 1). {12 and is that of ~  (Zi .) Thus. .. Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. In).d. if the experimenter's critical value corresponds to a test of size less than the p~value.. Tn!. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. Consider.. under H. But. T nB . if X = x. for instance. and the critical value may be obtained by simulating ij...X) . whatever be 11 and (12. Example 4. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again.' .1 Introduction 221 (12. . (4.
. I.1. It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. that is. Proposition 4.1. let X be a q dimensional random vector. ..5) . distribution (Problem 4.(S > s) '" 1.1). The pvalue is a(T(X))...• see f [' Hedges and Olkin. a(Tr ).. Thus. . to quote Fisher (1958). • i~! <:1 im _ . 1985). Thus. Then if we use critical value c. This is in agreement with (4. U(O.<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . T(x) > c. Similarly in Example 4. More generally.). We have proved the following. (4. Suppose that we observe X = x.1. H is rejected if T > Xla where Xl_a is the 1 ..1. The normal approximation is used for the pvalue also. aCT) has a uniform. then if H is simple Fisher (1958) proposed using l j :i. when H is well defined.5). We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4. . . In this context.1. But the size of a test with critical value c is just a(c) and a(c) is decreasing in c.222 Testing and Confidence Regions Cnapter 4 In general.3. Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). but K is not.1.1. ~ H fJ. and only if. For example. if r experimenters use continuous test statistics T 1 . we would reject H if.6). Thus. the largest critical value c for which we would reject is c = T(x).. 80).ath quantile of the X~n distribution. 1).g. • T = ~2 I: loga(Tj) j=l ~ ~ r (4. I T r to produce p ..2. and the pvalue is a( s) where s is the observed value of X.Oo)} > 5. a(8) = p.1. ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p. !: The pvalue is used extensively in situations of the type we described earlier. The pvalue can be thought of as a standardized version of our original statistic. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e. for miu{ nOo. ' I values a(T. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). n(1 . The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. •• 1. a(T) is on the unit interval and when H is simple and T has a continuous distribution.6) I • to test H. Thus.4). so that type II error considerations are unclear.
00)/00 (1 . that is. In Sections 3.Section 4. The statistic L takes on the value 00 when p(x. by convention. and pvalue. For instance. 0) 0 p(x. the highest possible power. o:(Td has a U(O. OIl = p (x.)1 5 [(1.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. (4. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. under H. test statistics. we test whether the distribution of X is different from a specified Fo. I) = EXi is large. equals 0 when both numerator and denominator vanish. OIl where p(x.2. where eo and e 1 are disjoint subsets of the parameter space 6.) which is large when S true. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. In particular.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b. In the NeymanPearson framework. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. 00 .Otl/(1 . power function. is measured in the NeymanPearson theory. and S tends to be large when K : () = 01 > 00 is . then. (null) hypothesis H and alternative (hypothesis) K. (0. or equivalently. Summary. 0) is the density or frequency function of the random vector X. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01. 0. subject to this restriction. size. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true.1.0.01 )/(1. that is. power. 4. a given test statistic T. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. type I error.2 and 3. Such a test and the corresponding test statistic are called most poweiful (MP)./00)5[(1.3). In this section we will consider the problem of finding the level a test that ha<. 00.00)ln~5 [0. (I . p(x. and. This is an instance of testing goodness offit. L(x. significance level. Typically a test statistic is not given but must be chosen on the basis of its perfonnance. critical regions. we try to maximize the probability (power) of rejecting H when K is true. We introduce the basic concepts of simple and composite hypotheses.Oo)]n. OIl > 0. test functions. in the binomial example (4. 1) distribution.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. 00 ) = 0. type II error.
. we choose 'P(x) = 0 if S < 5. To this end consider (4. and because 0 < 'P(x) < 1. o Because L(x. and suppose r.) . It follows that II > O. Theorem 4. B . Because we want results valid for all possible test sizes Cl' in [0.'P(X)] [:~~:::l. Bo) = O. where 7l' denotes the prior probability of {Bo}. we consider randomized tests '1'. (4. 'Pk(X) = 1 if p(x.P(S > 5)I!P(S = 5) = .3.Bo.2.'P(X)] > kEo['PdX) . Proof. then it must be a level a likelihood ratio test. B. 1.'P(X)] > O.1.'P(X)] = Eo['Pk{X) . (See also Section 1. lOY some x.k is < 0 or > 0 according as 'Pk(X) is 0 or 1.2. thns. Bo) = O} = I + II (say).'P(X)] 1{P(X. and 'P(x) = [0. Note (Section 3. where 1= EO{CPk(X)[L(X. 'P(x) = 1 if S > 5. then ~ .3) 1 : > O. likelihood ratio tests are unbeatable no matter what the size a is.1]. They are only used to show that with randomization.) For instance. BI )  .p is a level a test. Such randomized tests are not used in practice.kJ}. i = 0. using (4.3 with n = 10 and Bo = 0. then I > O. (a) If a > 0 and I{Jk is a size a likelihood ratio test.. (c) If <p is an MP level 0: test..Bd >k <k with 'Pdx) any value in (0. Finally.k] + E'['Pk(X) .2. we have shown that I j I i E.2. (a) Let E i denote E8. i' ~ E.('Pk(X) . EO'PdX) We want to show E.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x.0262 if S = 5. If L(x.Bo.B.05 in Example 4.2.kEo['Pk(X) . If 0 < 'P(x) < 1 for the observation vector x. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted. 0 < <p(x) < 1. . which are tests that may take values in (0.1.) .3.4) 1 j 7 .7l').2) forB = BoandB = B.'P(X)] a. We show that in addition to being Bayes optimal. Bo.'P(X)] .[CPk(X) . if want size a ~ .3). there exists k such that (4. that is.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .'P(X)[L(X.05 . then <{Jk is MP in the class oflevel a tests. k] .['Pk(X) . BIl o . (NeymanPearson Lemma).1) if equality occurs. the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads. 'Pk is MP for level E'c'PdX).1). Eo'P(X) < a. Note that a > 0 implies k < 00 and. B .
L. Part (a) of the lemma can. then there exists k < 00 such that Po[L(X. OJ! > k] < a and Po[T.O.) > then to have equality in (4.k] on the set {x : L(x. 'Pk(X) =a . 0. = 'Pk· Moreover. 0.2.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v.. 0.2.OI)' Proof. 'P(X) > a with equality iffp(" 60 ) = P(·.) + 1T' (4. 2 (7 T(X) ~.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x. define a. 0. Here is an example illustrating calculation of the most powerful level a test <{Jk. = v. 00 ) (11T)L(x. 00 ) > and 0 = 00 .) .. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. Then the posterior probability of ().Po[L(X.) (1 . then E9. we conc1udefrom (4.) + 1Tp(X.Oo. OJ! > k] > If Po[L(X.5) thatthis 0. Now 'Pk is MP size a. If not.) = k] = 0.2.1.) < k. 00 . Because Po[L(X.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x. tben. k = 00 makes 'Pk MP size a.Oo. 0.2) holds forO = 0.7.5) If 0. The same argument works for x E {x : p(x. It follows that (4. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level.2.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0. OJ.2. X n ) is a sample of n N(j. when 1T = k/(k + 1). 00 . Example 4. 0.) > k] Po[L(X. We found nv2} V n L(X.. Let Pi denote Po" i = 0.(X.2. Consider Example 3.1. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it. (c) Let x E {x: p(x.. is 1T(O.. 'Pk(X) = 1 and !.0.2. Next consider 0 < Q: < 1. Remark 4.v) + nv ] 2(72 x . 0. where v is a known signal. denotes the Bayes procedure of Exarnple 3.O..2(b).) = k j.Section 4. for 0 < a < 1.1.O. .1T)p(x. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. 0.2. 00 . . 00 . 00 .) (1 .1T)L(x. If a = 1.fit [ logL(X. OJ! = ooJ ~ 0. 00 .2.pk is MP size a.3. I x) = (1 1T)p(X. See Problem 4. then 'Pk is MP size a. k = 0 makes E.v)=exp 2LX'2 . OJ Corollary 4.2. that is. Therefore.10).fit. 00 . also be easily argued from this Bayes property of 'Pk (Problem 4.I. 60 . 00 .2 where X = (XI. 0. decides 0. 0./f 'P is an MP level a test.
among other things. that the UMP test phenomenon is largely a feature of onedimensional parameter problems.2./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'. if ito.Jto)E01X large. for the two popnlations are known.. Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction.I. the test that rejects if. . T>z(la) (4. . If this is not the case. itl' However if.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O..90 or . Such a test is called uniformly most powerful (UMP).a)[~6Eo' ~oJ! (Problem 4. Note that in general the test statistic L depends intrinsically on ito. In this example we bave assnmed that (Jo and (J.2. . then. We return to this in Volume II.) test exists and is given by: Reject if (4. if Eo #. By the NeymanPearsoo lemma this is the largest power available with a level Ct test.' Rejecting H for L large is equivalent to rejecting for I 1 . 0.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. Suppose X ~ N(Pj. . > 0 and E l = Eo. The following important example illustrates. Example 4. But T is the test statistic we proposed in Example 4.2.2). <I>(z(a) + (v.6) has probability of type I error 0:. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on .1. a UMP (for all ). however.2. ). then we solve <1>( z(a) + (v.0..2. 0. 6. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. j = 0. large. Thus.. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl . O)T and Eo = I.4. E j ). Eo are known. I I .2.7) where c ~ z(1 . they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances. From our discussion there we know that for any specified Ct. and only if. then this test rule is to reject H if Xl is large. this is no longer the case (Problem 4. by (4..9)." The function F is known as the Fisher discriminant function. If ~o = (1. (Jj = (Pj. 0 An interesting feature of the preceding example is that the test defined by (4.1. We will discuss the phenomenon further in the next section. . It is used in a classification context in which 9 0 . The power of this test is. E j ). itl = ito + )"6./iil (J)).95). (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other. This is the smallest possible n for any size Ct test. l.8). say.2.. • • .
3. For instance. r lUI nt···· nk· ". . .M( n.. .3.. is established.3. The NeymanPearson lemma.3.1. Nk) . Testing for a Multinomial Vector.... ."" (}k = Oklo In this case. . 'P') > (3(0.2 for deciding between 00 and Ol.2) where . 4. Example 4. .1. A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0. if a match in a genetic breeding experiment can result in k types. However... .OkO' Usually the alternative to H is composite. Ok = OkO.···. 0) P n! nn. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}. where nl.. Suppose that (N1 . With such data we often want to test a simple hypothesis H : (}I = Ow.1) test !. . . . N k ) has a multinomial M(n.. .'P) for all 0 E 8" for any other level 0:' (4. k are given by Ow. Before we give the example.. (}l.u k nn. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. _ (nl. We note the connection of the MP test to the Bayes procedure of Section 3.p. .. n offspring are observed.. ... there is sometimes a simple alternative theory K : (}l = (}ll. here is the general definition of UMP: Definition 4. 1 (}k) distribution with frequency function. .. 0" .2 that UMP tests for onedimensional parameter problems exist.nk. the likelihood ratio L 's L~ rr(:.3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. 0 < € < 1 and for some fixed (4. Such tests are said to be UMP (unifonnly most powerful).3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary. Two examples in which the MP test does not depend on 0 1 are given. . then (N" . This phenomenon is not restricted to the Gaussian case as the next example illustrates.Section 4. and N i is the number of offspring of type i. nk are integers summing to n.
2) with 0 < f < I}. Example 4. the power function (3(/:/) = E. However. = P( N 1 < c). Thus.1. set s then = l:~ 1 Xi. The family of models {P.nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . then this family is MLR. .OW and the model is by (4. for testing H : 8 < /:/0 versus K:O>/:/I' . . is UMP level". equals the likelihood ratio test 'Ph(t) and is MP.(X) = '" > 0. (2) If E'o6.2 (Example 4. Suppose {P. for".3. This is part of a general phenomena we now describe. Example 4. = . . Theorem 4. Because dt does not depend on ()l.3. in the case of a real parameter. 00 . If 1J(O) is strictly increasing in () E e. : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x.(X) is increasing in 0. X ifT(x) < t ° (4. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact. is an MLRfamily in T(x). . is an MLRfamily in T(x). Ow) under H. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H.1. Critical values for level a are easily determined because Nl . B( n.3. where u is known.nu)!". we conclude that the MP lest rejects H.3 continned).228 Testmg and Confidence Regions Chapter 4 That is. : 8 E e}. 02)/p(X.3.3. Consider the oneparameter exponential family mode! o p(x. Bernoulli case. we get radically different best tests depending on which Oi we assume to be (ho under H.o)n. J Definition 4.3. under the alternative. .3.d. Moreover. then L(x. if and only if. e c R. If {P. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81.i.. (1) For each t E (0. p(x.. . is of the form (4.1) if T(x) = t.o)n[O/(1 .6.2.. N 1 < c..1 is of this (. 0 form with T(x) . 6. Oil is an increasing function ofT(x).. In this i.3) with 6... . Then L = pnN1£N1 = pTl(E/p)N1. there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. 0) ~.2.00).O) = 0'(1 . we have seen three models where. .1) MLR in s. 0 . e c R. Example 4. = (I . = h(x)exp{ry(O)T(x) ~ B(O)}. Note that because l can be any of the integers 1 .(x) any value in (0. :.. . k. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ .2. Oil = h(T(x)) for some increasing function h. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. < 1 implies that p > E. then 6.: 0 E e}. Because f..
0.l. is an MLRfamily in T(r). . where h(a) is the ath quantile of the hypergeometric. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .(bl1) x. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory..1 by noting that for any (it < (J2' e5 t is MP at level Eo.U 2 ) population. is UMP level a. d t is UMP for H : (J < (Jo versus K : (J > (Jo. and because dt maximizes the power over this larger class.<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. H. Thus..Xn is a sample from a N(I1.Section 4. and we are interested in the precision u.3. For simplicity suppose that bo > n. then e}.<>. the alternative K as (J < (Jo.o.bo > n. (X) < <> and b t is of level 0: for H : (J :S (Jo. where 11 is a known standard. .( NOo. (8 < s(a)) = a. e c p(x. then the test/hat rejects H if and only ifT(r) > t(1 . To show (2). (l. Then..l. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard.Xn . 0 Example 4. where b = N(J. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately. and only if. if N0 1 = b1 < bo and 0 < x < b1 . where xn(a) is the ath quantile of the X. N. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 .3. Corollary 4. and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. If a is a value taken on by the distribution of X.l) yields e L( 0 0 )=b. n).l of the measurements Xl. . distribution. Suppose {Po: 0 E Example 4.2.5. Quality Control. then by (1).o. Let S = l:~ I (X.1. we test H : u > Uo l versus K : u < uo. Because the most serious error is to judge the precision adequate when it is not. she formulates the hypothesis H as > (Jo.l. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution.. N . . Suppose X!. Suppose tha~ as in Example l. we now show that the test O· with reject H if.3. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. For instance.4. 0) = exp {~8 ~ IOg(27r0'2)} . R. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' . Eoo. If 0 < 00 .(X) for testing H: 0 = 01 versus J( : 0 = 0.(X). X < h(a). where U O represents the minimum tolerable precision.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4. _1')2. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. distribution. the critical constant 8(0') is u5xn(a). Testing Precision..
This continuity of the power shows that not too much significance can be attached to acceptance of H.2). That is. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. we want the probability of correctly detecting an alternative K to be large.3. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small.n + I) .Il = (b l . I' . i.' . (0.. we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t. In our nonnal example 4.4 because (3(p.ntI") for sample size n. we would also like large power (3(0) when () E 8 1 .O'"O.+. In both these cases. This is not serious in practice if we have an indifference region. However. Thus. . the appropriate n is obtained by solving i' . In Example 4. It follows that 8* is UMP level Q... 0) continuous in 0.1. Note that a small signaltonoise ratio ~/ a will require a large sample size n. in general. I i (3(t) ~ 'I>(z(a) + . ".x) <I L(x. Thus.4 we might be uninterested in values of p.. this is a general phenomenon in MLR family models with p( x.'. we want guaranteed power as well as an upper bound on the probability of type I error. ~) would be our indifference region. The critical values for the hypergeometric distribution are available on statistical calculators and software. {3(0) ~ a. ~) for some small ~ > 0 because such improvements are negligible.a. In our example this means that in addition to the indifference region and level a. H and K are of the fonn H : () < ()o and K : () > ()o.1.1. this is. For such the probability of falsely accepting H is almost 1 .B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n. Therefore.c:O".I.x) (N .230 NotethatL(x.1.. in (O . forO <:1' < b1 1. not possible for all parameters in the alternative 8.Oo. ° ° .) is increasing. as seen in Figure 4. This equation is equiValent to = {3 z(a) + ..L. By CoroUary 4.:(x:. if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a...(}o.(b o . L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r... On the other hand. that is. Off the indifference region. This is possible for arbitrary /3 < 1 only by making the sample size n large enough.1.OIl box (Nn+I)(blx) .ntI" = z({3) whose solution is il . and the powers are continuous increasing functions with limOlO. This is a subset of the alternative on which we are willing to tolerate low power. 1 I .1 and formula (4. .
we can have very great power for alternatives very close to O. Again using the nonnal approximation. we solve (3(00 ) = Po" (S > s) for s using (4.4. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 .35(0.282 x 0.0.90. to achieve approximate size a. Suppose 8 is a vector. 1. where (3(0. We solve For instance. 00 = 0.80 ) .1.) = (3 for n and find the approximate solution no+ 1 . > O.86.Oi) [nOo(1 . (3 = .4. and only if. As a further example and precursor to Section 5. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo. Now let .55)}2 = 162.2) shows that.3 continued).645 x 0.5).3 requires approximately 163 observations to have probability . using the SPLUS package) for the level . the size . The power achievable (exactly.4.3. Example 4. Dt. First.35. Our discussion uses the classical normal approximation to the binomial distribution. = 0.90 of detecting the 17% increase in 8 from 0. Thus.4 this would mean rejecting H if.05)2{1. Such hypotheses are often rejected even though for practical purposes "the fit is good enough.1.35 and n = 163 is 0. Formula (4.1.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much. test for = ." The reason is that n is so large that unimportant small discrepancies are picked up. if Oi = .6 (Example 4.1. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant. and 0.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l. far from the hypothesis.7) + 1.00 )] 1/2 .35. when we test the hypothesis that a very large sample comes from a particular distribution. we fleed n ~ (0. In Example 4.05 binomial test of H : 8 = 0.Section 4. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example. that is.3 to 0.. This problem arises particularly in goodnessoffit tests (see Example 4. (h = 00 + Dt. 0 ° Our discussion can be generalized. ifn is very large and/or a is small. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:.3(0.3. Bd. There are various ways of dealing with this problem.05.
is detennined by a onedimensional parameter ). Suppose that in the Gaussian model of Example 4.232 Testing and Confidence Regions Chapter 4 q.4. The set {O : qo < q(O) < ql} is our indjfference region. and also increases to 1 for fixed () E 6 1 as n . For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. 0) = (B .+ 00.(0) so that 9 0 = {O : ).7.(0) = o} aild 9.1) 1(0. In general.= (10 versus K : 0.(Tn ) is an MLR family. .2. The theory we have developed demonstrates that if C. = (tm. J.. 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function. that are not 01.2 only.(0) > O} and Co (Tn) = C'(') (Tn) for all O. Although H : 0. is the a percentile of X~l' It is evident from the argument of Example 4.< (10 among all tests depending on <. I}. to the F test of the linear model in Section 6. I . for 9. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q. This procedure can be applied. For instance. is such that q( Oil = q.. = {O : ). 0 E 9. Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). We have seen in Example 4.£ is unknown.3. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn. It may also happen that the distribution of Tn as () ranges over 9.1. the critical value for testing H : 0.l' independent of IJ.= 00 is now composite. (4. However. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. a).. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0. Then the MLE of 0.3. when testing H : 0 < 00 versus K : B > Bo. Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. Thus. . Testing Precision Continued. 0 E 9. To achieve level a and power at least (3. 00).0) > ° I(O.O)<O forB < 00 forB>B o...3.< (10 and rejecting H if Tn is small. then rejecting for large values of Tn is UMP among all tests based on Tn. a E A = {O.Xf as in Example 2.9.4) j j 1 .3 that this test is UMP for H : (1 > (10 versus K : 0.l)l(O.. Example 4. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0. we illustrate what can happen with a simple example. we may consider l(O.3. the distribution of Tn ni:T 2/ (15 is X. for instance.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative.. .Bo).2 is (}'2 = ~ E~ 1 (Xi .
Thus. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. 1) + [1 1"(X)]I(O.6) But 1 . locally most powerful (LMP) tests in some cases can be found.4 CONFIDENCE BOUNDS. it isn't worthwhile to look outside of complete classes.12) and. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4. 1") for all 0 E e.4). then the class of tests of the form (4.(x)) .3.{1(8. is an MLR family in T( x) and suppose the loss function 1(0. Now 0.(X) > 1. Intervals.3. For () real. Thus.E.I") ~ = EO{I"(X)I(O. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.3.3. In the following the decision procedures arc test functions. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a .E.(2) a v R(O.O)]I"(X)} Let o.o. the risk of any procedure can be matched or improved by an NP test. for some 00 • E"o.E.O)} E. Theorem 4.Section 4. if the model is correct and loss function is appropriate. 0 < " < 1. (4. e 4. Suppose {Po: 0 E e}.1 and. O))(E.3) with Eo. E. a) satisfies (4. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1. we show that for MLR models. Proof. hence.o. then any procedure not in the complete class can be matched or improved at all () by one in the complete class. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests).I"(X) for 0 < 00 . 1) 1(0. For MLR models. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 .o) < R(O.(X) = E"I"(X) > O.2.(X)) = 1. = R(O. 1) 1(O.(X) be such that. The risk function of any test rule 'P is R(O. the class of NP tests is complete in the sense that for loss functions other than the 01 loss function.) .(Io.3. Finally.5). when UMP tests do not exist. is complete.I"(X) 0 for allO then O=(X) clearly satisfies (4. If E.0. We also show how.3. (4.3.R(O.3.(X) = ". hence. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). INTERVALS. 0 Summary. 1") = (1(0.(I"(X))) < 0 for 8 > 8 0 . (4.5) holds for all 8. e c R.O) + [1(0.3.(o.4 Confidence Bounds.5) That is.
. In general. In our example this is achieved by writing By solving the inequality inside the probability for tt. In the N(".oz(1 .(X).a)l. That is.!a)/.d.±(X) ~ X ± oz (1..i.95 Or some other desired level of confidence. and X ~ P.) = 1 .. 0.. . . we find P(X . . as in (1. is a constant. We find such an interval by noting .(X) that satisfies P(.8). PEP.a)..95.. in many situations where we want an indication of the accuracy of an estimator.) = 1 . ( 2 ) with (72 known. In the nonBayesian framework.. .. and we look for a statistic .. . X + a] contains pis 1 . Similarly..+(X)] is a level (1 .. . and a solution is ii(X) = X + O"z(1 . Finally. we want both lower and upper bounds.(X) and solving the inequality inside the probability for p. ..a) for a prescribed (1.. intervals.a.fri.. we may be interested in an upper bound on a parameter..(Xj < .. We say that [j.a) of being correct.) = Ia. 1 X n are i.. As an illustration consider Example 4. In this case.oz(1 ..a)I. it may not be possible for a bound or interval to achieve exactly probability (1 .. X n ) to establish a lower bound .. we settle for a probability at least (1 .a.a) confidence interval for ".) and =1 a . N (/1. or sets that constrain the parameter with prescribed probability 1 .. Then we can use the experimental outcome X = (Xl..Q equal to . Now we consider the problem of giving confidence bounds.(X) .a) confidence bound for ".a. is a lower bound with P(. We say that . X E Rq.a.fri < .a.4 where X I. • j I where .fri.3.fri < .1.! = X .(X) for" with a prescribed probability (1 ... Suppose that JL represents the mean increase in sleep among patients administered a drug. This gives . is a parameter. Here ii(X) is called an upper level (1 .Q with 1 .a) such as .234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 .(X) is a lower confidence bound with confidence level 1 . we want to find a such that the probability that the interval [X . if v ~ v(P). 11 I • J . That is.2 ) example this means finding a statistic ii(X) snch that P(ii(X) > .a)l.
P[v(X) < v < v(X)] > I . has aN(O. The (Student) t Interval and Bounds. independent of V = (n .3.L. for all PEP. D(X) is called a level (1 . Intervals. Let X t .L) by its estimate s.3. the confidence level is clearly not unique because any number (I . which has a X~l distribution. Moreover.L) obtained by replacing (7 in Z(J. the minimum probability of coverage). In general. 1) distribution and is.0) will be a confidence level if (I . has a N(o.0)% confidence intervolfor v if. In the preceding discussion we used the fact tbat Z (1') = Jii(X . where S 2 = 1 nI L (X. lhat is.1.o.4. ii( Xl] formed by a pair of statistics v(X).. by Theorem B. Similarly.0) or a 100(1 .1.O!) lower confidence bound for v if for every PEP. Example 4.4.a) upper confidence bound for v if for every PEP.l)s2 lIT'. I) distribu~on to obtain a confidence interval for I' by solving z (1 . For a given bound or interval.! that Z(iL)1 "jVI(n .e.0) is.4 Confidence Bounds. we will need the distribution of Now Z(I') = Jii(X 1')/".X) n i=l 2 . P[v(X) = v] > 10. finding confidence intervals (or bounds) often involves finding appropriate pivots. In this process Z(I') is called a pivot.Section 4.1')1".. The quantities on the left are called the probabilities of coverage and (1 . A statistic v(X) is called a level (1 . .o. Now we tum to the (72 unknown case and propose the pivot T(J. P E PI} (i.!o) < Z(I') < z (1. " X n be a sample from a N(J.. the random interval [v( X).3. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X). and assume initially that (72 is known. v(X) is a level (I .0') < (J . (72) population. and Regions 235 Definition 4.a) is called a confidence level. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient. P[v(X) < vi > I .1) = T(I') . We conclude from the definition of the (Student) t distribution in Section B.~o) for 1'.
3. . On the other hand. we enter Table II to find that the probability that a 'TnI variable exceeds 3.1) has confidence coefficient close to 1 .355 and IX . For instance.1 (I ..7. if we let Xn l(p) denote the pth quantile of the X~1 distribution. Suppose that X 1. X +stn _ 1 (1 Similarly. then P(X("I) < V(<7 2) < x(l..Q.Q).7 (see Problem 8.1.355 is . Then (72.. Hence.~a)/. Example 4.. The properties of confidence intervals such as (4.l) < t n_ 1 (1. Testing and Confidence Regions Chapter 4 1 .a).) confidence interval of the type [X . or very heavytailed distributions such as the Cauchy.4. .st n_l (1 ~. we have assumed that Xl. . we use a calculator.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 .)/ v'n].fii.~a) < T(J.)/..i.2.4. . . (4..a. Up to this point.1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. .) /.fii are natural lower and upper confidence bounds with confidence coefficients (1 . See Figure 5.~a)) = 1.1 )s2/<7 2 has a X. . we find P [X .355s/3.. To calculate the coefficients t n. (4.1. If we assume a 2 < 00."2)' (n .3. distribution. 0 . t n .2) f.a. " . the interval will have probability (1 .a).st n _ 1 (I . It turns out that the distribution of the pivot T(J. and if al + a2 = a.01.. for very skew distributions such as the X2 with few degrees of freedom.d. X + 3. .n < J.1) The shortest level (I . the confidence coefficient of (4.~a) = 3.1 (1 .4. X n are i.1."2)) = 1". i '1 !. • r. computer software.355s/31 is the desired level 0. By solving the inequality inside the probability for a 2 we find that [(n . . if n = 9 and a = 0.005.3.11 whatever be Il and Tj.. ~. V( <7 2 ) = (n . . Let tk (p) denote the pth quantile of the P (t n.. By Theorem B .l (I . X . Solving the inequality inside the probability for Jt. For the usual values of a.~a)/..fii and X + stn.l < X + st n_ 1 (1.n is. we can reasonably replace t n .a) in the limit as n + 00.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. Confidence Intervals and Bounds for the Vartance of a Normal Distribution...236 has the t distribution 7.1 distribution and can be used as a pivot. ..Xn is a sample from a N(p" a 2 ) population.!a) and tndl .l (I .nJ = 1" X ± sc/.stn_ 1 (I .a) /. .1 (p) by the standard normal quantile z(p) for n > 120.. we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution. N (p" a 2 ).4. thus.. • " ~. . or Tables I and II.1) can be much larger than 1 ..1.12).l) is fairly close to the Tn . Thus.99 confidence intervaL From the results of Section B.l)s2/X(1. In this case the interval (4.
4. 0 The method of pivots works primarily in problems related to sampling from nonnal populations.l)sjx(Q). However.0)  1Ql] = 12 Q. taking at = (Y2 = ~a is not far from optimal (Tate and Klett.S)jn] + k~j4} / (n + k~). Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. the scope of the method becomes much broader.X) < OJ'" 1where g(O. If we consider "approximate" pivots. There is no natural "exact" pivot based on X and O.1.S)jn] {S + k2~ + k~j4} / (n + k~) (4. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n .a even in the limit as n + 00.0) ." we can write P [ Let ka vIn(X .0) ] = Plg(O. vIn(X ... and Regions 237 The length of this interval is random. which unifonnly minimizes expected length among all intervals of this type.ka)[S(n . In contrast to Example 4. We illustrate by an example. Intervals. .0(1.Q) and (n . Example 4.0) has approximately a N(O. [0: g(O.4) . g( 0.3) + ka)[S(n . In Problem 1.4. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. If X I.O)j )0(1 .X) = (1+ k!) 0' .~ a) and observe that this is equivalent to Q P [(X . If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. then X is the MLE of (j.. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution.3. 1959).Section 44 Confidence Bounds.l)sjx(1 . In tenns of S = nX. It may be shown that for n large.k' < . X) is a quadratic polynomial with two real roots. 1 X n are the indicators of n Bernoulli trials with probability of success (j. 1) distribution. There is a unique choice of cq and a2. if we drop the nonnality assumption.1. which typically is unknown. X) < OJ (4. the confidence interval and bounds for (T2 do not have confidence coefficient 1. = Z (1 . by the De MoivreLaplace theorem.4.(2X + ~) e+X 2 For fixed 0 < X < I.4.0) < z (1)0(1 . they are(1) O(X) O(X) {S + k2~ .16 we give an interval with correct limiting coverage probability.
95. We can similarly show that the endpoints of the level (1 .96 0.8) is at least 6. . A discussion is given in Brown. it is better to use the exact level (1 . He draws a sample of n potential customers.~a = 0.~a) ko 2 kcc i 1. and Das Gupta (2000). we n .~n)2 < S(n . of the interval is I ~ 2ko { JIS(n .S)ln ~ to conclude that tn . consider the market researcher whose interest is the proportion B of a population that will buy a product. [n this case. These inlervals and bounds are satisfactory in practice for the usual levels.601.96) 2 = 9. Cai.975. if the smaller of nO.600.O > 8 1 > ~. For instance.S)ln] Now use the fact that + kzJ4}(n + k~)l (S . .a) confidence interval for O.02 by choosing n so that Z = n~ ( 1.. i o 1 [.5. calls willingness to buy success.(n + k~)~ = 10 .6) Thus.4.: iI j .4. = 0..a) procedure developed in Section 4. O(X)] is an approximate level (1 .0)1 JX(1 .(1.2a) interval are approximate upper and lower level (1 .4. we choose n so that ka. This leads to the simple interval 1 I (4. Note that in this example we can detennine the sample size needed for desired accuracy.02.4.7) See Brown. and uses the preceding model.238 Testing and Confidence Regions Chapter 4 so that [O(X). say I. i " . (1 . Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~. That is. to bound I above by 1 0 choose . and we can achieve the desired .4.5) (4. See Problem 4. To see this. He can then detennine how many customers should be sampled so that (4.4) has length 0.4. Cai.16. note that the length.5) is used and it is only good when 8 is near 1/2. Another approximate pivot for this example is yIn(X .96. ka.X).02 and is a confidence interval with confidence coefficient approximately 0.4. = length 0. = T.02 ) 2 . n(l . and Das Gupta (2000) for a discussion.n 1 tn (4. For small n. or n = 9. 1 . This fonnula for the sample size is very crude because (4.a) confidence bounds.
(T c' Tc ) are independent. Suppose X I.. That is. Intervals. Note that ifC(X) is a level (1 ~ 0') confidence region for 0... By using 2nX /0 as a pivot we find the (1 . In this case. ( 2 ) T.. if the probability that it covers the unknown but fixed true (ql (0). (0).. this technique is typically wasteful. Let Xl.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. X~n' distribution. X n is modeled as a sample from an exponential. Suppose q (X) and ii) (X) are realJ valued.0').a. . (X) ~ [q.. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. Example 4. Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q.. j ~ I.1 ). then the rectangle I( X) Ic(X) has level = h (X) x .. By Problem B. We write this as P[q(O) E I(X)I > 1.(O) < ii.a) confidence interval 2nX/x (I .3.4. then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 .a Note that if I.(X).). ..4. . £(8. .. distribution. ii.) confidence interval for qj (0) and if the ) pairs (T l ' 1'.1. .4. . and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. .. . qc( 0)) is at least (I .8) . .Section 44 Confidence Bounds. we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 ..0') confidence region for q(O). Here q(O) = 1 . .a) confidence region.a).X n denote the number of hours a sample of internet subscribers spend per week on the Internet. qc (0)). ~.. x IT(1a. 0 E C(X)} is a level (1. For instance. If q is not 1 .a). if q( B) = 01 .(X) < q.. we will later give confidence regions C(X) for pairs 8 = (0 1 . Let 0 and and upper boundaries of this interval.a).r} is said to be a level (1 .). and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 . then q(C(X)) ~ {q(O) . I is a level (I . Then the rdimensional random rectangle I(X) = {q(O). . 2nX/0 has a chi~square.F(x) = exp{ x/O}. . j==1 (4.4.
1) the distribution of . Moreover.. . .F(t)1 tEn i " does not depend on F and is known (Example 4. . if we choose a J = air.2. j = 1.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. 1 X n is a N (M.~a)/rn !a).a) confidence rectangle for (/" . r.a)..5. ( 2 ) sample and we want a confidence rectangle for (Ii. Suppose Xl.4.7). for each t E R..6.(1 . See Problem 4.2. From Example 4. Suppose Xl> . as X rv P. F(t) .1. and we are interested in the distribution function F(t) = P(X < t).LP[qj(O) '" Ij(X)1 > 1. ~ ~'f Dn(F) = sup IF(t) . min{l.4. According to this inequality.a = set of P with P( 00.Laj. in which case (Proposition 4. .. Example 4.1 h(X) = X ± stn_l (1. then I(X) has confidence level (1 . D n (F) is a pivot. Let dO' be chosen such that PF(Dn(F) < do) = 1 .1.4. 0 The method of pivots can also be applied to oodimensional parameters such as F.a. Example 4.i.15.a) r. From Example 4.do}. That is. .2). X n are i.4. tl continuous in t. j=1 j=1 Thus. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1.0 confidence region C(x)(·) is the confidence band which. h (X) X 12 (X) is a level (1 .. if we choose 0' j = 1 . that is. v(P) = F(·).0:. (7 2). .~a). we find that a simultaneous in t size 1. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. It is possible to show that the exact confidence coefficient is (1 . Then by solving Dn(F) < do for F.. then leX) has confidence level 1 . We assume that F is continuous. .d.4. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 .. Thus.40) 2.. c c P[q(O) E I(X)] > 1.5). Confidence Rectangle for the Parameters ofa Normal Distribution.240 Testing and Confidence Regions Chapter 4 Thus. consists of the interval i J C(x)(t) = (max{O.
4.5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4. For a nonparametric class P ~ {P} and parameter v ~ v(P).1. given by (4.!(F) : F E C(X)) ~ J.d. subsets of the sample space with probability of accepting H at least 1 .!(F) : F E C(X)} = J. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative. and we derive an exact confidence interval for the binomial parameter. Acceptance regions of statistical tests are. and more generally confidence regions.4. a 2 ) model with a 2 unknown.4.7. Then a (1 . o 4. which is zero for t < 0 and nonzero for t > O.4.!(F)   ~ oosee Problem 4.! ~ inf{J. Intervals for the case F supported on an interval (see Problem 4. X n are i. We define lower and upper confidence bounds (LCBs and DCBs).0') lower confidence bound for /1. A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining . a level 1 .6. is /1.0: for all E e.!(F+) and sUP{J.. for a given hypothesis H. 1WoSided Tests for the Mean of a Normal Distribution. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X. as X and that X has a density f( t) = F' (t). confidence intervals.5.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 . if J." for all PEP.5 to give confidence regions for scalar or vector parameters in nonparametric models.4. Suppose that an established theory postulates the value Po for a certain physical constant.4..6. . We begin by illustrating the duality in the following example.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 .4.18) arise in accounting practice (see Bickel.Section 4. In a parametric model {Pe : BEe}. J. We derive the (Student) t interval for I' in the N(Il. .0: when H is true. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses.! ~ /L(F) = f o tf(t)dt = exists.0'. By integration by parts.i.4 and 4. 0 Summary. Example 4. we similarly require P(C(X) :J v) > 1. Suppose Xl.4.9)   because for C(X) as in Example 4. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.19. Example 4.
.1 (1 .5.4. and in Example 4. 1.flO. This test is called twosided because it rejects for both large and small values of the statistic T.!a) < T < I n.~Ct). if and only if.2) we obtain the confidence interval (4. Evidently.I n . = 1.1 (1. O{X.242 Testing and Confidence Regions Chapter 4 measurements Xl .2..2 = .. Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely. For a function space example. t n . Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 . (4.a)s/ vn and define J'(X. the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn.l (1.4.5... Because p"IITI = I n.. as in Example 4. .a)s/vn > /l. . We achieve a similar effect.1) we constructed for Jl as follows. .. in Example 4. X . generated a family of level a tests {J(X.l other than flo is a possible alternative.a) LCB X .5.4.I n _ 1 (1. by starting with the test (4. /l) ~ O.2) takes values in N = (00 . 00) X (0. it has power against parameter values on either side of flo.6.2) These tests correspond to different hypotheses.4.. and only if. in Example 4. Because the same interval (4.1 (1 . (/" .~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 .4. then our test accepts H. o These are examples of a general phenomenon.5'[) If we let T = vn(X .~1 >tn_l(l~a) ootherwise.5. We can base a size Q' test on the level (1 . Let S = S{X) be a map from X to subsets of N. If any value of J.00).00). then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i. X + Slnl (1  ~a) /vn]. if we start out with (say) the level (1 .1) is used for every flo we see that we have.a.1) by finding the set of /l where J(X.a. For instance.4. /l) to equal 1 if.Q) confidence interval (4. Let v = v{P) be a parameter that takes values in the set N. consider v(P) = F.fLo)/ s.00).2(p) takes values in N = (0. all PEP. if and only if.5. We accept H.a). Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP.~a). (4. generating a family of level a tests. then S is a (I . that is P[v E S(X)] > 1 . /l = /l(P) takes values in N = (00.1.. /l)} where lifvnIX. in fact.a) confidence region for v if the probability that S(X) contains v is at least (1 . In contrast to the tests of Example 4. i:. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions. • j n .
the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1. Conversely. acceptance regions. then 8(T) is a Ia confidence region forO. then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . eo e Proof. Consider the set of Va for which HI/o is accepted.1. < to(la). Then the acceptance regIOn A(vo) ~ (x: J(x. It follows that Fe (t) < 1 .1. Suppose we have a test 15(X. if 0"1 + 0"2 < 1. If the equation Fo(t) = 1 . H may be accepted.5.1. let .vo) = OJ is a subset of X with probability at least 1 . 00). Duality Theorem. if S(X) is a level 1 .0".a)) where to o (1 . the power function Po(T > t) = 1 . That i~. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x. for other specified Va. We have the following. H may be rejected.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va.Ct) is the 1. then the test that accepts HI/o if and only if Va is in S(X).0" quantile of F oo • By the duality theorem. 0 We next give connections between confidence bounds. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. if 8(t) = (O E e: t < to(1 . is a level 0 test for HI/o. We next apply the duality theorem to MLR families: Theorem 4. Moreover.(01 + 0"2). this is a random set contained in N with probapility at least 1 . Fe (t) is decreasing in (). By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1.G iff 0 > Iia(t) and 8(t) = Ilia.Section 4. By Corollary 4.0" confidence region for v. Let S(X) = (va EN.3. then [QUI' oU2] is confidence interval for () with confidence coefficient 1 .(t) in e. va) with level 0".(T) of Fo(T) = a with coefficient (1 . let Pvo = (P : v(P) ~ Vo : va E V}. E is an upper confidence bound for 0 with any solution e. The proofs for the npper confid~nce hound and interval follow by the same type of argument.3.Fo(t) for a test with critical constant t is increasing in ().Ct).Ct confidence region for v.a)}.0" of containing the true value of v(P) whatever be P.Ct. Similarly. Formally. For some specified Va. By Theorem 4. X E A(vo)).a has a solution O. then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 .aj.
2.Fo(t). _1 X n be the indicators of n binomial trials with probability of success 6. N (IL. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ. The corresponding level (1 . I X n are U. for the given t.a)ifBiBo. t is in the acceptance region.a) test of H. We call C the set of compatible (t. Figure 4. a) denote the critical constant of a level (1.Xn ) = {B : S < k(B.5. we seek reasonable exact level (1 . v) =O} gives the pairs (t.. .~) upper and lower confidence bounds and confidence intervals for B. a confidence region S( to). We illustrate these ideas using the example of testing H : fL = J. . a) . We shall use some of the results derived in Example 4. Let T = X.oo).B) > a} {B: a(t. v will be accepted.to(l.10 when X I .Fo(t) is decreasing in t. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials.3. B) plane. then C = {(t. Then the set .2 known.B) = (oo. a). Let k(Bo.80 E (0. . Because D Fe{t) is a distribution function.1. The pvalue is {t: a(t. v) = O}.1). In the (t.d. B) pnints. .Fo(t) is increasing in O. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00. In general.a) is nondecreasing in B.p): It pI < <7Z (1. For a E (0. • . and an acceptance set A" (po) for this example.a) confidence region is given by C(X t.v) : J(t. . and for the given v.1 that 1 . and A"(B) = T(A(B)) = {T(x) : x Corollary 4.a)] > a} = [~(t). vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t.. We have seen in the proof of Theorem 4. let a( t. The result follows. B) ~ poeT > t) = 1 . • a(t.v) where...I}.v) : a(t. '1 Example 4. (ii) k(B.4. . I c= {(t.1. cr 2 ) with 0.1 I.5. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo.1 shows the set C.~a)/ v'n}. E A(B)). v) <a} = {(t. Under the conditions a/Theorem 4. vol denote the pvalue of a test6(T. A"(B) Set) Proof. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x).a)~k(Bo. 1 .1. Let Xl. .5.3.1).1 244 Testing and Confidence Regions Chapter 4 1 o:(t. : : !. We claim that (i) k(B.
1 (i) that PetS > j] is nondecreasing in () for fixed j. e < e and 1 2 k(O" Q) > k(02. Then P. Q).Q) = n+ 1. Q). The claims (iii) and (iv) are left as exercises.1. I] if S ifS >0 =0 . then C(X) ={ (O(S). Q) as 0 tOo.5 The Duality Between Confidence Regions and Tests 245 Figure 4. let j be the limit of k(O. Therefore. S(to) is a confidence interval for 11 for a given value to of T. (iii). The assertion (ii) is a consequence of the following remarks. Q) would imply that Q > Po.o' (iii) k(fJ. The shaded region is the compatibility set C for the twosided test of Hp. P.[S > k(O"Q) I] > Q. Po.[S > k(02. Therefore. (ii). [S > j] > Q.[S > k(02. [S > j] < Q.I] [0. To prove (i) note that it was shown in Theorem 4.5.Q)] > Po. On the other hand.Q) = S + I}. From (i). it is also nonincreasing in j for fixed e. whereas A'" (110) is the acceptance region for Hp. (iv) k(O.3. if 0 > 00 . [S > j] = Q and j = k(Oo. hence. Clearly.o : J.Section 4. a) increases by exactly 1 at its points of discontinuity. if we define a contradiction.Q) ~ I andk(I. Po. If fJo is a discontinuity point 0 k(O.L = {to in the normal model. and. and (iv) we see that.Q) I] > Po.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O.
Q) is given by.2 portrays the situatiou. if n is large. When S = n. When S 0.3 0. Plot of k(8.16) for n = 2.Q) = S and.3. 1 _ . these bounds and intervals differ little from those obtained by the first approximate method in Example 4. j Figure 4. .Q)=Sl} where j (0.5. 4 k(8. As might be expected.. From our discussion.4.5 I i .Q) DCB for 0 and when S < n.I.1 d I J f o o 0. then k(O(S). 8(S) = O.Q) LCB for 0(2) Figure 4.2Q).246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . O(S) I oflevel (1. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations. j '~ I • I • . . . I • i I .4 0. . Then 0(S) is a level (1 . Similarly.16) 3 I 2 11.1 0. therefore. we define ~ O(S)~sup{O:j(O. 0. O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q.0.2. Putting the bounds O(S). when S > 0.5.2 0. O(S) together we get the confidence interval [8(S). O(S) = I. i • . we find O(S) as the unique solution of the equation. .
o (4.5. but if H is rejected. and 3. Therefore.5.5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests..2. 0. 2. The problem of deciding whether B = 80 .B . Here we consider the simple solution suggested by the level (1 . Decide I' > I'c ifT > z(1 . when !t < !to. we usually want to know whether H : () > B or H : B < 80 . Example 4. vo) is a level 0' test of H .3. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B . the twosided test can be regarded as the first step in the decision procedure where if H is not rejected. Suppose Xl.d. However.~Q). If H is rejected. A and B. o o Decide B < 80 if! is entirely to the left of B .0') confidence interval!: 1. To see this consider first the case 8 > 00 .Jii for !t. For this threedecision rule.!a). suppose B is the expected difference in blood pressure when two treatments.2 ) with u 2 known. the probability of the wrong decision is at most ~Q. Thus.!a).Q) confidence interval X ± uz(1 ..3) 3. we decide whether this is because () is smaller or larger than Bo. by using this kind of procedure in a comparison or selection problem.4 we considered the level (1 . v = va.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . Summary. o o For instance. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. then the set S(x) of Vo where . This event has probability Similarly. we test H : B = 0 versus K : 8 i..Section 4. Decide I' < 1'0 ifT < z(l. If J(x. we make no claims of significance. If we decide B < 0. In Section 4.i. Then the wrong decision "'1 < !to" is made when T < z(1 . we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1. N(/l.13. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i. . Using this interval and (4. I X n are i. then we select A as the better treatment. Because we do not know whether A or B is to be preferred.~Q) / .5. the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. are given to high blood pressure patients. and vice versa. and o Decide 8 > 80 if I is entirely to the right of B . We explore the connection between tests of statistical hypotheses and confidence regions.O.4. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions.~a).
Q)er/. in fact. P.5. If (J and (J* are two competing level (1 . We next show that for a certain notion of accuracy of confidence bounds. 0"2) random variables with (72 known.2) Lower confidence bounds e* satisfying (4. which reveals that (4. unifonnly most accurate in the N(/L. Note that 0* is a unifonnly most accurate level (1 .Q for a binomial.if.1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4..0) confidence region for va.1 t! Example 4. Definition 4.6. and only if.6.IB' (X) I • • < B'] < P. I' i' ..6.[B(X) < B'].n(X ..z(1 . B(n.6. for any fixed B and all B' < B.6.Xn and k is defined by P(S > k) = 1 . we say that the bound with the smaller probability of being far below () is more accurate. 0 I i. where 80 is a specified value. the following is true. A level (1. Suppose X = (X" .Q) LCB B if.Q) UCB for B..2) for all competitors. we find that a competing lower confidence bound is 1'2(X) = X(k). This is a consequence of the following theorem. ~). (X) = X .. for X E X C Rq. Using Problem 4..I (X) is /L more accurate than .6. Which lower bound is more accurate? It does tum out that . . If S(x) is a level (1 . < X(n) denotes the ordered Xl.Q).. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.2 and 4.Xn ) is a sample of a N (/L.1) Similarly.a) LCB for if and only if 0* is a unifonnly most accurate level (1 . Formally.1. optimality of the tests translates into accuracy of the bounds.6.3.[B(X) < B'].u2(X) and is. where X(1) < X(2) < . random variable S. . 4. which is connected to the power of the associated onesided tests. j i I . v = Va when lIa E 8(:1') is a level 0: test.a) LCB 0* of (J is said to be more accurate than a competing level (1 .248 Testing and Confidence Regions Chapter 4 0("'.. or larger than eo. a level (1 any fixed B and all B' > B.2. But we also want the bounds to be close to (J.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account. (4. for 0') UCB e is more accurate than a competitor e .1 (Examples 3. We also give a connection between confidence intervals. twosided tests. and the threedecision problem of deciding whether a parameter 1 l e is eo. The dual lower confidence bound is 1'. A level a test of H : /L = /La vs K .' .1 continued). they are both very likely to fall below the true (J. Ii n II r .1) is nothing more than a comparison of (X)wer functions. . . e. and only if. then the lest that accepts H . va) ~ 0 is a level (1 .Q) con· fidence region for v. P.6. less than 80 .. Thus. (4. (72) model.n.~'(X) < B'] < P. .Ilo)/er > z(1 .0') lower confidence bounds for (J. I' > 1'0 rejects H when .
6.5 favor X{k) (see Example 3.6. Boundsforthe Probability ofEarly Failure ofEquipment. they have the smallest expected "distance" to 0: Corollary 4. 00 ) by o(x. Proof Let 0 be a competing level (1 . However. We define q* to be a uniformly most accurate level (1 .Because O'(X. the probability of early failure of a piece of equipment.t o . for any other level (1 .5.a) LCB q.z(1 a)a/ JTi is uniformly most accurate. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}). .6. [O(X) > 001 < Pe. a real parameter.a) lower confidence bound.6. 00 )) or Pe. for e 1 > (Jo we must have Ee.4.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 .1 to Example 4. and 0 otherwise. O(x) < 00 .6.2 and the result follows. 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 . . such that for each (Jo the associated test whose critical function o*(x.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4. Most accurate upper confidence bounds are defined similarly. Then!l* is uniformly most accurate at level (1 . if a > 0. .a) lower confidence boundfor O.2).(o(X. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables. (Jo) is given by o'(x. then for all 0 where a+ = a.•.1.2.a) upper confidence bound q* for q( A) = 1 .7 for the proof). and only if.Oo)) < Eo. Also. and only if. Uniformly most accurate (UMA) bounds turn out to have related nice properties. For instance (see Problem 4. we find that j.Section 4.1.2. Defined 0(x. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 . 00 ) ~ 0 if. Let O(X) be any other (1 . 0 If we apply the result and Example 4. Example 4.a). Let f)* be a level (1 . (O'(X.0') LeB for (J.e>. the robustness considerations of Section 3. Let XI. Then O(X. We want a unifonnly most accurate level (1 . Suppose ~'(X) is UMA level (1 . X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it.1. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. [O'(X) > 001.a) Lea for q( 0) if.a) LeB 00.
a uniformly most accurate level (1 . 1962.6. subject to ilie requirement that the confidence level is (1 .a) in the sense that. in general. In particular. Of course. Neyman defines unbiased confidence intervals of level (1 . By Problem 4.a) UCB for the probability of early failure.) as a measure of precision. *) is a uniformly most accurate level (1 . is random and it can be shown that in most situations there is no confidence interval of level (1 . Thus.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1. If we turn to the expected length Ee(t . the UMP test accepts H if LXi < X2n(1. Considerations of accuracy lead us to ask that. it follows that q( >.T.a) iliat has uniformly minimum length among all such intervals. the interval must be at least as likely to cover the true value of q( B) as any other value. These topics are discussed in Lehmann (1997). 374376).a) quantile of the X§n distribution.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1.6. The situation wiili confidence intervals is more complicated. 0 Discussion We have only considered confidence bounds. * is by Theorem 4. By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 .\ *) where>. the intervals developed in Example 4.1.a) by the property that Pe[T. some large sample results in iliis direction (see Wilks.a) UCB'\* for A. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. ~ q(B) ~ t] ~ Pe[T. the confidence region corresponding to this test is (0.a) intervals iliat has minimum expected length for all B.a) unbiased confidence intervals. there exist level (1 . in the case of lower bounds. there does not exist a member of the class of level (1 . the situation is still unsatisfactory because. the confidence interval be as short as possible. B'. . they are less likely ilian oilier level (1 .. pp.a)/ 2Ao i=1 n (4..a) lower confidence bounds to fall below any value ()' below the true B. however.a) UCB for A and.6.a) is the (1. That is.T. because q is strictly increasing in A. we can restrict attention to certain reasonable subclasses of level (1 .8. l . Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.5. Pratt (1961) showed that in many of the classical problems of estimation .a) confidence intervals that have uniformly minimum expected lengili among all level (1. as in the estimation problem.1 have this property. However.. Therefore. the lengili t .. There are. ~ q(()') ~ t] for every ().a). Summary.a) intervals for which members with uniformly smallest expected length exist.
Xl. A consequence of this approach is that once a numerical interval has been computed from experimental data. If 7r( alx) is unimodal. then fl. the posterior distribution of fL given Xl. and that fL rv N(fLo. . .a) credible region for e if II( C k Ix) ~ 1  a .2. a E e c R. Suppose that.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 .a..'" .aB = n ~ + 1 I ~ It follows that the level 1 . what are called level (1 '.2 and 1. Let 7r('lx) denote the density of agiven X = x. Then.a.a) lower and upper credible bounds for if they respectively satisfy II(fl.. from Example 1. with fLo and 75 known. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). II(a:s: t9lx) .1.a)% of the intervals would contain the true unknown parameter value. Thus.1.7.7 Frequentist and Bayesian Formulations 251 4.12.i.4. In the Bayesian formulation of Sections 1. a a Turning to Bayesian credible intervals and regions. :s: alx) ~ 1 .7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown. then 100(1 . and that a has the prior probability distribution II.Section 4.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 . then Ck = {a: 7r(lx) . a~).Xn are i.a) by the posterior distribution of the parameter given the data. then Ck will be an interval of the form [fl.)2 nro 1 .d.::: k} is called a level (1 ..A)2 nro ao 1 Ji = liB + Zla Vn (1 + . no probability statement can be attached to this interval. the interpretation of a 100(1". N(fL.a)% confidence interval. Suppose that given fL.::: 1 .7.6. Let II( 'Ix) denote the posterior probability distribution of given X = x. given a.3.7. a Definition 4. Definition 4. Instead. Example 4.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + . and {j are level (1 . (j].Xn is N(liB. We next give such an example. X has distribution P e. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 .1. with known. 75).
. N(/10. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4. In addition to point prediction of Y. We shall analyze Bayesian credible regions further in Chapter 5.a) credible interval is [/1. . the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x).• .3. it is desirable to give an interval [Y.n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/. Let xa+n(a) denote the ath quantile of the X~+n distribution. the interpretations of the intervals are different. b > are known parameters. 4. the probability of coverage is computed with the data X random and B fixed. t] in which the treatment is likely to take effect. However. whereas in the Bayesian credible interval.2 . the center liB of the Bayesian interval is pulled in the direction of /10. D Example 4../10)2. In the Bayesian framework we define bounds and intervals. then . Y] that contains the unknown value Y with prescribed probability (1 . Similarly..4 for sources of such prior guesses.. X n . In the case of a normal prior w( B) and normal model p( X I B). where t = 2:(Xi . Xl. Then. that determine subsets of the parameter space that are assigned probability at least (1 . a doctor administering a treatment with delayed effect will give patients a time interval [1::. (t + b)A has a X~+n distribution.7.6.2.a) by the posterior distribution of the parameter () given the data :c.8 PREDICTION INTERVALS In Section 1.252 Testing and Confidence Regions Chapter 4 while the level (1 . .a) upper credible bound for 0.a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. See Example 1. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 . the interpretations are different: In the frequentist confidence interval. the Bayesian interval tends to the frequentist interval. 01 Summary. called level (1 a) credible bounds and intervals. Note that as TO > 00. by Problem 1. given Xl. ~b) density where a > 0. For instance.n. the level (1 . however.. we may want an interval for the .4 we discussed situations in which we want to predict the value of a random variable Y.a). = x a+n (a) / (t + b) is a level (1 .2 .d.2 and suppose A has the gamma f( ~a.02) where /10 is known.4. ..Xn are Li. Suppose that given 0.2.12. where /10 is a prior guess of the value of /1.a) lower credible bound for A and ° is a level (1 .2. is shifted in the direction of the reciprocal b/ a of the mean of W(A). Let A = a.
1.4. let Xl.1)8 2 /cr 2. We define a level (1. we find the (1 . Tp(Y) !a) .3 and independent of X n +l by assumption. . . and can c~nclu~e that in the class of prediction unbiased predictors. It follows that Z (Y) p  Y vnY l+lcr has a N(O.+ 18tn l (1  (4.) acts as a confidence interval pivot in Example 4. Moreover.1. ::. Y ::. [n. Note that Y Y = X .1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p. We define a predictor Y* to be prediction unbiased for Y if E(Y* . . 1) distribution and is independent of V = (n . which is assumed to be also N(p" cr 2 ) and independent of Xl.. "~). .3. has the t distribution. Also note that the prediction interval is much wider than the confidence interval (4.l distribution. the optimal estimator when it exists is also the optimal predictor.Section 4.X) is independent of X by Theorem B. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4. Let Y = Y(X) denote a predictor based on X = (Xl. ~ We next use the prediction error Y . TnI.a) prediction interval Y = X± l (1 ~Ct) ::. As in Example 4.. We want a prediction interval for Y = X n +1.X n +1 '" N(O..8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio. By solving tnl Y.4.1)1 L~(Xi .1). It follows that.i.l + 1]cr2 ). Y) ?: 1 Ct.. . as X '" N(p" cr 2 ). fnl (1  ~a) for Vn. In Example 3.8. in this case.4.Xn be i..d. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so. Y] based on data X such that P(Y ::.8.. which has a X. . AISP E(Y) = MSE(Y) + cr 2 . 8 2 = (n .Ct) prediction interval as an interval [Y.3. it can be shown using the methods of .1..Y to construct a pivot that can be used to give a prediction interval. where MSE denotes the estimation theory mean squared error.4.Y) = 0. by the definition of the (Student) t distribution in Section B. The (Student) t Prediction Interval.Xn . Thus. we found that in the class of unbiased estimators.8. In fact. X is the optimal estimator. the optimal MSPE predictor is Y = X.
. as X rv F. The posterior predictive distribution Q(. This interval is a distributionfree prediction interval. Moreover..~o:).4.254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4. Q(. 1)..2). . Here Xl"'" Xn are observable and Xn+l is to be predicted. Let U(1) < . (4. b). .8. . n + 1.12..' •.8.2) P(X(j) It follows that [X(j) .8.j is a level 0: = (n + 1 ..Un+l are i. 00 :s: a < b :s: 00. where F is a continuous distribution function with positive density f on (a.i. E(UCi») thus.. Un ordered.1) is not (1 . I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI.2.d. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . See Problem 4.4.. then.2j)/(n + 1) prediction interval for X n +l .v) = E(U(k») . uniform. where X n+l is independent of the data Xl'.1) is approximately correct for large n even if the sample comes from a nonnormal distribution. .v) j(v lI)dH(u. UCk ) = v)dH(u. . We want a prediction interval for Y = X n+l rv F. that is. Set Ui = F(Xi ). .' X n .E(U(j)) where H is the joint distribution of UCj) and U(k). Example 4. U(O. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution. .2. < X(n) denote the order statistics of Xl. with a sum replacing the integral in the discrete case.0:) Bayesian prediction interval for Y = X n + l if . by Problem B.... whereas the width of the prediction interval tends to 2(]z (1. Now [Y B' YB ] is said to be a level (1 .. ..5 for a simpler proof of (4.9.8..8. " X n + l are Suppose that () is random with () rv 1'( and that given () = i.2.i.'..d.d. Let X(1) < . Suppose XI. the confidence level of (4. Ul . = i/(n+ 1). I x) has in the continuous case density e. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.1) tends to zero in probability at the rate n!. " X n.i.Xn are i.0:) in the limit as n ) 00 for samples from nonGaussian distributions.. By ProblemB. p(X I e). Bayesian Predictive Distributions Xl' . whereas the level of the prediction interval (4. . i = 1. < UCn) be Ul .xn ).
~o:) (J"o as n + 00. even in .  0) + 0 I X = t} = N(liB.8.3.9 4. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. (J"5 known. (J" B n(J" B 2) It follows that a level (1 . T2).c{Xn+l I X = t} = .8.3) we compute its probability limit under the assumption that Xl " '" Xn are i.1. and X ~ Bas n + 00.c{(Xn +1 where. N(B.8. (4.7. we find that the interval (4. the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures.8.B)B I 0 = B} = O.0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. To obtain the predictive distribution. Note that E[(Xn+l .Section 4. X n + l and 0. B = (2 / T 2) TJo + ( 2 / (J"0 X.0 and 0 are uncorrelated and. However. from Example 4. . . independent. For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval. 4.1) .1 where (Xi I B) '" N(B . Because (J"~ + 0. This is the same as the probability limit of the frequentist interval (4. and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X.0:). by Theorem BA.8. Thus. Thus. (J"5 + a~) ~2 = (J" B n I (. 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. note that given X = t..2. Consider Example 3. (n(J"~ /(J"5) + 1.I . The Bayesian prediction interval is derived for the normal model with a normal prior. and 7r( B) is N( TJo .d. In the case of a normal sample of size n + 1 with only n variables observable.3) converges in probability to B±z (1 .0 and 0 are still uncorrelated and independent. (J"5). (J"5). we construct the Student t prediction interval for the unobservable variable. Xn+l . A sufficient statistic based on the observables Xl. Xn is T = X = n 1 2:~=1 Xi . We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point.i.2 o + I' T2 ~ P. T2 known. where X and X n + l are independent. Xn+l . Summary.~o:) V(J"5 + a~.3) To consider the frequentist properties of the Bayesian prediction interval (4.9.0)0] = E{E(Xn+l .. . 1983).9 Likelihood Ratio Procedures 255 Example 4.
and conversely. 8 1 = {e l }. and for large samples. we specify the size a likelihood ratio test through the test statistic h(A(X) and . where T = fo(X . In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters.Xn ) has density or frequency function p(x. eo) when 8 0 = {eo}.'Pa(x). p(x. note that it follows from Example 4.a )th quantile obtained from the table. Because 6a (x) I.. For instance.2 that we think of the likelihood function L(e. if Xl. In the cases we shall consider. 1). . e) : e E 8l}islargecomparedtosup{p(x. 1. .. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H . . e) and we wish to test H : E 8 0 vs K : E 8 1 . On the other hand. To see this. the MP level a test 'Pa(X) rejects H if T ::::.2. optimal procedures may not exist. Although the calculations differ from case to case.1 that if /Ll > /Lo. B')/p(x . Xn). recall from Section 2..2. 3. 2. ed / p(x. We are going to derive likelihood ratio tests in several important testing problems./Lo. by the uniqueness of the NP test (Theorem 4. then the observed sample is best explained by some E 8 1 . ifsup{p(x. . . ed/p(x. Suppose that X = (Xl .its (1 .9. 4. . Because h(A(X)) is equivalent to A(X) . z(a ). Calculate the MLE eo of e where e may vary only over 8 0 . . the MP level a test 6a (X) rejects H for T > z(l . e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x.x) = p(x. Xn is a sample from a N(/L..1) whose computation is often simple. We start with a generalization of the NeymanPearson statistic p(x . Form A(X) = p(x.1(c»./LO. there can be no UMP test of H : /L = /Lo vs H : /L I. there is no UMP test for testing H : /L = /Lo vs K : /L I. The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. Note that in general A(X) = max(L(x). In particular cases. Calculate the MLE e of e. a 2 ) population with a 2 known. To see that this is a plausible statistic.e): E 8 0 }. e) : e E 8 l }. .2. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. the basic steps are always the same.~a). e) : () sup{p(x. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. So. sup{p(x. eo). .256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional./Lo)/a. if /Ll < /Lo. e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. e) E 8} : eE 8 0} (4. Also note that L(x) coincides with the optimal test statistic p(x . e) as a measure of how well e "explains" the given sample x = (Xl. eo).
and soon. the experiment proceeds as follows. 8n e (4.. An important class of situations for which this model may be appropriate occurs in matched pair experiments. For instance. and other factors. we measure the response of a subject when under treatment and when not under treatment. which are composite because 82 can vary freely.e. After the matching.9. Here are some examples.2. p (X . In order to reduce differences due to the extraneous factors.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p.'" .9.9. Let Xi denote the difference between the treated and control responses for the ith pair.Section 4. the difference Xi has a distribution that is .9.24. bounds. An example is discussed in Section 4. This section includes situations in which 8 = (8 1 . We are interested in expected differences in responses due to the treatment effect. We can regard twins as being matched pairs.8) . 0'2) population in which both JL and 0'2 are unknown. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.9. mileage of cars with and without a certain ingredient or adjustment. sales performance before and after a course in salesmanship. To see how the process works we refer to the specific examples in Sections 4.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. That is. The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 .9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions. 8) > (8)] = eo a. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors. [ supe p(X.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter.a) confidence region C(x) = {8 : p(x. diet. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age.Xn form a sample from a N(JL. 4. If the treatment and placebo have the same effect. C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data.5. Response measurements are taken on the treated and control members of each pair. In that case. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. while the second patient serves as control and receives a placebo. and so on. In the ith pair one patient is picked at random (i.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl. 8) ~ [c( 8)t l sup p(x. with probability ~) and given the treatment.
L X i ' .fJ. where B=(x. which has the immediate solution ao ~2 = . = fJ.=1 fJ. Our null hypothesis of no treatment effect is then H : fJ." However. = fJ.0. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect. = E(X 1 ) denote the mean difference between the response of the treated and control subjects. n i=l is the maximum likelihood estimate of B. = fJ. Let fJ.0 as an established standard for an old treatment. . TwoSided Tests We begin by considering K : fJ.0 )2 . Form of the TwoSided Tests Let B = (fJ" a 2 ). where we think of fJ.L Xi 1 ~( n i=l . =Ie fJ. The likelihood equation is oa 2 a logp(x. To this we have added the nonnality assumption. e).o.0 is known and then evaluating p(x.0}. However. as discussed in Section 4. The problem of finding the supremum of p(x. we test H : fJ.a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . a~). 8 0 = {(fJ" Under our assumptions. This corresponds to the alternative "The treatment has some effect.5.0) 2  a2 n] = 0.258 Testing and Confidence Regions Chapter 4 symmetric about zero. B) at (fJ. We think of fJ. good or bad.6. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied.L ( X i . B) 1[1~ ="2 a 4 L(Xi . as representing the treatment effect. Finding sup{p(x. We found that sup{p(x.0. a 2 ) : fJ. B) was solved in Example 3. = O.B): B E 8} = p(x.3. 'for the purpose of referring to the duality between testing and confidence procedures. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ.X) .
A and B. to find the critical value.05. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem. or Table III. therefore.Mo) . Mo . the relevant question is whether the treatment creates an improvement.MO)2/&2. the likelihood ratio tests reject for large values of ITn I. the test that rejects H for Tn Z t nl(1 . In Problem 4.1).1)1 I:(Xi . A proof is sketched in Problem 4.8) logp(x. which can be established by expanding both sides. rejects H for iarge values of (&5/&2). the testing problem H : M ::.1). Thus. OneSided Tests The twosided formulation is natural if two treatments.2.9 Likelihood Ratio Procedures 259 By Theorem 2. if we are comparing a treatment and control. and only if.\(x). ITnl Z 2.Mo)/a. Mo versus K : M > Mo (with Mo = 0) is suggested. 8) for 8 E 8 0. To simplify the rule further we use the following equation. For instance.9. . Then we would reject H if. the size a critical value is t n 1 (1 . (Mo. Similarly. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). The test statistic .1. where 8 = (M . Therefore.4. which thus equals log .a). &5/&2 (&5/&2) is monotone increasing Tn = y'n(x . Therefore.9. Because Tn has a T distribution under H (see Example 4.Section 4.x)2 = n&2/(n .~a) and we can use calculators or software that gives quantiles of the t distribution.11 we argue that P"[Tn Z t] is increasing in 8. and only if. 8 Therefore. suppose n = 25 and we want a = 0.\(x) logp(x. However. Because 8 2 function of ITn 1 where = 1 + (x . (n . is of size a for H : M ::. are considered to be equal before the experiment is performed.3.\(x) is equivalent to log .064. &5 gives the maximum of p(x.  {~[(log271') + (log&5)]~} Our test rule. the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if.
n(X . bring the power arbitrarily close to 0:. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L .11) just as the power functions of the corresponding tests of Example 4.9. Thus.jV/ k where Z and V are independent and have N(8.4. we can no longer control both probabilities of error by choosing the sample size. we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8. p. With this solution. however. This distribution. Computer software will compute n. fo( X . 1997. the ratio fo(X . in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L.9. We have met similar difficulties (Problem 4.260 Power Functions and Confidence 4 To discuss the power of these tests. ( 2) only through 8.b distribution. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases.1.1 are monotone in fo/L/ a. 1) distribution..7) when discussing confidence intervals. Because E[fo(X /Lo)/a] = fo(/L .2. note that from Section B. If we consider alternatives of the form (/L /Lo) ~ ~. A Stein solution.4. we know that fo(X /L)/a and (n ./Lo) / a./Lo)/a and Yare ./Lo)/a . is possible (Lehmann.b./Lol ~ ~. with 8 = fo(/L . Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. we may be required to take more observations than we can afford on the second stage. 260. The reason is that.1. denoted by Tk. The density of Z/ . 1) and X~ distributions. . Likelihood Confidence Regions If we invert the twosided tests. Problem 17). thus./Lo) / a as close to 0 as we please and. and the power can be obtained from computer software or tables of the noncentral t distribution.4.3. whatever be n. Note that the distribution of Tn depends on () (/L./Lo) / a has N (8.l distribution.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X. To derive the distribution of Tn.jV/k is given in Problem 4. is by definition the distribution of Z / . respectively. say.12. / (n1)s2 /u 2 n1 V has a 'Tn1. The power functions of the onesided tests are monotone in 8 (Problem 4. we obtain the confidence region We recognize C(X) as the confidence interval of Example 4./Lo)/a) = 1.
For quantitative measurements such as blood pressure.8 2.. height. e . volume. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.4 If we denote the difference as x's. .4 1. Patient i A B BA 1 0..9.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures. The 0.995) = 3. ( 2) populations. blood pressure). (See also (4. Y) = (Xl.. . .. ..5 1. Jl2.4 4. This is a matched pair experiment with each subject serving as its own control.6 0.7 5.9.Xn1 . while YI .0 3.0 4.0 7 3. consider the following data due to Cushny and Peebles (see Fisher. temperature. . this is the problem of determining whether the treatment has any effect.7 1. 'Yn2 are independent samples from N (JlI.1 0. YI .1 1. it is shown that = over 8 by the maximum likelihood estimate e. Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. length.Section 4.'" .. 8 2 = 1. For instance.4 3 0. Because tg(0.8 1.Y ) is n2 where n = nl + n2. Let e = (Jlt.A in sleep gained using drugs A and B on 10 patients.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I. 121) giving the difference B . Yn2 .84].6.Xn1 could be blood pressure measurements on a sample of patients given a placebo.. .06. we conclude at the 1% level of significance that the two drugs are significantly different. The preceding assumptions were discussed in Example 1.g.6 0.. ( 2). respectively. one from each popUlation..3 5 0.8 8 0.1 1.2. .8 9 0.. As in Section 4.Xn1 and YI ....99 confidence interval for the mean difference Jl between treatments is [0..513. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X. suppose we wanted to test the effect of a certain drug on some biological variable (e. weight.9. . It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level.3.2 the likelihood function and its log are maximized In Problem 4. .4 1.25.0 6 3.Xnl and YI .2 0.) 4.2 1.1 0.3). ( 2) and N (Jl2.5. .3 4 1.32. . 'Yn2 are the measurements on a sample given the drug.1. and ITnl = 4.6 4. p. .9 1. and so forth. 1958.2 2 1.58.6 10 2.. Then Xl. then x = 1. it is usually assumed that X I. In the control versus treatment example.
where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. the maximum of p over 8 0 is obtained for f) (j1. Y. Y. we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2. our model reduces to the onesample model of Section 4. Thus.X) + (X  j1)]2 and expanding. (7 ). .1 .2 (Xi .X) and .2 X.2. where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi . j1.9.j112 log likelihood ratio statistic [(Xi . (75).3.3 Tn2 1 .1 I:rtl . .2 1 I:rt2 .2 (y.Y) '1':" a a a i=l a j=l J I .262 (X. By Theorem B.
f"V As usual. Confidence Intervals To obtain confidence intervals for 1L2 . twosample t test rejects if. we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll. corresponding to the twosided test. T has a noncentral t distribution with noncentrality parameter. .2)8 2/0. 1/nI).ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . As for the special case Ll = 0.2 under H bution and is independent of (n .9. Similarly.ILl #. 1/n2). As in the onesample case.N(1L2/0.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. and only if.Section 4.~ Q likelihood confidence bounds. T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4.2 • Therefore.Ll. We conclude from this remark and the additive property of the X2 distribution that (n . We can show that these tests are likelihood ratio tests for these hypotheses.2)8 2 /0. by definition. It is also true that these procedures are of size Q for their respective hypotheses.3) for 1L2 .X) /0.ILl. onesided tests lead to the upper and lower endpoints of the interval as 1 . T and the resulting twosided. for H : 1L2 ::. this follows from the fact that.has a N(O. X~ll' X~2l' respectively.1L2.2 has a X~2 distribution. there are two onesided tests with critical regions. if ILl #. 1L2. IL 1 and for H : ILl ::. 1) distriTn .ILl = Ll versus K : 1L2 . and that j nl n2/n(Y .
282 We test the hypothesis H of no difference in expected log permeability. If normality holds. ai) and N(Ji2.977.627 2. The results in terms oflogarithms were (from Hald. the more waterproof material. Because t4(0. Suppose first that aI and a~ are known.845 1. /11 = x and /12 = y for (Ji1. .3.. it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. 472) x (machine 1) Y (machine 2) 1..395 ± 0.. 1952..2 (1 .042 1.95 confidence interval for the difference in mean log permeability is 0. 4. thus. As a first step we may still want to compare mean responses for the X and Y populations. we would select machine 2 as producing the smaller permeability and.Xn1 . a~) samples. :. Again we can show that the selection procedure based on the level (1 . Here y . Ji2) E R x R. we are lead to a model where Xl.8 2 = 0.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4. The level 0. p.0264. respectively.583 1. Y1. 'Yn2 are two independent N (Ji1.~o:).776. When Ji1 = Ji2 = Ji.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. a treatment that increases mean response may increase the variance of the responses. On the basis of the results of this experiment.9. For instance. . This is the BehrensFisher problem. and T = 2.x = 0. is The MLEs of Ji1 and Ji2 are.975) = 2. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines. From past experience.0:) confidence interval has probability at most ~ 0: of making the wrong selection. H is rejected if ITI 2 t n . The log likelihood.368. setting the derivative of the log likelihood equal to zero yields the MLE . we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines. it may happen that the X's and Y's have different variances. . . .9. except for an additive constant.790 1. thus.395.
t1.x)/(nl + . where D = Y .X)2 + nl(j1.9 Likelihood Ratio Procedures J 265 where. DiaD has aN(I:::. = at I a~. 1) distribution.) 18 D due to Welch (1949) works well. Unfortunately the distribution of (D 1:::.t2 must be estimated.1:::.n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD.n2(fi .x)2 i=1 n2 i=1 n2 L(Yi . (D 1:::. BecauseaJy is unknown.y) By writing nl L(Xi .fi = n2(x .Section 4. and more generally (D .j1)2 j=1 = L(Yi . it Thus.. j1x = = nl . = J. For large nl.) I 8 D depends on at I a~ for fixed nl. n2. 2 8y 8~ . An unbiased estimate is J. J.1:::.)18D to generate confidence procedures..t2. n2.t2 :::. y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. j1.. by Slutsky's theorem and the central limit theorem.3.t1..n2)' Similarly.28)..fi? j=1 + n2(j1.y)/(nl + .. where I:::.t1 = J. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J. that is Yare X) + Var(Y) = nl + n2 .fi)2 we obtain \(x. For small and moderate nI.X and aJy is the variance of D.)18D has approximately a standard normal distribution (Problem 5. IDI/8D as a test statistic for H : J. It follows that "\(x.. n2 an approximation to the distribution of (D .
the critical value is obtained by linear interpolation in the t tables or using computer software. uI 4. Y) have a joint bivariate normal distribution. Some familiar examples are: X test score on English exam. X = average cigarette consumption per day in grams. TwoSided Tests . Y = blood pressure. p).266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 .3. with > 0. Y = age at death. u~ > O. ur Testing Independence. etc. X percentage of fat in score on mathematics exam. machines. The LR procedure derived in Section 4. then we end up with a bivariate random sample (XI. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. (Xn' Yn). Y diet.1 When k is not an integer. which works well if the variances are equal or nl = n2.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons.u~. .05 and Q: 0. . Y cholesterol level in blood.J1.01.. Y). X test tistical studies. Yd.. can unfortunately be very misleading if =1= u~ and nl =1= n2. fields. .9. N(j1.1. the maximum error in size being bounded by 0.9. mice. If we have a sample as before and assume the bivariate normal model for (X. our problem becomes that of testing H : p O.2 . Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X. Wang (1971) has shown the approximation to be very good for Q: = 0. Note that Welch's solution works whether the variances are equal or not.28.003.3.3 and Problem 5. See Figure 5.) are sampled from a population and two numerical characteristics are measured on each case.3. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem.
p). p::. 0.13 as 2 = .4) = Thus. V n ) is a sample from the N(O.5.~( Yi . VI)"" . 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . Because ITn [ is an increasing function of [P1. Qualitatively. (j~.X)(Yi ~=1 fj)] /n(jl(72.1. (jr . Because (U1..0"2 = . (4.9. See Example 5. the distribution of p depends on p only.4. we can control probabilities of type II error by increasing the sample size. .. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II.Jl2)/0"2. and the likelihood ratio tests reject H for large values of [P1.(j~. 1 (Problem 2. log A(X) is an increasing function of p2.. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1.o. If p = 0.9. for any a. and eo can be obtained by separately maximizing the likelihood of XI.8).Now.X .~( Xi . 0) and the log of the likelihood ratio statistic becomes log A(X) (4.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. We have eo = (x.3.Section 4.3.9.2 distribution. y. where Ui = (Xi . ~ = (Yi . p) distribution. When p = 0 we have two independent samples. A normal approximation is available. . (Un.Jll)/O"I. ii.Y . then by Problem B. Pis called the sample correlation coefficient and satisfies 1 ::. .1. . where (jr e was given in Problem 2. When p = 0.1. Yn . . the distribution of pis available on computer packages. Therefore. .. There is no simple form for the distribution of p (or Tn) when p #. if we specify indifference regions.5) has a Tn. To obtain critical values we need the distribution of p or an equivalent statistic under H.Xn and that ofY1 .
we would test H : P = 0 versus K : P > 0 or H : P ::. c(po) where P po [I)' d(po)] P po [I)'::. These tests can be shown to be of the form "Accept if.48. We obtain c(p) either from computer software or by the approximation of Chapter 5.~Q bounds together. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: .268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. only onesided alternatives are of interest. Because c can be shown to be monotone increasing in p.9.75.15). We can show that Po [p. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. we obtain a commonly used confidence interval for p. ! Data Example A~ an illustration.: : d(po) or p::. by putting two level 1 . by using the twosided test and referring to the tables. Thus. for large n the equal tails and LR confidence intervals approximately coincide with each other. We can similarly obtain 1 Q upper confidence bounds and. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. c(po)" where Ppo [p::. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O. c(Po)] 1 ~Q. inversion of this family of tests leads to levell Q lower confidence bounds. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. Po but rather of the "equaltailed" test that rejects if. These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. c(po)] 1 . The power functions of these tests are monotone. and only if. For instance.:::: c] is an increasing function of p for fixed c (Problem 4. The likelihood ratio test statistic >. 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. To obtain lower confidence bounds. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O. we find that there is no evidence of correlation: the pvalue is bigger than 0. Therefore. However.7 18. p::. 370. and only if p .18 and Tn 0. 'I7 Summary.1 Here p 0.Q. 0 versus K : P > O.
Consider the hypothesis H that the mean life II A = J. We also find that the likelihood ratio statistic is equivalent to a t statistic with n . what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20.X. We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. where SD is an estimate of the standard deviation of D = Y .L2' a~) populations. The likelihood ratio test is equivalent to the onesample (Student) t test.05? e :::.. .L :::. When a? and a~ are unknown. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. . . ..Xn ) is an £(A) sample.L is zero. ( 2) and we test the hypothesis that the mean difference J. 4.10 PROBLEMS AND COMPLEMENTS Problems for Section 4.. . X n ) and let 1 if Mn 2:: c = of e. where p is the sample correlation coefficient. Suppose that Xl.. Approximate critical values are obtained using Welch's t distribution approximation.Lo. J. the likelihood ratio test is equivalent to the test based on IY ..48. X n are independently and identically distributed according to the uniform distribution U(O.X)I SD. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl. respectively. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. When a? and a~ are known..10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J.1 1.XI. . .Xn denote the times in days to failure of n similar pieces of equipment. Let Xl. .L. (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0.Section 4.Ll' an and N(J. Let Mn = max(X 1 . Mn e = ~? = 0. ~ versus K : e> ~. (d) How large should n be so that the 6c specified in (b) has power 0. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases.98 for (e) If in a sample of size n = 20. we use (Y . . e).2 degrees of freedom. respectively. . Assume the model where X = (Xl.L2' ( 2) populations. . 0 otherwise. what is the pvalue? 2.Ll' ( 2) and N(J.
Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.2 L:. . X> 0. Let Xl. X n be a 1'(0) sample. . (b) Give an expression of the power in tenns of the X~n distribution.. where x(I . under H.. T ~ Hint: See Problem B. distribution. . .. (d) The following are days until failure of air monitors at a nuclear plant.. Hint: See Problem B. . j=l . . then the pvalue <>(T) has a unifonn. Hint: Approximate the critical region by IX > 1"0(1 + z(1 . Hint: Use the central limit theorem for the critical value. Establish (4.O) = (xIO')exp{x'/20'}..3).05? 3. j .270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8. . U(O..2. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. If 110 = 25.Xn be a sample from a population with the Rayleigh density .1. 6. j = 1.. (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a). (0) Show that.3. i I (b) Check that your test statistic has greater expected value under K than under H. 1nz t Z i ..r.. 1). . r.ontinuous distribution. .a)th quantile of the X~n distribution.) 4. (b) Show that the power function of your test is increasing in 8. . (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 . Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution.a) is the (1 . Show that if H is simple and the test statistic T has a c. " i I 5.. is a size Q test. . . Suppose that T 1 . . Assume that F o and F are continuous. 7. give a normal approximation to the significance probability. Draw a graph of the approximate power function.<»1 VTi)J . ..3.0 > 0. Let Xl.4.4 to show that the test with critical region IX > I"ox( 1  <» 12n]. 1 using a I . I ...~llog <>(Tj ) has a X~r distribution. f(x.12. Let o:(Tj ) denote the pvalue for 1j. (Use the central limit theorem. (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00.
d.5. 10. . then the power of the Kolmogorov test tends to 1 as n .T(B+l) denote T.. = ~ I..) Evaluate the bound Pp(lF(x) ... Express the Cramervon Mises statistic as a sum.a)/b.. 9.2. Let X I.5. . b > 0. and let a > O. . That is. . B... .p(Fo(x))lF(x) . 80 and x = 0.. . Hint: If H is true T(X). Use the fact that T(X) is equally likely to be any particular order statistic.Fo(x)1 foteach..p(F(x)IF(x) . let .l) (Tn). 1) to (0. to get X with distribution .1.5.Section 4. (b) Suppose Fo isN(O. (b) Use part (a) to conclude that LN(". (b) When 'Ij. Let T(l).Fo(x)IOdFo(x) . Vw.1. ~ ll.u. = (Xi . T(l). (a) Show that the statistic Tn of Example 4.p(Fo(x))IF(x) . In Example 4. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.o sup. .o T¢.T.o J J x . X n be d. with distribution function F and consider H : F = Fo. . .p(u) be a function from (0.)(Tn ) = LN(O. then T.10.X(B) from F o on the computer and computing TU) = T(XU)). is chosen 00 so that 00 x 2 dF(x) = 1.12(b).. n (c) Show that if F and Fo are continuous and FiFo.T(X(l))..Fo(x) 10 U¢.o: is called the Cramervon Mises statistic. Define the statistics S¢. j = 1. T(X(B)) is a sample of size B + 1 from La.00).(X') = Tn(X).J(u) = 1 and 0: = 2... (This is the logistic distribution with mean zero and variance 1. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x). and 1. Here. . Fo(x)1 > kol x ~ Hint: D n > !F(x) . .6 is invariant under location and scale.10 Problems and Complements 271 8. 1. .5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). .. 00. 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/.Fo(x)1 > k o ) for a ~ 0. (In practice these can be obtained by drawing B independent samples X(I). Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T. . if X.p(F(x»!F(x) .Fo(x)IOdF(x).Fo• generate a U(O.Fo(x)IO x ~ ~ sup.o V¢.. T(B) be B independent Monte Carlo simulated values of T. T(B) ordered. (a) For each of these statistics show that the distribution under H does not depend on F o. . .1.) Next let T(I).
the UMP test is of the form I {T > c).2..0) = 20(1. I 1 (b) Show that if c > 0 and a E (0. One has signalMtonoise ratio v/ao = 2. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t). Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . 3.. f(2. (a) Show that L(x.3. You want to buy one of two systems.. and 3 occuring in the HardyWeinberg proportions f(I. take 80 = O. The first system costs $106 . j 1 i :J . = /10 versus K : J. [2N1 + N.1) satisfy Pe.. then the pvalue is U =I . then the test that rejects H if.1. . Show that EPV(O) = P(To > T).. the other has v/aQ = 1. Let 0 < 00 < 0 1 < 1. respectively. f(3. the other $105 • One second of transmission on either system COsts $103 each.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale. 5 about 14% of the time.') sample Xl. (See Problem 4.Fe(F"l(l.05.2 I i i 1. X n . I X n from this population.. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • . and only if. Let To denote a random variable with distribution F o. (For a recent review of expected p values see Sackrowitz and SamuelCahn.10. ! J (c) Suppose that for each a E (0. If each time you test. Consider a population with three kinds of individuals labeled 1. ! . . i i (b) Define the expected pvalue as EPV(O) = EeU.. 2.L > J. ! .. whereas the other a I ~ a .a)) = inf{t: Fo(t) > u}../"0' Show that EPV(O) = if! ( .Fo(T). 00 ..2 and 4. Consider Examples 3. 01 ) is an increasing function of 2N1 + N.to on the basis of the N(p" .nO /.O) = 0'. I). you intend to test the satellite 100 times.1. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. let N I • N 2 • and N 3 denote the number of Xj equal to I. which is independent ofT. Suppose that T has a continuous distribution Fe. Let T = X/"o and 0 = /" .. Without loss of generality.) . .0)'. 12. Problems for Section 4. 2. 2N1 + N. where if! denotes the standard normal distribution.). and 3.2.0) = (1. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. i • (a) Show that if the test has level a.0). Whichever system you buy during the year.) I 1 . > cJ = a. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. 1999. Expected pvalues. Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2. where" is known. For a sample Xl. the power is (3(0) = P(U where FO 1 (u)  < a) = 1.
. I. Bd > cJ = I . and to population 0 ifT < c. then a.<l. two 5's are obtained.4 and recall (Proposition B. derive the UMP test defined by (4. Bo. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. +akNk has approx.7).1.Bd > cJ then the likelihood ratio test with critical value c is best in this sense.17).1. 0. prove Theorem 4. Find a statistic T( X.2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. (c) Using the fact that if(N" .2. I.. 5.2.. Y) belongs to population 1 if T > c.=  Section 4. (X. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel. B" .2. Bk). where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2.. 7. . [L(X. and only if. M(n.0..10 Problems and Complements 273 four numbers are equally likely to occur (i. with probability ..o = (1. = a. In Example 4.. . 1.0196 test rejects if.2.2. . Show that if randomization is pennitted..imum probability of error (of either type) is as small as possible.Pe. Nk) ~ 4. [L(X.2.6) where all parameters are known..2..imately a N(np" na 2 ) distribution. In Examle 4.. L .2. Bo. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po.2. A fonnulation of goodness of tests specifies that a test is best if the max..1. Hint: Use Problem 4.0..4. find an approximation to the critical value of the MP level a test for this problem. Y) known to be distributed either (as in population 0) according to N(O.N. .e.1.. H : (J = (Jo versus K : (J = (J. 9. Y) and a critical value c such that if we use the classification rule.~:":2:~~=C:="::===='. find the MP test for testing 10. Upon being asked to play.. For 0 < a < I.0. . PdT < cJ is as small as possible.2. 6. Hint: The MP test has power at least that of the test with test function J(x) 8.1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4. A newly discovered skull has cranial measurements (X. Prove Corollary 4.. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times. of and ~o '" I. (b) Find the test that is best in this sense for Example 4. +..6) or (as in population 1) according to N(I.. then the maximum of the two probabilities ofmisclassification porT > cJ. . if .
the probability of your deciding to close is also < 0. 0) = ( : ) 0'(1. . but if the arrival rate is > 15. > 1/>'0. .3.<»/2>'0 where X2n(1.3.3 Testing and Confidence Regions Chapter 4 1. Find the sample size needed for a level 0..4..3. x > O.. r p(x. Consider the foregoing situation of Problem 4.1.95 at the alternative value 1/>'1 = 15.1 achieves this? (Use the normal approximation. ]f the equipment is subject to wear. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00. You want to ensure that if the arrival rate is < 10. I i .. Show that if X I. the probability of your deciding to stay open is < 0.Xn is a sample from a Weibull distribution with density f(x. In Example 4.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function.01.. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2..) 3. .(IOn x = 1•. a model often used (see Barlow and Proschan. • I i • 4.0)"'/[1.i . i' i . .<» is the (1 . Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days. 1965) is the one where Xl.. . A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . (c) Suppose 1/>'0 ~ 12. Hint: Show that Xf . • • " 5.&(>'). Let Xl. A) = c AcxC1e>'x . Suppose that if 8 < 8 0 it is not worth keeping the counter open.. that the Xi are a sample from a Poisson distribution with parameter 8.• n.Xn is a sample from a truncated binomial distribution with • I.01. Here c is a known positive constant and A > 0 is the parameter of interest. I i . . 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! . show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function. .. (a) Show that L~ K: 1/>. 274 Problems for Section 4.01 lest to have power at least 0. Use the normal approximation to the critical value and the probability of rejection. i 1 . the expected number of arrivals per day. hence.Xn be the times in months until failure of n similar pieces of equipment. How many days must you observe to ensure that the UMP test of Problem 4. . .
"" X n denote the incomes of n persons chosen at random from a certain population.8 > 80 .  . JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b).Q)th quantile of the X~n distribution. 0 < B < 1.:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ . . . and let YI ..10 Problems and Complements 275 then 2. where Xl_ a isthe (1 . .1.6) is UMP for testing that the pvalues are uniformly distributed. X N }).d. Assume that Fo has ~ density fo.). 00 < a < b < 00. survival times with distribution Fa. X 2. 8. Suppose that each Xi has the Pareto density f(x. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a. Show how to find critical values.6. P().2 to find the mean and variance of the optimal test statistic. suppose that Fo(x) has a nonzero density on some interval (a.) It follows that Fisher's method for cQmbining pvalues (see 4.6.5. X 2 . x> c where 8 > 1 and c > O.. and consider the alternative with distribution function F(x. (Ii) Show that if we model the distribotion of Y as C(min{X I . which is independent of XI. Let Xl. (b) Find the optimal test statistic for testing H : JL = JLo versus K . 7.Fo(y)l"..12. b).B) = Fi!(x).d.1..O) = cBBx(l+BI.. . 6. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo. Hint: Use the results of Theorem 1. then P(Y < y) = 1 e  . (i) Show that if we model the distribution of Y as C(max{X I . against F(u)=u B. (b) NabeyaMiura Alternative. (a) Lehmann Alte~p.tive. X N e>. > 1.. ~ > O. .1. (a) Express mean income JL in terms of e. survival times of a sample of patients receiving an experimental treatment. A > O.Fo(y)  }).Section 4. random variable.6.. Find the UMP test. > po..•• of Li. To test whether the new treatment is beneficial we test H : ~ < 1 versus K : . y > 0. 1 ' Y > 0. In Problem 1.. . imagine a sequence X I. (See Problem 4. In the goodnessoffit Example 4. A > O. Y n be the Ll.O<B<1.~) = 1. .1.[1. For the purpose of modeling.. then e.. we derived the model G(y.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0. Let N be a zerotruncated Poisson. .
1 . Show that under the assumptions of Theorem 4.. x > 0. 'i .O)d..exp( x). Problem 2. B> O. n.j "'" .. . every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t. Oo)d. 0.3. We want to test whether F is exponential. .... Using a pivot based on 1 (Xi . The numerator is an increasing function of T(x).01.2 the class of all Bayes tests is complete. Assume that F o has a density !o(Y).). with distribution function F(x). where the €i are independent normal random variables with mean 0 and known variance . I Hint: Consider the class of all Bayes tests of H : () = . /. Show that the UMP test is based on the statistic L~ I Fo(Y.. 10. Problems for Section 4.(O)/ 16' I • 00 p(x. > Er • . 12. L(x. Show that the test is not UMP.4 1.(O) 6 .. . Show that under the assumptions of Theorem 4. O)d.L and unknown variance (T2. Let Xi (f)/2)t~ + €i. eo versus K : B = fh where I i I : 11.. Hint: A Bayes test rejects (accepts) H if J .d.exp( _x B). F(x) = 1 . 0.'? = 2.2.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y....0". ... 00 p(x.3. .{Od varies between 0 and 1. i = 1.1 and 01 loss.52.(O) L':x.. F(x) = 1 . 0) e9Fo (Y) Fo(Y).X)2. Let Xl. What would you . e e at l · 6. or Weibull.3. J J • 9. 0 = 0. . 1 X n be i. x > 0.(O) «) L > 1 i The lefthand side equals f6". Show that under the assumptions of Theorem 4.6t is UMP for testing H : 8 80 versus K : 8 < 80 . .' " 2. j 1 To see whether the new treatment is beneficial.' L(x.{Oo} = 1 . (a) Show how to construct level (1 . 1 • .i. Let Xl. Oo)d. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1.' (cf.2.0 e6 1 0~0. 1 X n be a sample from a normal population with unknown mean J. n announce as your level (1 .  1 j .0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi . we test H : {} < 0 versus K : {} > O.X)' = 16. the denominator decreasing.1).0) UeB for '.. = . I .
< a.01 [X . Suppose that in Example 4. I")/so. Use calculus. where 8. N(Ji" 17 2) and al + a2 < a.IN] .q(X)] is a level (1 . Show that if Xl.1 Show that.. Compare yonr result to the n needed for (4.3 we know that 8 < O.) Hint: Use (A. of the form Ji.c. then [q(X).1] if 8 > 0..n is obtained by taking at = Q2 = a/2 (assume 172 known). ../N. X!)/~i 1tl ofB. .4. .3). ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0. 0.3).Section 4.1)] if 8 given by (4. < 0.. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 .10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 . then the shortest level n (1 .~a)/ .1. Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2. but we may otherwise choose the t! freely. distribution. find a fixed length level (b) ItO < t i < I.d. .a) interval of the fonn [ X .z(1 . Show that if q(X) is a level (1 .2..1..no further observations.~a)!. 0. X + z(1 . has a 1.7). (Define the interval arbitrarily if q > q.. . i = 1...4.(al + (2)) confidence interval for q(8). Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a. although N is random. X+SOt no l (1. t. [0. .SOt no l (1.02. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4.IN(X  = I:[' lX.~a) /d]2 .. minI ii..4. 6.a. with X .IN.4.4.. Stein's (1945) twostage procedure is the following.a) confidence interval for B. (a) Justify the interval [ii. 5. It follows that (1 .1.] .X are ij. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji.(2) " .(X) = X .1) based on n = N observations has length at most l for some preassigned length l = 2d.l.(2) UCB for q(8). n. Suppose we want to select a sample size N such that the interval (4.n . there is a shorter interval 7.al).1. Then take N . Let Xl. . what values should we use for the t! so as to make our interval as short as possible for given a? 3.) LCB and q(X) is a level (1.a. with N being the smallest integer greater than no and greater than or equal to [Sot".Xn be as in Problem 4.
1947. .a) confidence interval for (1/1'). Hence. (a) If all parameters are unknown. Show that these two quadruples are each sufficient. we may very likely be forced to take a prohibitively large number of observations. d = i II: . X n1 and Yll .0)/[0(1. (c) If (12. 11.) I ~i . exhibit a fixed length level (1 . (J'2 jk) distribution. . . and the fundamental monograph of Wald. " 1 a) confidence . .95 confidence interval for (J..18) to show that sin 1( v'X)±z (1 . it is necessary to take at least Z2 (1 (12/ d 2 observations. (a) Show that X interval for 8. Let 8 ~ B(n.~a)].3.~ a) upper and lower bounds.O). 1"2.. Let 8 ~ B(n. I 1 j Hint: [O(X) < OJ = [... find ML estimates of 11. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. if (J' is large.  10..l4. 0) and X = 81n. X has a N(J1. Y n2 be two independent samples from N(IJ. respectively.jn(X ./3z (1.) (lr!d observations are necessary to achieve the aim of part (a). in order to have a level (1 . 1986.6. (a) Use (A.O)]l < z (1. 4().a:) confidence interval of length at most 2d wheri 0'2 is known.a) interval defined by (4.3) are indeed approximate level (1 . T 2 I I' " I'.0') for Jt of length at most 2d.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi. I . (a) Show that in Problem 4. and. = 0.') populations. v. (12 2 9.. i (b) What would be the minimum sample size in part (a) if ().. 0. (1  ~a) / 2. (b) Exhibit a level (1 .a) confidence interval for sin 1 (v'e). . Suppose that it is known that 0 < ~. 0") distribution and is independent of sn" 8. Show that ~(). The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook. sn. is _ independent of X no ' Because N depends only on sno' given N = k.fN(X .. ±.!a)/ 4.jn is an approximate level (1  • 12.3. (12) and N(1/.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a).051 (c) Suppose that n (12 = 5. 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 . Hint: Set up an inequality for the length and solve for n. use the result in part (a) to compute an approximate level 0. . Show that the endpoints of the approximate level (1 . . By Theorem B. Indicate what tables you wdtdd need to calculate the interval..001.jn is an approximate level (b) If n = 100 and X = 0. Let XI. but we are sure that (12 < (It. (The sticky point of this approach is that we have no control over N.: " are known. (12.4.1.1') has a N(o. > z2 (1  is not known exactly..4.
1) t z{1 . (a) Show that x( "d and x{1. (c) A level 0.7). give a 95% confidence interval for the true proportion of cures 8 using (al (4.~Q).J1.I).2. . 14. find (a) A level 0.<. = 3.2.10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.3).Q confidence interval for cr 2 . Assuming that the sample is from a /V({L.). If S "' 8(64.D andn(X{L)2/ cr 2 '" = r (4.)/01' ~ (J1. K.9 confidence interval for cr.9 confidence interval for {L. Compare them to the approximate interval given in part (a).30.4. Hint: By B. (In practice.) Hint: (n . 0). (b) A level 0. .4. :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution.ia). Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.IUI.n} and use this distribution to find an approximate 1 . (n _1)8'/0' ~ X~l = r(~(nl).4 and the fact that X~ = r (k. 1) random variables. and 10 000.J1. K. .4.3.3. and X2 = X n _ I (1 . Hint: Let t = tn~l (1 . 100.1 is known. Show that the confidence coefficient of the rectangle of Example 4. and the central limit theorem as given in Appendix A.4. Now use the law of large numbers.027 13. In the case where Xi is normal. XI 16. then BytheproofofTheoremB. V(o') can be written as a sum of squares of n 1 independentN(O.".J1.5 is (1 _ ~a) 2. 15.) can be approximated by x(" 1) "" (n ."') .9 confidence interval for {L x= + cr. but assume that J1.)4 < 00 and that'" = Var[(X. and (b) (4.)' . Xl = Xn_l na).2 is known as the kurtosis coefficient.' Now use the central limit theorem. Z. cr 2 ) population. is replaced by its MOM estimate.2. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.' = 2:.J1. Find the limit of the distribution ofn.4/0.1) + V2( n . it equals O. ~).4 = E(Xi .1)8'/0') . See Problem 5. Slutsky's theorem.3.1 and.I) + V2( n1) t z( "1) and x(1 .t ([en .1.Section 4.n(X . known) confidence intervals of part (b) when k = I. In Example 4. (c) Suppose Xi has a X~ distribution.' 1 (Xi . Hint: Use Problem 8. 4) are independent.3. Now the result follows from Theorem 8.(n . See A.".4 ) . Compute the (K.)'. 10.8).3. (d) A level 0.4.9 confidence region for (It.
Assume that f(t) > 0 ifft E (a. )F(x)[l. .4.I I 280 Testing and Confidence Regions Chapter 4 17. 18. Consider Example 4. as X and that X has density f(t) = F' (t). (c) For testing H o : F = Fo with F o continuous.u) confidence interval for 1". " • j .4. X n are i. we want a level (1 .3 for deriving a confidence interval for B = F(x). It follows that the binomial confidence intervals for B in Example 4.F(x)1 )F(x)[1 ..i.F(x)]dx.d. define .7. verify the lower bounary I" given by (4. U(O.1.F(x)1 . distribution function. (a) For 0 < a < b < I.1 (b) V1'(x)[l.1. I' .u.05 and .4. < xl has a binomial distribotion and ~ vnl1'(x) . . i 1 .a) confidence interval for F(x).95.6.F(x)l.4. 19. find a level (I .1'(x)] Show that for F continuous.6 with .F(x)1 is the approximate pivot given in Example 4. In this case nF(l') = #[X. !i .11 . b) for some 00 < a < 0 < b < 00.r fixed. Show that for F continuous where U denotes the uniform. • i. In Example 4. Suppose Xl. define An(F) = sup {vnl1'(X) . (a) Show that I" • j j j = fa F(x)dx + f..9) and the upper boundary I" = 00.4..I Bn(F) = sup vnl1'(x) .3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I . 1'1(0) < X < 1'. That is. (b)ForO<o<b< I. . pl(O) < X <Fl(b)}.F(x)1 Typical choices of a and bare . . I (b) Using Example 4.l). indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.~u) by the value U a determined by Pu(An(U) < u) = 1.
"6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function.1 has size a for H : 0 (}"2 be < 00 .4. Let Xl. if and only if 8(X) > 00 . How small must an alternative before the size 0.Xn1 and YI .3.) samples.0. Give a 90% confidence interval for the ratio ~ of mean life times. Show that if 8(X) is a level (1 .10 Problems and Complements 281 Problems for Section 4.. and let Il. X2) = 1 if and only if Xl + Xi > c.1 O? 2.a. O ) is an increasing 2 function of + B~. X 2 be independent N( 01. Yf (1 .2).3. N( O .a) UCB for 0.13 show that the power (3(0 1 . Hint: Use the results of Problems B.. then the test that accepts.. (b) Derive the level (1.5 1.3.~a) I Xl is a confidence interval for Il.2 that the tests of H : . M n /0. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il. 3. .90? 4. test given in part (a) has power 0. What value of c givessizea? (b) Using Problems B. Experience has shown that the exponential assumption is warranted.. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 . respectively.. .5.1"2nl. Or . 5. (15 = 1.2n2 distribution.~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. Let Xl.5.th quantile of the . show that [Y f( ~a) I X.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . a (b) Show that the test with acceptance region [f(~a) for testing H : Il. and consider the prob2 + 8~ > 0 when (}"2 is known.\. is of level a for testing H : 0 > 00 . with confidence coefficient 1 . ~ 01). . Yn2 be independent exponential E( 8) and £(.. Hint: If 0> 00 . = 1 rejected at level a 0.. of l. [8(X) < 0] :J [8(X) < 00 ].2 are level 0.05.· (a) If 1(0. (a) Find c such that Oc of Problem 4.2).) denotes the o. = 0. (a) Deduce from Problem 4.1. (d) Show that [Mn . (e) Similarly derive the level (1 .2 = VCBs of Example 4.. < XI? < f(1.Section 4.4 and 8. for H : (12 > (15. .a) LCB corresponding to Oc of parr (a). respectively.3.. = 1 versns K : Il. 1/ n J is the shortest such confidence interval.al) and (1 .(2) together. (e) Suppose that n = 16. . .12 and B.
Let X l1 . Let C(X) = (ry : o(X. we have a location parameter family.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2).0:) LeB for 8 whatever be f satisfying our conditions.00 . Suppose () = ('T}. Thus. . Show that PIX(k) < 0 < X(nk+l)] ~ . but I(t) = I( t) for all t. >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large.  j 1 j . (e) Show directly that P.8) where () and fare unknown. X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 . (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. 0. lif (tllXi > OJ) ootherwise. T) where 1] is a parameter of interest and T is a nuisance parameter. ry) = O}.~n + ~z(l . I . ). respectively. (b) The sign test of H versus K is given by.2).2). (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?.a) confidence region for the parameter ry and con versely that any level (1 . Hint: (c) X. . N( 020g.. We are given for each possible value 1]0 of1] a level 0: test o(X. O.. 1 Ct ~r (b) Find the family aftests corresponding to the level (1. . . 'TJo) of the composite hypothesis H : ry = ryo. . . Show that the conclusions of (aHf) still hold. og are independentN (0. and 1 is continuous and positive.IX(j) < 0 < X(k)] do not depend on 1 nr I. l 7.4. (a) Shnw that C(X) is a level (l . (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing . O?. . of Example 4.Xn be a sample from a population with density f (t .a)y'n. L:J ~ ( .1 when (72 is unknown. (c) Show that Ok(o) (X.[XU) (f) Suppose that a = 2(n') < OJ and P. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. 6.a.0:) confidence interval for 11. 0.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. X. k .
Let a(S.z(1 . . a( S. Yare independent and X ~ N(v.p) = = 0 ifiX . 1) and q(O) the ray (X . This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line.Section 4.a))2 if X > z(1 .O ~ J(X.20) if 0 < 0 and. 11.5. Let P (p.2.5. 1). Thus. Y ~ N(r/. Po) is a size a test of H : p = Po. Hint: You may use the result of Problem 4. let T denote a nuisance parameter. 1). (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X . p. ' + P2 l' z(1 1 .a) = (b) Show that 0 if X < z(1 . laifO>O <I>(z(1 .5.pYI < (1 1 otherwise.[q(X) < 0 2 ] = 1.3 and let [O(S).5. Y.a) confidence interval [ry(x). Then the level (1 . That is. 00) is = 0'. hence. if 00 < O(S) (S is inconsistent with H : 0 = 00 ). ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.7. Show that as 0 ranges from O(S) to 8(S).2 a) (a) Show that J(X.1) is unbiased. Y.2a) confidence interval for 0 of Example 4.a). alll/. Let X ~ N(O.2. or). that suP.00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H.7. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'. 0) ranges from a to a value no smaller than 1 .a. p)} as in Problem 4.a) .Y. 7)).5. Suppose X. Establish (iii) and (iv) of Example 4. Let 1] denote a parameter of interest.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4. (b) Describe the confidence region obtained by inverting the family (J(X. Show that the Student t interval (4. 12.10 Problems and Complements 283 8.z(l. and let () = (ry. Note that the region is not necessarily an interval or ray. 10. the quantity ~ = 8(8) .a). 9.1. 8( S)] be the exact level (1 . Define = vl1J.
Simultaneous Confidence Regions for Quantiles. That is. be the pth quantile of F. Let x p = ~ IFI (p) + Fi! I (P)]. (X(k)l X(n_I») is a level (1 . where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl. Show that Jk(X I x'. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 .p) versus K' : P(X > 0) > (1 . (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I . . k+ ' i . Show that P(X(k) < x p < X(nl)) = 1.7.6 and 4.1» . lOOx. 1 .284 Testing and Confidence Regions Chapter 4 13. (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b). Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  . . Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. I • (c) Let x· be a specified number with 0 < F(x') < 1.a) confidence interval for x p whatever be F satisfying our conditions. That is. (See Section 3. Confidence Regions for QlIalltiles. = ynIP(xp) . Let F..a) LeB for x p whatever be f satisfying our conditions. (e) Let S denote a B(n.95 could be the 95th percentile of the salaries in a certain profession l or lOOx . 0 < p < 1. P(xp < x p < xp for all p E (0. Then   P(P(x) < F(x) < P+(x)) for all x E (a. . 14.h(a).. and F+(x) be as in Examples 4.. We can proceed as follows. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}.1 and Fu 1.p)"j.. " II [I . il.p) variable and choose k and I such that 1 . Construct the interval using F. (g) Let F(x) denote the empirical distribution.05 could be the fifth percentile of the duration time for a certain disease. ! . . vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p.. iI . Let Xl. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. .p). X n be a sample from a population with continuous distribution F. F(x).F(x p)]. L.a. . it is distribution free.4. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. Thus.a = P(k < S < n I + 1) = pi(1. k(a) .a.b) (a) Show that this statement is equivalent to = 1.) Suppose that p is specified.4..5. X n x*) is a level a test for testing H : x p < x* versus K : x p > x*.
X I. where F u and F t .j < F_x(x)] See also Example 4.p. where A is a placebo. (b) Express x p and .(x)] < Fx(x)] = nF1_u(Fx(x)). n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X. The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R.r" ~ inf{x: a < x < b. X n and .) < p} and.i.1. .4.. .T: a <. That is. 15... Note the similarity to the interval in Problem 4. . Hint: Let t.z. VF(p) = ![x p + xlpl. We will write Fx for F when we need to distinguish it from the distribution F_ x of X..5. LetFx and F x be the empirical distributions based on the i. X I. F(x) > pl.1.10 Problems and Complements 285 where x" ~ SUp{. Express the band in terms of critical values for An(F) and the order statistics.Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. (c) Show how the statistic An(F) of Problem 4.X n .(x) = F ~(Fx(x» . (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X. then L i==l n IIXi < X + t. L i==I n I IFx (Xi) .U with ~ U(O.x.u are the empirical distributions of U and 1 .xp]: 0 < l' < I}. ..17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p ... Suppose that X has the continuous distribution F.Section 4. . F~x) has the same distribution Show that if F x is continuous and H holds. Give a distributionfree level (1 . ~ ~ (a) Consider the test statistic x .r p in terms of the critical value of the Kolmogorov statistic and the order statistics. that is.r < b. . F f (. where p ~ F(x). I)..d.13(g) preceding.. TbeaitemativeisthatF_x(t) IF(t) forsomet E R. Suppose X denotes the difference between responses after a subject has been given treatments A and B. the desired confidence region is the band consisting of the collection of intervals {[. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U ).
a) simultaneous confidence band for A(x) = F~l:(Fx(x)) .1. ..Fy ): 0 < p < 1}.l (p) ~ ~[Fxl(p) . where :F is the class of distribution functions with finite support. <aJ Show that if H holds.) < Fy(x p )] = nFv (Fx (t)) under H. Also note that x = H. . where F u and F v are independent U(O. Then H is symmetric about zero.a) simultaneous confidence band 15.F x l (l .x p . It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H..) < Fx(x)) = nFv(Fx(x)).(F(x)) . It follows that if c.d. nFx(x) we set ~ 2.2A(x) ~ HI(F(x)) l 1 + vF(F(x)). F y ) has the same distribntion as D(Fu . .a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(..(x) ~ R.x (': + A(x)).F~x. < F y I (Fx(x))] i=I n = L: l[Fy (Y. Show that for given F E F. i .Jt: 0 < p < 1) for the curve (5 p (Fx.. I I I 1 1 D(Fx.) < do.) = D(Fu .1) empirical distributions. F y ) . i I f F.Fy ) = maxIFy(t) 'ER "" .) is a location parameter} of all location parameter values at F.286 Testing and Confidence Regions ~ Chapter 4 '1. To test the hypothesis H that the two treatments are equally effective. . he a location parameter as defined in Problem 3.p)] = ~[Fxl(p) + F~l:(p)]. 1 Yn be i.. = 1 i 16.".x = 2VF(F(x)).t::. let Xl.'" 1 X n be Li. (b) Consider the parameter tJ p ( F x..) : :F VF ~ inf vF(p).i. where do:. for ~. then D(Fx .5. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y . Let 8(. and by solving D( Fx. is the nth quantile of the distribution of D(Fu l F 1 . nFy(t) = 2.. Give a distributionfree level (1. i=I n 1 i t .. F y )..3..17.F1 _u).d.) ~ < F x (")] ~ ~ = nFu(Fx(x)). then D(Fx. Hint: Define H by H.:~ I1IFx(X. treatment B responses..U ). ~ ~ ~ F~x. Fx. Properties of this and other bands are given by Doksum. vt = O<p<l sup vt(p) O<p<l where [V. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ . As in Example 1. treatment A (placebo) responses and let Y1 . (p). Hint: nFx(t) = 2.. the probability is (1 . Fenstad and Aaberge (1977).:~ 1 1 [Fy(Y.) < Fx(t)] = nFu(Fx(t)). Moreover. where x p and YP are the pth quan tiles of F x and Fy. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R. vt(p)) is the band in part (b). then  nFy(x + A(X)) = L: I[Y. "" = Yp .x. we get a distributionfree level (1 .. Hint: Let A(x) = FyI (Fx(x)) ... . Fx(t)l· . The result now follows from the properties of B(·).:~ II[Fx(X. Let t (c) A Distdbution and ParameterFree Confidence Interval.
then by solving D(F x. . Fy.). 0 < p < 1. nFx ("') = nFu(Fx(x)). Show that for given (Fx .    Problems for Section 4.6].2.) = F (p) . Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).6+] contains the shift parameter set {O(Fx.4. we find a distributionfree level (1 . t. Let6.Xn is a sample from a r (p. F y. Q. F y ) > O(Fx " Fy). (a) Consider the model of Problem 4. F y ) E F x F. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T. O(Fx.).(. thea It a unifonnly most accurate level 1 . ~) distribution.(X). then Y' = Y. B(. Exhibit the UMA level (1 .) ~ + t. F y ) and b _maxo<p<l 6p(Fx . ~ ~ = (e) A Distribution and ParameterFree Confidence Interval.·) is a shift parameter. F x ) ~ a and YI > Y.Section 4. tt Hint: Both 0 and 8* are nonnally distributed.. ~ D(Fu •F v ).a) UCB for O. Now apply the axioms.) I i=l L tt . T n n = (22:7 I X.4.('. ) is a shift parameter} of the values of all shift parameters at (Fx .6. Let 6 = minO<1'<16p(Fx. Fx +a) = O(Fx a.0:) confidence bound forO. 3. It follows that if we set F\~•.E(X).. if I' = 1I>'.0: UCB for J.) < do: for Ll.. F y ) is in [6. Show that if 0(·.. (d) Show that E(Y) ..) . = 22:7 I Xf/X2n( a) is . and 8 are shift parameters.L. (b) Consider the unbiased estimate of O. F y.= minO<p<l 6..10 Problems and Complements 287 Moreover.6 1.. 6. Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('.0:) simultaneouS coofidence band for t. Show that n p is known and 0 O' ~ (2 2~A X. where :F is the class of distri butions with finite support.(P). . moreover X + 6 < Y' < X + 6. is called a shift parameter if O(Fx. x' 7 R.T..2uynz(1 i=! i=l all L i=l n ti is also a level (1.) 12:7 I if. t Hint: Set Y' = X + t.(Fx . Let do denote a size a critical value for D(Fu . where is nnknown. . 6+ maxO<p<1 6+(p). Suppose X I.F y ' (p) = 6. Show that for the model of Problem 4..(x) = Fy(x thea D(Fx.a) that the interval [8. :F x :F O(Fx. the probability is (1 . Show that B = (2 L X. Fy ).2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8. Fv ).)I L ti . then B( Fx .(x)) .Fy). F y ) < O(Fx .3. F y ). 2. . Fy ) .
X has a binomial. F1(tjdt. s). Hint: Use Examples 4.. g. Establish the following result due to Pratt (1961).' with "> ~. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for.2 and B.f:s F. e> 0.B). Xl.C. Suppose that B' is a uniformly most accurate level (I . OJ. ..[B < u < O]du.d.6.3.B') < E. 8. . establish that the UMP test has acceptance region (4. satisfying the conditions of Problem 8. Suppose that given B Pareto.aj LCB such that I . 1. 3. where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t.B).2. 0']. . E(U) = . ·'"1 1 . In Example 4.Xn are Ll.\. .a) confidence intervals such that Po[8' < B' < 0'] < p. .6 for c fixed.B(n.O] are two level (1 . (B" 0') have joint densities. density = B.6 to V = (B . 1I"(t) Xl. Hint: By Problem B. Hint: Apply Problem 4. distribution and that B has beta. (a) Show that if 8 has a beta. 5. J • 6.Bj has the F distribution F 2r . . Let U.B)+.X. • • • = t) is distributed as W( s. j .. . t > e.6.1.. (0 .l. I'.(O . distribution with r and s positive integers.i.d.1 are well defined and strictly increasing. where s = 80+ 2n and W ~ X.) and that.l2(b). s). t) is the joint density of (8.E(V) are finite. U(O. where Show that if (B. .. p.3 and 4. Suppose that given. (b) Suppose that given B = B. i . Problems for Section 4.3. .2. U = (8 .2. /3( T. B). then E(U) > E(V). Prove Corollary 4. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I . Construct unifoffilly most accurate level 1 . and that B has the = se' (t. then )" = sB(r(1 . V be random variables with d.288 Testing and Confidence Regions Chapter 4 4. s) distribution with T and s integers.12 so that F.( 0' Hint: E. n = 1.a.B')+..6. (1: dU) pes.4. . Pa( c.\ =. 0).3). G corresponding to densities j.B) = J== J<>'= p( s. uniform. [B.1.0 upper and lower confidence bounds for J1 in the model of Problem 4. '" J. t)dsdt = J<>'= P. and for (). then E.2" Hint: See Sections 8..W < BI = 7.3.a) upper and lower credible bounds for A.\ is distributed as V /050. respectively. • .. 2. Show that if F(x) < G(x) for all x and E(U).W < B' < OJ for allB' l' B.6. Suppose [B'. ~. Poisson."" X n are i. f3( T. s > 0.4.7 1 . P(>.
/11 and /12 are independent in the posterior distribution p(O X. r In) density. . (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll.1..2 degrees of freedom.Xm. Let Xl. . T) = (111. (d) Compare the level (1.y) is 1f(r 180)1f(/11 1 r.y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2. .....r)p(y 1/12.y.r 1x. ". Yn are two independent N (111 \ 'T) and N(112 1 'T) samples.112.l.. Hint: p( 0 I x.8 1. and Y1.. y) and that the joint density of. 1 r..2. Suppose that given 8 = (ttl' J. Xl.. where (75 is known. .1 )) distribution. y) of is (Student) t with m + n . Suppose 8 has the improper priof?r(O) = l/r. . Here Xl. as X. 'P) is the N(x .Xn is observable and X n + 1 is to be predicted. (c) Give a level (1 . .x) is aN(x. .1..d. y) is aN(y. = So I (m + n  2).x)' + E(Yj y)'..0:) prediction interval for Xnt 1.s') withe' = max{c. 'T).m} and + n. .y) = 7f(r I sO)7f(LlI x . r 1x.rlm) density and 7f(/12 1r. (a) Let So = E(x. (b) Show that given r.i. (a) Give a level (1 . .Section 4. Hint: 7f(Ll 1x.0"5). Show that (8 = S IM ~ Tn) ~ Pa(c'..Xn + 1 be i. Show fonnaUy that the posteriof?r(O 1 x. 4.0:) confidence interval for B. = PI ~ /12 and'T is 7f(Ll.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B. r > O. y) is obtained by integrating out Tin 7f(Ll.y. . r) where 7f(Ll I x . Problems for Section 4. y)... respectively.r). y) is proportional to p(8)p(x 1/11. 1f(/11 1r.y.a) upper and lower credible bounds for O. In particular consider the credible bounds as n + 00.. N(J. r( TnI (c) Set s2 + n.10 Problems and Complements 289 (a)LetM = Sf max{Xj. (b) Find level (1 .Xn ).x)1f(/1. Show thatthe posterior distribution 7f( t 1 x.
.. (This q(y distribution. . Present the results in a table and a graph.nl..Xn + 1 be i.2. In Example 4. < u(n+l) denote Ul . give level 3.e. .i. which is not observable. and that (J has a beta. give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. .. 1 2.15. . x > 0.9. X n+ l are i. . . CJ2) with (J2 unknown. (a) If F is N(Jl. B(n.a (P(Y < Y) > I .Xn are observable and we want to predict Xn+l. Let X have a binomial.. distribution. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X . i! Problems for Section 4. 00 < a < b (1 . Suppose that given (J = 0.(0 I x)dO. 0). B( n...i.. Suppose that Y.8. That is.8. 0) distribution given (J = 8. has a B(m. )B(r+x+y.12. Un + 1 ordered.x ) where B(·.2) hy using the observation that Un + l is equally likely to be any of the values U(I).s+n. . " u(n+l). X is a binomial. .0'5) with 0'5 known. (3(r. . 1 X n are i.3) by doing a frequentist computation of the probability of coverage..'. and a = .Q) distribution free lower and upper prediction bounds for X n + 1 • < 00.2n distribution. where Xl.d..8. random variable.·) denotes the beta function.0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl.s+nx+my)/B(r+x. :[ = .. s). (b) If F is N(p" bounds for X n + 1 .' . as X . let U(l) < .5. A level (1 . 0 > 0.9 1. Suppose XI. b). ! " • Suppose Xl.10.11.t for M = 5.a) prediction interval for X n + l .8). give level (I . suppose Xl. • . X n such that P(Y < Y) > 1.5. n = 100. F. Let Xl. Establish (4.. ~o = 10.9.... Find the probability that the Bayesian interval covers the true mean j. ..a). . 5.Xn are observable and X n + 1 is to be predicted. Give a level (1 .290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4..d.05. N(Jl.a) lower and upper prediction bounds for X n + 1 .. distribution.) Hint: First show that .i.10. Then the level of the frequentist interval is 95%.d.. 0'5)' Take 0'5 ~ 7 2 = I. 4. I x) is sometimes called the Polya q(y I x) = J p(y I 0).• .x . as X where X has the exponential distribution F(x I 0) =I .'" . q(ylx)= ( . .8. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 .
( x) ~ 0. log . (i) F(c. (c) Deduce that the critical values of the commonly used equaltailed test. (b) Use the nonnal approximatioh to check that CIn C'n n .Section 4. 2.. Hint: log .I)) for Tn > 0.) .(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy.log (ii' / (5)1 otherwise.3.Q).~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . 0' 4.3.(nx).10 Problems and Complements 291 is an increasing function of (2x . Xn_1 (~a). · .Xn be a N(/L. where Tn is the t statistic.. XnI (1 .~Q) also approximately satisfy (i) and (ii) of part (a). I . 0'2) sample with both JL and 0'2 unknown. OneSided Tests for Scale./(n . ~2 . (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12. let X 1l .F(CI) ~ 1.='7"nlog c l n /C2n CIn .(x) = . = aD nO' 1". = Xnl (1 . >.(x) Hint: Show that for x < !n.(x) = 0.X) aD· t=l n > C. Show that (a) Likelihood ratio tests are of the form: Reject if. 2 2 L. if ii' /175 < 1 and = .Q. 0'2 < O'~ versus K : 0'2 > O'~.(Xi .~Q) n + V2nz(l. Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise.f..C2n !' 1 as n !' 00. TwoSided Tests for Scale. where F is the d.V2nz(1 .. and only if.. We want to test H . JL > Po show that the onesided. (ii) CI ~ C2 = n logcl/C2. of the X~I distribution. onesample t test is the likelihood ratio test (fo~ Q <S ~). ° 3.. In testing H : It < 110 versus K .n) and >. Thus. and only if.... n Cj I L. In Problems 24. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B. We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. . if Tn < and = (n/2) log(1 + T.
2' (12) samples. 8. .90 confidence interval for a by using the pivot 8 2 ja z .. (a) Using the size 0: = 0. . 190.05. 0'2). find the sample size n needed for the level 0.L. a 2 ) random variables. Forage production is also measured in pounds per acre. 7. = 0 and €l. . (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power.. (b) Consider the problem aftesting H : Jl. ... €n are independent N(O.. (a) Find a level 0.Xn1 and YI .. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0. . 110.1'2. where a' is as defined in < ~. .O).. and that T is sufficient for O. (a) Show that the MLE of 0 ~ (1'1. . The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. The nonnally distributed random variables Xl.9. respectively..95 confidence interval for p. (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0..1 < J. I Yn2 be two independent N(J. The control measurements (x's) correspond to 0 pounds of mulch per acre. 100. 6. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0..Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ .z . . Y.4.01 test to have power 0. I'· (c) Find a level 0.9. Suppose X has density p(x.. 114.95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~.tl > {t2. .l.L2 versus K : j. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall. • Xi = (JXi where X o I 1 + €i. x y vi I I . . n. (X. 0 E e.90 confidence interval for the mean bloC<! pressure J. Let Xl. eo.P. Show that A(X. 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances.05 onesample t test.3. .lIl (7 2) and N(Jl. 9. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T. whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre.292 Testing and Confidence Regions Chapter 4 5. e 1 ) depends on X only throngh T. Assume the onesample normal model.95 confidence interval for equaltailed tests of Problem 4.. (7 2) is Section 4. i = 1.
XZ distributions respectively. Suppose that T has a noncentral t. Consider the following model.6..2.. P. Condition on V and apply the double expectation theorem. 1 {= xj(kl)ellx+(tVx/k'I'ldx. . I). The power functions of one.2) In exp{ (I/Z(72) 2)Xi . Yl. 13.O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if. (T~).. 12. (Yl) = f PY"Y. distribution. respectively.IITI > tl is an increasing function of 161. i = 1. Define the frequency functions p(x. Y2 )dY2. and only if. . X powerful whatever be B. The F Test for Equality of Scale.'" p(X. / L:: i 10.[T > tl is an increasing function of 6. From the joint distribution of Z and V.<» (1 .[Z > tVV/k1 is increasing in 6. . with all parameters assumed unknown. Stein). Then use py.IIZI > tjvlk] is increasing in [61.. Show that the noncentral t distribution.<»1 < e < <>. = 0. .. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl.10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. Then.O) for 00 = (Z. Let Xl.6... get the joint distribution of Yl = ZI vVlk and Y2 = V. Let e consist of the point I and the interval [0.OXi_d 2} i= 1 < Xi < 00. (b) P. for each v > 0.n. . (An example due to C.. 7k. has level a and is strictly more 11.Section 4. Hint: Let Z and V be independent and have N(o.. Show that. . Fix 0 < a < and <>/[2(1 . P. has density !k .and twosided t tests. Yn2 be two independent N(P. . samples from N(J1l. (a) P.(t) = . Xo = o. 0) by the following table. II. 7k. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. X n1 . (Tn. (Yl..
I ' LX. The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. . 0.J J .9. . i=2 i=2 n n S12 =L i=2 n U.4. p) distribution.V.9.'.0 has the same dis r j . (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4. Because this conditional distribution does not depend on (U21 •. p) distribution. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. (Un. !I " .. and using the arguments of Problems B.71 240 2. then P[p > c] is an increasing function of p for fixed c.62 284 2. (11.37 384 2.4) and (4.1' Sf = L U. L Yj'. .4. .4.4.64 298 2.1 . . ..68 250 2. • .. . and only if. where f( t) is the tth quantile of the F nz . > C. (a) Show that the LR test of H : af = a~ versus K : ai > and only if.10. the continuous versiop of (B. i 0 in xi !:.' ! .7 and B. 0. distribution and that critical values can be ..1)]E(Y. .61 310 2.1)/(n.. that T is an increasing function of R. 14. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model..9.1. 1 . .nl1 ar is of the fonn: Reject if. vn 1 l i 16. using (4. x y 254 2. 15. 1. . show that given Uz = Uz .X)' (b) Show that (aUa~)F has an F nz obtained from the :F table. . .a/2) or F < f( n/2).'. note that . T has a noncentral Tnz distribution with noncentrality parameter p.1I(a).5) that 2 log '(X) V has a distribution. Finally.94 Using the likelihood ratio test for the bivariate nonnal model.24) implies that this is also the unconditional distribution. Sbow.. ! . can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I . Hint: Use the transfonnations and Problem B. · . p) distribution.Y)'/E(X. a?. and use Probl~1ll4. Let R = S12/SIS" T = 2R/V1 R'.~ I . Un = Un. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. Un). (1~. r> (c) Justify the twosided F test: Reject H if.12 337 1. .19 315 2. . V. . Argue as in Proplem 4. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where .96 279 2. Consider the problem of testing H : p = 0 versus K : p i O.8. Yn) be a sampl~ from a bivariateN(O.~ I i " 294 Testing and Confidence Regions Chapter 4 . Let (XI. 1. F ~ [(nl . .nt 1 distribution. . <. (Xn .8.). where a. . • I " and (U" V. .7 to conclude that tribution as 8 12 /81 8 2 . si = L v. . where l _ .. F > /(1 .0 has the same distribution as R. Vn ) is a sample from a N(O. YI ).4.
Notes for Section 4. 1983. . 1974) to the critical value is <. BICKEL. (2) The theory of complete and essentially complete families is developed in Wald (1950).. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0.(t) = tl(.~.035.3). AND J. W.11. G.Section 4.1 (1) The point of view usually taken in science is that of Karl Popper [1968].15 is used here. Mathematical Theory of Reliability New York: J.[ii . Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding.398404 (975). T. Because the region contains C(X). i] versus K : 8 't [i. if the parameter space is compact and loss functions are bounded.10. G.851. 00. Data Analysis and Robustness. Apology for Ecumenism in Statistics and Scientific lnjerence. Box.3) holds for some 8 if <p 't V. and C. The term complete is then reserved for the class where strict inequality in (4. Essentially.. HAMMEL. REFERENCES BARLOW. O'CONNELL. Leonard.. Consider the bioequivalence example in Problem 3. the class of Bayes procedures is complete.01.11 Notes 295 17. AND F.3 (1) Such a class is sometimes called essentially complete. Tl . (3) A good approximation (Durbin. E.0.2. 4. E. and r.. R.9.I\ E.). (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1. respectively. P.[ii) where t = 1. . i]. Wiley & Sons. 00. Consider the cases Ii.2.. "Is there a sex bias in graduate admissions?" Science. Editors New York: Academic Press.12 1965.11 NOTES Notes for Section 4. 187. Nntes for Section 4. 1973.01 + 0. (2) In using 8(5) as a confidence hound we are using the region [8(5). T6 .1.05 and 0.o = 0. F. it also has confidence level (1 . PROSCHAN. Rejection is more definitive. see also Ferguson (1967). S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 .. Wu. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.. P. !.895 and t = 0.3.0.9. 4. Stephens. (a) Find the level" LR test for testing H : 8 E given in Problem 3.4 (I) If the continuity correction discussed in Section A.819 for" = 0. Box..
j 1 . OLlaN. TATE. 659680 (1967). Mathematical Statistics. HAlO. SIEVERS.. "On the combination of independent test statistics. 69. 1. Assoc. A. AND I. Sratisr. L. 1968. 730737 (1974). G. ! I I I FERGUSON. Philadelphia. Statist. the Growth ofScientific Knowledge.. The Theory of Probability Oxford: Oxford University Press.. HEDGES. 3rd ed. . SIAM. Y. SACKRoWITZ. 1967. Statist. Assoc. "Interval estimation for a binomial proportion." An" Math. 1958. R. A. Testing Statistical Hypotheses. DAS GUPTA. A. ..674682 (1959)." Biometrika. "PloUing with confidence: Graphical comparisons of two populations. 421434 (1976). j j New York: 1. W. New York: I J 1 Harper and Row. WILKS. DURBIN. "Length of confidence intervals. Math. CAl. 473487 (1977). ." Ann. . LEHMANN. 54 (2000). Sequential Analysis New York: Wiley. VAN ZWET." Biometrika. AND G. W. j 1 .. AND E. Sequential Methods in Statistics New York: Chapman and Hall. Statist. 9. AARBERGE. AND A. D. Wiley & Sons.. FL: Academic Press. 1950. 1947. S. c. 64. J. AND G. "Plots and tests for symmetry. FISHER. FENSTAD.. A. I . 2nd ed. 38... OSTERHOFF. Amer. AND 1. F. 56. T. 66. New York: Hafner Publishing Company. .. Amer." Regional Conference Series in Applied Math. AND K. WELCH. 1961. 13th ed. Statist. WANG. 63. AND R. WETHERDJ. B. GLAZEBROOK." J.. KLETI. 53. 1985. Assoc. 1997.. "Optimal confidence intervals for the variance of a normal distribution. 16. 1952.• Statistical Methods for Research Workers..243258 (1945). H. 8. .." The AmencanStatistician. 1986. V. S. R.. K. K. Statistical Theory with Engineering Applications ... ! ." 1.. R. A. STEIN. Pennsylvania (1973).. Aspin's tables:' Biometrika. Amer.. T. JEFFREYS." J. M. 549567 (1961).• Conjectures and Refutations.. Wiley & Sons. G. E. WALD. PRATI." The American Statistician. Statistical Decision Functions New York: Wiley. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. "Further notes on Mrs.. 36. Amer. . 1%2. SAMUEL<:AHN. I . New York: Springer. A. POPPER. 54. "P values as random variablesExpected P values. L. ." J. A Decision Theoretic Approach New York: Academic Press. DOKSUM. 326331 (1999). "EDF statistics for goodness of fit.296 Testing and Confidence Regions Chapter 4 BROWN. "Probabilities of the type I errors of the Welch tests. Mathematical Statistics New York: J. STEPHENs. D. 243246 (1949).. K. R. H. L. WALD. Statistical Methods for MetaAnalysis Orlando. Statist.. "Distribution theory for tests based on the sample distribution function. DOKSUM. 605608 (1971).
from Problem (B.1.1 degrees of freedom. computation even at a single point may involve highdimensional integrals.13).2) where.2 that .i. This distribution may be evaluated by a twodimensional integral using classical functions 297 . and F has density f we can wrile (5. consider med(X 11 •.Xn ) as an estimate of the population median II(F).Chapter 5 ASYMPTOTIC APPROXIMATIONS 5.9. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. v(F) ~ F. but a different one for each n (Problem 5. consider a sample Xl.. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain. and calculable for any F and all n by a single onedimensional integration. Ifwe want to estimate J1(F} = (5. Even if the risk is computable for a specific P by numerical integration in one dimension. Worse..d. if n ~ 2k + 1.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n . our setting for this section EFX1 and use X we can write.1. N(J1.1.. .3).3) gn(X) =n ( 2k ) F k(x)(1 .1. If XI. a 2 ) we have seen in Section 4. Worse. If n is odd. consider evaluation of the power function of the onesided t test of Chapter 4.• .F(x)k f(x).Xn are i. .2) and (5.1 (D. 1 X n from a distribution F. and most of this chapter. .1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with. telling us exactly how the MSE behaves as a function of n. k Evaluation here requires only evaluation of F and a onedimensional integration.2. In particular. To go one step further.1).1.1. (5.1) This is a highly informative formula.. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5. However.
Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or. just as in numerical integration. where A= { ~ . 00. .. X n as n + 00..d.fii(Xn  EF(Xd) + N(O.o(X1). .4) i RB =B LI(F. we can approximate Rn(F) arbitrarily closely. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5. Monte Carlo is described as follows. which occupies us for most of this chapter. always refers to a sequence of statistics I. X n + EF(Xd or p £F( . {X 1j l ' •• . j=l • By the law of large numbers as B j.Xn):~Xi<n_l (~2 (EX. We shall see later that the scope of asymptotics is much greater.. in this context. Approximately evaluate Rn(F) by _ 1 B i (5. S) and in general this is only representable as an ndimensional integral. for instance the sequence of means {Xn }n>l. . It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X.Xnj ) . .. 10=1 f=l There are two complementary approaches to these difficulties. X nj }..2) and its qualitative properties are reasonably transparent. Thus.n (Xl. which we explore further in later chapters. • . Rn (F).. But suppose F is not Gaussian. 1 < j < B from F using a random number generator and an explicit fonn for F. _.3. In its simplest fonn. or it refers to the sequence of their distributions 1 Xi. I j {Tn (X" .1. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. or the sequence Asymptotic statements are always statements about the sequence. . observations Xl. The classical examples are. . is to use the Monte Carlo method. where X n = ~ :E~ of medians.fiit !Xi. save for the possibility of a very unlikely event. VarF(X 1 ».i..I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. but for the time being let's stick to this case as we have until now. fiB ~ R n (F). based on observing n i.)2) } . . The first. i . The other. of distributions of statistics. Asymptotics. . . Xn)}n>l. Draw B independent "samples" of size n.1. more specifically. .3.
Section 5. 7). n = 400. As an approximation. We interpret this as saying that.1. PF[IX n _  . then (5. (see A.2 .1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.(Xn .1. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· .6) and (5.omes 11m'.VarF(Xd. (5.1.9) when . 1'1 > €] < 2exp {~n€'}. Thus. X n ) but in practice we act as if they do because the T n (Xl. the weak law of large numbers tells us that.15. below .l1) states that if EFIXd' < 00.25 whereas (5. say.Xn ) or £F(Tn (Xl.9) is .9) As a bound this is typically far too conservative. DOD? Similarly.1. X n )) (in an appropriate sense). Xn is approximately equal to its expectation.1. Is n > 100 enough or does it have to be n > 100.' 1'1 > €] < .' = 1 possible (Problem 5.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl. if EFIXd < 00.1. and P F • What we in principle prefer are bounds.6) gives IX11 < 1.1.10) < 1 with .14. For instance.11) . which are available in the classical situations of (5. . For € = ..1.6) for all £ > O....6) to fall. 1') < z] ~ ij>(z) (5.1. the much PFlIx n  Because IXd < 1 implies that .. the right sup PF x [r.1.1. the central limit theorem tells us that if EFIXII < 00. . ' .8) are given in Problem 5. then P [vn:(X F n . (5.1. (5.7) where 41 is the standard nonnal d.1.l) (5. by Chebychev's inequality.1.f. For instance.5) That is. .. this reads (5.10) is .1.14.1.. Similarly.. the celebrated BerryEsseen bound (A.2 hand side of (5.3).1.01.2 is unknown be.6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5.1. £ = . if EFXf < 00. The trouble is that for any specified degree of approximation. say.... for n sufficiently large. if more delicate Hoeffding bound (B.9. . .. x.4.' m (5. Further qualitative features of these bounds and relations to approximation (5./' is as above and .8) Again we are faced with the questions of how good the approximation is for given n.01.
consistency is proved for the estimates of canonical parameters in exponential families.di) where '(0) = 0. The estimates B will be consistent. behaves like . Section 5.8) is typically much betler than (5. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. • c F (y'n[Bn .2 deals with consistency of various estimates including maximum likelihood. It suggests that qualitatively the risk of X n as an estimate of Ji.1. (5. If they are simple. . "(0) > 0.I '1.1. (5. The arguments apply to vectorvalued estimates of Euclidean parameters. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD. • model. 1 i .12) where (T(O. . asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate.1. As we shall see.2 and asymptotic normality via the delta method in Section 5. and asymptotically normal. Yet.11) is again much too consctvative generally. d) = '(II' . (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. In particular.11) suggests.5) and quite generally that risk increases with (1 and decreases with n. although the actual djstribution depends on Pp in a complicated way.7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j.F) > N(O 1) . Asymptotics has another important function beyond suggesting numerical approximations for specific nand F.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families.1.(l) The approximation (5.' (0)( (1 / y'n)( v'27i') (Problem 5. B ~ O(F). For instance.1. Consistency will be pursued in Section 5. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. I . good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. : i .1. The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. We now turn to specifics. as we have seen. quite generally. . Practically one proceeds as follows: (a) Asymptotic approximations are derived. (5.e(F)]) (1(e. even here they are not a very reliable guide. As we mentioned. Section 5. for all F in the n n . for any loss function of the form I(F. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4. Although giving us some idea of how much (5.t and 0"2 in a precise way.8) differs from the truth.3. which is reasonable.1. i .
7. We also introduce Monte Carlo methods and discuss the interaction of asymptotics. 1 X n are i.5 we examine the asymptotic behavior of Bayes procedures. A stronger requirement is (5.i. Means.2. . remains central to all asymptotic theory. by quantities that can be so computed.Section 5. For P this large it is not unifonnly consistent.) forsuP8 P8 lI'in .1.2. by the WLLN. we talk of uniform cornistency over K..2 Consistency 301 in exponential families among other results. which is called consistency of qn and can be thought of as O'th order asymptotics. is a consistent estimate of p(P). The stronger statement (5.7 without further discussion. 'in 1] q(8) for all 8. e Example 5. (5. case become increasingly valid as the sample size increases.2.1.1). Finally in Section 5. . Most aSymptotic theory we consider leads to approximations that in the i.1). . The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. asymptotics are methods of approximating risks.2.lS.. But. A. That is.) However. .4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models. 1denotes Euclidean distance.q(8)1 > 'I that yield (5. If Xl.14. Summary. > O. where P is the empirical distribution.14.1) and (B. (See Problem 5. but we shall use results we need from A. AlS. and other statistical quantities that are not realistically computable in closed form. If is replaced by a smaller set K. and probability bounds.2 5.Xn ) is that as e n 8 E ~ e. P where P is unknown but EplX11 < 00 then. in accordance with (A.i..1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. 5.2. ' .14.2) are preferable and we shall indicate some of qualitative interest when we can. distributions.. if. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.2) is called unifonn consistency.2. . Monte Carlo.2) Bounds b(n. 00..1) where I . with all the caveats of Section 5.2. In practice. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X.2. for . The simplest example of consistency is that of the mean. for all (5.Xn from Po where 0 E and want to estimate a real or vector q(O).d. We will recall relevant definitions from that appendix as we need them.2. .d. Section 5.7. and B. and 8. The least we can ask of our estimate Qn(X I.
p' E S. l iA) E S be the empirical distribution.b) say. where q(p) = p(1p).2. for all PEP. in this case.2.p'l < 6}.I l 302 instance. then by Chebyshev's inequality. 0 In fact. .3) that > <} (5.2. Suppose thnt P = S = {(Pl.pi > 6«)] But. Theorem 5. there exists 6 «) > 0 such that p.2. where Pi = PIXI = xi]' 1 < j < k. and p = X = N In is a uniformly consistent estimate of p. 0 < p < 1. Suppose the modulus ofcontinuity of q. implies Iq(p')q(p) I < <Then Pp[li1n . Evidently. Because q is continuous and S is compact. q(P) is consistent. i .p) distribution. .14. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln. .. with Xi E X 1 .2.. w.. · .5) I A simple and important result for the case in which X!..2.l «) = inf{6 : w(q. Letw. we can go further. Other moments of Xl can be consistently estimated in the same way. w( q. Pn = (iiI. consider the plugin estimate p(Iji)ln of the variance of p.b] < R+bedefinedastheinver~eofw.. Let X 1. Thus.6)! Oas6! O. 0) is defined by .." is the following: I X n are LLd.6) = sup{lq(p)  q(p')I: Ip ..2.3) Evidently. and (Xl.LJ~IPJ = I}. 6 >0 O. . Binomial Variance.1 < j < k. By the weak law of large numbers for all p.1. P = {P : EpX'f < M < oo}.4) I i (5.·) is increasing in 0 and has the range [a. Suppose that q : S + RP is continuous.2. 6) It easily follows (Problem 5. If q is continuous w(q. it is uniformly continuous on S.6.ql > <] < Pp[IPn . for every < > 0.1) and the result follows. Then qn q(Pn) is a unifonnly consistent estimate of q(p). Then N = LXi has a B(n. Pp [IPn  pi > 6] ~ .Pk) : 0 < Pi < 1.l : [a. (5. w(q.. Ip' pi < 6«). by A. the kdimensional simplex.. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. . w(q. . which is ~ q(ji).• X n be the indicators of binomial trials with P[X I = I] = p. = Proof. But further. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5.
Suppose [ is open. U implies !'. Var(U. Let g(u.2. .V..a~.) > 0. 0 Here is a general consequence of Proposition 5... ai. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj. 11.6.. o Example 5. let mj(O) '" E09j(X. variances. [ and A(·) correspond to P as in Section 1.Section 5. )1 < oo}. We may.V2.p).JL2. is a consistent estimate of q(O)..1 .1. Let g (91. Then.)1 < 1 } tions such that EUr < 00.) > 0. then vIP) h(g). .2. conclude by Proposition 5. 'Vi) is the statistic generating this 5parameter exponential family. ag.2. thus.2 Consistency 303 Suppose Proposition 5. .UV) so that E~ I g(Ui .p).4. X n are a sample from PrJ E P. Let TI. EVj2 < 00. Var(V. = Proof. where P is the empirical distribution. Theorem 5.7. N2(JLI.Jor all 0. if h is continuous.1 that the empirical means. and correlation coefficient are all consistent.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B.2. and let q(O) = h(m(O)).then which is well defined and continuous at all points of the range of m. We need only apply the general weak law of large numbers (for vectors) to conclude that (5. More generally if v(P) = h(E p g(X . af > o.i. h(D) for all continuous h. Then.U 2.2.ar. Vi). Suppose P is a canonical exponentialfamily of rank d generated by T.1. 1 are discussed in Problem 5.1: D n 00. (ii) ij is consistent.d.2.3. where h: Y ~ RP.v} = (u. is consistent for v(P).6) if Ep[g(X. 1 < j < d. then Ifh=m. . ' . Let Xi = (Ui . Variances and Correlations. )) and P = {P : Eplg(X . 1 < i < n be i. If we let 8 = (JLI.3. Jl2.2.2.). !:. (i) Plj [The MLE Ti exists] ~ 1.9d) map X onto Y C R d Eolgj(X')1 < 00.lpl < 1. 1 < j < d.1 and Theorem 2. If Xl.
7) occurs and (i) follows. as usual.p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. i I . (T(Xl)) must hy Theorem 2.lI o): 111110 1 > e} > D(1I0 . I .1 belong to the interior of the convex support hecause the equation A(1)) = to.1. and the result follows from Proposition 5. II) L p(X n. Rudin (1987). is a continuous function of a mean of Li. Let Xl. A more general argument is given in the following simple theorem whose conditions are hard to check.8) I > O.. where to = A(1)o) = E1)o T(X 1 ). .d. D( 110 . D ..E1).1I)11: II E 6} ~'O (5. . X n be Li. II) 0 j =EII. I I Proof Recall from Corollary 2..= 1 .1 that On CT the map 17 A(1]) is II and continuous on E.1 that i)(Xj. j 1 Pn(X. exists iff the event in (5.2. T(Xl). . . which solves 1 n A(1)) = . {t: It .. BEe c Rd.. . Suppose 1 n PII sup{l. 5.. By definition of the interior of the convex support there exists a ball 8. Let 0 be a minimum contrast estimate that minimizes I . i=l 1 n (5.L[p(Xi . vectors evidently used exponential family properties. .2 Consistency of Minimum Contrast Estimates . Po. E1).2. 0 Hence.3..Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. for example.LT(Xi ).  inf{D(II. see. n LT(Xi ) i=I 21 En. is solved by 110.2. 1 i ..3..lI) .1 to Theorem 2. (5.9) n and Then (} is consistent.2.3. .2. I ! . the inverse AI: A(e) ~ e is continuous on 8. By a classical result. if 1)0 is true..2. • .. the MLE. I J Theorem 5. i i • • ! . The argument of the the previous subsection in which a minimum contrast estimate. II) = where.304 Asymptotic Approximations Chapter 5 . . z=l 1 n i.d. '1 P1)'[n LT(Xi ) E C T) ~ 1. We showed in Theorem 2.j. . But Tj.1I 0 ) foreverye I .3.3.. = 1 n p.2.7) I . T(Xdl < 6} C CT' By the law of large numbers.D(1I0 . 7 I . n . Note that.
lJ o): IIJ IJol > oj.IJ)1 < 00 and the parameterization is identifiable.11) implies that the righthand side of (5.inf{D(IJo.D(IJo. IJ)I : IJ K } PIJ !l o."'[p(X i .2.2. 0) . IIJ .2. ifIJ is the MLE.1.2 Consistency 305 proof Note that.8) and (5.8) can often failsee Problem 5.8). {IJ" . [I ~ L:~ 1 (p(Xi . .12) which has probability tending to 0 by (5.2. IJj ) .[max{[ ~ L:~ 1 (p(Xi . PIJ. But for ( > 0 let o ~ ~ inf{D(IJ.IJor > <} < 0] because the event in (5.D(IJ o. Coronary 5. /fe e =   Proof Note that for some < > 0. [inf c e.10) By hypothesis.IJd ). 0 Condition (5. IJ) . for all 0 > 0. EIJollogp(XI..D(IJo. (5. and (5.13) By Shannon's Lemma 2.D(IJo.2.1 we need only check that (5. PIJ. _ I) 1 n PIJ o [16 . IJo)) .2.5. Ie ¥ OJ] ~ Ojoroll j. IJ) n ~ i=l p(X"IJ o)] .IJ) p(Xi.L(p(Xi. IJ)II ' IJ n i=l E e} > .L[p(Xi .2. (5.D( IJ o.2. A simple and important special case is given by the following. But because e is finite. is finite.2.8) by (i) For ail compact K sup In { c e. IIJ . IJ j )) : 1 <j < d} > <] < d maxi PIJ.[IJ ¥ OJ] = PIJ.2.11) sup{l. E n ~ [p(Xi .2. o Then (5.IJO)) ' IIJ IJol > <} n i=l .2.Section 5. IJ).IJol > E] < PIJ [inf{ . IJ) .2.2.IIIJ  IJol > c] (5.2.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1. then.11) implies that 1 n ~ 0 (5.2.14) (ii) For some compact K PIJ.8) follows from the WLLN and PIJ.9) follows from Shannon's lemma.IJol > <} < 0] (5. . IJ) = logp(x.10) tends to O. IJj))1 > <I : 1 < j < d} ~ O. An alternative condition that is readily seen to work more widely is the replacement of (5. IJ j ) . 2 0 (5..2.9) hold for p(x.IJ o)  D(IJo. 1 n PIJo[inf{. IJ) .2.
3. let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn. D . I ~.Il E(X Il). ~l Theorem 5.. Unfortunately checking conditions such as (5.B(P)I > Ej : PEP} t 0 for all € > O. then = h(ll) + L (j)( ) j=l h .1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk.2. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il.1. We then sketch the extension to functions of vector means.. and assume (i) (a) h is m times differentiable on R. Summary.2. Unifonn consistency for l' requires more.3.3. .Xn be i.) = <7 2 Eh(X) We have the following. J.3. 5.1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. Sufficient conditions are explored in the problems. in Section 5. consistency of the MLE may fail if the number of parameters tends to infinity. We introduce the minimal property we require of any estimate (strictly speak ing. =suPx 1k<~)(x)1 <M < . see Problem 5.306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems.1) where • . m Mi) and assume (b) IIh(~)lloo j 1 .8) and (5. . sequence of estimates) consistency.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued.i. Let h : R ~ R.3 FIRST. We denote the jth derivative of h by 00 . . We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. ~ 5. As usual let Xl.14) is in general difficult. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. If (i) and (ii) hold.d. . When the observations are independent but not identically distributed. + Rm (5. that sup{PIiBn . > 2. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. VariX. we require that On ~ B(P) as n ~ 00.33. If fin is an estimate of B(P).. X valued and for the moment take X = R.
. for j < n/2.3) (5.t" = t1 .1 < k < r}}.. .4) Note that for j eveo.3. Let I' = E(X. ..3.4) for all j and (5. The expression in (c) is. then there are constants C j > 0 and D j > 0 such (5. tI.3. (n . bounded by (d) < t... 'Zr J ..3. and the following lemma. X ij ) least twice. then (a) But E(Xi .1'1..[jj2] + 1) where Cj = 1 <r.3) and j odd is given in Problem 5.1) . ..3... Moreover. Lemma 5. (5.2... +'r=j J . EIX I'li = E(X _1')' Proof. . (b) a unless each integer that appears among {i I .Section 5. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r.) ~ 0..3) for j even. 21.i j appears at by Problem 5.. j > 2.3.3.5. . rl l:: .5[jJ2J max {l:: { .and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion. . + i r = j. :i1 + . ik > 2. We give the proof of (5. . If EjX Ilj < 00. .1.. tr i/o>2 all k where tl .2) where IX' that 1'1 < IX . In!. Xij)1 = Elxdj il.ij } • sup IE(Xi . _ . . C [~]! n(n .. t r 1 and [t] denotes the greatest integer n. . The more difficult argument needed for (5.3. .. .3.. .3 First.'1 +.'" .
5) apply Theorem 5. 0 1 I .J.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 .3. if I' = O.4).3. 0 The two most important corollaries of Theorem 5.5) (b) if E(Xt) < G(n.3. 00 and 11hC 41 11= < then G(n.l) and its variance and MSE.1 with m = 4. In general by considering Xi .l) + 00 2n I' + G(n3f2).1. 1 iln .6) n ..ll j < 2j EIXd j and the lemma follows. i • r' Var h(X) = 02[h(11(J.. • .1') + {h(2)(J.1')3 = G(n 2) by (5.1.3. and EXt < replaced by G(n').l) + {h(21(J.1')2 = 0 2 In.308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . give approximations to the bias of h(X) as an estimate of h(J. then I • I.1.2 ). Then Rm = G(n 2) and also E(X . EIX1 .3.l)h(J. < 3 and EIXd 3 < 00.1 with m = 3. respectively. Because E(X .3.l) + [h(1)]2(J. I 1 .5) can be replaced by Proof For (5. if 1 1 <j (a) Ilh(J)II= < 00. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00.3. .3. and (e) applied to (a) imply (5.5) follows.6) can be 1 Eh 2 (X) = h'(J. then G(n.l)h(1)(J. I (b) Ifllh(j11l= < 00. (5. Proof (a) Write 00..3) for j even. .3.3f') in (5.1'11.llJ' + G(n3f2) (5.. using Corollary 5. then h(')( ) 2 0 = h(J. I .3.3. (d).3..3. . I .l)h(J.+ O(n. (5.l)}E(X .3f2 ).6. n (b) Next.l) + [h(llj'(1')}':.l) 6 + 2h(J.. I I I Corollary 5.l)E(X .4) for j odd and (5. 1 < j < 3. By Problem 5.3f2 ) in (5. (n li/2] + 1) < nIJf'jj and (c). If the conditions of (b) hold.3.3. Corollary 5.3. apply Theorem 5.2.
We will compare E(h(X)) and Var h(X) with their approximations.3. and. which is neglible compared to the standard deviation of h(X).4 ) exp( 2/t).and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5.Section 5.1. .3. Bias. If X [. the MLE of ' is X .1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.3.3 First.2 ) 2e. (5. Example 5. then heX) is the MLE of 1 . .3(1_ ')In + G(n.1) = 11.1 with bounds on the remainders.(h(X) .= E"X I ~ 11 A.11) Example 5. Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5.3..h(/1)) h(2~(II):: + O(n. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX).t) and X. as the plugin estimate of the parameter h(Jl) then.l / Z) unless h(l)(/1) = O.i. T!Jus.3.t.6).3.(h(X)) E. by Corollary 5. when hit) = t(1 . {h(l) (/1)h(ZI U')/13 (5. where /1.2. Note an important qualitative feature revealed by these approximations. X n are i.d..3.3. then we may be interested in the warranty failure probability (5.i.5). for large n. Bias and Variance of the MLE of the Binomial VanOance. which is G(n. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.3.2 (Problem (5. (1Z 5.3. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5. as we nonnally would.3.3.3.7) If h(t) ~ 1 .exp( 2') c(') = h(/1)...8) because h(ZI (t) = 4(r· 3 .3.1. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5.Z.3.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 .h(/1) is G(n.  ./1)3 = ~~. If heX) is viewed.Z) (5.4) E(X .9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n.3.exp( 2It). Thus. by Corollary 5.Z ".l ). the bias of h(X) defined by Eh(X) .
n n Because MII(t) = I . .[Var(X) + (E(X»2] I nI ~ p(1 . Next compute Varh(X)=p(I..p(1 .3 ). (5.p) . (5.p) + .p)] n I " .e x ) : i l + . II'. R'n p(1 ~ p) [(I .2t.310 Asymptotic Approximations Chapter 5 B(l.P) {(1_2 P)2+ 2P(I. 1 and in this case (5.2p)')} + R~.3.3. .p) .. 0 < i.p) = p ( l .!c[2p(1 n p) .pl..j .10) yields .amh i. Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).2p) n n +21"(1 . .p)(I.5) yields E(h(X))  ~ p(1 . < m.2(1 . • . First calculate Eh(X) = E(X) .9d(Xi )f.2p)p(l. Theorem 5.5) is exact as it should be. ..p)'} + R~ p(l .2p)2p(1 .3. I .2.10) is in a situation in which the approximation can be checked. the error of approximation is 1 1 + .6p(l.E(X') ~ p .P )} (n_l)2 n nl n Because 1'3 = p(l. M2) ~ 2. 1 < j < { Xl .p)J n p(1 ~ p) [I .p) n . a Xd d} . D " ~ I .p(1 . " ! .{2(1.' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. asSume that h has continuous partial derivatives of order up to m.2p). Let h : R d t R..' .p) {(I _ 2p)2 n Thus. ~ Varh(X) = (1.p).3.2p)' . + id = m.3.2p(1 . . and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i. D 1 i I .. . and will illustrate how accurate (5. • ~ O(n.p)(I.
) Var(Y.).6). and the appropriate generalization of Lemma 5. B.. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00. EI Yl13 < 00 Eh(Y) h(J1.3.12) can be replaced by O(n. (7'(h)) where and (7' = VariX.12) Var h(Y) ~ [(:. as for the case d = 1.14) + (g:.3. (J1..x. and J.h(!. if EIY. Similarly. under appropriate conditions (Problem 5. Suppose thot X = R. Lemma 5. Theorem 5.))) ~ N(O.3. for d = 2.13) Moreover.'.".3). The most interesting application.3. 1 < j < d where Y iJ I = 9j(Xi ).1 Then. E xl < 00 and h is differentiable at (5. by (5..) gx~ (J1. + ~ ~:l (J1. 5. 0/ The result follows from the more generally usefullernma.3.8.) Cov(Y".3 / 2 ) in (5.) + a..2.and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.) + 2 g:'. .5). h : R ~ R.3.13).(J1.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d .3.C( v'n(h(X) .3. The results in the next subsection go much further and "explain" the fonn of the approximations we already have.~E(X.11.3.2 ). The proof is outlined in Problem 5.~ (J1.) )' Var(Y. then Eh(Y) This is a consequence of Taylor's expansion in d variables.1.hUll)~.3.)} + O(n (5.3. (J1.) Var(Y. Y12 ) (5. We get.) Cov(Y".3.Section 5. (5. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi. and (5.2 The Delta Method for In law Approximations = As usual we begin with d !.3.).3.15) Then .3 First. is to m = 3.) + ~ {~~:~ (J1. 1.. Y 12 ) 3/2).3. (5.3. then O(n.4..»)'var(Y'2)] +O(n') Approximations (5.I' < 00. = EY 1 .
V and the result follows. hence.• ' " and. see Problems 5.a 2 ).32 and B. ~ EVj (although this need not be true. ·i Note that (5. .15) "explains" Lemma 5.8). . for every Ii > 0 (d) j. .17) . V. I . . 'j But (e) implies (f) from (b). (5. an = n l / 2. .3. (e) Using (e). (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u).N(O. ~: . we expect (5. PIIUn €  ul < iii ~ 1 • "'. from (a). • • .1. Formally we expect that if Vn !:.3.g(ll(U)(v  u)1 < <Iv .3.16) Proof.u)!:. u = /}" j \ V . = I 0 .N(O.3.i. j . Thus. V for some constant u.312 Asymptotic Approximations Chapter 5 (i) an(Un .. for every € > 0 there exists a 8 > 0 such that (a) Iv . j " But. ( 2 ). for every (e) > O. By definition of the derivative. Consider 1 Vn = v'n(X !") !:.ul < Ii '* Ig(v)  g(u) . by hypothesis.7.3. an(Un . Therefore. _ F 7 .u) !:. I (g) I ~: . The theorem follows from the central limit theorem letting Un X.ul Note that (i) (b) '* . then EV. 0 . V .
3. v) = u!v. Example 5.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~..X) n 2 .d. FE :F where EF(X I ) = '". else EZJ ~ = O. sn!a).'" . Let Xl./2 ).3.17) yields O(n J. . Now Slutsky's theorem yields (5. (1  A)a~ . (b) The TwoSample Case. Then (5. A statistic for testing the hypothesis H : 11.N(O. we find (Problem 5. In general we claim that if F E F and H is true.3. N n (0 Aal.3.l distribution.3. Let Xl.1 (Ia) + Zla butthat the t n. and S2 _ n nl ( ~ X) 2) n~(X.X" be i. Using the central limit theorem. EZJ > 0.18) because Tn = Un!(sn!a) = g(Un .2 and Slutsky's theorem. 1).18) In particular this implies not only that t n. we can obtain the critical value t n.t2 versus K : 112 > 111.O < A < 1.1 (1 .l (1.> 0 .a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian. But if j is even.i.a) for Tn from the Tn .. ••• .).). al = Var(X. VarF(Xd = a 2 < 00.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O. (1 . Slutsky's theorem. Yn2 be two independent samples with 1'1 = E(X I ).+A)al + Aa~) .Section 5. 1'2 = E(Y. £ (5. 1).'" 1 X n1 and Y1 .= 0 versuS K : 11.s Tn =. then Tn . and the foregoing arguments. (a) The OneSample Case. For the proof note that by the central limit theorem.TiS X where S 2 = 1 nl L (X.2.j odd. In Example 4.. where g(u.) and a~ = Var(Y.9.28) that if nI/n _ A. j even = o(nJ/').3 First.3. "t" Statistics. then S t:. 1 2 a P by Theorem 5. i=l If:F = {Gaussian distributions}. Consider testing H: /11 = j.
then the critical value t.1 shows that for the onesample t test. i 2(1 approximately correct if H is true and the X's and Y's are not normal. Other distributions should also be tried...I = u~ and nl = n2. oL.. and in this case the t n .1. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5...02 ..5 .. . Figure 5. when 0: = 0. and the true distribution F is X~ with d > 10.. i I .3.05.1 (0. .'.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. " . 32.5 3 Log10 sample size j i Figure 5.3..1. .5 . 10. .c:'::c~:____:_J 0.. the For the twosample t tests. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.95) approximation is only good for n > 10 2 . where d is either 2.5 2 2. 20.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X. Chisquare data I 0... approximations based on asymptotic results should be checked by Monte Carlo simulations.. . Figure 5. or 50. Y . One sample: 10000 Simulations. The simulations are repeated for different sample sizes and the observed significance levels are plotted. Each plotted point represents the results of 10. 316.000 onesample t tests using X~ data.. Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table. The X~ distribution is extremely skew. X~. the asymptotic result gives a good approximation when n > 10 1.5 1 1. as indicated in the plot.3. ur ....2 or of a~.
ChiSquare Dala.3. the t n 2(1 . as we see from the limiting law of Sn and Figure 5.2 (1 ..02 d'::'.3.) i O.3.Section 5_3 First. in the onesample situation.95) approximation is good for nl > 100.4 based on Welch's approximation works well. To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. when both nl I=. N(O.(})) do not have approximate level 0.Xd. Y ._' 0.9.000 twosample t tests. and a~ = 12af. or 50. In this case Monte Carlo studies have shown that the test in Section 4.2.12 0.X = ~. _ y'ii:[h(X) . However. Moreover. and the data are X~ where d is one of 2.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because.ll)(!. and Yi .h(JL)] ".3. y'ii:[h(X) .' 0.hoi n s[h(l)(X)1 . 1 (ri . By Theoretn 5. even when the X's and Y's have different X~ distributions. 2:7 ar Two sample..Xi have a symmetric distribution.0:) approximation is good when nl I=.o'[h(l)(JLJf).5 1 1. then the twosample t tests with critical region 1 {S'n > t n .0' H<I o 0. in this case. Each plotted point represents the results of 10. scaled to have the same means.5 2 2. D Next.a~ show that as long as nl = n2. 10.n2 and at I=.5 3 log10 sample size Figure 5. }. let h(X) be an estimate of h(J. af = a~. the t n _2(O.3.cCc":'::~___:.L) where h is con tinuously differentiable at !'. For each simulation the two samples are the same size (the size indicated on the xaxis).n2 and at = a').a~. 10000 Simulations. Equal Variances 0. Other Monte Carlo runs (not shown) with I=.
12 .". Each plotted point represents the results of IO. such as the binomial.3 and Slutsky's theorem. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered.3.oj:::  6K . called variance stabilizing.3. . +2 + 25 J O'. We have seen that smooth transformations heX) are also approximately normally distributed.. It turns out to be useful to know transformations h. From (5.6. we see that here.5 Log10 (smaller sample size) l Figure 5.{x)() twosample t tests. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.. 0. '! 1 .5 1 1.£. The xaxis denotes the size of the smaller of the two samples. Unequal Variances... In Appendices A and B we encounter several important families of distributions.4. as indicated in the plot. Tn . . If we take a sample from a member of one of these families. 2nd sample 2x bigger 0. o. and beta. For each simulation the two samples differ in size: The second sample is two times the size of the first. if H is true . and 9. £l __ __ 0.316 Asymptotic Approximations Chapter 5 Two Sample. N(O.. gamma.~'ri i _ .' OIlS .6) and . which are indexed by one or more parameters.J~ f). Variance Stabilizing Transfonnations Example 5. 1) so that ZI_Q is the asymptotic critical value.3.3. too. The data in the first sample are N(O.3. Poisson.0.c:~c:_~___:J 0.. +___ 9. such that Var heX) is approximately independent of the parameters indexing the family we are considering.. Combining Theorem 5.3.I _ __ 0 I +__ I :::.02""~".. I: .10000 Simulations: Gaussian Data. " 0.
3. . I X n is a sample from a P(A) family. 1/4) distribution. In this case a'2 = A and Var(X) = A/n. In this case (5.d. h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. 1n (X I. If we require that h is increasing. .' . yX± r5 2z(1 . . finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.19) is an ordinary differential equation. Suppose.13) we see that a first approximation to the variance of h( X) is a' [h(l) (/. .3. in the preceding P( A) case.(A)') has approximately aN(O. sample. a variance stabilizing transformation h is such that Vi'(h('Y) . The notion of such transformations can be extended to the following situation.16. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. Under general conditions (Bhattacharya and Rao. Thus.3 First.)] 2/ n .3. where d is arbitrary.15 and 5. which varies freely. See Problems 5.3. Thus.3.hb)) + N(o. 0 One application of variance stabilizing transformations. Suppose further that Then again. p. c) (5. Major examples are the generalized linear models of Section 6.. .i. Substituting in (5. Also closely related but different are socaBed normalizing transformations.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5.Xn are an i. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. See Example 5. 1976. suppose that Xl..1 0 ) Vi' ' is an approximate 1.19) for all '/..)C. Some further examples of variance stabilizing transformations are given in the problems.. by their definition...3. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. To have Varh(X) approximately constant in A. Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X.\ + d.5. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl.6) we find Var(X)' ~ 1/4n and Vi'((X) .3.. A > 0. Such a function can usually be found if (J depends only on fJ. Thus.Section 5.Ct confidence interval for J>. 538) one can improve on . which has as its solution h(A) = 2. this leads to h(l)(A) = VC/J>..6.. As an example. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals.
2706 0.1.40 0.15 0..9506 0.1964 1. Therefore..0010 ~: EA NA x Exact .5421 0. H 3 (x) = x3  3x.0284 1. xro I • I I . According to Theorem 8.' •• .1051 0.72 .9905 0.91 0.1254 0.1) + 2 (x 3 .38 0.0287 0.0100 0.75 0.34 2.04 1.0481 1.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.ססoo 0.1000 0.6999 0. i = 1..2.77 0. 4 0.4000 0.2000 0.5999 0.ססOO I .0877 0.0655 O. Table 5.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn. Edgeworth Approximations to the X 2 Distribution./2 2 .0254 0.9984 0.7000 0.0050 0.II 0. P(Tn < x). Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.5.ססoo 0.38 0.95 :' 1. Hs(x) = xS  IOx 3 + 15x.5000 0.0032 ·1.ססoo 1.9995 0.0208 ~1. ~ 2.86 0.!.n)' _ 3 (2n)2 = 12 n 1 I .8008 0..61 0.9999 1. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.9684 0. TABLE 5.9950 0.9097 o.20) is called the Edgeworth expansion for Fn .2024 EA NA x 0.9997 4.35 0.9996 1.9990 0. 0. • " .3x) + _(xS . (5. We can use Problem B.9029 0.IOx 3 + 15x)] I I y'n(X n 9n ! I i !. Fn(x) ~ <!>(x) .9900 0. we need only compute lIn and 1'2n. 0.3513 0.3.0250 0.6548 4. .7792 5. 1). " I.0001 0 0.9943 0.9500 'll . .4000 0.6000 0.<p(x) ' . where Tn is a standardized random variable. 1) distribution. Let F'1 denote the distribution of Tn = vn( X . It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V . To improve on this approximation.21) The expansion (5.3.0553 0..4 to compute Xl. I' 0.9750 0.95 3.I . Suppose V rv X~. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O.n.8000 0.3.51 1.0105 0.79 0.66 0.85 0 0.3000 0.4415 0. Exact to' EA NA {.n)1 V2ri has approximately aN(O.9724 0.3.9876 0. 0 x Exacl 2. Example 5.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.3.3..15 0.0500 ! ! . "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V . .3006 0.4999 0.0005 0 0." • [3 n .9999 1.1.40 0. ii.0397 0.
Let central limit theorem T'f. (X n .d. Let (X" Y. (ii) g.n1EY/) ° ~ ai ~ and u = (p.0 .l = Cov(XkyJ.. and Yj = (Y .J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0. Lemma 5.3. ._p2). Using the central limit and Slutsky's theorems.2rJ >"..Section 5.1. (i) un(U n ./U2U3.1) and vn(i7~ . Ui/U~U3.22) 2 g(l)(u) = (2U.i. Then Proof. >'1.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5. xmyl).3 First. a~ = Var Y.5.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.1.ar. and let r 2 = C2 /(j3<. 4f.2. R d !' RP has a differential g~~d(U) at u. aiD. Y) where 0 < EX 4 < 00.3.ij~ where ar Recall from Section 4.3 and (B.1) jointly have the same asymptotic distribution as vn(U n . Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl.u) ~ N(O..0.0. E Next we compute = T11 .u) ~~ V dx1 forsome d xl vector ofconstants u.2 >'1.3.2. vn(i7f .2.2 T20 .0 TO . Y)/(T~ai where = Var X.1. Let p2 = Cov 2(X.1. where g(UI > U2. We can write r 2 = g(C.9.0 >'1. UillL2U~) = (2p. The proof follows from the arguments of the proof of Lemma 5. Because of the location and scale invariance of p and r..9) that vn(C .p).2.n lEX. P = E(XY).I.0. 1).m. Example 5. 0< Ey4 < 00.0. then by the 2 vn(U .2.o}.3. . = ai = 1. It follows from Lemma 5.u) where Un = (n1EXiYi.0 2 (5. Yn ) be i.0. .3. = Var(Xkyj) and Ak.2.0.J11)/a. 1.1.6) that vn(r 2 N(O. U3) = Ui/U2U3.j.2. we can show (Problem 5.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al.3.).0 >'1. with  r?) is asymptotically nonnal. E).a~) : n 3 !' R. _p2.2 extends to the dvariate case. >'2. we can .2 + p'A2. as (X.6.= (Xi . .3.2 >'2.3.
g.320 When (X. j f· ~. (1 _ p2)2).. c E (1. Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'. . Suppose Y 1.. EY 1 = rn. ': I = Ilgki(rn)11 pxd. then u5 .3.1). and (Prohlem 5.h= (h i .p). .3.Y) ~ N(/li.19). it gives approximations to the power of these tests.1'2. (5. Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd.3.rn) N(o.: ! + h(i)(rn)(y .h(p)J !:.p) !:. N(O. .9) Refening to (5. E) = .fii[h(r) . (5.~a)/vn3} where tanh is the hyperbolic tangent. 2 1.5 (a) ! . David. N(O. .1) is achieved hy choosing h(P)=!log(1+ P ).P The approximation based on this transfonnation.fii(r .3. per < c) '" <I>(vn .a)% confidence interval of fixed length. .u 2. Theorem 5. Then ~.3.3[h(c) .mil £ ~ hey) = h(rn) and (b) so that . I Y n are independent identically distributed d vectors with ElY Ii 2 < 00.fii(Y . and it provides the approximate 100(1 . that is. 1938) that £(vn . . . .3. Argue as before using B.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . o Here is an extension of Theorem 5.4.8.rn) + o(ly . which is called Fisher's z. .24) I Proof. we see (Problem 5. .hp )andhhasatotaldifferentialh(1)(rn) f • .l) distribution. ! f This expression provides approximations to the critical value of tests of H : p = O.h(p)]).UI. ..h(p)) is closely approximated by the NCO.3.3(h(r) .23) • . . p=tanh{h(r)±z (1. has been studied extensively and it has been shown (e.3.
3. when the number of degrees of freedom in the denominator is large. then P[T5 .37] = P[(vlk) < 2.7) implies that as m t 00.m distribution. the:F statistic T _ k. 00. check the entries of Table IV against the last row.1. 1 Xl has a x% distribution.3. when k = 10. . The case m = )'k for some ). 14.Section 5.9) to find an approximation to the distribution of Tk.. We do not require the Xi to be normal. we can write.3 First. xi and Normal Approximation to the Distribution of F Statistics. the mean ofaxi variable is E(Z2). Now the weak law of large numbers (A.:: . When k is fixed and m (or equivalently n = k + m) is large.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) .1. gives the quantiles of the distribution of V/k.25) has an :Fk.x%.= density. .05 and the respective 0. L::+.0 < 2. i=k+1 Using the (b) part of Slutsky's theorem.3. Suppose that Xl. By Theorem B.211 = 0.. EX.3.3. where k + m = n. We write T k for Tk. is given as the :FIO .15. Thus. (5.k. only that they be i.3.' = Var(X.i.h(m)) = y'nh(l)(m)(Y  m) + op(I). See also Figure B.l) distribution..37 for the :F5 . By Theorem B. if k = 5 and m = 60..:l ~ m k+m L X. But E(Z') = Var(Z) = 1.. we can use Slutsky's theorem (A.m distribution can be approximated by the distribution of Vlk.). Then according to Corollary 8. which is labeled m = 00.m' To show this. we conclude that for fixed k.m. > 0 is left to the problems. This row.d. I). Next we turn to the normal approximation to the distribution of Tk. For instance. D Example 5. where V .m.1. as m t 00. we first note that (11m) Xl is the average ofm independent xi random variables. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk. To get an idea of the accuracy of this approximation. Suppose for simplicity that k = m and k . > 0 and EXt < 00.k+m L::l Xl 2 (5.~1.1 in which the density of Vlk. the :Fk.05 quantiles are 2.0 distribution and 2.3.Xn is a sample from a N(O.7.21 for the distribution of Vlk. if . with EX l = 0.26) l.i=k+l Xi ".. where Z ~ N(O. Then.m  (11k) (11m) L.
1'.)/ad 2 .3.3.3. (5.29) is unknown. ~ Ck. v) identity.8(c» one has to use the critical value . Thus (Problem 5.. as k ~ 00. a').3.28) .k(tl)] '" 'fI() :'.k (t 1 (5..m 1) <) :'. .3.»)T and ~ = Var(Yll)J.' !k .)T. . ._o l or ~. where J is the 2 x 2 v'n(Tk 1) ""N(0.1)/12 j 2(m + k) mk Z. 4). Specifically.. when Xi ~ N(o. In general. v'n(Tk . . ! i l _ . k.5.3.m critical value fk. 2) distribution.8(a» does not have robustness of level. In general (Problem 53.4. the F test for equality of variances (Problem 5.m(1 . By Theorem 5.k(T>. ~ N (0. I)T and h(u. 1)T.3.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows. Equivalently Tk = hey) where Y i ~ (li" li. P[Tk.m • < tl P[) :'.a) '" 1 + is asymptotically incorrect.'. :f. Then if Xl.). . when rnin{k. _. I' Theorem 5. :.). We conclude that xl jer}.3.m(l.m(1. X n are a sample from PTJ E P and ij is defined as the MLE 1 . the distribution of'. j m"': k (/k.27) where 1 = (1.8(d)). Suppose P is a canonical exponential family of rank d generated by T with [open.a). = E(X. if it exists and equal to c (some fixed value) otherwise.7).3. • • 1)/12) . v) ~ ~.m 1) can be ap proximated by a N(O. = (~.2Var(Yll))' In particular if X. i. 0 ! j i · 1 where K = Var[(X. a').o) k) K (5.. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. l (: An interesting and important point (noted by Box. 1'. When it can be eSlimated by the method of moments (Problem 5. which by (5.m = (1+ jK(:: z.m} t 00.3. the upper h. i = 1. and a~ = Var(X.I.28) satisfies Zto '" 1 I I:. 1953) is that unlike the t test.k(Tk. i. 1 5. !1 . E(Y i ) ~ (1. ifVar(Xf) of 2a'. h(i)(u.1) "" N(O.a) .
O. X n be i. and.23). (ii) follows from (5.24).3.3. if T ~ I:7 I T(X.2 that. therefore. and 1). thus. 0').". T/2 By Theorem 5. iiI = X/O".4. PT/[T E A(E)] ~ 1 and.38) on the variance matrix of y'n(ij .. .1.'l 1 (ii) LT/(Vii(iiT/)).3. (i) follows from (5.5.8.N(o..In(ij .3 First.I(T/) (5.EX.l4. Example 5.T. are sufficient statistics in the canonical model.3. = 1/20'.3. where. Now Vii11.1.33) 71' 711 Here 711 = 1'/0'.71) for any nnbiased estimator ij. = A by definition and.3.4 with AI and m with A(T/). Ih our case.2 where 0" = T.8. Then T 1 = X and 1 T2 = n. as X witb X ~ Nil'.A 1(71))· Proof.30) But D A . = 1/2. by Corollary 1. Let Xl>' .) .Section 5. (I" +0')] . Note that by B.. I'. The result is a consequence. Thus.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.3.3. For (ii) simply note that.2. = (5.Nd(O.4.A(T/)) + oPT/ (n .11) eqnals the lower bound (3.d. hence. o Remark 5.1.32) Hence.6. (5.3.2.3. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information.  (TIl' .31) .2 and 5.4. Thus.). (5. We showed in the proof ofTheorem 5.3. by Example 2. in our case.i.2. PT/[ii = AI(T)I ~ L Identify h in Theorem 5. of Theorems 5.ift = A(T/).3. This is an "asymptotic efficiency" property of the MLE we return to in Section 6. the asymptotic variance matrix /1(11) of .
: CPo. stochastic approximations in the case of vector statistics and parameters are developed.4.Pk) where (5.L~ I l(Xi = Xj) is sufficient. We focus first on estimation of O. .. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. .s where Eo = diag(a 2 )2a 4 ).h(JL) where Y 1. Xk} only so that P is defined by p .. . Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. . and PES.P(Xk' 0)) : 0 E 8}. . testing. 0). These stochastic approximations lead to Gaussian approximations to the laws of important statistics.d.d.6. I . we can use (5. N = (No.4 and Problem 2.3. under Li."...d.33) and Theorem 5.15). . variables are explained in tenns of similar stochastic approximations to h(Y) .(T. Consistency is Othorder asymptotics. and il' = T..7).26) vn(X /". 0).i.4 to find (Problem 5.L. and confidence bounds.g. Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation. 8 open C R (e. il' . Y n are i. and so on.1) i . Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal.3.)'. . when we are dealing with onedimensional smooth parametric models. < 1. 5. We consider onedimensional parametric submodels of S defined by P = {(p(xo. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families.324 Asymptotic Approximations Chapter 5 Because X = T.. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. the difference between a consistent estimate and the parameter it estimates. Nk) where N j .. the (k+ I)dimensional simplex (see Example 1... Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. as Y. 0 . .1. I I I· < P. Assume A : 0 ~ pix. and h is smooth. .') N(O. taking values {xo. 5. i In this section we define and study asymptotic optimality for estimation. Summary. is twice differentiable for 0 < j < k. . EY = J. .1. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families. Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i.. We begin in Section 5. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families.i. .d. i 7 •• . I i I .3. . Finally. sampling. .1 Estimation: The Multinomial Case . . Thus. Eo) .3.. .4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! . 0. see Example 2.4.1 by studying approximations to moments and central moments of estimates. . for instance. J J .
4.8) . Theorem 5. 0 <J (5.l (Xl.4. Moreover.11).Section 5.4.8)1(X I =Xj) )=00 (5.3) Furthermore (See'ion 3. (5. h) with eqUality if and only if. Next suppose we are given a plugin estimator h (r. Many such h exist if k 2. h) is given by (5.8) ~ I)ogp(Xj.4.4. < k.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5. 88 (Xl> 8) and =0 (5.. Assume H : h is differentiable. bounded random variable (5.1.5) As usual we call 1(8) the Fisher infonnation.:) (see (2.2(0. Consider Example where p(8) ~ (p(xo. fJl E.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I . (5.4.6) 1. 8) is similarly bounded and well defined with (5.Jor all 8.p(Xk.2) is twice differentiable and g~ (X I) 8) is a well~defined.8).4. Then we have the following theorem..9) .2)..2 (8. .4) g.4..4. Under H.4.4. for instance. 8))T. if A also holds.4. .8) logp(X I .7) where .1. > rl(O) M a(p(8)) P. pC') =I 1 m .0). (0) 80(Xj.1.
10).6). but also its asymptotic variance is 0'(8. by (5. I h 2 (5.9). • • "'i.8) gives a'(8)I'(8) = 1.12) [t.II. whil.4.4.8) with equality iff.10) Thus. we obtain &l 1 <0 . &8(X.p(x). I (x j. 0)] p(Xj.11) • (&h (p(O)) ) ' p(xj. using the definition of N j .')  h(p(8)) } asymptotically normal with mean 0.4. 2: • j=O &l a:(p(8))(I(XI = Xj) .4.4. we s~~ iliat equality in (5.I n. ~ir common variance is a'(8)I(8) = 0' (0. (5.15) i • ..2 noting that N) vn (h (. h) = I . 0) = • &h fJp (5.p(xj. : h • &h &Pj (p(8)) (N .p(xj.4. by noting ~(Xj. not only is vn{h (..3.4.h)Var. which implies (5.326 Asymptotic Approximations Chapter 5 Proof.4.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.14) .12).4.8) ) ~ . using the correlation inequality (A. 8).8) ) +op(I). By (5.8)). we obtain 2: a:(p(8)) &0(xj.l6) as in the proof of the information inequality (3.. for some a(8) i' 0 and some b(8) with prob~bility 1.8)&Pj 2: • &h a:(p(0))p(x.h)I8) (5.13).4. ( = 0 '(8. (5.4. h(p(8)) ) ~ vn Note that. h k &h &pj(p(8)) (N p(xj. kh &h &Pj (p(8))(I(Xi = Xj) . (5...4.8) j=O 'PJ Note that by differentiating (5.4.13) I I · . (8./: . h).16) o . Taking expectations we get b(8) = O. Apply Theorem 5.8) = 1 j=o PJ or equivalently. Noting that the covariance of the right' anp lefthand sides is a(8).
0 (5. Then Theo(5. As in Theorem 5.8) is. 0).3. Let p: X x ~ R where e D(O.. ..xd). if it exists and under regularity conditions. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B). Suppose p(x.. is a canonical oneparameter exponential family (supported on {xo..d... Suppose i.4.4. OneParameter Discrete Exponential Families.5 applies to the MLE 0 and ~ e is open.19) The binomial (n. achieved by if = It (r. then. Example 5. by ( 2.Oo) = E.i..(vn(B .4.0)=0. . : E Ell. . the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.3) L. Xk}) and rem 5. .8) = exp{OT(x) .A(O)}h(x) where h(x) = l(x E {xo.2. 5. Write P = {P..3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check.3 that the information bound (5. 0) and (ii) solvesL::7~ONJgi(Xj.2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates.). Xl. . which (i) maximizes L::7~o Nj logp(xj. p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case.4. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena.4.::7 We shall see in Section 5.1.I L::i~1 T(Xi) = " k T(xJ )". 0 E e."j~O (5. E open C R and corresponding density/frequency functions p('. .Xn are tentatively modeled to be distributed according to Po. In the next two subsections we consider some more general situations.17) L.4.Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:.3.4.o(p(X"O)  p(X"lJo )) .0)) ~N (0.. Note that because n N T = n ..18) and k h(p) = [A]I LT(xj)pj )".
# o.L . L. PEP and O(P) is the unique solution of(5. 0) is differentiable.328 Asymptotic Approximations Chapter 5 . rather than Pe. That is. O(P))) ..~(Xi.' '. < co for all PEP..p E p 80 (X" O(P») A3: . We need only that O(P) is a parameter as defined in Section 1. O)ldP(x) < co. Suppose AI: The parameter O(P) given hy the solution of .!:.1/ 2) n '. . under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}. A4: sup. .1.20) In what follows we let p.p(X" On) n = O. n i=l _ 1 n Suppose AO: . parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for.. ~ (X" 0) has a finite expectation and i ! . f 1 .LP(Xi.O(P)) +op(n. O(P» / ( Ep ~~ (XI. i' On where = O(P) +. { ~ L~ I (~(Xi.=1 n ! (5. Let On be the minimum contrast estimate On ~ argmin .4.22) J i I .p2(X.t) .O). j.21) i J A2: Ep. as pointed out later in Remark 5. Theorem 5. hence. O(Pe ) = O. As we saw in Section 2.4. O)dP(x) = 0 (5.1.0(P)) l.21) and.4.4.p(. o if En ~ O.p(x.p(x.p(Xi. Under AOA5.4. . ~i  i .23) I. 1 n i=l _ .pix. ~I AS: On £.p = Then Uis well defined. . J is well defined on P. 0 E e.3.2.O(p)))I: It . On is consistent on P = {Pe : 0 E e}. (5.O(P)I < En} £. (5. That is.4. denote the distribution of Xi_ This is because. O(P). 8. P) = .4.. .p(x. • is uniquely minimized at Bo.
Next we show that (5.27) Combining (5. By expanding n..4.O(P)) Ep iJO (Xl.O(P)I· Apply AS and A4 to conclude that (5. Let On = O(P) where P denotes the empirical probability.8(P)) I n n n~ t=1 n~ t=1 e (5.4. en) around 8(P). (5.O(P))) p (5.' " 1jJ(Xi .1j2 ). applied to (5.28) .4.20) and (5.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.24) follows from the central limit theorem and Slutsky's theorem.425)(5. where (T2(1jJ.p) = E p 1jJ2(Xl .4.4.O(P)) ~ ..' " iJ (Xi .22) follows by a Taylor expansion of the equations (5. t=1 1 n (5.4. .On)(8n .lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1.24) proof Claim (5.4.O(P)) = n.27) we get.26) and A3 and the WLLN to conclude that (5.l   2::7 1 iJ1jJ..22) because .4. But by the central limit theorem and AI.29) . 1 1jJ(Xi .Section 5.25) where 18~  8(P)1 < IiJn .4.20). A2. O(P)).4 Asymptotic Theory in One Dimension 329 Hence. and A3.O(P)) 2' (E '!W(X. 8(P)) + op(l) = n ~ 1jJ(Xi ._ .4. n.4.p) < 00 by AI.4.L 1jJ(X" 8(P)) = Op(n. using (5.21). (iJ1jJ ) In (On . we obtain.Ti(en .4.
Suppose lis differen· liable and assume that ! • &1 Eo &O(X I .O) = O.O)?jJ(X I . This is in fact truesee Problem 5.4.O(P) Ik = n n ?jJ(Xi . AI.. . O)p(x. it is easy to see that A6 is the same as (3. Our arguments apply to Mestimates. (2) O(P) solves Ep?jJ(XI. O)dp. Our arguments apply even if Xl. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i.29)..2 may hold even if ?jJ is not differentiable provided that . we see that A6 corresponds to (5.1.d.1. Conditions AO. P but P E a}. and we define h(p) = 6(xj )Pj. n I _ .30) is formally obtained by differentiating the equation (5.22) follows from the foregoing and (5. 0) is replaced by Covo(?jJ(X" 0). Z~ (Xl. A4 is found.i. written as J i J 2::.4.! . B» and a suitable replacement for A3.2. • Theorem 5.. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I .12).8). 0 • .4.4.4. en j• . for A4 and A6. Remark 5. and that the model P is regular and letl(x. A6: Suppose P = Po so that O(P) = 0. We conclude by stating some sufficient conditions. . essentially due to Cramer (1946).30) Note that (5.4. Identity (5. Solutions to (5.2.Eo (XI.4. Remark 5.28) we tinally obtain On .4. .4. If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x.=0 ?jJ(x. A2.4. I 'iiIf 1 I • I I . =0 (5.: 1. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5. {xo 1 ••• 1 Xk}. that 1/J = ~ for some p). 0) = logp(x.?jJ(XI'O)).e. 0) Covo (:~(Xt. This extension will be pursued in Volume2.4. for Mestimates. B) where p('. O(P)) ) and (5.4.21).O).4.'. X n are i.4. I o Remark 5.4.3.4. If further Xl takes on a finite set of values. O(P)) + op (I k n n ?jJ(Xi .(x) (5. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po.4. e) is as usual a density or frequency function. 1 •• I ! . and A3 are readily checkable whereas we have given conditions for AS in Section 5.20) are called M~estimates. 0) = 6 (x) 0.2 is valid with O(P) as in (I) or (2). 0) l' P = {Po: or more generally.31) for all O.30) suggests that ifP is regular the conclusion of Theorem 5.
O). satisfies If AOA6 apply to p(x. lO. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.eo (Xl. AOA6.4.O) and P ~ ~ Pe.:d in Section 3.Section 5. s) diL(X )ds < 00 for some J ~ J(O) > 0. That is.4.34) Furthermore. 0') .20) occurs when p(x. 10' . Theorem 5.8) is a continuous function of8 for all x.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. then the MLE On (5.O) = l(x.. Pe) > 1(0) 1 (5. < M(Xl..01 < J( 0) and J:+: JI'Ii:' (x. 0) (5.O) 10gp(x. 0) obeys AOA6. where iL(X) is the dominating measure for P(x) defined in (A. 0') is defined for all x. "dP(x) = p(x)diL(x).1). .4. We also indicate by example in the problems that some conditions are needed (Problem 5.p(x. 0) g~ (x. In this case is the MLE and we obtain an identity of Fisher's.4.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5.3.4.32) where 1(8) is t~e Fisher information introduq. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.4. 0) and .35) with equality iff.4) but A4' and A6' are not necessary (Problem 5.33) so that (5.p.p = a(0) g~ for some a 01 o. 0) < A6': ~t (x.4.0'1 < J(O)} 00. where EeM(Xl. B)..4. 0) .4." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.O) = l(x. 5. w.4. We can now state the basic result on asymptotic normality and e~ciency of the MLE. = en en = &21 Ee e02 (Xl.
N(B. j 1 " Note that Theorem 5.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise. Let Z ~ N(O.<1>( _n '/4 . 1)..4.1). PoliXI < n 1/4 ] .A'(B). and Polen = OJ .B). The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(.2..4.439) where . I I Example 5.4. if B I' 0. PIIZ + . .36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. The optimality part of Theorem 5. 0 e en e. see Lehmann and Casella.2.38) Therefore.30) and (5..d.4. If B = 0.34) follow directly by Theorem 5. 5.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo). 00. .4. Then X is the MLE of B and it is trivial to calculate [( B) _ 1.4. 1.35) is equivalent to (5. {X 1.r 332 Asymptotic Approximations Chapter 5 Proof Claims (5.Xn be i..4. However.i .4.nBI < nl/'J <I>(n l/ ' . We discuss this further in Volume II. for some Bo E is known as superefficiency.4.nB) .4. thus. 1 . 442.n(Bn .i.. We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n. . 1998.ne . ..l'1 Let X" .4.nB).1/4 X if!XI > n.4. B) with J T(x) .. 1 I I.36) Because Eo 'lj.'(0) = 0 < l(~)' The phenomenon (5. Consider the following competitor to X: B n I 0 if IXI < n.35). for higherdimensional the phenomenon becomes more disturbing and has important practical consequences.4. I. Therefore. .". .4. ...3 is not valid without some conditions on the esti· mates being considered. 1 . . (5. claim (5.4 Testing ! I i .• •.4.1/4 I (5. 1.'(B) = I = 1(~I' B I' 0. PoliXI < n1/'1 .1 once we identify 'IjJ(x. Then . }.3 generalizes Example 5. Hodges's Example. B) is an MLR family in T(X). we know that all likelihood ratio tests for simple BI e e eo.. . . and. . We next compule the limiting distribution of ... p.. . B) = 0. By (5. PolBn = Xl .37) . 0 because nIl' ..(X B).4.33) and (5. ... 0 1 .. (5. For this estimate superefficiency implies poor behavior of at values close to 0. cross multiplication shows that (5..
(a. .Oo) = 00 + ZIa/VnI(Oo) +0(n.4. Suppose the model P = {Po . Then c. The test is then precisely. B E (a..(a.45) which implies that vnI(Oo)(c. PolO n > c. "Reject H if B > c(a.4. and then directly and through problems exhibit other tests wi!.m='=":c'. > AO.h the same behavior.42)  ~ o. derive an optimality property..::. Zla (5.:. Thus. 00 )" where P8 . (5.22) guarantees that sup IPoolvn(Bn . X n distributed according to Po..00) other hand.Section 5.r'(O)) where 1(0) (5. If pC e) is a oneparameter exponential family in B generated by T(X). are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo. and (5. ljO < 00 . Xl. 0 E e} is such thot the conditions of Theorem 5. 1 < z . It seems natural in general to study the behavior of the test.I / 2 ) (5.. 1) distribution. The proof is straightforward: PoolvnI(Oo)(Bn .4. this test can also be interpreted as a test of H : >.0)]. [o(vn(Bn 0)) ~N(O..(1 1>(z))1 ~ 0.3 versus simple 2 .46) PolBn > cn(a.:cO"=~ ~ 3=3:.4 Asymptotic Theo:c'Y:c. On the (5.4.41) where Zlct is the 1 .4.00) > z] .4.43) Property (5.4. Theorem 5.Oo)] PolOn > cn(a.44) But Polya's theorem (A.2 apply to '0 = g~ and On.42) is sometimes called consistency of the test against a fixed alternative. Proof. 00 )]  ~ 1.00) > z] ~ 11>(z) by (5. . eo) denote the critical value of the test using the MLE en based On n observations. 00) . < >"0 versus K : >. .40). the MLE.4. a < eo < b.. That is.a quantile o/the N(O. • • where A = A(e) because A is strictly increasing.4..:c".(a.0:c":c'=D. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a. as well as the likelihood ratio test for H versus K. (5.4. b). (5. 00) . Suppose (A4') holds as well as (A6) and 1(0) < oofor all O. Then ljO > 00 .40) > Ofor all O. Let en (0:'. e e e e... "Reject H for large values of the MLE T(X) of >.4.41) follows. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a.4..d. ~ 0.[Bn > c(a.Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij.4.l4.
.1 / 2 )) . . f..47) lor.4. 80 )  8) . Claims (5.80) ~ 8)J Po[.4.jnI(8)(Cn(0'.80 ) tends to zero.jI(80)) ill' > 0 > llI(zl_a ")'.8) > .4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 .   j Proof.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5. In either case. In fact. iflpn(X1 .48) .4.jnI(8)(80 .4. 1 ~. (5. On the other hand. the test based on 8n is still asymptotically MP.1/ 2 ))].k'Pn(X1. If "m(8 .jnI(8)(Cn(0'.42) and (5.4.4. X n ) is any sequence of{possibly randomized) critical (test) functions such that .4.")' ~ "m(8  80 ).jnI(8)(80 . Write Po[Bn > cn(O'.jnI(80) + 0(n.4.8) < z] .[.4. I.jnI(8)(Bn .80 ). 00 if8 > 80 0 and . (3) = Theorem 5. o if "m(8 .80) tends to infinity.48) and (5. That is.41).jnI(80) +0(n. 0.48). .jnI(8)(Bn .51) I 1 I I . Theorem 5. Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 . LetQ) = Po.(1 lI(z))1 : 18 .017".···.jnI(O)(Bn . . j . (5. the power of tests with asymptotic level 0' tend to 0'.50) i.. < 1 lI(Zl_a ")'. .43) follow.50)...jI(80)) ill' < O.4.40) hold uniformly for (J in a neighborhood of (Jo. 00 if 8 < 80 .49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5.80 1 < «80 )) I J . Furthennore. j (5. . Suppose the conditions afTheorem 5. ~! i • Note that (5. then (5. the power of the test based on 8n tends to I by (5.4.8 + zl_o/. then by (5. ~ " uniformly in 'Y. then limnE'o+.4.4.50)) and has asymptotically smallest probability of type I error for B < Bo.4.(80) > O. these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B .5.2 and (5.jnI(8)(80 . assume sup{IP.8) > .8) + 0(1) .4.4. .4.49) ! i . I . (5. X n ) i. 80)J 1 P.[.4.J 334 Asymptotic Approximations Chapter 5 By (5.  i.8 + zl_a/.
jI(Oo)) + 0(1)) and (5.. . 0 n p(Xi . To prove (5. . .4 Asymptotic Theory in One Dimension 335 If. [5In (X" . The details are in Problem 5.5 in the future will be referred to as a Wald test.?.. . 8 + fi) 0 P'o+:.4. 0 (5.53) is Q if. X n .Xn ) = 5Wn (X" .o + 7. for Q < ~.4.4.4.8) that. > dn (Q. ..Section 5..4...8 0 ) .Xn) is the critical function of the LR test then.48) follows. 1." + 0(1) and that It may be shown (Problem 5.4. These are the likelihood ratio test and the score or Rao test... 80)] of Theorems 5. Finally. .54) establishes that the test 0Ln yields equality in (5.4. .4. note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t. i=1 P 1.52) > 0. is O. . t n are uniquely chosen so that the right. There are two other types of test that have the same asymptotic behavior..OO+7n) +EnP. hence.' L10g i=1 p (X 8) 1. Q).j1i(8  8 0 ) is fixed.53) tends to the righthand side of (5.<P(Zl_a .8) 1.Xn) is the critical function of the Wald test and oLn (Xl..4.4. 00 + E) logPn(X l E I " .50) for all.. X. The test li8n > en (Q.7).. € > 0 rejects for large values of z1. Llog·· (X 0) =dn (Q.54) Assertion (5. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n.4. if I + 0(1)) (5. for all 'Y.OO)J .)] ~ L (5.50) and.4. 0) denotes the density of Xi and dn . . .jn(I(8o) + 0(1))(80 . 00)]1(8n > 80 ) > kn(Oo. ~ . P.O+.<P(Zla(1 + 0(1)) + . is asymptotically most powerful as well.[logPn( Xl. k n (80 ' Q) ~ if OWn (Xl.4 and 5.4. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5.50) note that by the NeymanPearson lemma.80 ) < n p (Xi. Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.5.53) where p(x...4.. " Xn. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i . Thus.4. hand side of (5.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
.i. inf{p(X.p(X. (Ii) Show that condition (5.(eo)} < 00.e'l < . V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00.Section 5.e)1 : e' E 5'(e.8) fails even in this simplest case in which X ~ /J is clear. {p(X" 0) . (c) Suppose p{P) is defined generally as Covp (U. sup{lp(X. for each e eo.. (1 (J.2 rII.\} c U s(ejJ(e j=l T J) {~ t n .\} . then is a consistent estimate of p. A]. e) is continuous. (a) Show that condition (5.(ii) holds.d.~ . . eo) : e' E S(O. 7.2. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent. e) .p(X. 6. e E Rand (i) For some «eo) >0 Eo. (i).p(X. e') .p(Xi . and < > 0 there is o{0) > 0 such that e. (Wald) Suppose e ~ p( X.o)} =0 where S( 0) is the 0 ball about Therefore. 05) where ao is known.•.lO)2)) .2.e')p(X. inf{p(X. .14)(i).2. N (/J. e)1 (ii) Eo. 5_0  lim Eo. t e.eol > . n Lt= ". Suppose Xl. eo) : Ie  : Ie . and the dominated convergence theorem.eol > . .Xn are i.14)(i) add (ii) suffice for consistency.n 1 (XiJ.sup{lp(X.l)2 _ + tTl = o:J. Hint: sup 1.. eo)} : e E K n {I : 1 IB . VarpUVarpV > O}. Eo. e') . i (e))} > <. By compactness there is a finite number 8 1 . Hint: K can be taken as [A. Prove that (5. 5. Hint: From continuity of p. eol > A} > 0 for some A < 00. where A is an arbitrary positive and finite constant.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution. by the basic property of maximum contrast estimates. ". . .lJ. Or of sphere centers such that K Now inf n {e: Ie . Show that the maximum contrast estimate 8 is consistent.
Xn but independent of them.k / 2 . (72).X'i j . < > OJ.. Extend the result of Problem 7 to the case () E RP. • • (i) Suppose Xf. i . en are constants.)I' < MjE [L:~ 1 (Xi . Establish (5.d. 1/(J. . 2.. P > 1. = 1. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x.) 2] . Then EIX .I'lj < EIX . 8.3. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K.. .l m < Cm k=I n. .[. Show that the log likelihood tends to 00 as a + 0 and the condition fails. Establish Theorem 5.L. L:!Xi . J (iii) Condition on IXi  X.X.1.3. L:~ 1(Xi . and let X' = n1EX.3. n.. Establish (5.d.3 I. .3.11). li . . Hint: Taylor expand and note that if i . 4.. (ii) If ti are ij. for some constants M j . The condition of Problem 7(ii) can also fail. i ..2. . ..9) in the exponential model of Example 5. < < 17 < 1/<.d.. < Mjn~ E (. .3. . and if CI. . Hint: See part (a) of the proof of Lemma 5.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi.1'..)2]' < t Mjn~ E [.1. .. Problems for Section 5. .O'l n i=l p(X"Oo)}: B' E S(Bj.. with the same distribution as Xl. Let X~ be i. Compact sets J( can be taken of the form {II'I < A.i. For r fixed apply the law of large numbers.X.. + id = m E II (Y.O(BJll}.i. II I .3) for j odd as follows: . .. then by Jensen's inequality. 9. I 10. . 1 + . ~ ..xW) < I'lj· 3."d' k=l d < mm+1 L d ElY.3. Establish (5.X~ are i. and take the values ±1 with probability ~.
. s~ ~ (n2 ./LI = E(Xt}.28.d. j = 1.m distribution with k = nl .. . Show that under the assumptions of part (c). 8. I < liLl}P(IXd < liLl)· 7. under H : Var(Xtl ~ Var(Y. )=1 5.m with K.m) Then. theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. respectively.a~d < [max(al"" . X. ~ iLli I IX. i. n} EIX.a as k t 00.aD. PH(st! s~ < Ck.andsupposetheX'sandY's are independent. (b) Show that when F and G are nonnal as in part (a). .Fk. replaced by its method of moments estimate. L~=1 i j = m then m a~l. if a < EXr < 00. ~ iLl' I IXli > liLl}P(lXd > 11.i.) I : i" . 0'1 = Var(Xd· Xl /LI Ck.c> . . (a) Show that if F and G are N(iL"af) and N(iL2. FandYi""'Yn2 bei. .\k for some .I) Hint: By the iterated expectation theorem EIX.. .(X.ad)]m < m L aj j=1 m <mmILaj.(l'i .m be Ck.1 and 'In = n2 ..6 Problems and Complements 349 Suppose ad > 0. LetX1""..m 00.r'j". Establish 5.3.. 6.l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck. . km K.I. j > 2. Show that if m = .pli ~ E{IX. then EIX. ~ xl".i. 2 = Var[ ( ) /0'1 ]2 . R valued with EX 1 = O.d.i..\ > 0 and = 1 + JI«k+m) ZIa.1)'2:7' .d. then (sVaf)/(s~/a~) has an .Xn1 bei. . . 1 < j < m..Tn) + 1 .). Show that if EIXlii < 00. ~ iLli < 2 i EIX.. 1)'2:7' . Let XI. Show that ~ sup{ IE( X..1) +E{IX. where s1 = (n. (d) Let Ck. P(sll s~ < ~ 1~ a as k ~ 00."" X n be i. G.Section 5.
H En.I))T has the same asymptotic distribution as n~ [n.p) ~ N(O.m (depending on ~'l ~ Var[ (X I .l EXiYi . then jn(r' . .I).lIl) + 1 . . from {t I.) =a 2 < 00 and Var(U) Hint: (a) X . "1 = "2 = I. Under the assumptions of part (c).a as k . Without loss of generality.A») where 7 2 = Var(XIl.4.i. then jn(r . t N }. N where the t:i are i.l EY? . Instead assume that 0 < Var(Y?) < x. i = I.2" ( 11. (a) jn(X .".~) (X .. .p) ~ N(O. 0 < ..1 EX. 1).3. = = 1.. if p i' 0. + I+p" ' ) I I I II.i.+xn n when T.XN} or {(uI. and if EX. () E such that T i = ti where ti = (ui.\ 00. 4p' (I . Var(€.. be qk. n. . In particular. p).m (0 Let 9. . JLl = JL2 = 0. . • . • 10.. < 00 (in the supennodel).XC) wHere XC = N~n Et n+l Xi. Show that jn(XRx) ~N(0. as N 2 t Show that. jn(r .p. In Example 5. show that i' ~ .. suppose i j = j." . if 0 < EX~ < 00 and 0 < Eyl8 < 00. Y) ~ N(I'I. = (1. = ("t .\ < 1. Wnte (l 1pP .6. (iJl.a: as k + 00. n.Il2.1 · /. ..4...3.. Po.1 that we use Til. . Consider as estimates e = ') (I X = x.d. .I.xd.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters..x) ~ N(O.. .1. Show that under the assumptions of part (e). . . . (b) Use the delta method for the multivariate case and note (b opt  b)(U .d.iL) • .6. iik. if ~ then .p).p') ~ N(O. suppose in the context of Example 3.N.m with KI and K2 replaced by their method of moment estimates.1'2)1"2)' such that PH(SI!S~ < Qk..". (iJi .350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g. 7 (1 . . that is. ! . Hint: Use the centra] limit theorem and Slutsky's theorem.. • op(n')." •. i = 1. .. .2< 7'. Without loss of generality. use a normal approximation to tind an approximate critical value qk. . . then P(sUs~ < qk. (cj Show that if p = 0.1. ~ ] . . Tin' which we have sampled at random ~ E~ 1 Xi. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3. ._ J . to estimate Ii 1 < j < n.m) + 1 .(I_A)a2). jn((C .+.I' ill "rI' and "2 = Var[( X. (UN. Et:i = 0. (b) If (X. .1JT.. (a) If 1'1 = 1'2 ~ 0. there exists T I .00.. ' . .p')') and. (I _ P')'). In survey sampling the modelbased approach postulates that the population {Xl.Ii > 0. TN i. . i (b) Suppose Po is such that X I = bUi+t:i.t . "i. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.xd.
.. X = XO. (b) Let Sn .3.. 15.. n = 5. (h) Deduce fonnula (5. = 0.. EU 13.99. is known as Fisher's approximation.3..3 < 00. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x . then E(UVW) = O.E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d.. Suppose X 1l . 14.i'c) I < Mn'.. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl). Here x q denotes the qth quantile of the distribution. each with HardyWeinberg frequency function f given by . .y'ri has approximately aN (0. W).10. (a) Suppose that Ely. . Suppose X I I • • • l X n is a sample from a peA) distrihution.14 to explain the numerical results of Problem 5. X. X n are independent.. 1931) is found to be excellent Use (5.25.i'b)(Yc .3' Justify fonnally j1" variance (72. 16..90. (a) Use this fact and Problem 5. and Hint: Use (5..3. (h) Use (a) to justify the approximation 17. Hint: If U is independent of (V. (a) Show that if n is large..3.6) to explain why. Suppose XI. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes. !) distribution..6 Problems and Complements 351 12.i'a)(Yb . The following approximation to the distribution of Sn (due to Wilson and Hilferty. E(WV) < 00.13(c).: .12). X n is a sample from a population with mean third central moment j1.3.Section 5. . Let Sn have a X~ distribution.14).n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO. Normalizing Transformation for the Poisson Distribution. .s. Show that IE(Y. . (a) Show that the only transformations h that make E[h(X) . X~.
a) Hmt: Use Bm.a tends to zero at the rate I/(m + n)2. Show that the only variance stabilizing transformation h such that h(O) = 0. (e) What is the approximate distribution of Vr'( X . . Y) where (X" Yi). 0< a < I. and h'(t) > for all t. V)) '" . 1'2)h2(1'1.I'll + h2(1'1. Y" . Var(X) = 171. Let then have a beta distribution with parameters m and n. 1'2)]2C7n + O(n 2) i . [vm +  n (Bm. Y) . 172 + [h 2(1'l. h 2 (x.. Bm.n P . Justify fonnally the following expressions for the moments of h(X. (1'1. Show directly using Problem B. 19. . ° ..h(l'l. . if m/(m + n) . Cov(X. Yn are < < . Let X I. Y) ~ p!7IC72.n  m/(m + n») < x] 1 > 'li(x). (b) Find an approximation to P[JX' < t] in terms of 0 and t. where I' = (a) Find an approximation to P[X < E(X. i 21. v'a(l. Var(Y) = C7~. ..)? 18. E(Y) = 1'2..n tends to zero at the rate I/(m + n)2. .y). 1'2) Hint: h(X. is given by h(t) = (2/1r) sin' (Yt).5 that under the conditions of the previous problem. where I I h. Var Bmnm + n +Rmn = 'm+n ' ' where Rm.(x. ..• • • ~{[hl (1".a) E(Bmn ) = . (1'1. X n be the indicators of n binomial trials with probability of success B.2.. h(l) = I.. (a) (b) Var(h(X.y). Yn) is a sample from a bivariate population with E(X) ~ 1'1. 1'2)(X . I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. (Xn .y) = a axh(x. • j b _ . + (mX/nY)] where Xl.Xm ... Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. .1') + X 2 .1'2)]2 177 n +2h. then I m a(l.n = (mX/nY)i1 independent standard eXJX>nentials. 1'2) = h. 1'2)pC7. 1'2)(Y  + O( n ').y) = a ayh(x. which are integers. Variance Stabilizing Transfo111U1tion for the Binomial Distribution. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t. 20.
Let Xl.Section 5.3 1/6n2 + M(J1.X2 .4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k).l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ.2V with V "' Give an approximation to the distribution of X (1 .) 23.J1...6 Problems and Complements 353 22.)2 and n(X . .). Show that Eh(X) 3 h(J1. 1 X n be a sample from a population with mean J. < t) = p(Vi < (b) When J1. = 0.J1. _. Compare your answer to the answer in part (a). and that h(1)(J. Let Sn "' X~. Suppose that Xl.4 + 3o. find the asymptotic distribution of y'n(T.2)/24n2 Hint: Therefore.)] ~ as ~h(2)(J1.2) using the delta method. XI.)] !:.)2. ()'2 = Var(X) < 00. J1. . (e) Fiud the limiting laws of y'n(X ..J1. n[X(I . (a) When J1. Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .l4 is finite. ° while n[h(X h(J1. (a) Show that y'n[h(X) .(1.. k > I. xi.X) in tenns of the distribution function when J.J1. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J.l = ~. find the asymptotic distrintion of nT nsing P(nT y'nX < Vi).2V where V ~ 00.. xi 25.J1.l) = o. .)1J1.)] is asymptotically distributed (b) Use part (a) to show that when J1.2.) + ": + Rn where IRnl < M 1(J1. = E(X) "f 0. = ~. Usc Stirling's approximation and Problem 8. (It may be shown but is not required that [y'nR" I is bounded. 24.h(J1..) + ~h(2)(J1. .
.\ul + (1 ~ or al . . Yn2 are as in Section 4.o.. Suppose that instead of using the onesample t intervals based on the differences Vi .JTi times the length of the onesample t interval based on the differences.9. . b l .~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .a.t. . . .t. . 'I :I ! t. n2 + 00.3) have correct asymptotic (c) Show that if a~ > al and .9.Jli = ~. and a~ = Var(Y1 ).(~':~fJ'). Suppose Xl.•. that E(Xt) < 00 and E(Y.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I ..') < 00. 1 Xn. .33) and Theorem 5. Let n _ 00 and (a) Show that P[T(t.9. E In] > 1 .\)al + . p) distribution.\a'4)I'). (c) Apply Slutsky's theorem.) of Example 4. if n}. .Xi we treat Xl. (d) Make a comparison of the asymptotic length of (4. We want to obtain confidence intervals on J1. and SD are as defined in Section 4. then limn PIt.\ probability of coverage. Hint: Use (5.3). We want to study the behavior of the twosample pivot T(Li. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . 0 < .9. < tJ ~ <P(t[(. Eo) where Eo = diag(a'.3) and the intervals based on the pivot ID . . .)/SD where D. 29.(j ' a ) N(O. Suppose nl + 00 . N(Pl (}2). Hint: (a)..a.\ < 1.3.4..9.9. . = = az.3).4. 27.. n2 + 00.) (b) Deduce that if. Let T = (D . I i t I .. a~. (c) Show that if IInl is the length of the interval In. Suppose (Xl.1/ S D where D and S D are as in Section 4.3 independent samples with III = E(XIl..3) has asymptotic probability of coverage < 1 .\)a'4)/(1 . (a) Show that P[T(t. 28. .3. Yd.9. whereas the situation is reversed if the sample size inequalities and variance inequalities agree.9.\. the intervals (4.t2.d. so that ntln ~ . Show that if Xl" ../". . I . (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment.354 26.4. /"' = E(YIl.2a 4 ).) < (b) Deduce that if p tl ~ <P (t [1. cyi . What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution. Viillnl ~ 2vul + a~z(l .9.}. XII are ij. . 1 X n1 and Y1 . .2 . " Yn as separate samples and use the twosample t intervals (4.\ > 1 . al = Var(XIl. . . the interval (4. Y1 . 2pawz)z(1  > 0 and In is given by (4. Assume that the observations have a common N(jtl.\.3. .
d. where tk(1. Hint: See Problems B.6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4. U( 1. .£.3. (d) Find or write a computer program that carries out the Welch test. as X ~ F and let I' = E(X)..9 and 4. Y nERd are Li. . Carry out a Monte Carlo study such as the one that led to Figure 5. Generalize Lemma 5.4. X n be i. EIY." . Let . has asymptotic level a. Tn . Let XI.~ t (Xi . then lim infn Var(Tn ) > Var(T). Let X ij (i = I. (a) SupposeE(X 4 ) < 00. x = (XI. vectors and EIY 1 1k < 00... then for all integers k: where C depends on d.Z). and let Va ~ (n . I< = Var[(X 1')/()'jZ. j = I. . .3.8Z = (n . (c) Let R be the method of moment estimaie of K.a).. then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31.4. then P(Vn < va<) t a as 11 t 00. I is the Euclidean norm.a) is the critical value using the Welch approximation.:~l IXjl. where I. 32._I' Find approximations to P(Yn and P{Vn < XnI (I .£.3 by showing that if Y 1.X)" Then by Theorem B. 30.z has a X~I distribution when F is theN(fJl (72) distribution..a)) and evaluate the approximations when F is 7. . (). k) be independent with Xi} ~ N(l'i.9. Plot your results. Hint: If Ixll ~ L.3.Section 5..1).I)z(a). . Show that if 0 < EX 8 < 00. . Show that as n t 00.i. T and where X is unifonn.z are Iii = k.I k and k only.d. (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l ..1)1 L. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. but Var(Tn ) + 00.I) + y'i«n .1.3. X.3 using the Welch test based on T rather than the twosample t test based on Sn.z = Var(X). (). . 33. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X.xdf and Ixl is Euclidean distance. .. Vn = (n . Show that k ~ ex:: as 111 ~ 00 and 112 t 00.p. (a) Show that the MLEs of I'i and ().I)sz/()..1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" .16.
. 1/4/'(0)).L~. .f.p(X. I .. where P is the empirical distribution of Xl. Use the bounded convergence lbeorem applied to . Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) .p(X. (c) Assume the conditions in (a) and (b). . .. (a) Show that (i). F(x) of X. . 1/).)). .....0). . aliI/ > O(P) is finite. and (iii) imply that O(P) defined (not uniquely) by Ep. (e) Suppose lbat the d. [N(O))' . then < t) = P(O < On) = P (.n(X .0) !:.. (c) Give a consistent estimate of (]'2. (b) Show that if k is fixed and p ~ 00. N(O.0) and 7'(0) ..) Hint: Show that Ep1/J(X1  8) is nonincreasing in B.0) ~ N Hint: P( .p(x) (b) Suppose lbat for all PEP. 00 (iii) I!/!I(x) < M < for all x. ... n 1 F' (x) exists.. . Let On = B(P).Xn be i. Let i !' X denote the sample median.. . \ . . . (ii). (Use . N(O) = Cov(..n(On .. Ep. • 1 = sgn(x).p(Xi ..O(P)) > 0 > Ep!/!(X. Show lbat if I(x) ~ (d) Assume part (c) and A6. ! Problems for Section 5.0) !. 1 X n. That is the MLE jl' is not consistent (Neyman and Scott. under the conditions of (c). .(0) Varp!/!(X.p(X.nCOn .\(On)] !:.0). Assmne lbat '\'(0) < 0 exists and lbat .. ..4 1. is continuous and lbat 1(0) = F'(O) exists. .356 Asymptotic Approximations Chapter 5 .0))  7'(0)) 0. 1948)..n7(0) ~[.. 1) for every sequence {On} wilb On 1 n = 0 + t/..On) ..p( 00) as 0 ~ 00.. Show that _ £ ( . Let Xl. ..p(X. N(O. . _ . I(X. Set (.. I ! 1 i .i. O(P) is lbe unique solution of Ep. .. . I i . Show that.On) < 0).p(X.1)cr'/k. Show that On is consistent for B(P) over P.. then 8' !. random variables distributed according to PEP.p(X.n for t E R.. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique.. . . I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 .0) = O. (k .d.
( 2 ). and O WIth .b)p(x. then 'ce(n(B .Section 5.9)dl"(x) = J te(1J.(x).6 Problems and Complements 357 (0 For two estImates 0. . not only does asymptotic nonnality not hold but 8 converges B faster than at rate n. U(O. Hint: Apply A4 and the dominated convergence theorem B. Show that A6' implies A6.(x. X n be i.. 'T = 4.c)'P. X) = 1r/2..5) where f.Xn ) is the MLE.(x.O)p(x. 82 ) Pis N(I".O).B) < xl = 1.a)p(x. ~"'. 0 > 0.b)dl"(x) J 1J.  = a?/o. 0 «< 0.7.05. 3.i.10. .. X) as defined in (f).    .(x .(x) + <'P. 0) = ."' c . Find the efficiency ep(X. " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 . lJ :o(1J. (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2). Hint: teJ1J.d.y.I / 2 . 1979) which you may assume.. .'" .. the asymptotic 2 . Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.(1:e )n ~ 1.(x. Thus. Let XI. not O! Hint: Peln(B .(x. x E R. 4.B)) ~ [(I/O).' . Show that assumption A4' of this section coupled with AGA3 implies assumption A4. 0» density.~ for 0 > x and is undefined for 0 g~(X. denotes the N(O. If (] = I.20 and note that X is more efficient than X for these gross error cases. evaluate the efficiency for € = . .(x. then ep(X. Show that   ep(X. Conclude that (b) Show that if 0 = max(XI. (a) Show that g~ (x. Show that if (g) Suppose Xl has the gross error density f. This is compatible with I(B) = 00.a)dl"(x). 2.jii(Oj .5  and '1'.2..4.O)dl"(x)) = J 1J.(x) = (I .O)p(x.O))dl"(x) ifforalloo < a < b< 00..0.0) ~ N(O.0)p(x. J = 1.X) =0.0) (see Section 3..O) is defined with Pe probability I but < x. oj).15 and 0.0.exp(x/O).
..Oo) p(Xi .358 5.00) . O. A2. 80 +.·_t1og vn n j = p( x"'o+.in) ~ ."log .22).5 continue to hold if . j. under F oo ' and conclude that . 0 ~N ~ (. . 0 for any sequence {en} by using (b) and Polyii's theorem (A.4.i.) P"o (X .in) 1 is replaced by the likelihood ratio statistic 7. 1(80 ).O) 1(0) and Hint: () _ g.In ~ "( n & &8 10gp(Xi. ~log (b) Show that n p(Xi . 8)p(x. .in. ~ L.4) to g.. I 1 . Show that 0 ~ 1(0) is continuous. I I j 6.4. I .In) + .g. 0).. ) ! .14.. Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".i(X.in) . Apply the dominated convergence theorem (B. ) n en] .0'1 < En}'!' 0 .< ~log n p(X. I (c) Prove (5. Show that the conclusions of PrQple~ 5.p ~ 1(0) < 00. . (8) Show that in Theorem 5.i'''i~l p(Xi.(X. .18 . 8 ) 0 ."('1(8) .80 + p(X . i. i I £"+7.i (x.. Suppose A4'.7.5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.~ (X.[ L:.O') . 8 ) i .80 + p(Xi . I .. 8) is continuous and I if En sup {:. and A6 hold for.50). &l/&O so that E.80 p(X" 80) +.4.
61).2.59).4. Hint: Use Problem 5.Q" is asymptotically at least as good as B if.59). Consider Example 4. Hint.4.56). We say that B >0 nI Show that 8~I and.6 and Slutsky's theorem. B' _n + op (n 1/') 13.54) and (5. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).59) coincide. Let B be two asymptotic l.14.4.B lower confidence bounds.4.5.4. the bound (5.4. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 . setting a lower confidence bound for binomial fJ. 10. = X.3).2. B' nj for j = 1.4.3.58).2. (a) Show that. which agrees with (4.4.4. Then (5. (Ii) Suppose the conditions of Theorem 5.:vell . gives B Compare with the exact bound of Example 4. Hint: Use Problems 5.4. ~ 1 L i=I n 8'[ ~ 8B' (Xi.4.6. 9. all the 8~i are at least as good as any competitors. at all e and (A4').6 Problems and Complements 359 8.57) can be strengthened to: For each B E e. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5. 11. Let [~11. f is a is an asymptotic lower confidence bound for fJ.61) for X ~ 0 and L 12. hence.5 hold. (a) Establish (5. ~.4.4. Let B~ be as in (5.7.4.4. Establish (5.4.4.5 and 5. (a) Show that under assumptions (AO)(A6) for all Band (A4'). A. which is just (4.a. . (Ii) Deduce that g.Section 5. Compare Theorem 4. (Ii) Compare the bounds in Ca) with the bound (5. for all f n2 nI n2 .7). and give the behavior of (5. B). if X = 0 or 1.
riX) = T(l + nr 2)1/2rp (I+~~.\ l . 0 given Xl. N(I'.2.d. = O.11). (a) Show that the test that rejects H for large values of v'n(X .01 .•. .I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox"). That is.=2[1  <1>( v'nIXI)] has a I U(O. ! I • = '\0 versus K : .. (b) Suppose that I' ). X > 0.. X n be i. I . the pvalue fJ .5.4. Consider the Bayes test when J1. By Problem 4. 7f(1' 7" 0) = 1 . 1.i. variables with mean zero and variance 1(0). each Xi has density ( 27rx3 A ) 1/2 {AX + J.A 7" 0.'" XI! are ij. . is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0. 1 .I). (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation.2. 00. > ~.. (c) Suppose that I' = 8 > O.LJ.. N(Jt..L and A.1 and 3.1. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : ..d. (e) Find the Rao score test for testing H : .6.)l/2 Hint: Use Examples 3. Show that (J/(J l'. . is a given number. phas a U(O. 1. 1) distribution. where . Let X 1> .. (d) Find the Wald test for testing H : A = AO versus K : A < AO.. ! (a) Show that the posterior probability of (OJ is where m n (.d.] versus K .t = 0 versus K : J. the evidence I . Consider the problem of testing H : I' E [0.1.. T') distribution..10) and (3.\ > o.2. 2. 1). Suppose that Xl. 1) distribution..\ < Ao. Show that (J l'.4.LJ. Hint: By (3. the test statistic is a sum of i.4. . " " I I . . I' has aN(O. Consider testing H : j.360 1 Asymptotic Approximations Chapter 5 14.) has pvalue p = <1>(v'n(X . 1 15. Problems for Section 5. X n i. if H is false.1. That is. inverse Gaussian with parameters J. LJ. Now use the central limit theorem.i. A exp . J1.\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO.d.55). Jl > OJ .21J2 2x A} .5 ! 1. where Jl is known. Establish (5.i.)) and that when I' = LJ. =f:.
4.5.058 20 . By (5. (e) Verify the following table giving posterior probabilities of 1. for all M < 00.. Hint: By (5.5.)}.S..2. Hint: 1 n [PI ..( Xt.Oo»)+log7f(O+ J. L M M tqn(t)dt ~ 0 a.2.645 and p = 0. By Theorem 5. yn(E(9 [ X) .5. all .s. Suppose that in addition to the conditions of Theorem 5. yn(anX   ~)/ va. Hint: In view of Theorem 5..5.:.l has a N(O. 1) and p ~ L U(O.i Itlqn(t)dt < J:J').029 . i ~. 10 .13) and the SLLN.5.8) ~ 0 a.14).4.) (d) Compute plimn~oo p/pfor fJ. J:J').2 it is equivalent to show that J 02 7f (0)dO < a.s.I).5. Establish (5.046 . ifltl < ApplytheSLLN and 0 ous at 0 = O.2 and the continuity of 7f( 0).17).034 .1 is not in effect. Apply the argnment used for Theorem 5. ~ L N(O. 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .042 ..} dt < .5. ~l when ynX = n 1'>=0.~~(:.sup{ :. ~ E. Extablish (5.05. oyn.01 < 0 } continuThen 00. [0.) sufficiently large. Show that the posterior probability of H is where an = n/(n + I)..1 I'> = 1.Section 5.(Xi.. Fe.O') : 100'1 < o} 5. (e) Show that when Jl ~ ~.i It Iexp {iI(O)'..0 3.' " . 1) prior.054 50 ..I(Xi. (Lindley's "paradox" of Problem 5. P.(Xi.0'): iO' .17).1.for M(.O'(t)).6 Problems and Complements 361 (b) Suppose that J.052 100 .050 logdnqn(t) = ~ {I(O)..
(1) This famous result appears in Laplace's work and was rediscovered by S. J I .) and A5(a.16) noting that vlnen. all d and c(d) / in d. . Bernstein and R. Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d. Finally. . Show tbat (5. jo I I . A proof was given by Cramer (1946). • Notes for Section 5.Pn where Pn a or 1. dn Finally. .5.7 NOTES Notes for Section 5. Notes for Section 5.1. 7. for all 6.I(Xi}»} 7r(t)dt. qn (t) > c} are monotone increasing in c.S. See Bhattacharya and Ranga Rao (1976) for further discussion. . I: . 5. .I Notes for Section 5. I : ItI < M} O. von Misessee Stigler (1986) and Le Cam and Yang (1990). (a) Show that sup{lqn (t) . . ! ' (2) Computed by Winston Cbow.f.5.s.". 'i roc J.5 " I i .!.5. .2 we replace the assumptions A4(a.s.9) hold with a. 362 f Asymptotic Approximations Chapter 5 > O. Fn(x) is taken to be O. . .s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) .d) for some c(d). (I) If the rightband side is negative for some x.) by A4 and A5. • . ~ II• '.) ~ 0 and I jtl7r(t)dt < co.s. (0» (b) Deduce (5.5. A.3 ).29). ~ 0 a. . .. . Suppose that in Theorem 5.4 (1) This result was first stated by R. convergence replaced by convergence in Po probability. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5. Fisher (1925). L . The sets en (c) {t .1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 . (O)<p(tI. I i. " .8) and (5.5. For n large these do not correspond to distributions one typically faces.. i=l }()+O(fJ) I Apply (5. .5.1 .
J. W. 1987. E. 2nd ed. NJ: Princeton University Press. 684 (1931). New York: J. Some Basic Concepts New York: Springer. R.S. 1990. BHATTACHARYA. 1985.. 700725 (1925).. M. CRAMER. R." J.. Vth Berkeley Symposium. Vol. LEHMANN.. P. AND E.. 1996. p. R. 132 (1948). Assoc. 1964. O. 1995. 1980. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. Nat.A. New York: McGraw Hill. "Probability inequalities for sums of bounded random variables.. N. Box. M. 1380 (1963). AND D. WILSON.8 References 363 5. CASELLA. RUDIN. LEHMANN. R. L. FISHER. 'The distribution of chi square. B. YANG. 1946. LE CAM. NEYMAN. Camb. C.. RANGA RAO. Amer. FISHER. Pearson. Probability and Measure New York: Wiley. Mathematical Analysis. A CoUrse in Large Sample Theory New York: Chapman and Hall. BILLINGSLEY. 17. SCHERVISCH. Theory ofPoint EstimatiOn New York SpringerVerlag.. Linear Statistical Inference and Its Applications. FERGUSON. 16. "Nonnormality and tests on variances:' Biometrika... Mathematical Methods of Statistics Princeton. I. S. E. STIGLER. 3rd ed. HANSCOMB. 1958. 1967. Theory o/Statistics New York: Springer. 22. J. HAMMERSLEY. Wiley & Sons. A. H. DAVID.. MA: Harvard University press. A.. Cambridge University Press. Vth Berk. Normal Approximation and Asymptotic Expansions New York: Wiley. 40. 1998. AND R.. Statisticallnjerence and Scientific Method. reprinted in Biometrika Tables/or Statisticians (1966). Monte Carlo Methods London: Methuen & Co. 318324 (1953).8 REFERENCES BERGER. W. 1979. Statist. Proc. R. AND G. Math. L. CA: University of California Press. Asymptotics in Statistics. "Theory of statistical estimation. E. SERFLING. T. AND M. L. Elements of LargeSample Theory New York: SpringerVerlag. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. Vol. Approximation Theorems of Mathematical Statistics New York: J. 1999. E. U. S.. G. P. RAo. 1976. S." Proc. J. Phil. Acad. Hartley and E. Symp. 1986. F. L. M. HUBER. P. 3rd ed. C. I Berkeley. J. Wiley & Sons.Section 5. J.." Econometrica. 58. L.. 1938. Tables of the Correlation Coefficient. Editors Cambridge: Cambridge University Press. SCOTT.. . AND G.. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. H. "Consistent estimates based on partially consistent observations.. Statist. Sci. Prob. H. 1973.. HOEFFDING." Proc. Soc. HfLPERTY...
. . ' i ' .\ . (I I i : " .I 1 . . . I . .
and efficiency in semiparametric models. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible. tests.2.1) and more generally have studied the theory of multiparameter exponential families (Sections 1.3). however.6. real parameters and frequently even more semi. and n.3. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics.4. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates. and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}. for instance. the number of observations.2. in which we looked at asymptotic theory for the MLE in multiparameter exponential families. with the exception of Theorems 5. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. 2.2.3. testing. The inequalities ofVapnikChervonenkis. the fact that d.1. 2.1.and semiparametric models.4. the bootstrap. 2.5. often many. the multinomial (Examples 1. curve estimates.6. 1.3).1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets.or nonpararnetric models. the number of parameters.8 C R d We have presented several such models already. the properties of nonparametric MLEs. we have not considered asymptotic inference. and prediction in such situations. confidence regions.7. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. However. There is.2 and 5. multiple regression models (Examples 1. 365 . the modeling of whose stochastic structure involves complex models governed by several.3. an important aspect of practical situations that is not touched by the approximation. 2. are often both large and commensurate or nearly so. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6.
2... In vector and matrix notation.2(4) in this framework..a2).j . Example 6.zipf.1. e ~N(O. " n (6.Herep= 1andZnxl = (l. let expressions such as (J refer to both column and row vectors..1. The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i.. .d. I I.1. I The regression framewor~ of Examples 1. Here Yi is called the response variable.l p.1.. i = 1. .N(O.1. .'" .2J) (6. Regression. when there is no ambiguity.1.1) j .n (6.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. . In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI.. . The OneSample Location Problem.5) .2) Ii .€nareij.1. .1. Z = (Zij)nxp.6 we will investigate the sensitivity of these procedures to the assumptions of the model. .1.. '.1. . In Section 6. i = l.. . It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6. 1 en = LZij{3j j=1 +Ei. 366 Inference in the Multiparameter Case Chapter 6 ! . are U.1.l)T. • i . N(O. ..3).3): l I • Example 6. 1 Yn from a population with mean /31 = E(Y). The model is Yi j = /31 + €i. In this section we will derive exact statistical procedures under the assumptions of the model (6... 6. .1. We consider experiments in which n cases are sampled from a population.. . the Zij are called the design values. n (6. We have n independent measurements Y1 . i = 1.1. These are among the most commonly used statistical techniques. say the ith case.3) whereZi = (Zil.. .. we have a response Yi and a set of p .1. we write (6. andJ is then x nidentity matrix.1 covariate measurements denoted by Zi2 • . and Z is called the design matrix. 0"2). .1.. . Notational Convention: In this chapter we will. • .4 and 2.Zip' We are interested in relating the mean of the response to the covariate values..3). . Here is Example 1. " 'I and Y = Z(3 + e." . and for each case.4) 0 where€I. .1 is also of the fonn (6.d. i .Zip.
Yn1 + 1 . we are interested in qualitative factors taking on several values.3. n. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k. Frequently. .~) random variables. this terminology is commonly used when the design values are qualitative. In this case. if no = 0.6) is an example of what is often called analysis ojvariance models..3 we considered experiments involving the comparisons of two population means when we had available two independent samples. we want to do so for a variety of locations.1..9.1.3 and Section 4. The pSample Problem Or One.6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k . The model (6. the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. · · . . . 1 < k < p. . we arrive at the one~way layout or psample model. (6. + n p = n./.1fwe set Zil = 1.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment. We treat the covariate values Zij as fixed (nonrandom). . we often have more than two competing drugs to compare. To see that this is a linear model we relabel the observations as Y1 . Then for 1 < j < p.3.1.1. then the notation (6. (6. and so on. 13k is the mean response to the kth treatment. and the €kl are independent N(O.2) and (6. .Yn . /3p are called the regression coefficients.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32..4... If we are comparing pollution levels. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 .1. Yn1 correspond to the group receiving the first treatment. The random design Gaussian linear regression model is given in Example 1.3) applies.. where YI.1. The model (6.•.. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values.Way Layout. In Example I. . nl + . .Section 6. 0 Example 6. and so on.0: because then Ok represents the difference between the kth and average treatment ...1. one from each population. Generally. Ynt +n2 to that getting the second.5) is called the fixed design norrnallinear regression model. TWosample models apply when the design values represent a qualitative factor taking on only two values."i = 1.
. • .2). .1. .... there exists (e. i .. • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. ..368 effects..1.P.17). It follows that the parametrization ({3. n. We assume that n > r. by the GramSchmidt process) (see Section B. the linear Y = Z'(3' + E. i = 1. Even if {3 is not a parameter (is unidentifiable). . an orthononnal basis VI. . TJi = E(Ui) = V'[ J1. V r span w. n. b I _ . Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. Note that any t E Jl:l can be written .• znj)T..7) !. . . i = T + 1. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry. .... . The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3.! x (p+l) = (1.. . j = I •. k model is Inference in the Multiparameter Case Chapter 6 1.p... of the design matrix. {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. . •• Note that w is the linear space spanned by the columns Cj. Cj = (Zlj.. n...a 2 ) is identifiable if and only ifr = p(Problem6. (3 E RP}. Ui = v. j = 1. However.. " . When VrVj = O.g. I5 h . i=I (6.Y.. .3. . i5p )T.r additional linear restrictions have been specified. we call Vi and Vj orthogonal.po In tenns of the new parameter {3* = (0'... Because dimw = r. vT n i I t and that T = L(vTt)v. .. •. j = I.. It is given by 0 = (itl. Let T denote the number of linearly independent Cj.6jCj are the columns of the design matrix. and with the parameters identifiable only once d . This type oflinear model with the number of columns d of the design matrix larger than its rank r. V n for Rn such that VI. . Z) and Inx I is the vector with n ones. . Recall that orthononnal means Vj = 0 fori f j andvTvi = 1.(T2J) where Z. E ~N(O. . the vector of means p of Y always is. We now introduce the canonical variables and means . is common in analysis of variance models. then r is the rank of Z and w has dimension r. . r i' . 1 JLn)T I' where the Cj = Z(3 = L j=l P .
• .1.3.. .. while (1]1.1.• .. U. which are sufficient for (IJ.. and then translate them to procedures for . (3.".2<7' L.). Theorem 6. 2 0.. i i = 1.'J nxn . Let A nxn be the orthogonal matrix with rows vi.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6.8)(6. . In the canonical/orm ofthe Gaussian linear model with 0. ais (v) The MLE of. .1.. .. Then we can write U = AY.. We start by considering the log likelihood £('1.1.Ur istheMLEof1]I. 7J = AIJ. Proof. . 2 ' " 'Ii i=l ~2(T2 6.)T is sufJicientfor'l. Note that Y~AIU. Un are independent nonnal with 0 variance 0.L = 0 for i = r + 1. . .. is Ui = making vTit.8) So... •.10) It will be convenient to obtain our statistical procedures for the canonical variables U. i it and U equivalent. and by Theorem B. .2 Estimation 0. i (iv) = 1. n. . .. 20.' based on Y using (6.1. . = L. n.2 known (i) T = (UI •. ... . it = E.. . IJ.2 and E(Ui ) = vi J. whereas (6... (T2)T. 1]r)T varies/reely over Rr. .11) _ n log(21r"'). .'Ii) . Theorem 6. (6.(Ui . .2.. and . . (iii) U i is the UMVU estimate of1]i. then the MLE of 0: Ci1]i is a = E~ I CiUi. .9) Var(Y) ~ Var(U) = .10).u) 1 ~ 2 n 2 . = 1. Moreover. 0. '" ~ A 1'1.2"log(21r<7 ) t=1 n 1 n _ _ ' " u.2 ) using the parametrization (11.9 N(rli.. The U1 are independent and Ui 11i = 0.1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. v~.2. n because p. n.. where = r + 1....2 We first consider the known case.Cr are constants.1.1. E w.1. observing U and Y is the same thing.1. also UMVU for CY..Section 6. (6. 1 If Cl. u) based on U £('I. . U1 .•• (ii) U 1 .1]r. ..and 7J are equivalently related.1. 1 ViUi and Mi is UMVU for Mi. which is the guide to asymptotic inference in general.. r.
'  .. ~i = vi'TJ.11). .1. Proof. .. Let Wi = vTu.3 and Example (iii) is clear because 3. To show (ii).. .'7.1. By Problem 3. Assume that at least one c is different from zero. T r + 1 = L~ I and ()r+l = 1/20. (6. . ( 2 )T. 2 2 ()j = 1'0/0. To show (iv).2.1). define the norm It I of a vector tERn by  ! It!' = I:~ .ti· 1 . this statistic is equivalent to T and (i) follows. . i = 1.N(~. By GramSchmidt orthogonalization.2. . (iv) The conclusions a/Theorem 6. 1lr only through L~' 1 CUi . . l Cr . j = 1. apply Theorem 3. .11) is a function of 1}1. 2:7 r+l un Tis sufficientfor (1]1.3. The maximizer is easily seen to be n 1 L~ r+l (Problem 6.. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul.(J2J) by Theorem B.1 L~ r+l U. n " .3..4.2 . . (iii) 8 2 =(n .1.3. {3. .11) is an exponential family with sufficient statistic T.. . .. To this end.. (v) Follows from (iv).10. I ..1. 370 Inference in the Multiparameter Case Chapter 6 I iI .. Q is UMVU. n.O)T E R n ..(v) are still valid. = 1.1. where J is the n x n identity matrix. wecan assume without loss of generality that 2:::'=1 c. and give a geometric interpretation of ii.2 . (i) By observation. i > r+ 1... WI = Q is sufficient for 6 = Ct. .11) has fJi 1 2 = Ui . (i) T = (Ul1" . (6.. . . In the canonical Gaussian linear model with a 2 unknown. "7r because.• EUl U? Ul Projections We next express j1 in terms of Y. (ii) U1.2).1]r.11) by setting T j = Uj ..) (iii) By Theorem 3. Theorem 6.4 and Example 3. obtain the MLE fJ of fl.4. the MLE of q(9) = I:~ I c. and is UMVU for its expectation E(W. The distribution of W is an exponential family...4. 2(J i=r+l ~ U. (ii) The MLE ofa 2 is n.r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2. (We could also apply Theorem 2.. I.4. r. I i I .Ui · If all thee's are zero. That is..) = '7" i = 1. ..4. (iv) By the invariance of the MLE (Section 2. by observation.1. 1 Ur . (U I .6. is q(8) = L~ 1 <. Proof. = 0. 1 U1" are the MLEs of 1}1. I i > .1. By (6. . . and .. we need to maximize n (log 27f(J2) 2 II "..6 to the canonical exponential family obtained from (6. . " V n of R n with VI = C = (c" .. recall that the maximum of (6. r. then W . 0. there exists an orthonormal basis VI) . .1... 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. " as a function of 0.2.Ur'L~ 1 Ul)T is sufficient..2 (ii). .1.) = a."7i)2 and is minimized by setting '(Ii = Ui. Ui is UMVU for E(U.
That is.2(iv) and 6. /3. equivalently...1. note that the space w. any linear combination of U's is a UMVU estimate of its expectation. .' and by Theorems 6. _. i=l.ti.p. fj = arg min{Iy . by (6.3 because Ii = l:~ 1 ViUi and Y .Z/3 I' .3(iv).9).1.J) of a point y Yo =argmin{lyW: tEw}.Z/3I' : /3 E W}. fj = (ZTZ)lZTji.3 of.1. .3 E RP. f3j and Jii are UMVU because.1. ~ The maximum likelihood estimate . To show (iv). then /3 is identifiable.1.2..1.1. IY . note that 6. and /3 = (ZTZll ZT 1".. . The projection Yo = 7l"(Y I L. = Z/3 and (6.14) (v) f3j is the UMVU estimate of (3j...3.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6.4.1.L. Proof.1.L of vectors s orthogonal to w can be written as ~ 2:7 w. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2.jil' /(n r) (6.1 and Section 2.1.Section 6. .Ii = .13) (iv) lfp = r. (ii) and (iii) are also clear from Theorem r+l VjUj .2<T' IY . which implies ZT s = 0 for all s E w. To show f3 = (ZTZ)lZTy. It follows that /3T (ZT s) = 0 for all/3 E RP. ZT (Y . We have Theorem 6.n log(21f<T .ji) = 0 and the second equality in (6.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.1. Thus. ZTZ is nonsingular..1. because Z has full rank.n. ) 2 or. <T) = .12) ii.1.. 0 ~ . spans w.' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6. and  Jii is the UMVU estimate of J.. (i) is clear because Z.3 maximizes 1 log p(y. j = 1.... In the Gaussian linear model (i) jl is the unique projection oiY on L<. the MLE = LSE of /3 is unique and given by (6.1... any linear combination ofY's is also a linear combination of U's.14) follows.12) implies ZT. = Z T Z/3 and ZTji = ZTZfj and. /3 = (ZTZ)l ZT.
1. the ith component of the fitted value iJ.)I are the vertical distances from the points to the fitted line. H T 2 =H. € ~ N(o. n lie on the regression line fitted to the data {('" y. .ill 2 = 1 q. That is.. (12(J .({3t + {3.H)). .14) and the normal equations (Z T Z)f3 = Z~Y." As a projection matrix H is necessarily symmetric and idempotent. H I I It follows from this and (B. By Theorem 1. w~ obtain Pi = Zi(3. 1 · . whenp = r. moreover. (j ~ N(f3.3) is to be taken. .' € = Y . = [y.).12) and (6. Suppose we are given a value of the covariate z at which a value Y following the linear model (6.I" (12H).. =H . • ~ ~ Y=HY where i• 1 .16) J I I CoroUary 6. I. Y = il. (6. the residuals €.2 illustrates this tenninology in the context of Example 6.1. 1 < i < n.1.1. n}.1. then (6. Example 2. 1 ~ ~ ! I ~ ~ ~ i . it is commOn to write Yi for 'j1i. ~ ~ ·. Var(€) = (12(J . I .'.1. . . We can now conclude the following. and I.2.1. 1 < i < n.5.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2.3) that if J = J nxn is the identity matrix. L7 1 . In statistics it is also called the hat matrix because it "puts the hat on Y. .H). i .1.. i = 1.1. i = 1. (iv) ifp = r. . In this method of "prediction" of Yi. The residuals are the projection of Y on the orthocomplement of wand ·i . In the Gaussian linear model (i) the fitted values Y = iJ. The estimate fi = Z{3 of It is called thefitted value and € = y . Note that by (6. see also Section RIO.14). the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. • .ii is called the residual from this fit.. The goodness of the fit is measured by the residual sum of squares (RSS) IY .15) Next note that the residuals can be written as ~ I .2 with P = 2. ..H)Y. (ii) (iii) y ~ N(J.1.1 we give an alternative derivation of (6.4. There the points Pi = {31 + fhzi. Taking z = Zi. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w.Y = (J . (12 (ZTz) 1 ).. and the residual € are independent.1.
.. (n .p. then replacement of a subscript by a dot indicates that we are considering the average over that subscript.. k=I. o .ii. k = 1.. Regression (continued).3). 'E) is a linear transformation of U and..8 and (3. In the Gaussian case. Thus.1 and Section 2.Ill'. The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY .Y are given in Corollary 6.. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1.8 and that {3j and f1i are UMVU for {3j and J1.8) = (J2(Z T Z)1 follows from (B. 0 Example 6. .1). . k = 1.p. and € are nonnally distributed with Y and € independent. .a of the kth treatment is ~ Pk = Yk.1. joint Gaussian. and€"= Y .. In this example the nonnal equations (Z T Z)(3 = ZY become n.2)..i respectively. . is th~ UMVU estimate of the average effect of all a 1 p = Yk .8. .1.8. hence.. Y. One Sample (continued).3. If the design matrix Z has rank P. (not Y. } is a multipleindexed sequence of numbers or variables.81 and i1 = {31 Y.1..2.Section 6... in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k . a = {3.j = 1.3. i = 1. . the treatments. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem.. The OneWay Layout (continued).1 ~ Inference for Gaussian Linear Models 373 Proof (Y.5.1.1. in the Gaussian model. + n p and we can write the least squares estimates as {3k=Yk .. . .1. By Theorem 6.no The variances of . then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2.1.ii = Y. nk(3k = LYkl. where n = nl + . . 0 ~ We now return to our examples..p. 0 ~ ~ ~ ~ ~ ~ ~ Example 6. . Moreover. If {Cijk _. Var(.1.2. 1=1 At this point we introduce an important notational convention in statistics.1. Example 6.p..3. Here Ji = .4. We now see that the MLE of It is il = Z.
a regression equation of the form mean response ~ . . Thus. . The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986).1.al. {JL : Jti = 131 + f33zi3. the mean vector is an element of the space {JL : J. . i = 1. . " . in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider.1.. 1 and the estimates of f3 and J.Li = 13 E R. An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. However. • = 131 + f32Zi2 + f33Zi3. ! j ! I I . (6.374 Inference in the Multiparameter Case Chapter 6 Remark 6. . q < r.17). We first consider the a 2 known case and consider the likelihood ratio statistic .. j 1 • 1 . 1 < . . in the context of Example 6. which together with a 2 specifies the model. .JL): JL E w} sup{p(Y.2." Now. see Problems 1. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal..1. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H. .2.i 1 6.3 with 13k representing the mean resJXlnse for the kth population. the LADEs are obtained fairly quickly by modem computing methods. In general. j where Zi2 is the dose level of the drug given the ith patient.3 Tests and Confidence Intervals .. Now we would test H : 132 = 0 versus K : 132 =I O.7 and 2.1.4. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w. The LSEs are preferred because of ease of computation and their geometric properties.En in (6. which is a onedimensional subspace of Rn. n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6.1. whereas for the full model JL is in a pdimensional subspace of Rn.JL): JL E wo} ~ ! .L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi .1. i = 1.. .1) have the Laplace distribution with density . The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = . .6. under H.zT . Next consider the psample model of Example 1.17) I.1. For instance. . j. and the matrix IIziillnx3 with Zil = 1 has rank 3.31. I.\(y) = sup{p(Y.. • ...p}. For more on LADEs. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997). under H. Zi3 is the age of the ith patient.
\(Y) It follows that = exp 1 2172 L '\' Ui' (6.1. 2 log .21). In the Gaussian linear model with 17 2 known. Write 'Ii is as defined in (6.IY .1.20) i=q+I 210g. 1) distribution with OJ = 'Ida.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6. span Wo and VI. But if we let A nx n be an orthogonal matrix with rows vi.1.\(Y) has a X.I8) then.2(v).. We write X. then r.1. by Theorem 6.l2). V q (6.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.1. V r span wand set . by Theorem 6. .1.iii' .\(Y) = L i=q+l r (ud a )'. respectively. . We have shown the following.1.. Proposition 6.. ._q' = AJL where A L 'I. .3. .1. .. o .19). In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r . = IJL i=q+l r JLol'.q {(}2) for this distribution.J X. v~' such that VI. . when H holds.1. i=q+l r = a 21 fL  Ito [2 (6.. (}r..q( (}2) distribution with 0 2 =a 2 ' \ '1]i L.1.21 ) where fLo is the projection of fL on woo In particular.\(Y) ~ exp {  2t21Y .. Note that (uda) has a N(Oi.\(Y) Proof We only need to establish the second equality in (6..l9) then. 2 log .1.'" (}r)T (see Problem B..iio 2 1 } where i1 and flo are the projections of Yon wand wo..
In Proposition 6.a:.2 is known" is replaced by '.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown. which is equivalent to the likelihood ratio statistic for H : Ii.~I. /L. it can be shown (Problem 6.1.0. = aD IIY . (6.1. T has the representation .. ]t consists of rejecting H when the fit. The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r .1.r IY ..22).q)'Ili .'I/L.(Y)./Lol'. 8 2 .1 suppose the assumption "0. T is called the F statistic for the general linear hypothesis.l) {Ir . .1. is poor compared to the fit under the general model.iT'): /L E w} .itol 2 have a X./L.1.1. we obtain o A(Y) = P Y.q and m = n . which has a X~r distribution and is independent of u..2 is the same under Hand K and estimated by the MLE 0:2 for /._q(02) distribution with 0' = .JtoI 2 . . We have seen in PmIXlsition 6. IIY ./to I' ' n J respectively. when H I lwlds.23) . .'}' YJ.1. In particular.wo.1 that the MLEs of a 2 for It E wand It E wQ are . The resulting test is intuitive.21ii.q variable)/df (central X~T variahle)/df with the numerator and denominator independent. In the Gaussian linear model the F statistic defined by (6.1.1.3.itol 2 = L~ q+ 1 (Ui /0) 2. We write :h.E Wo for K : Jt E W ..L n where p{y.') denotes the righthand side of (6. {io. For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.22) Because T = {n .n_r(02) where 0 2 = u.r. I . T is an increasing function of A{Y) and the two test statistics are equivalent. .r degrees affreedam (see Prohlem B. J • T = (noncentral X:.J.I}.q)'{[A{Y)J'/n .liol' (n _ r) 'IY _ iii' .m{O') for this distrihution where k = r .1 that 021it .5) that if we introduce the variance equal likelihood ratio w'" .JLo. Thus.1. .I.r){r . E In this case.. as measured by the residual sum of squares under the model specified by H.2 1JL ..19). Remark 6. T has the (central) Frq.max{p{y.'IY _ iii' = L~ r+l (Ui /0)2../L.a = ~(Y'E. By the canonicaL representation (6. and 86 into the likelihood ratio statistic. We know from Problem 6.liol' IY r _q IY _ iii' iii' (r .18).iT') : /L E wo} (6.L a =  n I' and . has the noncentral F distribution F r _q.2.q and n . Proposition 6.2..14). T = n .nr distribution. statistic :\(y) _ max{p{y. we can write . We have shown the following. Substituting j).• j .1.
(Y) equals the likelihood ratio statistic for the 0'2.Y)2' which we recognize as t 2 / n.1'0 I' y.iLol'. The canonical representation (6. See Pigure 6. One Sample (continued).1.1.1.iLl' + liL . Y3 y Iy .25) which we exploited in the preceding derivations.\(Y) = .1.1.9. r = 1 and T = 1'0 versus K : (3 'f 1'0.~T = (n .q noncentral X? q 2Iog.1. q = 0.iLol' = IY .19) made it possible to recognize the identity IY .22). and the Pythagorean identity.1 and Section B.1. The projections il and ilo of Yon w and Wo.2.1 Inference for Gaussian Linear Models ~ 377 then >. Yl = Y2 Figure 6.IO. This is the Pythagorean identity.iLl' ~++Yl I' I' 1 .1. (6. (6.r)ln central X~_cln where T is the F statistic (6.1. In this = (Y 1'0)2 (nl) lE(Y. where t is the onesample Student t statistic of Section 4. We next return to our examples. It follows that ~ (J2 known case with rT 2 replaced by r.1.Section 6.24) Remark 6. Example 6. 0 .3. We test H : (31 case wo = {I'o}.
Now the linear model can be written as i i I I ! We test H : f32 (6. respectivelY. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62.26) 0 versus K : f3 2 i' O. I • ! .q) x 1 vector of main (e. (6. As we indicated earlier.3.1..1.1.Way Layout (continued).)2= Lnk(Yk. I • I b .n_p(fP) distribution with noncentrality parameter (Problem 6..Q)f3f(ZfZ2)f32.q covariates in multiple regression have an effect after fitting the first q.1. Under H all the observations have the same mean so that.1.RSSF)/(dflf .)2' .g. .q covariates does not affect the mean response. I3I) where {32 is a (p . L~"(Yk' . .1.: I :I T _ n ~ p L~l nk(Y . • 1 F = (RSSlf ... we want to test H : {31 = .22) we obtain the F statistic for the hypothesis H in the oneway layout . Regression (continued).1 L~=. '1 Yp.y)2 k . = {3p. respectively.1.np distribution. we partition the design matrix Z by writing it as Z = (ZI. .. The One.' • ito = Thus. P nk (Y. .. Z2) where Z1 is n x q and Z2 is 11 X (p . In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6.=11=1 k=1 •• Substituting in (6.  I 1.2. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2. and we partition {3 as f3T = (f3f.. .dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .ito 1 are the residual sums of squares under the full model and H. . Under the alternative F has a noncentral Fp_q.. We consider the possibility that a subset of p .. •. j ! j • litPol2 = LL(Yk _y.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i . The F test rejects H if F is large when compared to the ath quantile of the Fpq. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e. However.26) and H. 7) iW l.1. 0 I j Example 6. 1 02 = (J2(p .}f32.27) 1 . economic status) coefficients. _y.. Using (6. anddfF = np and dfH = nq are the corresponding degrees of freedom.)2..g. Without loss of generality we ask whether the last p . age. .1.vy.Yk. To formulate this question.q)..P ..{3p are Y1.ZfZl(ZiZl)'ziz. which only depends on the second set of variables and coefficients.q)1f3nZfZ2 .. P .. Recall that the least squares estimates of {31.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6. 02 simplifies to (J2(p .
1. which is their ratio. the within groups (or residual) sum of squares. Ypl . the between groups (or treatment) sum of squares and SSw.np distribution. sum of squares in the denominator.1) degrees of freedom and noncentrality parameter 0'. . The sum of squares in the numerator.. . and III with I being the "high" end.1 and 6..1.28) where j3 = n I :2:=~"" 1 nifJi. then by the Pythagorean identity (6.1.fLo 12 for the vector Jt (rh..p) degrees of freedom. If the fJ.p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'.3.1) and (n . . Note that this implies the possibly unrealistic . and the F statistic.1.. .1. [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y. SSB. T has a Fpl. We assume the oneway layout is valid.)'. Yi nl . As an illustration..2 1M . are not all equal.1 Inference for Gaussian Linear Models 379 When H holds.. . .p). identifying 0' and (p . (31. are often summarized in what is known as an analysis a/variance (ANOVA) table. . This information as well as S8B/(P .30) can also be viewed stochastically. the total sum of squares. we have a decomposition of the variability of the whole set of data.29) Thus. . Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. the unbiased estimates of 02 and a 2 .np distribution with noncentrality parameter (6. See Tables 6.1) and 8Sw /(n . . . ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic.25) 88T = 888 + 88w .. (3" . . (6. we see that the decomposition (6. . 888 =L k=I P nk(Yk.. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I... compute a. . Because 88B/0" and SSw / a' are independent X' variables with (p . T has a noncentral Fp1. To derive IF.)' . k=1 1=1 which measures the variability of the pooled samples... into two constituent components. SST/a 2 is a (noncentral) X2 variable with (n . is a measure of variation between the p samples YII .Y. (3p)T and its projection 1'0 = (ij.Section 6.... n. SST. k=l 1=1 measures variation within the samples.1.···.fJ.1) degrees offreedom as "coming" from SS8/a' and the remaining (n . respectively. (3p.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
the equivalence of Bayesian and frequentist optimality asymptotically.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly.3. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. This approach is refined in Kass.19) (Laplace's method). under PeforallB. fJ p vary freely. See Schervish (1995) for some of the relevant calculations. as is usually the case.. We simply make 8 a vector. 6.2. Although it is easy to write down the posterior density of 8.s. Again the two approaches differ at the second order when the prior begins to make a difference. Using multivariate expansions as in B.s. up to the proportionality constant f e .. t)dt. When the estimating equations defining the M estimates coincide with the likelihood . Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution. These methods are beyond the scope of this volume but will be discussed briefly in Volume II.32) a. 1T(8) rr~ 1 P(Xi1 8). say (fJ 1. A4(a. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. The three approaches coincide asymptotically but differ substantially. Kadane.2. Ifthe multivariate versions of AOA3. say.s.2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models. for fixed n.3.) and A6A8 hold then. Wald tests (a generalization of pivots).5. .2. The consequences of Theorem 6..2.. A new major issue that arises is computation.(t) n~ 1 p(Xi . However. .8 we obtain Theorem 6.5. ife denotes the MLE. Op). . typically there is an attempt at "exact" calculation. Confidence regions that parallel the tests will also be developed in Section 6. A5(a. the latter can pose a fonnidable problem if p > 2.19). Summary. and Tierney (1989). The problem arises also when.3 are the same as those of Theorem 5. the likelihood ratio principle. We have implicitly done this in the calculations leading up to (5.3. All of these will be developed in Section 6. ~ (6.5. O ).Section 6.2. because we then need to integrate out (03 . in perfonnance and computationally. . we are interested in the posterior distribution of some of the parameters.I as the Euclidean nonn in conditions A7 and A8. and Rao's tests. .). and interpret I .
.4 to vectorvalued parameters. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i. I j I I . Wald and Rao large sample tests. In Section 6.110 : () E 8}. 9 E 8 0 } for testing H : 9 E 8 0 versus K . A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions.1 . In linear regression. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO.i. In this section we will use the results of Section 6. we need methods for situations in which. 9 E 8" 8 1 = 8 .4. X n are i.•• . . and we tum to asymptotic approximations to construct tests.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. X II . this result gives the asymptotic distribution of the MLE.1 we considered the likelihood ratio test statistic.6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal.3 • .8 0 . In such . in many experimental situations in which the likelihood ratio test can be used to address important questions. confidence regions.4. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. 9 E 8} sup{p(x. In these cases exact methods are typically not available.4.f/ : () E 8.2 to extend some of the results of Section 5. Finally we show that in the Bayesian framework where given 0. However. and other methods of inference.s. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.9) . I H I! ~. and confidence procedures. LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a. However. PO'  . We shall show (see Section 6.3.392 Inference in the Multiparameter Case Chapter 6 .9). as in the linear model. .9 and 6. "' ~ 6. under P. 6. I . '(x) = sup{p(x. 1 '~ J . ry E £} if the asymptotic distribution of In(() . ! I In Sections 4. . equations. the exact critical value is not available analytically.. Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. We present three procedures that are used frequently: likelihood ratio.". adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) .denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. and showed that in several statistical models involving normal distributions. . to the N(O. .I .d. 1 Zp.I . 170 specified. These were treated for f) real in Section 5. 1 . I .1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed.
SetBi = TU/uo. Thus. 0 It remains to find the critical value. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) . . .2 we showed that the MLE.1). which is usually referred to as Wilks's theorem or approximation. . iJ). Wilks's approximation is exact. iJo) is the denominator of the likelihood ratio statistic.3 we test whether J. In this u 2 known example.4. Then Xi rvN(B i . =6r = O. Q > 0.1. .2 we showed how to find B as a nonexplicit solution of likelihood equations.. 0) = IT~ 1 p(Xi. . ..q' i=q+1 r Wilks's theorem states that.. as X where X has the gamma.. iJ > O.i. . This is not available analytically.. .n..i = 1. Y n be nn. 1.3. exists and in Example 2.. ~ In Example 2.'··' J.l.d. As in Section 6. Using Section 6."" V r spanw.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo.5) to be iJo = l/x and p(x.3. the numerator of .£ X~" for degrees of freedom d to be specified later. x ~ > 0. 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. 1). 2 log . Suppose XL X 2 . Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations. E W . ..randXi = Uduo. Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i.. .2. .1. the X~_q distribution is an approximation to £(21og .Section 6.l.. 0) ~ iJ"x·. iJ).3.3..1.v'[. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6. the hypothesis H is equivalent to H: 6 q + I = . = (J.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of .n. qQ. . 0).. when testing whether a parameter vector is restricted to an open subset of Rq or R r . under regularity conditions.1. . q < r. i = r+ 1. . Moreover. i = 1.tl.i = 1. 0 = (iX.Xn are i.. we conclude that under H. We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi. ~ X. The MLE of f3 under H is readily seen from (2. <1~) where {To is known. Let YI .1 exp{ iJx )/qQ). where A nxn is an orthogonal matrix with rows vI.\(x) is available as p(x.\(X) based on asymptotic theory. The Gaussian Linear Model with Known Variance. and we transform to canonical form by setting Example 6.\(Y)). V q span Wo and VI. such that VI.. •.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. .3.. ..\(Y) = L X. r and Xi rv N(O. distribution with density p(X..
Apply Example 5.(In[ < l(Jn . Then. Hence._q as well. In Section 6. B).3...3. If Yi are as in = (IL.(Jol < I(J~ .3. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. Proof Because On solves the likelihood equation DOln(O) = 0.3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V. I .. . 0).(Jol. ! . V T I«(Jo)V ~ X. The result follows because. (6. Because .q also in the 0.(X) ~ 1 .(Jo) ".3.2... an = n..(J) Uk uJ rxr c'· • 1 .3. . .1) " I. = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X. "'" T.2. I I i Example 6. . . c = and conclude that 210g . .d. I1«(Jo)).(Jo) for some (J~ with [(J~ . under H . N(o.3.\(Y) £ X.2. 0 Consider the general i. (J = (Jo. .(Jol. !.. 21og>. and conclude that Vn S X.. I.~ L H . The Gaussian Linear Model with Unknown Variance. .(Y) defined in Remark ° 6. where DO is the derivative with respect to e.· ..3. we can conclude arguing from A. j . i\ I .logp(X.2 with 9(1) = log(l + I). where Irxr«(J) is the Fisher information matrix.Xn a sample from p(x..1.i. where x E Xc R'.1 ((Jo)). Write the log likelihood as e In«(J) ~ L i=I n logp(X i ..7 to Vn = I:~_q+lXl/nlI:~ r+IX. .q' Finally apply Lemma 5. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6. y'ri(On . 2 log >'(Y) = Vn ". Suppose the assumptions of Theorem 6.2 unknown case. Here ~ ~ j. o .(Jo) In«(Jn)«(Jn . I(J~ . and (J E c W.1.i=q+l 'C'" I=r+l Xi' ) X. Note that for J. ·'I1 .3. Eln«(Jo) ~ I«(Jo). X.2. case with Xl. "'" (6.2 are satisfied. hy Corollary B.1. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn . I( I In((J) ~ n 1 n L i=l & & &" &"..6. .394 Inference in the Multiparameter Case Chapter 6 1 ." . By Theorem 6.2) V ~ N(o.Onl + IOn  (Jol < 210n .2...2 but (T2 is unknown then manifold whereas under H. (72) ranges over an r + Idimensional Example 6. ranges over a q + Idimensional manifold.3. ".. : .
q.3) is a confidence region for 0 with approximat~ coverage probability 1 ... 8 2)T where 8 1 is the first q coordinates of8. Next we tum to the more general hypothesis H : 8 E 8 0 . Let 00.2[1. for given true 0 0 in 8 0 • '7=M(OOo) where.3. Theorem 6. ~(11 ~ (6....0(2».80.n = (80 .2 illustrate such 8 0 . the test that rejects H : () 2Iog>. (6. ~ {Oo : 2[/ n (On) /n(OO)] < x.8q ).a)) (6. T (1) (2) Proof.)1'.4).4) It is easy to see that AOA6 for 10 imply AOA5 for Po. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] .3.3. Then under H : 0 E 8 0. Furthennore.1) applied to On and the corresponding argument applied to 8 0 .3. where x r (1.Section 6. Suppose that the assumptions of Theorem 6.0(1) = (8 1" ".35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .2 hold for p(x.(Ia).T.(9 0 . Examples 6.)T.n) /n(Oo)l. and {8 0 .3.0:'. .(] ..n and (6.+l.3. 10 (0 0 ) = VarO. 0). Make a change of parameter.2. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1.j. . distribution.0) is the 1 and C!' quantile of the X.81(00).3.' ".(X) 8 0 when > x.. By (6.0 0 ). 0b2) = (80"+1".".8. (JO.)T. 0(2) = (8.. where e is open and 8 0 is the set of 0 E e with 8j = 80 .1 and 6. OT ~ (0(1).a. has approximately level 1. SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po.8.2. j ~ q + 1.3.10) and (6.3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6.j} are specified values.2. 0 E e. We set d ~ r .6) .
1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l)..Tf(O)T . His {fJ E applying (6.3.9) . J) distribution by (6.. then by 2 log "(X) TT(O)T(O) .1 '1).3.l..2. 18q acting as r nuisance parameters.3. • 1 y{X) = sup{P{x.• I . ..11) . .+ 1 . Moreover. (6. Note that.3. ••..II).In( 00 ./ L i=I n D'1I{X" liD + M. 0 Note that this argument is simply an asymptotic version of the one given in Example 6. with ell _ . is invariant under reparametrization . ).l.3.2.1/ 2 p = J.2...(X) = y{X) where ! 1 1 ..8) that if T('1) then = 1 2 n.1I0 + M.3. .5) to "(X) we obtain.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1. Me : 1]q+l = TJr = a}.0.1'1 '1 E eo} and from (B.all (6. under the conditions of Theorem 6.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.1ITD III(x. rejecting if . 1e ). Tbe result follows from (6.8 0 : () E 8}.9). by definition.(l .6) and (6. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ. because in tenus of 7]. (O) r • + op(l) (6. if A o _ {(} .+1 ~ .00 .1 '1ll/sup{p(x.. '1 E Me}. IIr ) : 2[ln(lI n ) .1 '1) = [M..8. I I eo~ ~ ~ {(O.I '1): 11 0 + M.+I> .1> . We deduce from (6.3..(X) > x r .1I0 + M..00 : e E 8 0} MAD = {'1: '/.. = 0.. = . • .13) D'1I{x. .3.7). This asymptotic level 1 .3. Or)J < xr_.0: confidence region is i . ~ 'I. a piece of 8. .3..10) LTl(o) . Such P exists by the argument given in Example 6.(1 . Thus. : • Var T(O) = pT r ' /2 II. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O.1I 0 + M.
3. We only need note that if WQ is a linear space spanned by an orthogonal basis v\. We need both properties because we need to analyze both the numerator and denominator of A(X)... themselves depending on 8 q + 1 . . t. Suppose the assumptions of Theorem 6. (6.3.g.n under H is consistent for all () E 8 0.3.2 falls under this schema with 9. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6. .. 9j . Suppose H is specified by: There exist d functions. if 0 0 is true. A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O .. N(B 1 .q at all 0 E 8. The esseutial idea is that. Then.2)(6.. of f)l" ..3.::Se::. . Theorem 6.3).80. 8q o assuming that 8q + 1 . q + 1 < j < r.8 0 E Wo where Wo is a linear space of dimension q are also covered.) be i. R. J).13). Theorem 6.l:::.3.2. " ' J v q and Vq+l " . such that Dg(O) exists and is of rauk r . Examples such as testing for independence in contingency tables.... X. .... will appear in the next section. The proof is sketched iu Problems (6.. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension. 8. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X....3.o:::"'' 397 CC where 00. Vl' are orthogonal towo and v" . More complicated linear hypotheses such as H : 6 . B2 .3. .13). The formulation of Theorem 6."de"'::'"e:::R::e"g.2 to this situation is easy and given in Problem 6.::. It can be extended as follows. q + 1 < j < r written as a vector g.d. (6.3. 8q)T : 0 E 8} is opeu iu Rq.3..13) then the set {(8" . q + 1 <j < r}.13) Evideutly..3.'':::':::"::d_C:::o:::.3.12) The extension of Theorem 6.14) a hypothesis of the fonn (6.3..2 and the previously conditions on g. As an example ofwhatcau go wrong.3. If 81 + B2 = 1. 210g A(X) ". .. "'0 ~ {O : OT Vj = 0.. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).6.:::m.3. l B.t.. . ... which require the following general theorem. Suppose the MLE hold (JO.2 is still inadequate for most applications. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}.o''C6::...e:::S:::. More sophisticated examples are given in Problems 6.cTe:::. X~q under H.3.(0) ~ 8j . if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6.( 0 ) ~ o} (6. let (XiI.:::f..3.3..i. v r span " R T then.8r are known. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}..3..5 and 6.p:::'e:.'! are the MLEs..
hence.2 hold.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.3. Then .3. . It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l . I 22 (9 n ) is replaceable by any consistent estimate of J22( 8). More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ). More generally. (6.X.17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d.2. theorem completes the proof. r 1 (1J)) where.3. 1(8) continuous implies that [1(8) is continuous and.Nr(O..16) n(9 n _IJ)TI(9 n )(9 n IJ) !:.2..10). respectively. Slutsky's it I' I' . But by Theorem 6. I.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" . y T I(IJ)Y .2.9).2.~ D 2ln (IJ n ).. .3.15) and (6.rl(IJ» asn ~ p ~ L 00.3. I22(lJo)) if 1J 0 E eo holds. . ~ ~ 00.7...Or) and define the Wald (6.~ D 2l n (1J 0 ) or I (IJ n ) or .18) i 1 Proof.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6.398 Inference in the Multiparameter Case Chapter 6 6. in particular . for instance.. has asymptotic level Q. 0 ""11 F . i . . the lower diagonal block of the inverse of .0.2.3. . i favored because it is usually computed automatically with the MLE.15) Because I( IJ) is continuous in IJ (Problem 6.7.: b. j Wn(IJ~2)) !:. The last Hessian choice is ~ ~ . the Hessian (problem 6. X.3..1.fii(1J IJ) ~N(O. (6.) and IJ n ~ ~ ~(2) ~ (Oq+" . It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when . it follows from Proposition B.6.2.3. I . [22 is continuous. If H is true. ~ Theorem 6. y T I(IJ)Y. For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n .a)} is an ellipsoid in W easily interpretable and computablesee (6.~ D 2 1n ( IJ n). Y .4. Under the conditions afTheorem 6.3.31).16).fii(lJn (2) L 1J 0 ) ~ Nd(O.3. . (6.q' ~(2) (6.2. according to Corollary B.
ln(Oo.n W (80 .3. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although. 2 Wn(Oo ) ~(2) where .3. (6.:El ((}o) is ~ n 1 [D. Let eo '1'n(O) = n... 112 is the upper right q x d block. and the convergence Rn((}o) ~ X.19) indicates.' (0 0 )112 (00) (6. This test has the advantage that it can be carried out without computing the MLE..(00 )  121 (0 0 )/.\(X) is the LR statistic for H Thus.0:) is called the Rao score test.6. asymptotically level Q. in practice they can be very different..8) E(Oo) = 1. as (6. (Problem 6.n under H.20) vnt/Jn(OO) !. under H.9. N(O.nl]  2  1 ..3. un.n) ~  where :E is a consistent estimate of E( (}o). Rao's SCOre test is based on the observation (6..n] D"ln(Oo. 1(0 0 )) where 1/J n = n .. The extension of the Rao test to H : (J E runs as follows. . The argument is sketched in Problem 6.n) under n H.3....3. ~1 '1'n(OO.3. a consistent estimate of . Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n.9) under AOA6 and consistency of (JO. What is not as evident is that.19) : (J c: 80. The test that rejects H when R n ( (}o) > x r (1. therefore.3 Large Sample Tests and Confidence Regions 399 The Wold test. X.nlE ~ (2) _ T  . It follows from this and Corollary B.3. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests.ln(8 0 .n) 2  + D21 ln (00.I Dln ((Jo) is the likelihood score vector.n)[D.2 that under H. The Wald test leads to the Wald confidence regions for (B q +] .3.21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).Q). (J = 8 0 . the two tests are equivalent asymptotically. Furthermore. the asymptotic variance of .22) . and so on. = 2 log A(X) + op(l) (6. as n _ CXJ. The Rao Score Test For the simple hypothesis H that. It can be shown that (Problem 6. by the central limit theorem. which rejects iff HIn (9b ») > x 1_ q (1 . is.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d.Section 6... The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo.
. . .On) + t. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. . Power Behavior of the LR. We established Wilks's theorem. . it shares the disadvantage of the Wald test that matrices need to be computed and inverted. we introduced the Rao score test. . D 21 the d x d Theorem 6. then. which measures the distance between the hypothesized value of 8(2) and its MLE. Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. Summary. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. which is based on a quadratic fonn in the gradient of the log likelihood._q(A T I(Oo)t. I . for the Wald test neOn . e(l).3. In particular we shall discuss problems of .5. Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l.. and showed that this quadratic fonn has limiting distribution q . called the Wald statistic. We also considered a quadratic fonn.2.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). which stales that if A(X) is the LR statistic.q distribution under H.3. . I I .0 0 ) = T'" "" (vn(On . We considered the problem of testing H : 8 E 80 versus K : 8 E 8 . 2 log A(X) has an asymptotic X. . and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 .2 but with Rn(lJb 2 1 1 » !:. Rao. Finally. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. b .8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified. X. 0(2).)I(On)( vn(On  On) + A) !:. The analysis for 8 0 = {Oo} is relatively easy. On the other hand.) .a)}. X~' 2 j > xd(1 .19) holds under On and that the power behavior is unaffected and applies to all three tests. under regularity conditions. . and so on. For instance..4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data.0 0 ) I(On)(On .
j = 1. j=l i=1 The second term on the right is kl 2 n Thus." .) > Xkl(1. 6. = L:~ 1 1{Xi = j}. Ok Thus.7. In.+ ] 1 1 0.)/OOk.lnOo.3. Let ()j = P(Xi = j) be the probability of the jth category.00. and 2. .2.8.. In Example 2. j = 1.4. k . ..()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. # j. .E~. . ~ N. or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory. Thus. where N.2..) /OOk = n(Bk . ()j. L(9.. .d. Ok if i if i = j. j=1 Do. treated in more detail in Section 6. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution. k Wn(Oo) = L(N..]2100. .32) thaI IIIij II.i. log(N. j=1 To find the Wald test.00k)2lOOk.6. trials in which Xi = j if the ith trial prcx:luces a result in the jth category. ..OOi)(B. For i. Pearson's X2 Test As in Examples 1.nOo.Section 6. k.8 we found the MLE 0. Because ()k = 1 . + n LL(Bi .)2InOo.3. . k1. j = 1. we need the information matrix I = we find using (2.1.4. It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN.OO. with ()Ok =  1 1 E7 : kl j=l ()OJ.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM).. .0). I ij [ . we consider the parameter 9 = (()l.5.. 2.33) and (3. j=l .1 GoodnessofFit in a Multinomial Model.2. consider i. the Wald statistic is klkl Wn(Oo) = n LfB.
L .4.... It is easily remembered as 2 = SUM (Observed ... by A. .1S.13.80j ) = (Ok . where N = (N1 .1) of Pearson's X2 will reappear in other multinomial applications in this section. . 80j 80k 1 and expand the square keeping the square brackets intact. .2). we write ~ ~  8j 80j .4. To derive the Rao test.~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6. Then.. we could invert] or note that by (6.2. n ( L kl ( j=1 ~ ~) !. 80i 80j 80k (6.Nk_d T and.1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem.80k).2.. '" . because kl L(Oj . with To find ]1.4. note that from Example 2.~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k . The second term on the right is n [ j=1 (!. 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } . .. The general form (6. ~ . ]1(9) = ~ = Var(N).8.L . ~ = Ilaijll(kl)X(kl) with Thus.Expected)2 Expected (6. the Rao statistic is .2) ..4.11). j=1 .
If we assume the seeds are produced independently. n040 = 34.4.k. Testing a Genetic Theory. Contingency Tables Suppose N = (NI . 04.4.48 in this case.i::. • 6. and (4) wrinkled green.75)2 104.25)2 312. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. However. 04 = 1/16. Example 6.25 + (3. n2 = 101.4). this value may be too small! See Note 1.1. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. fh = 03 = 3/16.25. Possible types of progeny were: (1) round yellow.75)2 = 04 34.Section 6. j. In experiments on pea breeding.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6. k = 4 X 2 = (2. distribution.25)2 104. LOi=l}. Mendel observed nl = 315.2 GoodnessofFit to Composite Multinomial Models.75 + (3. 8 0 . Mendel's theory predicted that 01 = 9/16. n4 = 32. nOlO = 312. which is a onedimensional curve in the twodimensional parameter space 8. 02.4. For comparison 210g A = 0. 7. (3) round green. Then.: . 2.75.. n020 = n030 = 104. .1) dimensional parameter space k 8={8:0i~0. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:. n3 = 108. where 8 0 is a composite "smooth" subset of the (k . which has a pvalue of 0. Nk)T has a multinomial.25 + (2. 3.75 . 4 as above and associated probabilities of occurrence fh.fh.75..1. (2) wrinkled yellow. 8). in the HardyWeinberg model (Example 2. i=1 For example. l::. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E . .2) becomes It follows that the Rao statistic equals Pearson 's X2. M(n. There is insufficient evidence to reject Mendel's hypothesis. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.9 when referred to a X~ table.
8) denote the frequency function of N.(t)~ei(11)=O... to test the HardyWeinberg model we set e~ = e1 . We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. .. .3. j = 1. 8 0 • Let p( n1. 8(11)) for 11 E £. j = q + 1. i=l t n. where 9j is chosen so that H becomes equivalent to "( e~ . {p(.nk. 8) for 8 E 8 0 is the same as maximizing p( n1..~) and test H : e~ = O. l~j~q. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6. ij = (iiI." For instance.l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni .q distribution for large n. Other examples. 8(11)) = 0. approximately X. .2) by ej(ij). ."" 'r/q) T.1. r.3 and conclude that. 8(11)) : 11 E £}.r for specified eOj . .4.q) exists.1. .3).4.. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories. . Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. . £ is open. thus. However.. nk. and the map 11 ~ (e 1(11).("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. .X approximately has a xi distribution under H.3 that 2 log ..ne j (ij)]2 = 2 ne.. also equal to Pearson's X2 • . which will be pursued further later in this section.r. is a subset of qdimensional space.4. nk..4. . e~) T ranges over an open subset of Rq and ej = eOj ..8 0 . The algebra showing Rn(8 0 ) = X2 in Section 6. 2 log . . . . The Wald statistic is only asymptotically invariant under reparametrization. nk) = L ndlog(ni/n) log ei(ij)]. . ek(11))T takes £ into 8 0 . j Oj is.. i=l k If 11 ~ e( 11) is differentiable in each coordinate. a 11 'r/J .q' Moreover. nk. and ij exists...3) Le.. then it must solve the likelihood equation for the model..2~(1 . .4. If £ is not open sometimes the closure of £ will contain a solution of (6. the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij).. .. we define ej = 9j (8).404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 .1). If a maximizing value. Maximizing p(n1.. .. the log likelihood ratio is given by log 'x(nb . by the algebra of Section 6. To apply the results of Section 6. That is.4. ij satisfies l a a'r/j logp(nl.X approximately has a X. . 1 ~ j ~ q or k (6. Then we can conclude from Theorem 6. To avoid trivialities we assume q < k . The Rao statistic is also invariant under reparametrization and. under H. e~ = e2 ... ..
we obtain critical values from the X~ tables. 1} a "onedimensional curve" of the threedimensional parameter space 8. ij {}ij = (Bil + Bi2 ) (Blj + B2j ).4 Methods for Discrete Data 405 Example 6.4. (}4) distribution.3) becomes i(1 °:. p.. .4. To study the relation between the two characteristics we take a random sample of size n from the population. We found in Example 2. and so on.4. Denote the probabilities of these types by {}n..TJ). . Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B . N 4 ) has a M(n. If Ni is the number of offspring of type i among a total of n offspring. . The Fisher Linkage Model. A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. For instance. AB. The results are assembled in what is called a 2 x 2 contingency table such as the one shown. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. respectively.6 that Thus. ( 2n3 + n2) 2) T 2n 2n o Example 6.a) with if (2nl + n2)/2n.. The only root of this equation in [0. 1958. {}21. iTJ) : TJ :. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0.4. i(1 . (3) starchywhite. (6. AB. 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. The likelihood equation (6.5. then (Nb . T() test the validity of the linkage model we would take 8 0 {G(2 + TJ). A linkage model (Fisher.4. is male or female.1). HardyWeinberg. We often want to know whether such characteristics are linked or are independent. (4) starchygreen. (2) sugarygreen. (2nl + n2)(2 3 + n2) .Section 6. An individual either is or is not inoculated against a disease.4. H is rejected if X2 2:: Xl (1 .• . 9(ij) ((2nl + n2) 2n 2n 2.4) which reduces to a quadratic equation in if. {}12. is or is not a smoker. (h. AB. TJ). do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. Then a randomly selected individual from the population can be one of four types AB. . k 4. 1] is the desired estimate (see Problem 6.2. Because q = 1. {}22. 301). specifies that where TJ is an unknown number between 0 and 1.
Ti2) (6. 'r/2 ::. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. I}. X2 has approximately a X~ distribution.~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 .2).4. C j = N 1j + N 2j is the jth column sum. N 21 . For () E 8 0 .6) the proportions of individuals of type A and type B. 0 ::. 'r/2 to indicate that these are parameters.4. This suggests that X2 may be written as the square of a single (approximately) standard normal variable. Here we have relabeled 0 11 + 0 12 . q = 2. 'r/1 ::. Then. 1. N 22 )T. Pearson's statistic is then easily seen to be (6. respectively.5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n. ( 22 ). By our theory if H is true.3) become . because k = 4. if N = (Nll' N 12 . we have N rv M(n. the (Nij . (1 .RiCj/n) are all the same in absolute value and. These solutions are the maximum likelihood estimates. which vary freely.4. 'r/2 (1 . the likelihood equations (6.'r/1). (6. In fact (Problem 6.'r/2)) : 0 ::. 021 .4.4. 8 0 . We test the hypothesis H : () E 8 0 versus K : 0 f{.011 + 021 as 'r/1.Tid (n12 + n22) (1 .'r/1) (1 . 'r/1 (1 .406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column.7) where Ri = Nil + Ni2 is the ith row sum.'r/2). for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. 0 11 . Thus. 0 12 . where z tt [R~jl1 2=1 J=l .
b). that A is more likely to occur in the presence of B than it would in the presence of B). B. i = 1. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. then Z is approximately distributed as N(O. (Jij : 1 :::. b ~ 2 (e.4.3) that if A and B are independent. peA I B)) versus K : peA I B) > peA I B). Nab Cb Ra n . .. peA I B) = peA I B).4.. 1 Nu 2 N12 . i :::. . .. Thus. Z = v'n[P(A I B) .. a.. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. if X2 measures deviations from independence. eye color. B.4. The N ij can be arranged in a a x b contingency table. j :::. if and only if. b where the TJil. b} "J M(n. Therefore. (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::. a.4.Section 6. j :::. respectively.peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. a... 1 :::. Positive values of Z indicate that A and B are positively associated (i. 1 :::.. a. .. j = 1..g. . . b NIb Rl a Nal C1 C2 .. a. i :::. B to denote the event that a randomly selected individual has characteristic A.a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. 1 :::. j :::. . If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . .. . it is reasonable to use the test that rejects. 1)..e.. The X2 test is equivalent to rejecting (twosidedly) if. hair color). B. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2]..4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6.8) Thus. i :::. .. Z ~ z(l. that is. Z indicates what directions these deviations take. Next we consider contingency tables for two nonnumerical characteristics having a and b states. then {Nij : 1 :::.. It may be shown (Problem 6. and only if.
[Xi C :i"J log + m. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses.11) When we use the logit transfonn g(7r). whicll we introduced in Example 1. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0). we observe independent Xl' .~ =1 Zij {3j = unknown parameters {31. i :::.7r)] are also used in practice. Instead we turn to the logistic transform g( 7r).8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 . we obtain what is called the logistic linear regression model where . and the loglog transform g2(7r) = 10g[log(1 . The argument is left to the problems as are some numerical applications.408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. log(l "il] + t.4. perhaps after a transfonnation.4. ' XI."F~1 Yij where Yij is the response on the jth of the mi trials in block i.. (6. In this section we assume that the data are grouped or replicated so that for each fixed i. with Xi binomial.. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.) over the whole range of z is impossible. we call Y = 1 a "success" and Y = 0 a "failure. (6. such as the probit gl(7r) = <I>1(7r) where <I> is the N(O.7r)].{3p.. Thus.10. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0). usually called the log it. The log likelihood of 7r = (7rl. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). a simple linear representation zT ~ for 7r(. Other transforms. we observe the number of successes Xi = L. 6. 1) d. Because 7r(z) varies between 0 and 1.4.9) which has approximately a X(al)(bl) distribution under H. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6. log ( ~: ) .6. 1 :::." We assume that the distribution of the response Y depends on the known covariate vector ZT. k."" 7rk)T based on X = (Xl"'" Xk)T is t. . ..4.3 Logistic Regression for Binary Responses In Section 6. As is typical.f.Li = L.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are.
.p.L) = O.3.16) 1fi) 1]. j = 1... The condition is sufficient but not necessary for existencesee Problem 2. .4 the Fisher information matrix is I(f3) = ZTWZ (6.3.4 Large Sample Methods for Discrete Data 409 The special case p = 2. Alternatively with a good initial value the NewtonRaphson algorithm can be employed. 7r Using the 8method. j Thus..! ) ( mi .. .Li..4.J. (6.14) where W = diag{ mi1fi(l1fi)}kxk.4.[ i(l7r iW')· .4.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00.Section 6.1 and by Proposition 3. Similarly..4. . mi 2mi Xi 1 Xi 1 = .2 can be used to compute the MLE of f3. IN(f3) of f3 = (f31..Li = E(Xi ) = mi1fi.l.+ . We let J.3. Zi)T is the logistic regression model of Problem 2.4. I N(f3) = f.13) By Theorem 2.14).f3p )T is.3. or Ef3(Z T X) = ZTX.. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '. Tp) T. Zi The log likelihood l( 7r(f3)) = == k (1. in (6. ( 1 . the likelihood equations are just ZT(X .~ Xi 2 1 +"2 ' (6.3. i ::.4.15) Vi = log X+.1fi )* = 1 .4. k ( ) (6.3. it follows by Theorem 5. if N = 2:7=1 mi.log(l:'. . The coordinate ascent iterative procedure of Section 2.1fi)*}' + 00. . the empirical logistic transform. k m} (V.1. As the initial estimate use (6. Theorem 2. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' . Then E(Tj ) = 2:7=1 Zij J. we can guarantee convergence with probability tending to 1 as N + 00 as follows. where Z = IIZij Ilrnxp is the design matrix..4. Although unlike coordinate ascent NewtonRaphson need not converge. It follows that the NILE of f3 solves E f3 (Tj ) = T .(1. that if m 1fi > 0 for 1 ::.J) SN(O. W is estimated using Wo = diag{mi1f. the solution to this equation exists and gives the unique MLE 73 of f3.12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f.+ .
4.flm.Wo 210g.4. As in Section 6.. In the present case.4..IOg(E.1 (L:~=l Xij(3j). i 1. We obtain k independent samples. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w ... Example 6. {3 E RP} and let r be the dimension of w. f"V . If Wo is a qdimensional linear subspace of w with q < r. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. fl) has asymptotically a X%r.\ for testing H : TJ E w versus K : TJ n .\ has an asymptotic X. k.8 that the inverse of the logit transform 9 is the logistic distribution function Thus.1 linear subhypotheses are important. distribution for TJ E w as mi + 00.13.w is denoted by D(y.4."" Xk independent with Xi example. that is..) +X.3. k < oosee Problem 6. Thus. By the multivariate delta method...1. . . Here is a special case.4. In this case the likelihood is a product of independent binomial densities.13. The samples are collected indeB(1ri' mi). . Theorem 5. We want to contrast w to the case where there are no restrictions on TJ.\=2t [Xi log i=1 (Ei. . recall from Example 1. ji) measures the distance between the fit ji based on the model wand the data X.12) k zT D(X..1. it foHows (Problem 6. i = 1. the MLE of 1ri is 1ri = 9. To get expressions for the MLEs of 7r and JL. we let w = {TJ : 'f]i = {3. by Problem 6. . D(X.17) where X: = mi Xi and M~ mi Mi. i = 1.:IITlPt'pr Case 6 Because Z has rankp. .14) that {3o is consistent.4. from (6.4. and the MLEs of 1ri and {Li are Xdmi and Xi.4..q distribution as mi + 00. 2 log . D(X. Testing In analogy with Section 6. where ji is the MLE of JL for TJ E w. ji). The LR statistic 2 log .4. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover. The Binomial OneWay Layout.~.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi .)l {LOt (6.11) and (6. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6.k.6. one from each location. we set n = Rk and consider TJ E n.410 Inference in the Mllltin::lr. For a second pendently and we observe Xl. and for the ith location count the number Xi among mi that has the given attribute.
Li = EYij. J.4..1 that if 0 < T < N the MLE exists and is the solution of (6.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi.3.18) with JiOi mi7r.13).7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6. and (3 = 10g[7r1(1 .1 and 6. in the case of a Gaussian response. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates.Li of a response is expressed as a function of a linear combination ~i =. discuss algorithms for computing MLEs. z.4. 6. the Wald and Rao statistics take a form called "Pearson's X2. N 2::=1 mi. an important hypothesis is that the popUlations are homogenous. In particular.Li 7ri = gl(~i)' where gl(y) is the logistic distribution .4. In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table. Using Z as given in the oneway layout in Example 6. Under H the log likelihood in canonical exponential form is {3T ." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H.L (ml7r. we test H : 7rl = 7r2 7rk = 7r.3. Finally.1). In the special case of testing equality of k binomial parameters. We used the large sample testing results of Section 6.3 to find tests for important statistical problems involving discrete data.1. In Section 6. We found that for testing the hypothesis that a multinomial parameter equals a specified value. Summary.7r)]..1.4. then J. the Pearson statistic is shown to have a simple intuitive form.mk7r)T. if J.3 we considered experiments in which the mean J. (3 = L Zij{3j j=l p of covariate values. and give the LR test.15). We derive the likelihood equations. The LR statistic is given by (6.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. we give explicitly the MLEs and X2 test.5 GENERALIZED LINEAR MODELS Yi In Sections 6.4. we find that the MLE of 7r under His 7r = TIN. and as in that section. 7r E (0.Li = ~i. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k . The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r .Idimensional parameter space e. . versus the alternative that the 7r'S are not all equal. It follows from Theorem 2. where J.Section 6. . Thus. the Rao statistic is again of the Pearson X2 form.3.
More generally. in which case 9 is also called the link function. Typically. A (11) = Ao (TJi) for some A o. . Canonical links The most important case corresponds to the link being canonical. These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. which is p x 1. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj. g(fL) is of the form (g(JL1) . . 11) given by (6.. that is. but in a subset of E obtained by restricting TJi to be of the form where h is a known function.1. ••• . . ."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y.412 Inference in the 1\III1IITin'~r:>rn"'T. the mean fL of Y is related to 11 via fL = A(11). Note that if A is oneone. See Haberman (1974). Typically. 1989) synthesized a number of previous generalizations of the linear model. j=1 p In this case.1) where 11 is not in E. As we know from Corollary 1. = L:~=1 Zij{3. Yn)T is n x 1 and ZJxn = (ZI. Znj)T is the jth column vector of Z.6.. g = or 11 L J'j Z(j) = Z{3.. the identity...r Case 6 function. Special cases are: (i) The linear model with known variance.1). in which case JLi A~ (TJi).. We assume that there is a onetoone transform g(fL) of fL. called the link junction. most importantly the log linear model developed by Goodman and Haberman.5. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. Y) where Y = (Yb . g(JLn)) T. fL determines 11 and thereby Var(Y) A( 11). (ii) Log linear models. the GLM is the canonical subfamily of the original exponential family generated by ZTy. McCullagh and NeIder (1983. the natural parameter space of the nparameter canonical exponential family (6.5..
Section 6. b. ZTy or = ZT E13 y = ZT A(Z. classification i on characteristic 1 and j on characteristic 2.6. .2) (or ascertain that no solution exists). where f3i. Bj > 0.5. b. log J. for example. In this case.5..2) It's interesting to note that (6.3. j::.1. Then () = IIBij II. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . the link is canonical.B 1. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist. so that Yij is the indicator of.4.8) is not a member of that space.5.4. B+ j = 2:~=1 Bij . 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)]. 2:~=1 B = 1. "Ij = 10gBj..5. . 1f. . . 1 ::.Yp)T is M(n.Bp). the . j ::.3) where In this situation and more generally even for noncanonical links..1. 1 ::.t1. Suppose. and the log linear model corresponding to log Bij = f3i + f3j. f3j are free (unidentifiable parameters). j ::. . that Y = IIYij 111:Si:Sa..5. as seen in Example 1. j ::. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). This isjust the logistic linear model of Section 6.. See Haberman (1974) for a further discussion. The coordinate ascent algorithm can be used to solve (6 . that procedure is just (6. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment. Algorithms If the link is canonical. But.8) is orthogonal to the column space of Z. by Theorem 2. /1(.5 Generalized linear Models 413 Suppose (Y1. a.tp) T .8) (6. p are canonical parameters.3. say.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y .7.1::. in general. If we take g(/1) = (log J. ./1(. i ::.80 ~ f3 0 . NewtonRaphson coincides with Fisher's method of scoring described in Problem 6. 1 ::. 1 ::. Then. p. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2..
the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. iLl) where iLl is the MLE under WI. as n ~ 00.414 Inference in the Multiparameter Case Chapter 6 true value of {3.2.D(Y. 1]) > o}).20) when the data are the residuals from the fit at stage m. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. Testing in GLM Testing hypotheses in GLM is done via the LR statistic.2.5. the deviance of Y to iLl and tl(iL o. iLl) == D(Y.5. Let ~m+l == 13m+1 . This quantity. 210gA = 2[l(Y. Write 1](') for AI. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI . . Unfortunately tl =f. ••• . Yn ) T (assume that Y is in the interior of the convex support of {y : p(y. We can think of the test statistic .5. then with probability tending to 1 the algorithm converges to the MLE if it exists.1) for which p = n.13m' which satisfies the equation (6. D generally except in the Gaussian case.1](Y)) l(Y .6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components. This name stems from the following interpretation. (6. iLo) . the correction Am+ l is given by the weighted least squares formula (2. iLl)' each of which can be thought of as a squared distance between their arguments.4) That is.1. iLo) == inf{D(Y.5.5. If Wo C WI we can write (6. For the Gaussian linear model with known variance 0'5 (Problem 6. In this context.Wo with WI ~ Wo is D(Y.5) is always ~ O.1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. As in the linear model we can define the biggest possible GLM M of the form (6.. In that case the MLE of J1 is iL M = (Y1 .4).5. called the deviance between Y and lLo. The LR statistic for H : IL E Wo is just D(Y. iLo) .D(Y . the algorithm is also called iterated weighted least squares.
there are easy conditions under which conditions of Theorems 6. if we assume the covariates in logistic regression with canonical link to be stochastic. Similar conclusions r '''''''. 1. Y n ) from the family with density (6. = f3d = 0.1 . and (6.5. Ao(Z.. However. . which. then ll(fio... ••• .. f3d+l..5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. asymptotically exists.9) What is ]1 ((3)? The efficient score function a~i logp(z .q. .. Thus. .3.7) More details are discussed in what follows.2.8) This is not unconditionally an exponential family in view of the Ao (z..Section 6 . (3))Z.5. we obtain If we wish to test hypotheses such as H : 131 = .5. then (ZI' Y1 ). can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. . . (3) is (Yi and so.2. fil) is thought of as being asymptotically X. and 6.5. . which we temporarily assume known. Asymptotic theory for estimates and tests If (ZI' Y 1 ).3 hold (Problem 6. .0..2.. . For instance. if we take Zi as having marginal density qo. . Zn in the sample (ZI' Y 1 ).. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical. and can conclude that the statistic of (6. (Zn' Y n ) has density (6.. so that the MLE {3 is unique. (Zn. . (3) term. . is consistent with probability 1. we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0. This can be made precise for stochastic GLMs obtained by conditioning on ZI.3.3). 6. y .5.2 and 6. in order to obtain approximate confidence procedures. d < p.10) is asymptotically X~ under H. the theory of Sections 6. f3p ).
Note that these tests can be carried out without knowing the density qo of Z1. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. r)dy = 1.13) (6.4. which we postpone to Volume II.5. 11.3. A) families.A(11))}h(y. For further discussion of this generalization see McCullagp. p(y.1) depend on an additional scalar parameter r.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics.5. r) = exp{c. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. Because (6. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model. r).5. It is customary to write the model as. 1989). if in the binary data regression model of Section 6. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables.r)dy. 11. For instance. As he points out.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter.5. .5.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. General link functions Links other than the canonical one can be of interest. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6.1 (r)(11T y . Important special cases are the N (JL.9 are rather similar. However.11) Jp(y. (6. JL ::. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. From the point of analysis for fixed n. the results of analyses with these various transformations over the range . (J2) and gamma (p. when it can.12) The lefthand side of (6.5. .1 ::. for c( r) > 0. and NeIder (1983. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families.
if P3 is the nonparametric model where we assume only that (Z.r+i".2. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II.5). the GaussMarkov theorem. then the LSE of (3 is. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function.d. ____ ."n".2. E p y2 < oo}. Furthermore. is "act as ifthe model were the one given in Example 6. a 2 ) or. Y) rv P given by (6.1. Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z.2. We found that if we assume the error distribution fa. the distributional and implicit structural assumptions of parametric models are often suspect. of a linear predictor of the form Ef3j Z(j). These issues. /3. roughly speaking. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian. in fact.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known.2. and tests. We discussed algorithms for computing MLEs of In the random design case.i. had any distribution symmetric around 0 (Problem 6." Of course. the right thing to do if one assumes the (Zi' Yi) are i. we use the asymptotic results of the previous sections to develop large sample estimation results. so is any estimate solving the equations based on (6. under further mild conditions on P. EpZTZ nonsingular.. for estimating /lL(Z) in a submodel of P 3 with /3 (6. the LSE is not the best estimate of (3. are discussed below. There is another semiparametric model P 2 = {P : (ZT. and Models 417 Summary. confidence procedures.Section 6. called the link function.31) with fa symmetric about O. still a consistent asymptotically normal estimate of (3 and. Ii 6.2.J .2.2.2.. in fact.i.6.6 Robustness Pr.. with density f for some f symmetric about O}. if we consider the semiparametric model. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor.6. ". In Example 6.2. PI {P : (zT. y)T rv P that satisfies Ep(Y I Z) = ZT(3.d. have a fixed covariate exact counterpart. That is.19) with Ei i..6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6.19) even if the true errors are N(O. which is symmetric about 0.8 but otherwise also postponed to Volume II. whose further discussion we postpone to Volume II. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6.
4 are still valid.. See Problem 6. Theorem 6. Moreover. Instead we assume the GaussMarkov linear model where (6.3.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case. Ej).an. Then. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions.6. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 ..1.. Because Varc = VarCM for all linear estimators.1.H).'. JL = f31. However. in Example 6.. and a is unbiased. jj and il are still unbiased and Var(Ji) = a 2 H. N(O. where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1.. and by (B. for . By Theorems 6.p . n = 0:.4.1. {3 is p x 1.i. in addition to being UMVU in the normal case. One Sample (continued).1. In Example 6.1.1. . moreover. .Y .Yn ..1.3) are normal.1. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6. Var(jj) = a 2 (ZTZ)1.6. and ifp = r. The optimality of the estimates Jii.14).6. . . Let Ci stand for any estimate linear in Y1 . . The preceding computation shows that VarcM(Ci) = Varc(Ci). .' En with the GaussMarkov assumptions. Yj) = Cov( Ei. for any parameter of the form 0: = E~l aiJLi for some constants a1. . . 13i of Section 6. ( 2 ). The GaussMarkov theorem shows that Y.1 and the LSE in general still holds when they are compared to other linear estimates. .Ej) = a La. Because E( Ei) = 0.. n a Proof. is UMVU in the class of linear estimates for all models with EY/ < 00. Varc(a) ::.2. Varc(Ci) for all unbiased Ci. .6. Suppose the GaussMarkov linear model (6.En are i.2) holds.6.1. the conclusions (1) and (2) of Theorem 6.2) where Z is an n x p matrix of constants.1.1. E(a) = E~l aiJLi Cov(Yi. € are n x 1. the result follows. 0 Note that the preceding result and proof are similar to Theorem 1. and Ji = 131 = Y.3(iv). Var(e) = a 2 (I . and Y. Ifwe replace the Gaussian assumptions on the errors E1. n 2 VarcM(a) = La..2(iv) and 6.. our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1.S.. .En in the linear model (6. 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1.d. Example 6.4.1.6). Var(Ei) + 2 Laiaj COV(Ei. In fact.
Y has a larger variance (Problem 6.. 0.:. If the are version of the linear model where E( €) known. Y is unbiased (Problem 3. when the not optimal even in the class of linear estimates.. The 6method implies that if the observations Xi come from a distribution P. Because ~ ('lI .2 are UMVU.Ill}.6. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ). . P))..ii(8 .12).2 we know that if '11(" 8) = Die. Example 6.4.P) i= II (8 0 ) in general.2.2. y E R. There is an important special case where all is well: the linear model we have discussed in Example 6.d. then Tw ~ VT I(8 0 )V where V '" N(O. For this density and all symmetric densities. for instance. the asymptotic behavior of the LR. P)) with :E given by (6.6 Robustness "" .5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly . but Var( Ei) depends on i.. Il E R.. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 . Remark 6.. ~('11. and .. If 8(P) (6.2.and twosample problems using asymptotic and Monte Carlo methods. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown.6. This observation holds for the LR and Rao tests as wellbut see Problem 6.1.6. However. aT Robustness of Tests In Section 5. Now we will use asymptotic methods to investigate the robustness of levels more generally. our estimates are estimates of Section 2.6. ~(w. which is the unique solution of ! and w(x. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i. Suppose (ZI' Yd.4. . Y n ) are d. as (Z. A > O. 1 (Zn. As seen in Example 6..3) .. 8)dP(x) ° = 80 (6.8(P)) ~ N(O.6). 00. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6.ti". 8) and Pis true we expect that 8n 8(P).ennip.6.:lrarnetlric Models 419 sample n large.. which does not belong to the model.6.Section 6.5) .4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 .1.3 we investigated the robustness of the significance levels of t tests for the one. Suppose we have a heteroscedastic 0.2. 0.6. The Linear Model with Stochastic Covariates.6."n". Thus. From the theory developed in Section 6. the asymptotic distribution of Tw is not X.
6. by the law of large numbers.r Case 6 with the distribution P of (Z.6.2..420 Inference in the M .6.. it is still true that if qi is given by (6.. . It follows by Slutsky's theorem that.. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model.. Then. EpE 0.6. ..2) the LR. when E is N(O. Y) such that Eand Z are independent.q+ 1. (12).Zn)T and  2 1 ~ " 2 1 ~(Yi .9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level.2.7.a) where _ ""T T  2 (6. hence. the limiting distribution of is Because !p.5) has asymptotic level a even if the errors are not Gaussian.8) Moreover.P i=l n p  Z(n)(31 .5) and (12(p) = Varp(E). Wald. .p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP. say.np(1 . it is still true by Theorem 6.6. under H : (3 = 0.1 and 6. E p E2 < 00..t. the test (6.. n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6.30) then (3(P) specified by (6.np(l a) . (Jp (Jo.3.. procedures based on approximating Var(.1 that (6.6) (3 is the LSE. and we consider. even if E is not Gaussian. (see Examples 6. X. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p.  2 Now. Zen) S (Zb . For instance.B) by where W pxp is Gaussian with mean 0 and.Yi) = IY n .6. This kind of robustness holds for H : (Jq+ 1 (Jo.7) and (6.t xp(l a) by Example 5. It is intimately linked to the fact that even though the parametric model on which the test was based is false. !ltin:::.2.6. .2.r::nn".3) equals (3 in (6. Thus. H : (3 = O.
6. XU) and X(2) are identically distributed.15) .. If we take H : Al .. (6. (6. To see this.6. For simplicity. we assume variances are heteroscedastic.. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 . Example 6.2 clearly hold.13) but & V2Ep(X(I») ::f..5) are still meaningful but (and Z are dependent.6. . 2  (th).3..6. the theory goes wrong.6.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails. Q) (6.. V2 VarpX{I) (~2) Xz in general.6.10) nL and ~ ~ X(j) 2 (6. the lifetimes A2 the standard Wald test (6.Section 6. We illustrate with two final examples.6. Suppose E(E I Z) = 0 so that the parameters f3 of (6.6.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  . If it holds but the second fails.6.10) does not have asymptotic level a in general. The Linear Model with Stochastic Covariates with E and Z Dependent. It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general.4. E (i.2. That is.3.) respectively. Then H is still meaningful.3. (X(I).a)" (6.12) The conditions of Theorem 6. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential. X(2) are independent E of paired pieces of equipment. Suppose without loss of generality that Var( (') = L Then (6.6. then it's not clear what H : (J = (Jo means anymore. The TwoSample Scale Problem.. X(2») where X(I).6. note that. i=1 (6.14) o Example 6.1 Vn(~ where (3) + N(O.11) i=1 &= v 2n ! t(xP) + X?»).7) fails and in fact by Theorem 6. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z. under H.. But the test (6.
and Rao tests.i.Id) has a multinomial (A1. . In this case. ••• . . We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model.. n 1 L.6.fiT) o:2)T of U 2)T .d.6). the methods need adjustment.6. In partlCU Iar. ""2 _ ""12 (TJ1. This is the stochastic version of the dsample model of Example 6. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. = Ad = 1/ d (Problem 6.6) and. TJr. . A solution is to replace Z(n)Z(n)/ 8 2 in (6.."i=r+ I i . .. 1967).. where Qis a consistent estimate of Q. (52) are still reasonable when the true error distribution is not Gaussian.1. in general. It is easy to see that our tests of H : {32 = .422 Inference in the Multiparameter Case Chapter 6 (Problem 6.6.1.. 1 ::. Compare this bound to the variance of 8 2 • . (5 . . (ii) if n 2: r + 1. We considered the behavior of estimates and tests when the model that generated them does not hold. the confidence procedures derived from the normal model of Section 6. (i) the MLE does not exist if n = r.11) with both 11 and (52 unknown.7 PROBLEMS AND COMPLEMENTS Problems for Section 6. Wald. . then the MLE (iil. = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = .A1.6) by Q1. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model.. In the linear model with a random design matrix. N(O. In particular.4.Id .11 . and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. (52) assumption on the errors is replaced by the assumption that the errors have mean zero.1 1. Show that in the canonical exponential model (6. Yn . the test (6. . . ... Specialize to the case Z = (1. (5 2)T·IS (U1.. . in general.6. .4.Ad) distribution. 0 < Aj < 1. use Theorem 3..3. we showed that the MLEs and tests generated by the model where the errors are U. 0 To summarize: If hypotheses remain meaningful when the model is false then. and are uncorrelated. Wald. and Rao tests need to be modified to continue to be valid asymptotically.Ad)T where (I1. the twosample problem with unequal sample sizes and variances. .3 to compute the information lower bound on the variance of an unbiased estimator of (52. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i.1 are still approximately valid as are the LR.. provided we restrict the class of estimates to linear functions of Y1 . N(O.d.JL • ""n 2. identical variances.6) does not have the correct level. j ::. .6. LR. d. For the canonical linear Gaussian model with (52 unknown.16) Summary. The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. (6.. Ur..n l1Y .. ..9. 6.
13. . i = 1.... and the 1. where IJ. Var(Uf) = 20.n. 9.n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. N(O.4 • 3.i.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] .. . n. = (Zi2. . the 100(1 .l+c (_c)j+l)/~ (1. {Ly = /31.2.. i = 1. .. Suppose that Yi satisfies the following model Yi ei = () + fi. .1] is a known constant and are i...(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of (). c E [0.Li = {Ly+(zi {Lz)f3.1.. Find the MLE B ().d. Derive the formula (6. 8. of 7.. . t'V (b) Show that if ei N(O.1. and (3 and {Lz are as defined in (1. (d) Show that Var(O) ~ Var(Y).2 ).d. then B the MLE of (). is (c) Show that Y and Bare unbiased.2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. . 0. i = 1.1. Here the empirical plugin estimate is based on U. .2 replaced by 0: 2 • 6."" (Z~. see Problem 2. .2 .22. i = 1. Yn ) where Z.28) for the noncentrality parameter ()2 in the regression example. 1 n f1.2 known case with 0. fn where ei = ceil + fi.5) Yi () + ei.1. 5. i eo = 0 (the €i are called moving average errors.14). Yl). . Let Yi denote the response of a subject at time i.. n.7 Problems and Complements 423 Hint: By A. Derive the formula (6.1.4. where a. .'''' Zip)T.. 4. .. zi = (Zi21"" Zip)T.2 ). Let 1. (e) Show that Var(B) < Var(Y) unless c = O. J =~ i=O L)) c i (1.Section 6. 0. .29). 1 {LLn)T. fO 0.2 coincides with the likelihood ratio statistic A(Y) for the 0. Show that >:(Y) defined in Remark 6. (Zi. Show that Ii of Example 6. Consider the model (see Example 1..29) for the noncentrality parameter 82 in the oneway layout. Show that in the regression example with p = r = 2.
(b) Find a level (1 a) prediction interval for Y (i. Show that for the estimates (a) Var(a) = +==_=:.p (~a) (n .. Yn ) such that P[t.2)(Zj2 . Y ::.:1 Yi + (3/ 2n) L~= ~ n+ 1 }i. ••. (b) Find confidence intervals for 'lj. (c) If n is fixed and divisible by 2(p . Yn . Yn be a sample from a population with mean f. Often a treatment that is beneficial in small doses is harmful in large doses.p (1 . Assume the linear regression model with p future observation Y to be taken at the pont z.2). then the hat matrix H = (h ij ) is given by 1 n 11.1 and variance a 2 .. n2 = . ~ 1 . . 'lj...• statistics HYl.k)' C (b) If n is fixed and divisible by p.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::. Note that Y is independent of Yl. (a) Show that level (1 .Z.1.":=::.. l(Yh . Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6.2)2 a and 8k in the oneway layout Var(8k ) . Let Y 1 . then Var(8k ) is minimized by choosing ni = n/2.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.2) L(Zi2 .a)% confidence intervals for a and 6k • 12.. 15. = np n/2(p 1). . (d) Give the 100(1 . The following model is useful in such situations. (Zi2 Z. a 2 ::. where n is even. ~(~ + t33) t31' r = 2. .. . Consider a covariate x..• 424 10. Consider the three estimates T1 = and T3 Y. (n .2.~ L~=1 nk = (p~:)2 + Lk#i .{3i are given by = ~(f.. 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. ::.p)S2 /x n . which is the amount .. T2 1 (1/ 2n) L..a).. Yn ). = 2(Y . We want to predict the value of a 14. .e . .Z.I).p)S2 /x n . In the oneway layout.a) confidence intervals for linear functions of the form {3j . then Var( a) is minimized by choosing ni = n / p. (a) Find a level (1 .
5 Zi2 * + (33zi2 where Zil = Xi  X. ( 2 ). En Xi. 16. P = {N({L. Y (yield) 3.r2 )l 1 2 7 > O.. Problems for Section 6. Show that if C is an n x r matrix of rank r. .0'2) is the N(Il'. 8).77 167. ( 2 ). fi3 and level 0. E :::. : {L E R. and a response variable Y. n are independent N(O. r :::.0'2)(x ) + E<t'(JL. (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. hence. nonsingular. ( 2 ) logp(x. Check AO. (b) Plot (Xi. xC' = 0 => x = 0 for any rvector x. 7 2 ) does not exist so that A6 doesn't hold. compute confidence intervals for (31.95 Yi = e131 e132xi Xf3. n. which is yield or production. . 1 2 where <t'CJL.or dose of a treatment. and Q = P. a 2 <7 2 . Let 8. p. Yi) and (Xi. 8) = 2. (a) For the following data (from Hald.. P. (33. 17. 1952.1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. i = 1..2. fli) where fi1. p(x. .Section 6. a 2 > O}. 653).0289. Assume the model = (31 + (32 X i + (33 log Xi + Ei.A6 when 8 = ({L. Hint: Because C' is of rank r. In the Gaussian linear model show that the parametrization ({3. •. and p be as in Problem 6.r2) (x).1) + 2<t'CJL.7 Probtems and Lornol.• . .emEmts 425 . fi2. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133. ( 2 ) density.2 1.I::llogxi' You may use X = 0. X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . (b) Show that the MLE of ({L.. ( 2 )T is identifiable if and only if r = p. . Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. (32. then the r x r matrix C'C is of rank r and.
10 that the estimate Raphson estimates from On is efficient. and . .2. then ({3..2.2 hold if (i) and (ii) hold. (a) In Example 6.20)...1/ 2 ). i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric.2. Z i ~O.2.24)..I3). n[z(~)~en)jl)' X. Show that if we identify X = (z(n). I I . '".24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then.fii(ji . Fill in the details of the proof of Theorem 6.426 Inference in the Multiparameter Case Chapter 6 I . Y).1.2. In Example 6. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic. e Euclidean. show that ([EZZ T jl )(1. t) is then called a conditional MLE.2.. The MLE of based on (X. show that the assumptions of Theorem 6. Hint: !x(x) = !YIZ(Y)!z(z). 8.(3) = op(n. that (d) (fj  (3)TZ'f. given Z(n).  (3)T)T has a (". . (e) Construct a method of moment estimate moments which are ~ consistent. In Example 6.2. (e) Show that 0:2 is unconditionally independent of (ji. (c) Apply Slutsky's theorem to conclude that and. T(X) = Zen).'. In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H.. T 2 ) based on the first two I (d) Deduce from Problem 6. I . iT 2 ) are conditional MLEs. H E H}. Establish (6. ell of (} ~ = (Ji.21).1.2. .2. b . .Pe"H).1) EZ .i > 1.Zen) (31 2 e ~ 7.  I . (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4.(b) The MLE minimizes ._p distribution. 6.1 show thatMLEs of (3.2. 3. H abstract. (P("H) : e E e. In Example 6. X . /1..". . (I) Combine (aHe) to establish (6. hence. Hint: (a). (fj multivariate nannal distribution with mean 0 and variance.2.IY . and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ). 5. (6.2.2.2 are as given in (6.)Z(n)(fj .
.•. <).. (iv) Dg(O) is continuous at 0 0 . Show that if BB a unique MLE for B1 exists and 10. E R.l real be independent identically distributed Y. ao (}2 = (b) Write 8 1 uniquely solves 1 8. (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi.1/ 2 ). hence.1' _ On = O~  [ . (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1. Examples are f Gaussian.'2:7 1 'l'(Xi. (logistic).LD. (iii) Dg(0 0 ) is nonsingular. the <ball about 0 0 ..Section 6.(XiI') _O. .3). (ii) gn(Oo) ~ g(Oo).' .2. p i=l J. p is strictly convex. 8~ = 1 80 + Op(n. 10 . Suppose AGA4 hold and 8~ is vn consistent.0 0 1 < <} ~ 0. = J. Hint: n . ] t=l 1 n ..l + aCi where fl.p(Xi'O~) n. R d are such that: (i) sup{jDgn(O) . t=l 1 tt Show that On satisfies (6. Let Y1 •.2. Y. that is. . a' . and f(x) = e.0 0 ).L'l'(Xi'O~) n.O) has auniqueOin S(Oo. . a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and.LD.l exists and uniquely ~ .p(Xi'O~) + op(1) n i=l n ) (O~ .Oo) i=cl (1 .x (1 + C X ) . (a) Show that if solves (Y ao is assumed known a unique MLE for L.1) starting at 0.7 Problems and Complements 427 9.Dg(O)j . Him: You may use a uniform version of the inverse function theorem: If gn : Rd j.
. hence. ./3)' = L i=1 1 CjZiU) n p . {Vj} are orthonormal.2 holds for eo as given in (6.~_.3 i 1. = 131 (Zil . ~.d. •. i=l • Z. 'I) : 'I E 3 0 } and. that Theorem 6. Suppose that Wo is given by (6.. . Wald.II(Zil I Zi2. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}. and Rao tests for testing H : Ih. Similarly compute the infonnation matrix when the model is written as Y..3. I '" LJ i=l ~(1) Y. 0 < ZI < .2. Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . i2. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3. for "II sufficiently large. and ~ . Zip)) + L j=2 p "YjZij + €i J . ..12) and the assumptions of Theorem (6. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. Problems for Section 6. Y.2. converges to the unique root On described in (b) and that On satisfies 1 1 • j. Suppose responses YI .. Establish (6. II. minimizing I:~ 1 (Ii 2 i...1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1.(Z" . q(.3. LJ j=2 Differentiate with respect to {31. P I . log Ai = ElI + (hZi.. . .O E e. 2 ° 2.. P(Ai). a 2 ).12). 'I) = p(.ip range freely and Ei are i. . CjIJtlZ. Thus... 1 Yn are independent Poisson variables with Yi .i. . t• where {31. . .(j) l. N(O..27).Zi )13. then it converges to that solution.428 then.. < Zn 1 for given covariate values ZI..(Z"  ~(') z.O). Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj. Hint: Write n J • • • i I • DY.' " 13. > 0 such that gil are 1 . •. .3. i . there exists a j and their image contains a ball S(g(Oo).26) and (6. Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution..l hold for p(x. 1 Zw Find the asymptotic likelihood ratio.6). Zi.3). Show that if 3 0 = {'I E 3 : '1j = 0. iteration of the NewtonRaphson algorithm starting at 0. • (6. Inference in the Multiparameter Case Chapter 6 > O. f.3.2. = versus K : O oF 0.
with density p(.X.3 hold.(X" . Suppose 8)" > 0.) 1(Xi . p .) O.. 1).. 'I) S(lJo)} then.Section 6.) where k(Bo. Oil and p(X"Oo) = E.(I(X i .n 7J(Bo.'1 8 0 = and. B). .. q + 1 <j< rl.d. = ~ 4. Vi).'" . =  p(·.l. . .. hence. .) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('. {IJ E S(lJo): ~j(lJ) = 0.aq are orthogonal to the linear span of ~(1J0) . OJ}. even if ~ b = 0. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo).. with Xi and Y. pXd . Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo. X n Li.. Let (Xi. (Adjoin to 9q+l.(X1 . q('. . be ii.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio. < 00..3.3. Suppose that Bo E (:)0 and the conditions of Theorem 6. 1 < i < n. 1 5. > 0 or IJz > O. Oil + Jria(00 .d.IJ) where q(. Show that under H. 210g. N(IJ" 1). 0 ») is nK(Oo. respectively.3 is valid. Deduce that Theorem 6. . IJ..o log (X 0) PI.'1) and Tin '1(lJ n) and. Testing Simple versus Simple. j = 1.n:c"' 4c:2=9 3. Let e = {B o .gr. Consider testing H : 8 1 = 82 = 0 versus K : 0. . Xi. N(Oz. (ii) E" (a) Let )..n) satisfy the conditions of Problem 6. . IJ. aTB. \arO whereal. q + 1 <j < r S(lJo) .Xn ) be the likelihood ratio statistic... There exists an open ball about 1J0.7 Problems and C:com"p"":c'm"'. ..3. independent. Consider testing H : (j = 00 versus K : B = B1 . Assume that Pel of Pea' and that for some b > 0. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0).2... (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ .2.
 0. (d) Relate the result of (b) to the result of Problem 4(a). I . = v'1~c2' 0 <.) if 0. Z 2 = poX1Y v' . > OJ.0). and Dykstra. 7. Then xi t.\(X). are as above with the same hypothesis but = {(£It. li). > 0.. < .430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. Hint: By sufficiency reduce to n = 1. 0.\( Xi. = a + BB where .2 for 210g.6.1.3.i. Po . Show that 2log.) under the model and show that (a) If 0. (h) : 0 < 0. ±' < i < n) is distributed as a respectively. < cIJ"O. Sucb restrictions are natural if. (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. for instance.3. be i. 2 e( y'n(ii. under H.~. we test the efficacy of a treatment on the basis of two correlated responses per individual. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson.1.=0 where U ~ N(O.d... Exhibit the null distribution of 2 log . • i . = 0. ' · H In!: Consl'defa1O . Wright. Yi : 1 and X~ with probabilities ~. Yi : l<i<n).0.li : 1 <i< n) has a null distribution.0. (b) Suppose Xil Yi e 4 (e) Let (X" Y. which is a mixture of point mass at O. (0.. . XI and X~ but with probabilities ~ .. (b) If 0. Po) distribution and (Xi. . ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0.0"10' O"~o. > 0.. =0. Let Bi l 82 > 0 and H be as above. and ~ where sin. Show that (6.  OJ.0. 1) with probability ~ and U with the same distribution.)) ~ N(O. ii.3. 1 < i < n.0. j ~ ~ 6. = (T20 = 1 andZ 1= X 1.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6. (iii) Reparametrize as in Theorem 6. mixture of point mass at 0. /:1. O > 0.19) holds.. 0. Hint: ~ (i) Show that liOn) can be replaced by 1(0). 1988).\(Xi.) have an N.6. 2 log >'( Xi. In the model of Problem 5(a) compute the MLE (0.
A611 Problems for Section 6.0 < PCB) < 1. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born. (b) (6. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6.P(A)P(B) JP(A)(1 . 2.6 is related to Z of (6.2 to lin . hence.4. Exhibit the two solutions of (6. (e) Conclude that if A and B are independent.4.8) by Z ~ .. 10.4) explicitly and find the one that corresponds to the maximizer of the likelihood.nf. 9.8) for Z. 3. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero.4. (a) Show that the correlation of Xl and YI is p = peA n B) . Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.3.21).1(0 0 ).7 Problems and Complements ~(1 ) 431 8. Show that under A2.4. 1. then Z has a limitingN(011) distribution. A3.8) (e) Derive the alternative form (6.Section 6.3.3.22) is a consistent estimate of ~l(lIo).P(A))P(B)(1 .10.3.4 ~ 1(11) is continuous. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6. Hint: Write and apply Theorem 6. Hint: Argue as in Problem 5. 0 < P( A) < 1.2. .P(B))· (b) Show that the sample correlation coefficient r studied in Example 5.
.4. . C i = Ci. = Lj N'j..C. b I Ri ( = Ti. Hint: (a) Consider the likelihood as a function of TJil. Cj = Cj] : ( nll. . B!c1b!. 8(r2' 82 I/ (8" + 8.. I (c) Show that under independence the conditional distribution of N ii given R.4 deduce that jf j(o:) (depending on chosen so that Tl. Fisher's Exact Test From the result of Problem 6.4... TJj2. .. .~. j. Consider the hypothesis H : Oij = TJil TJj2 for all i. 8 21 .. (b) Deduce that Pearson's X2 is given by (6.~l. TJj2 ~ = Cj n where R. =T 1.. i = 1. N 22 ) rv M (u.TI. .9) and has approximately a X1al)(bl) distribution under H. j = 1. ( where ( .b1only. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. ( rl.2 is 1t(Ci. R 2 = T2 = n . N 12 • N 21 ..1.2. (a) Show that the maximum likelihood estimates of TJil.. j = 1. (a) Show that then P[N'j niji i = 1...1 . 432 Inference in the Multiparameter Case Chapter 6 i 4. C j = L' N'j.~~.1 II . (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent. Let N ij be the entries of an let 1Jil = E~=l (}ij. nal ) n12. .6 that H is true.. . TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. . use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 . are the multinomial coefficients. N ll and N 21 are independent 8( r. Suppose in Problem 6. It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5.ra) nab .. . nab) A ) = B. . .D. S. Ti) (the hypergeometric distribution). n. a ... n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o...~~. (a) Let (NIl. This is known as Fisher's exact test. 6. TJj2 are given by TJil ~ = . CI.. . i = 1.)). 1 : X 2 (b) How would you. (22 ) as in the contingency table. n a2 ) . 8l! / (8 1l + 812 )).54). n R. 811 \ 12 .u.4. in principle. 7..
2:f . and that we wish to test H : Ih < f3E versus K : Ih > f3E.4. = 0 in the logistic model. consider the assertions.BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. n. + IhZi.5? Admit Deny Men Women 1 19 . ~i = {3. there is a UMP level a test. Suppose that we know that {3.1""". (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. 10.. classified by sex and admission status.ziNi > k. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. Establish (6. where Pp~ [2:f . C2). B. .4. and petfonn the same test on the resulting table.) Show that (i) and (ii) imply (iii). 11. and that under H. Show that. if A and C are independent or B and C are independent.0. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n . (a) If A.(rl + cI). Would you accept Or reject the hypothesis of independence at the 0.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A. Zi not all equal. 9.12 I .. Give pvalues for the three cases.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. B INDEPENDENT) (iii) PeA (C is the complement of C. N 22 is conditionally distributed 1t(r2.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). for suitable a. The following table gives the number of applicants to the graduate program of a small department of the University of California. if and only if. Test in both cases whether the events [being a man] and [being admitted] are independent. Then combine the two tables into one.ZiNi > k] = a.. which rejects. (b). Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'.(rl + cd or N 22 < ql + n . (h) Construct an experiment and three events for which (i) and (ii) hold.14). but (iii) does not.7 Problems and Complements 433 8.Section 6. C are three events.
Show that the likelihO<Xt ratio test of H : O} = 010 .4.i. In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 .11 are obtained as realization of i.00 or have Eo(Nd : .3. • I 1 • • (a)P[Z. . " Ok = 8kO. . ... Tn.Ld.. . which is valid for the i. Y n ) have density as in (6... 434 • Inference in the Multiparameter Case Chapter 6 f • 12. I. ~ 8 m + [1(8 m )Dl(8 m ). a5 j J j 1 J I . 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2..) ~ B(m... Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973).(Ni ) < nOiO(1 .4.3.. n) and simplification of a model under which (N1. 3. Show that. · . if the design matrix has rank p.5.5 construct an exact test (level independent of (31).z(kJl) ~I 1 .. k and a 2 is unknown. Xi) are i. . mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6. . . in the logistic regression model.2. Let Xl.. .8) and.5. I < i < k are independent.. case. 2. .: nOiO..5 1. . .4).11.. .f..'Ir(. .. Given an initial value ()o define iterates  Om+l . i Problems for Section 6. . asymptotically N(O.. a 2 = 0"5 is of the form: Reject if (1/0".OiO)2 > k 2 or < k}. E {z(lJ. (Zn. i 16. " lOk vary freely. < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. for example.LJ2 Z i )). a under H. Verify that (6. . . I).4. X k be independent Xi '" N (Oi... = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. a 2 ) where either a 2 = (known) and 01 . then 130 as defined by (6.15) is consistent.".l 1 .Oio)("Cooked data"). Use this to imitate the argument of Theorem 6. .1 14.d. .d.4.OkO) under H. 1 . Compute the Rao test statistic for H : (32 case. Nk) ~ M(n. 15. but Var. but under K may be either multinomial with 0 #. (a) Compute the Rao test for H : (32 Problem 6. 13.18) tends 2 to Xrq' Hint: (Xi . with (Xi I Z. . Suppose the ::i in Problem 6..4.i..3) l:::~ I (Xi . This is an approximation (for large k. Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI. or Oi = Bio (known) i = 1.5...4) is as claimed formula (2.. .4.)llmi~i(1 'lri). Suppose that (Z" Yj). Zi and so that (Zi. 010.20) for the regression described after (6.
).9).. . (y.ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j.9). C(T). 1'0) = jy 1'01 2 /<7.5. Set ~ = g(.d. 05. (d) Gaussian GLM.) . . P(I").) and v(. and b' and 9 are monotone.(3)..(... give the asymptotic distribution of y'n({3 .). h(y. L. contains an open L.i. Y follow the model p(y.. k. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.1 = 0 V fJ~ .T).. Show that when 9 is the canonical link. k..5. Wi = w(". d". In the random design case.) = ~ Var(Y)/c(T) b"(O).) ~ 1/11("i)( d~i/ d". ball about "k=l A" zlil in RP .Section 6. .F. Show that for the Gaussian linear model with known variance D(y.)z'j dC.b(Oi)} p(y. the deviance is 5.5. C(T).. Assume that there exist functions h(y. T).. .. T). h(y. and v(. (a) Show that the likelihood equations are ~ i=I L. Give 0. Wn). T. Show that. Yn ) are i. .. your result coincides with (6. . and v(.Oi)~h(y. .. Give 0. Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . J JrJ 4. under appropriate conditions. . Y) and that given Z = z. b(9). 00 d" ~ a(3j (b) Show that the Fisher information is Z]. distribution. ~ . the resuit of (c) coincides with (6.. <..y . as (Z. .7 Problems and Complements 435 (b) The linear span of {ZII). ~ N("" <7. b(B). Let YI. .WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI.= 1 . (c) Suppose (Z" Y I ). has the Poisson.5). (Zn.p. .. Suppose Y.. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known).. j = 1. . g("i) = zT {3. b(9). 9 = (b')l.. T. Find the canonical link function and show that when g is the canonical link. Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1.O(z)) where O(z) solves 1/(0) = gI(zT {3). J . g(p. . then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .) and C(T) such that the model for Yi can be written as O. (e) Suppose that Y..T)exp { C(T) where T is known...
Show that. let 1r = s(z. In the hinary data regression model of Section 6.15) by verifying the condition of Theorem 6. .1) ~ .3.6. and R n are the corresponding test statistics. hence.10) creates a valid level u test. if P has a positive density v(P). Suppose Xl. then 0'2(P) < 0'2. but that if the estimate ~ L:~ dDIllDljT (Xi.2. = 1" (b) Show that if f is N(I'. 2.p under the sole assumption that E€ = 0.1. t E R. I: . then the Rao test does not in general have the correct asymptotic level.. W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.O' 2 (p») = 1/4f(v(p». 0'2). i 6. l 1 . P.3) then f}(P) = f}o. the unique median of p. . Consider the linear model of Example 6. . Wn Wn + op(l) + op(I).6.6. the infonnation bound and asymptotic variance of Vri(X 1').). W n . in fact. /10) is estimated by 1(80 )..Xn are i. set) = 1.j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6.6. ! I I ! I . .3 is as given in (6. Establish (6. Apply Theorem 6.4.1 under this model and verifying the fonnula given. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. .2. replacing &2 by 172 in (6.6.6 1.2 and the hypothesis (3q+l = (3o.. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. f}o) is used. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold.7. • I 3.Q+1>'" . O' 2(p)/0'2 = 2/1r. and Rao tests are still asymptotically equivalent in the sense that if 2 log An. Show that the LR.1. n .(3p ~ (3o. By Problem 5. Note: 2 log An. (6. 0 < Vae f < 00.6.10).s(t).i. (a) Show that if f is symmetric about 1'. Suppose ADA6 are valid.14) is a consistent estimate of2 VarpX(l) in Example 6. then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. then O' 2(p) > 0'2 = Varp(X.3 and. I 4. then it is.3. Show that 0: 2 given in (6. . Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. I "j 5.6.d. Show that the standard Wald test forthe problem of Example 6. j. . then v( P) . that is. but if f~(x) = ~ exp Ix 1'1. then under H. if VarpDl(X.6. I . Wald. 7. " . .
1. ... L:~ 1 (P. Show that if the correct model has Jri given by s as above and {3 = {3o..9.Zn and evaluate the performance of Yjep) .p. O)T and deduce that (c) EPE(p) = RSS(p) + ~.. Let f3(p) be the LSE under this assumption and Y. . 8. for instance). .. But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1. 1973.' i=l .7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models. . Zi are ddimensional vectors for covariate (factor) values. 1 n. Zi) : 1 <i< n}. 0. 1 n. A natural goal to entertain is to obtain new values Yi".2 is known.2 is an unbiased estimate of EPE(P). . Hint: Apply Theorem 6.d.. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i.2. (Model Selection) Consider the classical Gaussian linear model (6. (b) Suppose that Zi are realizations of U. _. hence. /3P+I = '" = /3d ~ O. i . f3p. then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution.i . Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. Suppose that . that Zj is bounded with probability 1 and let ih(X ln )).. at Zll'" . 1 V. i = 1. .(b) continue to hold if we assume the GaussMarkov model. be the MLE for the logit model...{3lp) p and {3(P) ~ (/31.(p) the corresponding fitted value.p don't matter. Zi.1.lp)2 Here Yi" l ' •• 1 Y.. where Xln) ~ {(Y... i = 1. . = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n. (a) Show that EPE(p) ~ . Gaussian with mean zero J1i = Z T {3 and variance 0'2.. . the model with 13d+l = .2 (b) Show that (1 + ~D + .. . y~p) and.lpI)2 be the residual sum of squares..9.I EL(Y...Section 6. where ti are i. ~ ~ (d) Show that (a). Let RSS(p) = 2JY. .d.jP)2 where ILl ) = z. then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular.. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d .1) Yi = J1i + ti.
1 I . Heart Study after Dixon and Massey (1969). 6. ti. 1 n n L (I'. ! .5.) .  J . LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6.9 for a discussion of densities with heavy tails.4.z.1 (1) From the L A. Introduction to Statistical Analysis. + Evaluate EPE for (i) ~i ~ "IZ. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course.2 (1) See Problem 3.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. AND F. but it is reasonable. EPE(p) n R SS( p) = . 3rd 00. 1 "'( i=1 . Use 0'2 = 1 and n = 10. . we might envision the possibility that an overzealous assistant of Mendel "cooked" the data. and (ii) 'T}i = /31 Zil + f32zi2. .". Y. .. (b) The result depends only on the mean and covariance structure of the i=l"". . if we consider alternatives to H. A.. D. t12 and {Z/I.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. i 6. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good.~I'.6. ~(p) n . . The Analysis of Binary Data London: Methuen.. which are not multinomial. this makes no sense for the model we discussed in this section. W. DIXON.' . Note for Section 6. i=1 Derive the result for the canonical model.n. Note for Section 6.9 REFERENCES Cox. R. Y/" j1~.L . )I'i). 1969.16).8 NOTES Note for Section 6. New York: McGrawHill. The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6.4 (1) R. For instance. Hint: (a) Note that ". . 1970. Give values of /31. MASSEY.
. WRIGHT. The Analysis of Variance New York: Wiley. I New York: McGrawHill. S. GauthierVillars. AND R. New York: Hafner." J. S.. Statist. Vol. 279300 (1997).• St(ltistical Methods {Of· Research lV(Jrkers. Wiley & Sons. c. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge. S. ROBERTSON. 2nd ed. McCULLAGH. A. Roy.• "Sur quelques points du systeme du monde.Section 6. KADANE AND L. HAW. J. MALLOWS. PORTNOY. 1995. An Inrmdllction to Linear Stati. D'OREY. HUBER. MA: Harvard University Press. 221233 (1967). "Some comments on C p . 1959. 1985. Fifth BNkeley Symp. R. M. 76. HABERMAN.... New York. t 983. Paris) (1789). NELDER. DYKSTRA." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. Statist. AND R. 1952. II. R. Univ. 2nd ed. T. R. Linear Statisticallnference and Its Applications. TIERNEY. Prob. A. "Approximate marginal densities of nonlinear functions..Him{ Models.9 References 439 Frs HER. /5. 1973.. f. 12. AND j. R. LAPLACE. KASS. 1989. Theory ofStatistics New York: Springer. second edition. SCHEFFE. . 13th ed. C." Proc." Technometrics. Math. E. 1958. KOENKER. 1986. RAO.. J.661675 (1973). AND Y." Statistical Science. WEISBERG. "The behavior of the maximum likelihood estimator under nonstandard conditions. 1974. II. The Analysis of Frequency Data Chicago: University of Chicago Press. Genemlized Linear Models London: Chapman and Hall. I. 36. Ser. Soc. $CHERVISCH.. of California Press. 383393 (1987). P. STIGLER. Statistical Theory with £r1gineering Applications New York: Wiley. New York: J. P. GRAYBILL. 1961. S. 475558. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. Order Restricted Statistical Inference New York: Wiley. 1988.425433 (1989). "Computing regression qunntiles. Applied linear Regression. C. PS." Biometrika. L. A.. New York: Wiley. KOENKER. T.
j.I :i . .. I . . i 1 I . . I j i.. . I I . . .
A. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. We add to this notion the requirement that to be called an experiment such an action must be repeatable. The intensity of solar flares in the same month of two different years can vary sharply. in addition. The adjective random is used only to indicate that we do not. although we do not exclude this case. The situations we are going to model can all be thought of as random experiments. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level.IS contain some results that the student may not know.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects.14 and A. is tossed can land heads or tails. This 441 . A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. at least conceptually. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. we include some commentary. we include some proofs as well in these sections. A coin that.Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. which are relevant to our study of statistics. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. The Kolmogorov model and the modem theory of probability based on it are what we need. require that every repetition yield the same outcome. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. Sections A. A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. Therefore. The reader is expected to have had a basic course in probability theory. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. Viewed naively.
intersection. the null set or impossible event. almost any kind of activity involving uncertainty.~ n and is typically denoted by w. For example." The set operations we have mentioned have interpretations also. induding the authors. . then (ii) If AI. A.. If A contains more than one point. We now turn to the mathematical abstraction of a random experiment. and Berger (1985). Ii I I I j . complementation. as we shall see subsequently. which by definition is a nonempty class of events closed under countable unions. and inclusion as is usual in elementary set theory. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. I l 1." Another school of statisticians finds this formulation too restrictive. B. they are willing to assign probabilities in any situation involving uncertainty. i ! A. . Chung. However. A is always taken to be a sigma field. 1 1  . the probability model. By interpreting probability as a subjective measure.4 We will let A denote a class of subsets of to which we an assign probabilities. it is called a composite event. de Groot (1970). is denoted by A. In this section and throughout the book. A2 . A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . and Loeve. Raiffa and Schlaiffer (\96\). If wEn.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 . A. . 1977). The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. intersections. Lindley (1965). . 1974. We shall use the symbols U. from horse races to genetic experiments. C for union. We denote events by A. We denote it by n. For a discussion of this approach and further references the reader may wish to consult Savage (1954). and so on or by a description of their members. = 1. Grimmett and Stirzaker. . For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. A. and complementation (cf.l The sample space is the set of all possible outcomes of a random experiment. A random experiment is described mathematically in tenns of the following quantities. {w} is called an elementary event.1. In this sense.. n. where 11.1/ /1. set theoretic difference.l. • . Savage (1962).1. whether it is conceptually repeatable or not. 1992.3 Subsets of are called events. falls under the vague heading of "random experiment. Its complement. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960).l.2 A sample point is any member of 0. c. is to many statistician::. the operational interpretation of the mathematical concept of probability. .\ is the number of times the possible outcome A occurs in n repetitions.. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common.
. then P (U::' A. A.2 PeN) 1 ..3.P(A).. A.2. In this case.2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P. and P together describe a random experiment mathematically. References Gnedenko (1967) Chapter I. References Gnedenko.2 and 1.. For convenience.2 Elementary Properties of Probability Models 443 A.2. Sections 13.. by axiom (ii) of (A. and Stone (1992) Section 1.7 P 1 An) = limn~= P(A n ). we have for any event A.2) n .3 Hoe!.3 DISCRETE PROBABILITY MODELS A. when we refer to events we shall automatically exclude those that are not members of A. we can write f! = {WI.3 A. That is.6 If A . (A.L~~l peA. P(B) A. then PCB .3 Hoel.. ~ A.2. .3 Panen (1960) Chapter I.PeA).' 1 An) < L. Sections 15 Pitman (1993) Sections 1.2.3 A. Port.3.2.1 If A c B.2.A) = PCB) .1. Section 8 Grimmett and Stirzaker (1992) Section 1. 1.2..t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability.l. We shall refer to the triple (0.Section A. A. W2. P) either as a probability model or identify the model with what it represents as a (random) experiment.l1.4). (U. > PeA).. P(0) ~ O.S The three objects n.40 < P(A) < 1..) (Bonferroni's inequality). A.3 If A C B. and Stone (1971) Sections 1. c . C An . Port. A. } and A is the collection of subsets of n.S P A.2 Parzen (1960) Chapter 1..68 Grimmett and Stirzaker (1992) Sections 1. . (n~~l Ai) > 1. J. Sections 45 Pitman (1993) Section 1. (1967) Chapter l.). c A.' 1 peA.
3) 1 I A.4.).• P (Q A.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment.3.(A n B j ). From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred.2) In fact. .l n' " I"I ". ~f 1 .4. the function P(. are (pairwise) disjoint events and P(B) > 0.. guinea pigs. Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper. (A.:. say N.I B) is a probability measure on (fl.' . (A.4. for fixed B as before.1. the identity A = . Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another. . For large N. I B) = ~ PiA.l) gives the multiplication rule. A.3. n PiA) = LP(A I Bj)P(Bj ). then . (A.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! . we define the conditional probability of A given B. etc.3. 1 1 PiA n B) = P(B)P(A I B).:c:. Transposition of the denominator in (AA.3) yield U. '< = ~=~:. selecting at random.• 1 B n are (pairwise) disjoint events of positive probability whose union is fl.=. and drawing. all of which are equally likely. flowers. i References Gnedenko (1967) Chapter I. a random number table or computer can be used.4) . I B).4)(ii) and (A. A) which is referred to as the conditional probability measure given B. then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur.3).~ 1 1 ~• 1 . If A" A 2 .444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements. Given an event B such that P( B) > 0 and any other event A.4.. machines...:c Number of elements in A N . Sections 45 Parzen (1960) Chapter I. is an experiment leading to the model of (A. Sections 67 Pitman (1993) Section l. j=l (A. . . . WN are the members of some population (humans.1 • 1 .=:. Then P( {w}) = 1/ N for every wEn. shaking well. (A.. .. (A. which we write PtA I B). i • i: " i:. and P( A ) ~ j .4.3) If B l l B 2 ..4 Suppose that WI.. by l P(A I B) ~ PtA n B) P(B) .
. and (A. BnJl (AA. The events All .lO) for any subset {iI..S) The conditional probability of A given B I defined by .. n Bnd > O.. Section 4. relation (A. ..) = II P(A. . Sections IA Pittnan (1993) Section lA .) A) ~ ""_ PIA I B )p(BT L..3).7) whenever P(B I n ..Section AA Conditional Probability and Independence 445 If P( A) is positive. .. . . . the relation (AA. . A and B are independent if knowledge of B does not affect the probability of A. I.4.• . n··· nA i . B n ) and for any events A.. P(B" I Bl... .}... .lO) is equivalent to requiring that P(A J I A... (AA.8) may be written P(A I B) ~ P(A) (AA. I PeA I B.9) In other words.. Ifall theP(A i ) are positive.n}..} such thatj ct {i" . Chapter 3. .A. An are said to be independent if P(A i ..... (AA. .1 )..I J (AA. n B n ) > O. Port. B 2 ) .4...I_1 .. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B). Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.)P(B.···.. References Gnedenko (1967) Chapter I..En such that P(B I n . we can combine (A.4) and obtain Bayes rule . P(il. . Simple algebra leads to the multiplication rule.S parzen (1960) Chapter 2.4.i.i k } of the integers {l.8) > 0.) ~ P(A J ) (AA.) j=l k (AA..i. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill.II) for any j and {i" ... B I . and Stone (1971) Sections lA. B ll is written P(A B I •." .
..5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments.3).1 Recall that if AI. then intuitively we should have alI classes of events AI.. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second. .. Pn({W n }) foral! Wi E fli. . This makes sense in the compound experiment. .0) given by . X . 1 £n if the n stage compound experiment has its probability structure specified by (A5. W n ) : W~ E Ai. independent. that is. ... Loeve. To be able to talk about independence and dependence of experiments..2) 1 i " 1 If we are given probabilities Pi on (fl" AI). the Cartesian product Al x .. £n with respective sample spaces OJ.5. P. x . .1 X Ai x 0i+ I x .. .o i . the subsets A ofn to which we can assign probability(1).53) holds provided that P({(w"". I ! i w? 1 . On the other hand..) . Pn(A n ). x fl 2 x '" x fln)P(fl.' . There are certain natural ways of defining sigma fields and probabilities for these experiments. If we are given n experiments (probability models) t\.0 1 x··· XOi_I x {wn XOi+1 x··· x.1 . . . A. The reader not interested in the formalities may skip to Section A.. . I <i< n. 1 < i < n}. ..3) for events Al x .. then the probability of a given chip in the second draw will depend on the outcome of the first dmw.. The (n stage) compound experiment consists in performing component experiments £1. X A 2 X '" x fl n ). (A5. P(fl.. wn ) is a sample point in .5. An are events." X An) = P.0 if and only if WI is the outcome of £1. Chung.o n = {(WI.. The interpretation of the sample space 0 is that (WI. x'" . (A53) It may be shown (Billingsley. x " . if we toss a coin twice."" .. . To say that £t has had outcome pound event (in . \ I I .. it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix. . r. x flnl n Ifl) x A 2 x ' . P([A 1 x fl2 X . a compound experiment is one made up of two or more component experiments. .'" £n and recording all n outcomes. . A2). x An) ~ P(A. . (A5A) i. x An of AI.An with Ai E Ai..w n ) E o : Wi = wf}. . x On in the compound experiment. .. . X fl n 1 X An). I P n on (fln> An). . then Ai corresponds to .on. if Ai E Ai.. 1977) that if P is defined by (A. x An. the sigma field corresponding to £i.0 1 X '" x . For example. on (fl2. We shall speak of independent experiments £1.0 ofthe n stage compound experiment is by definition 0 1 x·· . InformaUy. x flnl n ' . If P is the probability measure defined on the sigma field A of the compound experiment. 1974. ) = P(A.. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on.6 where examples of compound experiments are given. .' .. I n  I . These will be discussed in this section. and if we do not replace the first chip drawn before the second draw... x An by 1 j P(A.wn )}) I: " = PI ({Wi}) . then (A52) defines P for A) x ".. . If we want to make the £i independent.. An is by definition {(WI. In the discrete case (A. . 446 A Review of Basic Probability Theory Appendix A A. . it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip. 1995. (A..on. 1 1 .. we should have . then the sample space . More generally. we introduce the notion of a compound experiment.
4. then (A..6...·. .k)!' The fonnula (A. /I. By the multiplication rule (A. If o is the sample space of the compound experiment. If Ak is the event [exactly k S's occur].: n )}) for each (WI. any point wEn is an ndimensional vector of S's and F's and. If we repeat such an experiment n times independently. the compound experiment is called n multinomial trials with probabilities PI. 1.6. . .£"_1 has outcome wnd.SP({(w] . The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). we shall refer to such an experiment as a Bernoulli trial with probability of success p.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. .6 Hoel.3) where n ) ( k = n! kl(n .2) where k(w) is the number of S's appearing in w. ill the discrete case. Port. we refer to such an experiment as a multinomial trial with probabilities PI.5.. Other examples will appear naturally in what follows... . if an experiment has q possible outcomes WI. If the experiment is perfonned n times independently.. . (A. J. and Stone (1971) Section 1. the following. A.5.S) . which we shall denote by 5 (success) and F (failure). Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated. .· IPq... A..Pq' If fl is the sample space of this experiment and W E fl.u.6.6..6.wq and P( {Wi}) = Pi.. n The probability structure is determined by these conditional probabilities and conversely.1 Suppose that we have an experiment with only two possible outcomes. SAMPLING WITH AND WITHOUT REPLACEMENT A. ..7) we have.6 BERNOULLI AND MULTINOMIAL TRIALS.4 More generally. . then (A.6. we say we have performed n Bernoulli trials with success probability p..5 Parzen (1960) Chapter 3 A. i = 1" .: n ) with Wi E 0 1. References Grimmett and Stirzaker ( (992) Sections (. In the discrete case w