Mathematical Statistics
Basic Ideas and Selected Topics
Volume I
Peter J. Bickel
University of California
Kjell A. Doksum
University of California
1'1"(.'nl icc
Hall
.. ~
PRENTICE HALL
Upper Saddle River, New Jersey 07458
Library of Congress CataloginginPublication Data Bickel. Peter J. Mathematical statistics: basic ideas and selected topics / Peter J. Bickel, Kjell A. Doksum2 nd ed. p. em. Includes bibliographical references and index. ISBN D13850363X(v. 1) L Mathematical statistics. L Doksum, Kjell A. II. Title.
QA276.B47200l
519.5dc21 00031377
Acquisition Editor: Kathleen Boothby Sestak Editor in Chief: Sally Yagan Assistant Vice President of Production and Manufacturing: David W. Riccardi Executive Managing Editor: Kathleen Schiaparelli Senior Managing Editor: Linda Mihatov Behrens Production Editor: Bob Walters Manufacturing Buyer: Alan Fischer Manufacturing Manager: Trudy Pisciotti Marketing Manager: Angela Battle Marketing Assistant: Vince Jansen Director of Marketing: John Tweeddale Editorial Assistant: Joanne Wendelken Art Director: Jayne Conte Cover Design: Jayne Conte
}'I('I1II(\'
lI,dl
@2001, 1977 by PrenticeHall, Inc. Upper Saddle River, New Jersey 07458
All rights reserved. No part of this book may be reproduced, in any form Or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
I
i
ISBN: D1385D363X
PrenticeHall International (UK) Limited, London PrenticeHall of Australia Pty. Limited, Sydney PrenticeHall of Canada Inc., Toronto PrenticeHall Hispanoamericana, S.A., Mexico PrenticeHall of India Private Limited, New Delhi PrenticeHall of Japan, Inc., Tokyo Pearson Education Asia Pte. Ltd. Editora PrenticeHall do Brasil, Ltda., Rio de Janeiro
J
To Erich L Lehmann
." i
~I
"I
~ !
,'
,
I,
,.~~
_.

..
CONTENTS
PREFACE TO THE SECOND EDITION: VOLUME I PREFACE TO THE FIRST EDITION I STATISTICAL MODELS, GOALS, AND PERFORMANCE CRITERIA 1.1 Data, Models, Parameters, and Statistics
xiii
xvii
1 1
1.1.1
1.1.2
Data and Models
Pararnetrizations and Parameters
I
6
1.1.3
1.2
Statistics as Functions on the Sample Space
8
1.3
1.4
1.5
1.6
1.7 1.8 1.9
1.1.4 Examples, Regression Models Bayesian Models The Decision Theoretic Framework 1.3.1 Components of the Decision Theory Framework 1.3.2 Comparison of Decision Procedures 1.3.3 Bayes and Minimax Criteria Prediction Sufficiency Exponential Families 1.6.1 The OneParameter Case 1.6.2 The Multiparameter Case 1.6.3 Building Exponential Families 1.6.4 Properties of Exponential Families 1.6.5 Conjugate Families of Prior Distributions Problems and Complements Notes References
9 12 16 17 24 26 32 41 49 49 53 56 58 62 66 95 96
VII
VIII
•••
CONTENTS
2 METHODS OF ESTIMATION 2.1 Basic Heuristics of Estimation 2.1.1 Minimum Contrast Estimates; Estimating Equations 2.1.2 The PlugIn and Extension Principles 2.2 Minimum Contrast Estimates and Estimating Equations 2.2.1 Least Squares and Weighted Least Squares 2.2.2 Maximum Likelihood 2.3 Maximum Likelihood in Multiparameter Exponential Families *2.4 Algorithmic Issues 2.4.1 The Method of Bisection 2.4.2 Coordinate Ascent 2.4.3 The NewtonRaphson Algorithm 2.4.4 The EM (ExpectationlMaximization) Algorithm 2.5 Problems and Complements 2.6 Notes 2.7 References
3
MEASURES OF PERFORMANCE
99 99 99 102 107 107 114 121 127 127 129 132 133 138 158 159
3.1 Introduction 3.2 Bayes Procedures 3.3 Minimax Procedures *3.4 Unbiased Estimation and Risk Inequalities 3.4.1 Unbiased Estimation, Survey Sampling 3.4.2 The Information Inequality *3.5 Nondecision Theoretic Criteria 3.5.1 Computation 3.5.2 Interpretability 3.5.3 Robustness 3.6 Problems and Complements 3.7 Notes 3.8 References
4 TESTING AND CONFIDENCE REGIONS 4.1 Introduction 4.2 Choosing a Test Statistic: The NeymanPearson Lemma 4.3 UnifonnIy Most Powerful Tests and Monotone Likelihood Ratio
161 161 161 170 176 176 179 188 188 189 190 197 210 211 213 213 223
227 233
4.4
Models Confidence Bounds, Intervals, and Regions
CONTENTS
ix
241 248 251 252
4.5 *4.6 *4.7 4.8
The Duality Between Confidence Regions and Tests Uniformly Most Accurate Confidence Bounds Frequentist and Bayesian Formulations Prediction Intervals
4.9
Likelihood Ratio Procedures 4.9.1 Inttoduction
4.9.2 4.9.3 Tests for the Mean of a Normal DistributionMatched Pair Experiments Tests and Confidence Intervals for the Difference in Means of
255 255
257
4.9.4
4.9.5
Two Normal PopUlations The TwoSample Prohlem with Unequal Variances
Likelihood Ratio Procedures fOr Bivariate Nonnal
261 264 266 269 295 295
Distrihutions 4.10 Problems and Complements 4.11 Notes 4.12 References
5 ASYMPTOTIC APPROXIMATIONS 5.1 Inttoduction: The Meaning and Uses of Asymptotics
5.2 Consistency
297 297
301
5.2.1
5.2.2
PlugIn Estimates and MLEs in Exponential Family Models
Consistency of Minimum Contrast Estimates
301
304
5.3
5.4
5.5 5.6 5.7 5.8
First and HigherOrder Asymptotics: The Delta Method with Applications 5.3.1 The Delta Method for Moments 5.3.2 The Delta Method for In Law Approximations 5.3.3 Asymptotic Normality of the Maximum Likelihood Estimate in Exponential Families Asymptotic Theory in One Dimension 5.4.1 Estimation: The Multinomial Case *5.4.2 Asymptotic Normality of Minimum Conttast and M Estimates *5.4.3 Asymptotic Normality and Efficiency of the MLE *5.4.4 Testing *5.4.5 Confidence Bounds Asymptotic Behavior and Optimality of the Posterior Distribution Problems and Complements Notes References
306 306 311 322 324 324 327 331 332 336 337 345 362 363
x
CONTENTS
6 INFERENCE IN THE MULTIPARAMETER CASE
6.1 Inference for Gaussian Linear Models 6.1.1 6.1.2 6.1.3 *6.2 6.2.1 6.2.2 6.2.3 *6.3 6.3.1 6.3.2 *6.4 6.4.1 6.4.2 6.4.3 *6.5 *6.6 6.7 6.8 6.9 The Classical Gaussian Linear Model Estimation Tests and Confidence Intervals Estimating Equations Asymptotic Normality and Efficiency of the MLE The Posterior Distribution in the Multiparameter Case Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic Wald's and Rao's Large Sample Tests GoodnessofFit in a Multinomial Model. Pearson's X 2 Test GoodnessofFit to Composite Multinomial Models. Contingency Thbles Logistic Regression for Binary Responses
365
365 366 369 374 383 384 386 391 392 392 398 400 401 403 408 411 417 422 438 438 441 441 443 443
Asymptotic Estimation Theory in p Dimensions
Large Sample Tests and Confidence Regions
Large Sample Methods for Discrete Data
Generalized Linear Models
Robustness Properties and Scmiparametric Models Problems and Complements Notes References
A A REVIEW OF BASIC PROBABILITY THEORY A.I The Basic Model A.2 Elementary Properties of Probability Models A.3 Discrete Probability Models A.4 Conditional Probability and Independence A.5 Compound Experiments A.6 Bernoulli and Multinomial Trials, Sampling With and Without Replacement A.7 Probabilities On Euclidean Space A.8 Random Variables and Vectors: Transformations A.9 Independence of Random Variables and Vectors A.IO The Expectation of a Random Variable A.II Moments A.12 Moment and Cumulant Generating Functions
444
446 447 448 451 453 454 456 459
CONTENTS
XI
•
A. t3 Some Classical Discrete and Continuous Distributions A.14 Modes of Convergence of Random Variables and Limit Theorems A. I5 Further Limit Theorems and Inequalities A.16 Poisson Process
A.17 Notes
460
466 468 472
474 475 477 477 477 479 480 482 484 485 485 488 491 491 494 497 502 502 503 506 506 508
A.18 References
B ADDITIONAL TOPICS IN PROBABILITY AND ANALYSIS
B.I Conditioning by a Random Variable or Vector B.I.l B.I.2 B.1.3 B.IA B.l.5 B.2.1 B.2.2 B.3
B.3.1
The Discrete Case Conditional Expectation for Discrete Variables Properties of Conditional Expected Values Continuous Variables Comments on the General Case The Basic Framework The Gamma and Beta Distributions
The X2 , F, and t Distributions
B.2 Distribution Theory for Transformations of Random Vectors
Distribution Theory for Samples from a Normal Population B.3.2 Orthogonal Transformations
BA The Bivariate Normal Distribution
B.5 Moments of Random Vectors and Matrices B.5.1 B.5.2 B.6.1 B.6.2 B.8 Basic Properties of Expectations Properties of Variance Definition and Density Basic Properties. Conditional Distributions
Op
B.6 The Multivariate Normal Distribution
B.7 Convergence for Random Vectors: Multivariate Calculus B.9 Convexity and Inequalities
and Op Notation
511
516 518 519 519 520 521
B.1O Topics in Matrix Theory and Elementary Hilbert Space Theory
B.1O.1 Symmetric Matrices B,10.2 Order on Symmetric Matrices
B.10.3 Elementary Hilbert Space Theory
B.Il Problems and Complements B.12 Notes B.13 References
524
538 539
•• XII
CONTENTS
C TABLFS
Table I The Standard Nonna! Distribution Table I' Auxilliary Table of the Standard Normal Distribution
Table II t Distribution Critical Values Table In x2 Distribution Critical Values Table IV F Distribution Critical Values
INDEX
541 542 543 544 545 546 547
I
PREFACE TO THE SECOND EDITION: VOLUME I
In the twentythree years that have passed since the first edition of our book appeared statistics has changed enonnollsly under dIe impact of several forces:
(1) The generation of what were once unusual types of data such as images, trees (phy
logenetic and other), and other types of combinatorial objects.
(2) The generation of enonnous amounts of dataterrabytes (the equivalent of 10 12 characters) for an astronomical survey over three years.
(3) The possibility of implementing computations of a magnitude that would have once been unthinkable. The underlying sources of these changes have been the exponential change in computing speed (Moore's "law") and the development of devices (computer controlled) using novel instruments and scientific techniques (e.g., NMR tomography, gene sequencing). These techniques often have a strong intrinsic computational component. Tomographic data are the result of mathematically based processing. Sequencing is done by applying computational algorithms to raw gel electrophoresis data. As a consequence the emphasis of statistical theory has shifted away from the small sample optimality results that were a major theme of our book in a number of directions:
(I) Methods for inference based on larger numbers of observations and minimal assumptionsasymptotic methods in non and semiparametric models, models with ''infinite'' number of parameters. (2) The construction of models for time series, temporal spatial series, and other complex data structures using sophisticated probability modeling but again relying for analytical results on asymptotic approximation. Multiparameter models are the rule. (3) The use of methods of inference involving simulation as a key element such as the bootstrap and Markov Chain Monte Carlo.
XIll
.,q;,
•
:ci'~:,f"
.;.
'4,.
<.it:
...
Our one long book has grown to two volumes." That is. The latter include the principal axis and spectral theorems for Euclidean space and the elementary theory of convex functions on Rd as well as an elementary introduction to Hilbert space theory. problems. which includes more advanced topics from probability theory such as the multivariate Gaussian distribution. our focus and order of presentation have changed. There have. These will not be dealt with in OUr work. reflecting what we now teach our graduate students. As a consequence our second edition. Despite advances in computing speed. (5) The study of the interplay between numerical and statistical considerations. or the absolutely continuous case with a density. As in the first edition. and functionvalued statistics. such as the empirical distribution function. The reason for these additions are the changes in subject matter necessitated by the current areas of importance in the field. From the beginning we stress functionvalued parameters. In this edition we pursue our philosophy of describing the basic concepts of mathematical statistics relating theory to practice. but for those who know this topic Appendix B points out interesting connections to prediction and linear regression analysis. weak convergence in Euclidean spaces. covers material we now view as important for all beginning graduate students in statistics and science and engineering graduate students whose research will involve statistics intrinsically rather than as an aid in drawing concluSIons. Others do not and some though theoretically attractive cannot be implemented in a human lifetime. Volume I. such as the density. However. (6) The study of the interplay between the number of observations and the number of parameters of a model and the beginnings of appropriate asymptotic theories. been other important consequences such as the extensive development of graphical and other exploratory methods for which theoretical development and connection with mathematics have been minimal. and probability inequalities as well as more advanced topics in matrix theory and analysis. each to be only a little shorter than the first edition. instead of beginning with parametrized models we include from the start non. Appendix B is as selfcontained as possible with proofs of mOst statements.xiv Preface to the Second Edition: Volume I (4) The development of techniques not describable in "closed mathematical form" but rather through elaborate algorithms for which problems of existence of solutions are important and far from obvious. However. Volume [ covers the malerial of Chapters 16 and Chapter 10 of the first edition with pieces of Chapters 710 and includes Appendix A on basic probability theory. We . we do not require measure theory but assume from the start that our models are what we call "regular. Specifically. then go to parameters and parametric models stressing the role of identifiability. of course. Hilbert space theory is not needed. and references to the literature for proofs of the deepest results such as the spectral theorem. which we present in 2000. we assume either a discrete probability whose support does not depend On the parameter set. Chapter 1 now has become part of a larger Appendix B. some methods run quickly in real time. is much changed from the first.and semipararnetric models.
including a complete study of MLEs in canonical kparameter exponential families. include examples that are important in applications. from the start. and some parailels to the optimality theory and comparisons of Bayes and frequentist procedures given in the univariate case in Chapter 5. Major differences here are a greatly expanded treatment of maximum likelihood estimates (MLEs). including some optimality theory for estimation as well and elementary robustness considerations. There are clear dependencies between starred .Preface to the Second Edition: Volume I xv also. inference in the general linear model. Robustness from an asymptotic theory point of view appears also. which parallels Chapter 2 of the first edition. Although we believe the material of Chapters 5 and 6 has now become fundamental. the Wald and Rao statistics and associated confidence regions. Finaliy. As in the first edition problems playa critical role by elucidating and often substantially expanding the text. there is clearly much that could be omitted at a first reading that we also star. Also new is a section relating Bayesian and frequentist inference via the Bemsteinvon Mises theorem. Almost all the previous ones have been kept with an approximately equal number of new ones addedto correspond to our new topics and point of view. Chapter 5 of the new edition is devoted to asymptotic approximations. Chapter 2 of this edition parallels Chapter 3 of the first artd deals with estimation. This chapter uses multivariate calculus in an intrinsic way and can be viewed as an essential prerequisite for the mOre advanced topics of Volume II. Included are asymptotic normality of maximum likelihood estimates. are an extended discussion of prediction and an expanded introduction to kparameter exponential families. Chapters 14 develop the basic principles and examples of statistics. Other novel features of this chapter include a detailed analysis including proofs of convergence of a standard but slow algorithm for computing MLEs in muitiparameter exponential families and ail introduction to the EM algorithm. These objects that are the building blocks of most modem models require concepts involving moments of random vectors and convexity that are given in Appendix B. Nevertheless. we star sections that could be omitted by instructors with a classical bent and others that could be omitted by instructors with more computational emphasis. such as regression experiments. Generalized linear models are introduced as examples. One of the main ingredients of most modem algorithms for inference. There is more material on Bayesian models and analysis. Wilks theorem on the asymptotic distribution of the likelihood ratio test. if somewhat augmented. The main difference in our new treatment is the downplaying of unbiasedness both in estimation and testing and the presentation of the decision theory of Chapter 10 of the first edition at this stage. Save for these changes of emphasis the other major new elements of Chapter 1. The conventions established on footnotes and notation in the first edition remain. It includes the initial theory presented in the first edition but goes much further with proofs of consistency and asymptotic normality and optimality of maximum likelihood procedures in inference. Chapter 6 is devoted to inference in multivariate (multiparameter) models. Chapters 3 and 4 parallel the treatment of Chapters 4 and 5 of the first edition on the theory of testing and confidence regions.
taken on a new life. Semiparametric estimation and testing will be considered more generally. Last and most important we would like to thank our wives. A final major topic in Volume II will be Monte Carlo methods such as the bootstrap and Markov Chain Monte Carlo.6 Volume II is expected to be forthcoming in 2003.3 ~ 6. density estimation.edn Kjell Doksnm doksnm@stat. encouragement. The basic asymptotic tools that will be developed or presented. Bickel bickel@stat. We also thank Faye Yeager for typing. Topics to be covered include permutation and rank tests and their basis in completeness and equivariance. 5. greatly extending the material in Chapter 8 of the first edition. elementary empirical process theory. Jianhna Hnang.4 ~ 6. We also expect to discuss classification and model selection using the elementary theory of empirical processes. are weak. appeared gratifyingly ended in 1976 but has. For the first volume of the second edition we would like to add thanks to new colleagnes. j j Peter J. The topic presently in Chapter 8. convergence for random processes.4. 6. and our families for support. Michael Jordan. in part in appendices. Ying Qing Chen. and Carl Spruill and the many students who were guinea pigs in the basic theory course at Berkeley.3 ~ 6.berkeley.2 ~ 6.5 I. particnlarly Jianging Fan.edn . Nancy Kramer Bickel and Joan H. Yoram Gat for proofreading that found not only typos but serious errors. and the functional delta method. in part in the text and. other transformation models.2 ~ 5. Michael Ostland and Simon Cawley for producing the graphs. and Prentice Hall for generous production support. and the classical nonparametric k sample and independence problems will be included. With the tools and concepts developed in this second volume students will be ready for advanced research in modem statistics. and active participation in an enterprise that at times seemed endless.XVI • Pref3ce to the Second Edition: Volume I sections that follow. • with the field.4. will be studied in the context of nonparametric function estimation. Fujimura. Examples of application such as the Cox model in survival analysis.berkeley.
The work of Rao. for instance.. we select topics from xvii . 2nd ed. Hoel. covers most of the material we do and much more but at a more abstract level employing measure theory. Introduction to Mathematical Statistics. Our book contains more material than can be covered in tw~ qp. and Stone's Introduction to Probability Theory. multinomial models. At the other end of the scale of difficulty for books at this level is the work of Hogg and Craig. In addition we feel Chapter 10 on decision theory is essential and cover at least the first two sections. the treatment is abridged with few proofs and no examples or problems. which go from modeling through estimation and testing to linear models.arters. Finally. We feel such an introduction should at least do the following: (1) Describe the basic concepts of mathematical statistics indicating the relation of theory to practice. 3rd 00. the physical sciences. and nonparametric models. Port. (3) Give heuristic discussions of more advanced results such as the large sample theory of maximum likelihood estimates. (4) Show how the ideas aod results apply in a variety of important subfields such as Gaussian linear mOdels. PREFACE TO THE FIRST EDITION This book presents our view of what an introduction to mathematical statistics for students with a good mathematics background should be. These authors also discuss most of the topics we deal with but in many instances do not include detailed discussion of topics we consider essential such as existence and computation of procedures and large sample behavior. The extent to which holes in the discussion can be patched and where patches can be found should be clearly indicated. and the structure of both Bayes and admissible solutions in decision theory. In the twoquarter courses for graduate students in mathematics. the information inequality. (2) Give careful proofs of the major "elementary" results such as the NeymanPearson lemma. and engineering that we have taught we cover the core Chapters 2 to 7. Be cause the book is an introduction to statistics. we feel that none has quite the mix of coverage and depth desirable at this level. However. we need probability theory and expect readers to have had a course at the level of. the Lehmann5cheffe theorem. Linear Statistical Inference and Its Applications. Our appendix does give all the probability that is needed.. statistics. and the GaussMarkoff theorem. By a good mathematics background we mean linear algebra and matrix theory and advanced calculus (but no measure theory). Although there are several good books available for tbis purpose.
densities.xviii Preface to the First Edition Chapter 8 on discrete data and Chapter 9 on nonpararnetric models. They range from trivial numerical exercises and elementary problems intended to familiarize the students with the concepts to material more difficult than that worked out in the text. We would like to acknowledge our indebtedness to colleagues. U. (iii) Basic notation for probabilistic objects such as random variables and vectors. Conventions: (i) In order to minimize the number of footnotes we have added a section of comments at the end of each Chapter preceding the problem section. Chapter 1 covers probability theory rather than statistics. or it may be included at the end of an introductory probability course that precedes the statistics course.6 would probably not have been written and without Julia Rubalcava's impeccable typing and tolerance this text would never have seen the light of day. Chen. These comments are ordered by the section to which they pertain. reservations. Gray. Bickel Kjell Doksum Berkeley /976 : . They need to be read only as the reader's curiosity is piqued. X. Lehmann's wise advice has played a decisive role at many points. enthusiastic. (i) Various notational conventions and abbreviations are used in the text. G. and additional references. P. Minassian who sent us an exhaustive and helpful listing. and A Samulon. Cannichael. and so On. We would also like tn thank tlle colleagues and friends who Inspired and helped us to enter the field of statistics.5 was discovered by F. Scholz. S. I for the first. distribution functions. respectively. Within each section of the text the presence of comments at the end of the chapter is signaled by one or more numbers.2. Much of this material unfortunately does not appear in basic probability texts but we need to draw on it for the rest of the book. and moments is established in the appendix. 2 for the second. Later we were both very much influenced by Erich Lehmann whose ideas are strongly rellected in this hook. . A serious error in Problem 2. Among many others who helped in the same way we would like to mention C. i . It may be integrated with the material of Chapters 27 as the course proceeds rather than being given at the start. A special feature of the book is its many problems. R. The comments contain digressions. Peler J. I . in proofreading the final version. Quang. Chou. W. The foundation of oUr statistical knowledge was obtained in the lucid. and stimUlating lectures of Joe Hodges and Chuck Bell. E. Drew. and friends who helped us during the various stageS (notes. students. L. C. A list of the most frequently occurring ones indicating where they are introduced is given at the end of the text. final draft) through which this book passed. caught mOre mistakes than both authors together. preliminary edition. Without Winston Chow's lovely plots Section 9. Gupta. Pyke's careful reading of a nexttofinal version caught a number of infelicities of style and content Many careless mistakes and typographical errors in an earlier version were caught by D. J. They are included both as a check on the student's mastery of the material and as pointers to the wealth of ideas and results that for obvious reasons of space could not be put into the body of the text.
Mathematical Statistics Basic Ideas and Selected Topics Volume I Second Edition .
I .. j 1 J 1 . . .
we shall parenthetically discuss features of the sources of data that can make apparently suitable models grossly misleading.5. Chapter 1. Moreover. (2) Matrices of scalars and/or characters. and so on. as usual. are to draw useful information from data using everything that we know. A generic source of trouble often called grf?ss errors is discussed in greater detail in the section on robustness (Section 3. but we will introduce general model diagnostic tools in Volume 2. trees as in evolutionary phylogenies. a single time series of measurements. Data can consist of: (1) Vectors of scalars.1. In any case all our models are generic and. Subject matter specialists usually have to be principal guides in model formulation. large scale or small.4 and Sections 2. in particUlar. MODELS. scientific or industrial.1 and 6. which statisticians share. for example. (4) All of the above and more. for example. The particular angle of mathematical statistics is to view data as the outcome of a random experiment that we model mathematically. functions as in signal processing. ''The Devil is in the details!" All the principles we discuss and calculations we perform should only be suggestive guides in successful applications of statistical analysis in science and policy.1.Chapter 1 STATISTICAL MODELS. and/or characters. AND PERFORMANCE CRITERIA 1. produce data whose analysis is the ultimate object of the endeavor. GOALS. (3) Arrays of scalars and/or characters as in contingency tablessee Chapter 6or more generally multifactor IDultiresponse data on a number of individuals. A detailed discussion of the appropriateness of the models we shall discuss in particUlar situations is beyond the scope of this book. The goals of science and society. A 1 .1 1. measurements. PARAMETERS AND STATISTICS Data and Models Most studies and experiments.3).1 DATA.2.1. digitized pictures or more routinely measurements of covariates and response on a set of n individualssee Example 1.
A to m. testing. in particular. (b) We want to study how a physical or economic feature.5 and throughout the book. Hierarchies of models are discussed throughout. is distributed in a large population. We run m + n independent experiments as follows: m + n members of the population are picked at random and m of these are assigned to the first method and the remaining n are assigned to the second method. give methods that assess the generalizability of experimental results. a shipment of manufactured items. height or income. (5) We can be guided to alternative or more general descriptions that might fit better. in the words of George Box (1979). and more general procedures will be discussed in Chapters 24. learning a maze. So to get infonnation about a sample of n is drawn without replacement and inspected. treating a disease. We begin this in the simple examples that follow and continue in Sections 1. (c) An experimenter makes n independent detenninations of the value of a physical constant p. (4) We can decide if the models we propose are approximations to the mechanism generating the data adequate for our purposes. Chapter I. and Performance Criteria Chapter 1 priori. for instance. and B to n. are never true but fortunately it is only necessary that they be usefuL" In this book we will study how. robustness. Goodness of fit tests. if we observe an effect in our data. we approximate the actual process of sampling without replacement by sampling with replacement. Goals. We begin this discussion with decision theory in Section 1. and diagnostics are discussed in Volume 2. This can be thought of as a problem of comparing the efficacy of two methods applied to the members of a certain population. An unknown number Ne of these elements are defective. have the patients rated qualitatively for improvement by physicians. In this manner. For instance. reducing pollution. . An exhaustive census is impossible so the study is based on measurements and a sample of n individuals drawn at random from the population. starting with tentative models: (I) We can conceptualize the data structure and our goals more precisely. "Models of course. to what extent can we expect the same effect more generally? Estimation. e. It is too expensive to examine all of the items. and so on. The population is so large that.21. His or her measurements are subject to random fluctuations (error) and the data can be thought of as p.. Here are some examples: (a) We are faced with a population of N elements. and so on. For instance.. for modeling purposes. for example. (2) We can derive methods of extracting useful information from data and. confidence regions. we obtain one or more quantitative or qualitative measures of efficacy from each experiment. producing energy. plus some random errors. The data gathered are the number of defectives found in the sample. randomly selected patients and then measure temperature and blood pressure. (3) We can assess the effectiveness of the methods we propose.! I 2 Statistical Models.3 and continue with optimality principles in Chapters 3 and 4. we can assign two drugs. Random variability " : . (d) We want to compare the efficacy of two ways of doing something under similar conditions such as brewing coffee.
i. and also write that Xl. that depends on how the experiment is carried out...1.Xn independent. as X with X . ."" €n are independent. .. . Situation (b) can be thought of as a generalization of (a) in that a quantitative measure is taken rather than simply recording "defective" or not... Parameters. OneSample Models. n). . We often refer to such X I. We shall use these examples to arrive at out formulation of statistical models and to indicate some of the difficulties of constructing such models.1. . The mathematical model suggested by the description is well defined.N(I .8). The main difference that our model exhibits from the usual probability model is that NO is unknown and.d.. What should we assume about the distribution of €. 1 <i < n (1. Models. Sample from a Population. If N8 is the number of defective items in the population sampled. Sampling Inspection.2. ./ ' stands for "is distributed as. On this space we can define a random variable X given by X(k) ~ k. . and Statistics 3 here would come primarily from differing responses among patients to the same drug but also from error in the measurements and variation in the purity of the drugs. It can also be thought of as a limiting case in which N = 00..6) (J..2) where € = (€ I . That is. 1. although the sample space is well defined. we cannot specify the probability structure completely but rather only give a family {H(N8..) random variables with common unknown distribution function F. we observe XI. . .. A random experiment has been perfonned.. n.Section 1." The model is fully described by the set :F of distributions that we specify. X n ? Of course. n) distribution.". where . Fonnally. Here we can write the n determinations of p. . n)} of probability distributions for X.t + (i. The same model also arises naturally in situation (c). X n are ij. which are modeled as realizations of X I.... . as Xi = I..l.. So.0) < k < min(N8.. . if the measurements are scalar. . F. Thus. 1 €n) T is the vector of random errors. n corresponding to the number of defective items found. Given the description in (c). X n as a random sample from F. completely specifies the joint distribution of Xl. can take on any value between and N. which we refer to as: Example 1.l) if max(n .1. in principle.. so that sampling with replacement replaces sampling without..1 Data. . The sample space consists of the numbers 0. I. we postulate (l) The value of the error committed on one determination does not affect the value of the error at other times. H(N8. X has an hypergeometric.. N.. .. identically distributed (i.xn .. k ~ 0. . then by (AI3.1. N.. First consider situation (a)..d. 0 ° Example 1. anyone of which could have generated the data actually observed. which together with J1. €I.
We let the y's denote the responses of subjects given a new drug or treatment that is being evaluated by comparing its effect with that of the placebo. we have specified the Gaussian two sample model with equal variances. Often the final simplification is made. TwoSample Models. 1 Yn a sample from G....3) and the model is alternatively specified by F. Then if treatment B had been administered to the same subject instead of treatment A." Now consider situation (d). This implies that if F is the distribution of a control. and Y1. Example 1. Natural initial assumptions here are: (1) The x's and y's are realizations of Xl. .~). that is. G) pairs. . YI. (72) population or equivalently F = {tP (':Ji:) : Jl E R. cr 2 ) distribution. Then if F is the N(I". By convention. response y = X + ~ would be obtained where ~ does not depend on x. To specify this set more closely the critical constant treatment effect assumption is often made. for instance. or by {(I". .t + ~. . The classical def~ult model is: (4) The common distribution of the errors is N(o. Goals. Heights are always nonnegative. £n are identically distributed. will have none of this. heights of individuals or log incomes.. respectively. All actual measurements are discrete rather than continuous. or alternatively all distributions with expectation O. G) : J1 E R. .4 Statistical Models. whatever be J. 1 X Tn .. Equivalently Xl. we refer to the x's as control observations. the Xi are a sample from a N(J. then F(x) = G(x  1") (1. We call this the shift model with parameter ~.. . (3) The control responses are normally distributed. Commonly considered 9's are all distributions with center of symmetry 0. A placebo is a substance such as water tpat is expected to have no effect on the disease and is used to correct for the welldocumented placebo effect. patients improve even if they only think they are being treated. . E}. '. That is. We call the y's treatment observations..t and cr. be the responses of m subjects having a given disease given drug A and n other similarly diseased subjects given drug B.. The Gaussian distribution. 0 This default model is also frequently postulated for measurements taken on units obtained by random sampling from populations. G E Q} where 9 is the set of all allowable error distributions that we postulate.1. 0'2). •. Let Xl.1. and Performance Criteria Chapter 1 (2) The distribution of the error at one determination is the same as that at another.. It is important to remember that these are assumptions at best only approximately valid. then G(·) F(' ..3.t. if drug A is a standard or placebo. There are absolute bounds on most quantitiesIOO ft high men are impossible.Xn are a random sample and.2) distribution and G is the N(J. tbe set of F's we postulate. if we let G be the distribution function of f 1 and F that of Xl.. .. (3) The distribution of f is independent of J1.Yn. cr > O} where tP is the standard normal distribution. 0 ! I = . . 1 X Tn a sample from F. where 0'2 is unknown. so that the model is specified by the set of possible (F.. (2) Suppose that if treatment A had been administered to a subject response X would have been obtained. Thus.
for instance. we use a random number table or other random mechanism so that the m patients administered drug A are a sample without replacement from the set of m + n available patients. This will be done in Sections 3.2 is that. if they are true.4. For instance. The study of the model based on the minimal assumption of randomization is complicated and further conceptual issues arise. Using our first three examples for illustrative purposes. we need only consider its probability distribution.1). G are assumed arbitrary. ~(w) is referred to as the observations or data.Section 1. P is referred to as the model. we know how to combine our measurements to estimate JL in a highly efficient way and also assess the accuracy of our estimation procedure (Example 4. Statistical methods for models of this kind are given in Volume 2. In this situation (and generally) it is important to randomize. The number of defectives in the first example clearly has a hypergeometric distribution. in Example 1.Xn ). we can ensure independence and identical distribution of the observations by using different.'" . Parameters.2. That is. For instance. Without this device we could not know whether observed differences in drug performance might not (possibly) be due to unconscious bias on the part of the experimenter. This distribution is assumed to be a member of a family P of probability distributions on Rn.1 Data. On this sample space we have defined a random vector X = (Xl. in comparative experiments such as those of Example 1. if they are false.5. The advantage of piling on assumptions such as (I )(4) of Example 1. equally trained observers with no knowledge of each other's findings.1. We are given a random experiment with sample space O. the methods needed for its analysis are much the same as those appropriate for the situation of Example 1. our analyses.3 the group of patients to whom drugs A and B are to be administered may be haphazard rather than a random sample from the population of sufferers from a disease.1. However.1.3 and 6. A review of necessary concepts and notation from probability theory are given in the appendices. have been assigned to B.2. we can be reasonably secure about some aspects. Experiments in medicine and the social sciences often pose particular difficulties. though correct for the model written down. in Example 1. Fortunately. The danger is that. Since it is only X that we observe.3 when F. but not others. For instance. In others. the number of a particles emitted by a radioactive substance in a small length of time is well known to be approximately Poisson distributed. As our examples suggest. we have little control over what kind of distribution of errors we get and will need to investigate the properties of methods derived from specific error distribution assumptions when these assumptions are violated. All the severely ill patients might.6.1. we observe X and the family P is that of all bypergeometric distributions with sample size n and population size N.1. P is the . may be quite irrelevant to the experiment that was actually performed. In Example 1. there is tremendous variation in the degree of knowledge and control we have concerning experiments.1. It is often convenient to identify the random vector X with its realization. When w is the outcome of the experiment. if (1)(4) hold. In some applications we often have a tested theoretical model and the danger is small. we now define the elements of a statistical model. Models.1. the data X(w). and Statistics 5 How do we settle on a set of assumptions? Evidently by a mixture of experience and physical considerations.
. we may parametrize the model by the first and second moments of the normal distribution of the observations (i. 02 and yet Pel = Pe 2 • Such parametrizations are called unidentifiable. Note that there are many ways of choosing a parametrization in these and all other problems. under assumptions (1)(4). () t Po from a space of labels..2. knowledge of the true Pe. that is.. or equivalently write P = {Pe : BEe}.6 Statistical Models.1.2) suppose that we pennit G to be arbitrary. to P.G taken to be arbitrary are called nonparametric.Xn are independent and identically distributed with a common N (IL. . still in this example." that is. Yn ..1 we take () to be the fraction of defectives in the shipment.1.. The only truly nonparametric but useless model for X E R n is to assume that its (joint) distribution can be anything. NO. Of even greater concern is the possibility that the parametrization is not onetoone. we will need to ensure that our parametrizations are identifiable. the parameter space e.JL) where cp is the standard normal density.3 that Xl. G) into the distribution of (Xl. in (l.. N.e.t2 + k e e e J .I} and 1. tt = to the same distribution of the observations a~ tt = 1 and N (~ 1. For instance.Xrn are identically distributed as are Y1 . in Example 1. . moreover. In Example 1. Then the map sending B = (I'. .1 we can use the number of defectives in the population. . that is. as a parameter and in Example 1. j.3 with only (I) holding and F.. we know we are measuring a positive quantity in this model. ..G) has density n~ 1 g(Xi 1'). Yn . Models such as that of Example 1. on the other hand. for eXample. we can take = {(I'. parts of fJ remain unknowable.2 ) distribution.G) : I' E R.1. Finally. ..1. () is the fraction of defectives.. Now the and N(O.1. n) distribution. Goals. 02 ).l. 1 • • • . . e j . that is. and Performance Criteria Chapter 1 family of all distributions according to which Xl. The critical problem with such parametrizations is that ev~n with "infinite amounts of data.1..2 with assumptions (1)(4) we have = R x R+ and. the first parametrization we arrive at is not necessarily the one leading to the simplest analysis.. by (tt. For instance. What parametrization we choose is usually suggested by the phenomenon we are modeling.2)). Pe the distribution on Rfl with density implicitly taken n~ 1 . lh i O => POl f. It's important to note that even nonparametric models make substantial assumptionsin Example 1. 1) errors. X n ) remains the same but = {(I'. G) : I' E R. . . we have = R+ x R+ . Ghas(arbitrary)densityg}. we only wish to make assumptions (1}(3) with t:.. in Example = {O. . a map.1. We may take any onetoone function of 0 as a new parameter.2 Parametrizations and Parameters i e To describe P we USe a parametrization. However. 0. if) (X'. 1) errors lead parametrization is unidentifiable because. When we can take to be a nice subset of Euclidean space and the maps 0 + Po are smooth.2 with assumptions (1)(3) are called semiparametric. X rn are independent of each other and Yl . having expectation 0..X l . Thus. I 1. models P are called parametric. Pe the H(NB. as we shall see later. models such as that of Example 1. tt is the unknown constant being measured. If. if () = (Il. . G with density 9 such that xg(x )dx = O} and p(".. If. such that we can have (}1 f. . Pe 2 • 2 e ° .1. Thus. in senses to be made precise later. .
or the median of P. Sometimes the choice of P starts by the consideration of a particular parameter. that is.1 Data. then 0'2 is a nuisance parameter. For instance. in Example 1. in Example 1.1. As we have seen this parametrization is unidentifiable and neither f1 nor ~ arc parameters in the sense we've defined. the focus of the study.2 again in which we assume the error € to be Gaussian but with arbitrary mean~.1. . in Example 1.3) and 9 = (G : xdG(x) = J O}. Parameters. E(€i) = O.. a map from some e to P. which correspond to other unknown features of the distribution of X.2 where fl denotes the mean income and. we can define (J : P + as the inverse of the map 8 + Pe. Formally. Then P is parametrized by 8 = (fl1~' 0'2).3 with assumptions (1H2) we are interested in~.flx. instead of postulating a constant treatment effect ~. the fraction of defectives () can be thought of as the mean of X/no In Example 1..3. Here are two points to note: (1) A parameter can have many representations. v. if the errors are normally distributed with unknown variance 0'2. (2) A vector parametrization that is unidentifiable may still have components that are parameters (identifiable). and observe Xl.1. which indexes the family P. implies 8 1 = 82 .2 is now well defined and identifiable by (1. it is natural to write e e e . make 8 + Po into a parametrization of P. and Statistics 7 Dual to the notion of a parametrization. from to its range P iff the latter map is 11. For instance. thus. or the midpoint of the interquantile range of P.2 with assumptions (1)(4) the parameter of interest fl. that is. ) evidently is and so is I' + C>..M(P) can be characterized as the mean of P. For instance. More generally. . For instance. consider Example 1. (J is a parameter if and only if the parametrization is identifiable. there are also usually nuisance parameters. say with replacement. formally a map. A parameter is a feature v(P) of the distribution of X. Similarly. G) parametrization of Example 1. from P to another space N. in Example 1. or more generally as the center of symmetry of P. where 0'2 is the variance of €. implies q(BJl = q(B2 ) and then v(Po) q(B). When we sample. Implicit in this description is the assumption that () is a parameter in the sense we have just defined.Section 1." = f1Y .1. Then "is identifiable whenever flx and flY exist. which can be thought of as the difference in the means of the two populations of responses.1. as long as P is the set of all Gaussian distributions.2.1.1.1. For instance. a function q : + N can be identified with a parameter v( P) iff Po. . we can start by making the difference of the means. In addition to the parameters of interest. ~ Po. if POl = Pe'). We usually try to combine parameters of interest and nuisance parameters into a single grand parameter (). Models. X n independent with common distribution. But = Var(X. The (fl.1. is that of a parameter. our interest in studying a population of incomes may precisely be in the mean income. But given a parametrization (J + Pe.
This statistic takes values in the set of all distribution functions on R.1.3 Statistics as Functions on the Sample Space I j • . a statistic we shall study extensively in Chapter 2 is the function valued statistic F. statistics. then our attention naturally focuses on estimating this constant If. Deciding which statistics are important is closely connected to deciding which parameters are important and. Thus. These issues will be discussed further in Volume 2. Often the outcome of the experiment is used to decide on the model and the appropriate measure of difference.X)" L i=l I n How we use statistics in estimation and other decision procedures is the subject of the next section.. Goals.2 is the statistic I i i = i i 8 2 = n1 "(Xi . to narrow down in useful ways our ideas of what the "true" P is.. If we suppose there is a single numerical measure of performance of the drugs and the difference in performance of the drugs for any given patient is a constant irrespective of the patient. a common estimate of 0. we have to formulate a relevant measure of the difference in performance of the drugs and decide how to estimate this measure. . Formally.. Nevertheless. called the empirical distribution function. In this volume we assume that the model has . < xJ. which evaluated at ~ X and 8 2 are called the sample mean and sample variance. . hence. T( x) is what we can compute if we observe X = x. For instance. It estimates the function valued parameter F defined by its evaluation at x E R. 1964). which now depends on the data. T(x) = x/no In Example 1.2 a cOmmon estimate of J1. can be related to model formulation as we saw earlier. Models and parametrizations are creations of the statistician. x E Ris F(X " ~ I . is used to decide what estimate of the measure of difference should be employed (ct. a statistic T is a map from the sample space X to some space of values T. however.1. For instance. this difference depends on the patient in a complex manner (the effect of each drug is complex). and Performance Criteria Chapter 1 1. X n ) are a sample from a probability P on R and I(A) is the indicator of the event A. F(P)(x) = PIX. we can draw guidelines from our numbers and cautiously proceed. consider situation (d) listed at the beginning of this section.• 1 X n ) = X ~ L~ I Xi. the fraction defective in the sample.8 Statistical Models.1. Databased model selection can make it difficult to ascenain or even assign a meaning to the accuracy of estimates or the probability of reaChing correct conclusions. is the statistic T(X 11' . The link for us are things we can compute. usually a Euclidean space. but the true values of parameters are secrets of nature. for example.. in Example 1. For future reference we note that a statistic just as a parameter need not be real or Euclidean valued.1. Our aim is to use the data inductively. Xn)(x) = n L I(X n i < x) i=l where (X" . Mandel. Next this model. Informally.
This selection is based on experience with previous similar experiments (cf.. in the study. Xl). 1990).1 Data. Y n ) where Y11 ••• .4. age. 1. and the decision of which drug to administer for a given patient may be made using the knowledge of what happened to the previous patients. There are also situations in which selection of what data will be observed depends on the experimenter and on his or her methods of reaching a conclusion. 8). Regression Models We end this section with two further important examples indicating the wide scope of the notions we have introduced.. This is the Stage for the following. They are not covered under our general model and will not be treated in this book. Yn are independent. For instance. Thus. we shall denote the distribution corresponding to any particular parameter value 8 by Po. 0). and there exists a set {XI. sequentially. are continuous with densities p(x. This is obviously overkill but suppose that. density and frequency functions by p(...1.1. Notation. In most studies we are interested in studying relations between responses and several other variables not just treatment or control as in Example 1. the statistical procedure can be designed so that the experimenter stops experimenting as soon as he or she has significant evidence to the effect that one drug is better than the other. after a while. (B. these and other subscripts and arguments will be omitted where no confusion can arise. in Example 1. 0). height. Problems such as these lie in the fields of sequential analysis and experimental deSign. for example.1. (zn. . assign the drugs alternatively to every other patient in the beginning and then. In the discrete case we will use both the tennsfrequency jimction and density for p(x.1.4 Examples. the number of patients in the study (the sample size) is random. ) that is independent of 0 such that 1 P(Xi' 0) = 1 for all O.Section 1. X m ). 'L':' Such models will be called regular parametric models. Regular models. The distribution of the response Vi for the ith subject or case in the study is postulated to depend on certain characteristics Zi of the ith subject. assign the drug that seems to be working better to a higher proportion of patients. (B. Example 1. (2) All of the P e are discrete with frequency functions p(x. Parameters.3 we could take z to be the treatment label and write onr observations as (A. Yn ). Zi is a d dimensional vector that gives characteristics such as sex.lO. Y. .. and so On of the ith subject in a study. ). Distribution functions will be denoted by F(·. and Statistics 9 been selected prior to the current experiment. weight. See A. We observe (ZI' Y. Moreover. For instance. 0). However.. Lehmann. in situation (d) again. It will be convenient to assume(1) from now on that in any parametric model we consider either: (I) All of the P. We refer the reader to Wetherill and Glazebrook (1986) and Kendall and Stuart (1966) for more infonnation. When dependence on 8 has to be observed. The experimenter may.3. Expectations calculated under the assumption that X rv P e will be written Eo. . Models. patients may be considered one at a time. .). (A. Regression Models. Thus.. 0).X" . drugs A and B are given at several .
Here J..(z) only.E(Yi ). i = 1.. z) where 9 is known except for a vector (3 = ((31. We usually ueed to postulate more. Then we have the classical Gaussian linear model. Zi is a nonrandom vector of values called a covariate vector or a vector of explanatory variables whereas Yi is random and referred to as the response variable or dependent variable in the sense that its distribution depends on Zi.1.Z~)T and Jisthen x n identity.1.Yn) ~ II f(Yi I Zi). .. n.1. A eommou (but often violated assumption) is (I) The ti are identically distributed with distribution F. Treatment Dose Level) for patient i. So is Example 1. [n general.2 uuknown. which we can write in vector matrix form. The most common choice of 9 is the linear fOlltl. Then. then we can write.. and Performance Criteria Chapter 1 dose levels. In fact by varying our assumptions this class of models includes any situation in which we have independent but not necessarily identically distributed observations. and nonparametric if we drop (1) and simply .1. .2 with assumptions (1)(4).. . For instance.. Identifiability of these parametrizations and the status of their components as parameters are discussed in the problems. . in Example 1. i=l n If we let J. .2) with . If we let f(Yi I zd denote the density of Yi for a subject with covariate vector Zi.z) = L.. In the two sample models this is implied by the constant treatment effect assumption. the effect of z on Y is through f.L(z) is an unknown function from R d to R that we are interested in. 0 .1.. I'(B) = I' + fl. (3) g((3. (b) where Ei = Yi . .treat the Zi as a label of the completely unknown distributions of Yi..L(z) denote the expected value of a response with given covariate vector z. Goals. I3d) T of unknowns. On the basis of subject matter knowledge and/or convenience it is usually postulated that (2) 1'(z) = 9((3. Clearly.3 with the Gaussian twosample model I'(A) ~ 1'. then the model is zT (a) P(YI. Example 1. That is. d = 2 and can denote the pair (Treatment Label. . See Problem 1.8.~II3Jzj = zT (3 so that (b) becomes (b') This is the linear model.3(3) is a special case of this model.. semiparametric as with (I) and (2) with F arbitrary. . By varying the assumptions we obtain parametric models as with (I).10 Statistical Models. (c) whereZ nxa = (zf.. . Often the following final assumption is made: (4) The distributiou F of (I) is N(O. (3) and (4) above.
n. eo = a Here the errors el... the elapsed times X I. i ~ 2. X n spent above a fixed high level for a series of n consecutive wave records at a point on the seashore. save for a brief discussion in Volume 2....cn _d p(edp(e2 I edp(e3 I e2) .. . .1 Data.t+ei..Section 1.(3cd··· f(e n Because €i =  (3e n _1). we start by finding the density of CI. . 0'2) density. . . To find the density p(Xt. we give an example in which the responses are dependent.(3) + (3X'l + 'i. It is plausible that ei depends on Cil because long waves tend to be followed by long waves. ergodicity.. .(3)I').. .5. . . Let j.n ei = {3ei1 + €i.. . at best an approximation for the wave example. Xl = I' + '1.Il. However. X n is P(X1.(1 . Of course.. Models. . is that N(O. p(e n I end f(edf(c2 .1. en' Using conditional probability theory and ei = {3ei1 + €i. Parameters.. Measurement Model with Autoregressive Errors. . . . we have p(edp(C21 e1)p(e31 e"e2)"..xn ) ~ f(x1 I') II f(xj . In fact we can write f.. A second example is consecutive measurements Xi of a constant It made by the same observer who seeks to compensate for apparent errors. x n ). say. Then we have what is called the AR(1) Gaussian model J is the We include this example to illustrate that we need not be limited by independence.. and the associated probability theory models and inference for dependent data are beyond the scope of this book. The default assumption.p(e" I e1. X n be the n determinations of a physical constant J.. . Example 1. Let XI.. .t = E(Xi ) be the average time for an infinite series of records.. . An example would be. C n where €i are independent identically distributed with density are dependent as are the X's... i = 1. the conceptual issues of stationarity. 0 . . . Consider the model where Xi and assume = J. the model for X I. ' . Xi ~ 1. .. Xi ..{3Xj1 j=2 n (1.. . i = l. . model (a) assumes much more but it may be a reasonable first approximation in these situations.n.t. and Statistics 11 Finally.
1l"i is the frequency of shipments with i defective items. In this section we introduced the first basic notions and formalism of mathematical statistics. Models are approximations to the mechanisms generating the observations. They are useful in understanding how the outcomes can he used to draw inferences that go beyond the particular experiment. ([. i = 0. There are situations in which most statisticians would agree that more can be said For instance.2. N. .1. it is possible that. 1. We know that. vector observations X with unknown probability distributions P ranging over models P. N.. Goals. . given 9 = i/N. That is. Thus.I 12 Statistical Models. in the past. and indeed necessary. .1. There is a substantial number of statisticians who feel that it is always reasonable. The notions of parametrization and identifiability are introduced. we have had many shipments of size N that have subsequently been distributed. This is done in the context of a number of classical examples. . Now it is reasonable to suppose that the value of (J in the present shipment is the realization of a random variable 9 with distribution given by P[O = N] I = IT" i = 0. We view statistical models as useful tools for learning from the outcomes of experiments and studies. This distribution does not always corresp:md to an experiment that is physically realizable but rather is thought of as measure of the beliefs of the experimenter concerning the true value of (J before he or she takes any data.2 BAYESIAN MODELS Throughout our discussion so far we have assumed that there is no information available about the true value of the parameter beyond that provided by the data. If the customers have provided accurate records of the number of defective items that they have found. The general definition of parameters and statistics is given and the connection between parameters and pararnetrizations elucidated. in the inspection Example 1. N.. How useful a particular model is is a complex mix of how good the approximation is and how much insight it gives into drawing inferences. n). we can construct a freq uency distribution {1l"O.. a . the most important of which is the workhorse of statistics. to think of the true value of the parameter (J as being the realization of a random variable (} with a known distribution. 0 = N I I ([. and Performance Criteria Chapter 1 Summary.2) This is an example of a Bayesian model.1) Our model is then specified by the joint distribution of the observed number X of defectives in the sample and the random variable 9.2."" 1l"N} for the proportion (J of defectives in past shipments. PIX = k. the regression model.. X has the hypergeometric distribution 'H( i.
We shall return to the Bayesian framework repeatedly in our discussion. Before the experiment is performed. The most important feature of a Bayesian model is the conditional distribution of 8 given X = x. In the "mixed" cases such as (J continuous X discrete. . (1. (1.1)'(0. Savage (1954). whose range is contained in 8. 1. Lindley (1965). X) is appropriately continuous or discrete with density Or frequency function. (8. X) is that of the outcome of a random experiment in which we first select (J = (J according to 7r and then. which is called the posterior distribution of 8.1.2. Raiffa and Schlaiffer (1961). suppose that N = 100 and that from past experience we believe that each item has probability . ?T. To get a Bayesian model we introduce a random vector (J. (1962).1. 100. and Berger (1985). we can obtain important and useful results and insights. An interesting discussion of a variety of points of view on these questions may be found in Savage et al. Suppose that we have a regular parametric model {Pe : (J E 8}. select X according to Pe. However. . insofar as possible we prefer to take the frequentist point of view in validating statistical statements and avoid making final claims in terms of subjective posterior probabilities (see later).. After the value x has been obtained for X. Eqnation (1. However. (J) as a conditional density or frequency function given 8 = we will denote it by p(x I 0) for the remainder of this section. In this section we shall define and discuss the basic clements of Bayesian models.l. This would lead to the prior distribution e. the resulting statistical inference becomes subjective. (1) Our own point of view is that SUbjective elements including the views of subject matter experts arc an essential element in all model building. There is an even greater range of viewpoints in the statistical community from people who consider all statistical statements as purely subjective to ones who restrict the use of such models to situations such as that of the inspection example in which the distribution of (J has an objective interpretation in tenus of frequencies. I(O. by giving (J a distribution purely as a theoretical tool to which no subjective significance is attached. = ( I~O ) (0. The theory of this school is expounded by L. Before sampling any items the chance that a given shipment contains . If both X and (J are continuous or both are discrete. The joint distribution of (8. then by (B.2) is an example of (1. the information about (J is described by the posterior distribution. let us turn again to Example 1.2 Bayesian Models 13 Thus. given (J = (J.Section 1. For a concrete illustration.9)'00'.O). x) = ?T(O)p(x. the information or belief about the true value of the parameter is described by the prior distribution.3) Because we now think of p(x. De Groot (1969).1 of being defective independently of the other members of the shipment.2. 1. For instance.2. The function 7r represents our belief or information about the parameter (J before the experiment and is called the prior density or frequency function. We now think of Pe as the conditional distribution of X given (J = (J.4) for i = 0.2..3).3). with density Or frequency function 7r. the joint distribution is neither continuous nor discrete.
In the cases where 8 and X are both continuous or both discrete this is precisely Bayes' rule applied to the joint distribution of (0. theu 1r(91 x) 1r(9)p(x I 9) 2.II ~.:.10 '" . Therefore. (A 15. . X) given by (1.9 ] .1)(0.2. ( 1. 14 Statistkal Models...1.6) we argue loosely as follows: If be fore the drawing each item was defective with probability .2.30.k dt (1.5) 5 ) = 0. 1.9) I . P[IOOO > 201 = 10] P [ .9)(0.X..1) (1. (ii) If we denote the corresponding (posterior) frequency function or density by 1r(9 I x).3).2.1) distribution.8) roo 1r(9)p(x 19) 1r(t)p(x I t)dt if 8 is continuous. (1.2. ...2. we obtain by (1..Xn are indicators of n Bernoulli trials with probability of success () where 0 < 8 < 1. (i) The posterior distribution is discrete or continuous according as the prior distri bution is discrete or continuous. Example 1. 1r(t)p(x I t) if 8 is discrete.1 > . 0.8. Thus.X) .1100(0.1.1) '" I .1)(0. If we assume that 8 has a priori distribution with deusity 1r.6) To calculate the posterior probability given iu (1. Bernoulli Trials. the number of defectives left after the drawing.. This leads to P[1000 > 20 I X = lOJ '" 0.4) can be used.30.001.xn )= Jo'1r(t)t k (lt)n.9) > .10).2.2. .<I> 1000 . . some variant of Bayes' rule (B. P[1000 > 20 I X = 10] P[IOOO . 1r(9)9k (1 _ 9)nk 1r(Blx" .181(0. Goals.9 independently of the other items.8) as posterior density of 0.9)(0. to calculate the posterior.J C3 (1.2. 1008 .1100(0.7) In general.181(0.9) I . Here is an example.. Specifically.1 and good with probability . and Performance Criteria Chapter 1 I . I 20 or more bad items is by the normal approximation with continuity correction.52) 0. Suppose that Xl. .. this will continue to be the case for the items left in the lot after the 19 sample items have been drawn.2.<1>(0.. is iudependeut of X and has a B(81.j . Now suppose that a sample of 19 has been drawn in which 10 defective items are found.. .X > 10 I X = lOJ P [(1000 .
05 and its variance is very small. nonVshaped bimodal distributions are not pennitted. k = I Xi· Note that the posterior density depends on the data only through the total number of successes. L~ 1 Xi' We also obtain the same posterior density if B has prior density 1r and 1 Xi. . Another bigger conjugate family is that of finite mixtures of beta distributionssee Problem 1. r..1). the beta family provides a wide variety of shapes that can approximate many reasonable prior distributions though by no means all.2 indicates.2.<.6.13) leads us to assume that the number X of geniuses observed has approximately a 8(n. Then we can choose r and s in the (3(r. and the posterior distribution of B givenl:X i =kis{3(k+r.2. n . To choose a prior 11'. upon substituting the (3(r.s) distribution.2. To get infonnation we take a sample of n individuals from the city.k + s) where B(·.2. We may want to assume that B has a density with maximum value at o such as that drawn with a dotted line in Figure B.2.9). say 0. x n ). we are interested in the proportion () of "geniuses" (IQ > 160) in a particular city. Xi = 0 or 1. Summary.nk+s).2. must (see (B.2. 'i = 1. 0 A feature of Bayesian models exhibited by this example is that there are natural parametric families of priors such that the posterior distributions also belong to this family.11» be B(k + T. so that the mean is r/(r + s) = 0.ll) in (1.. Or else we may think that 1[(8) concentrates its mass near a small number. 8) distribution given B = () (Problem 1. This class of distributions has the remarkable property that the resulting posterior distributions arc again beta distributions..2.n. As Figure B. . If we were interested in some proportion about which we have no information or belief. One such class is the twoparameter beta family.2. The result might be a density such as the one marked with a solid line in Figure B.16. and s only. Now we may either have some information about the proportion of geniuses in similar cities of the country or we may merely have prejudices that we are willing to express in the fonn of a prior distribution on B. for instance. introduce the notions of prior and posterior distributions and give Bayes rule. we need a class of distributions that concentrate on the interval (0. where k ~ l:~ I Xj.15..Section 1. which has a B( n.8) distribution.10) The proportionality constant c. For instance.(8 I X" . we might take B to be uniformly distributed on (0.2 Bayesian Models 15 for 0 < () < 1. Such families are called conjugaU? Evidently the beta family is conjugate to the binomial. We also by example introduce the notion of a conjugate family of distributions. . Specifically.05. we only observe We can thus write 1[(8 I k) fo". which depends on k.2. 1). If n is small compared to the size of the city.2. (A. s) density (B.·) is the beta function. Suppose. which corresponds to using the beta distribution with r = . .9) we obtain 1 L: L7 (}k+rl (1 _ 8)nk+s1 c (1. We return to conjugate families in Section 1. = 1. We present an elementary discussion of Bayesian models.
l < J.e. On the basis of the sample outcomes the organization wants to give a ranking from best to worst of the brands (ties not pennitted). 0 Example 1. (age. Given a statistical model.3. say. Intuitively.1. As the second example suggests.3. with (J < (Jo. We may have other goals as illustrated by the next two examples. P's that correspond to no treatment effect (i. the fraction defective B in Example 1. such as.16 Statistical Models. and Performance Criteria Chapter 1 1. 1 < i < n. one of which will be announced as more consistent with the data than others. shipments.l > JkO correspond to an eternal alternation of Big Bangs and expansions. there are k! possible rankings or actions. However. In Example 1. if there are k different brands.1. Thus. of g((3. whereas the receiver does not wish to keep "bad. A consumer organization preparing (say) a report on air conditioners tests samples of several brands. for instance. For instance." In testing problems we.l < Jio or those corresponding to J. Thus. in Example 1. state which is supported by the data: "specialness" or.3 THE DECISION THEORETIC FRAMEWORK I . we have a vector z.1. as in Example 1.1 or the physical constant J. drug dose) T that can be used for prediction of a variable of interest Y.1. say. These are estimation problems. i we can try to estimate the function 1'0. Unfortunately {t(z) is unknown. If J. For instance. Note that we really want to estimate the function Ji(')~ our results will guide the selection of doses of drug for future patients. In other situations certain P are "special" and we may primarily wish to know whether the data support "specialness" or not.lo as special. if we believe I'(z) = g((3. For instance. There are many possible choices of estimates. placebo and treatment are equally effective) are special because the FDA (Food and Drug Administration) does not wish to pennit the marketing of drugs that do no good. if we have observations (Zil Yi).1. then depending on one's philosophy one could take either P's corresponding to J.1 contractual agreement between shipper and receiver may penalize the return of "good" shipments. say a 50yearold male patient's response to the level of a drug. z) we can estimate (3 from our observations Y. as it's usually called.1.l > J. the receiver wants to discriminate and may be able to attach monetary Costs to making a mistake of either type: "keeping the bad shipment" or "returning a good shipment. Zi) and then plug our estimate of (3 into g. Ranking. sex. Goals. A very important class of situations arises when. Example 1." (} > Bo. We may wish to produce "best guesses" of the values of important parameters. the expected value of Y given z.2.lo is the critical matter density in the universe so that J. in Example 1..L in Example 1. the information we want to draw from data can be put in various forms depending on the purposes of our analysis. i.1.1 do we use the observed fraction of defectives . and as we shall see fonnally later. Making detenninations of "specialness" corresponds to testing significance. at a first cut.2. Prediction.lo means the universe is expanding forever and J.3. Po or Pg is special and the general testing problem is really one of discriminating between Po and Po. there are many problems of this type in which it's unclear which oftwo disjoint sets of P's. "hypothesis" or "nonspecialness" (or alternative).4. a reasonable prediction rule for an unseen Y (response of a new patient) is the function {t(z). 0 In all of the situations we have discussed it is clear that the analysis does not stop by specifying an estimate or a test or a ranking or a prediction function.
in Example 1. A = {a.3 The Decision Theoretic Framework 17 X/n as our estimate or ignore the data and use hislOrical infonnation on past shipments. We usnally take P to be parametrized.1...6. (2) point to what the different possible actions are. in estimation we care how far off we are. Here only two actions are contemplated: accepting or rejecting the "specialness" of P (or in more usual language the hypothesis H : P E Po in which we identify Po with the set of "special" P's).1. For instance.1. If we are estimating a real parameter such as the fraction () of defectives. 1). : 0 E 8).. we would want a posteriori estimates of perfomlance. The answer will depend on the model and.3. ~ 1 • • • l I} in Example 1. (3) provide assessments of risk. P = {P. That is. (1. defined as any value such that half the Xi are at least as large and half no bigger? The same type of question arises in all examples. On the other hand. for instance.3. I)}. most significantly. A = {( 1. 2). or p. there are 3! = 6 possible rankings. . (2. we begin with a statistical model with an observation vector X whose distribution P ranges over a set P. once a study is carried out we would probably want not only to estimate ~ but also know how reliable our estimate is. or combine them in some way? In Example 1. # 0. By convention. if we have three air conditioners. 1. In any case.1 Components of the Decision Theory Framework As in Section 1.2 lO estimate J1 do we use the mean of the measurements. ik) of {I. . on what criteria of performance we use. These examples motivate the decision theoretic framework: We need to (I) clarify the objectives of a study. (4) provide guidance in the choice of procedures for analyzing outcomes of experiments. taking action 1 would mean deciding that D. in Example 1. I} with 1 corresponding to rejection of H. . Thus. X = ~ L~l Xi.3 even with the simplest Gaussian model it is intuitively clear and will be made precise later that. we need a priori estimates of how well even the best procedure can do. and reliability of statistical procedures.1.. accuracy. whatever our choice of procedure we need either a priori (before we have looked at the data) and/or a posteriori estimates of how well we're doing. Action space. 1. 2). Thus. 1.1.2. Testing. 3). A new component is an action space A of actions or decisions or claims that we can contemplate making. (2. Intuitively. it is natural to take A = R though smaller spaces may serve equally well.Section 1. (3. 2. In designing a study to compare treatments A and B we need to determine sample sizes that will be large enough to enable us to detect differences that matter. 3. in Example 1. 3.1. Thus. 3). 2. Here are action spaces for our examples. A = {O. Ranking. (3. k}}. even if. Here quite naturally A = {Permntations (i I . Estimation. a large a 2 will force a large m l n to give us a good chance of correctly deciding that the treatment effect is there. in testing whether we are right or wrong. and so on.1. in ranking what mistakes we've made. . . or the median. in Example 1.1.1. is large.
As we shall see.Vd) = (q1(0). a) 1(0. less computationally convenient but perhaps more realistically penalize large errors less are Absolute Value Loss: l(P. say. sometimes can genuinely be quantified in economic terms.1). ~ 2. a) = 1 otherwise. Evidently Y could itself range over an arbitrary space Y and then R would be replaced by Y in the definition of a(·)." that is. The interpretation of I(P. is the nonnegative loss incurred by the statistician if he or she takes action a and the true "state of Nature. if Y = 0 or 1 corresponds to.. Sex)T.al < d. For instance.. examples of loss functions are 1(0.V. in the prediction example 1. a) = 0.)2 = Vj [ squared Euclidean distance/d absolute distance/d 1. 1 d} = supremum distance. a) = l(v < a).a)2). as the name suggests.i = We can also consider function valued parameters. = max{laj  vjl. the expected squared error if a is used. the probability distribution producing the data. Estimation. I(P. and truncated quadraticloss: I(P.18 Statistical Models. This loss expresses the notion that all errors within the limits ±d are tolerable and outside these limits equally intolerable. Goals.2. . . Although estimation loss functions are typically symmetric in v and a.: la. a) I d . a) = J (I'(z) . For instance. a) = min {(v( P) . [f Y is real.a(z))2dQ(z).. 1'() is the parameter of interest.a) ~ (q(O) .a)2. r I I(P. Loss function. a) if P is parametrized. or 1(0. If we use a(·) as a predictor and the new z has marginal distribution Q then it is natural to consider. say. and Performance Criteria Chapter 1 Prediction. If v ~ (V1. A = {a . "does not respond" and "responds. l(P.3. which penalizes only overestimation and by the same amount arises naturally with lower confidence bounds as discussed in Example 1. a) 1(0.3. Here A is much larger. a). Closely related to the latter is what we shall call confidence interval loss. a) = [v( P) . a is a function from Z to R} with a(z) representing the prediction we would make if the new unobserved Y had covariate value z. ." respectively. For instance. . . r " I' 2:)a. Q is the empirical distribution of the Zj in ... Quadratic Loss: I(P. Far more important than the choice of action space is the choice of loss function defined as a function I : P X A ~ R+. Other choices that are. [v(P) .. If. and z E Z. a) = (v(P) .3. then a(B. although loss functions.qd(lJ)) and a ~ (a""" ad) are vectors. In estimating a real valued parameter v(P) or q(6') if P is parametrized the most commonly used loss function is. is P.ai. asymmetric loss functions can also be of importance.a)2 (or I(O. as we shall see (Section 5. d'}. ' . M) would be our prediction of response or no response for a male given treatment B. l( P. and z = (Treatment. they usually are chosen to qualitatively reflect what we are trying to do and to be mathematically convenient.
a Estimation.2) and N(I'.3 with X and Y distributed as N(I' + ~. (1.3 we will show how to obtain an estimate (j of a from the data.y is close to zero. the decision is wrong and the loss is taken to equal one. and to decide Ll '# a if our estimate is not close to zero. is a or not. For instance. this leads to the commonly considered I(P.1. a(zn) jT and the vector parameter (I'( z. We next give a representation of the process Whereby the statistician uses the data to arrive at a decision.I loss: /(8. respectively.). In Section 4.1. This 0 ... .2) I if Ix ~ 111 (J >c . in the measurement model. y) = o if Ix::: 111 (J <c (1. The data is a point X = x in the outcome or sample space X. Y n). . . I) 1(8. then a reasonable rule is to decide Ll = 0 if our estimate x .))2. Using 0 means that if X = x is observed. that is. relative to the standard deviation a. if we are asking whether the treatment effect p'!1'ameter 6.2). is a partition of (or equivalently if P E Po or P E P.3. Otherwise. We define a decision rule or procedure to be any function from the sample space taking its values in A.1 suppose returning a shipment with °< 1(8. L(I'(Zj) _0(Z..1'(zn))T Testing. . Y). For the problem of estimating the constant IJ.Section 1. .9.1 loss function can be written as ed.). 0. Testing.1) Decision procedures. .).. the statistician takes action o(x). e ea. The decision rule can now be written o(x. Here we mean close to zero relative to the variability in the experiment. (Zn....1 times the squared Euclidean distance between the prediction vector (a(z... other economic loss functions may be appropriate. where {So. Then the appropriate loss function is 1. We ask whether the parameter () is in the subset 6 0 or subset 8 1 of e.0) sif8<80 Oif8 > 80 rN8..a) = I n n. a) = 1 otherwise (The decision is wrong). ]=1 which is just n.3 The Decision Theoretic Framework 19 the training set (Z 1. I) /(8. In Example 1. we implicitly discussed two estimates or decision rules: 61 (x) = sample mean x and 02(X) = X = sample median.. Of course. in Example 00 defectives results in a penalty of s dollars whereas every defective item sold results in an r dollar replacement cost.3. If we take action a when the parameter is in we have made the correct decision and the loss is zero. a) ~ 0 if 8 E e a (The decision is correct) l(0.
(]"2) errors. The MSE depends on the variance of fJ and on what is called the bias ofv where = E{fl) . Proof. X n are i.6) = Ep[I(P.v) = [V  + [E(v) .3. then Bias(X) 1 n Var(X) .v(P))' (1. Estimation of IJ.. we typically want procedures to have good properties not at just one particular x.. () is the true value of the parameter.. measurements of IJ. . 1 is the loss function. then the loss is l(P. Suppose v _ v(P) is the real parameter we wish to estimate and fJ(X) is our estimator (our decision rule). We illustrate computation of R and its a priori use in some examples. and Performance Criteria Chapter 1 where c is a positive constant called the critical value. the risk or riskfunction: The risk function. If d is the procedure used.v can be thought of as the "longrun average error" of v. and assume quadratic loss. the cross term will he zero because E[fi . and X = x is the outcome of the experiment. We do not know the value of the loss because P is unknown.3. That is. our risk function is called the mean squared error (MSE) of.. and is given by v= MSE{fl) = R(P.) = 2L n ~ n . lfwe use the mean X as our estimate of IJ.E(fi)] = O. but for a range of plausible x·s. (If one side is infinite.1'].i. The other two terms are (Bias fi)' and Var(v).20 Statistical Models. Moreover. 6(X)] as the measure of the perfonnance of the decision rule o(x). for each 8. fi) = Ep(fi(X) . A useful result is Bias(fi) Proposition 1. Goals.) 0 We next illustrate the computation and the a priori and a posteriori use of the risk function.1.. 6 (x)) as a random variable and introduce the riskfunction R(P.3. MSE(fi) I.3) where for simplicity dependence on P is suppressed in MSE. e Estimation. R('ld) is our a priori measure of the performance of d. If we expand the square of the righthand side keeping the brackets intact and take the expected value.d. withN(O." Var(X. Suppose Xl. Example 1. we regard I(P. If we use quadratic loss. Thus. (Continued). R maps P or to R+. so is the other and the result is trivially true. How do we choose c? We need the next concept of the decision theoretic framework.3. i=l . fJ(x)). we turn to the average or mean loss over the sample space. (fi . Write the error as 1 = (Bias fi)' + E(fi)] Var(v).. Thus. .
. as we assumed.1 is useful for evaluating the performance of competing estimators. R( P.3.4.fii/a)[ ~ a . is possible. If we want to be guaranteed MSE(X) < £2 we can do it by taking at least no = <:TolE measurements.t.Section 1.1"1 ~ Eill 2 ) where £. If we have no idea of the value of 0.d. If. Suppose that instead of quadratic loss we used the more natural(l) absolute value loss.1".a 2 .S. We shall derive them in Section 1.1). 0 ~ = a We next give an example in which quadratic loss and the breakup of MSE given in Proposition 1. 1 X n from area A. with mean 0 and variance a 2 (P).X)2. Example 1. analytic. an}). as we discussed in Example 1. an estimate we can justify later. If we only assume. X) = EIX .8)X. a natural guess for fl would be flo.3 The Decision Theoretic Framework 21 and.3. computational difficulties arise even with quadratic loss as soon as we think of estimates other than X. the £.1 2 MSE(X) = R(I". a'. . If we have no data for area A. by Proposition 1.1. Ii ~ (0.. . or numerical and/or Monte Carlo computation. census.6 through a . Then (1. planning is not possible but having taken n measurements we can then estimate 0..5) This harder calculation already suggests why quadratic loss is really favored. age or income. or na 21(n .2. 00 (1.fii 1 af2 00 Itl<p(t)dt = . write for a median of {aI..a 2 then by (A 13. X) = a' (P) / n still. Next suppose we are interested in the mean fl of the same measurement for a certain area of the United States.1")' ~ E(~) can only be evaluated numerically (see Problem 1.2 . if X rnedian(X 1 . The choice of the weights 0.. In fact.3.3.4) can be used for an a priori estimate of the risk of X. we may wantto combine tLo and X = n 1 L~ 1 Xi into an estimator. = X. that the E:i are i.2)1"0 + (O.X) = ". E(i . .23). areN(O.. Suppose that the precision of the measuring instrument cr2 is known and equal to crJ or where realistically it is known to be < crJ. say. Let flo denote the mean of a certain measurement included in the U. for instance by 8 2 = ~ L~ l(X i . or approximated asymptotically.. then for quadratic loss. for instance. The a posteriori estimate of risk (j2/n is.3. (.. in general. ( N(o.6). but for absolute value loss only approximate.2. n (1.2 and 0.8 can only be made on the basis of additional knowledge about demography or the economy.Xn ) (and we. itself subject to random error. of course.4) which doesn't depend on J. . For instance.fii V. .3.i. whereas if we have a random sample of measurements X" X 2. 1) and R(I". Then R(I".X) =.3.
the risk is = 0 and i'> i 0 can only take on the R(i'>.4).04(1'0 .3. D Testing. and we are to decide whether E 80 or E 8 where 8 = 8 0 U 8 8 0 n 8 . then X is optimal (Example 3. and Performance Criteria Chapter 1 i formal Bayesian analysis using a nonnal prior to illustrate a way of bringing in additional knowledge. In the general case X and denote the outcome and parameter space. R(i'>. using MSE. Figure 1. iiJ + 0. Y) = which in the case of 0 .I.2(1'0 . However.64)0" In. Y) = IJ. thus.3. The two MSE curves cross at I' ~ 1'0 ± 30'1 .) ofthe MSE (called the minimax criteria).3.0)P[6(X. Because we do not know the value of fL.6) = 1(i'>. 22 Statistical Models.64 when I' = 1'0. X) = 0" In of X with the minimum relative risk inf{MSE(ii)IMSE(X).n.1')' + (. If I' is close to 1'0.8)'Yar(X) = (.1.I' = 0. Ii) of ii is smaller than the risk R(I'. Y) = 01 if i'> i O.21'0 R(I'. We easily find Bias(ii) Yar(ii) 0. respectively.1 gives the graphs of MSE(ii) and MSE(X) as functions of 1'. if we use as our criteria the maximum (over p. A test " " e e e . The mean squared errdrs of X and ji. I' E R} being 0.6) P[6(X.64)0" In MSE(ii) = . I.3.2) for deciding between i'> two values 0 and 1. the risk R(I'. The test rule (1. neither estimator can be proclaimed as being better than the other. = 0. Goals..1 loss is 01 + lei'>. Figure 1. Y) = I] if i'> = 0 P[6(X. MSE(ii) MSE(X) i .1') (0. Here we compare the performances of Ii and X as estimators of Jl using MSE.81' . I)P[6(X.
say . confidence bounds and intervals (and more generally regions). This is not the only approach to testing. Such a v is called a (1 .a) upper coofidence houod on v.IX > k] + rN8P. usually . and then next look for a procedure with low probability of proclaiming no difference if in fact one treatment is superior to the other (deciding Do = 0 when Do ¥ 0). (1.05 or . Here a is small. Finding good test functions corresponds to finding critical regions with smaIl probabilities of error. a < v(P) . [n the NeymanPearson framework of statistical hypothesis testing.a (1. on the probability of Type I error. Suppose our primary interest in an estimation type of problem is to give an upper bound for the parameter v. we want to start by limiting the probability of falsely proclaiming one treatment superior to the other (deciding ~ =1= 0 when ~ = 0). that is.3.8) for all possible distrihutions P of X.1.6) P(6(X) = 0) if 8 E 8 1 Probability of Type II error.6) Probability of Type [error R(8.3. Thus. the focus is on first providing a small bound. and then trying to minimize the probability of a Type II error.3. in the treatments A and B example.18). an accounting finn examining accounts receivable for a finn on the basis of a random sample of accounts would be primarily interested in an upper bound on the total amount owed. This corresponds to an a priori bound on the risk of a on v(X) viewed as a decision procedure with action space R and loss function.05." in Example 1.o) 0.1 lead to (Prohlem 1.01 or less.3. For instance.[X < k].1) and tests 15k of the fonn. 6(X) = IIX E C]. 8 > 80 .3 The Decision Theoretic Framework 23 function is a decision rule <5(X) that equals 1 On a set C C X called the critical region and equals 0 on the complement of C. we call the error committed a Type I error. If (say) X represents the amount owed in the sample and v is the unknown total amount owed.6) E(6(X)) ~ P(6(X) ~ I) if8 E 8 0 (1. For instance.[X < k[. For instance. it is natural to seek v(X) such that P[v(X) > vi > 1. a > v(P) I. whereas if <5(X) = 0 and we decide () E 80 when in fact 8 E 8 b we call the error a Type II error. "Reject the shipment if and only if X > k. 8 < 80 rN8P. R(8. If <5(X) = 1 and we decide 8 E 8 1 when in fact E 80.8) sP.Section 1. the loss function (1. where 1 denotes the indicator function.3. I(P. the risk of <5(X) is e R(8.7) Confidence Bounds and Intervals Decision theory enables us to think clearly about an important hybrid of testing and estimation.
In the context of the foregoing examples.3.5. we could leave the component in. for some constant c > O. though. For instance = I(P. as we shall see in Chapter 4.: Comparison of Decision Procedures In this section we introduce a variety of concepts used in the comparison of decision procedures. a) (Drill) aj (Sell) a2 (Partial rights) a3 . . Suppose we have two possible states of nature. . or wait and see. We conclude by indicating to what extent the relationships suggested by this picture carry over to the general decision theoretic model. a2. The decision theoretic framework accommodates by adding a component reflecting this. we could drill for oil.1.a ·1 for all PEP. It is clear.a<v(P). What is missing is the fact that. Goals.3. though upper bounding is the primary goal. and Performance Criteria Chapter 1 1 . Suppose the following loss function is decided on TABLE 1. the connection is close.8) and then see what one can do to control (say) R( P. aJ. The loss function I(B. general criteria for selecting "optimal" procedures. Ii) = E(Ii(X) v(P))+. Suppose that three possible actions. administer drugs. rather than this Lagrangian form. where x+ = xl(x > 0). an asymmetric estimation type loss function.a>v(P) . The 0. replace it. or sell partial rights.3. 1.a)  a~v(P) . which we represent by Bl and B .3. we could operate. thai this fonnulation is inadequate because by taking D 00 we can achieve risk O. and the risk of all possible decision procedures can be computed and plotted. For instance. We next tum to the final topic of this section. a patient either has a certain disease or does not. Typically. or repair it. it is customary to first fix a in (1. e Example 1. in fact it is important to get close to the truthknowing that at most 00 doJlars are owed is of no use. sell the location.2 . and so on. A has three points. I I (Oil) (No oil) Bj B2 0 12 10 I 5 6 \ . a component in a piece of equipment either works or does not work. We shall go into this further in Chapter 4.24 Statistical Models. are available.1 nature makes it resemble a testing loss function and. a 2 certain location either contains oil or does not. c ~1 .. The same issue arises when we are interested in a confidence interval ~(X) l v(X) 1for v defined by the requirement that • P[v(X) < v(P) < Ii(X)] > 1 . We shall illustrate some of the relationships between these ideas using the following simple example in which has two members. and a3.
6) + 1(0.3 and formation 1 with frequency 0.4. a. (R(O" 5). .)) are given in Table 1. whereas if there is no oil.7 0.3. 1 9. a3 4 a2 a. and so on. 0 . a." Criteria for doing this will be introduced in the next subsection.5 8. 5 a. TABLE 1.3. 2 a.4) ~ 7.3 The Decision Theoretic Framework 25 Thus.)) 1 1 R(B .5(X))] = I(B. 6 a.5 3 7 1. Risk points (R(B 5.3 0. a. if X = 1. Possible decision rules 5i (x) .6 and 0. R(B k ." 02 corresponds to ''Take action Ul> if X = 0.).3. The risk of 5 at B is R(B.." and so on.3. we can represent the whole risk function of a procedure 0 by a point in kdimensional Euclidean space. e TABLE 1. take action U2. the loss is zero. 3 a. The risk points (R(B" 5.5. We list all possible decision rules in the following table. the loss is 12.5.6 0.7) = 7 12(0. R(B2. .3.) R(B.3.4.2.5 4.a2)P[5(X) ~ a2] + I(B.4 8 8.5 9.6 3 3.Section 1. The frequency function p(x.4 10 6 6.52 ) + 10(0. B) given by the following table TABLE 1.5) E[I(B. If is finite and has k members.4 = 1. e. I x=O x=1 a.6 4 5 1 " 3 5.).3) For instance. formations 0 and 1 occur with frequencies 0. Next.0 9 5 6 It remains to pick out the rules that are "good" or "best.) 0 12 2 7 7.2 (Oil) (No oil) e..). . R(B" 52) R(B2 . B2 Thus. a3 7 a3 a. 8 a3 a2 9 a3 a3 Here.7.. it is known that formation 0 occurs with frequency 0. and when there is oil..)P[5(X) = ad +1(B. 5. an experiment is conducted to obtain information about B resulting in the random variable X with possible values coded as 0. whereas if there is no oil and we drill.6.. R(025. X may represent a certain geological formation. if there is oil and we drill.a3)P[5(X) = a3]' 0(0.2 for i = 1. and frequency function p(x. . a. 01 represents "Take action Ul regardless of the value of X. 1. 5)) and if k = 2 we can plot the set of all such points obtained by varying 5. i Rock formation x o I 0.4 and graphed in Figure 1.
26 Statistical Models. Symmetry (or invariance) restrictions are discussed in Ferguson (1967). Researchers have then sought procedures that improve all others within the class. Here R(8 S.9 .3 Bayes and Minimax Criteria The difficulties of comparing decision procedures have already been discussed in the special contexts of estimation and testing.7 . and Performance Criteria Chapter 1 R(8 2 . for instance.4 . neither improves the other.2 . Section 1.3.9. (2) A second major approach has been to compare risk functions by global crite . We say that a procedure <5 improves a procedure Of if. R( 8" Si)). The risk points (R(8 j .Si) Figure 1. For instance. It is easy to see that there is typically no rule e5 that improves all others. Extensions of unbiasedness ideas may be found in Lehmann (1997. Si). and only if. Usually.) > R( 8" S6)' The problem of selecting good decision " procedures has been attacked in a variety of ways. 1. Consider.. i (1) Narrow classes of procedures have been proposed using criteria such as con siderations of symmetry. if we ignore the data and use the estimate () = 0. or level of significance (for tests).. Goals.S') for all () with strict inequality for some (). We shall pursue this approach further in Chapter 3.5). a5). we obtain M SE(O) = (}2.8 5 . . unbiasedness (for estimates and tests).5 0 0 5 10 R(8 j . R(8. and S6 in our example.) < R( 8" S6) but R( 8" S.6 .) I • 3 10 . S.3.S) < R(8. in estimating () E R when X '"'' N((). . The absurd rule "S'(X) = 0" cannot be improved on at the value 8 ~ 0 because Eo(S'(X)) = 0 ifand only if O(X) = O.S. i = 1.   .2. if J and 8' are two rules.
To illustrate. if () is discrete with frequency function 'Tr(e). we need not stop at this point. Note that the Bayes approach leads us to compare procedures on the basis of. therefore. We postpone the consideration of posterior analysis. and only if. to Section 3.9).9) The second preceding identity is a consequence of the double expectation theorem (B. In this framework R(O. given by reo) = E[R(O. The method of computing Bayes procedures by listing all available <5 and their Bayes risk is impracticable in general. Then we treat the parameter as a random variable () with possible values ()1.92 5. the expected loss. 0) isjust E[I(O.38 9.) ~ 0. 0)11(0). but can proceed to calculate what we expect to lose on the average as () varies. r( 09) specified by (1.5 gives r( 0.48 7.).3.o)1I(O)dO. O(X)) I () = ()].4 8 4. and reo) = J R(O. (1. which attains the minimum Bayes risk l that is.2.3. 11(0. We shall discuss the Bayes and minimax criteria. Bayes and maximum risks oftbe procedures of Table 1. + 0. (1. if we use <5 and () = (). suppose that in the oil drilling example an expert thinks the chance of finding oil is . it has smaller Bayes risk. If we adopt the Bayesian point of view.8 6 In the Bayesian framework 0 is preferable to <5' if. Recall that in the Bayesian model () is the realization of a random variable or vector () and that Pe is the conditional distribution of X given 0 ~ O.5. From Table 1.8R(0. . . .) maxi R(0" Oil.3.8.o)] = E[I(O. . R(0" Oi)) I 9. This quantity which we shall call the Bayes risk of <5 and denote r( <5) is then.4 5 2.5 7 7.6 4 4.7 6.6 3 8. r(o.0) Table 1.2.8 10 6 3.9 8.. If there is a rule <5*..3.o(x))].02 8.3.10) r( 0) ~ 0. . the only reasonable computational method.3 The Decision Theoretic Framework 27 ria rather than on a pointwise basis.I. TABLE 1.0).20) in Appendix B. = 0.5 we for our prior.3. ()2 and frequency function 1I(OIl The Bayes risk of <5 is.5 9 5. such that • see that rule 05 is the unique Bayes rule then it is called a Bayes rule. r(o") = minr(o) reo) = EoR(O.6 12 2 7.Section 1..2. Bayes: The Bayesian point of view leads to a natural global criterion.3.3.2R(0.
For instance. is called minimax (minimizes the maximum risk).4. in Example 1. R(O. Minimax: Instead of averaging the risk as the Bayesian does we can look at the worst possible risk. 8)].(2) We briefly indicate "the game of decision theory. < sup R(O. the sel of all decision procedures. Nature's intentions and degree of foreknowledge are not that clear and most statisticiaqs find the minimax principle too conservative to employ as a general rule. But this is just Bayes comparison where 1f places equal probability on f}r and ()2. The maximum risk of 0* is the upper pure value of the game.3.20 if 0 = O .J). 6). J'). which makes the risk as large as possible.5. I Randomized decision rules: In general. e 4. For instance. a weight function for averaging the values of the function R( B. we prefer 0" to 0"'. The principle would be compelling. Of course. Students of game theory will realize at this point that the statistician may be able to lower the maximum risk without requiring any further information by using a random mechanism to determine which rule to emplOy. Goals. and Performance Criteria Chapter 1 if (J is continuous with density rr(8). . . It aims to give maximum protection against the worst that can happen. Nevertheless l in many cases the principle can lead to very reasonable procedures. J)) we see that J 4 is minimax with a maximum risk of 5. if and only if.28 Statistical Models.5 we might feel that both values ofthe risk were equally important.3. This is. Such comparisons make sense even if we do not interpret 1r as a prior density or frequency. To illustrate computation of the minimax rule we tum to Table 1. Nature's choosing a 8. sup R(O. The criterion comes from the general theory of twoperson zero sum games of von Neumann. Our expected risk would be. For simplicity we shall discuss only randomized .3. Player II then pays Player I. 2 The maximum risk 4. J). we toss a fair coin and use 04 if the coin lands heads and 06 otherwise. . but only ac. which has . suppose that.J') ." Nature (Player I) picks a independently of the statistician (Player II). This criterion of optimality is very conservative. who picks a decision procedure point 8 E J from V. From the listing ofmax(R(O" J). 4.4. = infsupR(O. in Example 1. if V is the class of all decision procedures (nonrandomized).75 is strictly less than that of 04. a randomized decision procedure can be thought of as a random experiment whose outcomes are members of V. R(02. supR(O. It is then natural to compare procedures using the simple average ~ [R( fh 10) + R( fh. if the statistician believed that the parameter value is being chosen by a malevolent opponent who knows what decision procedure will be used.75 if 0 = 0. J) A procedure 0*.
3..d) anwng all randomized procedures.(02 ). Jj ). i = 1.). TWo cases arise: (I) The tangent has a unique point of contact with a risk point corresponding to a nonrandomized rule. Ji ))...J.12) r(J) = LAiEIR(IJ. As in Example 1. R(O . This is thai line with slope .3. If the randomized procedure l5 selects l5i with probability Ai.J.3).3.13) intersects S.3. A point (Tl.R(O..14) .R(OI.3. For instance.(0 d = i = 1 . . we then define q R(O. i=l (1. A randomized minimax procedure minimizes maxa R(8.\.10).J): J E V'} where V* is the set of all procedures. 0 < i < 1. (1. .i/ (1 . 1 q.3.3). We now want to study the relations between randomized and nonrandomized Bayes and minimax procedures in the context of Example 1. R(O . By (1. given a prior 7r on q e. Ai >0.Ji )] i=l A randomized Bayes procedure d. AR(02. when ~ = 0.3. this point is (10. J)) and consider the risk sel 2 S = {(RrO"J)..Ji)' r2 = tAiR(02.r2):r.J) ~ L.r). All points of S that are on the tangent are Bayes. = ~A.).11) Similarly we can define.Section 1.2. minimizes r(d) anwng all randomized procedures.13) As c varies. the Bayes risk of l5 (1. i = 1 . S= {(rI.1).A)R(O"Jj ). That is.3. (1. then all rules having Bayes risk c correspond to points in S that lie on the line irl + (1 . 9 (Figure 2 1.i)r2 = c. Jj . Finding the Bayes rule corresponds to finding the smallest c for which the line (1. S is the convex hull of the risk points (R(0" Ji ). .5.13) defines a family of parallel lines with slope i/(1 . l5q of nOn randomized procedures. Ji ) + (1 .3. J). Ei==l Al = 1.5.R(02.3.3 The Decision Theoretic Framework 29 procedures that select among a finite set (h 1 • • • . (1..3. T2) on this line can be written AR(O"Ji ) + (1.. We will then indicate how much of what we learn carries over to the general case. which is the risk point of the Bayes rule Js (see Figure 1. (2) The tangent is the line connecting two unonrandomized" risk points Ji . ~Ai=1}.. including randomiZed ones. we represent the risk of any procedure J by the vector (R( 01 .'Y) that is tangent to S al the lower boundary of S. .A)R(02. If .
2 The point where the square Q( c') defined by (1.3. the first point of contact between the squares and S is the .3. where 0 < >. = 0).3.11) corresponds to the values Oi with probability>.3. We can choose two nonrandomized Bayes rules from this class. To locate the risk point of the nilhimax rule consider the family of squares. as >. 0 < >. = r2. (1. and.16) whose diagonal is the line r. Then Q( c*) n S is either a point or a horizontal or vertical line segment.30 Statistical Models.>').3. the first square that touches S).15) Each one of these rules. by (1.• all points on the lowet boundary of S that have as tangents the y axis or lines with nonpositive slopes). < 1.3. R(B . < 1.)') of the line given by (1. thus.e.. In our example. namely Oi (take'\ = 1) and OJ (take>. (1. (\)).9. . Goals.3. ranges from 0 to 1. . OJ with probability (1 . and Performance Criteria Chapter 1 r2 I 10 5 Q(c') o o 5 10 Figure 1..13). It is the set of risk points of minimax rules because any point with smaller maximum risk would belong to Q(c) n S with c < c* contradicting the choice of c*.. the set B of all risk points corresponding to procedures Bayes with respect to some prior is just the lower left boundary of S (Le.16) touches S is the risk point of the minimax rule. See Figure 1. is Bayes against 7I".3. Because changing the prior 1r corresponds to changing the slope . i = 1. The convex hull S of the risk poiots (R(B!.3. oil./(1 . Let c' be the srr!~llest c for which Q(c) n S i 0 (i.
.4 < 7.3 The Decision Theoretic Framework 31 intersection between TI = T2 and the line connecting the two points corresponding to <54 and <56 . {(x.3. they are Bayes procedures. (3) Randomized Bayes procedures are mixtures of nonrandomized ones in the Sense of (1. R(8" 0)) : 0 E V'} where V" is the set of all randomized decision procedures.59. we can define the risk set in general as s ~ {( R(8 . From the figure it is clear that such points must be on the lower left boundary. all rules that are not inadmissible are called admissible. if and only if. A rule <5 with risk point (TIl T2) is admissible. under some conditions.0).\ + 6. agrees with the set of risk points of Bayes procedures. Y < T2} has only (TI 1 T2) in common with S.3. j = 6 and .e. ..\ ~ 0. which yields .\) = 5..4. (b) The set B of risk points of Bayes procedures consists of risk points on the lower boundary of S whose tangent hyperplanes have normals pointing into the positive quadrant. e = {fh. . 1967. or equivalently. 1 . . > a for all i.3.5(1 . Thus. 04) = 3 < 7 = R( 81 . (d) All admissible procedures are Bayes procedures. (e) If a Bayes prior has 1f(Oi) to 1f is admissible. 0. A decision rule <5 is said to be inadmissible if there exists another rule <5' such that <5' improves <5. the set of all lower left boundary points of S corresponds to the class of admissible rules and. R( 81 . y) in S such that x < Tl and y < T2. thus. all admissible procedures are either Bayes procedures or limits of .14) with i = 4.) and R( 8" 04) = 5.V) : x < TI. The following features exhibited by the risk set by Example 1.14).\).5 can be shown to hold generally (see Ferguson.)).. There is another important concept that we want to discuss in the context of the risk set. However. Using Table 1. for instance. Naturally.. for instance). the minimax rule is given by (1.3. l Ok}. To gain some insight into the class of all admissible procedures (randomized and nOnrandomized) we again use the risk set. that <5 2 is inadmissible because 64 improves it (i. (a) For any prior there is always a nonrandomized Bayes procedure. If e is finite.Section 1.4 we can see.\ the solution of From Table 1. if there is a randomized one. if and only if.6 ~ R( 8" J. this equation becomes 3. there is no (x.3. In fact. (c) If e is finite(4) and minimax procedures exist.4'\ + 3(1 . then any Bayes procedure corresponding If e is not finite there are typically admissible procedures that are not Bayes.
4 PREDICTION . The frame we shaH fit them into is the following. The MSPE is the measure traditionally used in the . One reasonable measure of "distance" is (g(Z) . decision rule. we tum to the mean squared prediction error (MSPE) t>2(Y. A meteorologist wants to estimate the amount of rainfall in the coming spring. Z is the information that we have and Y the quantity to be predicted. For example. Since Y is not known.2 presented important situations in which a vector z of 00variates can be used to predict an unseen response Y. I II .Yj'. Z would be the College Board score of an entering freshman and Y his or her firstyear grade point average..g(Z») = E[g(Z) ~ YI 2 or its square root yE(g(Z) . I The prediction Example 1. and prediction. We introduce the decision theoretic foundation of statistics inclUding the notions of action space. A stockholder wants to predict the value of his holdings at some time in the future on the basis of his past experience with the market and his portfolio.4). An important example is the class of procedures that depend only on knowledge of a sufficient statistic (see Ferguson. in the college admissions situation.32 . although it usually turns out that all admissible procedures of interest are indeed nonrandomized. he wants to predict the firstyear grade point averages of entering freshmen on the basis of their College Board scores. Other theorems are available characterizing larger but more manageable classes of procedures. ]967. The basic biasvariance decomposition of mean square error is presented. testing. These remarkable results. The basic global comparison criteria Bayes and minimax are presented as well as a discussion of optimality by restriction and notions of admissibility. Bayes procedures (in various senses). Goals. In terms of our preceding discussion. and risk through various examples including estimation. A college admissions officer has available the College Board scores at entrance and firstyear grade point averages of freshman classes for a period of several years. A government expert wants to predict the amount of heating oil needed next winter.y)2. Section 3. Similar problems abound in every field. We stress that looking at randomized procedures is essential for these conclusions. are due essentially to Waldo They are useful because the property of being Bayes is ea·der to analyze than admissibility. Using this information. The joint distribution of Z and Y can be calculated (or rather well estimated) from the records of previous years that the admissions officer has at his disposal. Statistical Models. Here are some further examples of the kind of situation that prompts our study in this section. and Performance Criteria Chapter 1 '. confidence bounds.I 'jlI . . Summary. • 1. We assume that we know the joint probability distribution of a random vector (or variable) Z and a random variable Y. Next we must specify what close means. ranking.3. loss function. we referto Blackwell and Girshick (1954) and Ferguson (1967). For more information on these topics. which include the admissible rules. We want to find a function 9 defined on the range of Z such that g(Z) (the predictor) is "close" to Y. at least in their original fonn. which is the squared prediction error when g(Z) is used to predict Y. at least when procedures with the same risk function are identified.
1957) presuppose it. E(Y .3) If we now take expectations of both sides and employ the double expectation theorem (B. or equivalently.4. EY' < 00 Y .Section 1. In this section we bjZj .c)' < 00 for all c.g(z))' I Z = z]. EY' = J.L and the lemma follows. E(Y . exists.4.4.c)' ~ Var Y + (c _ 1')'. for example.16). we can find the 9 that minimizes E(Y . L1=1 Lemma 1. then either E(Y g(Z))' = 00 for every function g or E(Y . The class Q of possible predictors 9 may be the nonparametric class QN P of all 9 : R d Jo R or it may be to some subset of this class.1 assures us that E[(Y .1.4 Prediction 33 mathematical theory of prediction whose deeper results (see. By the substitution theorem for conditional expectations (B. 0 Now we can solve the problem of finding the best MSPE predictor of Y.711). We see that E(Y . Let (1. Grenander and Rosenblatt. given a vector Z.c = (Y 1') + (I' .) = 0 makes the cross product term vanish.g(z))' IZ = z] = E[(Y I'(z))' I Z = z] + [g(z) I'(z)]'. see Example 1.4. (1.1) < 00 if and only if E(Y .1.1) follows because E(Y p. that is.c) implies that p.6.5 and Section 3. we have E[(Y .4. In this situation all predictors are constant and the best one is that number Co that minimizes B(Y . Because g(z) is a constant. The method that we employ to prove our elementary theorems does generalize to other measures of distance than 6. g(Z)) such as the mean absolute error E(lg( Z) .c)2 as a function of c.4.4. in which Z is a constant.2) I'(z) = E(Y I Z = z).C)2 has a unique minimum at c = J.4. and by expanding (1.g(Z))' I Z = z] = E[(Y .4.2 where the problem of MSPE prediction is identified with the optimal decision problem of Bayesian statistics with squared error loss.YI) (Problems 1.4. we can conclude that Theorem 1. 1.I'(Z))' < E(Y . consider QNP and the class QL of linear predictors of the form a + We begin the search for the best predictor in the sense of minimizing MSPE by considering the case in which there is no covariate information.4) . Just how widely applicable the notions of this section are will become apparent in Remark 1.4.(Y.25. Proof. ljZ is any random vector and Y any random variable.g(Z))2.3.g(Z))' (1. see Problem 1. (1.4.l.L = E(Y).c)2 is either oofor all c or is minimized uniquely by c In fact. See Remark 1.20). Lemma 1. when EY' < 00.4.
then (1.4.4.34 Statistical Models.E(v)IIU .4.4.4. E{h(Z)<1 E{ E[h(Z)< I Z]} E{h(Z)E[Y I'(Z) I Z]} = 0 because E[Y I'(Z) I Z] = I'(Z) I'(Z) = O. Goals. .6) and that (1.1(c) is equivalent to (1. that is. 1. and Performance Criteria Chapter 1 for every 9 with strict inequality holding unless g(Z) best MSPE predictor.E(Y I z)]' I z).4. Write Var(Y I z) for the variance of the condition distribution of Y given Z = z.6).4. Property (1. then by the iterated expectation theorem.g(Z)' = E(Y .I'(Z))' + E(g(Z) 1'(Z»2 ( 1.E(U)] = O. 0 Note that Proposition 1. (1.4. we can derive the following theorem.5) follows from (a) because (a) implies that the cross product term in the expansionof E ([Y 1'(z)J + [I'(z) g(z)IP vanishes.5) becomes. (1. = O.4.E(V)J Let € = Y . Infact. then we can write Proposition 1. then Var(E(Y I Z)) < Var Y. Suppose that Var Y < 00. Theorem 1.6) is linked to a notion that we now define: Two random variables U and V with EIUVI < 00 are said to be uncorrelated if EIV . then (a) f is uncorrelated with every function ofZ (b) I'(Z) and < are uncorrelated (c) Var(Y) = Var I'(Z) + Var <.7) IJVar Y • < 00 strict inequality ooids unless 1 Y = E(Y I Z) (1.5) An important special case of (1. let h(Z) be any function of Z. As a consequence of (1. = I'(Z).20). and recall (B.2.4. Equivalently U and V are uncorrelated if either EV[U E(U)] = 0 or EUIV . I'(Z) is the unique E(Y .4. Var(Y I z) = E([Y . so is the other. Vat Y = E(Var(Y I Z» + Var(E(Y I Z).6) which is generally valid because if one side is infinite.4. If E(IYI) < 00 but Z and Y are otherwise arbitrary. To show (a). when E(Y') < 00. Properties (b) and (c) follow from (a).8) or equivalently unless Y is a function oJZ.4.1.5) is obtained by taking g(z) ~ E(Y) for all z. which will prove of importance in estimation theory. That is.Il(Z) denote the random prediction error. Proof.
25 1 0. also. .4. 2.:.45 ~ 1 The MSPE of the best predictor can be calculated in two ways. the best predictor is E (30. E(Var(Y I Z)) ~ E(Y . and only if. if we are trying to guess.45. Within any given month the capacity status does not change.05 0.4.4 Prediction 35 Proof.10.1. o Example 1. 2. I z) = E(Y I Z). The assertion (1.9) this can hold if. We find E(Y I Z = 1) = L iF[Y = i I Z = 1] ~ 2. y) = p (Z = Z 1 Y = y) of the number of shutdowns Y and the capacity state Z of the line for a randomly chosen day. half.025 0.10 0.05 0. The following table gives the frequency function p( z. or 3 shutdowns due to mechanical failure.25 0.y) pz(z) 0.7) follows immediately from (1. But this predictor is also the right one.lI. the average number of failures per day in a given month.10 0.4.E(Y I Z))' ~ 0 By (A.E(Y I Z = z))2 p (z.25 0. or 3 failures among all days.6).05 0.7) can hold if.50 z\y 1 0 0. (1.15 3 0. In this case if Yi represents the number of failures on day i and Z the state of the assembly line.'" 1 Y.10 I 0. Each day there can be 0.25 2 0.20. I.8) is true.E(Y I Z))' =L x 3 ~)y y=o . We want to predict the number of failures for a given day knowing the state of the assembly line for the month.10 0.Section 1. or quarter capacity. E (Y I Z = ±) = 1.885. 1. y) = 0.4. E(Y . Equality in (1.1 2.025 0.15 I 0. An assembly line operates either at full. as we reasonably might.30 I 0. The row sums of the entries pz (z) (given at the end of each row) represent the frequency with which the assembly line is in the appropriate capacity state.4. and only if. p(z.30 py(y) I 0. These fractional figures are not too meaningful as predictors of the natural number values of Y. The first is direct. whereas the column sums py (y) yield the frequency of 0. i=l 3 E (Y I Z = ~) ~ 2.
(1 .L[E(Y I Z y .E(Y I Z p') (1. whereas its magnitude measures the degree of such dependence.9) I . E((Y .11) I The line y = /lY + p(uy juz)(z ~ /lz).2. If (Z.4. o tribution of Y given Z = z is N City + p(Uy j UZ ) (z . Because of (1. = Z))2 I Z = z) = u~(l _ = u~(1 _ p'). can reasonably be thought of as a measure of dependence.4. Thus. " " is independent of z. 2 I .10) I I " The qualitative behavior of this predictor and of its MSPE gives some insight into the structure of the bivariate normal distribution. a'k.p')).4.4.6) we can also write Var /lO(Z) p = VarY . the best predictor of Y using Z is the linear function Example 1. this quantity is just rl'. (1. p < 0 indicates that large values of Z tend to go with small values of Y and we have negative dependence. and Performance Criteria Chapter 1 The second way is to use (1./lz).2 tells us that the conditional dis /lO(Z) Because .36 Statistical Models.4. for this family of distributions the sign of the correlation coefficient gives the type of dependence between Z and Y. Similarly. In the bivariate normal case. E(Y . u~. Regression toward the mean.4. If p = 0.4. the best predictor is just the constant f. The term regression was coined by Francis Galton and is based on the following observation. Suppose Y and Z are bivariate normal random variables with the same mean . Theorem B. the MSPE of our predictor is given by. The larger this quantity the more dependent Z and Yare. E(Y ~ E(Y I Z))' VarY ~ Var(E(Y I Z)) E(Y') ~ E[(E(Y I Z))'] I>'PY(Y) .LY. = z)]'pz(z) 0.6) writing. If p > O. .J. which corresponds to the best predictor of Y given Z in the bivariate normal model. O"~. . Goals. the predictor is a monotone increasing function of Z indicating that large (small) values of Y tend to be associated with large (small) values of Z. Y) has a N(Jlz l f. which is the MSPE of the best constant predictor.y as we would expect in the case of independence. Therefore. is usually called the regression (line) of Y on Z.E(Y I Z))' (1. The Bivariate Normal Distribution. One minus the ratio of the MSPE of the best predictor of Y given Z to Var Y. = /lY + p(uyjuz)(Z ~ /lZ).885 as before. p) distribution.
Yn ) from the population. In Galton's case. the distribution of (Z. y)T has a (d + !) multivariate normal. . Thus. Theorem B. (Zn. We shall see how to do this in Chapter 2.11).4. and positive correlation p. We write Mee = ' ~! _ PZy ElY /lO(Z))' ~ Var l"o(Z) Var Y Var Y . Note that in practice.6) in which I' (I"~..COV(Zd.. ttd)T and suppose that (ZT.3. I"Y) T.Section 1. I"Y = E(Y).. where the last identity follows from (1. N d +1 (I'. Let Z = (Zl. or the average height of sbns whose fathers are the height Z.6). in particular in Galton's studies. Ezy O"yy E zz is the d x d variancecovariance matrix Var(Z) .8. . E).4. consequently. (1 . .." This is compensated for by "progression" toward the mean among the sons of shorter fathers and there is no paradox. By (1. distribution (Section B.p)1" + pZ) = p'(T'.4 Prediction 37 tt. .5 states that the conditional distribution of Y given Z = z is N(I"Y +(zl'zf. the best predictor E(Y I Z) ofY is the linear function (1.4. . variance cr2. . usual correlation coefficient p = UZy / crfyCT lz when d = 1. Then the predicted height of the son. coefficient ofdetennination or population Rsquared. so the MSPE of !lo(Z) is smaller than the MSPE of the constant predictor J1y.4. tall fathers tend to have shorter sons. E zy = (COV(Z" Y). " Zd)T be a d x 1 covariate vector with mean JLz = (ttl. Thus. Y» T T ~ E yZ and Uyy = Var(Y). 0 Example 1.p)tt + pZ. This quantity is called the multiple correlation coefficient (MCC). Y) is unavailable and the regression line is estimated on the basis of a sample (Zl.. is closer to the population mean of heights tt than is the height of the father.6. . the MCC equals the square of the . should.. . YI ). there is "regression toward the mean. The quadratic fonn EyzEZ"iEzy is positive except when the joint nonnal distribution is degenerate. uYYlz) where. be less than that of the actual heights and indeed Var((! . these were the heights of a randomly selected father (Z) and his son (Y) from a large human population. The variability of the predicted value about 11.. The Multivariate Normal Distribution.12) with MSPE ElY l"o(Z)]' ~ E{EIY l"o(zlI' I Z} = E(uYYlz) ~ Uyy  EyzEziEzy.8 = EziEzy anduYYlz ~ uyyEyzEziEzy. One minus the ratio of these MSPEs is a measure of how strongly the covariates are associated with Y.
ba) + Zba]}' to get ba )' .5%.209.39 l.l Proof. Suppose that E(Z') and E(Y') are finite and Z and Y are not constant.:.13) . Y) a I VarZ.3%.4.9% and 39. and E(Y _ boZ)' = E(Y') _ [E(ZY)]' E(Z') (1..)T. Then the unique best zero intercept linear predictor i~ obtained by taking E(ZY) b=ba = E(Z') . when the distribution of (ZT.y = . respectively.. 29W· Then the strength of association between a girl's height and those of her mother and father.2. l. The natural class to begin with is that of linear combinations of components of Z. The best linear predictor. Y. let Y and Z = (Zl. We expand {Y . y)T is unknown.393..y = . respectively. we can avoid both objections by looking for a predictor that is best within a class of simple predictors..4. Let us call any random variable of the form a + bZ a linear predictor and any such variable with a = 0 a zero intercept linear predictor. are P~.bZ}' = E(Y)  b E(Z) 1· = (Y  [Z(b . (Z~. The percentage reductions knowing the father's and both parent's heights are 20.4.1. The problem of finding the best MSPE predictor is solved by Theorem 1. . Goals. Z2)T be the heights in inches of a IOyearold girl and her parents (Zl = mother's height. whereas the unique best linear predictor is ILL (Z) = al + bl Z where b = Cov(Z.335.~ ~:~~).3. . yn)T See Sections 2. Two difficulties of the solution are that we need fairly precise knowledge of the joint distribution of Z and Y in order to calcujate E(Y I Z) and that the best predictor may be a complicated function of Z. knowing the mother's height reduces the mean squared prediction error over the constant predictor by 33.bZ)' = E(Y') + E(Z')(b  Therefore.38 Statistical Models. E(Y . E(Y . and parents. Z2 = father's height). In words. P~.zy ~ (407.bZ)' is uniquely minintized by b = ba. What is the best (zero intercept) linear predictor of Y in the sense of minimizing MSPE? The answer is given by: Theorem 1. In practice. the hnear predictor l'a(Z) and its MSPE will be estimated using a 0 sample (Zi. and Performance Criteria Chapter 1 For example.zz ~ (. We first do the onedimensional case. p~y = .1 and 2.E(Z')b~. If we are willing to sacrifice absolute excellence. Suppose(l) that (ZT l Y) T is trivariate nonnal with Var(Y) = 6.
and only if.b(Z .a .boZ)' = 0. and b = b1 • because. Best Multivariate Linear Predictor. = I'Y + (Z  J.1).E(Z) and Y .a .1.5).4.E(Y)) . On the other hand.I'd Z )]' = 1 05 ElY I'(Z)j' . 0 Note that if E(Y I Z) is of the form a + bZ.I1. by (1.bZ)2 is uniquely minimized by taking a ~ E(Y) .a .bZ) + (E(Y) . In that example nothing is lost by using linear prediction.boZ)' > 0 is equivalent to the CauchySchwarz inequality with equality holding if.E(Y)]) = EziEzy.t and covariance E of X.4.4. E[Y . then the Unique best MSPE predictor is I'dZ) Proof.E(Z)J[Y . I 1.a)'.16) directly by calculating E(Y .l). in Example 1. b) X ~ (ZT.l7) in the appendix.bE(Z) . whatever be b.14) Yl T Ep[Y . E(Y .E(Z)IIZ .1 the best linear predictor and best predictor differ (see Figure 1. . Theorem 1.I'l(Z)1' depends on the joint distribution P of only through the expectation J. then a = a. (1.4.bZ)2 we see that the b we seek minimizes E[(Y . That is. OUf linear predictor is of the fonn d I'I(Z) = a + LbjZj j=l = a+ ZTb [3 = (E([Z . Note that R(a. whiclt corresponds to Y = boZo We could similarly obtain (A. E(Y .E(Y) to conclude that b. We can now apply the result on zero intercept linear predictors to the variables Z ..E(Z)f[Z .a. it must coincide with the best linear predictor.E(ZWW ' E([Z .bE(Z).E(Z)J))1 exist.Section 1. 0 Remark 1. is the unique minimizing value. This is because E(Y .bZ)' ~ Var(Y .4.E(Z))]'.4. From (1.tzf [3.Z)'. E(Y .13) we obtain tlte proof of the CauchySchwarz inequality (A. Substituting this value of a in E(Y .4. if the best predictor is linear. Therefore.4 Prediction 39 To prove the second assertion of the theorem note that by (lA.4. This is in accordance with our evaluation of E(Y I Z) in Example 1.2.4. If EY' and (E([Z .b. A loss of about 5% is incurred by using the best linear predictor. Let Po denote = .
75 1.3. that is. ~ Y . p~y = Corr2 (y. and R(a.4. See Problem 1. b) is minimized by (1. The line represents the best linear predictor y = 1.50 0. A third approach using calculus is given in Problem 1. .14).4. j = 0.40 Statistical Models.. However. b) = Ro(a. I"(Z) = E(Y I Z) = a + ZT 13 for unknown Q: E R and f3 E Rd.3. . case the multiple correlation coefficient (MCC) or coefficient of determination is defined as the correlation between Y and the best linear predictor of Y.. Thus.I'(Z) and each of Zo.19. Ro(a.4.05 + 1. not necessarily nonnal. . By Example 1. We want to express Q: and f3 in terms of moments of (Z.4. . 0 Remark 1. b) is miuimized by (1. the multivariate uormal. Goals. Set Zo = 1. the MCC gives the strength of the linear relationship between Z and Y. thus. 2 • • 1 o o 0.4. R(a. Y). 0 Remark 1. distribution and let Ro(a. N(/L. . b) = Epo IY ..15) .4. that is.25 0. By Proposition 1.00 z Figure 1. Suppose the mooel for I'(Z) is linear. Zd are uncorrelated.45z.2.1.14).I'LCZ)).. b).4.4. and Performance Criteria Chapter 1 y .17 for an overall measure of the strength of this relationship. Because P and Po have the same /L and E.4.4.3 to d > 1. .. E(Zj[Y .4.4 by extending the proof of Theorem 1.4. . The three dots give the best predictor.4. 0 Remark 1.(a + ZT 13)]) = 0.1(a). . our new proof shows how secondmoment results sometimes can be established by "connecting" them to the nannal distribution. In the general. (1.I'I(Z)]'. We could also have established Theorem 1.d. E). .4.
If T assigns the same value to different sample points. we would clearly like to separate out any aspects of the data that are irrelevant in the context of the model and that may obscure our understanding of the situation..16) (1.I'(Z) is orthogonal to I'(Z) and to I'dZ).14) (Problem 1.g(Z»): 9 E g).5 Sufficiency 41 Solving (1. Because the multivariate normal model is a linear model.5. the optimal MSPE predictor is the conditional expected value of Y given Z.4. T(X) = X loses information about the Xi as soon as n > 1. It is shown to coincide with the optimal MSPE predictor when the model is left general but the class of possible predictors is restricted to be linear. Consider the Bayesian model of Section 1. then by recording or taking into account only the value of T(X) we have a reduction of the data.4. g(Z) and h(Z) are said to be orthogonal if at least one has expected value zero and E[g(Z)h(Z)] ~ O.4. The notion of mean squared prediction error (MSPE) is introduced. We consider situations in which the goal is to predict the (perhaps in the future) value of a random variable Y... The range of T is any space of objects T. then 90(Z) is called the projection of Y on the space 9 of functions of Z and we write go(Z) = . and projection 1r notation.23).' (l'e(Z).3.5)'. we can conclude that I'(Z) = . Moreover.(Y I 9). With these concepts the results of this section are linked to the general Hilbert space results of Section 8.2 and the Bayes risk (1.2.4. I'(Z» + ""'(Y.4..17) ""'(Y.. We return to this in Section 3.Thus.16) is the Pythagorean identity.4. Using the distance D. can also be a set of functions. Thus. and it is shown that if we want to predict Y on the basis of information contained in a random vector Z. this gives a new derivation of (1.1.15) for a and f3 gives (1. I'dZ) = . but as we have seen in Sec~ tion 1. we see that r(5) ~ MSPE for squared error loss 1(9.(Y.4.I 0 and there is a 90 E 9 such that 0 < oc form a Hilbert go = arg inf{".5) = (9 . Remark 1.(Y 1ge) ~ "(I'(Z) 19d (1.8) defined by r(5) = E[I(II. o Summary. the optimal MSPE predictor E(6 I X) is the Bayes procedure for squared error loss. We begin by fonnalizing what we mean by "a reduction of the data" X EX.5(X»].4. If we identify II with Y and X with Z. usually R or Rk.3.12). .5 SUFFICIENCY Once we have postulated a statistical model.. 1. Remark 1. I'(Z» Y .10.Section 1.(Y 19NP).6. Recall that a statistic is any function of the observations generically denoted by T(X) or T. When the class 9 of possible predictors 9 with Elg(Z)1 space as defined in Section B. Note that (1. The optimal MSPE predictor in the multivariate normal distribution is presented.4. I'dZ» = ".
= tis U(O.. whatever be 8.5). is sufficient. X n ) into the same number.0. and Performance Criteria Chapter 1 Even T(X J. X. t) and Y = t . when Xl + X.9. We give a decision theory interpretation that follows. By (A. Therefore." it is important to notice that there are many others e.Xn ) does not contain any further infonnation about 0 or equivalently P. . the conditional distribution of Xl = [XI/(X l + X. 1) whatever be t. T = L~~l Xi. Suppose there is no dependence between the quality of the items produced and let Xi = 1 if the ith item is good and 0 otherwise.. in the context of a model P = {Pe : (J E e}.Xn ) where Xi = 1 if the ith item sampled is defective and Xi = 0 otherwise. . Thus. .l.. Instead of keeping track of several numbers. A machine produces n items in succession. By (A.l. Begin by noting that according to TheoremB. given that P is valid. X(n))' loses information about the labels of the Xi. where 0 is unknown. Y) where X is uniform on (0. XI/(X l +X. Thus. (Xl.) is conditionally distributed as (X. . = t. ~ t. However.'" . 16. . Then X = (Xl. recording at each stage whether the examined item was defective or not.. . X. Xl has aU(O.'" . P[X I = XI. .4). Suppose that arrival of customers at a service counter follows a Poisson process with arrival rate (parameter) Let Xl be the time of arrival of the first customer. the conditional distribution of X 0 given T = L:~ I Xi = t does not involve O..) and Xl +X.' • .X. " . T is a sufficient statistic for O.) and that of Xlt/(X l + X.) are the same and we can conclude that given Xl + X..1. we need only record one. 1).1.Xn ) is the record of n Bernoulli trials with probability 8. the conditional distribution of XI/(X l + X. A statistic T(X) is called sufficient for PEP or the parameter if the conditional distribution of X given T(X) = t does not involve O. The idea of sufficiency is to reduce the data with statistics whose use involves no loss of information. + X. t) distribution. The most trivial example of a sufficient statistic is T(X) = X because by any interpretation the conditional distribution of X given T(X) = X is point mass at X. the sample X = (Xl.5. are independent and the first of these statistics has a uniform distribution on (0. it is intuitively clear that if we are interested in the proportion 0 of defective items nothing is lost in this situation by recording and using only T. .1) where Xi is 0 or 1 and t = L:~ I Xi. Although the sufficient statistics we have obtained are "natural.1 we had sampled the manufactured items in order.42 Statistical Models.2.l. X 2 the time between the arrival of the first and second customers. . Each item produced is good with probability 0 and defective with probability 1.5.Xn = xnl = 8'(1. We prove that T = X I + X 2 is sufficient for O. Thus. 0 In both of the foregoing examples considerable reduction has been achieved. The total number of defective items observed. Xl and X 2 are independent and identically distributed exponential random variables with parameter O.5.8)n' (1. For instance. whatever be 8. is a statistic that maps many different values of (Xl.2. It follows that. Goals. Example 1.) given Xl + X.3. By Example B. We could then represent the data by a vector X = (Xl. suppose that in Example 1. once the value of a sufficient statistic T is known. Using our discussion in Section B.. One way of making the notion "a statistic whose use involves no loss of infonnation" precise is the following.'" . o Example 1. . • X n ) (X(I) .)](X I + X.l we see that given Xl + X2 = t.
0 E 8.In a regular model. then T 1 and T2 provide the same information and achieve the same reduction of the data. Neyman. i = 1. Po [X ~ XjlT = t. Let (Xl.6).. To prove the sufficiency of (152). Po[X = XjlT = t.. 0) porT ~ Ii] ~~CC+ g(li. X2.j is independent ofe on cach of the sets Si = {O: porT ~ til> OJ.I. h(xj) (1. if (152) holds. it is enough to show that Po[X = XjlT ~ l.O) Theorem I.2. a statistic T(X) with range T is sufficient for e if. We shall give the proof in the discrete case. In general.] Po[X = Xj. e porT = til = I: {x:T(x)=t. Such statistics are called equivalent. Proof.]/porT = Ii] p(x. . Now. there exists afunction g(t.. e)h(Xj) Po[T (15. Fortunately. e) definedfor tin T and e in on X such that p(X.4) if T(xj) ~ Ii 41 o if T(xj) oF t i . and Halmos and Savage.} p(x.S. It is often referred to as the factorization theorem for sufficient statistics. and e and a/unction h defined (152) = g(T(x).O) I: {x:T(x)=t. Section 2.5) . a simple necessary and sufficient criterion for a statistic to be sufficient is available.} h(x).. if Tl and T2 are any two statistics such that 7 1 (x) = T 1 (y) if and only if T2(x) = T 2(y).5. The complete result is established for instance by Lehmann (1997.. forO E Si. O)h(x) for all X E X. This result was proved in various forms by Fisher.Ll) and (152). (153) By (B.5 Sufficiency 43 that will do the same job.Section 1.] o if T(xj) oF ti if T(xj) = Ii. Being told that the numbers of successeS in five trials is three is the same as knowing that the difference between the numbers of successes and the number of failures is one. Applying (153) we arrive at. T = t. • only if.") be the set of possible realizations of X and let t i = T(Xi)' Then T is discrete and 2::~ 1 porT = Ii] = 1 for every e.e) = g(ti. By our definition of conditional probability in the discrete case. checking sufficiency directly is difficult because we need to compute the conditional distribution. More generally. we need only show that Pl/[X = xjlT = til is independent of for every i and j.
. .. ..16.. O)h(x) (1.Il) .X n JI) = 0 otherwise.9) if every Xi is an integer between 1 and B and p( X 11 • (1. and h(xl •.. .. = max(xl.2 (continued). we need only keeep track of X(n) = max(X .4)).O)=onexP[OLXil i=l (1.X.2..8) = eneStift > 0.5. Expression (1. Estimating the Size of a Population.5.5. 1. _ _ _1 .2. •. T = T(x)] = g(T(x). 10 fact. n P(Xl.5. ~ Po[X = x. O} = 0 otherwise. X(n) is a sufficient statistic for O.3.3). By Theorem 1.~.' L(Xi .7) o Example 1.44 Statistical Models. .4.10) P(Xl..[27f. > 0. 0 o Example 1. and Performance Criteria Chapter 1 Therefore. .. which admits simple sufficient statistics and to which this example belongs.n /' exp{ .5... The probability distribution of X "is given by (1.' (L x~ 1=1 1 n n 2JL LXi))]. if T is sufficient. Conversely. I x n ) = 1 if all the Xi are > 0.6) p(x..O) where x(n) = onl{x(n) < O}. T is sufficient. Goals. . .. .5. .5.'t n /'[exp{ . 1=1 (!. are introduced in the next section. 1 X n be independent and identically distributed random variables each having a normal distribution with mean {l and variance (j2. . l X n are the interarrival times for n customers. Then the density of (X" . let g(ti' 0) = porT = tiL h(x) = P[X = xIT(X) = til Then (1. and P(XI' .'I.S.. ' . both of which are unknown. If X 1. .9) can be rewritten as " 1 Xn .'). . then the joint density of (X" . Common sense indicates that to get information about B.'" .JL)'} ~=l I n I! 2 .Xn ) is given by (see (A.. X n ). [27". Let Xl.Xn ) is given by I. 0 Example 1. Let 0 = (JL..Xn.8) if all the Xi are > 0. . We may apply Theorem 1. and both functions = 0 otherwise.1..Xn ) = L~ IXi is sufficient.Xn are recorded. Takeg(t.xn }.5. The population is sampled with replacement and n members of the population are observed and their labels Xl. }][exp{ .5. Consider a population with () members labeled consecutively from I to B. . A whole class of distributions.1 to conclude that T(X 1 .5.5. 0) by (B. we can show that X(n) is sufficient....
X n ) = (I: Xi. Suppose X" ..) i=l i=l is sufficient for B. . for any decision procedure o(x). i= 1 = (lin) L~ 1 Xi.5. {32.} + EY. Yi N(Jli.5 Sufficiency 45 I Evidently P{Xl 1 • •• .. [1/(n i=l n n 1)1 I:(Xi . By randomized we mean that o'(T(X») can be generated from the value random mechanism not depending on B.9) and _ (y. with JLi following the linear regresssion model ("'J . I)) = exp{nI)(x . in Example 1.··· .12) t ofT(X) and a Example 1.6.Section 1. Specifically. if T(X) is sufficient. X n are independent identically N(I). Thus. that Y I . a<.1. respectively.( 2?Ta ')~ exp p {E(fJ. Here is an example.' + 213 2a + 213.~0)}(2"r~n exp{ . . 1 2 2 .EZiY. .xnJ)) is itself a function of (L~ upon applying Theorem 1.1. Ezi 1'i) is sufficient for 6. 0') for allO. Then pix. 2:>. we can. R(O. find a randomized decision rule o'(T(X)) depending only on T(X) that does as well as O(X) in the sense of having the same risk function.~ I: xn By the factorization theorem. X n ) = [(lin) where I: Xi. . that is. a 2)T is identifiable (Problem 1. we construct a rule O'(X) with the sarne risk = mean squared error as o(X) as follows: Conditionally.Zi)'} exp {Er. An equivalent sufficient statistic in this situation that is frequently used IS S(X" . I)) .o) = R(O. where we assume diat the given constants {Zi} are not all identical. a 2 ). The first and second components of this vector are called the sample mean and the sample variance. EYi 2 . X is sufficient. ..X)'].5. 0 X Example 1.5. (1.. L~l x~) and () only and T(X 1 .4 with d = 2.l) distributed. T = (EYi .2a I3. Let o(X) = Xl' Using only X. Suppose.5.5.• Yn are independent.1 we can conclude that n n Xi. Then 6 = {{31. o Sufficiency and decision theory Sufficiency can be given a clear operational interpretation in the decision theoretic setting.
). we can find a transformation r such that T(X) = r(S(X)). Now draw 6' randomly from this conditional distribution.5. 0 and X are independent given T(X). ) Var(T') = E[Var(T'IX)] + VarIE(T'IX)]  ~ nl n + 1 n ~ 1 = Var(X .6). In Example 1. where k In this situation we call T Bayes sufficient.5.12) follows along the lines of the preceding example: Given T(X) = t. then T(X) = (L~ I Xi. But T(X) provides a greater reduction of the data. Example 1.5. = 2:7 Definition.. T(X) is Bayes sufficient for IT if the posterior distribution of 8 given X = x is the same as the posterior (conditional) distribution of () given T(X) = T(x) for all x. Minimal sufficiency For any model there are many sufficient statistics: Thus. by the double expectation theorem. We define the statistic T(X) to be minimally sufficient if it is sufficient and provides a greater reduction of the data than any other sufficient statistic S(X). 0 The proof of (1. (T2) sample n > 2.Xn is a N(p. .6). Theorem }..l and (1. . Let SeX) be any other sufficient statistic. This 6' (T(X)) will have the same risk as 6' (X) because. T = I Xi was shown to be sufficient. Using Section B.1 (Bernoulli trials) we saw that the posterior distribution given X = x is 1 Xi. Thus.O) = g(S(x). Equivalently. the distribution of 6(X) does not depend on O. L~ I xl) and S(X) = (X" . Then by the factorization theorem we can write p(x.5.6'(T))IT]} ~ E{E[R(O.46 Statistical Models.2. 0) as p(x.14. O)h(x) E:  .2.. if Xl. (Kolmogorov). X n ) are hoth sufficient. In this Bernoulli trials case. and Performance Criteria Chapter 1 given X = t. the same as the posterior distribution given T(X) = L~l Xi = k. Goals. it is Bayes sufficient for every 11. IfT(X) is sufficient for 0.4. o Sufficiency aud Bayes models There is a natural notion of sufficiency of a statistic T in the Bayesian context where in addition to the model P = {Po : 0 E e} we postulate a prior distribution 11 for e.1 (continued). This result and a partial converse is the subject of Problem 1. we find E(T') ~ E[E(T'IX)] ~ E(X) ~ I" ~ E(X .6(X»)IT]} = R(O. .6') ~ E{E[R(O.. R(O.. n~l) distribution. in that. choose T" = 15"' (X) from the normal N(t. 6'(X) and 6(X) have the same mean squared error.
. The likelihood function o The preceding example shows how we can use p(x. However. we find OT(1 . the statistic L takes on the value Lx. t. }exp ( I .1). Example 1.1) nlog27r . (T').8) for the posterior distribution can then be remembered as Posterior ex: (Prior) x (Likelihood) where the sign ex denotes proportionality as functions of 8. = 2 log Lx(O. We define the likelihood function L for a given observed data vector x as Lx(O) = p(x. for a given 8. The formula (1. the ratio of both sides of the foregoing gives In particular. (T') example. 20. as a function of ().O)h(x) forall O. if X = x. it gives. 1/3)]}/21og 2. 20' 20 I t. for example. Thus.o)nT ~g(S(x).O). L Xf) i=l i=l n n = (0 10 0. for a given observed x. then Lx(O) = (27rO. t2) because.(t. Thus.4 (continued). ~ 2/3 and 0.) = (/". Now. ()) for different values of () and the factorization theorem to establish that a sufficient statistic is minimally sufficient. when we think of Lx(B) as a function of (). It is a statistic whose values are functions. if we set 0. ~ 1/3.5 Sufficiency 47 Combining this with (1. L x (') determines (iI. For any two fixed (h and fh. T is minimally sufficient.) n/' exp {nO. we find T = r(S(x)) ~ (log[2ng(s(x).5. In this N(/". 2/3)/g(S(x).Section 1. ()) x E X}. the likelihood function (1. L x (()) gives the probability of observing the point x.11) is determined by the twodimensional sufficient statistic T Set 0 = (TI.5.T.5.5.) ~ (L Xi. the "likelihood" or "plausibility" of various 8. take the log of both sides of this equation and solve for T. In the discrete case.O E e. Lx is a map from the sample space X to the class T of functions {() t p(x. In the continuous case it is approximately proportional to the probability of observing a point in a small rectangle around x.)}.
O)h(X). .X. as in the Example 1. We define a statistic T(X) to be Bayes sufficient for a prior 7r if the posterior distribution of f:} given X = x is the same as the posterior distribution of 0 given T(X) = T(x) for all X.X)..).. Lehmann. X is sufficient. If P specifies that X 11 •. Suppose that X has distribution in the class P ~ {Po : 0 E e}.1.5.5. and Performance Criteria Chapter 1 with a similar expression for t 1 in terms of Lx (O. in Example 1. Summary. 1. . If T(X) is sufficient for 0. 1 X n )..5. Suppose there exists 00 such that (x: p(x. .. Thus. Let p(X.1 (continued) we can show that T and.12 for a proof of this theorem of Dynkin. 0) and heX) such that p(X. a statistic closely related to L solves the minimal sufficiency problem in general.5. < X. 0 In fact. L is a statistic that is equivalent to ((1' i 2) and. but if in fact the common distribution of the observations is not Gaussian all the information needed to estimate this distribution is contained in the corresponding S(X)see Problem 1. hence. SeX) becomes irrelevant (ancillary) for inference if T(X) is known but only if P is valid. • X( n» is sufficient. 1 X n are a random sample. we can find a randomized decision rule J'(T(X» depending only on the value of t = T(X) and not on 0 such that J and J' have identical risk functions. The "irrelevant" part of the data We can always rewrite the original X as (T(X). Thus. Ax is the function valued statistic that at (J takes on the value p x.Oo) > OJ = Lx~6o)' Thus. _' . S(X) = (R" . if a 2 = 1 is postulated. We say that a statistic T (X) is sufficient for PEP. l:~ I (Xi .5. . then for any decision procedure J(X). it is Bayes sufficient for O.17). the residuals. (X. 1) (Problem 1. hence. 0 the likelihood ratio of () to Bo. .Rn ).4. L is minimal sufficient. . Then Ax is minimal sufficient. itself sufficient. 1) and Lx (1. Let Ax OJ c (x: p(x.X)2) is sufficient. Consider an experiment with observation vector X = (Xl •. The likelihood function is defined for a given data vector of . For instance. a 2 is assumed unknown.13.48 Statistical Models. (X( I)' . where R i ~ Lj~t I(X. OJ denote the frequency function or density of X. SeX)~ where SeX) is a statistic needed to uniquely detennine x once we know the sufficient statistic T(x). 0) = g(T(X).O) > for all B.... If.5. A sufficient statistic T(X) is minimally sufficient for () if for any other sufficient statistic SeX) we can find a ttansformation r such that T(X) = r(S(X).5. but if in fact a 2 I 1 all information about a 2 is contained in the residuals. Goals. By arguing as in Example 1. if the conditional distribution of X given T(X) = t does not involve O. ifT(X) = X we can take SeX) ~ (Xl .X Cn )' the order statistics.Xn . . or if T(X) ~ (X~l)' . or for the parameter O. See Problem ((x:). B ul the ranks are needed if we want to look for possible dependencies in the observations as in Example 1. We show the following result: If T(X) is sufficient for 0.. The factorization theorem states that T(X) is sufficient for () if and only if there exist functions g( t..5. and Scheffe. the ranks.
6. realvalued functions T and h on Rq. the likelihood ratio Ax(B) = Lx(B) Lx(Bo) depends on X through T(X) only. B E' sufficient for B. Then. B>O. Note that the functions 1}. The class of families of distributions that we introduce in this section was first discovered in statistics independently by Koopman. The Poisson Distribution.1. We shall refer to T as a natural sufficient statistic of the family.2"" }. More generally. They will reappear in several connections in this book.B) ~ h(x)exp{'7(B)T(x) . binomial. by the factorization theorem. We return to these models in Chapter 2. Subsequently. Probability models with these common features include normal.B o) > O}. x. p(x. for x E {O. gamma. and multinomial regression models used to relate a response variable Y to a set of predictor variables.6 EXPONENTIAL FAMILIES The binomial and normal models considered in the last section exhibit the interesting feature that there is a natural sufficient statistic whose dimension as a random vector is independent of the sample size. IfT(X) is {x: p(x. Here are some examples.Section 1. Pitman. Let Po be the Poisson distribution with unknown mean IJ. B) > O} c {x: p(x. 1.6. B( 0) on e.B(B)} (1. In a oneparameter exponential family the random variable T(X) is sufficient for B. (1. BEe.o I x. B). if there exist realvalued functions fJ(B).1 The OneParameter Case The family of distributions of a model {Pe : B E 8}. many other common features of these families were discovered and they have become important in much of the modern theory of statistics.6 Exponential Families 49 observations X to be the function of B defined by Lx(B) = p(X. and Dannois through investigations of this property(l).2) . B) of the Pe may be written p(x. and T are not unique. beta. Example 1. and if there is a value B E such that o e e. 1.B) = eXe. then.exp{xlogBB}. 1.1) where x E X c Rq. Ax(B) is a minimally sufficient statistic. = 1 .6. these families form the basis for an important class of models called generalized linear models. B.B) and h(x) with itself in the factorization theorem. This is clear because we need only identify exp{'7(B)T(x) .B(B)} with g(T(x). is said to be a oneparameter exponentialfami/y. Poisson.6. such that the density (frequency) functions p(x.
the family of distributions of X is a oneparameter exponential family with q=I. Specifically.1](0) = logO.' x. ..1).Xm are independent and identically distributed with common distribution Pe.1](O)=log(I~0).6) .5) o = 2. This is a oneparameter exponential family distribution with q = 2.6. h(x) = (2rr)1 exp { . T(x) ~ x.h(x) = .0)]. we have p(x.) i=1 m B(8)] ( 1. Here is an example where q (1. .y. is the family of distributions of X = (Xl' ..6. . 0) distribution. Example 1. for x E {O.O) f(z. 0 Then. Z and W are f(x. where the p. q = 1. suppose Xl. Then > 0. and Performance Criteria Chapter 1 Therefore.4) Therefore. 0 Example 1.~z.50 . .} . Suppose X = (Z.2.) exp[~(O)T(x. p(x.O) IIh(x. 1.6. Statistical Models.B(O)=nlog(I0). . (1. 1 (1. . .O) = ( : ) 0'(1_ Or" ( : ) ex p [xlog(1 ~ 0) + nlog(l.Z)OI) Z)'O'J} (2rrO)1 exp { (2rr)1 exp {  ~ [z' + (y  ~z' } exp { .3. The Binomial Family. . B) are the corresponding density (frequency) functions. B(O) = logO.~O'. y)T where Y = Z independent N(O.6.h(X)=( : ) . fonn a oneparameter exponential family as in (1.. the Pe form a oneparameter exponential family with ~~I .T(x) = (y  z)'.6.1](0) = . . + OW. n) < 0 < 1.6.3) I' . 1). I I! .. 0 E e. If {pJ=».. 1 X m ) considered as a random vector in Rmq and p(x. Suppose X has a B(n. o The families of distributions obtained by sampling from oneparameter exponential families are themselves oneparameter exponential families. Goals.T(x)=x.~O'(y  z)' logO}. B(O) = 0.O) = f(z)f.(y I z) = <p(zW1<p((y .6.
. then qCm) = mq. pi pi I:::n TABLE 1. Let {Pe } be a oneparameter exponential family of discrete distributions with corresponding functions T..6 Exponential Families 51 where x = (x 1 . if X = (Xl. B. This family of Poisson distIibutions is oneparameter exponential whatever be m. oneparameter exponential family with natural sufficient statistic T(m)(x) = Some other important examples are summarized in the following table. then the family of distributions of the statistic T(X) is a oneparameter exponential family of discrete distributions whose frequency functions may be written h'Wexp{1)(O)t . i=l (1.6. the m) fonn a oneparameter exponential family.1.. and h.Section 1. .Il)" . and h.\) (3(r.1 the sufficient statistic T :m)(x1. 1 X m).. .. . In the discrete case we can establish the following general result. B(m)(o) ~ mB(O). Therefore. and pJ LT(Xi). 1)(0) N(I'.\ 1"/'" (p (s 1) 1) 1) (r T(x) x (x .6.. ..8) .· .6. ily of distributions of a sample from any of the foregoin~ is just In our first Example 1. fl. 1]. then the m ) fonn a I Xi. B.\ fixed r fixed s fixed 1/2.7) Note that the natural sufficient statistic T(m) is onedimensional whatever be m. We leave the proof of these assertions to the reader.B(O)} for suitable h * . If we use the superscript m to denote the corresponding T.x log x 10g(l x) logx The statistic T(m) (X I. . · .1 Family of distributions . X m ) corresponding to the oneparameter exponential fam1 T(Xd.. 1 X m ) is a vector of independent and identically distributed P(O) random variables and m ) is the family of distributions of x. I:::n I::n Theorem 1. . •.6. .2) r(p. h(m)(x) ~ i=l II h(xi). ..Xm ) = I Xi is distributed as P(mO).' fixed I' fixed p fixed . For example.
then A(1}) must be finite.6. The model given by (1. .6. The Poisson family in canonical form is q(x. We obtain an important and useful reparametrization of the exponential family (1.6.9) and ~ is an Ihlerior point of E. Jh(x)exp[1}T(x)]dx in the continuous case and the integral is replaced by a sum in the discrete case.9) with 1} E E contains the class of models with 8 E 8.1}) = (1(x!)exp{1}xexp[1}]}.1.6. selves continuous. the result follows..2. and Performance Criteria Chapter 1 Proof.A(1})1 for s in some neighborhood 0[0.6.. Then as we show in Section 1. E is either an interval or all of R and the class of models (1.8) h(x) exp[~(8)T(x) . x E {O. }.6. Example 1.B(8)]{ Ifwe let h'(t) ~ L:{"T(x)~t} h(x).2. x=o x=o o Here is a useful result. If 8 E e. Let E be the collection of all 1} such that A(~) is finite. (continued). .A(~)I. x E X c Rq (1. if q is definable. 0 A similar theorem holds in the continuous case if the distributions ofT(X) are them Canonical exponential families. The exponential family then has the form q(x.1) by letting the model be indexed by 1} rather than 8. If X is distributed according to (1.. Goals.1}) = h(x)exp[1}T(x) .8) exp[~(8)t .6. the momentgenerating function o[T(X) exists and is given by M(s) ~ exp[A(s + 1}) .9) with 1} ranging over E is called the canonical oneparameter exponential family generated by T and h. L {x:T{x)=t} h(x)}.6. £ is called the natural parameter space and T is called the natural sufficient statistic. where 1} = log 8.B(8)] L {x:T(x)=t} ( 1.6.2..9) where A(1}) = logJ . Theorem 1. P9[T(x) = tl L {x:T(x)=t} p(x.52 Statistical Models. By definition.1. 00 00 exp{A(1})} andE = R. = L(e"' (x!) = L(e")X (x! = exp(e").
Pitman. X n is a sample from a population with density p(x. Var(T(X)) ~ A"(1]).10) . and Dannois were led in their investigations to the following family of distributions. We give the proof in the continuous case. and realvalued functions T 1.<·or . We compute M(s) = = E(exp(sT(X))) {exp[A(s ~ + 1])  A(1])]} J'" J J'" J h(x)exp[(s + 1])T(x) .A(1])] because the last factor. x E X j=1 c Rq.j8 2))exp(i=l n n I>U282) ~=1 = (il xi)exp[202 LX.6. It is used to model the density of "time until failure" for certain types of equipment.4 Suppose X 1> ••• .. Koopman. 1 Tk. 1.8) = (il(x. Direct computation of these moments is more complicated. ' e k p(x.17k and B of 6. which is naturally indexed by a kdimensional parameter and admit a kdimensional sufficient statistic.6.A(1])]dx h(x)exp[(s + 1])T(x)  A(s + 1])Jdx = . is one.Section 1. _ . is said to be a kparameter exponential family. More generally. . h on Rq such that the density (frequency) functions of the Po may be written as.8) ~ (x/8 2)exp(_x 2/28 2). c R k .8 > O...6. The rest of the theorem follows from the momentenerating property of M(s) (see Section A.2 The Multiparameter Case Our discussion of the "natural form" suggests that oneparameter exponential families are naturally indexed by a onedimensional real parameter fJ and admit a onedimensional sufficient statistic T(x). (1. x> 0.6 Exponential Families 53 Moreover. 2 i=1 i=l n 1 n Here 1] = 1/20 2. This is known as the Rayleigh distribution. nlog0 J. 02 ~ 1/21]. Example 1.12). 0 Here is a typical application of this result. Now p(x.8(0)]. being the integral of a density. if there exist realvalued functions 171. exp[A(s + 1]) . E(T(X)) ~ A'(1]). . B(O) the natural sufficient statistic E~ 2n(j2 and variance nlt}2 4n04 . 0 = n log 82 and A(1]) 1 xl has mean nit} = = nlog(21]). Proof. Therefore.O) = h(x) exp[L 1]j(O)Tj (x) . A family of distrib~tioos {PO: 0 E 8}.
8 2 = and I" 1 2' Tl(x) = x. X m ) from a N(I".. in the conlinuous case.') : 00 < f1 x2 1 J12 2 p(x. the vector T(X) = (T. 1)2(0) = 2 " B(O) " 1"' 1 " T.x . . 2 (". Goals. If we observe a sample X = (X" .1.A('l)}. . .'" .TdX))T is sufficient. ..I' 54 Statistical Models.. In either case.5.6.Xm ) where the Xi are independent and identically distributed and their common distribution ranges over a k~parameter exponential family given by (1. +log(27r" »)].'). suppose X = (Xl. It will be referred to as a natural sufficient statistic of the family. . (]"2 > O}. Then the distributions of X form a kparameter exponential family with natural sufficient statistic m m TCm)(x) = (LTl(Xi).6.Ot = fl..(x) = x'.. + log(27r"'». we define the natural parameter space as t: = ('l E R k : 00 < A('l) < oo}..6.11) (]"2. Suppose lhat P.4).2(". . Again.LX.LTk(Xi )) t=1 Example 1. then the preceding discussion leads us to the natural sufficient statistic m m (LXi.(X) •. (1. i=l i=1 which we obtained in the previous section (Example 1.3. . .O) =exp[".. 0 Again it will be convenient to consider the "biggest" families. = N(I". . i=l . q(x.x E X c Rq where T(x) = (T1 (x).5.. A(71) is defined in the same way except integrals over Rq are replaced by sums.10). . The density of Po may be written as e ~ {(I". I ! In the discrete case.'). and Performance Criteria Chapter 1 By Theorem 1.. which corresponds to a twOparameter exponential family with q = I. ... the canonical kparameter exponential indexed by 71 = (fJI.' .. J1 < 00. letting the model be Thus.2.Tk(x)f and.'l) = h(x)exp{TT(x)'l. h(x) = I. ..fJk)T rather than family generated by T and h is e.. The Normal Family.') population. .
I yn)T can be put in canonical form with k = 3. we can achieve this by the reparametrization k Aj = eO] /2:~ e Oj . .EziY.6.)T.Xn)'r where the Xi are i. . In this example. I <j < k 1. < OJ..Section 1. k ~ 2. remedied by considering IV ~j = log(A. Example 1. k}] with canonical parameter 0. 0. A E A. . ~3 and t: = {(~1. . From Exam~ pIe 1.. and AJ ~ P(X i ~ j). .. x') = (T. . . where A is the simplex {A E R k : 0 < A. Suppose a<.5.5 that Y I ..6.5..)j. h(x) = I andt: ~ R x R. (N(/" (7') continued}.. 1/) =exp{T'f._l)(x)1/nlog(l where + Le"')) j=1 . = 1/2(7'. .= {(~1..k. as X and the sample space of each Xi is the k categories {I.~3): ~l E R. Multinomial Trials.ti = f31 + (32Zi. a 2 ).6. ~l ~ /.~2)]. is not and all c./(7'./(7'. However 0..n log j=l L exp(.6. A) = I1:~1 AJ'(X). E R.5..~·(x) . the density of Y = (Yb . with J.id.(x). L71 Aj = I}../Ak) = "... i = 1.~2): ~l E R. We write the outcome vector as X = (Xl. This can be identifiable because qo(x. where m... . .'12 ~ (3. TT(x) = T(Y) ~ (EY"EY. In this example. Now we can write the likelihood as k qo(x. ."k. . .j j=l = 1. 0 + el) ~ qo(x. 0) ~ exp{L . Y n are independent.~3 ~ 1/2(7'.1. . . n.j = 1..4 and 1.k. . = n'Ezr Example 1. We observe the outcomes of n independent trials where each trial can end up in one of k possible categories.~.) + log(rr/. Example 1.kj. ~. Yi rv N(Jli.(x) = L:~ I l[X i = j].6 Exponential Families 55 (x. and rewriting kl q(x. 0) for 1 = (1..~l= (3. and t: = Rk .(x»).~3 < OJ.'.T..7. j=1 k This is a kparameter canonical exponential family generated by TIl"" Tk and h(x) = II~ I l[Xi E {I. It will often be more convenient to work with unrestricted parameters. Linear Regression. A(1/) ~ ~[(~U2~../(7'.. Let T. 2. A(1/) = 4n[~:+m'~5+z~1~'+2Iog(rr/~3)].'I2. . Then p(x..~.. < l.5. . in Examples 1. E R k .
6. let Xt =integers from 0 to ni generated Vi < n. Similarly.·. Here is an example of affine transformations of 6 and T. Ai). 1 < i < n. Here 'Ii = log 1\.17 e for details. is an npararneter canonical exponential family with Yi by T(Yt . this. 17).).12) taking on k values as in Example 1.6. I < k.d. if X is discrete Affine transformations from Rk to RI defined by UP is the canonical family generated by T kx 1 and hand M is the affine transformation I' .56 Note that q(x.6.6. as X. k}] with canonical parameter TJ and £ = Rk~ 1. . 0) ~ q(x. and 1] is a map from e to a subset of R k • Thus. See Problem 1.6.1. . 1 < i < n..2. M(T) sponding to and ~ MexkT + hex" it is easy to see that the family generated by M(T(X» and h is the subfamily of P corre 1/(0) = MTO. 0 1.. 1 <j < k . More10g(P'I[X = jI!P'IIX ~ k]).13) . then the resulting submodel of P above is a submodel of the exponential family generated by BTT(X) and h. if c Re and 1/( 0) = BkxeO C R k . Goals. 0 < Ai < 1.7 and X = (X t.. < . ) 1(0 < However. 1]) is a k and h(x) = rr~ 1 l[xi E over. B(n. X n ) T where the Xi are i.. then all models for X are exponential families because they are submodels of the multinomial trials model. If the Ai are unrestricted.Y n ) ~ Y. h(y) = 07 I ( ~. where (J E e c R1. the parameters 'Ii = Note that the model for X Statistical Models. are identifiable.3 Building Exponential Families Submodels A submodel of a kparameter canonical exponential family {q(x.. " Example 1. " . and Performance Criteria Chapter 1 1 parameter canonical exponential family generated by T (k 1) {I. 1/(0») (1.6. Logistic Regression. < X n be specified levels and (1.8.. from Example 1.i. . Let Y. be independent binomial.' A(1/) = L:7 1 ni 10g(l + e'·). is unchanged. 11 E £ C R k } is an exponential family defined by p(x.6. . .
.00). then this is the twoparameter canonical exponential family generated by lYIY = (L~l.'. However.Xi)). which is less than k = n when n > 3. 8) in the 8 parametrization is a canonical exponential family.6. Then. .8.. '. . Y. that is..i.1. Yn. p(x...12) with the range of 1J( 8) restricted to a subset of dimension l with I < k .. + 8.xn)T.6... In the nonnal case with Xl.). LocationScale Regression.14) 8..'V T(Y) = (Y" . Then (and only then)..8. with B = p" we can write where T 1 = ~~ I Xi.. + 8. this is by Example 1.6. . Y n are independent. . .i.6.!t is assumed that each animal has a random toxicity threshold X such that death results if and only if a substance level on or above X is applied. Suppose that YI ~ . The Yi represent the number of animals dying out of ni when exposed to level Xi of the substance. so it is not a curved family.i/a. y. n. o Curved exponential families Exponential families (1. which is called the coefficient of variation or signaltonoise ratio. Gaussian with Fixed SignaltoNoise Ratio. Example 1.6 Exponential Families 57 This is a linear transformation TJ(8) = B nxz 8 corresponding to B nx2 = (1.PIX < x])) 8. > O. . log(P[X < xl!(1 . .6. + 8. are called curved exponential families provided they do not form a canonical exponential family in the 8 parametrization. x)..9. i=I This model is sometimes applied in experiments to determine the toxicity of a substance.. }Ii N(J1.6.d. ~ (1. T 2 = L:~ 1 X.Xn i.x)W'.. E R.. where 1 is (1. I ii. .6.10. L:~ 1 Xl yi)T and h with A(8.. generated by . suppose the ratio IJlI/a. is a known constant '\0 > O. the 8 parametrization has dimension 2. and l1n+i = 1/2a. = '\501 and 'fJ2(0) = ~'\5B2. If each Iii ranges over R and each a. ranges over (0.) = L ni log(1 + exp(8. . N(Jl.Section 1. a.)T .13) holds. . This is a 0 In Example 1.8. PIX < xl ~ [1 + exp{ (8.1)T..x and (1. SetM = B T .5 a 2nparamctercanonical exponential family model with fJi = P. 'fJ1(B) curved exponential family with l = 1. Example 1. . i = 1.x = (Xl.a 2 ). Assume also: (a) No interaction between animals (independence) in relation to drug effects (b) The distribution of X in the animal population is logistic.
Supermode1s We have already noted that the exponential family structure is preserved under i. 1988. 1.1 generalizes directly to kparameter families as does its continuous analogue.. Thus.12) is not an exponential family model. say. Goals.5 that for any random vector Tkxl. ~.1.6. We extend the statement of Theorem 1. In Example 1. 1989. and Snedecor and Cochran.(O)Tj'(Y) for some 11.12. . we define I I .6. Carroll and Ruppert.13) exhibits Y. then p(y. and B(O) = ~. Ii. Sections 2. Bickel. 8. but a curved exponential family model with 0 1=3.) = (Yj.d. sampling.58 Statistical Models. Section 15. n.). a. be independent.. with an exponential family density Then Y = (Y1..10). Recall from Section B. 1 Bj(O).g. For 8 = (8 1 .) and 1 h'(YJ)' with pararneter1/(O).) depend on the value Zi of some covariate. respectively.1/(O)) as defined in (6. yn)T is modeled by the exponential family generated by T(Y) ~.10 and 1.2.6.5. E R..(0). M(s) as the momentgenerating function. 0) ~ q(y. as being distributed according to a twoparameter family generated by Tj(Y. 83 > 0 (e. TJ'(Y). 1 Tj(Y. Models in which the variance Var(}'i) depends on i are called heteroscedastic whereas models in which Var(Yi ) does not depend on i are called homoscedastic. and Performance Criteria Chapter 1 and h(Y) = 1.6. Let Yj .4 Properties of Exponential Families Theorem 1. the map 1/(0) is Because L:~ 11]i(6)Yi + L:~ i17n+i(6)Y? cannot be written in the fonn ~.6.6._I11. 1 < j < n.8.E Yj c Rq. We return to curved exponential family models in Section 2.6 are heteroscedastic and homoscedastic models.xjYj ) and we can apply the supennode! approach to reach the same conclusion as before. . Examples 1. Next suppose that (JLi.8.8 note that (1.3.6. for unknown parameters 8 1 E R. 1978. Even more is true.i. and = Ee sTT .
. u(x) ~ exp(a'7. v(x). t: and (a) follows.a)'72) < aA('7l) + (1 .6..6.6. Corollary 1.15) is finite.6.15) t: the righthand side of (1. ~ = 1 .6. J u(x)v(x)h(x)dx < (J ur(x)h(x)dx)~(Jv'(x)h(x)dx)~.6.6.6.8". ('70)) . . + (1 . . ('70).a)A('72) Because (1. (with 00 pennitted on either side). Substitute ~ ~ a. Suppose '70' '71 E t: and 0 < a < 1. A(U'71 Which is (b). If '7" '72 E + (1 . then T(X) has under 110 a moment M(s) ~ exp{A('7o + s) .3.6. Then (a) E is convex (b) A: E ) R is convex (c) If E has nonempty interior in R k generating function M given by and'TJo E E. Under the conditions ofTheorem 1.9. S > 0 with ~ + ~ = 1. h) with corresponding natural parameter space E and function A(11). for any u(x).A('7o)} vaLidfor aLL s such that 110 + 5 E E. h(x) > 0.Section 1. A where A( '70) = (8". T.4).5.1 give a classical result in Example 1..A('70) = II 8". ('70) II· The corollary follows immediately from Theorem B.15) that a'7.r'7o T(X) .6.1 and Theorem 1.6 Expbnential Families 59 V.a.1.a)'72 E is proved in exactly the same way as Theorem 1.2. By the Holder inequality (B.6.6. 8". Fiually (c) 0 The formulae of Corollary 1.3 V.T(x)). Since 110 is an interior point this set of5 includes a baLL about O. 8A 8A T" = A('7o) f/J.3. We prove (b) first. Proof of Theorem 1.a)'7rT(x)) and take logs of both sides to obtain.r(T) = IICov(7~. 1O)llkxk. Theorem 1. J exp('7 TT (x))h(x)dx > 0 for all '7 we conclude from (1. vex) = exp«1 . Let P be a canonicaL kparameter exponential famiLy generated by (T.3(c).
+ 9. 9" 9. ajTj(X) = ak+d < 1 unless all aj are O.(9) = 9. We establish the connection and other fundamental relationships in Theorem 1. Here. An exponential family is of rank k iff the generating statistic T is kdimensional and 1. It is intuitively clear that k .8.6.4.lx ~ j] = AJ ~ e"'ILe a . Suppose P = (q(x. Goads.6. (i) P is a/rank k.1.(Tj(X» = P>. Note that PO(A) = 0 or Po (A) < 1 for some 0 iff the corresponding statement holds i.7).1 is in fact its rank and this is seen in Theorem 1.60 Statistical Models. Sintilarly. p:x. Xl Y1 ). Our discussion suggests a link between rank and identifiability of the 'TJ parameterization. T.x. such that h(x) > O. (X)". Theorem 1. II ! I Ii . for all 9 because 0 < p«X. using the a: parametrization. (continued).6. .6.7. o The rank of an exponential family I . (ii) 7) is a parameter (identifiable). But the rank of the family is 1 and 8 1 and 82 are not identifiable. Tk(X) are linearly independent with positive probability.6. 2' Going back to Example 1. we are writing the oneparameter binomial family corresponding to Yl as a twoparameter family with generating statistic (Y1 . . However. and ry. in Example 1. k A(a) ~ nlog(Le"') j=l and k E>. f=l .4. . P7)[L. Formally. ifn ~ 1. However.6.. Then the following are equivalent.4 that follows. and Performance Criteria Chapter 1 Example 1. 7) E f} is a canonical exponential/amily generated by (TkXI ' h) with natural parameter space e such that E is open.7 we can see that the multinomial family is of rank at most k .~. if we consider Y with n'> 2 and Xl < X n the family as we have seen remains of rank < 2 and is in fact of rank 2. there is a minimal dimension. . (iii) Var7)(T) is positive definite.~~) < 00 for all x."I 11 J Evidently every kparameter exponential family is also k'dimensional with k' > k.
. ~ (ii) ~ _~ (i) (ii) = P1).'t·· . = Proof ofthe general case sketched I.2. Thus. This is equivalent to Var"(T) ~ 0 '*~ (iii) {=} f"V(ii) There exist T}I =I= 1J2 such that F Tll = Pm' Equivalently exp{r/. by our remarks in the discussion of rank. because E is open. all 1) = (~i) II. for all T}. ~ P1)o some 1). Taking logs we obtain (TJI . A'(TJ) is strictly monotone increasing and 11. 0 Corollary 1. F!~ .(ii) = (iii). Apply the case k = 1 to Q to get ~ (ii) =~ (i). ". A"(1]O) = 0 for some 1]0 implies c.' .. have (i) . of 1)0· Let Q = (P1)o+o(1). .) is false. ." Then ~(i) > 1 is then sketched with '* P"[a.3. 0 . Conversely. Suppose tlult the conditions of Theorem 1. We give a detailed proof for k = 1.TJ2)T(X) = A(TJ2) . 1)) is a strictly concavefunction of1) On 1'.. A is defined on all ofE. . '" '. hence. by Theorem 1.A(TJ2))h(x). which that T implies that A"(TJ) = 0 for all TJ and..6. This is just a restatement of (iv) and (v) of the theorem.~> .1)o)TT. III. :. A' is constant. ranges over A(I'). ~ (i) =~ (iii) ~ (i) = P1)[aT T = cJ ~ 1 for some a of 0. Note that.6.T ~ a2] = 1 for al of O.. (iv) '" (v) = (tii) Properties (iv) and (v) are equivalent to the statements holding for every Q defined as previously for arbitrary 1)0' 1).) with probability 1 =~(i).6. Now (iii) =} A"(TJ) > 0 by Theorem 1. (iii) = (iv) and the same discussion shows that (iii) . with probability 1.T(x) ..Section 1. Then (a) P may be uniquely parametrized by "'(1)) E1)T(X) where". (b) logq(x. Let ~ () denote "(.(v).6.1)o) : 1)0 + c(1).6 Exponential Families 61 (iv) 1) ~ A(1)) is 11 onto (v) A is strictly convex on E. The proof for k details left to a problem. hence.A(TJI)}h(x) ~ exp{TJ2T(x) .. 1)0) E n· Q is the exponential family (oneparameter) generated by (1)1 .. all 1) ~ (iii) = a T Var1)(T)a = Var1)(aT T) = 0 for some a of 0. We.4 hold and P is of rank k.2 and. = Proof. Proof. thus.A(TJ.
.. then X .18) _.._.2 p p (1. with mean J.21).. so that Theorem 1.6.I.6. B(9) = (log Idet(E) I+JlTEl Jl).. ifY 1.a 2 + J12).6. E(X. It may be shown (Problem 1. . ~iYi Vi).6. However. E).62 Statistical Models..4 applies.6..(Xi) i=l j=l i=l n • n nB(9)}. with its distinct p(p + I) /2 entries. the relation in (a) may be far from obvious (see Problem 1.1 '" 11"'. (1. and.6. . Then p(xI9) = III h(x.."5) family by E(X). For {N(/" "')}. and that (.(Y). is open.)] eXP{L 1/. as we always do in the Bayesian context..Yp. T (and h generalizing Example 1.3.Jl)TE.6.29) that I) generate this family and that the rank of the family is indeed p(p + 3)/2.17) can be rewritten ( 2: 1 <i<j~p aijYiYj + ~ 2: aii r:2 ) + L(2: aij /lj)Yi i=l i=l j=l p where E. Recall that Y px 1 has a p variate Gaussian distribution. revealing that this is a k = p(p + 3)/2 parameter exponential family with statistics (Yi..5 Conjugate Families of Prior Distributions In Section 1. . This is a special case of conjugate families of priors.Jl)}. which is obviously a 11 function of (J1. the 8(11) 0) family is parametrized by E(X).6.2 we considered beta prior distributions for the probability of success in n Bernoulli trials..t parametrization is close to the initial parametrization of classical P.6.(9) LT.Jl) . families to which the posterior after sampling also belongs.11.11. E). Y n are iid Np(Jl.. h(Y) . which is a p x p symmetric matrix.p/2 I exp{ . By our supermodel discussion. where we identify the second element ofT. E) _~yTEIy + (E1JllY 2 I 2 (log Idet(E)1 + JlT I P log". We close the present discussion of exponential families with the following example.1 (Y . . (1. .Xn is a sample from the kparameter exponential family (1.a 2 ). See Section 2. The corollary will prove very important in estimation theory. An important exponential family is based on the multivariate Gaussian distributions of Section 8.5. write p(x 19) for p(x. and Performance Criteria Chapter 1 The relation in (a) is sometimes evident and the j.. The p Van'ate Gaussian Family. E.. Example 1.2 (Y .tpx 1 and positive definite variance covariance matrix L::pxp' iff its density is f(Y.6. Thus.E). N p (1J. 0 ! = 1. 9 ~ (Jl. 0). . .Yn)T follows the k = p(p + 3)/2 parameter exponential family with T = (EiYi . Goals..17) The first two terms on the right in (1..Jl.10).ljh<i<..16) Rewriting the exponent we obtain 10gf(Y.6.{Y.~p).. Suppose Xl. .6...X') = (JL.. the N(/l. where X is the Bernoulli trial._.. Jl. E) = Idet(E) 1 1 2 / .
20) where t = (fJ..19) o {(iJ. Suppose Xl. tk+l) E 0.21) where S = (SI..Section 1.2)' Uo 2uo Ox . .u5) sample.1. Note that (1.36).18) by letting 11.6. then Proposition 1.. .6 Exponential Families 63 where () E e. 1r(6jx) is the member of the exponential family (1. where u6 is known and (j is unknown. be "parameters" and treating () as the variable of interest.fk+Jl. is a conjugate prior to p(xIO) given by (1. n)T D It is easy to check that the beta distributions are obtained as conjugate to the binomial in this way. Forn = I e. Because two probability densities that are proportional must be equal. . . let t = (tl. Proof.. . tk+ I) T and w(t) = 1= eXP{hiJ~J(O)' _='" _= 1 k ik+JB(II)}dO J···dOk (1.22) 02 p(xIO) ex exp{ 2 .'" .21) and our assertion follows. .18).(O) ex exp{L ~J(O)(LT.6."" Sk+l)T = ( t} + ~TdxiL .21) is an updating formula in the sense that as data Xl.6. ..6. The (k + I)parameter exponential/amity given by k ".6. .1. where a = (I:~ 1 T1 (Xi).20).12. 0 Remark 1.'''' I:~ 1 Tk(Xi). X n become available.. . then k n . n n )T and ex indicates that the two sides are proportional functions of ().O<W(tJ.6.6.. We assume that rl is nonempty (see Problem 1..6.Xn is a N(O. That is. .6..fkIJ)<oo) with integrals replaced by sums in the discrete case.6.6..(0).(0) ~ exp{L ~j(O)fj . Example 1. j = 1. which is k~dimensional.(Xi)+ f..20).. If p(xIO) is given by (1. (1.6.20) given by the last expression in (1. . k.(lk+1 j=1 i=1 + n)B(O)) ex "..) ...6. . To choose a prior distribution for model defined by (1. ik + ~Tk(Xi)' tk+l + 11. we consider the conjugate family of the (1.(Olx) ex p(xlO)".6.6.6. A conjugate exponential family is obtained from (1. the parameter t of the prior distribution is updated to s = (t + a).18) and" by (1. and t j = L~ 1 Tj(xd.fk+IB(O) logw(t)) j=1 (1. .
20) has density (1. the posterior has a density (1.24).24) Thus.23) Upon completing the square.26) intuitively as ( 1. consists of all N(TJo. = 1 . = s. = nTg(n)I(1~.i. .6. 761) where '10 varies over RP. Tg is scalar with TO > 0 and I is the p x p identity matrix (Problem 1. Moreover.6. + n. and Performance Criteria Chapter 1 This is a oneparameter exponential family with The conjugate twoparameter exponential family given by (1.6.d. Using (1.28) where W. Goals.(S) = t. we must have in the (t l •t2) parametrization "g (1.(n) (g +n)I[s+ 1]0~~1 TO TO (1.W. we obtain 1r.. E RP and f symmetric positive definite is a conjugate family f'V .6. 75) prior density. > 0 and all t I and is the N (tI!t" (1~ It.25) By (1. t.64 Statistical Models.26) (1.6.(O) <X exp{ .6. Np(O. i.27) Note that we can rewrite (1.) density.30) that the Np()o" f) family with )0. (J Np(TJo.(n) = . = rg(n)lrg so that w. 1 < i < n.n) and variance = t. W.6. it can be shown (Problem 1.6.6. Eo). if we observe EX.23) with (70 TO .6.. we find that 1r(Olx) is a normal density with mean Jl(s.) } 2a5 h t2 t1 2 (1.21).6. 0 These formulae can be generalized to the case X. If we start with aN(f/o.6. (0) is defined only for t. Our conjugate family.37). tl(S) = 1]0(10 2 TO 2 + S.6. rJ) distributions where Tlo varies freely and is positive. Eo known.( 0 . 1r. therefore.
32.'T}k and B on e. The set E is convex.29) (Tl (X). In fact.6 Exponential Families 65 but a richer one than we've defined in (1. It is easy to see that one can consUUct conjugate priors for which one gets reasonable formulae for the parameters indexing the model and yet have as great a richness of the shape variable as one wishes by considering finite mixtures of members of the family defined in (1. and Darmois. x j=1 E X c Rq.20). e..6..Section 1. which admit kdimensional sufficient statistics for all sample sizes.5..6. is a kparameter exponential/amity of distributions if there are realvalued functions 'T}I .3 is not covered by this theory.20) except for p = 1 because Np(A. .6.. T k (X)) is called the natural sufficient statistic of the family.6. . the map A . .6. . ..6..(s) = exp{A(7Jo + s) . 8 C R k .1 are often too restrictive. E 1 R is convex. a theory has been built up that indicates that under suitable regularity conditions families of distributions. ... the family of distributions in this example and the family U(O.10 is a special result of this type. r) is a p(p + 3)/2 rather than a p + I parameter family. the conditions of Proposition 1.J: h(x)exp{TT(x)7J}dx in the continuous case. Discussion Note that the uniform U( (I. 0) = h(x) expl2:>j(I1)Ti (x) . O}) model of Example 1. The set is called the natural parameter space.A(7Jo)} e e. 2. {PO: 0 E 8}. Summary. for all s such that '10 + s is in Moreover E 7Jo [T(X)] = A(7Jo) and Var7Jo[T(X)] A( '10) where A and A denote the gradient and Hessian of A. starting with Koopman.6. must be kparameter exponential families. See Problems 1. (1. which is onedimensional whatever be the sample size.. h on Rq such that the density (frequency) function of Pe can be written as k p(x. Problem 1. ••• . Despite the existence of classes of examples such as these. Some interesting results and a survey of the literature may be found in Brown (1986). In the onedimensional Gaussian case the members of the Gaussian conjugate family are unimodal and symmetric and have the same shape. Pitman. If bas a nonempty interior in R k and '10 E then T(X) has for X ~ P7Jo the momentgenerating function . B) are not exponential. with integrals replaced by sums in the discrete case. ..31 and 1. and realvalued functions TI. In fact. " Tk..Xn ).B(O)]. The canonical kparameter exponentialfamily generated by T and h is T(X) = where A(7J) ~ log J:. is not of the form L~ 1 T(Xi). Eo). to Np (8 . The natural sufficient statistic max( X 1 ~ .
6.L. He wishes to use his observations to obtain some infonnation about J.29). The (k + 1). (ii) '7 is identifiable. A family F of priOf distributions for a parameter vector () is called a conjugate family of priors to p(x I 8) if the posterior distribution of () given x is a member of F.66 Statistical Models. Theoretical considerations lead him to believe that the logarithm of pebble diameter is normally distributed with mean J.B(O)tk+l logw} j=l where w = 1:" 1: E exp{l:.7 PROBLEMS AND COMPLEMENTS Problems for Section 1. Give a formal statement of the following models identifying the probability laws of the data and the parameter space. State whether the model in question is parametric or nonparametric. (iii) Var'7(T) is positive definite. (b) A measuring instrument is being used to obtain n independent determinations of a physical constant J. 1. . Assume that the errors are otherwise identically distributed nonnal random variables with known variance. T" ..parameter exponential family k ". then the following are equivalent: (i) P is of rank k..L and a 2 but has in advance no knowledge of the magnitudes of the two parameters. is conjugate to the exponential family p(xl9) defined in (1. (v) A is strictlY convex on f. (a) A geologist measures the diameters of a large number n of pebbles in an old stream bed.1)) (O)t j .) E R k+l : 0 < w < oo}.(0) = exp{L1)j(O)t) . . and Performance Criteria Chapter 1 An exponential family is said to be of rank k if T is kdimensional and 1.. tk+..1 units. Goals..B(O)}dO. If P is a canonical exponential family with E open.1 1. and t = (t" . (iv) the map '7 ~ A('7) is I ' Ion 1'.L and variance a 2 . Suppose that the measuring instrument is known to be biased to the positive side by 0. tk+1) 0 ~ {(t" . Ii II I . l T k are linearly independent with positive Po probability for some 8 E e..
2). .p. . 3.) (a) Xl.. (e) The parametrization of Problem l. . B = (1'1.ap ) : Lai = O}. are sampled at random from a very large population. . Which of the following parametrizations are identifiable? (Prove or disprove.."" a p .. v. .1(c). Q p ) and (A I .. . At. (a) Let U be any random variable and V be any other nonnegative random variable.. 1 Ab) restricted to the sets where l:f I (Xi = 0 and l:~~1 A. ..2 ). . . i = 1. each egg has an unknown chance p of hatching and the hatChing of one egg is independent of the hatching of the others. respectively..Xp are independent with X~ ruN(ai + v. = o.1.. i=l (c) X and Y are independent N (I' I .Xpb ..1. then X is (b) As in Problem 1. (If F x and F y are distribution functions such that Fx(t) said to be stochastically larger than Y. bare independeht with Xi.. ~ N(l'ij.. = (al.. Show that Fu+v(t) < Fu(t) for every t. 2. 1'2) and we observe Y X.. Ab.2) where fLij = v + ai + Aj.Xp ).t. Are the following parametrizations identifiable? (Prove or disprove. and Po is the distribution of X (b) Same as (a) with Q = (X ll . . 0. Two groups of nl and n2 individuals. j = 1..··. .7 Problems and Complements 67 (c) In part (b) suppose that the amount of bias is positive but unknown. Each . .. Q p ) {(al.. An entomologist studies a set of n such insects observing both the number of eggs laid and the number of eggs hatching for each nest. . Once laid. .. restricted to p = (all' .Section 1. ... (b) The parametrization of Problem 1.2) and N (1'2. (d) Xi.) < Fy(t) for every t. Can you perceive any difficulties in making statements about f1.1(d).2 ) and Po is the distribution of Xll.for this model? (d) The number of eggs laid by an insect follows a Poisson distribution with unknown mean A.) (a) The parametrization of Problem 1.l(d) if the entomologist observes only the number of eggs hatching but not the number of eggs laid in each case. 0..1 describe formaHy the foHowing model. e (e) Same as (d) with (Qt.1.. . . 4.. .
.1..t) fj>r all t... ...9 and they oecur with frequencies p(0. 2.4(1). k.p(0. J.1. M k = mk] I! n.I nl .' I . nk·mI .LI ni + . ml . + mk + r = n 1. . but the distribution of blood pressure in the population sampled before and after administration of the drug is quite unknown.. 0 < Vi < 1. I nl. The number n of graduate students entering a certain department is recorded... I. 8. o < J1 < 1. . . (e) Suppose X ~ N(I'...Li = 71"(1 M)iIJ..2).. i = 1.p(0. . Both Y and p are said to be symmetric about c.1. Hint: IfY . 0 < J. 7.Li < + ..· . .0).. 0 < V < 1 are unknown.. Suppose the effect of a treatment is to increase the control response by a fixed amount (J.c bas the same distribution as Y + c.Nk = nk. P8[N I . + nk + ml + . is the distribution of X e = (0... Goals. }. . if and only if.. (a) What are the assumptions underlying this model'? (b) (J is very difficult to estimate here if k is large..) (a) p.. the density or frequency function p of Y satisfies p( c + t) ~ p( c .c) = PrY > c .t) = 1 .L. .O}.. 0'2).9).P(Y < c .:". I . . Let Po be the distribution of a treatment response. e = = 1 if X < 1 and Y {J. • "j :1 member of the second (treatment) group is administered the same dose of a certain drug believed to lower blood pressure and the blood pressure is measured after 1 hour.. .. .. Each member of the first (control) group is administered an equal dose of a placebo and then has the blood pressure measured after 1 hour. .0. Which of the following models are regular? (Prove or disprove. = n11MI = mI. is the distribution of X when X is unifonn on (0. 0 = (1'. The following model is proposed. . • ..0'2) (d) Suppose the possible control responses in an experiment are 0. Consider the two sample models of Examples 1. Mk. 1 < i < k and (J = (J.Lk VI n". and Performance Criteria Chapter 1 ":. =X if X > 1.. I ! fl. mk·r.. VI. It is known that the drug either has no effect or lowers blood pressure. Let Y and Po is the distribution of Y. 68 Statistical Models. (b)P. Show that Y . ...k + VI + . 5.k is proposed where 0 < 71" < I.3(2) and 1.. In each of k subsequent years the number of students graduating and of students dropping out is recorded..LI. V k mk P r where J. whenXis unifonnon {O. Let N i be the number dropping out and M i the number graduating during year i.  . .1). Vi = (1 11")(1 ~ v)iI V for i = 1. Vk) is unknown. What assumptions underlie the simplification? 6.2.c has the same distribution as Y + c.. + Vk + P = 1. The simplification J... (0). then P(Y < t + c) = P( Y < t .2. + fl.t).··· ..0.
ci '" N(O. {r(j) : j ~ 0. e(j). are not collinear (linearly independent)..Xn are observed i. 11. . that is. Therefore.. log X and log Y satisfy a shift model with parameter log (b) Show that if X and Y satisfy a shift model With parameter tl. .Xn ). = 9. Suppose X I. t > O. .2x yield the same distribution for the data (Xl. r(j) = = PlY = j. Positive random variables X and Y satisfy a scale model with parameter 0 > 0 if P(Y < t) = P(oX < t) for all t > 0. or equivalently.Znj?' (a) Show that ({31. > 0. C). Let X < 1'. . X m and Yi.'" . .tl). . fJp) are not identifiable if n pardffieters is larger than the number of observations. then satisfy a scale model with parameter e. .... . I(T < C)) = j] PIC = j] PIT where T.2x and X ~ N(J1.i. C(t} = F(tjo). (b) In part (a). . then C(·) = F(. e) : p(j) > 0. . e) vary freely over F ~ {(I'. suppose X has a distribution F that is not necessarily normal. For what tlando(x) ~ 2J1+tl2x? type ofF is it possible to have C(·) ~ F(·tl) for both o(x) (e) Suppose that Y ~ X + O(X) where X ~ N(J1. o (a) Show that in this case. C(·) = F(· . .. 1 < i < n. Collinearity: Suppose Yi LetzJ' = L:j:=1 4j{3j + Ci. 0 'Lf 'Lf op(j) = I. p(j)... if the number of = (min(T."') and o(x) is continuous.. C are independent.1.7 Problems and Complements 69 (a) Show that if Y ~ X + o(X). . (]"2) independent. Sbow that {p(j) : j = 0.3 let Xl. 12. 0 < j < N. i .N = 0. Let c > 0 be a constant. Yn denote the survival times of two groups of patients receiving treatments A and B. according to the distribution of X."" Yn ).. eX o.d. Y.{3p) are identifiable iff Zl. .. 10.. Y = T I Y > j]."'). Hint: Consider "hazard rates" for Y min(T. C).. N.. . N}. .. That is... Sx(t) = . The Scale Model. In Example 1. .6. . N} are identifiable.tl and o(x) ~ 21' + tl.. (Yl. (b) Deduce that (th..Section 1.tl) does not imply the constant treatment effect assumption. log y' satisfy a shift model? = Xc. The Lelunann 1\voSample Model.tl) implies that o(x) tl.. Show that if we assume that o(x) + x is strictly increasing. eY and (c) Suppose a scale model holds for X. e(j) > 0. I} and N is known. then C(·) ~ F(. . Does X' Y' = yc satisfy a scale model? Does log X'.zp. . j = 0. and (I'. the two cases o(x) . = (Zlj. . o(x) ~ 21' + tl . .
Show that hy(t) = c.(t). = = (. For treatments A and E. as T with survival function So.  . Hint: By Problem 8. and Performance CriteriCl Chapter 1 t Ii ..ll) with scale parameter 5. (c) Suppose that T and Y have densities fort) andg(t). z)) (1.. then X' = log So (X) and Y' ~ log So(Y) follow an exponential scale model (see Problem Ll. 13. 7. (b) By extending (bjn) from the rationals to 5 E (0. Hint: See Problem 1. we have the Lehmann model Sy(t) = si. Sy(t) = Sg.. . Show that if So is continuous. t > 0.12. z) zT. Let T denote a survival time with density fo(t) and hazard rate ho(t) = fo(t)j P(T > t). then Yi* = g({3. Zj) + cj for some appropriate €j. Let f(t I Zi) denote the density of the survival time Yi of a patient with covariate vector Zi apd define the regression survival and hazard functions of Y as i Sy(t I Zi) = 1~ fry I zi)dy. show that there is an increasing function Q'(t) such that ifYt = Q'(Y.h. A proportional hazard model.G(t). The most common choice of 9 is the linear form g({3. " T k are unobservable and i.7.) Show that (1.l3.00).7.2.. k = a and b. . respectively.(t) = fo(t)jSo(t) and hy(t) = g(t)jSy(t) are called the hazard rates of To and Y. F(t) and Sy(t) ~ P(Y > t) = 1 .{a(t).(t) with C. ~ a5.) Show that Sy(t) = S'.1) Equivalently. Also note that P(X > t) ~ Sort).I ~I ...1. Tk > t all occur. Goals. t > 0...7.i.. log So(T) has an exponential distribution.2) where ho(t) is called the baseline hazard function and 9 is known except for a vector {3 ({311 . Set C. Then h. are called the survival functions. I (b) Assume (1.d. h(t I Zi) = f(t I Zi)jSy(t I Zi). 1 P(X > t) =1 (. . . . where TI".z)}. thus. (1.2) is equivalent to Sy (t I z) = sf" (t). I' 70 StCitistical Models. hy(t) = c. 12.. t > 0. So (T) has a U(O.ho(t) is called the Cox proportional hazard model.l3. Specify the distribution of €i. . I) distribution.(t) if and only if Sy(t) = Sg.l3. = exp{g(. The Cox proportional hazani model is defined as h(t I z) = ho(t) exp{g(.. I (3p)T of unknowns.(t). Find an increasing function Q(t) such that the regression survival function of Y' = Q(Y) does not depend on ho(t).). (c) Under the assumptions of (b) above. Survival beyond time t is modeled to occur if the events T} > t. Moreover.2) and that Fo(t) = P(T < t) is known and strictly increasing.
the median is preferred in this case. where () > 0 and c = 2. When F is not symmetric. .5) or mean I' = xdF(x) = Fl(u)du. Problems for Section 1. J1 and v are regarded as centers of the distribution F. 1f(O2 ) = ~.7 Problems and Complements 71 Hint: See Problems 1.Section 1. Show how to choose () to make J. Consider a parameter space consisting of two points fh and ()2. Yl .d. Generally. wbere the model {(F. F.. (b) Suppose Zl and Z. Find the . Yn be i.i.11 and 1. Suppose the monthly salaries of state workers in a certain state are modeled by the Pareto distribution with distribution function J=oo Jo F(x. and Zl and Z{ are independent random variables.2 0..Xm be i. the parameter of interest can be characterl ized as the median v = F. Merging Opinions. x> c x<c 0. Observe that it depends only on l:~ I Xi· I 0).1.p and ~ are identifi· able.1..12. ..2 with assumptions (t){4). ..6 Let 1f be the prior frequency function of 0 defined by 1f(I1. Show that both .000 is the minimum monthly salary for state workers.1. what parameters are? Hint: Ca) PIXI < tl = if>(.l (0. (b) Suppose X" . and suppose that for given ().4 1 0." .O) 1 . . Find the median v and the mean J1 for the values of () where the mean exists.8 0. (a) Suppose Zl' Z. 15. . Examples are the distribution of income and the distribution of wealth. In Example 1. 1) distribution.p(t)).. G.L . X n are independent with frequency fnnction p(x posterior1r(() I Xl.2 1.. Let Xl.. have a N(O. . Are. have a N(O. an experiment leads to a random variable X whose frequency function p(x I 0) is given by O\x 01 O 2 0 0.p and C.2 unknown. J1 may be very much pulled in the direction of the longer tail of the density.d. 'I/1(±oo) = ±oo. 14.2) distribution with . ) = ~. and for this reason.v arbitrarily large.(x/c)e.i. Here is an example in which the mean is extreme and the median is not. still identifiable? If not.X n ). Ca) Find the posterior frequency function 1f(0 I x).G)} is described by where 'IjJ is an unknown strictly increasing differentiable map from R to R. '1/1' > 0..
O)kO. ':. 1r(J)= 'u .75. . . X has the geometric distribution (a) Find the posterior distribution of (J given X = 2 when the prior distribution of (} is . X n are independent with the same distribution as X. k = 0. 2. ~ = (h I L~l Xi ~ = . 4'2'4 (b) Relative to (a).. (b) Find the posterior density of 8 when 71"( OJ = 302 . Unllormon {'13} . X n be distributed as where XI. 4. ..m+I.. ) ~ . and Performance CriteriCi Chapter 1 (cj Same as (b) except use the prior 71"1 (0 . 7l" or 11"1. Then P. Suppose that for given 8 = 0.72 Statistical Models.. 71"1 (0 2 ) = . for given (J = B. (J. where a> 1 and c(a) . what is the most probable value of 8 given X = 2? Given X = k? (c) Find the posterior distribution of (} given X = k when the prior distribution is beta.0 < 0 < 1. IE7 1 Xi = k) forthe two priors (f) Give the set on which the two B's disagree. . . l3(r. "X n ) = c(n 'n+a m) ' J=m. 3.Xn = X n when 1r(B) = 1.2. " ••. . is used in the fonnula for p(x)? 2. does it matter which prior. ..Xn are natural numbers between 1 and Band e= {I. Consider an experiment in which.... Compare these B's for n = 2 and 100. Assume X I'V p(x) = l:.[X ~ k] ~ (1. 7r (d) Give the values of P(B n = 2 and 100. < 0 < 1.. 0 (c) Find E(81 x) for the two priors in (a) and (b).. I XI . 0 < x < O.. 1. " J = [2:.=l1r(B i )p(x I 8i ). c(a).. 'IT .0 < B < 1. . }.. . Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability of success O.'" .8). the outcome X has density p(x OJ ~ (2x/O'). (d) Suppose Xl.5n) for the two priors and 1f} when 7f (e) Give the most probable values 8 = arg maxo 7l"(B and 71"1. . (3) Find the posterior density of 8 when 71"( 0) ~ 1. . For this convergence. (3) Suppose 8 has prior frequency.2.)=1. Let X I. . 3..25. Show that the probability of this set tends to zero as n t 00. This is called the geometric distribution (9(0)). Let 71" denote a prior density for 8. Goals. Find the posteriordensityof8 given Xl = Xl. .'" 1rajI Show that J + a.
"'tF·t'.Section 1.. .2. n. are independent standard exponential. Justify the following approximation to the posterior distribution where q.0 E Let 9 have prior density 1r.8) that if in Example 1.) = n+r+s Hint: Let!3( a. then the posterior distribution of D given X = k is that of k + Z where Z has a B(N .. . .J:n). . 6.1.. . Xn) = Xl = m for all 1 as n + 00 whatever be a. In Example 1. b) denote the posterior distribution.. Let (XI. 9." . . 11'0) distribution.f) = [~. b) is the distribution of (aV loW)[1 + (aV IbW)]t. s).1 that the conditional distribution of 6 given I Xi = k agrees with the posterior distribution of 6 given X I = Xl •. 7.. (a) Show that the family of priors E. Show that ?T( m I Xl. X n+ k ) be a sample from a population with density f(x I 0).. where E~ 1 Xi = k.··..c(b. Next use the central limit theorem and Slutsky's theorem.Xn is a sample with Xi '"'' p(x I (}).b> 1. . 10.."" Zk) where the marginal distribution of Y equals the posterior distribution of 6 given Xl = Xl. where {i E A and N E {l. Suppose Xl.. . X n = Xn is that of (Y..(1.n. 11'0) distribution. Interpret this result.2. Xn) + 5. XI = XI.2. ZI.7 Problems ... . . D = NO has a B(N. then !3( a. Show that a conjugate family of distributions for the Poisson family is the gamma family..'" .. I Xn+k) given . f(x I f). Show rigorously using (1.. .and Complements 73 wherern = max(xl.• X n = Xn' and the conditional distribution of the Zi'S given Y = t is that of sample from the population with density e. f3(r. . Assume that A = {x : p( x I 8) > O} does not involve O.1= n+r+s _ . } is a conjugate family of prior distributions for p(x I 8) and that the posterior distribution of () given X = x is . If a and b are integers.... S. Va' W t .. a regular model and integrable as a function of O. . is the standard normal distribution function and n _ T 2 x + n+r+s' a J.2. Show that the conditional distribution of (6. Show in Example 1.. X n+ Il .1 suppose n is large and (1 In) E~ I Xi = X is not close to 0 or 1 and the prior distribution is beta.. W.1. where VI.• X n = Xn. (b) Suppose that max(xl.
posterior predictive distribution. X n . Find I 12. . . Here HinJ: Given I" and ". 14. . Q'j > 0. X n are i. . 11. x> 0.d.J t • I X. 52) of J1.~X) '" tnI..") ~ ./If (Ito.2) and we formally put 7t(I".1 < j < T. 9= (OJ. (a) Show that p(x I 0) ()( 04 n exp (~tO) where "proportional to" as a function of 8. Show that if Xl. Q'r)T.9).. given (x. Suppose p(x I 0) is the density of i. j=1 • .. given x. .d. (a) If f and 7t are the N(O. i). X 1. = 1. (12). <0<x and let7t(O) = 2exp{20}. < 1. The posterior predictive distribution is the conditional distribution of X n +l given Xl. a = (a1. . LO. .6 ""' 1r.. 0 > O. j=1 '\' • = 1.Xn.\ > 0.) pix I 0) ~ ({" . L. Letp(x I OJ =exp{(xO)}... . 8 2 ) is such that vn(p.. Note that. and () = a. (N.d. Find the posterior distribution 7f(() I x) and show that if>' is an integer."Z In) and (n1)S2/q 2 '" X~l' This leads to p(x. has density r("" a) IT • fa(u) ~ n' r(. (c) Find the posterior distribution of 0". 0<0. Let N = (N l ... . . the predictive distribution is the marginal distribution of X n + l . . ..J u.i. where Xi known..O the posterior density 7t( 0 Ix). and Performance Criteria Chapter 1 + nand ({. 13. . 'V . In a Bayesian model whereX l .O. .)T. . . T6) densities. . vB has a XX distribution. "5) and N(Oo. 0 > O... 1 X n . •• • . (. compute the predictive and n + 00.. unconditionally.. . Next use Bayes rule. The Dirichlet distribution.Xn + l arei. I(x I (J).i.) u/' 0 < cx) L".. N. X and 5' are independent with X ~ N(I". 0> 0 a otherwise. The Dirichlet distribution is a conjugate prior for the multinomial.. (b) Discuss the behavior of the two predictive distributions as 15.~vO}. Xl . .2 is (called) the precision of the distribution of Xi.) be multinomial M(n.. ..~l . 01 )=1 j=l Uj < 1... N(I".. 82 I fL..i.74 where N' ~ N Statistical Models. Goals. x n ). I (b) Use the result (a) to give 7t(0) and 7t(0 I x) when Oexp{ Ox}. O(t + v) has a X~+n distribution. then the posterior density ~(Jt 52 = _1_ "'(X _ X)2 nI I. . D(a). v > 0. po is I t = ~~ l(X i  1"0)2 and ()( denotes (b) Let 7t( 0) ()( 04 (>2) exp { ..
3.. respectively and suppose .159 be the decision rules of Table 1. .. .5. a new buyer makes a bid and the loss function is changed to 8\a 01 O 2 al a2 a3 a 12 7 I 4 6 (a) Compute and plot the risk points in this case for each rule 1St. .. a3.. Suppose the possible states of nature are (Jl.3. 0. . 159 for the preceding case (a). Let the actions corresponding to deciding whether the loss function is given by (from Lehmann.3. Problems for Section 1. 8 = 0 or 8 > 0 for some parameter 8. . Suppose that in Example 1. (J) given by O\x a (I . then the posteriof7r(0 where n = (nt l · · · .. a2. I. . n r ). (d) Suppose that 0 has prior 1C(OIl (a). = 0. (d) Suppose 0 has prior 1C(01) ~ 1'.) = 0. a) is given by 0.1.p) (I ..1.3. I b"9 }.3.Section 1.q) and let when at. 1. Compute and plot the risk points (a)p=q= .. (b) Find the minimax rule among {b 1 .5. 1C(O.7 Problems and Complements 75 Show that if the prior7r( 0) for 0 is V( a).1. the possible actions are al.3. . and the loss function l((J. The problem of selecting the better of two treatments or of deciding whether the effect of one treatment is beneficial or not often reduces to the pr9blem of deciding whether 8 < O. 159 of Table 1.3 IN ~ n) is V( a + n). or 0 > a be penoted by I. = 11'. Find the Bayes rule for case 2.5. (c) Find the minimax rule among the randomized rules. (c) Find the minimax rule among bt. 1C(02) l' = 0. (J2. (b)p=lq=..5 and (ii) l' = 0. Find the Bayes rule when (i) 3. O 2 a 2 I a I 2 I Let X be a random variable with frequency function p(x.1. 1957) °< °= a 0. See Example 1.
How should nj. We want to estimate the mean J1.76 StatistiCCII Models. Assume that 0 < a. I ..andastratumsamplemeanXj. Stratified sampling.3) . Weassurne that the s samples from different strata are independent.0))..g.I L LXij..i. where <f> = 1 <I>. geographic locations or age groups).) c<f>( y'n(r . i I For what values of B does the procedure with r procedure with r = ~s = I? = 8 = 1 have smaller risk than the 4. 1 (l)r=s=l. are known (estimates will be used in a latet chapter). Suppose X is aN(B.7.. variances. 0 be chosen to make iiI unbiased? I (b) Neyman allocation.0)) +b<f>( y'n(s . . iiz = LpjX j=Ii=1 )=1 s n] s j where we assume that Pj. Show that the strata sample sizes that minimize MSE(!i2) are given by (1. b<f>(y'ns) + b<I>(y'nr). I I I.j = 1. n = 1 and . 1 < j < S. (..d. .. and MSEs of iiI and 'j1z.s. and <I> is the N(O. < 00. J". 0<0 0=0 0 >0 R(O. Goals. E. < J' < s.=l iii = N. (b) Plot the risk function when b = c = 1.(X) 0 b b+c 0 c b+c b 0 where b and c are positive. = E(X) of a population that has been divided (stratified) into s mutually exclusive parts (strata) (e. random variables Xlj"". (a) Compute the biases. Within the jth stratum we have a sample of i.Xnjj. 11 . are known. 1) sample and consider the decision rule = 1 if X <r o (a) Show that the risk function is given by ifr<X<s 1 if X > s.8.j = I. 1) distribution function. 1 < j < S. c<I>(y'n(sO))+b<I>(y'n(rO)). and Performance Criteria ChClpter 1 O\a 1 0 c 1 <0 0 >0 J". .) r=2s=1. Suppose that the jth stratum has lOOpj% of the population and that the jth stratum population mean and variances are f£j and Let N = nj and consider the two estimators 0. ..
i.3. l X n . where lJ is defined as a value satisfyingP(X < v) > ~lllldP(X >v) >~.3. plot RR for p = . Suppose that n is odd. 5. U(O..45.2. Use a numerical integration package.9. Let X and X denote the sample mean and median. 1)..7 Problems and Complements 77 Hint: You may use a Lagrange multiplier.3) is N. . . I).5(0 + I) and S ~ B(n.=1 pp7J" aV. > 0. < ° P < 1.'.2'.0 + '.. set () = 0 without loss of generality. (c) Same as (b) except when n ~ 1.7. (a) Find MSE(X) and the relative risk RR = MSE(X)/MSE(X). N(o. Let Xb and X b denote the sample mean and the sample median of the sample XI b. Hint: Use Problem 1. X n be a sample from a population with values 0.0 . ali X '"'' F. . ~ ~  ~ 6. (b) Evaluate RR when n = 1.5.. .0. Let XI. Next note that the distribution of X involves Bernoulli and multinomial trials. ~ 7. ~ ~ ~ .40. and that n is odd. > k) (ii) F is uniform. (iii) F is normal.d.5. n ~ 1. We want to estimate "the" median lJ of F.=1 Pj(O'j where a.bl for the situation in (i). . Hint: See Problem B..5.bl.. ~ a) = P(X = c) = P. ~ (a) Find the MSE of X when (i) F is discrete with P(X a < b < c.I 2:.2.1.3.25. Hint: By Problem 1.. [f the parameters of interest are the population mean and median of Xi .. respectively..= 2:. Also find EIX . (b) Compute the relative risk RR = M SE(X)/MSE(X) in question (i) when b = 0. 75.2.b. . . a = . Suppose that X I. show that MSE(Xb ) and MSE(Xb ) are the same for all values of b (the MSEs of the sample mean and sample median are invariant with respect to shift). b = '. .13... p = . and 2 and (e) Compute the relative risks M SE(X)/MSE(X) in questions (ii) and (iii). .4. The answer is MSE(X) ~ [(ab)'+(cb)'JP(S where k = . and = 1..5.bl when n = compare it to EIX .2.20. X n are i.p).2p.5.. Each value has probability . P(X ~ b) = 1 .Section 1.) with nk given by (1.. ° = 15. that X is the median of the sample. Hint: See Problem B. (d) Find EIX .3. . (c) Show that MSE(!ill with Ok ~ PkN minus MSE(!i.15.b.. .0 ~ + 2'.
2 ~~ I (Xi .. for all 6' E 8 1.  .. = c L~ 1 (Xi .. defined by (J(6..X)' = ([Xi . 9. If the true 6 is 60 . 0 < a 2 < 00. (a)r = 8 =1 (b)r= ~s =1.8lP where p = XI n is the proportion with the characteristic in a sample of size n from the company. . 6' E e. II. In Problem 1.3(a) with b = c = 1 and n = I. I I 78 Statistical Models. 8. (a) Show that if 6 is real and 1(6. Find MSE(6) and MSE(P). the powerfunction. i1 . (b) Suppose Xi _ N(I'. . 10.2)(. and only if. being lefthanded). A decision rule 15 is said to be unbiased if E. 10% have the characteristic. ~ 2(n _ 1)1. ..(1(6.3.(1(6'. e 8 ~ (. .[X . I .. I (b) Show that if we use the 0 1 loss function in testing.0(X))) < E..X)2 has a X~I distribution.a)2.(o(X)). .4. It is known that in the state where the company is located. Let Xl. 0) (J(6'.X)2 keeping the square brackets intact.X)2.'). A person in charge of ordering equipment needs to estimate and uses I . and Performance Criteria Chapter 1 :! .L)4 = 3(12. Show that the value of c that minimizes M S E(c/.) is (n+ 1)1 Hint!or question (bi: Recall (Theorem B. then this definition coincides with the definition of an unbiased estimate of e.1)1 E~ 1 (Xi . Let () denote the proportion of people working in a company who have a certain characteristic (e. a) = (6 .J.0): 6 E eo}.10) + (.0(X))) for all 6.1'] . . satisfies > sup{{J(6. Goals. I .0) = E.for what 60 is ~ MSE(8)/MSE(P) < 17 Give the answer for n = 25 and n = 100. (i) Show that MSE(S2) (ii) Let 0'6 c~ . (a) Show that s:? = (n . You may use the fact that E(X i .X)2 is an unbiased estimator of u 2 • Hint: Write (Xi .Xn be a sample from a population with variance a 2 . Which one of the rules is the better one from the Bayes point of view? 11. then expand (Xi .3.1'])2. suppose (J is discrete with frequency function 11"(0) = 11" (!) = 11" U) = Compute the Bayes risk of I5r •s when i. then a test function is unbiased in this sense if.3) that . ' .g. .
4.1. o Suppose that U . . consider the loss function (1.) + (1 . Ok) as a function of B. A (behavioral) randomized test of a hypothesis H is defined as any statistic rp(X) such that 0 < <p(X) < 1. Define the nonrandomized test Ju . z > 0. Po[o(X) ~ 11 ~ 1 . we petiorm a Bernoulli trial with probability rp( x) of success and decide 8 1 if we obtain a success and decide 8 0 otherwise. If U = u. 0. Convexity ofthe risk set. Consider a decision problem with the possible states of nature 81 and 82 . use the test J u . 18.3. 15.Polo(X) ~ 01 = Eo(<p(X)). For Example 1. 03) = aR(B. 14. by 1 if<p(X) if <p(X) >u < u.) for all B.. wc dccide El.4.3.ao) ~ O. (c) Same as (b) except k = 2. Show that if J 1 and J 2 are two randomized. If X ~ x and <p(x) = 0 we decide 90. (B) = 0 for some event B implies that Po (B) = 0 for all B E El. then.' and 0 = = WJ10 + (1  w)x' II'  1'0 I are known.1. and possible actions at and a2. but if 0 < <p(x) < 1.~ is unbiased 13. s = r = I. given 0 < a < 1. a) is . Show that J agrees with rp in the sense that.. b.Section 1. consider the estimator < MSE(X). The interpretation of <p is the following.. and k o ~ 3. and then JT. 0 < u < 1. Compare 02 and 03' 19. Suppose that the set of decision procedures is finite.UfO. procedures.wo to X.7). 16." (a) Show that the risk is given by (1. find the set of I' where MSE(ji) depend on n.1) and let Ok be the decision rule "reject the shipment iff X > k.' and 0 = II' . B = .3. Consider the following randomized test J: Observe U. Suppose the loss function feB. In Example 1. plot R(O.3. Show that the procedure O(X) au is admissible. (b) find the minimum relative risk of P. In Example 1.3. Furthersuppose that l(Bo. In Problem 1. 0. Your answer should Mw If n.3. 1) and is independent of X..a )R(B. show that if c ::. (a) find the value of Wo that minimizes M SE(Mw). if <p(x) = 1.1'0 I· 17. (h) If N ~ 10. .7 Problems and Complements 79 12. there is a randomized procedure 03 such that R(B. Suppose that Po.1.
(0. . and Performance Criteria Chapter 1 B\a 0. Goals. Show that either R(c) = 00 for all cor R(c) is minimized by taking c to be any number such that PlY > cJ > pry < cJ > A number satisfying these restrictions is called a median of (the distribution of) Y. Let Y be any random variable and let R(c) = E(IYcl) be the mean absolute prediction error.6 (a) Compute and plot the risk points of the nonrandomized decision rules. al a2 0 3 2 I 0. and the best zero intercept linear predictor. An urn contains four red and four black balls. In Problem B. !. 5.l. (c) Suppose 0 has the prior distribution defined by .. Give an example in which Z can be used to predict Y perfectly. (b) Compute the MSPEs of the predictors in (a). In Example 1.) the Bayes decision rule? Problems for Section 1. (a) Find the best predictor of Y given Z. Give the minimax rule among the randomized decision rules.) = 0.. !.Is Z of any value in predicting Y? = Ur + U?.8 0. the best linear predictor. 7. The midpoint of the interval of such c is called the conventionally defined median or simply just the median.80 Statistical Models. What is 1. Let U I .9. (b) Give and plot the risk set S.7 find the best predictors ofY given X and of X given Y and calculate their MSPEs. Four balls are drawn at random without replacement. and the ratio of its MSPE to that of the best and best linear predictors.1. 4.4 = 0. Give the minimax rule among the nonrandomized decision rules.(0. 2.. 3.1 calculate explicitly the best zero intercept linear predictor.2 0. 6. Give an example in which the best linear predictor of Y given Z is a constant (has no predictive value) whereas the best predictor Y given Z predicts Y perfectly.4 I 0. but Y is of no value in predicting Z in the sense that Var(Z I Y) = Var(Z).4. its MSPE. Let X be a random variable with probability function p(x ! B) O\x 0. 0 0. Let Z be the number of red balls obtained in the first two draws and Y the total number of red balls drawn. U2 be independent standard normal random variables and set Z Y = U I. 0.
Let ZI. Suppose that Z has a density p. Let Y have a N(I". which is symmetric about c and which is unimodal. 7 2 ) variables independent of each other and of (ZI. Y)I < Ipl. + z) = p(c . which is symmetric about c. Y) has a bivariate nonnal distribution the best predictor of Y given Z in the sense of MSPE coincides with the best predictor for mean absolute error. Let Zl and Zz be independent and have exponential distributions with density Ae\z. exhibit a best predictor of Y given Z for mean absolute prediction error. p( c all z.YI > s. Suppose that Z. where (ZI . Suppose that if we observe Z = z and predict 1"( z) for Your loss is 1 unit if 11"( z) . Y = Y' + Y".  = ElY cl + (c  co){P[Y > co] . p( z) is nonincreasing for z > c. Sbow that the predictor that minimizes our expected loss is again the best MSPE predictor.t. (a) Show that P[lZ  tl < s] is maximized as a function oftfor each s > ° by t = c. If Y and Z are any two random variables. (b) Show directly that I" minimizes E(IY . z > 0. y l ). Find (a) The best MSPE predictor E(Y I Z = z) ofY given Z = z (b) E(E(Y I Z)) (c) Var(E(Y I Z)) (d) Var(Y I Z = z) (e) E(Var(Y I Z» .cl) = oQ[lc .PlY < eo]} + 2E[(c  Y)llc < Y < co]] 8. (b) Show that the error of prediction (for the best predictor) incurred in using Z to predict Y is greater than that incurred in using Z' to predict y J • 13. 10.col < eo. 9. Show that if (Z. Suppose that Z has a density p. a 2. Y') have a N(p" J. that is.z) for 11. Yare measurements on such a variable for a randomly selected father and son. Many observed biological variables such as height and weight can be thought of as the sum of unobservable genetic and environmental variables. (a) Show that the relation between Z and Y is weaker than that between Z' and y J . Define Z = Z2 and Y = Zl + Z l Z2. that is. ° 14. (b) Suppose (Z.el) as a function of c.7 Problems and Complements 81 Hint: If c ElY .1"1/0] where Q(t) = 2['I'(t) + t<I>(t)].2 ) distrihution. 12. Y'. Show that c is a median of Z. Z". and otherwise. Y" are N(v. a 2.Section 1. p) distribution and Z".L. Y) has a bivariate normal distribution. ICor(Z. Y" be the corresponding genetic and environmental components Z = ZJ + Z". 0. (a) Show that E(IY .
~~y = 1 . (See Pearson.g(Z)) where £ is the set of 17.ElY .4.) (a) Show that 1]~y > piy.) = E( Z. . Let S be the unknown date in the past when the sUbject was infected./L(Z)) = max Corr'(Y. (c) Let '£ = Y . .S from infection until detection. One minus the ratio of the smallest possible MSPE to the MSPE of the constant predictor is called Pearson's correlation ratio 1J~y. in the linear model of Remark 1. Consider a subject who walks into a clinic today.4. Yo > O. on estimation of 1]~y.1] . g(Z)) 9 where g(Z) stands for any predictor. Hint: See Problem 1. Let /L(z) = E(Y I Z ~ z). .g.3.y}.(y) = >./L£(Z)) ~ maxgEL Corr'(Y.. Goals./LdZ) be the linear prediction error. then'fJh(Z)Y =1JZY.15. and Doksurn and Samarov. Show that. (b) Show that if Z is onedimensional and h is a 11 increasing transfonnation of Z. (a) Show that the conditional density j(z I y) of Z given Y (b) Suppose that Y has the exponential density = y is H(y.) = Var( Z. Show that Var(/L(Z))/Var(Y) = Corr'(Y. >. hat1s. Assume that the conditional density of Zo (the present) given Yo = Yo (the past) is where j1. It will be convenient to rescale the problem by introducing Z = (Zo . and a 2 are the mean and variance of the severity indicator Zo in the population of people without the disease. I. We are interested in the time Yo = t .: (0 The best linear MSPE predictor of Y based on Z = z. j3 > 0. y> O. > 0.. 82 Statisflcal Models. mvanant un d ersuchh. IS. and is diagnosed with a certain disease. Show that p~y linear predictors. 2 'T ' . where piy is the population multiple correlation coefficient of Remark 1. .) ~ 1/>. €L is uncorrelated with PL(Z) and 1J~Y = P~Y' 18. 1905. Hint: Recall that E( Z.) ~ 1/>''. At the same time t a diagnostic indicator Zo of the severity of the disease (e. 1995. Here j3yo gives the mean increase of Zo for infected subjects over the time period Yo. = Corr'(Y. 16. and Performance Criteria Chapter 1 . 1). . at time t. that is. and Var( Z.4. IS.I • ! . . .4. Predicting the past from the present./L)/<J and Y ~ /3Yo/<J.exp{ >./L(Z)J' /Var(Y) ~ Var(/L(Z))/Var(Y). a blood cell or viral load measurement) is obtained.
N(z . solving for (a. 20. This density is called the truncated (at zero) normal. Find Corr(Y}. Y2) using (a). then = E{Cov[r(Y). need to be estimated from cohort studies. . including the "prior" 71". x = 1. (b) Show that (a) is equivalent to (1. Z2 and W are independent with finite variances.4. . 6. density.6) when r = s.>. I Z]' Z}.).9. 7r(Y I z) ~ (27r) . Cov[r(Y).14 by setting the derivatives of R(a. s(Y)] < 00. and Nonnand and Doksum. Let Y be a vector and let r(Y) and s(Y) be real valued.. (d) Find the best predictor of Yo given Zo = zo using mean absolute prediction error EIYo . (a) Show that ifCov[r(Y). s(Y) I Z]} + Cov{ E[r(Y) I Z]' E[s(Y) I Z]}. s(Y» given Z = z.. Z] = Cov{E[r(Y) (d) Suppose Y i = al + biZ i + Wand 1'2 = a2 and Y2 are responses of subjects 1 and 2 with common influence W and separate influences ZI and Z2. 1990. see Berman..>. s(Y) I z] for the covatiance between r(Y) and s(Y) in the conditional distribution of (r(Y). 2000).7 Problems and Complements 83 Show that the conditional distribution of Y (the past) given Z = z (the present) has density . where X is the number of spots showing when a fair die is rolled.4. 21. .4. + b2 Z 2 + W. (e) Show that the best MSPE predictor ofY given Z = z is E(Y I Z ~ z) ~ cI<p(>.g(Zo)l· Hint: See Problems 1.. and checking convexity. all the unknowns. b). Let Y be the number of heads showing when X fair coins are tossed. b) equal to zero.z) .Section 1. where Z}. Hint: Use Bayes rule. c. 19. Establish 1. (b) The MSPE of the optimal predictor of Y based on X. (c) The optimal predictor of Y given X = x.7 and 1.>')1 2 } Y> ° where c = 4>(z . 1).I exp { .z).s(Y)] Covlr(Y).4. (c) Find the conditional density 7I"o(Yo I zo) of Yo given Zo = zo. wbere Y j .(>' . Write Cov[r(Y).~ [y  (z .. (In practice. Find (a) The mean and variance of Y. (c) Show that if Z is real.
6(r. (a) Show directly that 1 z:::: 1 Xi is sufficient for 8.g(z») is called weighted squared prediction error..z).z) be a positive realvalued function. Y z )? (f) In model (d). show that the MSPE of the optimal predictor is . and Performance Criteria Chapter 1 (e) In the preceding model (d). g(Z)) < for some 9 and that Po is a density. . x  . density.3. . (c) p(x.14). suppose that Zl and Z.e)' Hint: Whatever be Y and c. Zz).4. Let Xl. if b1 = b2 and Zl.e' < (Y . a > O. Show that EY' < 00 if and only if E(Y .Xn is a sample from a population with one of the following densities.p~y). This is thebeta. Y ~ B(n.4. . 2. 1. 0) = Ox'... Goals. Zz and ~V have the same variance (T2. > a. 25. 0 > 0.~(1 . n > 2.. Show that the mean weighted squared prediction error is minimized by Po(Z) = EO(Y I Z). 0 > 0. .0> O. optimal predictor of Yz given (Yi. \ (b) Establish the same result using the factorization theorem. 3. z) = z(1 . . population where e > O. 1 2 y' .l . 00 (b) Suppose that given Z = z. where Po(Y.z) = 6w (y. Show that L~ 1 Xi is sufficient for 8 directly and by the factorization theorem. are N(p. Then [y . z) and c is the constant that makes Po a density. Let Xi = 1 if the ith item drawn is bad.z).. 0 < z < I. Oax"l exp( Ox").x > 0. < 00 for all e.15) yields (1.6(O. This is known as the Weibull density. 0) = < X < 1. and = 0 otherwise.4. In Example 1. we say that there is a 50% overlap between Y 1 and Yz.') and W ~ N(po. Suppose Xl. Verify that solving (1. 24.. Find Po(Z) when (i) w(y. 1). 1 X n be a sample from a Poisson.g(z)]'lw(y. z) ~ ep(y.5 1. 23. z) = I. and suppose that Z has the beta. z "5).e)' = Y' .z)lw(y. Let n items be drawn in order without replacement from a shipment of N items of which N8 are bad. 0) = Oa' Ix('H). p(e). (a) Let w(y. and (ii) w(y. . Problems for Section 1. s). (a) p(x. In this case what is Corr(Y. a > O.84 Statistical Models.density. 0 (b) p(x. .2eY + e' < 2(Y' + e'). . Assume that Eow (Y. Find the 22.
Section 1. 7
Problems and Complements
85
This is known as the Pareto density. In each case. find a realvalued sufficient statistic for (), a fixed. 4. (a) Show that T 1 and T z are equivalent statistics if, and only if, we can write T z = H (T1 ) for some 11 transformation H of the range of T 1 into the range of T z . Which of the following statistics are equivalent? (Prove or disprove.) (b) n~ (c) I:~
1
x t and I:~
and I:~
I:~
1
log Xi, log Xi.
Xi Xi
I Xi
1
>0 >0
l(Xi X
1 (Xi 
(d)(I:~ lXi,I:~ lxnand(I:~ lXi,I:~
(e) (I:~
I Xi, 1
?)
xl)
and (I:~
1 Xi,
I:~
X)3).
5. Let e = (e lo e,) be a bivariate parameter. Suppose that T l (X) is sufficient for e, whenever 82 is fixed and known, whereas T2 (X) is sufficient for (h whenever 81 is fixed and known. Assume that eh ()2 vary independently, lh E 8 1 , 8 2 E 8 2 and that the set S = {x: pix, e) > O} does not depend on e. (a) Show that ifT, and T, do not depend one2 and e, respectively, then (Tl (X), T2 (X)) is sufficient for e. (b) Exhibit an example in which (T, (X), T2 (X)) is sufficient for T l (X) is sufficient for 8 1 whenever 8 2 is fixed and known, but Tz(X) is not sufficient for 82 , when el is fixed and known. 6. Let X take on the specified values VI, .•. 1 Vk with probabilities 8 1 , .•• ,8k, respectively. Suppose that Xl, ... ,Xn are independently and identically distributed as X. Suppose that IJ = (e" ... , e>l is unknown and may range over the set e = {(e" ... ,ek) : e, > 0, 1 < i < k, E~ 18i = I}, Let Nj be the number of Xi which equal Vj' (a) What is the distribution of (N" ... , N k )? (b) Sbow that N = (N" ... , N k _,) is sufficient for 7. Let Xl,'"
1
e,
e.
X n be a sample from a population with density p(x, 8) given by
pix, e)
o otherwise.
Here
e = (/1, <r) with 00 < /1 < 00, <r > 0.
(a) Show that min (Xl, ... 1 X n ) is sufficient for fl when a is fixed. (b) Find a onedimensional sufficient statistic for a when J1. is fixed. (c) Exhibit a twodimensional sufficient statistic for 8. 8. Let Xl,. " ,Xn be a sample from some continuous distribution Fwith density f, which is unknown. Treating f as a parameter, show that the order statistics X(l),"" X(n) (cf. Problem B.2.8) are sufficient for f.
,
"
I
!
86
Statistical Models, Goals, and Performance Criteria
Chapter 1
9. Let Xl, ... ,Xn be a sample from a population with density
j,(x)
a(O)h(x) if 0, < x < 0,
o othetwise
where h(x)
> 0,0= (0,,0,)
with
00
< 0, < 0, <
00,
and a(O)
=
[J:"
h(X)dXr'
is assumed to exist. Find a twodimensional sufficient statistic for this problem and apply your result to the U[()l, ()2] family of distributions. 10. Suppose Xl" .. , X n are U.d. with density I(x, 8) = ~elx61. Show that (X{I),"" X(n», the order statistics, are minimal sufficient. Hint: t,Lx(O) =  E~ ,sgn(Xi  0), 0 't {X"" . , X n }, which determines X(I),
. " , X(n)'
11. Let X 1 ,X2, ... ,Xn be a sample from the unifonn, U(O,B). distribution. Show that X(n) = max{ Xii 1 < i < n} is minimal sufficient for O.
12. Dynkin, Lehmann, Scheffe's Theorem. Let P = {Po : () E e} where Po is discrete concentrated on X = {x" x," .. }. Let p(x, 0) p.[X = xl Lx(O) > on Show that f:xx(~~) is minimial sufficient. Hint: Apply the factorization theorem.
=
=
°
x,
13. Suppose that X = (XlI" _, X n ) is a sample from a population with continuous distribution function F(x). If F(x) is N(j1., ,,'), T(X) = (X, ,,'). where,,2 = n l E(Xi 1')2, is sufficient, and S(X) ~ (XCI)"" ,Xin»' where XCi) = (X(i)  1')/'" is "irrelevant" (ancillary) for (IL, a 2 ). However, S(X) is exactly what is needed to estimate the "shape" of F(x) when F(x) is unknown. The shape of F is represented hy the equivalence class F = {F((·  a)/b) : b > 0, a E R}. Thus a distribution G has the same shape as F iff G E F. For instance, one "estimator" of this shape is the scaled empirical distribution function F,(x) jln, x(j) < x < x(i+1)' j = 1, . .. ,nl
~
0, x
< XCI)
> x(n)
1, x
~
Show that for fixed x, F,((x  x)/,,) converges in prohahility to F(x). Here we are using F to represent :F because every member of:F can be obtained from F.
I
I ,
'I i,
14. Kolmogorov's Theorem. We are given a regular model with e finite.
(a) Suppose that a statistic T(X) has the property that for any prior distribution on 9, the posterior distrihution of 9 depends on x only through T(x). Show that T(X) is sufficient.
(b) Conversely show that if T(X) is sufficient, then, for any prior distribution, the posterior distribution depends on x only through T(x).
Section 1.7
Problems and Complements
87
Hint: Apply the factorization theorem.
15. Let X h .··, X n be a sample from f(x  0), () E R. Show that the order statistics arc minimal sufficient when / is the density Cauchy Itt) ~ I/Jr(1 + t 2 ). 16. Let Xl,"" X rn ; Y1 ,· . " ~l be independently distributed according to N(p, (72) and N(TI, 7 2 ), respectively. Find minimal sufficient statistics for the following three cases:
(i) p, TI,
0", T
are arbitrary:
00
< p, TI < 00, a <
(J,
T.
(ii)
(J
=T
= TJ
and p, TI, (7 are arbitrary. and p,
0", T
(iii) p
are arbitrary.
17. In Example 1.5.4. express tl as a function of Lx(O, 1) and Lx(l, I). Problems to Sectinn 1.6
1. Prove the assertions of Table 1.6.1.
2. Suppose X I, ... , X n is as in Problem 1.5.3. In each of the cases (a), (b) and (c), show that the distribution of X fonns a oneparameter exponential family. Identify 'TI, B, T, and h. 3. Let X be the number of failures before the first success in a sequence of Bernoulli trials with probability nf success O. Then P, IX = k] = (I  0)'0, k ~ 0 0 1,2, ... This is called thc geometric distribution (9 (0». (a) Show that the family of geometric distributions is a oneparameter exponential family with T(x) ~ x. (b) Deduce from Theorem 1.6.1 that if X lo '" oXn is a sample from 9(0), then the distributions of L~ 1 Xi fonn a oneparameter exponential family. (c) Show that E~
1
Xi in part (b) has a negative binomial distribution with parameters
(noO)definedbyP,[L:71Xi = kJ =
(n+~I
)
(10)'on,k~0,1,2o'"
(The
negative binomial distribution is that of the number of failures before the nth success in a sequence of Bernoulli trials with probability of success 0.) Hint: By Theorem 1.6.1, P,[L:7 1 Xi = kJ = c.(1  o)'on. 0 < 0 < 1. If
=
' " CkW'
I
= c(;'',)::, 0 lw n
k=O
L..J
< W < I, then
4. Which of the following families of distributions are exponential families? (Prove or
disprove.) (a) The U(O, 0) fumily
88
(b) p(.", 0)
Statistical Models, Goals, and Performance Criteria
Chapter 1
(c)p(x,O)
= {exp[2Iog0+log(2x)]}I[XE = ~,xE {O.I +0, ... ,0.9+0j = 2(x +
0)/(1 + 20), 0
(0,0)1
(d) The N(O, 02 ) family, 0 > 0
(e)p(x,O)
< x < 1,0> 0
(f) p(x,9) is the conditional frequency function of a binomial, B(n,O), variable X, given that X > O.
5. Show that the following families of distributions are twoparameter exponential families and identify the functions 1], B, T, and h. (a) The beta family. (b) The gamma family. 6. Let X have the Dirichlet distribution, D( a), of Problem 1.2.15. Show the distribution of X form an rparameter exponential family and identify fJl B, T, and h.
7. Let X = ((XI, Y I ), ... , (X no Y n » be a sample from a bIvariate nonnal population.
Show that the distributions of X form a fiveparameter exponential family and identify 'TJ, B, T, and h.
8. Show that the family of distributions of Example 1.5.3 is not a one parameter eX(Xloential family. Hint: If it were. there would be a set A such that p(x, 0) > on A for all O.
°
9. Prove the analogue of Theorem 1.6.1 for discrete kparameter exponential families. 10. Suppose that f(x, B) is a positive density on the real line, which is continuous in x for each 0 and such that if (XI, X 2) is a sample of size 2 from f(·, 0), then XI + X2 is sufficient for B. Show that f(·, B) corresponds to a onearameter exponential family of distributions with T(x) = x. Hint: There exist functions g(t, 0), h(x" X2) such that log f(x" 0) + log f(X2, 0) = g(xI + X2, 0) + h(XI, X2). Fix 00 and let r(x, 0) = log f(x, 0)  log f(x, 00), q(x, 0) = g(x,O)  g(x,Oo). Then, q(xI + X2,0) = r(xI,O) +r(x2,0), and hence, [r(x" 0) r(O, 0)1 + [r(x2, 0)  r(O, 0») = r(xi + X2, 0)  r(O, 0). 11. Use Theorems 1.6.2 and 1.6.3 to obtain momentgenerating functions for the sufficient statistics when sampling from the following distributions. (a) normal, () ~ (ll,a 2 )
(b) gamma. r(p, >.), 0
= >., p fixed
(c) binomial (d) Poisson (e) negative binomial (see Problem 1.6.3)
(0 gamma. r(p, >'). ()
= (p, >.).
 
,

Section 1. 7
Problems and Complements
89
12. Show directly using the definition of the rank of an ex}X)nential family that the multinomialdistribution,M(n;OI, ... ,Ok),O < OJ < 1,1 <j < k,I:~oIOj = 1, is of rank k1. 13. Show that in Theorem 1.6.3, the condition that E has nonempty interior is equivalent to the condition that £ is not contained in any (k ~ I)dimensional hyperplane. 14. Construct an exponential family of rank k for which £ is not open and A is not defined on all of &. Show that if k = 1 and &0 oJ 0 and A, A are defined on all of &, then Theorem 1.6.3 continues to hold. 15. Let P = {P. : 0 E e} where p. is discrete and concentrated on X = {x" X2, ... }, and let p( x, 0) = p. IX = x I. Show that if P is a (discrete) canonical ex ponential family generated bi, (T, h) and &0 oJ 0, then T is minimal sufficient. Hint: ~;j'Lx('l) = Tj(X)  E'lTj(X). Use Problem 1.5.12.
16. Life testing. Let Xl,.'" X n be independently distributed with exponential density (20)l e x/2. for x > 0, and let the ordered X's be denoted by Y, < Y2 < '" < YnIt is assumed that Y1 becomes available first, then Yz, and so on, and that observation is continued until Yr has been observed. This might arise, for example, in life testing where each X measures the length of life of, say, an electron tube, and n tubes are being tested simultaneously. Another application is to the disintegration of radioactive material, where n is the number of atoms, and observation is continued until r aparticles have been emitted. Show that
(i) The joint distribution of Y1 , •.. , Yr is an exponential family with density
n! [ (20), (n _ r)! exp (ii) The distribution of II:: I Y;
(iii) Let
1
I::l Yi + (n 20
r)Yr]
' 0  Y,  ...  Yr·
<
<
<
+ (n 
r)Yrl/O is X2 with 2r degrees of freedom.
denote the time required until the first, second,... event occurs in a Poisson process with parameter 1/20' (see A.I6). Then Z, = YI/O', Z2 = (Y2 Yr)/O', Z3 = (Y3  Y 2)/0', ... are independently distributed as X2 with 2 degrees of freedom, and the joint density of Y1 , ••. , Yr is an exponential family with density
Yi, Yz , ...
The distribution of Yr/B' is again XZ with 2r degrees of freedom. (iv) The same model arises in the application to life testing if the number n of tubes is held constant by replacing each burnedout tube with a new one, and if Y1 denotes the time at which the first tube bums out, Y2 the time at which the second tube burns out, and so on, measured from some fixed time.
I ,
90
Statistical Models, Goals, and Performance Criteria Chapter 1
1)(Y; l~~l)/e (I = 1", .. ,') are independently distributed as X2 with 2 degrees of freedom, and [L~ 1 Yi + (n  7")Yr]/B = [(ii): The random variables Zi ~ (n  i
+
L::~l Z,.l
17. Suppose that (TkXl' h) generate a canonical exponential family P with parameter k 1Jkxl and E = R . Let
(a) Show that Q is the exponential family generated by IlL T and h exp{ cTT}. where IlL is the projection matrix of Tonto L = {'I : 'I = BO + c). (b) Show that ifP has full rank k and B is of rank I, then Q has full rank l. Hint: If B is of rank I, you may assume
18. Suppose Y1, ... 1 Y n are independent with Yi '" N(131 + {32Zi, (12), where Zl,'" , Zn are covariate values not all equaL (See Example 1.6.6.) Show that the family has rank 3.
Give the mean vector and the variance matrix of T.
19. Logistic Regression. We observe (Zll Y1 ), ... , (zn, Y n ) where the Y1 , .. _ , Y n are independent, Yi "' B(TIi, Ad The success probability Ai depends on the characteristics Zi of the ith subject, for example, on the covariate vector Zi = (age, height, blood pressure)T. The function I(u) ~ log[u/(l  u)] is called the logil function. In the logistic linear re(3 where (3 = ((31, ... ,/3d ) T and Zi is d x 1. gression model it is assumed that I (Ai) = Show that Y = (Y1 , ... , yn)T follow an exponential model with rank d iff Zl, ... , Zd are
zT
not collinear (linearly independent) (cf. Examples 1.1.4, 1.6.8 and Problem 1.1.9). 20. (a) In part IT of the proof of Theorem 1.6.4, fill in the details of the arguments that Q is generated by ('11 'Io)TT and that ~(ii) =~(i). (b) Fill in the details of part III of the proof of Theorem 1.6.4. 21. Find JJ.('I) ~ EryT(X) for the gamma,
qa, A), distribution, where e = (a, A).
I
22. Let X I, . _ . ,Xn be a sample from the k·parameter exponential family distribution (1.6.10). Let T = (L:~ 1 1 (Xi ), ... , L:~ 1Tk(X,») and let T
I
S
~
((ryl(O), ... ,ryk(O»): e E 8).
Show that if S contains a subset of k + 1 vectors Vo, .. _, Vk+l so that Vi  Vo, 1 < i are not collinear (linearly independent), then T is minimally sufficient for 8.
< k.
I .' jl,
"
23. Using (1.6.20). find a conjugate family of distributions for the gamma and beta families. (a) With one parameter fixed. (b) With both parameters free.
:
I
Section 1.7
Problems and Complements
91
24. Using (1.6.20), find a conjugate family of distributions for the normal family using as parameter 0 = (O!, O ) where O! = E,(X), 0, ~ l/(Var oX) (cf. Problem 1.2.12). 2 25. Consider the linear Gaussian regression model of Examples 1.5.5 and 1.6.6 except with (72 known. Find a conjugate family of prior distributions for (131,132) T. 26. Using (1.6.20), find a conjugate family of distributions for the multinomial distribution. See Problem 1.2.15. 27. Let P denote the canonical exponential family genrated by T and h. For any TJo E £, set ho(x) = q(x, '10) where q is given by (1.6.9). Show that P is also the canonical exponential family generated by T and h o.
28. Exponential/amities are maximum entropy distributions. The entropy h(f) of a random variable X with density f is defined by h(f)
~ E(logf(X)) =
l:IIOgf(X)I!(X)dx.
This quantity arises naturally in infonnation in theory; see Section 2.2.2 and Cover and Thomas (1991). Let S ~ {x: f(x) > OJ. (a) Show that the canonical kparameter exponential family density
f(x, 'I)
= exp
• ryjrj(x) 1/0 + I:
j:=1
A('I)
, XES
maximizes h(f) subject to the constraints
f(x)
> 0,
Is
f(x)dx
~ 1,
Is
f(x)rj(x)
~ aj,
1 < j < k,
where '17o, .•.• '17k are chosen so that f satisfies the constraints. Hint: You may usc Lagrange multipliers. Maximize the integrand. (b) Find the maximum entropy densities when rj(x) = x j and (i) S ~ (0,00), k = 1, at > 0; (ii) S = R, k = 2, at E R, a, > 0; (iii) S = R, k = 3, a) E R, a, > 0, a3 E R. 29. As in Example 1.6.11, suppose that Y 1, ...• Y n are Li.d. Np(f.L. E) where f.L varies freely in RP and E ranges freely over the class of all p x p symmetric positive definite matrices. Show that the distribution of Y = (Y ... , Yn ) is the p(p + 3)/2 canonical " exponential family generated by h = 1 and the p(p + 3)/2 statistics
n n
Tj
=
LYii>
i=l
1 <j <Pi
Tjl =
LJ'ijJ'iI.
i=l
1 <j< l<p
where Y i = (Yi!, ... , Yip). Show that <: is open and that this family is of rank pcp + 3)/2. Hint: Without loss of generality, take n = 1. We want to show that h = 1 and the m = pcp + 3)/2 statistics Tj(Y) ~ Yj, 1 < j < p, and Tj,(Y) = YjYi, 1 <j < I < p,
92
Statistical Models, Goals, and Performance Criteria
Chapter 1
generate Np(J.l, E). As E ranges over all p x p symmetric positive definite matrices, so does E 1 • Next establish that for symmetric matrices M,
J
M
exp{ _uT Mu}du
< 00 iff M
is positive definite
by using the spectral decomposition (see B.I0.1.2)
=L
j=1
p
AjejeJ for el, ... , e p orthogonal. Aj E R.
To show that the family has full rank m, use induction on p to show that if Zt, ... , Zp are i.i.d. N(O, 1) and if B pxp = (b jl ) is symmetric, then
p
P
LajZj
j"" 1
+ Lbj,ZjZ,
j,l
~c
= P(aTZ + ZTBZ = c) ~ 0
N p(l', E), then
unless a ~ 0, B = 0, c = 0. Next recall (Appendix B.6) that since Y ~ y = SZ for some nonsingular p x p matrix S.
I
30. Show that if Xl,'" ,Xn are d.d. N p (8,E o) given (J where ~o is known, then the Np(A, f) family is conjugate to N p(8, Eo), where A varies freely in RP and f ranges over
all p x p symmetric positive definite matrices.
31. Conjugate Normal Mixture Distributions. A Hierarchical Bayesian Normal Model. Let {(I'j, Tj) : 1 < j < k} be a given collection of pairs with I'j E R, Tj > 0. Let (tt, tT) be a random pair with Aj = P«(I', tT) = (I'j, Tj)), 0 < Aj < 1, L:~~l Aj = 1. Let 8 be a random variable whose conditional distribution given (IL, IT) = (p,j 1 Tj) is nonnal, N(p,j, rJ). Consider the model X = 8 + f, where 8 and € are independent and € rv N(O, a3), a~ known. Note that 8 has the prior density
11'(0)
=L
j=l
k
Aj'l'rj (0  I'j)
(1.7.4)
where 'I'r denotes the N(O, T 2 ) density. Also note that (X tion. (a) Find the posterior
i!
I 0) has the N(O, (75) distribu
" "
,;
k
11'(0 I x)
and write it in the fonn
= LP((tt,tT) ~
j=1
(l'j,Tj) I X)1I'(O I (l'j,Tj),X)
" • ,
k
L
j=1
Aj (x)'I'rj(x) (0 l'j(X»
Section 1.7
Problems ;3nd Complements
93
for appropriate A) (x), Tj (x) and ILJ (x). This shows that (1. 7.4) defines a conjugate prior for the N(O, (76), distribution. (b) Let Xi = + Ei, I < i < n, where is as previously and EI," ., En are ij.d. N(O, (76)' Find the posterior 7r( 0 I Xl, ... , x n ), and show that it belongs to class (1.7 A). Hint: Consider the sufficient statistic for p(x I B).
e
e
32. A Hierarchical BinomialBeta Model. Let {(rj, Sj) : 1 <j < k} be a given collection of pair.; with rj > 0, sJ > 0, let (R, S) be a random pair with P(R = cJ' S = 8j) = Aj, D < Aj < 1, E7=1 Aj = 1, and let e be a random variable whose conditional density ".(0, c, s) given R = r, S = S is beta, (3(c, s). Consider the model in which (X I 0) has the binomial, B( n, fJ), distribution. Note that e has the prior density
".(0)
Find the posterior
k
=L
j=1
k
Aj"'(O, cJ ' sJ)'
(J .7.5)
".(0 I x) = LP(R= cj,S =
j=l
8j
I x)7r(O I (rj,sj),x)
and show that it can be written in the form J (x)7r(O,rj(x),sj(x)) for appropriate Aj(X), Cj(x) and 8j(X). This shows that (1.7.5) defines a class of conjugate prior.; for the B( n, 0) distribution.
L:A
33. Let p(x,TJ) be a one parameter canonical exponential family generated by T(x) = x and h(x), X E X C R, and let 1jJ(x) be a nonconstant, nondecreasing function. Show that E,1jJ(X) is strictly increasing in ry. Hint:
Cov,(1jJ(X), X)
~E{(X 
X')i1jJ(X) 1jJ(X')]}
where X and X' are independent identically distributed as X (see A.Il.12).
34. Let (Xl, ... , X n ) be a stationary Markov chain with two states D and 1. That is.
P[Xi
where
= Ei I Xl = EI,·· .,Xi  = Eid = P[Xi = Ci I X i  l = Eid =
l
PEi_1Ei
(POO PIO
pal) is the matrix of transition probabilities. Suppose further that Pn '
= 1  p.
(i) poo
(ii)
= PII = p, so that, PlO = Pal PIX, = OJ = PIX, = IJ = !.
94
Statistical Models, Goals, and Performance Criteria
Chapter 1
(a) Show that if 0 < p < 1 is unknown this is a full rank, oneparameter exponential family with T = NOD + N ll where Nt) the number of transitions from i to j. For example, 01011 has N Ol = 2, Nil = 1, N oo = 0, N IO ~ 1.
(b) Show that E(T)
= (n 
l)p (by the method of indicators or otherwise).
35. A Conjugate Priorfor the Two~Sample Problem. Suppose that Xl, ... , X n and Y1 , ... , Yn are independent N(fLI' (12) and N(1l2' ( 2 ) samples, respectively. Consider the prior 7r for which for some r > 0, k > 0, ro 2 has a X~ distribution and given 0 2 , /11 and fL2 are independent with N(~I, <7 2/ kt} and N(6, <7 2/ k2) distributions, respectively, where ~j E R, k j > 0, j = 1,2. Show that Jr is a conjugate prior.
I
36. The inverse Gaussian density. IG(j..t, .\), is
f(x,J1.,>')
= [>./21Tjl/2 x 3/2 exp { >.(x 
J1.)2/ 2J1.2 X}, x> 0, J1. > 0, >. > O.
(a) Show thatthis is an exponentialfamily generated hy T( X) h(x) = (21T)1/2 X 3/'. (b) Show that the canonical parameters TJl, TJ2 are given by TJI that A( 'II, '12) =  [! [Og('I2) + v''I1'I2]'£ = [0,00) x (0,00).
= ! (X, XI) T and = fL 2A, 1]2 =
= '\, and
(e) Fwd the momentgenerating function ofT and show that E(X) J1. 3>., E(XI) = J1. 1 + >.1, Var(X I ) = (>'J1.)1 + 2>'2. (d) Suppose J1. pnor.
~
J1., Var(X)
J1.o is known. Show that the gamma family, qa,,6), is a conjugate
(e) Suppose that>' = >'0 is known. Show that the conjngate prior formula (1.6.20) produces a function that is not integrable with respect to fl. That is, defined in (1.6.19) is empty.
n
(I) Suppose that J1. and>. are both unknown. Show that (1.6.20) produces a function
that is not integrable; that is, f! defined in (1.6,19) is empty. 37, Let XI, ... , X n be i.i.d. as X ~ Np(O, ~o) where ~o is known. Show that the conjugate prior generated by (1.6.20) is the N p ( 1]0,761) family, where 1]0 varies freely in RP, 76 > 0 and I is the p x p identity matrix.
,
•
38. Let Xi
"
(Zi, Yi)T be jj,d. as X = (Z, Y)T, 1 < i < n, where X has the density of Example 1.6.3. Write the density of XI, ... ,Xn as a canonical exponential family and identify T, h, A, and E. Find the expected value and variance of the sufficient statistic.
=
,!,i
i!: "
39. Suppose that Y1 , •.. 1 Y n are independent, Yi
'"'' N(fLi, a 2 ), n > 4.
"'
(a) Write the distribution of Y1 , ..• ,Yn in canonical exponential family fonn. Identify T, h, 1), A, and E. (b) Next suppose that fLi depends on the value Zi of some covariate and consider the submodel defined by the map 1) : (0 1, O , 03)T ~ (1'7, <72 jT where 1) is detennined by 2
fLi
I i'i
I ,
I
= exp{OI
+ 02Zi},
Zl
< Z2 < .. , <
Zn;
(72 =
03


Section 1.8
Notes
95
where 8 r E R, O E R, 03 > O. This model is sometimes used when IIi is restricted to be 2 positive. Show that p(y, 0) as given by (1.6.12) is a curved exponential family model with
1=3.
40. Suppose Y 1 , • •. , y;'l are independent exponentially, E' (Ai), distributed survival times, n > 3. (a) Write the distribution of Y1 , ... 1 Yn in canonical exponential family form. Identify T, h, '1, A, and E. (b) Recall that J.1i = E (Y'i) = Ai I. Suppose lJi depends on the value Zi of a covariate. Because Iti > O. fLi is sometimes modeled as
fLi
= CXp{ 0 1 + (hZi},
i
=
1, ... , n
where not all the z's are equal. Show that p(y, fi) as given by (1.6.12) is a curved exponential family model with 1 = 2.
1.8
NOTES
Note for Section 1.1
(1) For the measure theoretically minded we can assume more generally that the Po are
all dominated by a derivative.
(J
finite measure It and that p(x, 8) denotes
dJ;lI, the Radon Nikodym
Notes for Section 1,3
~
(I) More natural in the sense of measuring the Euclidean distance between the estimate f} and the "truth" Squared error gives much more weight to those that are far away from f} than those close to f}.
e.
e
~
(2) We define the lower boundary of a convex set simply to be the set of all boundary points r such that the set lies completely on or above any tangent to the set at r.
Note for Section 1,4
(I) Source; Hodges, Jr., J. L., D. Keetch, and R. S. Crutchfield. Statlab: An Empirical Introduction to Statistics. New York: McGrawHill, 1975.
Notes for Section 1,6
(1) Exponential families arose much earlier in the work of Boltzmann in statistical mechanics as laws for the distribution of the states of systems of particlessee Feynman (1963), for instance. The connection is through the concept of entropy, which also plays a key role in infonnation theorysee Cover and Thomas (199]). (2) The restriction that's x E Rq and that these families be discrete or continuous is artificial. In general if fL is a (J finite measure on the sample space X. p( x, e) as given by (1.6.1)
M. I. CRUTCHFIELD." Ann. r HorXJEs.. AND COVER. 1986. "A Stochastic Model for the Distribution ofHIV Latency Time Based on T4 Counts. A. Testing Statistical Hypotheses.g. II.. 1966. Statist. Eels. 1991.I' L6HMANN. Elements of Information Theory New York: Wiley. Optimal Statistical Decisions New York: McGrawHili. • • . M. E... Transformation and Weighting in Regression New York: Chapman and Hall. Vols. R. . 1985. BROWN. v. 383430 (1979). 1957.. Soc. Ii . R. Nonlinearity. Statist. G. 1997.7 (I) u T Mu > 0 for all p x 1 vectors u l' o. for instance. L. G. A 143.733741 (1990). Leighton. BERMAN. 2nd ed. Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory." Ann. M. .. BOX.266291 (1978). THOMAS. AND M. L6HMANN. L. 1961. and so on. ley. P. Feynman. IMS Lecture NotesMonograph Series.96 Statistical Models. 77. 1988. I: I" Ii L6HMANN. H.547572 (1957). the Earth). Theory ofGames and Statistical Decisions New York: Wiley. and M. J. 5. "Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression. CARROLL. Iii I' i M.. KENDALL.. B. Science 5. J. DE GROOT. 1969. positions. "Using Residuals Robustly I: Tests for Heteroscedasticity. KRETcH AND R.. BLACKWELL. 22. Statistical Analysis of Stationary Time Series New York: Wi • . P. S. The Advanced Theory of Statistics. BICKEL. R. L.. 1967. Note for Section 1. K A AND A. R." Ann. New York: Springer.. S...." J. Sands. J. Mathematical Statistics New York: Academic Press. GIRSHICK. "Sampling and Bayes Inference in Scientific Modelling and Robustness (with Discussion)." Statist. ROSENBLATT. T. Goals. AND A.. 40 Statistical Mechanics ofPhysics Reading. D. P. A.. J. E. and Performance Criteria Chapter 1 can be taken to be the density of X with respect to fLsee Lehmann (1997). AND M. 14431473 (1995). Royal Statist. 1963.. 1975. STUART. FEYNMAN. DoKSUM. Ch. 125. Statlab: An Empirical Introduction to Statistics New York: McGrawHili. MA: AddisonWesley. FERGUSON. JR. . L. and spheres (e. T. P. Hayward. RUPPERT. I and II. "A Theory of Some Multiple Decision Problems.9 REFERENCES BERGER. 0 . The Feynmtln Lectures on Physics. . and Later Developments. .• Statistical Decision Theory and Bayesian Analysis New York: Springer. D. 160168 (1990). 1. This permits consideration of data such as images. E.. GRENANDER. U. SAMAROV. E. 1954. "Model Specification: The Views of Fisher and Neyman. m New York: Hafner Publishing Co. L..t AND D. 6." Biomelika. Math Statist.. . 23.
Statistical Methods. The Foundation ofStattstical Inference London: Methuen & Co. 8th Ed. D. Roy. 1954." Empirical Bayes and Likelihood Inference. (Draper's Research Memoirs." Proc. . "On the General Theory of Skew Correlation and NOnlinear Regression. Boston. DOKSUM. E. B. SAVAGE. ET AL. Applied Statistical Decision Theory.. New York: Springer. 1962. SAVAGE. Ahmed and N.. WETHERILL. Division of Research. 1989. 1965. Wiley & Sons. SL. H. Wiley & Sons. L. AND W. G. GLAZEBROOK. The Statistical Analysis of Experimental Data New York: J. Editors: S.. A.Section 1.. J. Harvard University. Lecture Notes in Statistics. W. 2000.. COCHRAN. AND K. 303 (1905). SNEDECOR.. NORMAND. J. Sequential Methods in Statistics New York: Chapman and Hall. The Foundations ofStatistics. Ames. AND R. Graduate School of Business Administration.. 1961. Introduction to Probability and Statistics from a Bayesian Point of View. PEARSON. MANDEL. V. AND K.. Biometrics Series II. L. Part I: Probability. 6779. 1964. "Empirical Bayes Procedures for a Change Point Problem with Application to HIVJAIDS Data. 1986. New York. Soc. IA: Iowa State University Press. J. 1. Reid. SCHLAIFFER. G. Dulan & Co. London 71. K. 9 References 97 LINDLEY. Part II: Inference London: Cambridge University Press. G.) RAIFFA. D.
: '". I:' o. I 'I i .! ~J .·. I II 1'1 . . I . :i.'.
The equations (2.8) as a function of 8. Then we expect  \lOD(Oo.1. 8). 6) ~ o. we don't know the truth so this is inoperable. but in a very weak sense (unbiasedness). how do we find a function 8(X) of the vector observation X that in some sense "is close" to the unknown 81 The fundamental heuristic is typically the following. p(X. we could obtain 8 0 as the minimizer.O) EOoP(X.1. (2.O). X """ PEP.2) define a special form of estimating equations. This is the most general fonn of the minimum contrast estimate we shall consider in the next section.8) is smooth.Chapter 2 METHODS OF ESTIMATION 2.2) 99 . Of course.8) is an estimate of D(8 0 . if PO o were true and we knew DC 8 0 .8). As a function of 8. In order for p to be a contrast function we require that DC 8 0l 9) is uniquely minimized for 8 = (Jo.1.1 BASIC HEURISTICS OF ESTIMATION Minimum Contrast Estimates.1 2. DC8 0l 9) measures the (population) discrepancy between 8 and the true value 8 0 of the parameter. X E X. =0 (2. usually parametrized as P = {PO: 8 E e}. Estimating Equations Our basic framework is as before. the true 8 0 is an interior point of e. In this parametric case.1. Now suppose e is Euclidean C R d . That is. and 8 Jo D( 8 0 .O) where V denotes the gradient. how do we select reasonable estimates for 8 itself? That is. We consider a function that we shall call a contrast function  p:Xx8>R and define D(Oo. So it is natural to consider 8(X) minimizing p(X.1) Arguing heuristically again we are led to estimates B that solve \lOp(X.
. If.. Then {3 parametrizes the model and we can compute (see Problem 2. z) is differentiable in {3. Evidently.. (1).' .z)l: 1{31 ~ oo} 'I = 00 I (Problem 2. z.4 R d. z. g({3. . .3) = 0 has 8 0 as its unique solution for all 8 0 ~ E e.1. we take n p(X.2) or equivalently the system of estimating eq!1ations. Example 2.g({3. l'i) : 1 < i < n} where Yi. . W . {3) to consider is the squared Euclidean distance between the vector Y of observed Yi and the vector expectation of Y. Consider the parametric version of the regression model of Example 1. A naiural(l) function p(X. ~ Then we say 8 solving (2.p(X.1. N(O.1. .. W(X.!I [" '1 L • .1.5) Strictly speaking P is not fully defined here and this is a point we shall explore later. then {3 satisfies the equation (2. ua). But. z). "'. Yn are independent. (2. I In the important linear case. An estimate (3 that minimizes p(X.4 are i.10).8) ~ E I1 .)Y.8/3 ({3. (3) = !Y .)J i=1 ~ 2 (2.d..1.)g(l3.z.('l/Jl. :i = ~8g~ ~ L.1.7) i=l J .zd = LZij!3j j=1 andzi = (Zil.4 with I'(z) = g({3. i=l 3 I (2. Here is an example to be pursued later. (3) E{3.4) w(X.16). ZI). IL( z) = (g({3. (3) exists if g({3. suppose we are given a function W : X and define Methods of Estimation Chapter 2 X R d . further. Least Squares.1L1 2 = L[Yi .1. {3 E R d .1.1. suppose we postulate that the Ei of Example 1.i + L[g({3o. z.1. (1) Suppose V (8 0 .z. Wd)T V(l1 o.100 More generally.1. Here the data are X = {(Zi. z. z) is continuous and ! I lim{lg(l3. g({3.8/3 ({3. I D({3o. ~ ~8g~ L. i=l (2.»)'. .Zid)T 'j • r .6) . J which is indeed minimized at {3 = {3o and uniquely so if and only if the parametrization is identifiable. there is a substantial overlap between the two classes of estimates. d g({3.. .i.1. where the function 9 js known.I1) =0 is an estimating equation estimate. for convenience. Zn»T That is.g({3. The estimate (3 is called the least squares estimate.). (3) n n<T.) .
I'd of X. suppose is 1 . 1 1j ~_l".. d > k. . . ~ .i. Suppose that 1'1 (8). lid) as the estimate of q( 8). = 1'. This very important example is pursued further in Section 2.9) where Zv IIZijllnxd is the design matrix. These equations are commonly written in matrix fonn (2.1.i. Method a/Moments (MOM). L.Section 2.1 from R d to Rd.(8)..2 and Chapter = 6. we assume the existence of Define the jth sample moment f1j by. as X ~ P8. . 1 < i < n}. /1j converges in probability to flj(fJ). 8 E R d and 8 is identifiable. say q(8) = h(I'I. ..Xn are i. The method of moments prescribes that we estimate 9 by the solution of p. if we want to estimate a Rkvalued function q(8) of 9.. Suppose Xl. we obtain a MOM estimate of q( 8) by expressing q( 8) as a function of any of the first d moments 1'1. We return to the remark that this estimating method is well defined even if the Ci are not i. N(O.. u5). 0 Example 2.  n n. The motivation of this simplest estimating equation example is the law of large numbers: For X "' Po. Thus.8) the normal equations.1 <j < d if it exists. More generally. we need to be able to express 9 as a continuous function 9 of the first d moments.. Thus.d. .Xi t=l . which can be judged on its merits whatever the true P governing X is. . To apply the method of moments to the problem of estimating 9. . . Here is another basic estimating equation example. I'd). 1 1 < j < d. provides a first example of both minimum contrast and estimating equation methods.2...1.d.. . . In fact.1.. once defined we have a method of computing a statistic fj from the data X = {(Zi 1 Xi). Least squares. I'd( 8) are the first d moments of the population we are sampling from. thus.1 Basic Heuristics of Estimation 101 the system becomes (2. and then using h(p!. .
1. the method of moment estimator is not unique. If we let Xl. Vi = i. In this case f) ~ ('" A). Pi is the proportion of men in the population in the ith job category and Njn is the sample proportion in this category. ~ iii ~ (X/a)'.(1 + ")/A 2 Solving ... neither minimum contrast estimates nor estimating equation solutions can be obtained in closed fonn. case. two other basic heuristics particularly applicable in the i. 1968). I.. . as X and Ni = number of indices j such that X j = Vi. or 5. 1'1 for () gives = E(X) ~ "/ A.10. . .. . '\). Example 2. Here k = 5.1 0) . .1.. . but their respective probabilities PI.1. This algorithm and others will be discussed more extensively in Section 2. then the natural estimate of Pi = P[X = Vi] suggested by the law of large numbers is Njn.(X")1 dxd J isquickandM is nonsingular with high probability is the NewtonRaphson algorithm. Suppose we observe multinomial trials in which the values VI. .. An algorithm for estimating equations frequently used when computationofM(X.d.2 and 0=2 = n 1 EX1. A>O. X n be Ltd. . in particular Problem 6.6.·) fli =D'I'(X. l Vk of the population being sampled are known. i = 1.5. with density [A"/r(. Here is some job category data (Mosteller.>0.~ (I'l/a)'. .. As an illustration consider a population of men whose occupations fall in one of five different job categories. It is defined by initializing with eo.)]xO1exp{Ax}.. for instance. consider a study in which the survival time X is modeled to have a gamma distribution. A = Jl1/a 2 . express () as a function of IJ" and fl3 = E(X 3 ) and obtain a method of moment estimator based on /11 and fi3 (Problem 2. in general.Pk are completely unknown. A = X/a' where a 2 = fl.•.i. .4 and in Chapter 6. .11). x>O. and 1" ~ E(X') = . the proportion of sample values equal to Vi. f(u. There are many algorithms for optimization and root finding that can be employed. Frequency Plugin(2) and Extension.. then setting (2. 4.. . We introduce these principles in the context of multinomial trials and then abstract them and relate them to the method of moments.3.X 2 • In this example. I. We can.2 The PlugIn and Extension Principles We can view the method of moments as an example of what we call the plugin (or substitution) and extension principles. 3.102 Methods of Estimation Chapter 2 For instance. 0 Algorithmic issues We note that. 2. 2.·)  l~t.
0)2. lPk). Consider a sample from a population in genetic equilibrium with respect to a single gene with two alleles. i < k. and Then q(p) (p... . that is.. HardyWeinberg Equilibrium. Pk) of the population proportions. 1 < think of this model as P = {all probability distributions Pan {VI. . in v(P) by 0 P ).09.Pk) = (~l .31 5 95 0.12)." .1. the estimate is which in our case is 0.1.. . categories 4 and 5 correspond to bluecollar jobs.»i=1 for Danish men whose fathers were in category 3.Pk by the observable sample frequencies Nt/n. . .44 .}}. ... (2. .1 Basic Heuristics of Estimation 103 Job Category l I Ni Pi ~ 23 0.13 n2::. ..pl . ..P2.1. 1 Nk/n. P3 PI = 0 = (1. use (2. . together with the estimates Pi = NiJn. can be identified with a parameter v : P ~ R.Ps) = (P4 + Ps) .4... 1~!. we are led to suppose that there are three types of individuals whose frequencies are given by the socalled HardyWeinberg proportions 2 . If we use the frequency substitution principle. the multinomial empirical distribution of Xl.. . Pk do not vary freely but are continuous functions of some ddimensional parameter 8 = ((h. suppose that in the previous job category table. If we assume the three different genotypes are identifiable. let P dennte p ~ (P".Ni =708 2:::o. The frequency plugin principle simply proposes to replace the unknown population frequencies PI. P3) given by (2.Section 2. N 3 ) has a multinomial distribution with parameters (n..03 2 84 0. Next consider the marc general problem of estimating a continuous function q(Pl' ..0 < 0 < 1. the difference in the proportions of bluecollar and whitecollar workers. .0). We would be interested in estimating q(P"".0..12) If N i is the number of individuals of type i in the sample of size n. P2 ~ 20(1. Example 2. . + P3). v(P) = (P4 + Ps) and the frequency plugin principle simply says to replace P = (PI.. . whereas categories 2 and 3 correspond to whitecollar jobs.Xn .12 3 289 0. .53 = 0.41 4 217 0.Pk) with Pi ~ PIX = 'Vi]. . Equivalently. Now suppose that the proportions PI' . Suppose . For instance.. 1 ()d) and that we want to estimate a component of 8 or more generally a function q(8).1.. That is. then (NIl N 2 . v. Many of the models arising in the analysis of discrete data discussed in Chapter 6 are of this type.(P2 + P3).11) to estimate q(Pll··.
.). The plugin and extension principles can be abstractly stated as follows: Plugin principle. consider va(P) = [Fl (a) + Fu 1(a )1. .4. X n are Ll..E have an estimate P of PEP such that PEP and v : P 4 T is a parameter. If PI.Pk(IJ»). that is. T(Xl.. Now q(lJ) can be identified if IJ is identifiable by a parameter v .13) defines an extension of v from Po to P via v. Because f) = ~. .d. . . Then (2. .. by the law of large numbers. I) and ! F'(a) " = inf{x.Xn ) =h N. where a E (0. We shall consider in Chapters 3 (Example 3.4) and 5 how to choose among such estimates.14) As we saw in the HardyWeinberg case. . we can use the principle we have introduced and estimate by J N l In..Pk. Nk) .13) Given h we can apply the extension principle to estimate q( 8) as.Pk are continuous functions of 8.h(p) and v(P) = v(P) for P E Po.13) and estimate (2. (2. In particular. a natural estimate of P and v(P) is a plugin estimate of v(P) in this nonparametric context For instance. "·1 is. the representation (2. If w. . 1  V In general. then v(P) is the plugin estimate of v. Fu'(a) = sup{x. P > R where v(P) . suppose that we want to estimate a continuous Rlvalued function q of e. as X p. I: t=l (2. 0 write 0 = 1 .16) .. . that we can also Nsfn is also a plausible estimate of O. the empirical distribution P of X given by    f'V 1 n P[X E Al = I(X. we can usually express q(8) as a continuous function of PI.. ( .(IJ). .15) ~ ". F(x) < a}.104 Methods of Estimation Chapter 2 we want to estimate fJ. .d. Let be a submodel of P.1.14) are not unique. case if P is the space of all distributions of X and Xl. ... (2.. if X is real and F(x) = P(X < x) is the distribution function (dJ. Po ~ R given by v(PO) = q(O). . (2.. E A) n. thus.1. .1. Note.'" . ./P3 and. however.1. in the Li. We can think of the extension principle alternatively as follows. the frequency of one of the alleles. F(x) > a}.1.1. q(lJ) with h defined and continuous on = h(p.1.
1.13).. A natural estimate is the ath sample quantile .tj = E(Xj) in this nonparametric . P 'Ie Po.1.i. ~ I ~ ~ Extension principle.3 and 2.1 I:~ 1 Xl. However. For instance. Remark 2.) h(p) ~ L vip.14. context is the jth sample moment v(P) = xjdF(x) ~ 0.Section 2. If v: P ~ T is an extension of v in the sense that v(P) = viP) on Po.d. Pe as given by the HardyWeinberg p(O). is called the sample median. they are mainly applied in the i. The plugin and extension principles are used when Pe. in the multinomial examples 2. ~ ~ . these principles are general. The plugin and extension principles must be calibrated with the target parameter.d.1. As stated.12) and to more general method of moment estimates (Problem 2. . With this general statement we can see precisely how method of moment estimates can be obtained as extension and frequency plugin estimates for multinomial trials because I'j(8) where =L i=l k vfPi(8) = h(p(8» = viP. This reasoning extends to the general i. i=1 k /1j ~ = ~ LXI 1=1 I n . let Po be the class of distributions of X = B + E where B E R and the distribution of E ranges over the class of symmetric distributions with mean zero.1. and v are continuous.' Nl = h N () ~ ~ = v(P) and P is the empirical distribution. ~ For a second example. P E Po. Let viP) be the mean of X and let P be the class of distributions of X = 0 + < where B E R and the distribution of E ranges over the class of distributions with mean zero. = Lvi n i= 1 k . II to P.1 Basic Heuristics of Estimation 105 V 1. then Va. Here x ~ = median.. but only VI(P) = X is a sensible estimate of v(P).4.1. if X is real and P is the class of distributions with EIXlj < 00.17) where F is the empirical dJ.1. = v(P).2. e e Remark 2. Here x!. (P) is called the population (2.1. 1/. is a continuous map from = [0. v(P8 ) = q(8) = h(p(8)) is a continuous map from to Rand D( P) = h(p) is a continuous map from P to R. because when P is not symmetric. casebut see Problem 2.1. For instance. case (Problem 2. (P) is the ath population quantile Xa. then the plugin estimate of the jth moment v(P) = f. then v(P) is an extension (and plugin) estimate of viP).1. the sample median V2(P) does not converge in probability to Ep(X).i. Suppose Po is a submodel of P and P is an element of P but not necessarily Po and suppose v: Po ~ T is a p~eter. In this case both VI(P) = Ep(X) and V2(P) = "median of P" satisfy v(P) = v(P).
1. . . The method of moments can lead to either the sample mean or the sample variance. 0 Example 2. (b) If the sample size is large. See Section 2.X). these estimates are likely to be close to the value estimated (consistency). where Po is n. these are the frequency plugin (substitution) estimates (see Problem 2. When we consider optimality principles. Moreover. Plugin is not the optimal way to go for the Bayes. as we shall see in Chapter 3. we find I' = E. then B is both the population mean and the population variance. or uniformly minimum variance unbiased (UMVU) principles we discuss briefly in Chapter 3.5..1. . Thus.106 Methods of Estimation Chapter 2 Here are three further simple examples illustrating reasonable and unreasonable MOM estimates. However..LI (8) = (J the method of moments leads to the natural estimate of 8. the frequency of successes. O}. ". U {I. We will make a selection among such procedures in Chapter 3.1. Estimating the Size of a Population (continued). Algorithms for their computation will be introduced in Section 2.l [iX.3. [ [ . .1 because in this model B is always at least as large as X (n)' 0 As we have seen. Example 2. X. " . . The method of moments estimates of f1 and a 2 are X and 2 0 a. a frequency plugin estimate of 0 is Iogpo. • . those obtained by the method of maximum likelihood. In Example 1.4. a saving grace becomes apparent in Chapters 5 and 6.6. . For instance. .d.) = ~ (0 + I).. C' X n are ij. for large amounts of data. Suppose that Xl.3 where X" . estimation of B real with quadratic loss and Bayes priors lead to procedures that are data weighted averages of (J values rather than minimizers of functions p( (J. This minimal property is discussed in Section 5.4..2. Remark 2. valuable as preliminary estimates in algorithms that search for more efficient estimates. If the model fits.Ll). This is clearly a foolish estimate if X (n) = max Xi> 2X . we may arrive at different types of estimates than those discussed in this section. the plugin principle is justified.1 is a method of moments estimate of B. It does turn out that there are "best" frequency plugin estimates.1. a 2 ) sample as in Example 1. Because f. Because we are dealing with (unrestricted) Bernoulli trials.4. •i . To estimate the population variance B( 1 . (X. 0 = 21' . 'I I[ i['l i . a special type of minimum contrast and estimating equation method. there are often several method of moments estimates for the same q(9). Unfortunately. 2. therefore.. Example 2.2 with assumptions (1)(4) holding.8) we are led by the first moment to the estimate. X(l . = 0]. Suppose X I.7.5. For example. What are the good points of the method of moments and frequency plugin? (a) They generally lead to procedures that are easy to compute and are.. if we are sampling from a Poisson population with parameter B.1. Discussion. .I and 2X . X). because Po = P(X = 0) = exp{ O}. I . )" I I • . X n is a N(f1. and there are best extensions. minimax. as we shall see in Section 2. . optimality principle solutions agree to first order with the best minimum contrast and estimating equation solutions. they are often difficult to compute. X n are the indicators of a set of Bernoulli trials with probability of success fJ.
.Zi)j2. In this section we shall introduce the approach and give a few examples leaving detailed development to Chapter 6. 0 E e. For data {(Zi. where 9 is a known function and {3 E Rd is a vector of unknown regression coefficients. 6).2 Minimum Contrast Estimates and Estimating Equations 107 Summary.) = q(O) and call v(P) a plugin estimator of q(6)...(3) ~ L[li .1 L:~ 1 I[X i E AI. Let Po and P be two statistical models for X with Po c P. the associated estimating equations are called the normal equations and are given by Z1 Y = ZtZD{3.g((3. where ZD = IIziJ·llnxd is called the design matrix. I < i < n. If P is an estimate of P with PEP. . For this contrast. Method of moment estimates are empirical PIEs based on v(P) = (f.Pk).... is parametric and a vector q( 8) is to be estimated.ld)T where f.. when g({3. For the model {Pe : () E e} a contrast p is a function from X x H to R such that the discrepancy D(lJo..ll. and the contrast estimating equations are V Op(X. A minimum contrast estimator is a minimizer of p(X.1 MINIMUM CONTRAST ESTIMATES AND ESTIMATING EQUATIONS Least Squares and Weighted Least Squares Least squares(1) was advanced early in the nineteenth century by Gauss and Legendre for estimation in problems of astronomical measurement. f. where Pj is the probability of the jth category. In the multinomial case the frequency plugin estimators are empirical PIEs based on v(P) = (Ph . z) = ZT{3. ~ ~ ~ ~ ~ v we find a parameter v such that v(p.. The plugin estimate (PIE) for a vector parameter v = v(P) is obtained by setting fJ = v( P) where P is an estimate of P. .lJ) ~ E'o"p(X. 0 E e c Rd is uniquely minimized at the true value 8 = 8 0 of the parameter. If P = PO. It is of great importance in many areas of statistics such as the analysis of variance and regression theory. D(P) is called the extensionplug·in estimate of v(P).2 2. 0). . When P is the empirical probability distribution P E defined by Pe(A) ~ n. We consider principles that suggest how we can use the outcome X of an experiment to estimate unknown parameters. P.lj = E( XJ\ 1 < j < d.O) = O. a least squares estimate of {3 is a minimizer of p(X. 1 < j < k.Section 2. An extension ii of v from Po to P is a parameter satisfying v(P) = v(P). The general principles are shown to be related to each other. ~ ~ ~ ~ 2. Zi). then is called the empirical PIE.. Suppose X .2. P E Po. li) : I < i < n} with li independent and E(Yi ) = g((3.
zt}. " (2. I' . and 13 ~ (g(l3. Note that the joint distribution H of (C}.4}{2. This follows "I . " .i. . This is frequently the case for studies in the social and biological sciences.C=ha:::p::.2.13) n n i=1  L Varp«.d.=.::'hc. The least squares method of estimation applies only to the parameters {31' . that is.g(l3. (Z" Y. .) + <" I < i < n n (2.2).g(l3.).2. aJ) and (3 ranges over R d or an open subset.108_~ ~_ _~ ~ ~cMc. as in (a) of Example 1.E:::'::'. (2. Z could be educationallevel and Y income.1. j. or Z could be height and Y log weight Then we can write the conditional model of Y given Zj = Zj.3) which is again minimized as a function of 13 by 13 = 130 and uniquely so if the map 13 ~ (g(j3. Yi). z. < i < n. i=l (2. Zn»)T is 1. .zn)f is 11.... we can compute still Dp(l3o.g(l3.d.6) Var( €i) = u 2 COV(fi. which satisfies (but is not fully specified by) (2.(z.2.2) Are least squares estimates still reasonable? For P in the semiparametric model P. The estimates continue to be reasonable under the GaussMarkov assumptions. then 13 has an interpretation as a parameter on p.) ..2.z. a 2 and H unknown. For instance.2.:0::d=':Of. • . . . = 0. 1_' . 13) = LIY.4 with <j simply defined as Y.3) continues to be valid..:.) i.. .Y) such thatE(Y I Z = z) = g(j3. If we consider this model.z. 1 < i < j < n 1 > 0\ . .) + L[g(l3 o. 13 = I3(P) is the miniml2er of E(Y .::. 1 < j < n..)]' i=l ~ led to the least squares estimates (LSEs) {3 of {3. . j.o=nC. .. I<i<n. I Zj = Zj).. Suppose that we enlarge Po to P where we retain the independence of the Yi but only require lJ. Y) ~ PEP = {Alljnintdistributions of(Z. as (Z. is modeled as a sample from a joint distribution. The contrast p(X.::2 : In Example 2.). the model is semiparametric with {3. z.. fj) because (2. (3)  E p p(X.g(l3.z). NCO..m::a.2.g(l3. z.2.4) (2.i. E«. .1 we considered the nonlinear (and linear) Gaussian model Po given by Y.6) is difficult Sometimes z can be viewed as the realization of a population variable Z. 13 E R d }.2.»)2. That is.) = E(Y.1. I) are i.z. that is.5) (2.En) is any distribution satisfying the GaussMarkov assumptions.) = 0. where Ci = g(l3. . (Zi.. E«. .) = g(l3.E(Y.f3d and is often applied in situations in which specification of the model beyond (2.1.2. 1 < i < n.) =0.:1. . Z)f.2.
which.L{3jPj. For nonlinear cases we can use numerical methods to solve the estimating equations (2. In that case. 2:). Sections 6..2. We can then treat p(zo)  nnknown (30 and identify (zo) with (3j to give an approximate (d 1 and Zj as before and linear model with Zo = t!. E1=1 g~ (zo)zo as an + 1) dimensional d p(z) = L(3jZj. see Rnppert and Wand (1994). is called the linear (multiple) regression modeL For the data {(Zi. j=1 (2..1). where P is the empirical distribution assigning mass n.4. we are in a situation in which it is plausible to assume that (Zi.1..2.Section 2.1 the most commonly used 9 in these models is g({3.I to each of the n pairs (Zil Yi). We continue our discussion for this important special case for which explicit fonnulae and theory have been derived. Fan and Gijbels (1996). z) = zT (3.5) and (2... we can approximate p(z) by p(z) ~ p(zo) +L j=1 d a: a (zo)(z . . in conjnnction with (2.2. and Volnme n. (2) If as we discussed earlier. E(Y I Z = z) can be written as d p(z) where = (30 + L(3jZj j=1 (2. n} we write this model in matrix fonn as Y = ZD(3 + €    where Zv = IIZij 11 is the design matrix. Yi) are a sample from a (d + 1)dimensional distribution and the covariates that are the coordinates of Z are continuous. as we have seen in Section lA. y)T has a nondegenerate multivariate Gaussian distribution Nd+1(Il.9) ..3 and 6.7) (2.7).4. See also Problem 2.2.1.4). i = 1.zo).2. As we noted in Example 2. {3d). . (2. j=O This type of approximation is the basis for nonlinear regression analysis based on local polynomials. Zd. (2. J for Zo an interior point of the domain.2 Minimum Contrast Estimates and Estimating Equations 109 from Theorem 1.41.2. . Yi). and Seber and Wild (1989). a further modeling step is often taken and it is assumed that {Zll .2. I)/.8) d f30 = («(3" . In this case we recognize the LSE {3 as simply being the usual plugin estimate (3(P). .6).2. The linear model is often the default model for a number of reasons: (I) If the range of the z's is relatively small and p(z) is smooth.1. = py .5.
11) When the Zi'S are not all equal. 1 64 4 71 5 54 9 81 11 76 13 23 77 93 23 95 28 109 The points (Zi' Yi) and an estimate of the line 131 + 132z are plotted in Figure 2.il. the relationship between z and y can be approximated well by a linear equation y = 131 + (32Z provided z is restricted to a reasonably small interval. Zi 1'. Example 2. . Estimation of {3 in the linear regression model.8). ( (J2 2 ) distribution where = ayy Therefore.2. the least squares estimate.I 'i. The parametrization is identifiable if and only ifZ D is of rank d or equivalently if Z'bZD is affulI rank d. p. We have already argued in Examp}e 2.12) . 1967.10) Here are some examples.) = O. LZi(Yi .2. exists and is unique and satisfies the Donnal equations (2.il.ecor and Cochran. given Zi = 1 < i < n. necessarily. is independent of Z and has a N(O. Nine samples of soil were treated with different amounts Z of phosphorus. . We want to find out how increasing the amount z of a certain chemical or fertilizer in the soil increases the amount y of that chemical in the plants grown in that soil.(3zz. For this reason.2. . In that case. the solution of the normal equations can be given "explicitly" by (2. we will find that the values of y will not be the same.) = 1 and the normal equation is L:~ 1(Yi . For certain chemicals and plants. Y is random with a distribution P(y I z). g(z. I I. If we run several experiments with the same z using plants and soils that are as nearly identical as possible.ill . 0 Ii. Example 2. we get the solutions (2. d = 1.) ~ 0.1. < Methods of Estimation Chapter 2 =y  I'(Z) Eyz:Ez~Ezy.2.1 '. if the parametrization {3 ) ZD{3 is identifiable.2. i I . we assume that for a given z. i=l (2. Y is the amount of phosphorus found in com plants grown for 38 days in the different samples of soil.2. il.1. il. We want to estimate f31 and {32. il.) = il" a~. Following are the results of an experiment to which a regression model can be applied (Sned.2. 1 < i < n. see Problem 2.no Furthennore. {3. In the measurement model in which Yi is the detennination of a constant Ih. Zi. The nonnal equations are n i=l n ~(Yi . 139). whose solution is .25. L 1 that.1. ~ = (lin) L:~ 1Yi = ii.il2Zi) = 0. we have a Gaussian linear regression model for Yi. g(z.2.
Y6)..(a + bzi)l. . . 91. i = 1.42. The vertical distances €i = fYi .Yi) and a line Y ~ a + bz vertically by di = Iy. then the regression line minimizes the sum of the squared distances to the n points (ZI' Yl). and fi = (lin) I:~ 1 Yi· The line y = /31 + f32z is known as the sample regression line or line of best fit of Yl.9. €6 is the residual for (Z6..1. Here {31 = 61. Yn)..2. &atter plot {(Zi.. and 131 = fi .(Z). The regression line for the phosphorus data is given in Figure 2... that is p I'(z) ~ 2:)...9p. P > d and postulate that 1'( z) is a linear combination of 91 (z).. suppose we select p realvalued functions of z.(/31 + (32Zi)] are called the residuals of the fit..3. . n.1.13) where Z ~ (lin) I:~ 1 Zi. 9p( z). Yi)..Section 2. The linear regression model is considerably more general than appears at first sight.2.58 and 132 = 1. Geometrically. .. . The line Y = /31 + (32Z is an estimate of the best linear MSPE predictor Ul + b1Z of Theorem 1. j=1 Then we are still dealing with a linear regression model because we can define WpXI . . (zn. n} and sample regression line for the phosphorus data.132 z ~ ~ (2..1. . .2.2 Minimum Contrast Estimates and Estimating Equations 111 y 50 • o 10 20 x 30 Figure 22. . i = 1. . For instance. . if we measure the distance between a point (Zi. . This connection to prediction explains the use of vertical distance! in regression.Yn on ZI.4. . . . 0 ~ ~ ~ ~ ~ Remark 2. Zn..
2 and many similar situations it may not be reasonable to assume that the variances of the errors Ci are the same for all levels Zi of the covariate variable.'l are sufficient for the Yi and that Var(Ed.5) fails.wi. Zi2 = Zi.8.". We return to this in Volume II. which for given Yi = yd.3. . Note that the variables .. The method of least squares may not be appropriate because (2.. we may be able to characterize the dependence of Var( ci) on Zi at least up to a multiplicative constant. i = 1. 1 n.g«(3. Zi)/. Weighted Linear Regression.15) . Such models are called heteroscedastic (as opposed to the equal variance models that are homoscedastic). we arrive at quadratic regression. fi = <d..Zi)]2 = L ~. We need to find the values {3l and /32 of (3. Example 2.wi 1 < . • I' "0 r .2..5).16) as a function of {3. Weighted least squares.wi  _ g((3. Zil = 1. then = WW2/Wi = .. iI L Vi[Yi i=l n ({3. ...2..g(. and (32 that minimize ~ ~ "I. and g({3.2.wi .24 for more on polynomial regression.. 1 <i < n Yi n 1 . but the Wi are known weights. Whether any linear model is appropriate in particular situations is a delicate matter partially explorable through further analysis of the data and knowledge of the subject matter. .17) .2. gp( z)) T as our covariate and consider the linear model Yi = where L j=l p ()jWij + Ci. Thus. .z. + /3zzi)J2 (2. That is. I and the Y i satisfy the assumption (2.wi. z... .14) where (J2 is unknown as before. [Yi . Yi = 80 + 8 1Zi + 82 + Cisee Problem 2..filii minimizes ~  I i i L[1Ii . However...)]2 i=l i=l ~ n (2.II 112 Methods of Estimation Chapter 2 (91 (z). if we setg«(3.. 0 < j < 2. The weighted least squares estimate of {3 is now the value f3. 1<i <n Wij = 9j(Zi).. Zi) = (2.2.2.2.wi) g((3. if d = 1 and we take gj (z) = zj.= g({3.2. Zi) = {3l + fhZi.. Consider the case in which d = 2.. < n. .2. . In Example 2. I . _ Yi . Zi) + Ei .2.= y.. we can write (2. For instance. ...zd + ti.
2. (3 and for general d.2.Y.. ~ .(32E(Z') ~ .2. a__ Cov(Z"'.. .2.Y") ~(Zi.(3. when g(. .B.3. we find (Problem 2.1 leading to (2. using Theorem 1. Y*) V. ..16) for g((3.2..2. then its MSPE is ElY" ~ 1l1(Z")f ~ :L UdYi i=l n «(3] + (32 Zi)1 2 It follows that the problem of minimizing (2.27) that f3 satisfy the weighted least squares normal equations ~ ~ where W = diag(wI. .1.7). .~ UiYi .2.4H2. Var(Z") and  _ L:~l UiZiYi . i i=l = 1.Section 2. Y*) denote a pair of discrete random variables with possible values (z" Yl).2 Minimum Contrast Estimates and Estimating Equations 113 where Vi = l/Wi_ This problem may be solved by setting up analogues to the normal equations (2. as we make precise in Problem 2.2. When ZD has rank d and Wi > 0.2.2.n. If Itl(Z*) = {31 given by + {32Z* denotes a linear predictor of y* based on Z .1.18) ~ ~ ~I" (31 = E(Y') .B+€ can be transformed to one satisfying (2. Zi) = z.2. 0 Next consider finding the. By following the steps of Exarnple 2.13 minimizing the least squares contrast in this transformed model is given by (2. 1 < i < n..8). Yn) and probability distribution given by PI(Z". .B. That is...B that minimizes (2. z) = zT. Let (Z*.28) that the model Y = ZD..2. .2.6).4 as follows..20). we can write Remark 2. .1) and (2.. Thus.~ UiZi· n n i"~l I" n n F"l This computation suggests.4.1 7) is equivalent to finding the best linear MSPE predictor of y . . Moreover.19) and (2.n where Ui n = vi/Lvi.)] ~Ui. the .26. i ~ I.(Li~] Ui Z i)2 n 2 n (2. . We can also use the results on prediction in Section 1. Then it can be shown (Problem 2. we may allow for correlation between the errors {Ei}. (Zn. More generally. wn ) and ZD = IIZijllnxd is the design matrix.. suppose Var(€) = a 2W for some invertible matrix W nxn.2.(L:~1 UlYi)(L:~l uizd Li~] Ui Zi . that weighted least squares estimates are also plugin estimates.
.
.
.
.
x n } equal to 1. and as we have seen in Example 2. the MLE does not exist if 112 + 2n3 = O. Consider a popUlation with three kinds of individuals labeled 1. (2. Evidently. Here are two simple examples with (j real.28) If 2n. respectively. In general.2..1. the likelihood is {1 . p(X. p(3. Then the same calculation shows that if 2nl + nz and n2 + 2n3 are both positive.27) doesn't make sense. and n3 denote the number of {Xl. .2. there may be solutions of (2. which is maximized by 0 = 0.0)' < ° for all B E (0.lx(0) = 5 1 0' . If we observe a sample of three individuals and obtain Xl = 1.. n2.29) '(j . ]n practice..27) is very important and we shall explore it extensively in the natural and favorable setting of multiparameter exponential families in the next section. the dual point of view of (2.6. then X has a Poisson distribution with parameter nA. and 3 and occurring in the HardyWeinberg proportions p(I. situations with f) well defined but (2. then I • Lx(O) ~ p(l.O) = 0'. X2 = 2.27) that are not maxima or only local maxima. the rate of arrival. 0)p(2.2.2. 2 and 3. is zero. O)p(l.22) and (2. Example 2. 0) ~ 20(1 . .118 Methods of Estimation Chapter 2 which again enables us to analyze the behavior of B using known properties of sums of independent random variables. OJ = (1  0)' where 0 < () < 1 (see Example 2.4). 0) = 20'(1. Nevertheless. . 8' 80. + n. which has the unique solution B = ~. let nJ. represents the expected number of arrivals in an hour or. Let X denote the number of customers arriving at a service counter during n hours. equivalently.(1. ~ maximizes Lx(B). x. If we make the usual simplifying assumption that the arrivals form a Poisson process. A is an unknown positive constant and we wish to estimate A using X. 1. 1).0) The likelihood equation is 8 80 Ix (0) ~ = 5 1 0 10 =0.. X3 = 1.>.2.2. 2n (2.. } with probabilities.OJ'".2.7.. so the MLE does not exist because = (0.5. p(2. I). where ). the maximum likelihood estimate exists and is given by 8(x) = 2n. ..ny l.) = ·j1 e'"p. Similarly. Here X takes on values {O. + n.2.0). 2.2. 0 e Example 2. .2.l.x=O. Because . .
fJ) = TI7~1 B7'. A sufficient condition..2.6. = j) be the probability J of the jth category. the MLE must have all OJ > 0. 8) = 0 if any of the 8j are zero. Then p(x. consider an experiment with n Li.30) for all This is the condition we applied in Example 2.kl. 31=1 By (2.32) to find I.2.30)). this estimate is the MLE of ). ~ ~ . trials in which each trial can produce a result in one of k categories. and let N j = L:~ 1 l[Xi = j] be the number of observations in the jth category..\ I 0.2. and must satisfy the likelihood equa~ons ~ 8B1x(fJ) = 8B 3 8 8 " n. e. .2.o.31 ) To obtain the MLE (J we consider l as a function of 8 1 . familiar from calculus. We assume that n > k ~ 1. 8Bk/8Bj (2. and k j=1 k Ix(fJ) = LnjlogB j . 0 To apply the likelihood equation successfully we need to know when a solution is an MLE. p(x.2 Minimum Contrast Estimates and Estimating Equations 119 The likelihood equation is which has the unique solution). . (see (2. 8B.7.Section 2.logB. A similar condition applies for vector parameters..8. Let Xi = j if the ith trial produces a result in the jth category. = L.8B " 1=13 k k =0.2.. L. Then. 8' (2.d. j=d (2. for an experiment in which we observe nj ~ 2:7 1 I[X. . this is well known to be equivalent to ~ e..2.2.32).lx(B) <0.. . k. However. If x = 0 the MLE does not exist. ~ n . .2. As in Example 1. J = I.1. = jl.32) We first consider the case with all the nj positive. fJE6= {fJ:Bj >O'LOj ~ I}. 80. and the equation becomes (BkIB j ) n· OJ = ...n. thus. is that l be concave in If l is twice differentiable.ekl with kl Ok ~ 1. = x/no If x is positive.. the maximum is approached as . .6.. let B = P(X. j = 1. Example 2.LBj j=1 • (2.. Multinomial Trials.
6. as maximum likelihood estimates for f3 when Y is distributed as lV.1 holds and X = (Y" . The 0 < OJ < 1.202 L. ..).X)2 (Problem 2. n. see Problem 2. zi)1 .1 we consider least squares estimators (LSEs) obtained by min2 imizing a contrast of the fonn L:~ 1 IV.28. Then Ix «(3) log IT ~'P (V. ((g((3.lV. z. It is easy to see that weighted least squares estimates are themselves maximum likelihood estimates of f3 for the model Yi independent N(g({3.2.. i=l g«(3.120 ~ Methods of Estimation Chapter 2 To show thaI this () maximizes lx(8). Thus. least squares estimates are maximum likelihood for the particular model Po. 1 Yn . j = 1. N(lll 0. where E(V. As we have IIV.30.2. is still the unique MLE of fJ.1. Next suppose that nj = 0 for some j. k. See Problem 2. are known functions and {3 is a parameter to be estimated from the independent observations YI . 0. . and 0. version of this example will be considered in the exponential family case in Section 2. (2.2 both unknown..(Xi .. ~ g((3. Zi). More generally.2.33) It follows that in this nj > 0. This approach is applied to experiments in which for the ith case in a study the mean of the response Yi depends on • .1 L:~ . i = 1.' .2 ) with J1. we find that for n > 2 the unique MLEs of Il and Ii = X and iT2 ~ n. Suppose the model Po of Example 2. . wi(5).g«(3. OJ > ~ 0 case. ~ g«(3.d. Zi)) 00 ao n log 2 I"" 2 "2 (21Too) . gi. Suppose that Xl. .2. zn) 03W.Zi)]2. In Section 2. Example 2. we check the concavity of lx(O): let 1 1 <j < k .2. Summary. g«(3. we can consider minimizing L:i... Zi)J[l:} ....3. . o 1=1 Evidently maximizing Ix«(3) is equivalent to minimizing L:~ n (2.9.34) g«(3. these estimates viewed as an algorithm applied to the set of data X make sense much more generally. . then <r <k I. . 1 < i < n.gi«(3) 1 . YnJT. X n are Li... Ix(O) is strictly concave and (} is the unique ~ ~ maximizer of lx(O).. where Var(Yi) does not depend on i..1.1).. Then (} with OJ = njjn. 1 < j < k.11(a)). 'ffi f. Zj)]Wij where W = Ilwijllnxn is a symmetric positive definite matrix. Using the concavity argument.  seen and shall see more in Section 6.IV. .2 are Maximum likelihood and least squares We conclude with the link between least squares and maximum likelihood.2.) = gi«(3).. .2.
1 ).2.1 and 1. .6..Section 2.4 and Corollaries 1.3 Maximum Likelihood in Multiparameter Exponential Families 121 a set of available covariate values Zil.6. Existence and unicity of the MLE in exponential families depend on the strict concavity of the log likelihood and the condition of Lemma 2. oo]P. in the N(O" O ) case. Suppose we are given a function 1: continuous.a < b< oo} U {(a.1) lim{l(li) : Ii ~ ~ De} = 00..b) . We start with a useful general framework and lemma. 8). Extensions to weighted least squares.5. . This is largely a consequence of the strict concavity of the log likelihood in the natural parameter TI.3.6. it is shown that the MLEs coincide with the LSEs. In particular we consider the case with 9i({3) = Zij{3j and give the LSE of (3 in the case in which I!ZijllnXd is of rank d.3.2 and other exponential family properties also playa role. In the case of independent response variables Yi that are modeled to have a N(9i({3). These estimates are shown to be equivalent to minimum contrast estimates based on a contrast function related to Shannon entropy and KullbackLeibler information divergence. (m. a8 is the set of points outside of 8 that can be obtained as limits of points in e.I ) all tend to De as m ~ 00. if X N(B I . Let &e = be the boundary of where denotes the closure of in [00. See Problem 2. e e I"V e De= {(a. z=1=1 ~ 2.1. m). Proof.1 only. though the results of Theorems 1. or 8 1nk diverges with 181nk I .3 MAXIMUM LIKELIHOOD IN MUlTIPARAMETER EXPONENTIAL FAMILIES Questions of existence and uniqueness of maximum likelihood estimates in canonical exponential families can be answered completely and elegantly. (}"2) distribution.1. For instance. Suppose also that e t R where e c RP is open and 1 is (2.2 we consider maximum likelihood estimators (MLEs) 0 that are defined as maximizers of the likelihood Lx (B) = p(x. 2 e Lemma 2. m. b). (m. In Section 2.3. .3..6.3. as k . m. (h). For instance. Suppose 8 c RP is an open set. for a sequence {8 m } of points from open. In general. (a. Properties that derive solely fTOm concavity are given in Propositon 2. including all points with ±oo as a coordinate.ae as m t 00 to mean that for any subsequence {B1nI<} either 8 1nk t t with t ¢ e. Concavity also plays a crucial role in the analysis of algorithms in the next section. are given. where II denotes the Euclidean norm.a= ±oo. b).oo}}. which are appropriate when Var(Yj) depends on i or the Y's are correlated. That is.zid. Formally.00. we define 8 1n . (m.3 and 1. = RxR+ and ee e. (a. b) : aER. Then there exists 8 E e such that 1(8) = max{l(li) : Ii E e}.00.bE {a.
Proofof Theorem 2. have a necessary and sufficient condition for existence and uniqueness of the MLE given the data. then lx(11 m ) . From (B.9) we know that II lx(lI) is continuous on e.3. Suppose X ~ (PII . Let x be the observed data vector and set to = T(x).  (hO! + 0. h) and that (i) The natural parameter space. We show that if {11 m } has no subsequence converging to a point in E. a contradiction. and 0. and is a solution to the equation (2.(11) ~ 00 as densities p(x. II) is strictly concave and 1. Suppose the cOlulitions o/Theorem 2. lI(x) =  exists. = . no solution.1.3) has i 1 !. then the MLE8(x) exists and is unique..3.1 hold.6..3. then Ix lx(O.1. is open. Suppose P is the canonical exponential/amity generated by (T.12.00.'1) with T(x) = 0.)) > ~ (fx(Od +lx (0. '10) for some reference '10 E [ (see Problem 1. .3.3. open C RP. [ffunher {x (II) e = e e t 88. thus.'1 . I I. Write 11 m = Am U m . Existence and Uniqueness o/the MLE ij.). .. .2). Then. (ii) The family is of rank k. which implies existence of 17 by Lemma 2. if to doesn't satisfy (2. /fCT is the convex suppon ofthe distribution ofT (X). II). ' L n . We give the proof for the continuous case. Applications of this theorem are given in Problems 2.3.1.2) then the MLE Tj exists.)) 0 Themem 2.3. If 0. (a) If to E R k satisfies!!) Ifc ! 'I" 0 (2. is unique. if lx('1) logp(x.3. " We. By Lemma 2.1. Furthennore. II E j. we may also assume that to = T(x) = 0 because P is the same as the exponential family generated by T(x) . are distinct maximizers.1.3. U m = Jl~::rr' Am = . " . with corresponding logp(x. ! " .122 Methods of Estimation Chapter 2 Proposition 2. i . Corollary 2.1. then the MLE doesn't exist and (2. We can now prove the following.3. Without loss of generality we can suppose h(x) = pix.8 and 2. ~ Proof. then 1] exists and is unique ijfto E ~ where C!i: is the interior of CT.3.3) I I' (b) Conversely.3. Define the convex suppon of a probability P to be the smallest convex set C such that P(C) = 1. E.to.27).3.
6. (1"2 > O.3. which is impossible.3. u mk t u.3. 0 ° '* '* PrOD/a/Corollary 2. for every d i= 0. um~.3. Case 1: Amk t 111]=11. 0 (L:7 L:7 CJ. this is the exponential family generated by T(X) = 1 Xi. It is unique and satisfies x (2.* P'IlcTT = 0] = I.Xn are i.4.3) by TheoTem 1. Nonexistence: if (2.5.6.se density is a general phenomenon. thus. Evidently. This is equivalent to the fact that if n = 1 the formal solution to the likelihood equations gives 0'2 = 0.d. T(X) has a density and.Section 2. N(p" ( 2 ).1 follow. there exists c # such that Po[cTT < 0] = 1 E1](c T T(X)) < 0.3. Theorem 2. iff.i..3. 0 Example 2. CT = R )( R+ FOT n > 2. Po[uTT(X) > 61 > O. So.1. As we observed in Example 1. Por n = 1. By (B. .3. both {t : dTt > dTto} n CO and {t : dTt < dTto} n CO are nonempty open sets. I Xl) and 1.3 Maximum Likelihood in Multiparameter Exponential Families ~ 123 1. Write Eo for E110 and Po for P11o . fOT all 1].2) fails.2) and COTollary 2. t u. Then because for some 6 > 0.J = 00.3. If ij exists then E'I T ~ 0 E'I(cTT) ~ 0. = 0 because T(XI) is always a point on the parabola T2 = T'f and the MLE does not exist. The Gaussian Model. Suppose the conditions ofTheorem 2.2. So we have Case 2: Amk t A. Suppose X"". In fact. .1.1 hold and T k x 1 has a continuous case density on R k • Then the MLE Tj exists with probabiliry 1 and necessarily satisfies (2. I' E R. existence of MLEs when T has a continuous ca. CT = C~ and the MLE always exists. The equivalence of (2.k Lx( 11m.3.9. So In either case limm. that is. lIu m ll 00. contradicting the assumption that the family is of rank: k. if {1]m} has no subsequence converging in £ it must have a subsequence { 11m k} that obeys either case 1 or 2 as follows. Then AU ¢ £ by assumption.1) a point to belongs to the interior C of a convex set C iff there exist points in CO on either side of it. Because any subsequence of {11m} has no subsequence converging in E we conclude L ( 11m) t 00 and Tj exists. Then.3).
ln. The likelihood equations are equivalent to (problem 2. Example 2. >. where 0 < Aj PIX = j] < I. To see this note that Tj > 0. Thus._t)T. How to find such nonexplicit solutions is discussed in Section 2. In Example 3.. I < j < k iff 0 < Tj < n.. I < j < k.3.. lit ~ .. I < j < k.1. I < j < k.4 and 2.1.2 that (2. the best estimate of 8.I.5) have a unique solution with probability 1.3.3.6. with i'.1.n. x > 0. with density 9p.1.6.5) ~=X A where log X ~ L:~ t1ogXi.Xn are i. The TwoParameter Gamma Family. They are determined by Aj = Tj In.>.3). = = L L i i bJ _ I .3.1).I which generates the family is T(k_l) = (Tl>"" T.(X) = r~. where Tj(X) L~ I I(Xi = j).d. the MLE 7j in exponential families has an interpretation as a generalized method of moments estimate (see Problem 2..2(a).. From Theorem 1.3 we know that E. We follow the notation of Example 1.1 that in this caseMLEs of"'j = 10g(AjIAk).1 in the second.3.2(b» r' log . For instance.2 follows.6. Suppose Xl.2.3. in a certain sense. When method of moments and frequency substitution estimates are not unique. We conclude from Theorem 2. .3.13 and the next example).4) and (2. using (2.T(X) = A(.)e>"XxPI.4.4 we will see that 83 is. by Problem 2. A nontrivial application of Theorem 2.4. and one of the two sums is nonempty because c oF 0.124 Methods of Estimation Chapter 2 Proof. thus. Because the resulting value of t is possible if 0 < tjO < n. It is easy to see that ifn > 2. but only Os is a MLE.3.4) (2. > O. h(x) = XI. I <j < k. P > 0.3. the maximum likelihood principle in many cases selects the «best" estimate among them..9).3.7. we see that (2. The boundary of a convex set necessarily has volume 0 (Problem 2.= r(jJ) A log X (2. then and the result follows from Corollary 2. LXi)..2) holds.3. exist iff all Tj > O.I and verify using Theorem 2. Thus. Here is an example.3. if we write c T to = {CjtjO : Cj > O} + {cjtjO : Cj < O} we can increase c T to by replacing a tjO by tjo + I in the first sum or a tjO by tjO .3.1. We assume n > k .. liz = I  vn3/n and li3 = (2nl + nz)/2n are frequency substitution estimates (Problem 2. in the HardyWeinberg examples 2. 0 = If T is discrete MLEs need not exist.. I <t< k .3.i.3.). if T has a continuous case density PT(t). o Remark 2. Multinomial Trials. ..3. This is a rank 2 canonical exponential family generated by T   = (L log Xi. Example 2.3. Thasadensity.2. The statistic of rank k .
.Section 2. Remark 2.. Let CO denote the interior of the range of (c. .3. Unfortunately strict concavity of Ix is not inherited by curved exponential families. when we put the multinomial in canonical exponential family form. .2. .I. In Example 2. x E X.3 Maximum likelihood in Multiparameter Exponential Families 125 On the other hand.1). the bivariate normal case (Problem 2.2) by taking Cj = 1(i = j).k.3.3.3.13). The remaining case T k = a gives a contradiction if c = (1.10).B(B) j=l . note that in the HardyWeinberg Example 2. Similarly.8see Problem 2..3. n > kl. Ck (B)) T and let x be the observed data.1 directly D (Problem 2. Then Theorem 2.1. In some applications. L. If the equations ~ have a solution B(x) E Co.3) does not have a closedform solution as in Example 1. and unicity can be losttake c not onetoone for instance. then it is the unique MLE ofB. for example. the MLEs ofA"j = 1. our parameter set is open.1 we can obtain a contradiction to (2. (2. = 0. the following corollary to Theorem 2. BEe. Here [ is the natural paramerer space of the exponential fantily P generated by (T. 1)T.3 can be applied to determine existence in cases for which (2. whereas if B = [0.3..6. be a cUlVed exponential family p(x.3.7) Note that c(lI) E c(8) and is in general not ij.3.2. lfP above satisfies the condition of Theorem 2.3. 0 < j < k . Let Q = {PII : II E e).1 is useful.A(c(II))}h(x) Suppose c : 8 ~ [ C R k has a differential (2. The following result can be useful. 1 < i < k .2) so that the MLE ij in P exists. Alternatively we can appeal to Corollary 2.1 and Haberman (1974).1. h). if 201 + 0..3.3.3.3. mxk e. if any TJ = 0 or n.8 we saw that in the multinomial case with the clQsed parameter set {Aj : Aj > 0.. Corollary 2.1.6.3.3. When P is not an exponential family both existence and unicity of MLEs become more problematic.. . c(8) is closed in [ and T(x) = to satisfies (2.~l Aj = I}.1. However. (II) . then so does the MLE II in Q and it satisfies the likelihood equation ~ cT(ii) (to .lI) ~ exp{cT (II)T(x) .1] it does exist ~nd is unique.2.3.B) ~ h(x)exp LCj(B)Tj(x) . . m < k . 0 The argument of Example 2.exist and are unique. (B). .A(c(ii)) ~ = O. the MLE does not exist if 8 ~ (0.6) on c( II) = ~~. Consider the exponential family k p(x. e open C R=.
As a consequence of Theorems 2.. . with I. Now p(y. m m m f?. 11.3.?' m)T " ".3. Equation (2. m} is a 2nparameter canonical exponential family with 'TJi = /tila. Because It > 0./2'1' . Ii± ~ ~[).3)T.Zn are given constants. . i Example 2.. .~Yn.oJ).. Jl > 0. 1 n. are n independent random samples.< O. N'(fl. = L: xl. . E R.10.9.5 = ..2 . = ~ryf.2 and 2..'' . we can conclude that an MLE Ii always exists and satisfies (2.I LX. '() A 1/ Thus.4.6) with " • .\()'. We find CI Example 2..1/'1') T . which is closed in E = {( ryt> ry. ... . . As in Example 1. Zl < .nit.n(it' + )"5it'))T which with Ji2 = n. < O}.6. . .7) becomes = 0. . = ( ~Yll'''' .'. Gaussian with Fixed Signal to Noise.6. that . Next suppose.gx±)..6.5. C(O) and from Example 1.g( it' .0'2) with Jl/ a = AO > a known.3.5X' +41i2]' < 0.2~~2 . ry.5 and 1. which implies i"i+ > 0.) : ry.10.3. as in Example 1.7) if n > 2.2.n.3.d. t. I .. LocationScale Regression.) : ry. . .it..Xn are i. Evidently c(8) = {(lh. i = 1.". ).126 The proof is sketched in Problem 2. = 1 " = 2 n( ryd'1' .: Note that 11+11_ = '6M2 solution we seek is ii+.ryl > O.... suppose Xl. L: Xi and I.6... Suppose that }jI •. we see that the distribution of {01 : j = 1. n.ry. l}m. '1. corresponding to 1]1 = //x' 1]2 = .it. generated by h(Y) = 1 and "'> I T(Y) '. < O}.. 9) is a curved exponential family of the form (2. < Zn where Zt.3 )(t.3. tLi = (}1 + 82 z i .. .~Y'1.11. simplifies to Jl2 + A6XIl. l = 1.3.. . af = f)3(8 1 + 82 z i )2. j = 1.!0b''... Using Examples 1.. 'TJn+i = 1/2a. This is a curved exponential ¥. Methods of Estimation Chapter 2 family with (It) = C2 (It) = .3. . where 0l N(tLj 1 0'1).A6iL2 = 0 Ii.. the o q.i.3.ry.6.
which give a complete though slow solution to finding MLEs in the canonical exponential families covered by Theorem 2.1). entrust ourselves. we will discuss three algorithms of a type used in different statistical contexts both for their own sakes and to illustrate what kinds of things can be established about the black boxes to which we all.1. by the intermediate value theorem.Section 2.x*l: Find Xo < x" f(xo) < 0 < f(x') by taking Ixol.3. b) such that f(x*) = O. It is not our goal in this book to enter seriously into questions that are the subject of textbooks in numerical analysis. Then c(8) is closed in £ and we can conclude that for m satisfies (2.3. > 2. Given f continuous on (a.4 ALGORITHMIC ISSUES As we have seen. L 10). the basic property making Theorem 2. Given tolerance € > a for IXfinal . However.3. in pseudocode.7). Here. Ixd large enough. Initialize x61d ~ XI. b).1 work. at various times. Let £ be the canonical parameter set for this full model and let e ~ (II: II.1 The Method of Bisection The bisection method is the essential ingredient in the coordinate ascent algorithm that yields MLEs in kparameter exponential families. the fonnula (2. strict concavity. Finally. > OJ. is the bisection algOrithm to find x*. E R. 2. in this section. x o1d = xo.O. ~ 2. In this section we derive necessary and sufficient conditions for existence of MLEs in canonical exponential families of full rank with £ open (Theorem 2. In fact.3.4. The packages that produce least squares estimates do not in fact use fonnula (2. .).4 Algorithmic Issues 127 If m > 2. MLEs may not be given explicitly by fonnulae but only implicitly as the solutions of systems of nonlinear equations. there exists unique x*€(a. order d3 operations to invert.1. such as the twoparameter gamma. d.3. if implemented as usual. is isolated and shown to apply to a broader class of models. L 10) for {3 is easy to write down symbolically but not easy to evaluate if d is at all large because inversion of Z'bZD requires on the order of nd 2 operations to evaluate each of d(d + 1)/2 tenns with n operations to get Z'bZD and then.02 E R.3. f i strictly. even in the context of canonical multiparameter exponential families. an MLE (J of (J exists and () 0 ~ ~ Summary. even in the classical regression model with design matrix ZD of full rank. f( a+) < 0 < f (b. These results lead to a necessary condition for existence of the MLE in curved exponential families but without a guarantee of unicity or sufficiency.1 and Corollary 2. then. then the full 2nparameter model satisfies the conditions of Theorem 2. We begin with the bisection and coordinate ascent methods.
I (3) and X m + x* i as m j. x~ld = Xnew· Go to (I). (2...1) .xol· 1 (2) Therefore. Xnew = !(x~ld + x~ld)' ~ Xnew· (3) If f(xnew) = 0.1..1.4. in addition.x*1 < €. [(8.) = EryT(X) .4. 0 Example 2.i. ...4. o ! i If desired one could evidently also arrange it so that. exists. the MLE Tj. i I i.IXm+l . b) of the convex support a/PT.1.. • .. so that f is strictly increasing and continuous and necessarily because i. The Shape Parameter Gamma Family.4.1 and T = to E C~. the interior (a.1). Xm < x" < X m +l for all m. The bisection algorithm stops at a solution xfinal such that i Proot If X m is the mth iterate of Xnew i (I) Moreover. 00.3.. which exists and is unique by Theorem 2. h).to· I' Proaf. x~ld (5) If f(xnew) > 0.d. I. xfinal = Xnew· (4) If f(xnew) < 0. lx. . From this lemma we can deduce the following. by the intermediate value theorem.xoldl Methods of Estimation Chapter 2 < 2E. X n be i. f(a+) < 0 < f(b). Let X" . for m = log2(lxt . Then.) = VarryT(X) > 0 for all .128 (I) If IX~ld . By Theorem 1. If(xfinal)1 < E. xfinal = !(x~ld + x old ) and return xfinal' (2) Else. Theorem 2..xol/€).4...3. satisfying the conditions of Theorem 2. may befound (to tolerance €) by the method afbisection applied to f(. Let p(x I 1]) be a oneparameter canonical exponentialfamily generated by (T. Moreover.1.6. !'(. . End Lemma 2.
d and finally 1] _ ~Il) _ =1] (~1 1]1"".oJ _ = (~1 "" {}) ~02 _ 1]1. The function r(B) = x'Iexdx needed for the bisection method can itself only be evaluated by numerical integration or some other numerical method.4. This example points to another hidden difficulty. getting 1j 1r). say c. we would again set a tolerenceto be. eventually.'fJk· Repeat.4. 0 J:: 2.'fJk' ~1) an soon.4 Algorithmic Issues 129 Because T(X) = L:~' equation I log Xi has a density for all n the MLE always exists.···. always converges to Ti. but as we shall see. 1] ~Ok = (~1 ~1 {} "") 1]I. which is slow. The case k = 1: see Theorem 2..1 can be evaluated by bisection. Notes: (1) in practice. for a canonical kparameter exponential family.'TJk.4. 1 k. However.· .2 Coordinate Ascent The problem we consider is to solve numerically. In fact. E1/(T(X» = A(1/) = to when the MLE Tj = Tj(to) exists. The general case: Initialize ~o 1] = ("" 'TJll···' "") • 'TJk Solve ~1 f or1]k: Set 1] 8'T]k [) A(~l ~1 1]ll1'J2.···. Here is the algorithm. in cycle j and stop possibly in midcycle as soon as <I< .Section 2.1]2. for each of the if'.'TJ3. . bisection itself is a defined function in some packages.1. It solves the r'(B) r(B) T(X) n which by Theorem 2.'TJk =tk' ) . r > 1. it is in fact available to high precision in standard packages such as NAG or MATLAB.1]2.
. Whenever we can obtain such steps in algorithms.4.I ~(~ (1) 1 to get V. TIl = . Bya standard argument it follows that. ij'j and ij'(j+I) differ in only one coordinate for which iji(j+1) maximizes l. . (2) The sequence (iji!.) where flj has dimension d j and 2::.5).l(ij') = A. . This twodimensional problem is essentially no harder than the onedimensional problem of Example 2.. ff 1 < j < k.pO) = We now use bisection ' r' . 0 Example 2..4. _A (1»). Then each step of the iteration both within cycles and from cycle to cycle is quick.. 1 < j < k. = 7Jk. Consider a point we noted in Example 2.. . by (I). 1j(r) ~ 1j. We note some important generalizations. 71 1 is the unique MLE.=1 d j = k and the problem of obtaining ijl(tO..4. We pursue this discussion next.. fI.1. To complete the proof notice that if 1j(rk ) is any subsequence of 1j(r) that converges to ij' (say) then.2.'.1j{ can be explicit.. . the log likelihood. A(71') ~ to.3.130 Methods of Estimation Chapter 2 (2) Notice that (iii. 1J ! But 71j E [.2.1J k) ..? Continuing. (ii) a/Theorem 2. Therefore. Theorem 2. (6) By (4) and (5). ij = (V'I) . Suppose that this is true for each l.\(0) = ~. Else lim. (i).2..'11 11 t t (I .1 hold and to E 1j(r) t g:.4. as it should. 'tn.j l(i/i) = A (say) exists and is > 00. 71J. j ¥ I) can be solved in closed form. We give a series of steps. 0 ~ (4) :~i (77') = 0 because :~. ij as r t 00. Because l(ijI) ~ Aand the MLE is unique. the MLE 1j doesn't exist. (3) I (71j) = A for all j because the sequence oflikelihoods is monotone. I(W ) = 00 for some j..3. is the expectation of TI(X) in the oneparameter exponential family model with all parameters save T1I assumed known. W = iji = ij.I) solving r0'. . Thus. Proof.2: For some coordinates l.3.1 because the equa~ tion leading to Anew given bold' (2.···) C!J.. . (fij(e) are as above.. is computationally explicit and simple. (3) and (4) => 1]1 . We use the notation of Example 2..I)) = log X + log A 0) and then A(1) = pY. in fact. refuses to converge (in fI space!)see Problem 2. Let 1(71) = tif '1. (I) l(ij'j) Tin j for i fixed and in. For n > 2 we know the MLE exists. The TwoParameter Gamma Family (continued). ijik) has a convergent subsequence in t x . Suppose that we can write fiT = (fir. .2'  It is natural to ask what happens if. to ¢ Fortunately in these cases the algorithm..2. I • . We can initialize with the method 2 of moments estimate from Example 2. x ink) ( '~ini . .··. Continuing in this way we can get arbitrarily close to 1j. ilL" iii. .A(71) + log hex)..2.4. 'II. Hence.. Here we use the strict concavity of t. the algorithm may be viewed as successive fitting of oneparameter families. rp differ only in the second coordinate. . (Wn') ~ O. The case we have C¥. j (5) Because 1]1. that is. they result in substantial savings of time... limi.
4. BJ+l' .p. .4.. = dr = 1. and Problems 2. Solve g~: (Bt. See also Problem 2.1 in which Ix(O).4.1 illustrates the j process. e ~ 3 2 I o o 1 2 3 Figure 2. iterate and proceed. The coordinate ascent algorithm.. that is. for instance.4 Algorithmic Issues 131 just discussed has d 1 = ..1.2 has a generalization with cycles of length r.1 are not close to sphericaL It can be speeded up at the cost of further computation by Newton's method. Next consider the setting of Proposition 2.B~) = 0 by the method of j bisection in B to get OJ for j = 1.4.3.Section 2. the method extends straightforwardly. A special case of this is the famous DemingStephan proportional fitting of contingency tables algorithmsee Bishop. which we now sketch.92.. values of (B 1 . is strictly concave.4. The graph shows log likelihood contours. The coordinate ascent algorithm can be slow if the contours in Figure 2. 1 B. . . Then it is easy to see that Theorem 2.. find that member of the family of contours to which the vertical (or horizontal) line is tangent. the log likelihood for 8 E open C RP. If 8(x) exists and Ix is differentiable.4. Feinberg.10.4. 1' B . Figure 2.. and Holland (1975). . each of whose members can be evaluated easily. r = k..7.B2 )T where the log likelihood is constant. .. At each stage with one coordinate fixed. Change other coordinates accordingly.
B) We find exp{ (x . A hybrid of the two methods that always converges and shares the increased speed of the NewtonRaphson method is given in Problem 2. F(x. We return to this property in Problem 6. In this case.3.4. (2.132 Methods of Estimation Chapter 2 2. methods such as bisection. and Anderson (1974). then ii new = iiold .f. can be shown to be faster than coordinate ascent.4. iinew after only one step behaves approximately like the MLE. when it converges. though there is a distinct possibility of nonconvergence or convergence to a local rather than global maximum.6. if l(B) denotes the log likelihood.3) Example 2.4. is the NewtonRaphson method. the argument that led to (2. this method is known to converge to ij at a faster rate than coordinate ascentsee Dahlquist.2) The rationale here is simple.B)} [i+exp{(xB)}j2' n 1 l(B) l(B) n2Lexp{(X.3 The NewtonRaphson Algorithm An algorithm that.  I o The NewtonRaphson algorithm has the property that for large n. If 7J o ld is close to the root 'ij of A(ij) expanding A(ij) around 11old' we obtain = to. Let Xl. This method requires computation of the inverse of the Hessian. Bjork. Here is the method: If 110 ld is the current value of the algorithm.4. Newton's method also extends to the framework of Proposition 2.4.1. in general.7..2) gives (2.10. _.B) i=l 1 1 2 L i=l n I(X" B) < O. B) The density is = [I + exp{ (x = B) WI.and lefthand sides.. When likelihoods are noncave. A onedimensional problem 7 . 1 l(x. then by f1new is the solution for 1] to the approximation equation given by the right. which may counterbalance its advantage in speed of convergence when it does converge. If 110 ld is close enough to fj. B)}F(Xi.3.4.to). and NewtonRaphson's are still employed. X n be a sample from the logistic distribution with d. I 1 The NewtonRaphson method can be implemented by taking Bold X. coordinate ascent.k I (iiold)(A(iiold) .
Many examples and important issues and methods are discussed. . and so on. in Chapter 6 of Dahlquist. We give a few examples of situations of the foregoing type in which it is used. X ~ Po with density p(x. If we suppose (say) that observations 81..Section 2.4. A fruitful way of thinking of such problems is in terms of 8 as representing part of X.4.0)2. Laird. What is observed.6. 1 8 m are not Xi but (€il. difficult to compute.2. S = S(X) where S(X) is given by (2.B) + 2E'310g(1 . Xi = (EiI. Po[X = (0. I)] (1 . Say there is a closedfonn MLE or at least Ip.4 Algorithmic Issues 133 in which such difficulties arise is given in Problem 2.(0) })2EiI 10gB + Ei210g2B(1 .OJ] i=l n (2. however. A prototypical example folIows. i = 1. €i2 + €i3).. For detailed discussion we refer to Little and Rubin (1987) and MacLachlan and Krishnan (1997). and Anderson (1974).0)] ~ 28(1 . Here is another important example. As in Example 2. with an appropriate starting point.€i3). the function is not concave.4. m+ 1 <i< n. for instance. 0 leads us to an MLE if it exists in both cases. a)] = B2.. It does tum out that in this simplest case an explicit maximum likelihood solution is still possible. .4).n. but the computation is clearly not as simple as in the Original HardyWeinberg canonical exponential family example. (2. 1 <i< m (€i1 +€i2. Lumped HardyWeinberg Data.x(B) is concave in B. and Rubin (977).5) + ~ [(EiI+Ei2)log(1(10)')+2Ei310g(IB)] i=m+l a function that is of curved exponential family fonn. 0. where Po[X = (1.4. 0) is difficultto maximize. a < 0 < I. The algorithm was fonnalized with many examples in Dempster.x(B) is "easy" to maximize. and Weiss (1970). E'2.0 E c Rd Their log likelihood Ip. the homozygotes of one type (€il = 1) could not be distinguished from the heterozygotes (€i2 = 1).4) Evidently. Soules. .13.0. 0). Bjork. 1. the rest of X is "missing" and its "reconstruction" is part of the process of estimating (} by maximum likelihood.4. Petrie. 2. Yet the EM algorithm. e = Example 2. then explicit solution is in general not lXJssible. though an earlier general form goes back to Baum.B). let Xi. is not X but S where = 5i 5i Xi.4 The EM (Expectation/Maximization) Algorithm There are many models that have the following structure.4. Ei3). 0) where I".4. . The log likelihood of S now is 1". we observe 5 5(X) ~ Qo with density q(s. Unfortunately. be a sample from a population in HardyWeinberg equilibrium for a twoallele locus. Po[X (0. This could happen if. (0) = log q(s. and its main properties.. for some individuals. There are ideal observations.
4) = L()(Si I Ai) = N(Ail'l + (1.Bo) 0 p(X. I .I.\. I:' .<T. The EM Algorithm. Thus.\ < 1.12). = 11 = . i. The EM algorithm can lead to such a local maximum. = 01. a local maximum close to the true 8 0 turns out to be a good "proxy" for the 0 nonexistent MLE.4. The rationale behind the algorithm lies in the following formulas.4. This fiveparameter model is very rich pennitting up to two modes and scales..4.B) 0=00 = E. Although MLEs do not exist in these models. Again. . the M step is easy and the E step doable. Then we set Bnew = arg max J(B I Bold).<T.7) ii. As we shall see in important situations.i = 1. I. • • i ! I • .. .4.)<T~).9) follows from (2. A. reset Bold = Bnew and repeatthe process.4. If this is difficul~ the EM algorithm is probably not suitable..4. . j.9) I for all B (under suitable regularity conditions). Initialize with Bold = Bo· Tbe first (E) step of the algorithm is to compute J(B I Bold) for as many values of Bas needed. S has the marginal distribution given previously.B) IS(X)=s) q(s. the Si are independent with L()(Si I .4. Note that (2. I That is. Mixture of Gaussians. Suppose that given ~ = (~11"" ~n).(sI'z)where() = (.ll. Suppose 8 1 .6) where A. .. we can think of S as S(X) where X is given by (2. that under fJ.134 Methods of Estimation Chapter 2 Example 2.af) or N(J12.Bo) and (2.\ = 1. if this step is difficult.).\"'u.12) q(s. differentiating and exchanging E oo and differentiation with respect i.4. Let . Bo) _ I S(X) = s ) (2. .4.4.\)"'u.8) by o taking logs in (2. B) = (1.5. J. ~i tells us whether to sample from N(Jil.o log p(X. including the examples. It is not obvious that this falls under our scheme but let (2..(f~). we have given.(sl'd +. Here is the algorithm.4. EM is not particularly appropriate.B) J(B I Bo) ~ E. which we give for 8 real and which can be justified easily in the case that X is finite (Problem 2. It is easy to see (Problem 2.(I'.tz E Rand 'Pu (5) = .0'2 > 0. (P(X. o (:B 10gp(X. p(s. :B 10gq(s. + (1  A. O't.Ai)I'Z. I' Hi where we suppress dependence on s.p() [A.8). .11).B) I S(X) = s) 0=00 (2.p (~). . The second (M) step is to maximize J(B I Bold) as a function of B.B) =E.Sn is a sample from a population P whose density is modeled as a mixture of two Gaussian densities. The log likelihood similarly can have a number of local maxima and can tend to 00 as e tends to the boundary of the parameter space (Problem 2. .8) .6).2)andO <. are independent identically distributed with p() [A. (P(X..4.
0) = O.Section 2.11)  D IOgq(8. On the other hand.Onew) E'Old { log r(X I 8. Fat x EX. J(Onew I 0old) > J(Oold I 0old) = 0 by definition of Onew. I S(X) ~ s > 0 } (2. Lemma 2. Onew) > q(s. (2. 0 0 = 0old' = Onew.4. 0) ~ q(s.0) } ~ log q(s. q S. DJ(O I 00 ) [)O and.(X 18.3. (2. We give the proof in the discrete case.14) whete r(· j ·.Onew) {r(X I 8 .E. 00ld are as defined earlier and S(X) (2.13) iff the conditional distribution of X given S(X) forOnew asfor 00ld and Onew maximizes J(O I 0old)' =8 is the same Proof. DO The main reason the algorithm behaves well follows. Theorem 2.3.4. then .4. r(Xj8.4.0) is the conditional frequency function of X given S(X) J(O I 00 ) If 00 = s. hence.1. Then (2.10) [)J(O I 00 ) DO it follows that a fixed point 0 of the algorithm satisfies the likelihood equation. (2.o log .1.4. Let SeX) be any statistic.4 Algorithmic Issues 135 to 0 at 00 . Suppose {Po : () E e} is a canonical exponential family generated by (T. uold Now.0 ) I S(X) ~ 8 . S (x) ~ 8 p(x. 0) I S(X) = 8 ) DO (2.13) q(8.4." ) I S(X) ~ 8 .4. Id log (X I .4.17) o The most important and revealing special case of this lemma follows. ~ E '0 (~I ogp(X .0) (2.4. uold 0 r s.lfOnew.00Id) by Shannon's ineqnality.1. O n e w ) } log ( " ) ~ J(Onew I 0old) .O) { " ( X I 8. 0 ) +E.O)r(x I 8.4.2. Lemma 2.12) = s. In the discrete case we appeal to the product rule. Because.15) q(s.. However.h) satisfying the conditions a/Theorem 2.16) ° q(s. formally.4. the result holds whenever the quantities in J(O I (0 ) can be defined in a reasonable fashion. 0old)' Equality holds in (2.
I<j<2 I r r: I I r! Pol<it 11 <it + <i' = 11 0' B' B' + 20(1 . that. .A('1)}h(x) (2.(x). which is necessarily a local maximum J(B I Bo) I I = Eoo{(B . m + 1 < i < n) . (2.. h(x) = 2 N i . = 11 <it +<" = IJ. X is distributed according to the exponential family I where p(x. A proof due to Wu (I 983) is sketched in Problem 2. I .+l (2. 1'I.4.18) exists it is necessarily unique.B).23) I .4 (continued). after some simplification. (b) lfthe sequence of iterates {Bm} so obtained is bounded and the equation A(B) o/q(s.Pol<i.16.18) (2. we see. 1 I • • Po I<ij = 1 I <it + <i' = OJ =  0.(A(O) .136 (a) The EM algorithm consists of the alternation Methods of Estimation Chapter 2 A(Bnew ) = Eoold(T(X) I S(X) Bold ~ = s) (2.19) = Bnew · If a solution of(2.4. In this case. i i.4. • I E e (2Ntn + N'n I S) = 2Nt= + N.25) I" . A'(1)) = 2nB (2. 1 <j < 3.4.(A(O) .4.0)' 1.B) 1 .BO)TT(X) . I • Under the assumption that the process that causes lumping is independent of the values of the €ij.4.= +Eo ( t (2<" + <i') I <it + <i'.24) " ! I . then it converges to a limit 0·.4. i=m. Thus.A(Bo)) I S(X) = s} (B .(1 .A(Bo)) (2.4. = Ee(T(X) I S(X) = s) ~ (2.4.Bof Eoo(T(X) I S(X) = y) . Part (b) is more difficult.22) • ~ 0) . A(1)) = 2nlog(1 + e") and N jn = L:~ t <ij(Xi).21) Part (a) follows. Now. . Proof.20) has a unique solution.4. 0 Example 2.4.B) 1) = log (1 = exp(1)(2Ntn (x) + N'n(x)) .
} U {Y.4). ii2 . 1"1)/ad 2 + (1 .Tl IlT4 (8ol d) . T 5 = n. p).4. l"l)/al]Zi with the corresponding Z on Y regression equations when conditioning on Yi (Problem 2. where (Z. and for n2 + 1 < i < n.4. 112.p2)a~ [1'2 + P'Y2(Z. For other cases we use the properties of the bivariate nonna] distribution (Appendix BA and Section 1. iii.(y. we observe only Yi .. In this case a set of sufficient statistics is T 1 = Z. Jl2. Y) ~ N (I"" 1"" 1 O'~.new ~ + I'i. at Example 2. we oberve only Zi. + 1 <i< n}.2 I Z.Bold n ~ (2. new = T 3 (8ol d) .i.4 Algoritnmic Issues 137 where Mn ~ Thus.4. i=I i=l n n n i=l s = {(Z" Y. 2 Mn . I). a~..26) It may be shown directly (Problem 2.T{ (2.: nl + 1 <i< n. I'l)/al [1'2 + P'Y2(Z. the conditional expected values equal their observed values. To compute Eo (T I S = s).l LY/.27) hew i'tT2Ji{[T3 (8ol d) . ~ T l (801 d).Yn) be i. to conclude ar.1). as (Z. al a2P + 1"1/L2). the EM iteration is L i=tn+l n ('iI + <.d.). ai ~ We take Bo~ = B MOM ' where BMOM is the method of moment estimates (11" 112. we note that for the cases with Zi andlor Yi observed. then B' . converges to the unique root of ~ + N. compute (Problem 2. a~ + ~. unew  _ 2N1m + N 2m n + 2 .4.8) of B based on the observed data.) E..1) that the Mstep produces I1l.4.l L ZiYi. Let (Zl'yl).new ii~. . Suppose that some of the Zi and some of the Yi are missing as follows: For 1 < i < nl we observe both Zi and Yi.6.new T4 (Bo1d ) .T = (1"1.T2 ]}' . I Z.: n.. for nl + 1 < i < n2.(ZiY.(Y. T 2 = Y.4. T4 = n.T 2 .1. . b.12) that if2Nl = Bm. (Zn..= > N 3 =)) 0 and M n > 0.(2N3= + N 2=)B + 2 (N1= + (I n n =0 0 in (0.): 1 <i< nt} U {Z. This completes the Estep.4. r) (Problem 2. Y). and find (Problem 2. For the Mstep. [T5 (8ald) 2 = T2 (8old)' iii. T3 = The observed data are n l L zl.) E. which is indeed the MLE when S is observed. B). /L2.1) A( B) = E. I Zi) 1"2 + P'Y2(Z. where B = (111.Section 2.
Suppose that Li. Important variants of and alternatives to this algorithm. ~' ".B)2.2. . . Note that if S(X) = X. 0 Because the Estep. X I.6.1.1 with respective probabilities PI. Also note that. are discussed and introduced in Section 2. respectively. in Section 2..B o)]. (b) Using the estimate of (a). including the NewtonRaphson method. (b) Find the method of moments estimate of>. in general.5 PROBLEMS AND COMPLEMENTS Problems fnr Sectinn 2. X n assumed to be independent and identically distributed with exponential. (a) Find the method of moments estimate of >. (c) Combine your answers to (a) and (b) to get a method of moment estimate of >.. Find the method of moments estimates of a = (0'1. .i' I' I i. (3( 0'1.0'2) based on the first two moments.B)? ..P2.138 Methods of Estimation Chapter 2 where T j (B) denotes T j with missing values replaced by the values computed in the Estep and T j = Tj(B o1d )' j = 1. which yields with certainty the MLEs in kparameter canonical exponential families with E open when it exists.2. show that T 3 is a method of moment estimate of ().d. in the context of Example 2. involves imputing missing values. (a) Show that T 3 = N1/n + N 2 /2n is a frequency substitution estimate of e.. j : > I) that one system 3. &(>. (c) Suppose X takes the values 1. based On the first two moments. (d) Find the method of moments estimate of the probability P(X1 will last at least a month. j. Consider a population made up of three different types of individuals occurring in the HardyWeinberg proportions 02. Summary.0'2) distribution. Finally in Section 2. By considering the first moment of X. Consider n systems with failure times X I . use this algorithm as a building block for the general coordinate ascent algorithm. which as a function of () is maximized where the contrast logp(X.2. Remark 2.. the EM algorithm is often called multiple imputation.). based on the first moment. X n have a beta. 2B(1 . .4. based on the second moment. where 0 < B < 1.0) and (I .1 1.23).P3 given by the HardyWeinberg proportions. ! I 2. E"IJ(O I 00)] is the KullbackLeiblerdivergence (2. 2. . distributions.. We then. . what is a frequency substitution estimate of the odds ratio B/(l. Now the process is repeated with B MOM replaced by ~ ~ ~ ~ 0new...4. then J(B i Bo) is log[p(X. The basic bisection algorithm for finding roots of monotone functions is developed and shown to yield a rapid way of computing the MLE in all oneparameter canonical exponential families with E open (when it exists).. B)/p(X.4 we derive and discuss the important EM algorithm and its basic properties.3 and the problems.0. 0) is minimized.4.4.4.
. Let X(l) < . YI ). 6.)).t ) = Number of vectors (Zi. Nk+I) where N I ~ nF(t l ).2. The natural estimate of F(s.12. Hint: Consi~r (N I . empirical substitution estimates coincides with frequency substitution estimates.. (d) FortI tk.. (Zn.. The empirical distribution function F is defined by F(x) = [No. .. .. of Xi < xl/n. .) There is a Onetoone correspondence between the empirical distribution function ~ F and the order statistics in the sense that..findthejointfrequencyfunctionofF(tl). See A.Section 2. Show that these estimates coincide. X n .... . t) is the bivariate empirical distribution function FCs..8)ln first using only the first moment and then using only the second moment of the population.5 Problems and Complements 139 Hint: See Problem 8. .. Let (ZI. . X 1l be the indicators of n Bernoulli trials with probability of success 8. (b) Show that in the continuous case X ~ rv F means that X (c) Show that the empirical substitution estimate of the jth moment JLj is the jth sample moment JLj' Hinr: Write mj ~ f== xjdF(x) ormj = Ep(Xj) where XF. < ~ ~ Nk+1 = n(l  F(tk)). given the order statistics we may construct F and given P.8. (Z2.... < X(n) be the order statistics of a sample Xl. . Yi) such that Zi < sand Yi < t n . ~ Hint: Express F in terms of p and F in terms of P ~() X = No. (c) Argue that in this case all frequency substitution estimates of q(8) must agree with q(X).. 8. = Xi with probability lin. Let Xl. If q(8) can be written in the fonn q(8) ~ s(F) for sOme function s of F we define the empirical substitution principle estimate of q( 8) to be s( F). which we define by ~ F ~( s. .F(tk). . 4. . ~ ~ ~ (a) Show that in the finite discrete case. Give the details of this correspondence. t). ~  7.5.. of Xi n ~ =X. 5. (a) Show that X is a method of moments estimate of 8. . The jth cumulant '0' of the empirical distribution function is called the jth sample cumulanr and is a method of moments estimate of the cumulant Cj' Give the first three sample cumulants. .Xn be a sample from a population with distribution function F and frequency function or density p. (b) Exhibit method of moments estimates for VaroX = 8(1 . (See Problem B. Y2 ).F(t. Let X I. N 2 = n(F(t2J .. . . .2. < . we know the order statistics. Yn ) be a set of independent and identically distributed random vectors with common distribution function F.
) is the distribution function of a probability P on R2 assigning mass lin to each point (Zi. z) is continuous in {3 and that Ig({3.2 with X iiI and /is.IU9) that 1 < T < L . X n be LLd. (See Problem 2. : J • . as X ~ P(J. .". the sampIe correlation.. as the corresponding characteristics of the distribution F.2.Y) = Z)(Yk . .ZY.Xn ) where the X.. 11..). (J E R d. 9. Suppose X = (X"". " . .140 Methods of Estimation Chapter 2 (a) Show that F(.1. I) = ". Hint: Set c = p(X. ~ i . . . find the method of moments estimate based on i 12.. 1".1. I .2). In Example 2. . Y are the sample means of the sample correlation coefficient is given by Z11 •. respectively.. the result follows. and so on.. k=l n  n . f(a.) Note that it follows from (A.. . (b) Construct an estimate of a using the estimate of part (a) and the equation a .81 tends to 00. There exists a compact set K such that for f3 in the complement of K. with (J identifiable. Suppose X has possible values VI. p(X.4. {3) > c.17. (b) Define the sample product moment of order (i. (c) Use the empirical substitution principle to COnstruct an estimate of cr using the relation E(IX. .• . L I. Hint: See Problem B. 10. In Example 2. j). the sample covariance..Yn . . >. 0).L. Let X". Show that the least squares estimate exists. suppose that g({3. . Zn and Y11 . The All of these quantities are natural estimates of the corresponding population characteristics and are also called method of moments estimates."ZkYk . {3) is continuous on K.!'r«(J)) I . Vi)..j) is given by ~ The sample covariance is given by n l n L (Zk k=l .. z) I tends to 00 as \. are independent N"(O.j2. = #. (a) Find an estimate of a 2 based on the second mOment. . . Show that the sample product moment of order (i.. Vk and that q(9) can be written as ec q«(J) = h(!'I«(J)"" .' where Z. Since p(X. .
.d.. Use E(U[). .5 Problems and Com".p:::'. 0 ~ (p.i.l and (72 are fixed.. as X rv P(J. A).0) (ii) Beta.ilr) is a frequency plug (a) Show that the method of moments estimate if = h(fill . in estimate. . Let g. Consider the Gaussian AR(I) model of Example 1. IG(p. (b) Suppose {PO: 0 E 8} is the kparameter exponential family given by (1.:::m:::.1. 1 < j < k. 13... j i=l = 1.. 0).. find the method of moments estimates (i) Beta. can you give a method of moments estimate of {3? . 1'(1.p(x. to give a method of moments estimate of (72. where U.. . Hint: Use Corollary 1. Let 911 . (b) Suppose P ~ po and I' = b are fixed. A). with (J E R d and (J identifiable. > 0. = h(PI(O).5...Pr(O)) . rep.9r be given linearly independent functions and write ec n Pi(O) = EO(gj(X)).i.0 > 0 14. . 1'(0.6.l Lgj(Xi ). (c) If J.:::nt:::' ~__ 141 for some R k valued function h.36.1.. ..6. When the data are not i.  1/' Po) / . .. Vk and that q(O) for some Rkvalued function h. In the fOllowing cases. . X n are i. . Suppose X 1.d.Section 2. Iii = n. fir) can be written as a frequency plugin estimate. . . (a) Use E(X.x (iv) Gamma.10).6.(X) ~ Tj(X).O) ~ (x/O')exp(x'/20').. .) to give a method of moments estimate of p.• it may still be possible to express parameters as functions of moments and then use estimates based on replacing population moments with "sample" moments. r. p fixed (v) Inverse Gaussian.1) (iii) Raleigh. Suppose that X has possible values VI. See Problem 1. General method of moment estimates(!). ~ (X. Show that the method of moments estimate q = h(j1!.
11P2 + "2P4 + '2P6 . 17.83 8t Let N j be the number of plants of genotype j in a sample of n independent plants.."" X iq ). and (J3...142 Methods of Estimation Chapter 2 15. Establish (2. = n 1 L:Z. and F. Show that <j < i ! . and IF... ..g(I3.. n. . . Let 8" 8" and 83 denote the probabilities of S... I. II. HardyWeinberg with six genotypes. respectively. n 1  Problems for Sectinn 2.. Q. .. 1 6 and let Pl ~ N j / n.)]. In a large natural population of plants (Mimulus guttatus) there are three possible alleles S. where OJ = 1. 1"... ...Y.  t=1 liB = (0 1 . >... .. 5 SF 28.... . . . For a vector X = (Xl.•• Om) can be expressed as a function of the moments.bIZ. .1 " Xir Xk J..z.... FF.. An Object of unit mass is placed in a force field of unknown constant intensity 8.1". and F at one locus resulting in six genotypes labeled SS.)] = [Y.... Show that method of moments estimators of the parameters b1 and al in the best linear predictor are estimate ~ 8 of B is obtained L:Z. bI) are as in Theorem 1..3..g(l3o..ZY ~ . 1 q. 16.8... ).  1. SI.2 1. + '2P4 + 2PS . For independent identically distributed Xi = (XiI. S = 1.. Let X = (Z. .! ..S 1 .)] + [g(l3o. we define the empirical or sample moment to be j ~ . Y) and (ai.. 1 . 1 I . where (Z.z.. Y) and B = (ai..z.z. I . k> Q mjkrs L.. I . b. ()2...... Multivariate method a/moments.• l Yn are taken at times t 1._ ... of observations.. let the moments be mjkrs = E(XtX~). .. .) .. The HardyWeinberg model specifies that the six genotypes have probabilities L7=1 I Genotype Genotype Probability SS 2 II 8~ 3 FF 8j 4 SI 28. k > 0. _ (Z)' ' a. I • ..~b. = Y .. I I .6)..J is' _ n n.q. ' l X q )... t n on the position of the object.. Hint: [y. j > 0.. The reading Yi .4. 83 6 IF 28...... the method of moments l by replacing mjkrs by mjkrs. Readings Y1 ..1. i = 1. SF.g(I3. P3 + "2PS + '2P6 PI are frequency plugin estimates of OJ. I.
. all lie on aline.. Hint: The quantity to be minimized is I:7J (y.y. the range {g(zr. Suppose that observations YI . ... (Pareto density) . .z..(zn. < O. .). (a) Let Y I . B) = Bc'x('+JJ. Hint: Write the lines in the fonn (z.)' 1+ B5 6.2. . .t of the best (MSPE) predictor of Yn + I ? 4.4. X n denote a sample from a population with one of the following densities Or frequency functions. Show that the least squares estimate is always defined and satisfies the equations (2. Show that the two sample regression lines coincide (when the axes are interchanged) if and only if the points (Zi. . ~p ~(yj}) ~ . Find the least squares estimates of ()l and B .6) under the restrictions BJ > 0. .2 may be derived from Theorem 1....B..... nl and Yi = 82 + ICi.<) ~ .1.13 E R d } is closed. in fact. . (exponential density) (b) f(x. Yn). .Yi).. Let X I. Find the line that minimizes the sum of the squared perpendicular distance to the same points.. 9.4. . Find the least squares estimates for the model Yi = 8 1 (2. i = 1. . c cOnstant> 0. B. 2 10. .Section 2. B) = Be'x. Y. x > 0. (b) Relate your answer to the fonnula for the best zero intercept linear predictor of Section 1.We suppose the oand be uncorrelated with constant variance. to have mean 2. B > O. . 7. 3.5 Problems and Complements 143 ICi differs from the true position (8 /2)tt by a mndom error f l .4)(2.Yn). . and 13 ranges over R d 8. A new observation Ynl. 13).2.. " 7 5. Find the LSE of 8.5) provided that 9 is differentiable with respect to (3" 1 < i < d. ." + 82 z i + €i = nl with ICi a~ given by = 81 + ICi. where €nlln2 are independent N(O. (a) f(x. . i = 1. The regression line minimizes the sum of the squared vertical distances from the points (Zl. 13). .. . x > c. . Yn be independent random variables with equal variances such that E(Yi) = O:Zj where the Zj are known constants. .I is to be taken at time Zn+l. _. Zn and that the linear regression model holds. (J2) variables. . Yn have been taken at times Zl.n. i + 1.. Find the MLE of 8. g(zn. if we consider the distribution assigning mass l/n to each of the points (zJ.2. nl + n2.. What is the least squares estimate based on YI .BJ . B > O..3. Suppose Yi ICl". Find the least squares estimate of 0:... Show that the fonnulae of Example 2. Yl). (zn.. .
Let X" . Show that the MLE of ry is h(O) (i.2. (Wei bull density) 11.5 show that no maximum likelihood estimate of e = (1". I • :: I Suppose that h is a onetoone function from e onto h(e).l L:~ 1 (Xi . c constant> 0.t and a 2 . Because q is onto n. 0) = (x/0 2) exp{ _x 2/20 2 }. Hint: Let e(w) = {O E e : q(O) = w}. . then the unique MLEs are (b) Suppose J. Suppose that T(X) is sufficient for 0 and ~at O(X) is an MLE of O.. ry) denote the density or frequency function of X in terms of T} (Le. iJ( VB. 0) = cO". 0 <x < I.) ! ! !' ! 14.II ". MLEs are unaffected by reparametrization. 0 E e and let 0 denote the MLE of O. (b) Find the maximum likelihood estimate of PelX l > tl for t > 1".  e . . < I" < 00. c constant> 0. x> 0. 0 > O. (Pareto density) (d) fix. x > 0. If n exists.. J1 E R. X n be a sample from a U[0 0 + I distribution. .O) where 0 ~ (1". q(9) is an MLE of w = q(O).0"2) 15. (72 Ii (a) Show that if J1 and 0. . (beta.:r > 0.r(c+<).0"2 (a) Find maximum likelihood estimates of J. 0> O. > O.a)l rather than 0. I I I (b) LetP = {PO: 0 E e}. Thus.1. they are equivariant 16.0"2).144 Methods of Estimation Chapter 2 (c) f(".X)" > 0.l exp{ OX C } . I < k < p. ncR'.. x > 1". then {e(w) : wEn} is a partition of e. Show that if 0 is a MLE of 0. be independently and identically distributed with density f(x. 0 > O. .. say e(w).2 are unknown. the MLE of w is by definition ~ W.I")/O") . I i' 13.  under onetoone transformations). (a) Let X . ~ I in Example 2. .e. (Rayleigh density) (f) f(x. Show that depends on X through T(X) only provided that 0 is unique. I). then i  WMLE = arg sup sup{Lx(O) : 0 E e(w)}. 0) = VBXv'O'. 00 = I 0" exp{ (x .5. . Pi od maximum likelihood estimates of J1 and a 2 . (We write U[a. 0 > O.16(b)..t and a 2 are both known to be nonnegative but otherwise unspecified. reparametrize the model using 1]). 12. and o belongs to only one member of this partition. = X and 8 2 ~ n. Show that any T such that X(n) < T < X(1) + is a maximum likelihood estimate of O. e c Let q be a map from e onto n.P > I. Define ry = h(8) and let f(x. Suppose that Xl. be a family of models for X E X C Rd. bl to make pia) ~ p(b) = (b .1). for each wEn there is 0 E El such that w ~ q( 0).. WEn . density) (e) fix. . X n . is a sample from a N(/l' ( 2 ) distribution. X n • n > 2. Let X I. Hint: Use the factorization theorem (Theorem 1. .Pe.: I I \ .. n > 2.. 0) = Ocx c . Hint: You may use Problem 2.
(b) The observations are Xl = the number of failures before the first success. identically distributed...1 (1 . 0 < a 2 < oo}. .n ..O") : 00 < fl. X 2 = the number of failures between the first and second successes. Show that the maximum likelihood estimate of 0 based on Y1 . Bartholomew...Section 2.6. •• . Bremner. and so on. and have commOn frequency function. B) = 100' cp 9 (x I') + 0' 10 cp(x 1') 1 where cp is the standard normal density and B = (1'. .B) =Bk .. a model that is often used for the time X to failure of an item is P. if failure occurs on or before time r and otherwise just note that the item has lived at least (r + 1) periods. f(k.X I = B(I. 20. (We denote by "r + 1" survival for at least (r + 1) periods..0") E = {(I'. we observe YI .B).[X ~ k] ~ Bk . . .Xn be independently distributed with Xi having a N( Oi. A general solution of this and related problems may be found in the book by Barlow.. In the "life testing" problem 1. Y n IS _ B(Y) = L. Show that maximum likelihood estimates do not exist.5 Problems and Complements 145 Now show that WMLE ~ W ~ q(O). 1 <i < n. where 0 < 0 < 1.. We want to estimate O.) Let M = number of indices i such that Y i = r + 1. < 00.. . find the MLE of B. . k=I.~1 ' Li~1 Y. 16(i). Censored Geometric Waiting Times.. Thus. Let Xl.B).P. (KieferWolfowitz) Suppose (Xl.  . Derive maximum likelihood estimates in the following models." Y: . Suppose that we only record the time of failure.r f(r + I.B) = 1. 17. . and Brunk (1972). (a) Find maximum likelihood estimates of the fJ i under the assumption that these quantities vary freely. Yn which are independent. but e . M 18.. (a) The observations are indicators of Bernoulli trials with probability of success 8. in a sequence of binomial trials with probability of success fJ.2.1 (I_B). (b) Solve the problem of part (a) for n = 2 when it is known that B1 < B. 19. 1) distribution. . k ~ 1.. X n ) is a sample from a population with density f(x.. We want to estimate B and Var. 21. If time is measured in discrete periods. . .[X < r] ~ 1 LB k=l k  1 (1_ B) = B'.
.X)' + L(Y.. Assume that Xi =I=.'.0 I . 1 <k< p}. 1985).8 2.0"2) if.0 2.2 4.5 86.2.. respectively. x) as a function of b. = fl.a 2 ) and NU". .. and only if.5 66. 7 .0 5. Weisberg.ji. J2 J2 o o o I 3. (1') where e n ii' = L(Xi . Show that the maximum likelihood estimate of b for Nand n fixed is given by if ~ (N + 1) is not an integer. .Xj for i =F j and that n > 2. . .tl.2 3.(Zi) + 'i.4 o o o o o o o o o o o o o Life 20. . the following data were obtained (from S. where'i satisfy (2.. Let Xl. TABLE 2. and ~ b(X) = X X (N + 1) or (N + 1)1 n n othetwise..6).2. n).up(X.8 14. x)/ L(b.J. 23. Tool life data Peed 1 Speed 1 1 Life 54. 1 1 1 o o . " Yn be two independent samples from N(J.t.. and assume that zt'·· . Set zl ~ .2 4.0 11.x n .Xm and YI .z!. II I I :' 24. (1') populations.5 0.I' fl.9 3.8 0. Ii equals one of the numbers 22.Y)' i=1 j=1 /(rn + n). where [t] is the largest integer that is < t.4)(2.< J. i: In an experiment to study tool life (in minutes) of steelcutting tools as a function of cutting speed (in feet per minute) and feed rate (in thousands of an inch per revolution).2 Peed Speed  2 I 1 1 1 1 1 1 1 J2 o o 1 1 1 3. . Xl.6. Polynomial Regression. Suppose Y. N.146 Methods of Estimation Chapter 2 that snp. Suppose X has a hypergeometric. .5 I' I I . Y.p(x.1.. .8 3. Show that the MLE of = (fl.0 0.a 2 ) = Sllp~.' wherej E J andJisasubsetof {UI .1 2. distribution. 1i(b.jp): 0 <j. Hint: Coosider the mtio L(b + 1. (1') is If = (X.
. let v(z. Show that (3. z) ~ Z D{3.5 Problems and Complements 147 The researchers analyzed these data using Y = log tool life.Y') have density v(z.y)!(z. However. 25. (a) Show tha!.2.2. z. .y) defined . Let Z D = I!zijllnxd be a design matrix and let W nXn be a known symmetric invertible matrix.1.Y = Z D{3+€ satisfy the linear regression mooel (2. weighted least squares estimates are plugin estimates.2. Being larger.l be a square root matrix of W. Two models are contemplated (a) Y ~ 130 + pIZI + p.y)dzdy. Use these estimated coefficients to compute the values of the contrast function (2.ZD is of rank d. Set Y = Wly.z. (12 unknown. (2. l/Var(Y I Z = z).(P)E(Z·).(P) and p. Let W.. Let (Z.2. (a) Let (Z'.y)!(z.1).(P) coincide with PI and 13. = (cutting speed . Consider the model Y = Z Df3 + € where € has covariance matrix (12 W..y)/c where c = I Iv(z. (c) Zj.(b l + b. 28.19).. .5) for (a) and (b). Y)[Y . ~. (a) The parameterization f3 (b) ZD is of rank d.Z)]'). Y)Y') are finite.4)(2. ZI ~ (feed rate . That is. The best linear weighted mean squared prediction error predictor PI (P) + p. Consider the model (2..4)(2. + E eto (b) Y = + O:IZl + 0:2Z2 + Q3Zi +. Both of these models are approximations to the true mechanism generating the data.900)/300. ZD = Wz ZD and € = WZ€. Show that p.(P)Z of Y is defined as the minimizer of t = zT {3.0:4Z? + O:SZlZ2 + f Use a least squares computer package to compute estimates of the coefficients (f3's and o:'s) in the two models.2.6) with g({3.2. Derive the weighted least squares nonnal equations (2. 26.(P) = Cov(Z'.1. the second model provides a better approximation. . y) > 0 be a weight funciton such that E(v(Z. this has to be balanced against greater variability in the estimated coefficients.Section 2.13)/6.6) with g({3. . This will be discussed in Volume II. E{v(Z.2.2. y).1 (see (B.1). of Example 2. (2..  27. Y)Z') and E(v(Z. z) ing are equivalent. in Problem 2. Y')/Var Z' and pI(P) = E(Y*) 11.6.8 and let v(z. Show that the follow ZDf3 is identifiable.6)).3. Y) have joint probability P with joint density ! (z. (b) Let P be the empirical probability .
393 3.921 2. i = 1. \ (e) Find the weighted least squares estimate of p. _'...918 2. .916 6.6)T(y .689 4. then the  f3 that minimizes ZD. = (Y  29. where 1'. i = 1.229 4.611 4.599 0.(Y .046 2.) ~ ~ (I' + 1'.019 6.i.274 5.081 2. .. ' I (f) The following data give the elapsed times YI .053 4. Yn are independent with Yi unifonnly distributed on [/1i . .148 ~ Methods of Estimation Chapter 2 (b) Show that if Z  D has rank d.nk > 0.397 4. n.834 3. 0) = II j=q+l k 0. .249 1.097 1..480 5.196 2.860 5. Then = n2 = .968 9.114 1..666 3. Elapsed times spent above a certain high level for a series of 66 wave records taken at San Francisco Bay.858 3. . /1i + a}.453 1.'.182 0. Consider the model Yi = 11 + ei.093 5. Show that the MLE of . (See Problem 2.058 3. n..665 2.039 9. j = 1.075 4.564 1.155 4. . where El" .391 0.723 1.908 1. 4 (b) Show that Y is a multivariate method of moments estimate of p.968 2. . nq+1 > p(x..091 3. k.1. . suppose some of the nj are zero.j}. 1'.676 5.870 30.+l I Y .020 8.156 5.6) W. Chon) shonld be read row by row. Yn spent above a fixed high level for a series of n = 66 consecutive wave records at a point on the seashore. which vanishes if 8 j = 0 for any j = q + 1. . 31. = L. The data (courtesy S. . In the multinomial Example 2.038 3.582 2.669 7..703 4.360 1.ZD. (a) Show that E(1'..2. k..511 3.. different from Y? I I TABLE 2.716 7. Suppose YI .ZD..379 2.. I Yi is (/1 + Vi).. with mean zero and variance a 2 • The ei are called moving average errors.) (c) Find a matrix A such that enxl = Anx(n+l)ECn+l)xl' (d) Find the covariance matrix W of e. J.071 0..).6) T 1 (Y .20).a. . Is ji.093 1.6) is given by (2..jf3j for given covariate values {z.300 6.5.455 9.958 10.. .2..ZD. Use a weighted least squares c~mputer routine to compute the weighted least squares estimate Ii of /1.788 2.~=l Z. in this model the optimal " MSPE predictor of the future Yi+ 1 given the past YI .476 3.€n+l are i..d. Hint: Suppose without loss of generality that ni 0. 2.131 5. . = nq = 0.856 2.971 0. Let ei = (€i + {HI )/2. ~ ~ ~Show that the MLE of OJ is 0 with OJ = nj In. That is.17.457 2.100 3.064 5.392 4.8. .. <J > 0.054 1.301 2.1. .
.. the sample median yis defined as ~ [Y(.1.{:i. F)(x) *F denotes convolution... Y( n) denotes YI .(3p that minimizes the least absolute deviation contrast function L~ I IYi . .). An asymptotically equivalent procedure is to take the median of the distribution placing mass and mass ... .5 Problems and Complements 149 (f3I' .:C j • i <j (a) Show that the HodgesLehmann estimate is the minimizer of the contrast function p(x.4. . . Let X.) Suppose 1'.. .Jitl.. Let x. .. then the MLE exists and is unique.. (See (2. These least absolute deviation estimates (LADEs). = L Ix.I/a}.) + Y(r+l)] where r = ~n. Hint: See Example 1.3.. XHL i& at each point x'. Show that XHL is a plugin estimate of BHL. Yn are independent with Vi having the Laplace density 1 2"exp {[Y./3p that minimizes the maximum absolute value conrrast function maxi IYi .Ili I and then setting (j = n~I L~ I IYi . the sample median fj is defined as Y(k) where k ~ ~(n + 1) and Y(l).tLil and then setting a = max t IYi .17). Find the MLE of A and give its mean and variance.2. where Iii = ~ L:j:=1 ZtjPj· 32.32(b)./Jr ~ ~ and iii. The HodgesLehmann (location) estimate XHL is defined to be the median of the ~n(n + 1) pairwise averages ~(Xi + Xj).i. Show that the sample median ii is the minimizer of L~ 1 IYi . where Jii = L:J=l Zij{3j. . ~ 33.I'. .8).d. " .(1 + x'j. 8 E R.. Z and W are independent N(O.. A > 0... Suppose YI . a)T is obtained by finding 131. + Xj i<j  201· (b) Define BH L to be the minimizer of J where F [x . with density g(x . (a) Show Ihat if Ill[ < 1. (3p. Let B = arg max Lx (8) be "the" MLE. = I' for each i.Section 2. be the observations and set II = ~ (Xl .d. . . y)T where Y = Z + v"XW.. . Give the MLE when ..d. .i.l1.fin are called (b) If n is odd.>0 ~ ~ where tLi = :Ej=l Ztj{3j for given covariate values {Zij}. x E R. 35. . Let g(x) ~ 1/".J. be i. If n is even. be the Cauchy density. i < j... .7 with Y having the empirical distribution F..& at each Xi. 34. Yn ordered from smallest to largest. be i. 1).6.{3p.x. a) is obtained by finding (31. . Hint: Use Problem 1. as (Z.O) Hint: See Problem 2. (a) Show that the MLE of ((31. . let Xl and X. ~ f3I. and x. .28Id(F..  Illi < 1. .
Find the MLEs of Al and A2 hased on 5. ry) = p(x. . and positive everywhere. b).. B) = g(xB). 1985). B ) and p( X.+~l Vi and 8 2 = E. Define ry = h(B) and let p' (x.x2)/2. . ~ (XI + x. Let (XI. (a..150 Methods of Estimation Chapter 2 (b) Show that if 1t>1 > 1. (7') and let p(x.f)) in the likelihood equation. If we write h = logg. Find the values of B that maximize the likelihood Lx(B) when 1t>1 > L Hint: Factor out (x .B)g(x. then the MLE is not unique. . 'I i 39. . then h"(y) > 0 for some nonzero y. B) is and that the KullbackLiebler divergence between p(x. and 8 2 are independent. symmetric about 0.t> . . where Al +. b) there exists a 0 > 0 ~ (c) There is an interval such that h(y + 0) . Sl.+~l Wj have P(rnAl) and P(mA2) distributions. Let Xl and X2 be the observed values of Xl and X z and write j. ~ Let B ~ arg max Lx (B) be "the" MLE.B). The likelihood function is given by Lx(B) g(xI .B I ) (K'(ryo. Let 9 be a probability density on R satisfying the following three conditions: I.B )'/ (7'.B)g(i . 38.. On day n + 1 the Web Master decides to keep track of two types of hits (money making and not money making). then B is not unique. Let Vj and W j denote the number of hits of type 1 and 2 on day j. Also assume that S. Suppose X I. Assume that 5 = L:~ I Xi has a Poisson.h(y . i = 1. 37.. and 5.)/2 and t> = (XI . = I ! I!: Ii. 1 n + m. . B) and pix. Problem 35 can be generalized as follows (Dhannadhikari and JoagDev. . h I (ry») denote the density or frequency function of X for the ry parametrization.. a < b. ryo) and p' (x. > h(y) .i. ~ (d) Use (c) to show that if t> E (a. Let X ~ P" B E 8. . P(nA). 2. BI ) (p' (x. 9 is continuous. X.B) g(x + t> .Xn are i. Show that the entropy of p(x. 51. Show that (a) The likelihood is symmetric about ~ x.. Let Xi denote the number of hits at a certain Web site on day i. . Bo) is tn( B. . 9 is twice continuously differentiable everywhere except perhaps at O. Let K(Bo.0).. j = n + I.d.b). Ii I. (b) Either () = x or () is not unique...h(y) . Suppose h is a II function from 8 Onto = h(8).. ryl Show that o n ».ryt» denote the KullbackLeibler divergence between p( x. that 8 1 E. n. 3. . Assume :1. such that for every y E (a. where x E Rand (} E R.. B) denote their joint density. . distribution.) be a random samplefrom the distribution with density f(x. 36. o Ii !n 'I . N(B.j'.\2 = A.
.. 1'). 1 En are uncorrelated with mean 0 and variance estimating equations (2. 41.Section 2. Let Xl. Yn). /3.7) for a. . a. a get. Give the least squares where El" •. = LX. n. and /1. where OJ > O. .}. [3.) Show that statistics.O) ~ a {l+exp[[3(tJ1. i = 1. and g(t. Variation in the population is modeled on the log scale by using the model logY. 6.)/o)} (72.p.. . 0). > OJ and T. 1989) Y dt ~ [3 1  Idy [ (Y)!] ' y > 0. j = 1.. x > 0. [3. (.1. 0 > O. Seber and Wild. T. [3. For the ca. a > 0.)/o]}" (b) Suppose we have observations (t I . (tn.1. < 0] are sufficient (b) Find the maximum likelihood estimates of 81 and 82 in tenns of T 1 and T 2 • Carefully check the "T1 = 0 or T 2 = 0" case.. .0). h(z.. n where A = (a. i = 1. [3. n > 4. An example of a neural net model is Vi = L j=l p h(Zij. . 1).5 Problems and Complements 151 40. yd 1 ••• . 1959. = log a . and fj.. .I[X. Suppose Xl. The mean relative growth of an organism of size y at time t is sometimes modeled by the equation (Richards. . and 1'. where 0 (a) Show that a solution to this equation is of the form y (a..2. = L X.1.olog{1 + exp[[3(t.~e p = 1. . X n be a sample from the generalized Laplace distribution with density 1 O + 0. o + 0.7) for estimating a. 1'.~t square estimating equations (2. (e) Let Yi denote the response of the ith organism in a sample and let Zij denote the level of the jth covariate (stimulus) for the ith organism.c n are uncorrelated with mean zero and variance (72. 1 X n satisfy the autoregressive model of Example 1.  J1. . on a population of a large number of organisms.. .1'. . exp{x/O. + C. . . j 1 1 cxp{ x/Oj}..5. 42. Aj) + Ei. give the lea.j ~ x <0 1. I' E R. A) = g(z.. [3 > 0.I[X.
.. . . Yn) is not a sequence of 1's followed by all O's or the reverse. Hint: Let C = 1(0). Suppose Y I .0. Xn· Ip (x. n i=l n i=l n i=1 n i=l Cl LYi + C. I < j < 6. (b) Show that the likelihood equations are equivalent to (2. This set K will have a point where the max is attained. .3. There exists a compact set K c e such that 1(8) < c for all () not in K. S.)I(c.. < I} and let 03 = 1. = OJ.l. gamma. t·. .p).. _ C2' 2.3 1. n > 2. the bound is sharp and is attained only if Yi x. In a sample of n independent plants.4) and (2. = L(CI + C.. 1 Xl <i< n.02): 01 > 0... C2' y' 1 1 for (a) Show that the density of X = (Xl.3.x.'" £n)T of autoregression errors. . + 8. + C1 > 0). Consider the HardyWeinberg model with the six genotypes given in Problem 2. . Is this also the MLE of /1? Problems for Section 2. = If C2 > 0. < .l  L: ft)(Xi /. X n be i. (One way to do this is to find a matrix A such that enxl = Anxn€nx 1. Xn)T can be written as the rank 2 canonical exponential family generated by T = (ElogX"EXi ) and hex) = XI with ryl = p. show that the MLE of fi is jj =  2. Let Xl. L x. a.fi) = 1log p pry. fi exists iff (Yl . Prove Lenama 2. Under what conditions on (Xl 1 .) 1 II(Z'i /1)2 (b) If j3 is known. Hint: I .Xi)Y' < L(C1 + c..15..0 . . fi) = a + fix.). r(A. . = 1] = p(x"a.i... Yn are independent PlY.152 Methods of Estimation Chapter 2 (aJ If /' is known.1.J. . + 0.. • II ~.) Then find the weighted least square estimate of f. " l Xn) does the Mill exist? What is the MLE? Is it unique? e 4. Let = {(01. .. . write Xi = j if the ith plant has genotype j.(8 . 3..y.d.x. < Show that the MLE of a..1 > _£l. find the covariance matrix W of the vector € = (€1.. 0 for Xi < _£I.3.. Give details of the proof or Corollary 2.::. > 0.5). . I .3... '12 = A and where r denotes the gamma function.l. (X.
with density. 0 < Zl < . Let Xl.O. n > 2.6.. [(Ai)..d.1).3. ~fo (X u I. C! > 0.6. the MLE 8 exists but is not unique if n is even.3. < Zn.'~ve a unique solution (.en. . • It) w' ( Xi It) .8) .) ~ Ail = exp{a + ilz.1 (a) ~ ~ c(a) exp{ Ix . then it must contain a sphere and the center of the sphere is an interior point by (B. (O. Zj < ..40..5 Problems and Complements 153 n 6. (3) Show that.. Hint: If BC has positive volume. > 3. < Zn Zn. .Ld.fl.1} 0 OJ = (b) Give an algorithm such that starting at iP = 0. See also Problem 1.}.. 0:0 = 1. e E RP. < Show that the MLE of (a. and Zi is the income of the person whose duration time is ti. if n > 2. X n be Ll. Let Y I . _. 1 <j< k1.t).. . and assume forw wll > 0 so that w is strictly convex. Suppose Y has an exponential.. 8... 10. 9. In the heterogenous regression Example 1. the likelihood equations logfo that t (X It) w' iOJ t=I = 0 ~ { (X i . .0).1 to show that in the multinomial Example 2.. (b) Show that if a = 1.lkIl :Ij >0. J.Xn E RP be i. u > 1 fR exp{Ixlfr}dxand 1·1 is the Euclidean norm.Section 2. . . . MLEs of ryj exist iffallTj > 0.. Hint: If it didn't there would exist ryj = c(9j ) such that ryJ 10 .0.9.. .0). (0. ji(i) + ji.10 with that the MLE exists and is unique.3. Yn denote the duration times of n independent visits to a Web site. w( ±oo) = 00.. fe(x) wherec.l E R. show 7.3.n) are the vertices of the :Ij <n}.' . distribution where Iti = E(Y.1 <j < ac k1.ill T exists and is unique. .0 < ZI < . Let XI. the MLE 8 exists and is unique. .. Prove Theorem 2. " (a) Show that if Cl: > ~ 1. . .n. .... 12. Then {'Ij} has a subsequence that converges to a point '10 E f. . convex set {(Ij.3. Hint: The kpoints (0.z=7 11.A( ryj) ~ max{ 'IT toA( 'I) : 'I E c(8)} > 00. Show that the boundary of a convex C set in Rk has volume 0.. Use Corollary 2. But c( e) is closed so that '10 = c( eO) and eO must satisfy the likelihood equations. a(i) + a.. .
rank(ZD) = k.. p) population. ag. Let (X10 Yd. then the coordinate ascent algorithm doesn't converge to a member of t:. . Chapter 10. En i. Hint: (b) Because (XI.) Hint: (a) Thefunction D( a. and p = [~(Xi . Y.4. 8aob :1 13. (b) If n > 5 and J11 and /12 are unknown.J1.)(Yi .1. b) ~ I w( aXi . Apply Corollary 2.1)". .) has a density you may assume that > 0. 2.3. ar 1 a~. See.J1. (b) Reparametrize by a = .6. . b) and lim(a.8. 3.. J12. EM for bivariate data. it is unique.2)/M'''2] iI I'.4 1.b) .. EeT = (J11.:.'2). P coincide with the method of moments estimates of Problem 2. "i .. 1985. Note: You may use without proof (see Appendix B.2)2.1 and J1. O'~.(Yi .6. 7if)F 8 08 D 802 vb2 2 2 > (a'D)2 ' then D'IS strictIy convex.J1.d. IPl < 1.:.4. O'?. .b)_(ao. Describe in detail what the coordinate ascent algorithm does in estimation of the regression coefficients in the Gaussian linear model i: y ~ ZDJ3 + <. . and p when J1. Golnb and Van Loan. I (Xi . O'~ + J1~. b) = x if either ao = 0 or 00 or bo = ±oo.. for example.l and CT. . > 0.9). .4. (Check that you are describing the GaussSeidel iterative method for solving a system of linear equations. pal0'2 + J11J.. L:. respectively. Yn ) be a sample from a N(/11.3. show that the estimates of J11.' ! (a) In the bivariate nonnal Example 2. 0'1 Problems for Section 2.bo) D(a. E" . b successively. /12.n log a is strictly convex in (a... verify the Mstep by showing that E(Zi I Yi).2).2 are assumed to be known are = (lin) L:.. Show that if T is minimal and t: is open and the MLE doesn't exist.i. . N(O. (a) Show that the MLEs of a'f.154 Methods of Estimation Chapter 2 (c) Show that for the logistic distribution Fo(x) [1 + exp{ X}]I. complete the Estep by finding E(Zl I l'i) and E(ZiYi I Yd· (b) In Example 2. provided that n > 3.. (i) If a strictly convex function has a minimum. il' i.2. O'~ + J1i. (I't') IfEPD 8a2 a2 D > 0 an d > 0. (See Example 2. (Xn . W is strictly convex and give the likelihood equations for f.) I . {t2. b = : and consider varying a. ~J ' . = (lin) L:.J1.
2 (uJ L }iIi + k 2: Y. 1 < i < n.1) x Rwhere P.3) to show that the NewtonRaphson algorithm gives ~ _ 81 ~ 8 ~ _ ~ _ B(I. j = 0. . thus. I' >. as the first approximation to the maximum likelihood . Y1 '" N(fLl a}).Section 2. X is the number of members who have the trait.. the model often used for X is that it has the conditional distribution of a B( n. and that the (b) Use (2. Do you see any problems with >. B) variable._x(1.{(Ii.2B)x + nB'] _. 2:Ji) . (a) Justify the following crude estimates of Jt and A. Consider a genetic trait that is directly unobservable but will cause a disease among a certain proportion of the individuals that have it.and Msteps of the EM algorithm for this problem.. Y'i).. B= (A. ~2 = log C\) + (b) Deduce that T is minimal sufficient.I ~ + (1  B)n] . (c) Give explicitly the maximum likelihood estimates of Jt and 5. Because it is known that X > 1.B)n[n . is distributed according to an exponential ryl < n} = i£(1 I) uzuy.[lt ~ OJ. i ~ 1'. 1]. be independent and identically distributed according to P6.4.5 Problems and Complements 155 4..P. IX > 1) = \ X ) 1 (1 6)" .n. . also the trait)..O"~). Hint: Use Bayes rule.B)[I.. 1 and (a) Show that X .(1 _ B)n]{x nB .[1 . Yd : 1 < i family with T afi I a? known. where 8 = 80l d and 8 1 estimate of 8. I n ) 8"(10)"" (a) Show that P(X = x MLE exists and is unique. it is desired to estimate the proportion 8 that has the genetic trait. Suppose the Ii in Problem 4 are not observed. Y (~ A 2:7 I(Yi  Y)'  O"~)/ (0". given X>l. and given II = j.(I. Suppose that in a family of n members in which one has the disease (and. For families in which one member has the disease.(1 .? (b) Give as explicitly as possible the E. when they exist. Let (Ii. B E [0.x = 1.[lt ~ 1] = A ~ 1 .Ii). 6. 8new .B)"]'[(1 . 1') E (0.B)n} _ nB'(I.
Define TjD as before. .4. V. hence. b1 C..Na+c. c]' Show that this holds iff PIU ~ a.4. N++ c N a +c N+ bc Pabc = . X Methods of Estimation Chapter 2 = 2.2 noting that the sequence of iterates {rymJ is bounded and. * maximizes = iJ(A') t.d. W). Let Xl. . 1 < b < B. (c) Show that the MLEs exist iff 0 < N a+c . Hint: Apply the argument of the proof of Theorem 2. b1 c and then are .1) + C(A + B2) = C(A + B1) . n N++ c ++c  .1 generated by N++c. N . find (}l of (b) above using (} =   x/n as a preliminary estimate.9) = ".i. 1 <c < C and La. + Vbc where 00 < /l.2. Let and iJnew where ). . (h) Show that the family of distributions obtained by letting 1'.9)')1 Suppose X. PIU = a. X. (1) 10gPabc = /lac = Pabe. X 3 = a. (a) Suppose for aU a.X 2 .c)} and "+" indicates summation over the ind~x.b. = 1.156 (c) [f n = 5. . v vary freely is an exponential family of rank (C . iff U and V are independent given W.e. the sequence (11ml ijm+l) has a convergent subse quence. Show that the sequence defined by this algorithm converges to the MLE if it exists.X3 be independent observations from the Cauchy distribution about f(x.. where X = (U. X n be i. Let Xl. 8.A(iJ(A)).riJ(A) .1 (1 + (x e. (a) Deduce that depending on where bisection is started the sequence of iterates may converge to one or the other of the local maxima (b) Make a similar study of the NewtonRaphson method in this case.N+bc where N abc = #{i : Xi = (a. V = b. Consider the following algorithm under the conditions of Theorem 2.b. 9.c Pabc = l. Show that for a sufficiently large the likelihood function has local maxima between 0 and 1 and between p and a. 7. N +bc given by < N ++c for all a. . W = c] 1 < a < A.v ~ b I W = c] = P[U = a I W = c]P[V = b I W ~ i. v < 00. ~ 0.
= fo(x  9) where fo(x) 3'P(x) + 3'P(x .a>!b". c" and "a.(A + B + C).:.8).1) + (A . (a) Show that S in Example 2.c' obtained by fixing the "b. but now (2) logPabc = /lac + Vbc + 'Yab where J1.and Msteps of the EM algorithm in this case. f vary freely.I)(C .4.c.3 + (A . (b) Consider the following "proportional fitting" algorithm for finding the maximum likelihood estimate in this model.1)(B . N±bc . Justify formula (2.Section 2. Initialize: pd: O) = N a ++ N_jo/>+ N++ c abc nnn Pabc d2) dl) Nab+ Pabc dO) n n Pab+ d1) dl) dO) Pabc d3) N a + c Pabc Pa+c N+bc Pabc d 2) Pabc n P+bc d2)· Reinitialize with ~t~. (b) Give explicitly the E.I)(C .I) ~ AB + AC + BC . Hint: Note that because {p~~~} belongs to the model so do all subsequent iterates and that ~~~ is the MLE for the exponential family Pabc = ellauplO) _=..N++ c / B. Let f.4. 12. 10. Nl+co (c) The model implies Pabc = P+bcPa±c/P++c and use the likelihood equations. 11.a) 1 2 .5 has the specified mixtnre of Gaussian distribution. Suppose X is as in Problem 9.N++c/A. (a) Show that this is an exponential family of rank A + B + C .[X = x I SeX) = s] = 13. Show that the algorithm converges to the MLE if it exists and di· verges otheIWise. v.(x) :1:::: 1(S(x) ~ = s). Hint: P.. c" parameters.b'.5 Problems and Complements 157 Hint: (b) Consider N a+ c .I) + (B .=_ L ellalblp~~~'CI a'.
d. 14. = {(Zil Yi) Yi :i = 1. If /12 = 1. the probability that Vi is missing may depend on Zi. {32)..{Xj }.5. ~ 16. ! . 15. That is. .158 Methods of Estimation Chapter 2 and r. but not on Yi.4. I Z. Zl. . Establish the last claim in part (2) of the proof of Theorem 2. En a an 2. given X .4. and independent of El. (2) The frequency plugin estimates are sometimes called Fisher consistent. Limitations a/the EM Algorithm. this assumption may not be satisfied. A. find the probability that E(Y. .5. where E) .i. Hint: Use the canonical nature of the family and openness of E. Hint: Show that {( 8m . . Zn are ii. For a fascinating account of the beginnings of estimation in the context of astronomy see Stigler (1986).. N(ftl. • . The assumption underlying the computations in the EM algorithm is that the conditional probability that a component X j of the data vector X is missing given the rest of the data vector is not a function of X j . ~ . .) underpredicts Y. we observe only}li. That is. (J*) and necessarily g. consider the model I I: ! I I = J31 + J32 Z i + Ei I are i. For instance. Fisher (1922) argued that only estimates possessing the substitution property should be considered and the best of these selected. Show for 11 = 1 that bisection may lead to a local maximum of the likelihood.. 18. Bm+d} has a subsequence converging to (0" JJ*) and. For X . {31 1 a~." .~ go.n}. 2 ). if a is sufficiently large. If Vi represents the seriousness of a disease. the process determining whether X j is missing is independent of X j . the "missingness" of Vi is independent of Yi. NOTES .p is the N(O} '1) density.3 for the actual MLE in that example.4. Hint: Show that {(19 m. in Example 2. Bm + I)} has a subsequence converging to (B* . Noles for Section 2. I I' . Then using the Estep to impute values for the missing Y's would greatly unclerpredict the actual V's because all the V's in the imputation would have Y < 2.2. Ii . necessarily 0* is the global maximizer. Establish part (b) of Theorem 2.. given Zi. These considerations lead essentially to maximum likelihood estimates. In Example 2.3.d.6. . suppose all subjects with Yi > 2 drop out of the study.4. thus.and Msteps of the EM algorithm for estimating (ftl.. . a 2 ."" En. al = a2 = 1 and p = 0.6. suppose Yi is missing iff Vi < 2. Complete the E. For example..4. R. Verify the fonnula given in Example 2.1 (1) "Natural" now was not so natural in the eighteenth century when the least squares principle was introduced by Legendre and Gauss. 17. EM and Regression.. This condition is called missing at random.6 . N(O. Suppose that for 1 < i < m we observe both Zi and Yi and for m + 1 < i < n.
A.. Math. The Econometrics ofFinancial Markets Princeton. Fisher 1950) New York: J. G. A. Soc. NJ: Princeton University Press. B. E. EiSENHART. C. DAHLQUIST. 1972.. Statist.g. Matrix Computations Baltimore: John Hopkins University Press. J. 1922. (2) For further properties of KullbackLeibler divergence. P[T(X) E Al o for all or for no PEP. S. La.7 References 159 Notes for Section 2.Section 2. AND N.. J. 2.loAGDEV. AND P. Acad. L. Appendix A.. M..• AND I... Numerical Analysis New York: Prentice Hall. 1975. Note for Section 2. WEISS. Statist. Sciences. The Analysis ofFrequency Data Chicago: University of Chicago Press." Journal Wash. 39. B. T. MA: MIT Press. a multivariate version of minimum contrasts estimates are often called generalized method of moment estimates. 1997). ANDERSON. Statistical Inference Under Order Restrictions New York: Wiley.3 (I) Recall that in an exponential family. Note for Section 2. M. La. LAIRD. Local Polynomial Modelling and Its Applications London: Chapman and Hall. BARTHOLOMEW. see Cover and Thomas (1991)." reprinted in Contributions to MatheltUltical Statistics (by R. S. M. E. t 996. 199200 (1985). 1.5 (1) In the econometrics literature (e... VAN LOAN. 1985. CAMPBELL. BISHOP. GoLUB. AND J.. 1991. W. GIlBELS. F. Roy. BJORK. MACKINLAY. Discrete Multivariate Analysis: Theory and Practice Cambridge. BREMNER. A. BRUNK. HABERMAN. THOMAS. G. 1997. W. G. "Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm.1. AND C. and MacKinlay. FISHER. D. 54. S. 41. R. AND A. ANOH.2 (I) An excellent historical account of the development of least squares methods may be found in Eisenhart (1964). "On the Mathematical Foundations of Theoretical Statistics. BAUM. "Y. . "Y.. "The Meaning of Least in Least Squares. DHARMADHlKARI. 138 (1977). FEINBERG. 1974. Wiley and Sons. SOULES. for any A. 164171 (1970). J. RUBIN. "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains:' Am!.1974. A. M. COVER. c. E. M.." The American Statistician." J.2. M.. A. AND D. DEMPSTER.7 REFERENCES BARLOW. HOLLAND. AND K. AND N. 0. 2433 (1964). A. Elements oflnfonnation Theory New York: Wiley. R. H. T. "Examples of Nonunique Maximum Likelihood Estimators. FAN. Campbell. PETRIE.
AND M. LIlTLE. SHANNON.. Exp. R. WAND. G. A.. "Association and Estimation in Contingency Tables. Amer. 128 (1968)... I • I . IA: Iowa State University Press.. G. "A Flexible Growth Function for Empirical Use. E. AND D.·sis with Missing Data New York: J. AND c. Statistical Anal. J.. 1997. W. "Multivariate Locally Weighted Least Squares Regression.. E. C. "On the Shannon Theory of Information Transmission in the Case of Continuous Signals. AND T. 1986. 2nd ed. WEISBERG. Wiley. COCHRAN. "On the Convergence Properties of the EM Algorithm. G. The EM Algorithm and Extensions New York: Wiley. In. N. Botany.." 1." Ann. RICHARDS. S. A.. I 1 . MOSTELLER. KR[SHNAN. MACLACHLAN. 63.. Assoc. . 11. RUPPERT. New York: Wiley. WU. SEBER. Theory. 13461370 (1994). AND W.160 Methods of Estimation Chapter 2 KOLMOGOROV. Statistical Methods. "A Mathematical Theory of Communication.623656 (1948). J... J. i I." Bell System Tech. 22. Ames. RUBIN."/RE Trans! Inform. SNEDECOR. 1985. J.95103 (1983) . B. 27. WILD. 290300 ( 1959). F. Nonlinear Regression New York: Wiley. 1989. A." J. 6th ed.379243." Ann. MA: Harvard University Press. F. C. The History ofStatistics Cambridge. E. STIGLER. Applied linear Regression. 10.1. Statisr. Journal.. 1987. • . P. 1%7.. Statist. S. Statist. 102108 (1956). D..
+ R+ by ° R(O. in the context of estimation.2 BAYES PROCEDURES Recall from Section 1. NOTIONS OF OPTIMALITY. We also discuss other desiderata that strongly compete with decision theoretic optimality. OUf examples are primarily estimation of a real parameter. in Chapter 4 and developing in Chapters 5 and 6 the asymptotic tools needed to say something about the multiparameter case. 6) : e . R(·.3.2.a). 6 2 ) for all 0 or vice versa.3 that if we specify a parametric model P = {Po: E 8}.6) = EI(6. in particular computational simplicity and robustness. However.6) =ER(O.6) = E o{(O.1) 161 . by introducing a Bayes prior density (say) 1r for comparison becomes unambiguous by considering the scalar Bayes risk. (3. In Section 3.6(X)). and 62 on the basis of the risks alone is not well defined unless R(O.4. 6) as measuring a priori the performance of 6 for this model.Chapter 3 MEASURES OF PERFORMANCE. actual implementation is limited. ) < R(B. which is how to appraise and select among decision procedures. AND OPTIMAL PROCEDURES 3. o r(1T.6(X). However. In Sections 3. We return to these themes in Chapter 6. then for data X '" Po and any decision procedure J randomized or not we can define its risk function.2 and 3.6 . after similarly discussing testing and confidence bounds. we study. 3.1 INTRODUCTION Here we develop the theme of Section 1. We think of R(.3 we show how the important Bayes and minimax criteria can in principle be implemented. Strict comparison of 6. action space A. the relation of the two major decision theoretic principles to the 000decision theoretic principle of maximum likelihood and the somewhat out of favor principle of unbiasedness. loss function l(O.
Recall also that we can define R(.2. This exercise is interesting and important even if we do not view 1r as reflecting an implicitly believed in prior distribution on (). (0.2) the Bayes risk of the problem. B denotes mean height of people in meters.) ~ inf{r(". as usual.8) for the posterior density and frequency functions. 0) 2 and 1T2(0) = 1. if 1[' is a density and e c R r(". 0) = (q(O) using a nonrandomized decision rule 6.6): 6 E VJ (3.162 Measures of Performance Chapter 3 where (0.6(X)j2. 6) ~ E(q(O) ..(0)1 . . For testing problems the hypothesis is often treated as more important than the alternative. Clearly.2. we can give the Bayes estimate a more explicit form. Our problem is to find the function 6 of X that minimizes r(".(0) a special role ("equal weight") though (Problem 3.2.(O)dfI p(x I O).4) the parametrization plays a crucial role here. for instance.(O)dO (3.6) = J R(0. it is plausible that 6. (0. and 1r may express that we care more about the values of the risk in some rather than other regions of e.3 we showed how in an example. 1 1  . we just need to replace the integrals by sums.3) In this section we shall show systematically how to construct Bayes rules.. 1(0. considering I. and instead tum to construction of Bayes procedure.6). In view of formulae (1. if computable will c plays behave reasonably even if our knowledge is only roughly right.5.2. such that (3.5) This procedure is called the Bayes estimate for squared error loss.(O)dfI (3.4. 00 for all 6 or the Using our results on MSPE prediction. x _ !oo= q(O)p(x I 0). It is in fact clear that prior and loss function cannot be separated out clearly either.5). Thus.6) In the discrete case. We don't pursue them further except in Problem 3. we could identify the Bayes rules 0.2. 0) and "I (0) is equivalent to considering 1 (0. X) is given the joint distribution specified by (1. Here is an example. This is just the prohlem of finding the best mean squared prediction error (MSPE) predictor of q(O) given X (see Remark 1.4) !. 'f " ..2. j I 1': II . In the continuous case with real valued and prior density 1T.2. Issues such as these and many others are taken up in the fundamental treatises on Bayesian statistics such as Jeffreys (1948) and Savage (1954) and are reviewed in the modern works of Berger (1985) and Bernardo and Smith (1994). We first consider the problem of estimating q(O) with quadratic loss. 6) Bayes rule 6* is given by = a? 6'(X) = E[q(O) I XJ. and that in Section 1. We may have vague prior notions such as "IBI > 5 is physically implausible" if. (3.. e "(). After all. Suppose () is a random variable (or vector) with (prior) frequency function or density 11"(0).. oj = ".3).2.2..r= . If 1r is then thought of as a weight function roughly reflecting our knowledge. we find that either r(1I".
E(e I X))' I X)] E[(J'/(l+~)]=nja 2 + 1/72 ' 1 2 n n/ No finite choice of '1]0 and 7 2 will lead to X as a Bayes estimate. This quantity r(a I x) is what we expect to lose. . Suppose that we want to estimate the mean B of a nonnal distribution with known variance a 2 on the basis of a sample Xl. J') ~ 0 as n + 00.2. for each x.12. X is the estimate that (3.6) yields. we see that the key idea is to consider what we should do given X = x.2.r( 7r. If we choose the conjugate prior N( '1]0. Fonnula (3. Applying the same idea in the general Bayes decision problem. In fact. X is approximately a Bayes estimate for anyone of these prior distributions in the sense that Ir( 7r. we fonn the posterior risk r(a I x) = E(l(e. For more on this. a) I X = x). /2) as in Example 1. a 2 / n.E(e I X))' = E[E«e . tends to a as n _ 00. the Bayes estimate corresponding to the prior density N(1]O' 7 2 ) differs little from X for n large. take that action a = J*(x) that makes r(a I x) as small as possible. To begin with we consider only nonrandomized rules. Because the Bayes risk of X.1. Thus. J') E(e . But X is the limit of such estimates as prior knowledge becomes "vague" (7 .2.2.2 Bayes Procedures 163 Example 3. 0 We now tum to the problem of finding Bayes rules for gelleral action spaces A and loss functions l. Intuitively. 1]0 fixed). In fact. Bayes Estimates for the Mean of a Normal Distribution with a Normal Prior. that is. J' )]/r( 7r.4.w)X of the estimate to be used when there are no observations. and X with weights inversely proportional to the Bayes risks of these two estimates. we obtain the posterior distribution The Bayes estimate is just the mean of the posterior distribution J'(X) ~ 1]0 l/T' ] _ [ n/(J2 ] [n/(J' + l/T' + X n/(J' + l/T' (3.2.a)' I X ~ x) as a function of the action a.00 with.7) reveals the Bayes estimate in the proper case to be a weighted average MID + (1 . '1]0. . Such priors with f 7r(e) ~ 00 or 2: 7r(0) ~ 00 are called impropa The resulting Bayes procedures are also called improper. see Section 5.7) Its Bayes risk (the MSPE of the predictor) is just r(7r.5. This action need not exist nor be unique if it does exist. " Xn.6. . However. If we look at the proof of Theorem 1. if we substitute the prior "density" 7r( 0) = 1 (Prohlem 3.1. we should. if X = :x and we use action a. r) .Section 3.1). E(Y I X) is the best predictor because E(Y I X ~ .) minimizes the conditional MSPE E«(Y .
6* = 05 as we found previously.3. ~ a.Op}. " . we obtain for any 0 r(?f. 11) = 5.) = 0. :i " " Proof.2.2.8) Then 0* is a Bayes rule.4.o'(X)) IX = x].. Then the posterior distribution of o is by (1.2.o(X))] = E[E(I(O. As in the proof of Theorem 1.74.2.8.o) = E[I(O.! " j.2.2. o As a first illustration. . Let = {Oo.o(X)) I X = x] = r(o(x) Therefore. if 0" is the Bayes rule.89. Suppose we observe x = O. I x) > r(o'(x) I x) = E[I(O. !i n j. and aa are r(at 10) r(a.o'(X)) I X].. the posterior risks of the actions aI. ?f(O.o(X)) I X)]. r(a.67 5. More generally consider the following class of situations. E[I(O. and the resnlt follows from (3. Bayes Procedures Whfn and A Are Finite. r(al 11) = 8. .2.5) with priOf ?f(0. . Therefore. (3.164 Measures of Performance Chapter 3 Proposition 3. az has the smallest posterior risk and. a q }.2.2. az.70 and we conclude that 0'(1) = a.1. (3. E[I(O.. and let the loss incurred when (}i is true and action aj is taken be given by j e e . consider the oildrilling example (Example 1. A = {ao. 0'(0) Similarly.9) " "I " • But. r(a. The great advantage of our new approach is that it enables us to compute the Bayes procedure without undertaking the usually impossible calculation of the Bayes risks of all corrpeting procedures.8) Thus.. I 0)  + 91(0" ad r(a.8).·. Therefore.35. Suppose that there exists a/unction 8*(x) such that r(o'(x) [x) = inf{r(a I x) : a E A}.) = 0. 10) 8 10.· . let Wij > 0 be given constants. by (3.o(X)) I X] > E[I(O. Example 3. [I) = 3.1.9).
Section 3.2
Bayes Procedures
165
Let 1r(0) be a prior distribution assigning mass 1ri to Oi, so that 1ri > 0, i = 0, ... ,p, and Ef 0 1ri = 1. Suppose, moreover, that X has density or frequency function p(x I 8) for each O. Then, by (1.2.8), the posterior probabilities are
and, thus,
raj (
x I) =
EiWipTiP(X I Oi) . Ei 1fiP(X I Oi)
(3.2.10)
The optimal action 6* (x) has
r(o'(x) I x)
=
O<rSq
min r(oj
I x).
Here are two interesting specializations. (a) Classification: Suppose that p
= q, we identify aj with OJ, j
1, i f j
=
0, ... ,p, and let
Wii
O.
This can be thought of as the classification problem in which we have p + 1 known disjoint populations and a new individual X comes along who is to be classified in one of these categories. In this case,
r(Oi
Ix) = Pl8 f ei I X = xl
and minimizing r(Oi I x) is equivalent to the reasonable procedure of maximizing the posterior probability,
PI8 = e I X = i
xl
=
1fiP(X lei) Ej1fjp(x I OJ)
(b) Testing: Suppose p = q = 1, 1ro = 1r, 1rl = 1  'Jr, 0 < 1r < 1, ao corresponds to deciding 0 = 00 and al to deciding 0 = 01 • This is a special case of the testing fonnulation of Section 1.3 with 8 0 = {eo} and 8 1 = {ed. The Bayes rule is then to
decide e decide e
= 01 if (1 1f)p(x I 0, ) > 1fp(x I eo) = eo if (1  1f)p(x IeIl < 1fp(x Ieo)
and decide either ao or al if equality occurs. See Sections 1.3 and 4.2 on the option of randomizing .between ao and al if equality occurs. As we let 'Jr vary between zero and one. we obtain what is called the class of NeymanPearson tests, which provides the solution to the problem of minimizing P (type II error) given P (type I error) < Ct. This is treated further in Chapter 4. D
166
Measures of Performance
Chapter 3
To complete our illustration of the utility of Proposition 3.2. L we exhibit in "closed form" the Bayes procedure for an estimation problem when the loss is not quadratic.
Example 3.2.3. Bayes Estimation ofthe Probability ofSuccess in n Bernoulli Trials. Suppose that we wish to estimate () using X I, ... , X n , the indicators of n Bernoulli trials with
probability of success
e, We shall consider the loss function I given by
(0  a)2 1(0, a) = 0(1 _ 0)' 0 < 0 < I, a real.
(3.2.11)
"
" " ,

d
This close relative of quadratic loss gives more weight to parameter values close to zero and one. Thus, for () close to zero, this l((), a) is close to the relative squared error (fJ  a)2 JO, lt makes X have constant risk, a property we shall find important in the next section. The analysis can also be applied to other loss functions. See Problem 3.2.5. By sufficiency we need only consider the number of successes, S. Suppose now that we have a prior distribution. Then, if aU terms on the righthand side are finite,
rea I k)
(3.2.12)
;,
I
, " " I
,

I
Minimizing this parabola in a, we find our Bayes procedure is given by
J'(k) = E(1/(1  0) I S = k) E(I/O(I  0) IS  k)
;
(3.2.13)

i! , ,.
provided the denominator is not zero. For convenience let us now take as prior density the density br " (0) of the bela distribution i3( r, s). In Example 1. 2.1 we showed that this leads to a i3(k +r,n +s  k) posteriordistributionforOif S = k. Ifl < k < nI and n > 2, then all quantities in (3.2.12) and (3.2.13) are finite, and
.
,
";1 
J'(k)
Jo'(I/(1  O))bk+r,n_k+,(O)dO
J~(1/0(1 O))bk+r,n_k+,(O)dO
B(k+r,nk+sI) B(k + r  I, n  k + s  I) k+rl n+s+r2'
(3.2.14)
where we are using the notation B.2.11 of Appendix B. If k = 0, it is easy to see that a = is the only a that makes r(a I k) < 00. Thus, J'(O) = O. Similarly, we get J'(n) = 1. If we assume a un!form prior density, (r = s = 1), we see that the Bayes procedure is the usual estimate, X. This is not the case for quadratic loss (see Problem 3.2.2). 0
°
"Real" computation of Bayes procedures
The closed fonos of (3.2.6) and (3.2.10) make the compulation of (3.2.8) appear straightforward. Unfortunately, this is far from true in general. Suppose, as is typically the case, that 0 ~ (0" . .. , Op) has a hierarchically defined prior density,
"(0,, O ,... ,Op) = 2
"1 (Od"2(02
I( 1 ) ... "p(Op I Op_ d.
(3.2.15)
I
Section 3.2
Bayes Procedures
167
Here is an example. Example 3.2.4. The random effects model we shall study in Volume II has
(3.2.16)
where the €ij are i.i.d. N(O, (J:) and Jl, and the vector ~ = (AI, .. ' ,AI) is independent of {€i; ; 1 < i < I, 1 < j < J} with LI." ... ,LI.[ i.i.d. N(O,cri), 1 < j < J, Jl, '"'" N(Jl,O l (J~). Here the Xi)" can be thought of as measurements on individual i and Ai is an "individual" effect. If we now put a prior distribution on (Jl,,(J~,(J~) making them independent, we have a Bayesian model in the usual form. But it is more fruitful to think of this model as parametrized by () = (J.L, (J~, (J~, AI,. '. ,AI) with the Xii I () independent N(f.l + Ll. i . cr~). Then p(x I 0) = lli,; 'Pu.(Xi;  f.l Ll. i ) and
11'(0)
=
11'}(f.l)11'2(cr~)11'3(cri)
II 'PU" (LI.,)
i=l
I
(3.2.17)
where <Pu denotes the N(O, (J2) density. ]n such a context a loss function frequently will single out some single coordinate Os (e.g., LI.} in 3.2.17) and to compute r(a I x) we will need the posterior distribution of ill I X. But this is obtainable from the posterior distribution of (J given X = x only by integrating out OJ, j t= s, and if p is large this is intractable. ]n recent years socalled Markov Chain Monte Carlo (MCMC) techniques have made this problem more tractable 0 and the use of Bayesian methods has spread. We return to the topic in Volume II.
Linear Bayes estimates
When the problem of computing r( 1r, 6) and 611" is daunting, an alternative is to consider  of procedures for which r( 'R', 6) is easy to compute and then to look for 011" E 'D  a class 'D that minimizes r( 1f 16) for 6 E 'D. An example is linear Bayes estimates where, in the case of squared error loss [q( 0)  a]2, the problem is equivalent to minimizing the mean squared prediction error among functions of the form a + L;=l bjXj _ If in (1.4.14) we identify q(8) with Y and X with Z, the solution is

8(X)
= Eq(8) + IX 
E(X)]T{3
where f3 is as defined in Section 1.4. For example, if in the 11l0del (3.2.16), (3.2.17) we set q(fJ) = Ll 1 , we can find the linear Bayes estimate of Ll 1 hy using 1.4.6 and Problem 1.4.21. We find from (1.4.14) that the best linear Bayes estimator of LI.} is
(3.2.18)
where E(Ll.I) given model
= 0,
X = (XIJ, ... ,XIJ)T, P.
= E(X)
and/l
=
ExlcExA,. For the
168
Var(X I ;)
Measures of Performance
Chapter 3
=E
Var(X,)
I B) +
Var E(XIj
I BJ = E«T~) +
(T~ + (T~,
E Cov(X,j , Xlk I B)
+ Cov(E(X'j I B), E(Xlk I BJ)
0+ Cov(1' + II'!, I' + tl.,)
= (T~ +
(T~,
Cov(tl.
Xlj)
=E
Cov(X,j , tl.,
!: ,,
I',
" From these calculations we find and Norberg (1986).
I B) + Cov(E(XIj I B), E(tl. l I 0» = 0 + (T~, =
(Tt·
f3
and OL(X). We leave the details to Problem 3.2.10.
'I
Linear Bayes procedures are useful in actuarial science, for example, Biihlmann (1970)
if
ii "
Bayes estimation, maximum likelihood, and equivariance
As we have noted earlier. the maximum likelihood estimate can be thought of as the mode of the Bayes posterior density when the prior density is (the usually improper) prior 71"(0) ;,::; c. When modes and means coincide for the improper prior (as in the Gaussian case), the MLE is an improper Bayes estimate. In general, computing means is harder than
modes and that again accounts in part for the popularity of maximum likelihood. An important property of the MLE is equivariance: An estimating method M producing the estimate (j M is said to be equivariant with respect to reparametrization if for every onetoDne function h from e to fl = h (e). the estimate of w '" h(B) is W = h(OM); that is, M
~
I
= h(BM ). In Problem 2.2.16 we show that the MLE procedure is equivariant. If we consider squared error loss. then the Bayes procedure (}B = E(8 I X) is not equivariant
(h(O»
M
~

~
~
for nonlinear transfonnations because
E(h(O) I X)
op h(E(O I X))
1 ,
,
,,
for nonlinear h (e.g., Problem 3.2.3). The source of the lack of equivariance of the Bayes risk and procedure for squared error loss is evident from (3.2.9): In the discrete case the conditional Bayes risk is
i
re(a I x)
= LIB  a]'7f(O I x).
'Ee
(3.2.19)
i
If we set w = h(O) for h onetoone onto fl = heel, then w has prior >.(w) and in the w parametrization, the posterior Bayes risk is
= 7f(h
1
(w))
I
,
r,,(alx)
= L[waj2>,(wlx)
wE"
L[h(B)  a)2 7f (B I x).
(3.2.20)
'Ee
Thus, the Bayes procedure for squared error loss is not equivariant because squared error loss is not equivariant and, thus, r,,(a I x) op Te(h 1(a) Ix).
Section 3.2
Bayes Procedures
169
Loss functions of the form l((}, a) = Q(Po, Pa) are necessarily equivariant. The KullbackLeibler divergence K«(},a), (J,a E e, is an example of such a loss function. It satisfies Ko(w, a) = Ke(O, h 1 (a)), thus, with this loss function,
ro(a I x)
~
re(h1(a) I x).
See Problem 2.2.38. In the discrete case using K means that the importance of a loss is measured in probability units, with a similar interpretation in the continuous case (see (A.7.l 0». In the N(O, case the K L (KullbackLeibler) loss K( a) is ~n(a  0)' (Problem 2.2.37), that is, equivalent to squared error loss. In canonical exponential families
"J)
e,
K(1], a) = L[ryj  ajJE1]Tj
j=1
•
+ A(1]) 
A(a).
Moreover, if we can find the KL loss Bayes estimate 1]BKL of the canonical parameter 7] and if 1] = c( 9) :  t E is onetoone, then the K L loss Bayes estimate of 9 in the
e
general exponential family is 9 BKL = C 1(1]BKd. For instance, in Example 3.2.1 where J.l is the mean of a nonnal distribution and the prior is nonnal, we found the squared error Bayes estimate jiB = wf/o + (lw)X, where 1]0 is the prior mean and w is a weight. Because the K L loss is equivalent to squared error for the canonical parameter p, then if w = h(p), WBKL = h('iiUKL), where 'iiBKL =
~
wTJo + (1  w)X.
Bayes procedures based on the KullbackLeibler divergence loss function are important for their applications to model selection and their connection to "minimum description (message) length" procedures. See Rissanen (1987) and Wallace and Freeman (1987). More recent reviews are Shibata (1997), Dowe, Baxter, Oliver, and Wallace (1998), and Hansen and Yu (2000). We will return to this in Volume 11.
Bayes methods and doing reasonable things
There is a school of Bayesian statisticians (Berger, 1985; DeGroot, 1%9; Lindley, 1965; Savage, 1954) who argue on nonnative grounds that a decision theoretic framework and rational behavior force individuals to use only Bayes procedures appropriate to their personal prior 1r. This is not a view we espouse because we view a model as an imperfect approximation to imperfect knowledge. However, given that we view a model and loss structure as an adequate approximation, it is good to know that generating procedures on the basis of Bayes priors viewed as weighting functions is a reasonable thing to do. This is the conclusion of the discussion at the end of Section 1.3. It may be shown quite generally as we consider all possible priors that the class 'Do of Bayes procedures and their limits is complete in the sense that for any 6 E V there is a 60 E V o such that R(0, 60 ) < R(0, 6) for all O. Summary. We show how Bayes procedures can be obtained for certain problems by COmputing posterior risk. In particular, we present Bayes procedures for the important cases of classification and testing statistical hypotheses. We also show that for more complex problems, the computation of Bayes procedures require sophisticated statistical numerical techniques or approximations obtained by restricting the class of procedures.
170
Measures of Performance
Chapter 3
3.3
MINIMAX PROCEDURES
In Section 1.3 on the decision theoretic framework we introduced minimax procedures as ones corresponding to a worstcase analysis; the true () is one that is as "hard" as possible. That is, 6 1 is better than 82 from a minimax point of view if sUPe R(0,6 1 ) < sUPe R(B, 82 ) and is said to be minimax if
o·
supR(O, 0')
.
= infsupR(O,o).
,
.
,
I·
Here (J and 8 are taken to range over 8 and V = {all possible decision procedures (possibly randomized)} while P = {p. : 0 E e}. It is fruitful to consider proper subclasses of V and subsets of P. but we postpone this discussion. The nature of this criterion and its relation to Bayesian optimality is clarified by considering a socalled zero sum game played by two players N (Nature) and S (the statistician). The statistician has at his or her disposal the set V of all randomized decision procedures whereas Nature has at her disposal all prior distributions 1r on 8. For the basic game, 5 picks 0 without N's knowledge, N picks 1f without 5's knowledge and then all is revealed and S pays N
r(Jr,o)
, ,
,
=
J
R(O,o)dJr(O)
where the notation f R(O, o)dJr(0) stands for f R(0, o)Jr(O)dII in the continuous case and L, R(Oj, o)Jr(O j) in the discrete case. S tries to minimize his or her loss, N to maximize her gain. For simplicity, we assume in the general discussion that follows that all sup's and inf's are assumed. There are two related partial information games that are important.
I: N is told the choice 0 of 5 before picking 1r and 5 knows the rules of the game. Then
Ii ,
N naturally picks 1f,; such that
r(Jr"o)
that is, 1f,; is leastfavorable against such that
~supr(Jr,o),
o. Knowing the rules of the game S naturally picks 0*
•
(3.3.1)
r( Jr,., 0·) = sup r(Jr, 0') = inf sup r(Jr, 0).
We claim that 0* is minimax. To see this we note first that,
.
,
.
(3.3.2)
for allJr, o. On the other hand, if R(0" 0) = suP. R(0, 0), then if Jr, is point mass at 0" r( Jr" 0) = R(O" 0) and we conclude that supr(Jr,o) = sup R(O, 0)
,
•
•
(3.3.3)
I
,
•
•
Section 3.3
Minimax Procedures
171
and our claim follows. II: S is told the choke 7r of N before picking 6 and N knows the rules of the game. Then S naturally picks 6 1r such that
That is, r5 rr is a Bayes procedure for
11".
Then N should pick
7r*
such that (3.3.4)
For obvious reasons, 1f* is called a least favorable (to S) prior distribution. As we shall see by example, altbough the rigbthand sides of (3.3.2) and (3.3.4) are always defined, least favorable priors and/or minimax procedures may not exist and, if they exist, may not be umque. The key link between the search for minimax procedures in the basic game and games I and II is the von Neumann minimax theorem of game theory, which we state in our language.
Theorem 3.3.1. (von Neumann). If both
e and D are finite,
?5
11"
then:
(a)
v=supinfr(1I",6), v=infsupr(1I",6)
rr
?5
are both assumed by (say)
7r*
(least favorable), 6* minimax, respectively. Further,
") =v v=r1f,u (
.
(3.3.5)
v and v are called the lower and upper values of the basic game. When v (saY), v is called the value of the game.
=
v
=
v
Remark 3.3.1. Note (Problem 3.3.3) that von Neumann's theorem applies to classification ~ {eo} and = {eIl (Example 3.2.2) but is too reSlrictive in ilS and testing when assumption for the great majority of inference problems. A generalization due to Wald and Karlinsee Karlin (1 959)states that the conclusions of the theorem remain valid if and D are compact subsets of Euclidean spaces. There are more farreaching generalizations but, as we shall see later, without some form of compactness of and/or D, although equality of v and v holds quite generally, existence of least favorable priors and/or minimax procedures may fail.
eo
e,
e
e
The main practical import of minimax theorems is, in fact, contained in a converse and its extension that we now give. Remarkably these hold without essentially any restrictions on and D and are easy to prove.
e
Proposition 3.3.1. Suppose 6**. 7r** can be found such that
U
£** =
r ()1r.',
1f **
= 1rJ••
(3.3.6)
172
Measures of Performance
Chapter 3
that is, 0** is Bayes against 11"** and 11"** is least favorable against 0**. Then v R( 11"** ,0**). That is, 11"** is least favorable and J*'" is minimax.
To utilize this result we need a characterization of 11"8. This is given by
v
=
Proposition 3.3.2.11"8 is leastfavorable against
°iff
1r,jO: R(O,b) = supR(O',b)) = 1.
.'
(3,3,7)
That is, n a assigns probability only to points () at which the function R(·, 0) is maximal.
Thus, combining Propositions 3.3.1 and 3.3.2 we have a simple criterion, "A Bayes rule with constant risk is minimax." Note that 11"8 may not be unique. In particular, if R(O, 0) = constant. the rule has constant risk, then all 11" are least favorable. We now prove Propositions 3.3.1 and 3.3.2.
Proof of Proposition 3.3.1. Note first that we always have
v<v
because, trivially,
i~fr(1r,b)
(3.3.8)
< r(1r,b')
(3.3,9)
for aIln, 5'. Hence,
v = sup in,fr(1r,b) < supr(1r,b')
•
(3,3.10)
•
for all 0' and v
<
infa, sUP1r 1'(11", (/) =
v. On the other hand. by hypothesis,
sup1'(1r,6**) > v.
v> inf1'(11"*'",6)
,
= 1'(1I"*",0*'") =
.
(3.3.11)
Combining (3.3.8) and (3.3.11) we conclude that
:r
v
as advertised.
= i~f 1'(11"**,0) = 1'(11"**,6**) = s~p1'(1I",0*"')
= V
(3.3.12)
" 'I
,.;
~l
o
Proofof Proposition 3.3.2. 1r is least favorable for b iff E.R(8,6) =
,.,
f r(O,b)d1r(O) =s~pr( ..,6).
•
(3.3.13)
But by (3.3.3),
supr(..,b) = supR(O, 6),
(3.3,14)
•
i ,
Because E.R(8, 6)
= sUP. R(O, b), (3.3,13) is possible iff (3.3.7) holds.
o
Putting the two propositions together we have the following.
•
Section 3.3
Minimax Procedures
173
11"*
Theorem 3.3.2. Suppose 0* has sUPo R((}, 0*) = 1" < 00. If there exists a prior that 0* is Bayes for 11"* and tr" {(} : R( (}, 0") = r} = 1, then 0'" is minimax.
such
Example 3.3.1. Minimax estimation in the Binomial Case. Suppose S has a B(n,B) distribution and X = Sjn,as in Example 3.2.3. Let I(B, a) ~ (Ba)'jB(IB),O < B < 1. For this loss function,
R(B X) = E(X _B)' , B(IB)

=
B(IB) = ~ nB(IB) n'

and X does have constant risk. Moreover, we have seen in Example 3.2.3 that X is Bayes, when 8 is U(Ol 1). By Theorem 3.3.2 we conclude that X is minimax and, by Proposition 3.3.2, the uniform distribution least favorable. For the usual quadratic loss neither of these assertions holds. The minimax estimate is
o's=S+hln = .,fii X+ I 1 () n+.,fii .,fii+l .,fii+1 2
This estimate does have constant risk and is Bayes against a (J( y'ri/2, vn/2) prior (Problem 3.3.4). This is an example of a situation in which the minimax principle leads us to an unsatisfactory estimate. For quadratic loss, the limit as n t 00 of the ratio of the risks of 0* and X is > 1 for every () =f ~. At B = the ratio tends to 1. Details are left to Problem 3.3.4. 0
!
Example 3.3.2. Minimax Testing. Satellite Communications. A test to see whether a communications satellite is in working order is run as follows. A very strong signal is beamed from Earth. The satellite responds by sending a signal of intensity v > 0 for n seconds or, if it is not working, does not answer. Because of the general "noise" level in space the signals received on Earth vary randomly whether the satellite is sending ~r not. The mean voltage per second of the signal for each of the n seconds is recorded. Denote the mean voltage of the signal received through the ith second less expected mean voltage due to noise by Xi. We assume that the Xi are independently and identically distributed as N(p" 0'2) where p, = v, if the satellite functions, and otherwise. The variance 0'2 of the "noise" is assumed known. Our problem is to decide whether "J.l = 0" or"p, = v." We view this as a decision problem with 1 loss. If the number of transmissions is fixed, the minimax rule minimizes the maximum probability of error (see (1.3.6)). What is this risk? A natural first step is to use the characterization of Bayes tests given in the preceding section. If we assign probability 1r to and 1  1r to v, use 0  1 loss, and set L(x, 0, v) = p( x Iv) j p(x I 0), then the Bayes test decides I' = v if
°
°°
L(x,O,v)
and decides p,
=
=
exp
°
" {
2"EXi   , a 2a
nv'} >
17l'
if L(x,O,v) <:17l' 7l'
174
Measures of Performance
Chapter 3
This test is equivalent to deciding f.t
= v (Problem 3.3.1) if. and only if,
"yn
1 ;;;:EXi>t,
T=
, , ,
where,
t
If we call this test d,r.
=
" [log " + 2 nv'] ;;;: vyn a
17r
2
'
R(O,J. )
1 ~ <I>(t)
<I>
= <1>( t)
R(v,o,)
(t  vf)
To get a minimax test we must have R(O, 61r ) = R( V, 6ft), which is equivalent to
v.,fii t = t  "'
h •
or
"
','
I ,~
~.
, ,
•
v.,fii t= . 20
•
Because this value of t corresponds to 7r = the intuitive test, which decides JJ only ifT > ~[Eo(T) + Ev(T)J, is indeed minimax.
!.
= v if and
0
'I
•
If is not bounded, minimax rules are often not Bayes rules but instead can be obtained as limits of Bayes rules. To deal with such situations we need an extension of Theorem
e
3.3.2.
Theorem 3,3,3, Let 0' be a rule such that sup.R(O,o') = r < 00, let {"d denote a sequence of pn'or distributions such that 'lrk;{8 : R(B,o*) = r} = 1, and let Tk = inffJ r( ?rie, J), where r( 7fkl 0) denotes the Bayes risk wrt 'lrk;. If
Tk Task 00,
(3.3.i5)
then J* is minimax. Proof Because r( "k, 0') = r
supR(B, 0') = rk + 0(1)
•
where 0(1)
~
0 as k
~ 00.
But hy (3.3.13) for any competitor 0
supR(O,o) > E.,(R(B,o)) > rk ~supR(O,o') 0(1).
•
•
(3.3.16)
,
'.
If we let k _ suP. R(B, 0').
00
the lefthand side of (3.3.16) is unchanged, whereas the right tends to
0
j
1
•
Section 3.3
M'mimax Procedures
175
Example 3.3.3. Normal Mean. We now show that X is minimax in Example 3.2.1. Identify 1fk with theN(1Jo, 7 2 ) prior where k = 7 2 . Then
whereas the Bayes risk of the Bayes rule of Example 3.2.1 is
i~frk(J) ~ (,,'/n) +7' n
Because
(0
7
2
0
2
=
00,
n  (,,'/n) +7'
a2
1
0
2
n·
2
In)1 « (T2 In) + 7 2 )
>
0 as T 2

we can conclude that
X is minimax.
0
Example 3.3.4. Minimax Estimation in a Nonparametric Setting (after Lehmann). Suppose XI,'" ,Xn arei.i.d, FE:F
Then X is minimax for estimating B(F) EF(Xt} with quadratic loss. This can be viewed as an extension of Example 3.3.3. Let 1fk be a prior distribution on :F constructed as foUows:(J)
(i)
=
".{F: VarF(XIl # M}
= O.
(ii) ".{ F : F
# N(I', M) for some I'} = O.
(iii) F is chosen by first choosing I' = 6(F) from a N(O, k) distribution and then taking F = N(6(F),M).
Evidently, the Bayes risk is now the same as in Example 3.3.3 with 0"2 evidently,
= M.
Because.
) VarF(X, ) max R(F, X = max :F :F n
Theorem 3.3.3 applies and the result follows. Minimax procedures and symmetry
M
n
o
As we have seen, minimax procedures have constant risk or at least constant risk on the "most difficult" 0, There is a deep connection between symmetries of the model and the structure of such procedures developed by Hunt and Stein, Lehmann, and others, which is discussed in detail in Chapter 9 of Lehmann (1986) and Chapter 5 of Lehmann and CaseUa (1998), for instance. We shall discuss this approach somewhat, by example. in Chapters 4 and Volume II but refer to Lehmann (1986) and Lehmann and Casella (1998) for further reading. Summary. We introduce the minimax principle in the contex.t of the theory of games. Using this framework we connect minimaxity and Bayes metbods and develop sufficient conditions for a procedure to be minimax and apply them in several important examples.
(72) = . for instance. A prior for which the Bayes risk of the Bayes procedure equals the lower value of the game is called leas! favorable. Obviously. for example. This approach coupled with the principle of unbiasedness we now introduce leads to the famous GaussMarkov theorem proved in Section 6. In the nonBayesian framework. This notion has intuitive appeal. 5) > R(0. 176 Measures of Performance Chapter 3 .4. Survey Sampling In the previous two sections we have considered two decision theoretic optimality principles. Von Neumann's Theorem states that if e and D are both finite. 5') for aile. there is a least favorable prior 1[" and a minimax rule 8* such that J* is the Bayes rule for n* and 1r* maximizes the Bayes risk of J* over all priors." R(0. if Y is postulated as following a linear regression model with E(Y) = zT (3 as in Section 2.1 . looking for the procedure 0.L and (72 when XI.1. 0'5) model. then the game of S versus N has a value 11. estimates that ignore the data. all 5 E Va. on other grounds.2. We introduced. such as 5(X) = q(Oo). and so on. v equals the Bayes risk of the Bayes rule 8* for the prior 1r*. . the notion of bias of an estimate O(X) of a parameter q(O) in a model P {Po: 0 E e} as = Biaso(5) = Eo5(X)  q(O). it is natural to consider the computationally simple class of linear estimates. and then see if within the Do we can find 8* E Do that is best according to the "gold standard.6. in many cases. E Do that minimizes the Bayes risk with respect to a prior 1r among all J E Do. which can't be beat for 8 = 80 but can obviously be arbitrarily terrible. then in estimating a linear function of the /3J. are minimax. for which it is possible to characterize and. the game is said to have a value v. Do C D. the solution is given in Section 3.3. according to these criteria. The lower (upper) value v(v) of the game is the supremum (infimum) over priors (decision rules) of the infimum (supremum) over decision rules (priors) of the Bayes risk. 3. Bayes and minimaxity. we show how finding minimax procedures can be viewed as solving a game between a statistician S and nature N in which S selects a decision rule 8 and N selects a prior 1['.. An estimate such that Biase (8) 0 is called unbiased. When Do is the class of linear procedures and I is quadratic Joss.I . Moreover.2. D. We show that Bayes rules with constant risk. in Section 1. This result is extended to rules that are limits of Bayes rules with constant risk and we use it to show that x is a minimax rule for squared error loss in the N (0 . we can also take this point of view with humbler aims. computational ease. N (Il.d. .4 UNBIASED ESTIMATION AND RISK INEQUALITIES 3. S(Y) = L:~ 1 diYi . I X n are d.. I I .• •••. or more generally with constant risk over the support of some prior. ruling out. When v = ii. Ii I More specifically. This approach has early on been applied to parametric families Va. compute procedures (in particular estimates) that are best in the class of all procedures. An alternative approach is to specify a proper subclass of procedures. The most famous unbiased estimates are the familiar estimates of f. symmetry.1 Unbiased Estimation... II . .
Example 3. a census unit. We want to estimate the parameter X = Xj. .an}C{XI. We let Xl. is to estimate by a regression estimate ~ XR Xi. (3.Section 3..< 2 ~ ~ (3.4. .6) where b is a prespecified positive constant. Suppose we wish to sample from a finite population.. (3.14) and has = iv L:fl (3. [J = ~ I:~ 1 Ui· XR is also unbiased.2) 1 Because for unbiased estimates mean square error and variance coincide we call an unbiased estimate O*(X) of q(O) that has minimum MSE among all unbiased estimat~s for all 0.3) = 0 otherwise..3 and Problem 1.8) Jt =X .4 Unbiased Estimation and Risk Inequalities 177 given by (see Example 1.. UMVU (uniformly minimum variance unbiased). reflecting the probable correlation between ('ttl. ....' .5) This method of sampling does not use the information contained in 'ttl. .4...XN} 1 (3. Write Xl. If . Ui is the last census income corresponding to = it E~ 1 'Ui.. Unbiased estimates playa particularly important role in survey sampling.2. . X N ) as parameter .4. . .. This leads to the model with x = (Xl..3.4.1. UN' One way to do this.(Xi 1=1 I" N ~2 x) . UN for the known last census incomes. these are both UMVU. (~) If{aj. . .4.4.. UN) and (Xl. . X N for the unknown current family incomes and correspondingly UI. for instance.3. (3.4. We ignore difficulties such as families moving.il) Clearly for each b.···. As we shall see shortly for X and in Volume 2 for . . to determine the average value of a variable (say) monthly family income during a time between two censuses and suppose that we have available a list of families in the unit with family incomes at the last census.Xn denote the incomes of a sample of n families drawn at random without replacement. It is easy to see that the natural estimate X ~ L:~ 1 Xi is unbiased (Problem 3. X N ). ...4.1 ) ~ 1 nl L (Xi ~ ooc n ~ X) 2 . and u =X  b(U . Unbiased Estimates in Survey Sampling.. . ~ N L. . ..4) where 2 CT..
However. ". I . q(e) is biased for q(e) unless q is linear. M (3. . for instance. This makes it more likely for big incomes to be induded and is intuitively desirable. They necessarily in general differ from maximum likelihood estimates except in an important special case we develop later.15). For each unit 1. ~ . outside of sampling.~'! I .. . . X M } of random size M such that E(M) = n (Problem 3.4). (i) Typically unbiased estimates do not existsee Bickel and Lehmann (1969) and Problem 3. . To see this write ~ 1 ' " Xj XHT ~ N LJ 7'" l(Xj E S). I j . The result is a sample S = {Xl. ' .4.7) II ! where Ji is defined by Xi = xJ. .. N toss a coin with probability 7fj of landing heads and select Xj if the coin lands heads.Xi XHT=LJN i=l 7fJ. II it >: .3.'I . Unbiasedness is also used in stratified sampling theory (see Problem 1.. 0 Discussion. j=1 3 N ! I: .j .I.X)!Var(U) (Problem 3. estimate than X and the best choice of b is bopt The value of bopt is unknown but can be estimated by = The resulting estimate is no longer unbiased but behaves well for large samplessee Problem 5. I II.2Gand minimax estimates often are.. Xl!Var(U). X is not unbiased but the following estimate known as the HorvitzThompson estimate is: Ef .. If B is unbiased for e. .. Specifically let 0 < 7fl. 111" N < 1 with 1 1fj = n.. (ii) Bayes estimates are necessarily biasedsee Problem 3. .j' I ': Because 1rj = P[Xj E Sj by construction unbiasedness follows.4.4.1 the correlation of Ui and Xi is positive and b < 2Cov(U. 178 Measures of Performance Chapter 3 i ..i . this will be a beller cov(U. . " 1 ".18.19).. . 1. A natural choice of 1rj is ~n. the unbiasedness principle has largely fallen out of favor for a number of reasons.4. An alternative approach to using the Uj is to not sample all units with the same prob ability. Ifthc 'lrj are not all equal.bey the attractive equivariance property. The HorvitzThompson estimate then stays unbiased.3. :I . .4.. .11. Further discussion of this and other sampling schemes and comparisons of estimates are left to the problems. . It is possible to avoid the undesirable random sample size of these schemes and yet have specified 1rj. (iii) Unbiased_estimates do not <:..
8) is assumed to hold if T(xJ = I for all x. From this point on we will suppose p(x. B)dx.OJdx] = J T(xJ%oP(x.4.p) where i = (Y .O) > O} docs not depend on O.4. See Problem 3.4 Unbiased Estimation and Risk Inequalities 179 Nevertheless.4. which can be used to show that an estimate is UMVU. Assumption II is practically useless as written. The discussion and results for the discrete case are essentially identical and will be referred to in the future by the same numbers as the ones associated with the continuous~case theorems given later. 0 or equivalently Vare(Bn)/MSEe(Bn) t 1 as n t 00. The arguments will be based on asymptotic versions of the important inequalities in the next subsection. (I) The set A ~ {x : p(x.2. %0 [j T(X)P(x.. This preference of 8 2 over the MLE 52 = iT e/n is in accord with optimal behavior when both the number of observations and number of parameters are large. That is. and we can interchange differentiation and integration in p(x. Note that in particular (3. p. B) for II to hold.   . B)dx. e e. We expect that jBiase(Bn)I/Vart (B n ) . as we shall see in Chapters 5 and 6.( lTD < 00 J .4. We make two regularity assumptions on the family {Pe : BEe}. Simpler assumptions can be fonnulated using Lebesgue integration theory. OJ exists and is finite. has some decision theoretic applications.9. For all x E A.2 The Information Inequality The oneparameter case We will develop a lower bound for the variance of a statistic. The lower bound is interesting in its own right.ZDf3). We suppose throughout that we have a regular parametric model and further that is an open subset of the line.8) is finite. unbiased estimates are still in favor when it comes to estimating residual variances. For instance. In particular we shall show that maximum likelihood estimates are approximately unbiased and approximately best among all estimates. For instance. 3. and p is the number of coefficients in f3. good estimates in large samples ~ 1 ~ are approximately unbiased. for integration over Rq.O E 81iJO log p( x. 167.OJdx (3. f3 is the least squares estimate. and appears in the asymptotic optimality theory of Section 5. in the linear regression model Y = ZDf3 + e of Section 2. Then 11 holdS provided that for all T such that E.Section 3.4. suppose I holds. the variance a 2 = Var(ei) is estimated by the unbiased estimate 8 2 = iTe/(n .4. B) is a density. (II) 1fT is any statistic such that E 8 (ITf) < 00 for all 0 E then the operations of integration and differentiation by B can be interchanged in J T( x )p(x. Finally. e. Some classical conditiolls may be found in Apostol (1974).8) whenever the righthand side of (3. What is needed are simple sufficient conditions on p(x.
. Suppose that 1 and II hold and that E &Ologp(X. Suppose Xl.4.6. the Fisher information number. It is not hard to check (using Laplace transform theory) that a oneparameter exponential family quite generally satisfies Assumptions I and II. o Example 3.9) Note that 0 < 1(0) < Lemma 3. I 1(0) 1 = Var (~ !Ogp(X.O)} p(x. Ii ! I h(x) exp{ ~(O)T(x) ..4. 0))' = J(:Ologp(x.. suppose XI. 1 and 11 are satisfied for samples from gamma and beta distributions with one parameter fixed. (3. (~ logp(X.1) ~(O) = 0la' and 1 and 11 are satisfied. thus. I' . For instance.O)]j P(x. t.4.O)] dxand J T(x) [:oP(X.  Proof.0 Xi) 1 = nO =  0' n O· o .. 00.&0 '0 {[:oP(X. I J J & logp(x 0 ) = ~n 1 . 0) ~ 1(0) = E.I [I ·1 .O)).4.10) and. Then Xi . 0))' p(x. 1 X n is a sample from a Poisson PCB) population.B(O)} is an exponential family and TJ(B) has a nonvanishing continuous derivative on 8.1.180 for all Measures of Performance Chapter 3 e. O)dx.O) & < 00. where a 2 is known.n and 1(0) = Var (~n_l . Then (see Table 1. O)dx ~ :0 J p(x. the integrals .4. Similarly. O)dx = O. /fp(x. then I and II hold.i J T(x) [:OP(X.O)] dx " I'  are continuous functions(3) of O. which is denoted by I(B) and given by Proposition 3. O)dx :Op(x. .11 ) .2. 1 X n is a sample from a N(B. (J2) population.1. (3.. If I holds it is possible to define an important characteristic of the family {Po}.4. . Then (3.
'(~)I)'. by Lemma 3.' (e) = Cov (:e log p(X.O)dX= J T(x) UOIOgp(x.12) Proof. then I(e) = nI.(T(X)) < 00 for all O.4.1. e) and T(X). Let T(X) be all)' statistic such that Var.O). 0 E e. Let [. Var (t. Then Var.1.4.Section 3. Here's another important special case.(T(X)) by 1/. Theorem 3. Suppose that I and II hold and 0 < 1(0) < 00.14) Now let ns apply the correlation (CauchySchwarz) inequality (A. If we consider the class of unbiased estimates of q(B) = B.. we obtain a universal lower bound given by the following.16) The number 1/I(B) is often referred to as the information or CramerRoo lower bound for the variance of an unbiased estimate of 'tjJ(B).(0) is differentiable and Var (T(X)) . 1/.1. 0 The lower bound given in the information inequality depends on T(X) through 1/.1 hold.4.'(0) = J T(X):Op(x.O))P(x. 10gp(X.4. (Information Inequality).(0).(0).'(0)]'  I(e) .1 hold and T is an unbiased estimate ofB.4. We get (3. .(T(X)) > [1/.I 1.4.4. (3.(T(X) > I(O) 1 (3.I1. Using I and II we obtain. Denote E.l4) and Lemma 3. Suppose the conditions of Theorem 3.4.2. CoroUary 3. Suppose that X = (XI. (e) and Var. (3.1.4. (3.4. 1/. ej.4 Unbiased Estimation and Risk Inequalities 181 Here is the main result of this section. and that the conditions of Theorem 3.15) The theorem follows because. nI.13) By (A. > [1/.. Proposition 3.4. .16) to the random variables 8/8010gp(X. Then for all 0.4.Xu) is a sample from a population with density f(x. 0 (3.(0) = E(feiogf(X"e»)'. 1/.17) .O)dX. e)) = I(e).4. T(X)) .
i g .B)] Var " i) ~ 8B iogj(Xi.2.4.(B) such that Var. As we previously remarked. 0 We can similarly show (Problem 3. Theorem 3. if {P9} is a oneparameter exponentialfamily of the form (1.4. 1) density..(B). we have in fact proved that X is UMVU even if a 2 is unknown. (Continued).p'(B)]'ll(B) for all BEe.19) Ii i .2.1) with natural sufficient statistic T( X) and . whereas if 'P denotes the N(O. .1 and I(B) = Var [:B !ogp(X. Suppose Xl. the conditions of the information inequality are satisfied. For a sample from a P(B) distribution.. then X is UMVU. . Next we note how we can apply the information inequality to the problem of unbiased estimation...6. Example 3.B) = h(x)exp[ry(B)T'(x) .1) that if Xl. .. Conversely. By Corollary 3. If the family {Pe} satisfies I and II and if there exists an unbiased estimate T* of 1/.j. This is no accident.4. then T(X) achieves the infonnation inequality bound and is a UMVU estimate of E.[T'(X)] = [. then T' is UMVU as an estimate of 1.4.1 we see that the conclusion that X is UMVU follows if ~  Var(X) ~ nI. I I "I and (3.(T(X».4. (3. This is a consequence of Lemma 3. : BEe} satisfies assumptions I and II and there exists an unbiased estimate T* of1.18) follows. Because X is unbiased and Var( X) ~ Bin. Then {P9} is a oneparameter exponential family with density or frequency function af the fonn II p(x. the MLE is B ~ X.. Note that because X is UMVU whatever may be a 2 . Example 3. o II (B) is often referred to as the information contained in one observation.(8) has a continuous nonvanishing derivative on e. then  1 (3.3. Suppose that the family {p. (Br Now Var(X) ~ a'ln.4. We have just shown that the information [(B) in a sample of size n is nh (8).B) ] [ ~var [%0 ]Ogj(Xi.4.. These are situations in which X follows a oneparameter exponential family.18) 1 a ' . .1 for every B.182 Measures of Performance Chapter 3 Proof.4. then X is a UMVU estimate of B.4. X n is a sample from a nonnal distribution with unknown mean B and known variance a 2 .Xn are the indicators of n Bernoulli trials with probability of success B. which achieves the lower bound ofTheorem 3.B(B)J.B)] = nh(B).4.j. .
4. Our argument is essentially that of Wijsman (1973).4.(A") = 1 for all B'. the information bound is [A"(B))2/A"(B) A"(B) Vare(T(X») so that T(X) achieves the information bound as an estimate of EeT(X). Note that if AU = nmAem .1) we assume without loss of generality (Problem 3. B) IdB.20) to (3. (3.4. we see that all a2 are linear combinations of {} log p( Xj.B)) + nz)[log B .B) = a.B) so that = T(X) . Thus. However.16) we know that T'" achieves the lower bound for all (J if. 2 A ** = A'" and the result follows. then (3.4.23) must hold for all B.24) [(B) = Vare(T(X) . X2 E A"''''.A (B) ..4.21 ) Upon integrating both sides of (3. there exist functions al ((J) and a2 (B) such that :B 10gp(X.23) for all B1 . From this equality of random variables we shall show that Pe[X E A'I = 1 for all B where A' = {x: :Blogp(x. Then ~ ( 80 logp X.25) But .2.1. denotes the set of x for which (3. B .19). and only if. The passage from (3.4.2. .4.Section 3.4. In the HardyWeinberg model of Examples 2. By solving for a" a2 in (3. B . (3. Let B . p(x. :e Iogp(x.4.4 and 2..4. (3. Conversely in the exponential family case (1.10g(1 .6.4. (B)T'(X) + a2(B) (3. be a denumerable dense subset of 8. continuous in B.4.(Ae) = 1 for all B' (Problem 3.B) I + 2nlog(1  BJ} .4.14) and the conditions for equality in the correlation inequality (A. But now if x is such that = 1.4.p(B) = A'(B) and.A'(B» = VareT(X) = A"(B). J 2 Pe. Suppose without loss of generality that T(xI) of T(X2) for Xl.6. If A.20) guarantees Pe(Ae) ~ 1 and assumption 1 guarantees P.4 Unbiased Estimation and Risk Inequalities 183 Proof We start with the first assertion. j hence. thus.4.4.22) for j = 1.ll.20) hold.3) that we have the canonical case with ~(B) = Band B( Bj = A(B) = logf h(x)exp{BT(x))dx. it is necessary. B) = 2"' exp{(2nj 2"' exp{(2n. " and both sides are continuous in B. + n2) 10gB + (2n3 + n3) Iog(1 . 0 Example 3. B) = a.4. then (3.20) with Pg probability 1 for each B. B) = aj(B)T'(x) + a2(B) (3.4. Here is the argument.19) is highly technical.6).. By (3.2 and.(B)T' (x) +a2(B) for all BEe}..20) with respect to B we get (3.
IOgp(X. (Continued).26) Proposition 3. ~ ~ This T coincides with the MLE () of Example 2.. Even worse.4.4. but the variance of the best estimate is not equal to the bound [.. I(B) = Eo aB' It turns out that this identity also holds outside exponential families. 0 0 .2 implies that T = (2N 1 + N z )/2n is UMVU for estimating E(T) ~ (2n)1[2nB' + 2nB(1 .4.184 Measures of Performance Chapter 3 where we have used the identity (2nl + HZ) + (2n3 + nz) = 211. B) )' aB (3.26) holds. for instance. Suppose p('.B) ~ A "( B).6. in the U(O. DiscRSsion. For a sample from a P( B) distribution o EB(::.B)] and then using Theorem 1. . Theorem 3. or by transforming p(x.4. e).4.B)] ~ B.3. Proof We need only check that I a aB' log p(x.27) and integrate both sides with respect to p(x. =e'E(~Xi) = .B) is twice differentiable and interchange between integration and differentiation is pennitted. See Volume II. It often happens.4.4. () (ell" ..2. . B) example. that I and II fail to hold.24).2.4.ed)' In particular. B) 2 1 a = p(x. Then (3. in many situations. B) satisfies in addition to I and II: p(·. We find (Problem 3. B) to canonical form by setting t) = 10g[B((1 . B)  2 (a log p(x. .(e) exist.p' (B) I'( I (B).25) we obtain a' 10gp(X. B). . we will find a lower bound on the variance of an estimator .2. (3. The variance of B can be computed directly using the moments of the multinomial distribution of (NI • N z .6. Sharpenings of the information inequality are available but don't help in general. " 'oi .7) Var(B) ~ B(I . • The multiparameter case We will extend the information lower bound to the case of several parameters.25).'" . By (3. A third metho:! would be to use Var(B) = I(I(B) and formula (3. Because this is an exponential family. . assumptions I and II are satisfied and UMVU estimates of 1/.4. we have iJ' aB' logp (X.4. as Theorem 3. Example 3.. I I' f . Extensions to models in which B is multidimensional are considered next. Na). 0 ~ Note that by differentiating (3. '. . B) aB' p(x.2 suggests.4.B)) which equals I(B).B)(2n. although UMVU estimates exist.
. . .28) where (3..Xn are U. (3.4.30) iJ Ijd O) = Covo ( eO logp(X. Under the conditions in the opening paragraph. = Var('. d.0) = er. Then 1 = log(211') 2 1 2 1 2 loger . a ).Bd are unknown.4 /2.? ologp(X.d.O») .4. (a) B=T 1 (3. III (0) = E [~:2 logp(x. 0 = (I". 1 <k< d. (0) iJiJ lt2(0) ~ E [aer 2 iJl" logp(x. The arguments follow the d Example 3.31) and 1(0) (b) If Xl.2 = er. er 2). er 2 ).. (c) If. 0).4.4. .2 ] ~ er.4. ~ N(I".Xnf' has information matrix 1(0) = EO (aO:. 0)] = E[er.Section 3..29) Proposition 3. pC B) is twice differentiable and double integration and differentiation under the integral sign can be interchanged. 6) denote the density or frequency function of X where X E X C Rq.4 E(x 1") = 0 ~ I.5...( x 1") 2 2a 2 logp(x 0) .32) Proof. Suppose X = 1 case and are left to the problems. Let p( x. j = 1.. The (Fisher) information matrix is defined as (3.4 Unbiased Estimation and Risk Inequalities 185 of fh when the parameters 82 .. ..O) ] iJ2 ] l22(0) = E [ (iJer2)21ogp(x. 1 <j< d. .. We assume that e is an open subset of Rd and that {p(x.Ok IOgp(X.4. 0) j k That is.4. (3. as X. iJO logp(X. B) : 0 E 8} is a regular parametric model with conditions I and II satisfied when differentiation is with respect OJ. 0). nh (O) where h is the information matrix ofX. . .4. . then X = (Xl. in addition..
Here are some consequences of this result. 0) .33) o Example 3.186 Measures of Performance Chapter 3 Thus.30) and Corollary 1. We will use the prediction inequality Var(Y) > Var(I'L(Z».(O).2 ( 0 0 ) .6. Example 3. Then Theorem 3.. Assume the conditions of the .38) where Lzy = EO(TVOlogp(X. (continued).' = explLTj(x)9j j=1 ..4.6.4.. We claim that each of Tj(X) is a UMVU .4. that is. The conditions I. [(0) = VarOT(X) = (3.. UMVU Estimates in Canonical Exponential Families. .4. 1/. = .O) = T(X) .4/2 . I'L(Z) = I'Y + (Z l'z)TLz~ Lzy· Now set Y (3. By (3.6.O») = VOEO(T(X» and the last equality follows 0 from the argument in (3.37) = T(X).6 hold. Suppose k .4. Canonical kParameter Exponential Family. Then VarO(T(X» > Lz~ [1(0) Lzy (3.'(0) = V1/.35) o ~ Next suppose 8 1 = T is an estimate of (Jl with 82. ! p(x.A(O)}h(x) (3.( 0) be the d x 1 vector of partial derivatives.4. A(O). Let 1/. (3.(0) exists and (3..4.!pening paragraph hold and suppose that the matrix [(0) is nonsingular Then/or all 0.1.3. Suppose the conditions of Example 3.A. II are easily checked and because VOlogp(x. then [(0) = VarOT(X). Z = V 0 logp(x..4.4.4.4. .4. in this case [(0) ! . where I'L(Z) denotes the optimal MSPE linear predictor of Y.34) () E e open.(0) = EOT(X) and let ".13).36) Proof. . 0). J)d assumed unknown.
3. 0 = (0 1 .. . . = nE(T.. we let j = 1.0. .A(O)} whereTT(x) = (T!(x).>'j)/n. In the multinomial Example 1. j=1 X = (XI. hence. Example 3. But because Nj/n is unbiased and has 0 variance >'j(1 . .4. Multinomial Trials. AI..i.4.0).. by Theorem 3. are known.) = Var(Tj(X».Xn)T. . we transformed the multinomial model M(n. . n T.41) because .4.p(OJrI(O).(X)) 82 80... " Ok_il T. (1 + E7 / eel) = n>'j(1.. . This is a different claim than TJ(X) is UMVU for EOTj(X) if Oi.7. k. To see our claim note that in our case (3.d.>'. (3..p(0) is the first row of 1(0) and.40) J We claim that in this case .Tk_I(X))...T(O) = ~:~ I (3.O) = exp{TT(x)O . .. . We have already computed in Proposition 3. .4.A.Section 3. the lower bound on the variance of an unbiased estimator of 1/Jj(O) = E(n1Tj(X) = >. X n i. a'i X and Aj = P(X = j). without loss of generality.j. then Nj/n is UMVU for >'j.)/1£.(X) = I)IXi = j].A(O) is just VarOT! (X).39) where. is >..4.p(0)11(0) ~ (1. j = 1.6.A(0) j ne ej = (1 + E7 eel _ eOj ) . .. But f. j. . Ak) to the canonical form p(x.. ...4 Unbiased Estimation and Risk Inequalities 187 estimate of EOTj(X). kI A(O) ~ 1£ log Note that 1+ Le j=l Oj 80 A (0) J 8 = 1 "k 11 ne~ I 0 +LJl=le l = 1£>'. i =j:.4 8'A rl(O)= ( 8080 t )1 kx k .6 with Xl. .(1. . . Thus.4..
reasonable estimates are asymptotically unbiased.3 hold and • is a ddimensional statistic..3 are 0 explored in the problems.5.4.8. If Xl. Suppose that the conditions afTheorem 3. Asymptotic analogues oftbese inequalities are sharp and lead to the notion and construction of efficient estimates.pd(OW = (~(O)) dxd .4.. " il 'I."xl' Note that both sides of (3. 3. We derive the information inequa!ity in oneparameter models and show how it can be used to establish that in a canonical exponential family. Also note that ~ o unbiased '* VarOO > r'(O)..4. But it does not follow that n~1 L:(Xi . ." . Then J (3.d.4. Summary. The Normal Case. ~ In Chapters 5 and 6 we show that in smoothly parametrized models. X n are i. . interpretability of the procedure.2 + (J2. N(IL.4. (72) then X is UMVU for J1 and ~ is UMVU for p. . Let 1/J(O) ~EO(T(X))dXl and "'(0) = (1/Jl(O).B)a > Ofor all .3 whose proof is left to Problem 3. 1 j Here is an important extension of Theorem 3. 3.4. T(X) is the UMVU estimate of its expectation. 188 MeasureS of Performance Chapter 3 j Example 3. LX.4. .. features other than the risk function are also of importance in selection of a procedure. we show how the infomlation inequality can be extended to the multiparameter case.21.42) where A > B means aT(A .4.4.5 NON DECISION THEORETIC CRITERIA In practice.. and robustness to model departures.i. even if the loss function and model are well specified. . • Theorem 3. They are dealt with extensively 'in books on numerical analysis such as Dahlquist. . These and other examples and the implications of Theorem 3.4.1 Computation Speed of computation and numerical stability issues have been discussed briefly in Section 2. Using inequalities from prediction theory.X)2 is UMVU for 0 2 . .42) are d x d matrices. We study the important application of the unbiasedness principle in survey sampling. The three principal issues we discuss are the speed and numerical stability of the method of computation used to obtain the procedure. We establish analogues of the information inequality and use them to show that under suitable conditions the MLE is asymptotically optimal.
estimates of parameters based on samples of size n have standard deviations of order n.5 Nondecision Theoretic Criteria 189 Bjork. solve equation (2. consider the Gaussian linear model of Example 2. and Anderson (1974). Then least squares estimates are given in closed fonn by equ<\tion (2. at least if started close enough to 8.5. variance and computation As we have seen in special cases in Examples 3.4.2 is given by where 0.3.4.1). Unfortunately it is hard to translate statements about orders into specific prescriptions without assuming at least bounds on the COnstants involved.1.5.5. It is in fact faster and better to Y. takes on the order of log log ~ steps (Problem 3. For instance.2. < 1 then J is of the order of log ~ (Problem 3. if we seek to take enough steps J so that III III < .2 Interpretability Suppose that iu the normal N(Il.2 is the empirical variance (Problem 2.AI (flU 1) (T(X) .10).4.Section 3. fI(j) = iPl) .1 / 2 is wasteful. It is clearly easier to compute than the MLE. The interplay between estimated. It follows that striving for numerical accuracy of ord~r smaller than n.1 / 2 . The closed fonn here is deceptive because inversion of a d x d matrix takes on the order of d 3 operations when done in the usual way and can be numerically unstable. It may be shown that. Gaussian elimination for the particular z'b Faster versus slower algorithms Consider estimation of the MLE 8 in a general canonical exponential family as in Section 2.1. the NewtonRaphson method in ~ ~(J) which the jth iterate. with ever faster computers a difference at this level is irrelevant.A(BU ~l»)).11). a method of moments estimate of (A.P) in Example 2. But it reappears when the data sets are big and the number of parameters large. Of course. for this population of measurements has a clear interpretation. in the algOrithm we discuss in Section 2.3. say. The improvement in speed may however be spurious since AI is costly to compute if d is largethough the same trick as in computing least squares estimates can be used./ a even if the data are a sample from a distribution with .1. Its maximum likelihood estimate X/a continues to have the same intuitive interpretation as an estimate of /1. Closed form versus iteratively computed estimates At one level closed form is clearly preferable.1.4. (72) Example 2.9) by. On the other hand. the signaltonoise ratio. 3.3 and 3.5 we are interested in the parameter III (7 • • This parameter.2). We discuss some of the issues and the subtleties that arise in the context of some of our examples in estimation theory.2. On the other hand.
Gross error models Most measurement and recording processes are subject to gross errors. But () is still the target in which we are interested. We will consider situations (b) and (c). . 3.13. The actual observation X is X· contaminated with "gross errOfs"see the following discussion. if gross errors occur. However. Alternatively. The idea of robustness is that we want estimation (or testing) procedures to perform reasonably even when the model assumptions under which they were designed to perform excellently are not exactly satisfied. suppose we initially postulate a model in which the data are a sample from a gamma. which as we shall see later (Section 5. This is an issue easy to point to in practice but remarkably difficult to formalize appropriately.190 Measures of Performance Chapter 3 mean Jl and variance 0'2 other than the normal. economists often work with median housing prices... We consider three situations (a) The problem dictates the parameter. We can now use the MLE iF2. Similarly.e. we turn to robustness.\)' distribution. they may be interested in total consumption of a commodity such as coffee. would be an adequate approximation to the distribution of X* (i. X n ) where most of the Xi = X:. On the other hand.1.4) is for n large a more precise estimate than X/a if this model is correct.3 Robustness Finally. For instance.')1/2 = l.4. (c) We have a qualitative idea of what the parameter is. This idea has been developed by Bickel and Lehmann (l975a.5. that is. what reasonable means is connected to the choice of the parameter we are estimating (or testing hypotheses about). the HardyWeinberg parameter () has a clear biological interpretation and is the parameter for the experiment described in Example 2. we observe not X· but X = (XI •.5. but we do not necessarily observe X*. anomalous values that arise because of human error (often in recording) or instrument malfunction.. v is any value such that P(X < v) > ~. 1 X~) could be taken without gross errors then p. For instance. However. and both the mean tt and median v qualify. 1975b. 9(Pl. we could suppose X* rv p.. the form of this estimate is complex and if the model is incorrect it no longer is an appropriate estimate of E( X) / [Var( X)] 1/2. but there are a few • • = . E P*). P(X > v) > ~).)(p/>. However. say () = N p" where N is the population size and tt is the expected consumption of a randomly drawn individuaL (b) We imagine that the random variable X* produced by the random experiment we are interested in has a distribution that follows a ''true'' parametric model with an interpretable parameter B.I1'.. Then E(X)/lVar(X) ~ (p/>. . the parameter v that has half of the population prices on either side (fonnally. but there are several parameters that satisfy this qualitative notion. E p. suppose that if n measurements X· (Xi. 1976) and Doksum (1975). See Problem 3. We return to this in Section 5. To be a bit formal. amoog others.5. we may be interested in the center of a population.
. A reasonable formulation of a model in which the possibility of gross errors is acknowledged is to make the Ci still i. (3.5.Xn are i.d. Note that this implies the possibly unreasonable assumption that committing a gross error is independent of the value of X· . ISi'1 or 1'2 our goal? On the other hand. X~) is a good estimate. Again informally we shall call such procedures robust. specification of the gross error mechanism. in situation (c). = {f (..J) for all such P. we do not need the symmetry assumption.1'). This corresponds to.) are ij.j) E P. Xi Xt with probability 1 .l. Most analyses require asymptotic theory and will have to be postponed to Chapters 5 and 6.5. and then more generally. The breakdown point will be discussed in Volume II. we encounter one of the basic difficulties in formulating robustness in situation (b).l. the sensitivity curve and the breakdown point. We next define and examine the sensitivity curve in the context of the Gaussian location model.h(x). Consider the onesample symmetric location model P defined by ~ ~ i=l""l n .. . . where Y.j = PC""f.l. X is the best estimate in a variety of senses.. but with common distribution function F and density f of the form f(x) ~ (1 . We return to these issues in Chapter 6. However. for example. .2. . if we drop the symmetry assumption.Section 3. (3. X n ) will continue to be a good or at least reasonable estimate if its value is not greatly affected by the Xi I Xt.2) Here h is the density of the gross errors and . Further assumptions that are commonly made are that h has a particular fonn. The advantage of this formulation is that jJ.18). Then the gross error model issemiparametric..remains identifiable.d.)~ 'P C) + .\ is the probability of making a gross error. where f satisfies (3.) for 1'1 of 1'2 (Problem 3. . Example 1.5 Nondecision Theoretic Criteria ~ 191 wild values.d. Unfortunately.2). identically distributed.A Y.5.2) for some h such that h(x) = h(x) for all x}. X n ) knowing that B(Xi. Without h symmetric the quantity jJ. h = ~(7 <p (:(7) where K ::» 1 or more generally that h is an unknown density symmetric about O. P. with common density f(x . so it is unclear what we are estimating.1 ) where the errors are independent. Now suppose we want to estimate B(P*) and use B(X 1 .1') : f satisfies (3. However. the assumption that h is itself symmetric about 0 seems patently untenable for gross errors. the gross errors. it is the center of symmetry of p(Po.f. In our new formulation it is the xt that obey (3. That is. two notions.i. Formal definitions require model specification. with probability ...1') and (Xt.1).i. .1.. Y. If the error distribution is normal. . and symmetric about 0 with common density f and d.5. F. iff X" . That is. and definitions of insensitivity to gross errors. p(". has density h(y . it is possible to have PC""f.5. Informally B(X1 . make sense for fixed n.5.is not a parameter. •.
.192 The sensitivity curve Measures of Performance Chapter 3 . Suppose that X ~ P and that 0 ~ O(P) is a parameter. .. Xnl as an "ideal" sample of size n . Now fix Xl!'" . Because the estimators we consider are location invariant. .. . the sample mean is arbitrarily sensitive to gross errOfa large gross error can throw the mean off entirely. in particular. X"_l )J.. I .1. The empirical plugin estimate of 0 is 0 = O(P) where P is the empirical probability distribution. .··. . .. • .. .. Are there estimates that are less sensitive? A classical estimate of location based on the order statistics is the sample median X defined by ~ ~ X X(k+l) !(X(k) ifn~2k+1 + X(k+l)) ifn = 2k where X (I)' .2.1. 1 Xnl represents an observed sample of size nl from P and X represents an observation that (potentially) comes from a distribution different from P. x) . X(n) are the order statistics.1 for all p(/l. . we take I' ~ 0 without loss of generality. . (i) It is the empirical plugin estimate of the population median v (Problem 3.. (2. . SC(x. (}(PU.f) E P. .16). .1 for which the estimator () gives us the right value of the parameter and then we see what the introduction of a potentially deviant nth observation X does to the value of ~ We return to the location problem with e equal to the mean J. X =n (Xl+ . See Problem 3. 1') = 0. We are interested in the shape of the sensitivity curve. ~ ) SC( X. (}(X 1 .Xn ? An interesting way of studying this due to Tukey (1972) and Hampel (1974) is the sensitivity curve defined as follows for plugin estimates (which are well defined for all sample sizes n). that is." .O(Xl' . In our examples we shall.. . How sensitive is it to the presence of gross errors among XI.J..Xnl so that their mean has the ideal value zero.. has the plugin property.L. The sample median can be motivated as an estimate of location on various grounds. where F is the empirical d. Often this is done by fixing Xl.. The sensitivity curoe of () is defined as ~ ~ ~ ~ ~ ~ ~ ~ ~ .l. not its location. shift the sensitivity curve in the horizontal or vertical direction whenever this produces more transparent formulas. . Xl. is appropriate for the symmetric location model. . .j)) = J.1.5. Then ~ ~ O.I..2. . +X"_I+X) n = x.17).. .. . X n ordered from smallest to largest.4). therefore. At this point we ask: Suppose that an estimate T(X 1 . We start by defining the sensitivity curve for general plugin estimates.32. Thus. See (2.5. 0) = n[O(xl. and it splits the sample into two equal halves. This is equivalent to shifting the Be vertically to make its value at x = 0 equal to zero. P. See Section 2. . where Xl...L = (}(X l 1'. . X n ) .L = E(X). X" 1'). X n ) = B(F). that is.X.14. ..od Problem 2.od because E(X..
Section 3..1) is the Laplace (double exponential) density f(x) = 1 exp{l x l/7}. The sensitivity curves of the mean and median. SC(x) SC(x) x x Figure 3. SC(x..1 suggests that we may improve matters by constructing estimates whose behavior is more like that of the mean when X is near Ji. . XnI.9. .5. 27 a density having substantially heavier tails than the normaL See Problems 2. The sensitivity curve of the median is as follows: If. its perfonnance at the nonnal model is unsatisfactory in the sense that its variance is about 57% larger than the variance of X. we obtain . x is an empirical (iii) The sample median is the MLE when we assume the common density f(x) of the errors {cd in (3.Xn_l = (X(k) + X(ktl)/2 = 0.2. A class of estimates providing such intermediate behavior and including both the mean and .5. say....32 and 3.5.1. Although the median behaves well when gross errors are expected. < X(nl) are the ordered XI.. . The sensitivity curve in Figure 3.5. n = 2k + 1 is odd and the median of Xl. v coincides with fL and plugin estimate of fL. x) nx(k) nx = _nx(k+l) for for < x(k) for x(k) < x x <x(k+I) nx(k+l) x> x(k+l) where xCI) < .5.1).5 Nondecision Theoretic Criteria 193 (ii) In the symmetric location model (3.
the sensitivity curve calculation points to an equally intuitive conclusion. infinitely better in the case of the Cauchysee Problem 5. Xa  X.2.8) than the Gaussian density.5. Xa. and Hogg (1974).) SC(x) X x(n[naJ) . 4. Haber.l)Q] and the trimmed mean of Xl. If [na] = [(n . There has also been some research into procedures for which a is chosen using the observations. Xu = X.. . f(x) = or even more strikingly the Cauchy. the mean is better than any trimmed mean with a > 0 including the median.. Huber (1972). . The estimates can be justified on plugin grounds (see Problem 3. l Xnl is zero.194 Measures of Performance Chapter 3 the median has ~en known since the eighteenth century. We define the a (3. which corresponds approximately to a = This can be verified in tenns of asymptotic variances (MSEs}see Problem 5." see Jaeckel (1971). 4e'x'.. .5). The range 0.4. the Laplace density. That is. Which a should we choose in the trimmed mean? There seems to be no simple answer.5.. by <Q < 4.10 < a < 0.1. Hampel. and Tukey (1972). . Figure 3.5.1. we throw out the "outer" [na] observations on either side and take the average of the rest. suppose we take as our data the differences in Table 3. + X(n[nuj) n .5.4. Bickel. I I ! . i I .2[nn]/n)I. f = <p. Note that if Q = 0. the sensitivity Jo ~ CUIve of an Q trimmed mean is sketched in Figure 3. (The middle portion is the line y = x(1 . for example.3) Xu = X([no]+l) + . For more sophisticated arguments see Huber (1981). f(x) = 1/11"(1 + x 2 ). Let 0 trimmed mean.2. that is. whereas as Q i ~.20 seems to yield estimates that provide adequate protection against the proportions of gross errors expected and yet perform reasonably well when sampling is from the nonnal distribution.2[nnl where [na] is the largest integer < nO' and X(1) < . However.5. Intuitively we expect that if there are no gross errors. If f is symmetric about 0 but has "heavier tails" (see Problem 3. For instance. Rogers. < X (n) are the ordered observations. ..5. See Andrews. For a discussion of these and other forms of "adaptation.1 again. then the trimmed means for a > 0 and even the median can be much better than the mean.. The sensitivity curve of the trimmed mean.. .
75 and X. We next consider two estimates of the spread in the population as well as estimates of quantiles.XnI. X n is any value such that P( X < x n ) > Ct. the scale measure used is 0.I L:~ I (Xi .4) where the approximation is valid for X fixed.Section 3.2.2 . then SC(X. and let Xo. . Because T = 2 x (. &2) It is clear that a:~ is very sensitive to large outlying Ixi values. . P( X > xO') > 1 . If no: is an integer.5. where x(t) < .75 .16). .25). < x(nI) are the ordered Xl. . = ~ [X(k) + X(k+l)]' and at sample size n  1.5.10). where Xu has 100Ct percent of the values in the population on its left (fonnally. Then a~ = 11. SC(X.75 .is typically used. Example 3. 0 < 0: < 1.674)0". .X)2 is the empirical plugin estimate of 0. Let B(P) = X o deaote a ath quantile of the distribution of X.X. Xo: = x(k).1x. n + 00 (Problem 3. 0"2) model. A fairly common quick and simple alternative is the IQR (interquartile range) defined as T = X.X. Quantiles and the lQR.. . To simplify our expression we shift the horizontal axis so that L~/ Xi = O. The IQR is often calibrated so that it equals 0" in the N(/l. Write xn = 11.5 Nondecision Theoretic Criteria 195 Gross errors or outlying data points affect estimates in a variety of situations. If we are interested in the spread of the values in a population.1. other examples will be given in the problems. Similarly.1. then the variance 0"2 or standard deviation 0.Ct).742(x.5. say k..25 are called the upper and lower quartiles...25. Spread. 0 Example 3. Xu is called a Ctth quantile and X. &) (3.5. X n denote a sample from that population. Let B(P) = Var(X) = 0"2 denote the variance in a population and let XI. the nth sample quantile is Xo.1 L~ 1 Xi = 11. denote the o:th sample quantile (see 2...
and other procedures.. Rousseuw. trimmed mean.: : where F n .15) I[t < xl).O(F)] = . We discuss briefly nondecision theoretic considerations for selecting procedures including interpretability. xa: is not sensitive to outlying x's.5. An exposition of this point of view and some of the earlier procedures proposed is in Hampel.5.) . .. ~' is the distribution function of point mass at x (. 0. SC(x. F) and~:z. It is easy to see ~ ! '. 1'. It plays an important role in functional expansions of estimates. ~ ~ ~ Next consider the sample lQR T=X.196 thus. discussing the difficult issues of identifiability. x (t) that (Problem 3.F) where = limIF.2. . in particular the breakdown point.5..• .(x.1 denotes the empirical distribution based on Xl. and computability. o Remark 3.6.5) 1 [xlk+11 _ xlkl] x> xlk+1) ' Clearly. r: ~i II i .SC(x.25· Then we can write SC(x. = <1[0((1  <)F + <t. although this difficulty appears to be being overcome lately.. Unfortunately these procedures tend to be extremely demanding computationally. 1'75) . The sensitivity of the parameter B(F) to x can be measured by the influence function.(x. Discussion. for 2 Measures of Performance Chapter 3 < k < n . Summary.O. which is defined by IF(x.F) dO I.75 X. and Stabel (1983). . We will return to the influence function in Volume II.25) and the sample IQR is robust with respect to outlying gross errors x. Other aspects of robustness. The rest of our very limited treatment focuses on the sensitivity curve as illustrated in the mean. Ii I~ . have been studied extensively and a number of procedures proposed and implemented..Xnl. i) = SC(x. median. 1'0) ~[Xlkl) _ xlk)] x < XlkI) 2 '  2 2 1 Ix _ xlkl] XlkI) < x < xlk+11 '  (3. Most of the section focuses on robustness.O. Ronchetti. ~ .1.
Give the conditions needed for the posterior Bayes risk to be finite and find the Bayes rule.0). Find the Bayes risk r(7f.6 Problems and Complements 197 3.O) = p(x I 0)71'(0) = c(x)7f(0 .0)' and that the prinf7r(O) is the beta.2. the Bayes rule does not change.2. (X I 0 = 0) ~ p(x I 0). J).. lf we put a (improper) uniform prior on A.2. then the improper Bayes rule for squared error loss is 6"* (x) = X.3).. .1.2 preceeding. 6. s).r 7f(0)p(x I O)dO. In the Bernoulli Problem 3. That is. Show that OB ~ (S + 1)/(n+2) for ~ ~ the uniform prior. x) where c(x) =. (3(r. which is called the odds ratio (for success). Suppose IJ ~ 71'(0). 71') as . 1 X n is a N(B. 0) the Bayes rule is where fo(x. = (0 o)'lw(O) for some weight function w(O) > 0. is preferred to 0. (c) In Example 3.0 E e. the parameter A = 01(1 .6 PROBLEMS AND COMPLEMENTS Problems for Section 3. if 71' and 1 are changed to 0(0)71'(0) and 1(0. 0) = (0 . .2. E = R. Show that if Xl. Suppose 1(0.4. a(O) > 0. J) ofJ(x) = X in Example 3.O)P. Check whether q(OB) = E(q(IJ) I x). respectively. 71') = R(7f) /r( 71'. Find the Bayes estimate 0B of 0 and write it as a weighted average wOo + (1 ~ w)X of the mean 00 of the prior and the sample mean X = Sin.Section 3. we found that (S + 1)/(n + 2) is the Bayes rule.Xn be the indicators of n Bernoulli trials with success probability O. under what condition on S does the Bayes rule exist and what is the Bayes rule? 5. In Problem 3. where OB is the Bayes estimate of O. change the loss function to 1(0. Compute the limit of e( J. 3.. Consider the relative risk e( J.2 with unifonn prior on the probabilility of success O.O) and = p(x 1 0)[7f(0)lw(0)]/c c= JJ p(x I 0)[7f(0)/w(0)]dOdx is assumed to be finite.0) and give the Bayes estimate of q(O). 0) is the quadratic loss (0 . (a) Show that the joint density of X and 0 is f(x. (72) sample and 71" is the improper prior 71"(0) = 1. 2. Show that (h) Let 1(0. Let X I..3.4. where R( 71') is the Bayes risk. give the MLE of the Bernoulli variance q(0) = 0(1 . Hint: See Problem 1.a)' jBQ(1.2 o e 1. In some studies (see Section 6.24. ~ ~ 4.0)/0(0). density. .
3. 76) distribution. . equivalent to a namebrand drug. 1). a) = [q(B)a]'. . B)..=l CjUj . Var(Oj) = aj(ao . .d. (Use these results. .)T. a) = Lj~. 7.. (a) Problem 1.• N. For the following problems. Suppose tbat N lo . Let 0 = ftc .2(d)(i) and (ii). given 0 = B are multinomial M(n. then the generic and brandname drugs are.\(B) = I(B. ..3.a)' / nj~.I(d). Bioequivalence trials are used to test whether a generic drug is. N{O.. 0) and I( B. where is known. " X n of differences in the effect of generic and namebrand effects fora certain drug. Set a{::} Bioequivalent 1 {::} Not Bioequivalent . (a) If I(B. where ao = L:J=l Qj.. (Bj aj)2. 8. On the basis of X = (X I.198 (a) T + 00. defined in Problem 1. . Note that ). and that 0 has the Dirichlet distribution D( a)..exp {  2~' B' } . Measures of Performance Chapter 3 (b) n .15.aj)/a5(ao + 1). . 00. and that 9 is random with aN(rlO.. I ~' .3.9j ) = aiO:'j/a5(aa + 1).Xn ) we want to decide whether or not 0 E (e.B.O'5). (c) We want to estimate the vector (B" . (c) Problem 1. .ft B be the difference in mean effect of the generic and namebrand drugs. (. Hint: If 0 ~ D(a).. Let q( 0) = L:. . B. E). where E(X) = O. a = (a" ..(0) should be negative when 0 E (E. by definition. B . i: agency specifies a number f > a such that if f) E (E. c2 > 0 I. (b) Problem 1. I) = difference in loss of acceptance and rejection of bioequivalence. do not derive them.2. A regulatory " ! I i· .. E) and positive when f) 1998) is 1. and Cov{8j .Xu are Li. find the Bayes decision mle o' and the minimum conditional Bayes risk r(o'(x) I x).  I .) with loss function I( B. .. . to a close approximation..O)  I(B. Cr are given constants. Suppose we have a sample Xl. One such function (Lindley.19(c).. • 9.) (b) When the loss function is I(B. (e) a 2 + 00. Xl. There are two possible actions: 0'5 a a with losses I( B. bioequivalent. Find the Bayes decision rule. (E. a) = (q(B) . . B = (B" . E).. compute the posterior risks of the possible actions and give the optimal Bayes decisions when x = O. Assume that given f). then E(O)) = aj/ao. a r )T. where CI. find necessary and j sufficient conditions under which the Bayes risk is finite and under these conditions find the Bayes rule.E). I j l .\(B) = r .
Yo) to be a saddle point is that.Yo) = {} (Xo. yo) is a saddle point of 9 if g(xo.2.17).\(±€. Note that .3. .6." (c) Discuss the behavior of the preceding decision rule for large n ("n sider the general case (a) and the specific case (b). Any two functions with difference "\(8) are possible loss functions at a = 0 and I. S x T ~ R.16) and (3.y = (YI.2.Yo) Xi {}g {}g Yj = 0.2.. .) + ~}" . Yo) = inf g(xo. S T Suppose S and T are subsets of Rm. and 9 is twice differentiable. find (a) the linear Bayes estimate of ~l. representing x = (Xl. Discuss the preceding decision rule for this "prior. 1) are not constant. 0. 0) and l(()..6 Problems and Complements 199 where 0 < r < 1.6. (a) Show that a necessary condition for (xo..Section 3. (Xv. (b) It is proposed that the preceding prior is "uninformative" if it has 170 large ("76 + 00").. Con 10..) = 0 implies that r satisfies logr 1 = ( 2 2c' This is an example with two possible actions 0 and 1 where l(().. . y). . Hint: See Example 3. v) > ?r/(l ?r) is equivalent to T > t..3 1. {} (Xo.Xm). respectively. 2.. Suppose 9 . For the model defined by (3. + = 0 and it 00").Yp).1) is equivalent to "Accept bioequivalence if[E(O I x»)' where = x) < 0" (3. RP. . (c) Is the assumption that the ~ 's are nonnal needed in (a) and (b)? Problems for Section 3.1) < (T6(n) + c'){log(rg(~)+.1. (a) Show that the Bayes rule is equivalent to "Accept biocquivalence if E(>'(O) I X and show that (3.2 show that L(x. A point (xo. In Example 3. (b) the linear Bayes estimate of fl. Yo) = sup g(x. Yo) is in the interior of S x T.
o•• ) =R(B1 .d< p.n12). and 1 .n). B1 ) Ip(X. I(Bi. y ESp.)w". a) ~ (0 .. Yc Yd foraHI < i.andg(x. Hint: See Problem 3. (b) Show that limn_ooIR(O. . N(I'. I}.. B1 ) >  (l. .i) =0...(X) = 1 if Lx (B o." 1 Xi ~ l}.b < m.:. i._ I)'..a)'.Yo) > 0 8 8 8 X a 8 Xb 0. o')IR(O. Show that the von Neumann minimax theorem is equivalent to the existence of a saddle point for any twice differentiable g. L:.a. d) = (a) Show that if I' is known to be 0 (!.1(0.d.n12. Suppose i I(Bi.=1 CijXiYj with XE 8 m . the conclusion of von Neumann's theorem holds. .. B ) and suppose that Lx (B o.. n: I>l is minimax. • " = '!. the test rule 1511" given by o. 1rWlO = 0 otherwise is Bayes against a prior such that PIB = B ] = .2. Let Lx (Bo.n)/(n + .thesimplex. Let S ~ 8(n. 5. B ) ~ p(X. f.i. prior. 1 <j. and show that this limit ( ... Yo . (b) There exists 0 < 11"* < 1 such that the prior 1r* is least favorable against is. Hint: Show that there exists (a unique) 1r* so that 61f~' that R(Bo.. A = {O. .l. (b) Suppose Sm = (x: Xi > 0.c. . and o'(S) = (S + 1 2 . and that the mndel is regnlar. I • > 1 for 0 i ~.= 1 . Thus. Bd..2. I j 3.j=O. Suppose e = {Bo. f" I • L : I' (a) Show that 0' has constant risk and is Bayes for the beta.<7') and 1(<7'.Xn ) .PIB = Bo]. &* is minimax.0).j)=Wij>O. 2 J'(X1 . Let X" .1 < i < m.200 and Measures of Performance Chapter 3 8'g (x ) < 0 8'g(xo.y) ~ L~l 12. iij. o(S) = X = Sin. • . X n be i. fJ( . BIl has a continuous distrio 1 bution under both P Oo and P BI • Show that (a) For every 0 < 7f < 1. 0») equals 1 when B = ~. 4.0•• ).
12. jl~ < . Then X is nurnmax.\) distribution and 1(. . show that 0' is unifonnly best among all rules of the form oc(X) = CL Conclude that the MLE is inadmissible..) .k} I«i). . 00. . .a? 1. .. d) = L(d.a < b" = 'lb.tj. . " a.P.. 1 = t j=l (di .j. . . . Permutations of I....29). Show that X has a Poisson (. .. an d R a < R b. Remark: Stein (1956) has shown that if k > 3..\ . Xk)T. .. M (n. distribution. j 7. See Problem 1..?. Jlk. o(X 1.. d = (d). PI. h were t·... a) = (.).Section 3_6 Problems and Complements 201 (b) If I' ~ 0.2. . f(k. where (Ill. then is minimax for the loss function r: l(p. LetA = {(iI. .. = tj. 8. prior. ttl .» = Show that the minimax rule is to take L l. I' (X"". b = ttl..\. 1 <i< k.. Rj = L~ I I(XI < Xi)' Hint: Consider the uniform prior on permutations and compute the Bayes rule by showing that the posterior risk of a pennutation (i l .j. > jm). . Show that if (N). dk)T.X)' are inadmissible.. Xl· (e) Show that if I' is unknown.. . X k ) = (R l .I'kf. Rk) where Rj is the rank of Xi. . See Volume II.d) where qj ~ 1 . o'(X) is also minimax and R(I'.tn I(i. hence. .. .1.. that is. X is no longer unique minimax. ... . . i=l then o(X) = X is minimax. Hint: Consider the gamma.I). and iI. ftk) = (Il?l'"'' J1.j. 6. (c) Use (B. ik) is smaller than that of (i'l"'" i~)..3. . . Hint: (a) Consider a gamma prior on 0 = 1/<7'.X)' and. < im. . 0 1.. J.\. ik is an arbitrary unknown permutation of 1.X)' is best among all rules of the form oc(X) = c L(Xi ... 1). Let Xl. respectively.Pi. N k ) has a multinomial. 9. Let Xi beindependentN(I'i.. .. b.. . Let k . 1 < j < k. Write X  • "i)' 1(1'.).k. / .o) for alII'.. Show that if = (I'I''''.. For instance. 0') = (1. ik). . . that both the MLE and the estimate 8' = (n . . < /12 is a known set of values. . .~~n X < Pi < < R(I'. . X k be independent with means f. (jl.1)1 L(Xi . . o(X) ~ n~l L(Xi ... .Pi)' Pjqj <j < k. .l .
a) and let the prior be the uniform prior 1r(B" ..d.q)1r(B)dB. X = (X"". J K(po. See also Problem 3..3. ..B).. B j > 0. distribution. Suppose that given 6 = B = (B .. BO l) ~ (k .. Hint: Consider the risk function of o~ No._. distribution. X n be independent N(I'.i. .2..d with density defined in Problem 1. v'n v'n (a) Show that the risk (for squared error loss) E( v'n(o(X) . 13.. . B(n. a) is the posterior mean E(6 I X). LB= j 01 1.. . ~ I I .. of Xi < X • 1 + 1 o. I: • ... .1'»2 of these estimates is bounded for all nand 1'.202 Measures of Performance Chapter 3 Hint: Consider Dirichlet priors on (PI.. ! p(x) = J po(x)1r(B)dB... .15.=1 Show that the Bayes estimate is (Xi . I: I ir'z .. (b) How does the risk of these estimates compare to that of X? 12. 1).1r) = Show that the marginal density of X.i f X> . + l)j(n + k). Suppose that given 6 ~ B. B). Show that the Bayes estimate of 0 for the KullbackLeibler loss function lp(B. Pk.4. Let the loss " function be the KullbackLeibler divergence lp(B. 1 . 14. . For a given x we want to estimate the proportion F(x) of the population to the left of x..BO)T. Let Xi(i = 1.. ... ..... X has a binomial. Let X" . Show that v'n 1+ v'n 2(1+ v'n) is minimax for estimating F(x) = P(X i < x) with squared error loss. 10. Let K(po. . with unknown distribution F..Xo)T has a multinomial. n) be i..8. . M(n..I)!. Define OiflXI v'n < d < v'n d v'n d d X . d X+ifX 11..2.. q) denote the K LD (KullbackLeiblerdivergence) between the densities Po and q and define the Bayes KLD between P = {Po : BEe} and q as k(q. See Problem 3....
'? /1 as in (3. . Let X I. .alai) < al(O. Reparametrize the model by setting ry = h(O) and let q(x. then E(g(X) > g(E(X».1 E: 1 (Xi  J. p(X) IO.tO)2 is a UMVU estimate of a 2 • &'8 is inadmissible.. Hint: k(q. ry). for any ao. 4. J). 0) cases. if I(O.. ao) + (1 . {log PB(X)}] K(O)d(J.a )I( 0. the Fisher information lower bound is equivariant. that is. J') < R( (J. a) is convex and J'(X) = E( J(X) I I(X». Show that B q(ry) = Bp(h." A density proportional to VJp(O) is called Jeffrey's prior. 2.2) with I' . then That is. Show that if 1(0.. (b) Equivariance of the Fisher Information Bound. Prove Proposition 3.). N(I". X n be the indicators of n Bernoulli trials with success probability B.. Let A = 3.O)!. It is often improper. Hint: Use Jensen's inequality: If 9 is a convex function and X is a random variable.x =.f [E. a < a < 1. 0) with 0 E e c R. X n are Ll. a.. (ry»). 7f) and that the minimum is I(J. a" (J. Show that in theN(O. 15. p( x. then R(O. Suppose that there is an unbiased estimate J of q((J) and that T(X) is sufficient. K) .X is called the mutual information between 0 and X.12) for the two parametrizations p(x. Show that X is an UMVU estimate of O. . Jeffrey's "Prior. Problems for Section 3. Equivariance. Fisher infonnation is not equivariant under increasing transformations of the parameter. R. Let X . .. Jeffrey's priors are proportional to 1. We shall say a loss function is convex.1 . and O~ (1 . . h1(ry)) denote the model in the new parametrization. Let Bp(O) and Bq(ry) denote the information inequality lower bound ('Ij. 0) and q(x.Section 3.. 0. . K) = J [Eo {log ~i~i}] K((J)d(J > a by Jensen's inequality. . Show that (a) (b) cr5 = n. . (a) Show that if Ip(O) and Iq(fJ) denote the Fisher information in the two parametrizations. 0) and B(n. N(I"o. . respectively.aao + (1 .1"0 known. Suppose X I. S. suppose that assumptions I and II hold and that h is a monotone increasing differentiable function from e onto h(8). ry) = p(x.4.d.k(p.4.6 Problems and Complements 203 minimizes k( q.~). .4 1.4. Give the Bayes rules for squared error in these three cases. .
4. (a) Find the MLE of 1/0.. Suppose Yl .. Ct.p) is an unbiased estimate of (72 in the linear regression model of Section 2. 1). a ~ i 12.' . Show that if (Xl. Let X" .ZDf3)/(n .13 E R. Let F denote the class of densities with mean 0.. 14. . distribution. Show that a density that minimizes the Fisher infonnation over F is f(x. 0) = Oc"I(x > 0). then (a) X is an unbiased estimate of x = I ~ L~ 1 Xi· . Establish the claims of Example 3.. Measures of Performance Chapter 3 I (c) if 110 is not known and the true distribution of X t is N(Ji.XN }. I I .Yn are independent Poisson random variables with E(Y'i) = !Ji where Jli = exp{ Ct + (3Zi} depends on the levels Zi of a covariate.. 9. For instance. Show that .11(9) as n ~ give the limit of n times the lower bound on the variances of and (J. 00.1. compute Var(O) using each ofthe three methods indicated. ~ ~ ~ (a) Write the model for Yl give the sufficient statistic. . 204 Hint: See Problem 3.o. .Ii . .. Let a and b be constants.. ( 2). Does X achieve the infonnation . Zi could be the level of a drug given to the ith patient with an infectious disease and Vi could denote the number of infectious agents in a given unit of blood from the ith patient 24 hours after the drug was administered. Pe(E) = 1 for some 0 if and only if Pe(E) ~ ~2 > OJ docsn't depend on 0. X n be a sample from the beta. find the bias o f ao' 6. Y n in twoparameter canonical exponential form and (b) Let 0 = (a. n. . Show that assumption I implies that if A . . {3) T Compute I( 9) for the model in (a) and then find the lower bound on the variances of unbiased estimators and {J of a and (J. .8..2.4.5(b).4. Find lim n. . Suppose (J is UMVU for estimating fJ.ZDf3)T(y .3. 2 ~ ~ 10.\ = a + bB is UMVU for estimating .. Show that 8 = (Y . . • 13.{x : p(x. =f. 7. B(O. . . . Hint: Consider T(X) = X in Theorem 3. P. 0) for any set E.1 and variance 0.P.. and .4. . . a ~ (c) Suppose that Zi = log[i/(n + 1)1. Hint: Use the integral approximation to sums.2 (0 > 0) that satisfy the conditions of the information inequality. .\ = a + bOo 11. Is it unbiased? Does it achieve the infonnation inequality lower bound? (b) Show that X is an unbiased estimate of 0/(0 inequality lower bound? + 1). In Example 3.. 8. X n ) is a sample drawn without replacement from an unknown finite population {Xl. then = 1 forallO. i = 1.
. (c) Explain how it is possible if Po is binomial. (See also Problem 1.. 7. distribution.4.. 18.1  E1 k I Xki doesn't depend on k for all k such that 1Tk > o. XK.) Suppose the Uj can be relabeled into strata {xkd. .Xkl. . (a) Take samples with replacement of size mk from stratum k = fonn the corresponding sample averages Xl. 1 < i < h. (b) Deduce that if p. 20. 8).511 > (b) Show that the inequality between Var X and Var X continues to hold if ~ ..• .. if and only if.p). that ~ is a Bayes estimate for O. then E( M) ~ L 1r j=1 N J ~ n.X)/Var(U).".4. k=l K ~1rkXk. . : 0 E for (J such that E((J2) < 00.. .4.. Stratified Sampling. . ec R} and 1r is a prior distribution (a) Show that o(X) is both an unbiased estimate of (J and the Bayes estimate with respect to quadratic loss.1 and Uj is retained independently of all other Uj with probability 1rj where 'Lf 11rj = n. = N(O. Let 7fk = ~ and suppose 7fk = 1 < k < K. .6) is (a) unbiased and (b) has smaller variance than X if  b < 2 Cov(U.. _ Show that X is unbiased and if X is the mean of a simple random sample without replacement from the population then VarX<VarX with equality iff Xk.Section 3.~). B(n..6 Problems and Complements 205 (b) The variance of X is given by (3. k = 1. 16. . Define ~ {Xkl.4).~ for all k. More generally only polynomials of degree n in p are unbiasedly estimable. Show that is not unbiasedly estimable. P[o(X) = 91 = 1.l.} and X =K  1". B(n. Show that the resulting unbiased HorvitzThompson estimate for the population mean has variance strictly larger than the estimate obtained by taking the mean of a sample of size n taken without replacement from the population. 'L~ 1 h = N. Let X have a binomial. UN are as in Example 3. X is not II Bayes estimate for any prior 1r. even for sampling without replacement in each stratum. Suppose UI. G 19. 17. K.. Suppose the sampling scheme given in Problem 15 is employed with 'Trj _ ~. Suppose X is distributed accordihg to {p.3.4. Show that X k given by (3. :/i. Show that if M is the expected sample size. = 1. 15.
. Note that logp(x. An estimate J(X) is said to be shift or translation equivariant if.25 and net =k 3. Show that the sample median X is an empirical plugin estimate of the population median v.25. Show that. Regularity Conditions are Needed for the Information Inequality. Hint: It is equivalent to show that./(9)a [¢ T (9)a]T II (9)[¢ T (9)a]."" F (ij) Vat (:0 logp(X. B) is differentiable fat anB > x.206 Hint: Given E(ii(X) 19) ~ 9.4.3.5) to plot the sensitivity curve of the 1. Show that the a trimmed mean XCII. give and plot the sensitivity curve of the median. the upper quartile X.. for all adx 1.4. J xdF(x) denotes Jxp(x)dx in the continuous case and L:xp(x) in the discrete 6.. B») ~ °and the information bound is infinite. (iii) 2X is unbiased for .75' and the IQR. E(9 Measures of Performance Chapter 3 I X) ~ ii(X) compute E(ii(X) . T .25 and (n . Prove Theorem 3. = 2k is even. B). 1 X n1 C. • Problems for Section 35 I is an integer. 1 2.5. and we can thus define moments of 8/8B log p(x. If a = 0. is an empirical plugin estimate of ~ Here case. 22. 5. Let X ~ U(O. . 4. for all X!. Yet show eand has finite variance. that is. give and plot the sensitivity curves of the lower quartile X. however. with probability I for each B. Var(a T 6) > aT (¢(9)II(9). use (3. . Note that 1/J (9)a ~ 'i7 E9(a 9) and apply Theorem 3.9)' 21. If n IQR.l)a is an integer. B) be the uniform distribution on (0. B).4. If a = 0.
X are unbiased estimates of the center of symmetry of a symmetric distribution.. . is symmetrically distributed about O. '& is an estimate of scale. . . xH L is translation equivariant and antisymmetric.Section 3. median. Deduce that X. Show that if 15 is translation equivariant and antisymmetric and E o(15(X» exists and is finite. (7= moo 1 IX. For x > .1... and the HodgesLehmann estimate.30. k t 0 to the median.  XI/0. I)order statistics). .1. .30. . In .. (See Problem 3. It has the advantage that there is no trimming proportion Q that needs to be subjectively specified. J is an unbiased estimate of 11..5. 7. and xiflxl < k kifx > k kifx<k.6 Problems and Complements 207 It is antisymmetric if for all Xl. F(x . The HodgesLehmann (location) estimate XHL is defined to be the median of the 1n(n + 1) pairwise averages ~(Xi + Xj)..JL) where JL is unknown and Xi . . (a) Suppose n = 5 and the "ideal" ordered sample of size n ~ 1 = 4 is 1. X a arc translation equivariant and antisymmetric. (b) Suppose Xl. (b) Show that   .5 and for (j is.03.).. X n is a sample from a population with dJ.) ~ 8. One reasonable choice for k is k = 1. trimmed mean with a = 1/4. Xu. . <i< n Show that (a) k = 00 corresponds to X.6.(a) Show that X. plot the sensitivity curves of the mean.e.67. then (i..03 (these are expected values of four N(O. X. The Huber estimate X k is defined implicitly as the solution of the equation where 0 < k < 00. Its properties are similar to those of the trimmed mean.3. i < j.
then limlxj>00 SC(x.i.1)1 L:~ 1(Xi . thenXk is the MLEofB when Xl. with k and € connected through 2<p(k) _ 24>(k) ~ e k 1(e) Xk exists and is unique when k £ ! i > O. n is fixed. Show that SC(x.Xn arei.. (d) Xk is translation equivariant and antisymmetric (see Problem 3. Suppose L:~i Xi ~ . Let X be a random variable with continuous distribution function F.208 ~ Measures of Performance Chapter 3 (b) Ifa is replaced by a known 0"0.75 .. .) are two densities with medians v zero and identical scale parameters 7. For the ideal sample of Problem 3. with density fo((.X. In what follows adjust f and 9 to have v = 0 and 'T = 1. .d. . and is fixed. (e) If k < 00. The functional 0 = Ox = O(F) is said to be scale and shift (translation) equivariant \. and "" "'" (b) the Iratio of Problem 3.5. (b) Find thetml probabilities P(IXI > 2). thus. Let JJo be a hypothesized mean for a certain population. 1 Xnl to have sample mean zero. Let 1'0 = 0 and choose the ideal sample Xl. Location Parameters.." . 9. . 00.• 13.7(a).5.5. ii. plot the sensitivity curve of (a) iTn . (c) Show that go(x)/<p(x) is of order exp{x2 } as Ixl ~ 10.x}2. In the case of the Cauchy density. we say that g(. = O. we will use the IQR scale parameter T = X. Laplace. •• .25. Use a fixed known 0"0 in place of Ci. .. where S2 = (n .6). P(IXI > 3) and P(IXI > 4) for the nonnal. Find the limit of the sensitivity curve of t as (a) Ixl ~ (b) n ~ 00.O)/ao) where fo(x) for Ixl for Ixl <k > k.. If f(·) and g(. the standard deviation does not exist. Xk) is a finite constant. and Cauchy distributions. . The (student) tratio is defined as 1= v'n(x I'o)/s. (a) Find the set of Ixl where g(Ixl) > <p(lxl) for 9 equal to the Laplace and Cauchy densitiesgL(x) = (2ry)1 exp{ Ixl/ry} and gc(x) = b[b2 + x 2 ]1 /rr. 00. 11. [. This problem may be done on the computer.~ !~ t . X 12.) has heavier tails than f() if g(x) is above f(x) for Ixllarge.11. iTn ) ~ (2a)1 (x 2  a 2 ) as n ~ 00.
xn ..67 and tPk is defined in Problem 3. ~ (b) Write the Be as SC(x. a + bx. i/(F)] is the location parameter set in the sense that for any continuous F the value ()( F) of any location parameter must be in [v(F). 8. In Remark 3. (jx. . and note thatH(xi/(F» < F(x) < H(xv(F»._.. (e) Show that if the support S(F) = {x : 0 < F(x) < 1} of F is a finite interval... (a) Show that if F is symmetric about c and () is a location parameter.. .  vl/O. . ([v(F). c E R.t )) = 0.. and sample trimmed mean are shift and scale equivariant.6 Problems and Complements 209 if (Ja+bX = a + bex· It is antisymmetric if (J x :.) 14. (e) Let Ji{k) be the solution to the equation E ( t/Jk (X :. sample median. Show that J1. . . b > 0.Xn_l) to show its dependence on Xnl (Xl. SC(a + bx.5. xnd. a. 8) and ~ IF(x. b > 0. F) lim n _ oo SC(x.xn ).5) are location parameters. Show that if () is shift and location equivariant.1: (a) Show that SC(x.(F) = !(xa + XI").. 8) = IF~ (x. 8.a +bxn ) ~ a + b8n (XI. In this case we write X " Y.:. .. any point in [v(F). An estimate ()n is said to be shift and scale equivariant if for all xl. i/(F)] and.O.Section 3.. Fn _ I ). let v" = v. ~ (a) Show that the sample mean. median v. ~ ~ 15. v(F) = inf{va(F) : 0 < " < I/2} andi/(F) =sup{v.8. 0 < " < 1. ~ ~ 8n (a +bxI.) That is. and ordcr preserving. F(t) ~ P(X < t) > P(Y < t) antisymmctric.].5.(k) is a (d) For 0 < " < 1.5..._. · . let H(x) be the distribution function whose inverse is H'(a) = ![xax. compare SC(x. Let Y denote a random variable with continuous distribution function G. If 0 is scale and shift equivariant. Also note that H(x) is symmetric about zero. Hint: For the second part. i/(F)] is the value of some location parameter. c + dO. then v(F) and i/( F) are location parameters. . ..F). then for a E R. and trimmed population mean JiOl (see Problem 3. x n _.t. ~ se is shift invariant and scale equivariant. 8. n (b) In the following cases. X is said to be stochastically smaller than Y if = G(t) for all t E R. then ()(F) = c..). if F is also strictly increasing. the ~ = bdSC(x. it is called a location parameter.8. Show that v" is a location parameter and show that any location parameter 8(F) satisfies v(F) < 8(F) < i/(F).(F): 0 <" < 1/2}. ~ = d> 0. (b) Show that the mean Ji. 8 < " is said to be order preserving if X < Y ::::} Ox < ()y. where T is the median of the distributioo of IX location parameter.
A. Frechet. is not identifiable. 3. C < 00. (e) Does n j [SC(x.2).3 (1) A technical problem is to give the class S of subsets of:F for which we can assign probability (the measurable sets). Let d = 1 and suppose that 'I/J is twice continuously differentiable. ~ ~ 17.1/. .1) Hint: (a) Try t/J(x) = Alogx with A> 1. {BU)} do not converge. then fJ.rdF(l:). I I Notes for Section 3. and we seek the unique solution (j of 'IjJ(8) = O. ~ I(x  = X n .Bilarge enough. (ii) O(F) = (iii) e(F) "7. consequently.7 NOTES Note for Section 3. This is. we in general must take on the order of log ~ steps. We define S as the . IlFfdF(x).. 81" < 0.210 (i) Measures of Performance Chapter 3 O(F) ~ 1'1' ~ J .0 > 0 (depending on t/J) such that if lOCO) . B E B. (1) The result of Theorem 3. .4. (b) Show that there exists.t is identifiable. The NewtonRaphson method in this case is . in order to be certain that the Jth iterate (j J is within e of the desired () such that 'IjJ( fJ) = 0. I : . 0.1 is commonly known as the Cramer~Rao inequality. (b) If no assumptions are made about h.BI < C1(1(i.B ~ {F E :F : Pp(A) E B}.' > 0. also true of the method of coordinate ascent. show that (a) If h is a density that is symmetric about zero. . 18. ) (a) Show by example that for suitable t/J and 1(1(0) . . then J. In the gross error model (3. • . we shall • I' .. (b) ~ ~ . Because priority of discovery is now given to the French mathematician M. where B is the class of Borel sets. and (iii) preceding? ~ 16. • • . F)! ~ 0 in the cases (i). . Show that in the bisection method.I F(x.5. · . (ii).8) .field generated by SA. Assume that F is strictly increasing.OJ then l(I(i) .4 I .
DoKSUM. 40.. 3.. J. 0. DOWE. Statist. 1. T. M.. 1972. AND E." Ann. WALLACE. AND N. LEHMANN. SMITH. >. DE GROOT.thematical Methods in Risk Theory Heidelberg: Springer Verlag. 2. J. 1969. BICKEL. Statistical Decision Theory and Bayesian Analysis New York: Springer. Statist. 1122 (1975). Dispersion. Math... 1974.. 3.8). Jroo T(x) [~p(X' 0)] dx roo Joo 8(J 00 00 00 for all (J whereas the continuity (or even boundedness on compact sets) of the second integral guarantees that we can interchange the order of integration in (4) The finiteness of Var8(T(X)) and f(O) imply that 1/J'(0) is finite by the covariance interpretation given in (3. D. Numen'cal Analysis New York: Prentice Hall. D. H. "Descriptive Statistics for Nonparametric Models. AND J. P.cific Asian Conference on Knowledge Discovery and Data Mintng Melbourne: SpringerVerlag. F.8 References 211 follow the lead of Lehmann and call the inequality after the Fisher information number that appears in the statement. BJORK..Section 3. BERNARDO. Optimal Statistical Decisions New York: McGrawHill.. S. Ma. A. II. F.)dXd>. AND E. Location. BERGER. Bayesian Theory New York: Wiley. MA: AddisonWesley. DAHLQUIST. in Proceedings of the Second Pa. 1985.10451069 (1975b). HAMPEL.. F.. 10381044 (l975a). . AND C. P. R. BOHLMANN.. M. Jroo T(x) :>.. W. Statist. Reading. BICKEL.(T(X)) = 00.." Scand. Introduction. P.. 2nd ed. LEHMANN. LEHMANN." Ann. 4. 15231535 (1969). 3. ANDERSON. ROGERS. BICKEL. BAXTER. TUKEY. ofStatist. 1998. OLIVER. Robust Estimates of Location: Sun>ey and Advances Princeton. R. Mathematical Analysis. AND E. (2) Note that this inequality is true but uninteresting if f(O) = 00 (and 1/J'(0) is finite) or if Var. H. G. APoSTOL. (3) The continuity of the first integral ensures that : (J [rO )00 .. BICKEL. 1974. Point Estimation Using the KullbackLeibler Loss Function and MML. H. "Descriptive Statistics for Nonparametric Models. P. Families. BICKEL. J. "Descriptive Statistics for Nonparametric Models. M. III. LEHMANN. A. 1970..] = Jroo. M. "Measures of Location and AsymmeUy.8 REFERENCES ANDREWS." Ann. "Unbiased Estimation in Convex. P. NJ: Princeton University Press. J. W. p(x. AND A. AND E.. I. HUBER. Statist." Ann. L.. 11391158 (1976). K. A. 1994. J. J. P..4.
MA: AddisonWesley. 42. London. Assoc.375394 (1997). • • 1 . F." Ann. "Hierarchical Credibility: Analysis of a Random Effect Linear Model with Nested Classification. "Estimation and Inference by Compact Coding (With Discussions). Assoc. L. "On the Attainment of the CramerRao Lower Bound. W... London: Oxford University Press. LINDLEY.V. YU. I.. HOGG. R.. University of California Press. 1959.. Part II: Inference. M. B. S. AND B. 7.. Soc. Mathematical Methods and Theory in Games. LINDLEY. RoyalStatist. Theory ofPoint Estimation... Wiley & Sons. . Statist. TuKEY.. P.n. 43." Proc. Statist. Math. "Stochastic Complexity (With Discussions):' J. 383393 (1974). "Inadmissibility of the Usual Estimator for the Mean of a Multivariate Distribution. Soc. STEIN. 69. and Economics Reading.. R. Stall'. . 1981. D. HANSEN. 1998.. A. D. JAECKEL.. Math. WALLACE. 1954. 909927 (1974). SAVAGE." J. 49. LEHMANN. Third Berkeley Symposium on Math. HUBER. • NORBERG. STAHEL. E. 10201034 (1971). "Boostrap Estimate of KullbackLeibler Information for Model Selection. 223239 (1987). RONCHEro. R. Robust Statistics New York: Wiley. "The Influence Curve and Its Role in Robust Estimation. R. Part I: Probability. and Probability. . j I JEFFREYS.10411067 (1972). 1948.. 1965. The Foundations afStatistics New York: J. RISSANEN... Cambridge University Press.. 1.222 (1986). 240251 (1987). A. Math. CASELLA. Y. 2nd ed. Stu/ist. Royal Statist.. StrJtist. MA: AddisonWesley. HUBER." J. H. L. Assoc. HAMPEL. Robust Statistics: The Approach Based on Influence Functions New York: J. Testing Statistical Hypotheses New York: Springer. Theory ofProbability. WUSMAN.. Amer. L.. 2Q4. (2000). FREEMAN.. AND W.. KARLIN. "Adaptive Robust Procedures. 136141 (1998). R. C. 49. 197206 (1956). 1986. StatiSl. New York: Springer. "Robust Statistics: A Review. E. B. Programming. E." Statistica Sinica.212 Measures of Performance Chapter 3 HAMPEL. Actuarial J. "Decision Analysis and BioequivaIence Trials. "Robust Estimates of Location. Amer. P." Statistical Science. 69. C." Ann. AND P. 2nd ed. SHIBATA. Wiley & Sons.. AND G. Amer.. ROUSSEUW." Scand. LEHMANN. 538542(1973). J.. 1972. 1986. E. S.. .. P." J. H. Introduction to Probability and Statistics from a Bayesian Point of View." Ann. J." J. 1. "Model Selection and the Principle of Mimimum Description Length. L. Exploratory Data Analysis Reading.. 13. Stlltist..
treating it as a decision theory problem in which we are to decide whether P E Po or P l or. If n is the total number of applicants. the situation is less simple. it might be tempting to model (Nm1 . peIfonn a survey. where Po. and 3. and we have data providing some evidence one way or the other. and indeed most human activities. Does a new drug improve recovery rates? Does a new car seat design improve safety? Does a new marketing policy increase market share? We can design a clinical trial. Nmo.Chapter 4 TESTING AND CONFIDENCE REGIONS: BASIC THEORY 4. not a sample.1. Here are two examples that illustrate these issues. to answering "no" or "yes" to the preceding questions. in examples such as 1. As we have seen.PfO). pUblic policy. Accepting this model provisionally. as is often the case. whether (j E 8 0 or 8 1 jf P j = {Pe : () E 8 j }. we are trying to get a yes or no answer to important questions in science.PmllPmO. respectively. e Example 4. The design of the experiment may not be under our control. Nfo of denied applicants. what is an appropriate stochastic model for the data may be questionable. and what 8 0 and 8 1 correspond to in tenns of the stochastic model may be unclear. PI or 8 0 .1. medicine. parametrically.Nfb the numbers of admitted male and female applicants.2. M(n. This framework is natural if.1. the parameter space e. They initially tabulated Nm1. 3. But this model is suspect because in fact we are looking at the population of all applicants here. what does the 213 . Usually.3 the questions are sometimes simple and the type of data to be gathered under our control. o E 8.3.1 INTRODUCTION In Sections 1. respectively. Nfo) by a multinomial.Pfl. 8 1 are a partition of the model P or. distribution. The Graduate Division of the University of California at Berkeley attempted to study the possibility that sex bias operated in graduate admissions in 1973 by examining admissions data. Sex Bias in Graduate Admissions at Berkeley. and the corresponding numbers N mo . modeled by us as having distribution P(j. conesponding.Nf1 . where is partitioned into {80 . or more generally construct an experiment that yields data X in X C Rq.3 we defined the testing problem abstractly. 8d with 8 0 and 8.
.=7xlO 5 ' n 3 . OUf multinomial assumption now becomes N ""' M(pmld' PmOd. and so on. . . • In fact. N mOd . . m P [ NAA . . .< . The progeny exhibited approximately the expected ratio of one homozygous dominant to two heterozygous dominants (to one recessive). Mendel's Peas. It was noted by Fisher corresponds to H : p = ~ with the alternative K : p as reported in Jeffreys (1961) that in this experiment the observed fraction ':: was much closer to 3. • . ~. the same data can lead to opposite conclusions regarding these hypothesesa phenomenon called Simpson's paradox. the number of homozygous dominants. . Mendel crossed peas heterozygous for a trait with two alleles.! . where N m1d is the number of male admits to department d. ! . . the natural model is to assume. . I I 1] Fisher conjectured that rather than believing that such a very extraordinary event occurred it is more likely that the numbers were made to "agree with theory" by an overzealous assistant. . pml +P/I !. If departments "use different coins. Pfld.1. In one of his famous experiments laying the foundation of the quantitative theory of genetics. d = 1.. if there were n dominant offspring (seeds). . for d = 1. distribution. • I I . if the inheritance ratio can be arbitrary.D. as is discussed in a paper by Bickel. one of which was dominant. NIld. In these tenns the hypothesis of "no bias" can now be translated into: H: Pml Pmld PmOd Pfld Pild + + PfOd . PfOd. In a modem formulation.than might be expected under the hypothesis that N AA has a binomial.: Example 4. either N AA cannot really be thought of as stochastic or any stochastic I ." then the data are naturally decomposed into N = (Nm1d.p) distribution.. D). i:. .. B (n. . The hypothesis of dominant inheritance ~.. I . This is not the same as our previous hypothesis unless all departments have the same number of applicants or all have the same admission rate. has a binomial (n. Hammel... 0 I • . That is. I . ~).. and O'Connell (1975). NjOd.214 Testing and Confidence Regions Chapter 4 hypothesis of no sex bias correspond to? Again it is natural to translate this into P[Admit I Male] = Pm! Pml +PmO = P[Admit I Female] = P fI Pil +PjO But is this a correct translation of what absence of bias means? Only if admission is determined centrally by the toss of a coin with probability Pml Pi! Pml + PmO PIl +PiO [n fact. • Pml + P/I + PmO + Pio • I . D). d = 1. The example illustrates both the difficulty of specifying a stochastic model and translating the question one wants to answer into a statistical hypothesis..2. that N AA. admissions are petfonned at the departmental level and rates of admission differ significantly from department to department.n 3 t I .
it's not clear what P should be as in the preceding Mendel example. our discussion of constant treatment effect in Example 1.!. where 00 is the probability of recovery usiog the old drug. see. the set of points for which we reject. .. we shall simplify notation and write H : () = eo. The same conventions apply to 8] and K. In situations such as this one .1. Most simply we would sample n patients. and accept H otherwise. Suppose that we know from past experience that a fixed proportion Bo = 0. 11 and K is composite. 8 1 is the interval ((}o. It will turn out that in most cases the solution to testing problem~ with 80 simple also solves the composite 8 0 problem. and we are then led to the natural 0 . What our hypothesis means is that the chance that an individual randomly selected from the ill population will recover is the same with the new and old drug. where Xi is 1 if the ith patient recovers and 0 otherwise.1.11. suppose we observe S = EXi . P = Po. say 8 0 . . say k. for instance.3. That a treatment has no effect is easier to specify than what its effect is. we reject H if S exceeds or equals some integer. In this example with 80 = {()o} it is reasonable to reject IJ if S is "much" larger than what would be expected by chance if H is true and the value of B is eo. n 3' 0 What the second of these examples suggests is often the case. Now 8 0 = {Oo} and H is simple. I} or critical region C {x: Ii(x) = I}.. where 1 ~ E is the probability that the assistant fudged the data and 6!. Suppose we have discovered a new drug that we believe will increase the rate of recovery from some disease over the recovery rate when an old established drug is applied. K : () > Bo Ifwe allow for the possibility that the new drug is less effective than the old.3. is better defined than the alternative answer 8 1 .Xn ). 8 0 and H are called compOSite. say. (}o] and eo is composite. the number of recoveries among the n randomly selected patients who have been administered the new drug. If the theory is false. That is. we call 8 0 and H simple.1 Introduction 215 model needs to pennit distributions other than B( n.1.Section 4. acceptance and rejection can be thought of as actions a = 0 or 1. When 8 0 contains more than one point. p). then S has a B(n. then 8 0 = [0 . Thus. Our hypothesis is then the null hypothesis that the new drug does not improve on the old drug. for instance. In science generally a theory typically closely specifies the type of distribution P of the data X as. B) distribution. 0 E 8 0 .€)51 + cB( nlP). We illustrate these ideas in the following example. Thus.(1) As we have stated earlier. See Remark 4. These considerations lead to the asymmetric formulation that saying P E Po (e E 8 0 ) corresponds to acceptance of the hypothesis H : P E Po and P E PI corresponds to rejection sometimes written as K : P E PJ . It is convenient to distinguish between two structural possibilities for 8 0 and 8 1 : If 8 0 consists of only one point. in the e . (2) If we let () be the probability that a patient to whom the new drug is administered recovers and the population of (present and future) patients is thought of as infinite. = Example 4. Moreover. If we suppose the new drug is at least as effective as the old. is point mass at . a) = 0 if BE 8 a and 1 otherwise.3 recover from the disease with the old drug.1 loss l(B. and then base our decision on the observed sample X = (X J. To investigate this question we would have to perform a random experiment. The set of distributions corresponding to one answer. recall that a decision procedure in the case of a test is described by a test function Ii: x ~ {D. administer the new drug. then = [00 . (1 .
• .2 and 4. (5 > k) < k). Note that a test statistic generates a family of possible tests as c varies. is much better defined than its complement and/or the distribution of statistics T under eo is easy to compute. One way of doing this is to align the known and unknown regions and compute statistics based on the number of matches. but computation under H is easy. 4. testing techniques are used in searching for regions of the genome that resemble other regions that are known to have significant biological activity.3. and later chapters. generally in science. as we shall see in Sections 4. .2.e. and large. The value c that completes our specification is referred to as the critical value of the test.. The Neyman Pearson Framework The Neyman Pearson approach rests On the idea that. .1. We now tum to the prevalent point of view on how to choose c. asymmetry is often also imposed because one of eo.3.3 this asymmetry appears reasonable.1 and4. (5 The constant k that determines the critical region is called the critical value. The Neyman Pearson framework is still valuable in these situations by at least making us think of possible alternatives and then. 1.. Thresholds (critical values) are set so that if the matches occur at random (i. of the two errors. how reasonable is this point of view? In the medical setting of Example 4. There is a statistic T that "tends" to be small. We call T a test statistic. matches at one position are independent of matches at other positions) and the probability of a match is ~... (Other authors consider test statistics T that tend to be small. if H is true. : I I j ! 1 I . suggesting what test statistics it is best to use. one can be thought of as more important. e 1 . 0 > 00 . ! : . It has also been argued that. For instance. ~ : I ! j . In most problems it turns out that the tests that arise naturally have the kind of structure we have just described.T would then be a test statistic in our sense.216 Testing and Confidence Regions Chapter 4 tenninology of Section 1.1. No one really believes that H is true and possible types of alternatives are vaguely known at best. our critical region C is {X : S rule is Ok(X) = I{S > k} with > k} and the test function or PI ~ probability of type I error = Pe. As we noted in Examples 4. 1 i . when H is false. . .that has in fact occurred.3. then the probability of exceeding the threshold (type I) error is smaller than Q". Given this position. In that case rejecting the hypothesis at level a is interpreted as a measure of the weight of evidence we attach to the falsity of H.. . announcing that a new phenomenon has been observed when in fact nothing has happened (the socalled null hypothesis) is more serious than missing something new.2. By convention this is chosen to be the type I error and that in tum detennines what we call H and what we call K. We do not find this persuasive.1. To detennine significant values of these statistics a (more complicated) version of the following is done.) We select a number c and our test is to calculate T(x) and then reject H ifT(x) > c and accept H otherwise. 0 PH = probability of type II error ~ P.. if H is false. We will discuss the fundamental issue of how to choose T in Sections 4. it again reason~bly leads to a Neyman Pearson fonnulation. but if this view is accepted.1 .
0473. {3(8. This quantity is called the size of the test and is the maximum probability of type I error. Here are the elements of the Neyman Pearson story.9. numbers to the two losses that are not equal and/or depend on 0. Finally. The power of a test against the alternative 0 is the probability of rejecting H when () is true.3. . 80 = 0. it is too limited in any situation in which. e t .1 Introduction 217 There is an important class of situations in which the Neyman Pearson framework is inappropriate. Begin by specifying a small number a > 0 such that probabilities of type I error greater than a arc undesirable. and we speak of rejecting H at level a.01 and 0.2(b) is the one to take in all cases with 8 0 and 8 1 simple.Section 4. we can attach. Specifically. such as the quality control Example 1. there exists a unique smallest c for which a(c) < a.1. Here > cJ. Such tests are said to have level (o/significance) a. the power is 1 minus the probability of type II error. Example 4. in the Bayesian framework with a prior distribution on the parameter. Indeed. the approach of Example 3. This is the critical value we shall use. Ll) Nowa(e) is nonincreasing in c and typically a(c) r 1 as c 1 00 and a(e) 1 0 as c r 00.(S > 6) = 0.P [type 11 error] is usually considered. In that case. Once the level or critical value is fixed.05 critical value 6 and the test has size .(6) = Pe. Then restrict attention to tests that in fact have the probability of rejection less than or equal to a for all () E 8 0 . whereas if 8 E power against (). our test has size a(c) given by a(e) ~ sup{Pe[T(X) > cJ: 8 E eo}· (4.1. the probabilities of type II error as 8 ranges over 8 1 are determined.1. By convention 1 .1. even nominally. See Problem 3. The power is a function of 8 on I.2. 0) is just the probability of type! errot.1.05 are commonly used in practice. k = 6 is given in Figure 4. 8 0 = 0.0) = Pe[Rejection] = Pe[o(X) = 1] = Pe[T(X) If 8 E 8 0 • {3(8.3 and n = 10. 0) is the {3(8. That is. Definition 4.Ok) = P(S > k) A plot of this function for n ~ t( j=k n ) 8i (1 . if our test statistic is T and we want level a.1. then the probability of type I error is also a function of 8. The values a = 0.2. if 0 < a < 1.1. In Example 4.8)n.1. which is defined/or all 8 E 8 by e {3(8) = {3(8.3 (continued).j J = 10.3 with O(X) = I{S > k}. Because a test of level a is also of level a' > a. even though there are just two actions. we find from binomial tables the level 0. If 8 0 is composite as well. it is convenient to give a name to the smallest level of significance of a test. It can be thought of as the probability that the test will "detect" that the alternative 8 holds. It is referred to as the level a critical value. if we have a test statistic T and use critical value c.. Thus. Both the power and the probability of type I error are contained in the power function.
3 to (it > 0.:"::':::'. j 1 .[T(X) > k].8 0.05 onesided test c5k of H : () = 0.. Let Xi be the difference between the time slept after administration of the drug and time slept without administration of the drug by the ith patient. a 67% improvement. We might..5.6 0. 0 Remark 4.1. Suppose that X = (X I..218 Testing and Confidence Regions Chapter 4 1.0473.3 for the B(lO. suppose we want to see if a drug induces sleep.:.t < 0 versus K : J.0 0. I j a(k) = sup{Pe[T(X) > k]: 0 E eo} = Pe. From Figure 4. _l X n are nonnally distributed with mean p.2) popnlation with .1. If we assume XI. It follows that the level and size of the test are unchanged if instead of 80 = {Oo} we used eo = [0. and H is the hypothesis that the drug has no effect or is detrimental. That is.05 test will detect an improvement of the recovery fate from 0.0 Figure 4.3).3770.3 versus K : B> 0. ! • i.1 0..3.0 0 ]. For instance.1 it appears that the power function is increasing (a proof will be given in Section 4. One of the most important uses of power is in the selection of sample sizes to achieve reasonable chances of detecting interesting alternatives.t > O.4. .2 o~31:.3 04 05 0. then the drug effect is measured by p. .5.) We want to test H : J.~==/++_++1'+ 0 o 0.1..3 is the probability that the level 0.7 0. whereas K is the alternative that it has some positive effect. :1 • • Note that in this example the power at () = B1 > 0. OneSided Tests for the Mean ofa Normal Distribution with Known Variance. Example 4. The power is plotted as a function of 0. . for each of a group of n randomly selected patients.8 06 04 0. k ~ 6 and the size is 0. . record sleeping time without the drug (or after the administration of a placebo) and then after some time administer the drug and record sleeping time again. This problem arises when we want to compare two treatments or a treatment and control (nothing) and both treatments are administered to the same subject. I I I I I 1 . What is needed to improve on this situation is a larger sample size n. and variance (T2. We return to this question in Section 4. When (}1 is 0.9 1..3.2 0. 0) family of distributions. I i j . Xn ) is a sample from N (/'..1. .1._~:. this probability is only . Power function of the level 0. (The (T2 unknown case is treated in Section 4.2 is known.
p ~ P[AA]. But it occurs also in more interesting situations such as testing p. Because (J(jL) a(e) ~ is increasing. which generates the same family of critical regions.nX/ (J.(T(X)) doesn't depend on 0 for 0 E 8 0 . sup{(J(p) : p < OJ ~ (J(O) = ~(c). it is natural to reject H for large values of X. eo).1...2 and 4. That is.   = . £0.~(z). 1) distribution.2.P. T(X(lI).9)..2) = 1. critical values yielding correct type I probabilities are easily obtained by Monte Carlo methods.co =. The task of finding a critical value is greatly simplified if £. .1. In all of these cases.o versus p. and we have available a good estimate (j of (j. The power function of the test with critical value c is p " [Vii (X !") > e _ Vii!"] (J (J 1.• T(X(B)). ~ estimates(jandd(~. where T(l) < .co .1.i. < T(B+l) are the ordered T(X). #.eo) = IN~A . . S) inf{d(x. However.1. In Example 4.5).. T(X(l)). This minimum distance principle is essentially what underlies Examples 4.3. has level a if £0 is continuous and (B + 1)(1 . then the test that rejects iff T(X) > T«B+l)(la)).3. .. = P.Section 4. In Example 4. where d is the Euclidean (or some equivalent) distance and d(x. The smallest c for whieh ~(c) < C\' is obtained by setting q:.o if we have H(p. This occurs if 8 0 is simple as in Example 4.I') ~ ~ (c + V.1 Introduction 219 Because X tends to be larger under K than under H.. Here are two examples of testing hypotheses in a nonparametric context in which the minimum distance principle is applied and calculation of a critical value is straightforward. (12) observations with both parameters unknown (the t tests of Example 4. . Rejecting for large values of this statistic is equivalent to rejecting for large values of X.a) is the (1. Given a test statistic T(X) we need to determine critical values and eventually the power of the resulting tests. (Tn) for (j E 8 0 is usually invariance The key feature of situations in which under the action of a group of transformations_ See Lehmann (1997) and Volume II for discussions of this property.~I..V. o The Heuristics of Test Construction When hypotheses are expressed in terms of an estimable parameter H : (j E eo c RP.~ (c .(jo) = (~ (jo)+ where y+ = Y l(y > 0).1.1 and Example 4. N~A is the MLE of p and d(N~A.d. has a closed form and is tabled. It is convenient to replace X by the test statistic T(X) = .( c) = C\' or e = z(a) where z(a) = z(1 . T(X(B)) from £0. if we generate i. it is clear that a reasonable test statistic is d((j.1. the common distribution of T(X) under (j E 8 0 .a) is an integer (Problem 4.5.a) quantile oftheN(O.1.p) because ~(z) (4.1.y) : YES}.3. in any case.
The distribution of D n has been thoroughly smdied for finite and large n. that is.i. . Un' where U denotes the empirical distribution function of Ul u = Fo(x) ranges over (0.)} n (4. ). .Fa. The distribution of D n under H is the same for all continuous Fo. for n > 80..7) that Do. .n {~ n Fo(x(i))' FO(X(i)) _ .. F. + (Tx) is .. F o (Xi). and h n (L224) for" = . < x(n) is the ordered observed sample.1. X n are ij.05. ..1.6. P Fo (D n < d) = Pu (D n < d). In particular.5.3.. h n (L358). The natural estimate of the parameter F(!...(i_l.1 J) where x(l) < . Goodness of Fit to the Gaussian Family.d. This is again a consequence of invariance properties (Lehmann.10 respectively. Also F(x) < x} = n.Xn be i. 1997). F and the hypothesis is H : F = ~ (' (7'/1:) for some M. n. 0 Example 4..1. Set Ui ~ j .L5 rewriting H : F(!' + (Tx) = <p(x) for all x where !' = EF(X . Let F denote the empirical distribution and consider the sup distance between the hypothesis Fo and the plugin estimate of F. where U denotes the U(O.. . and hn(t) = t/( Vri + 0.. . .u[ O<u<l . x It can be shown (Problem 4. 1).1 El{Fo(Xi ) < Fo(x)} n. . and the result follows. the empirical distribution function F.. as X ...1 El{U. and . • . :i' D n = sup [U(u) . then by Problem B.U. . o Note that the hypothesis here is simple so that for anyone of these hypotheses F = F o• the distribution can be simulated (or exhibited in closed fonn). (T2 = VarF(Xtl. Proof. where F is continuous.220 Testing and Confidence Regions Chapter 4 1 . which is called the Kolmogorov statistic. As x ranges over R.. thus.. which is evidently composite. can be wriHen as Dn =~ax max tI. .   Dn = sup IF(x) .. Goodness of Fit Tests. This statistic has the following distributionjree property: 1 Proposition 4. as a tcst statistic . 1). We can proceed as in Example 4. U. 1) distribution. < Fo(x)} = U(Fo(x)) .12 + OJI/Vri) close approximations to the size" critical values ka are h n (L628). Suppose Xl.1 El{Xi ~ U(O. Let X I..d. the order statistics. . • Example 4.1.01. Consider the problem of testing H : F = Fo versus K : F i. What is remarkable is that it is independ~nt of which F o we consider.4.Fo(x)[. In particular.1.
<l'(x)1 where G is the empirical distribution of (L~l"'" Lln ) with Lli (Xi . (Sec Section 8..(3) 0 The pValue: The Test Statistic as Evidence o Different individuals faced with the same testing problem may have different criteria of size. the pvalue is <l'( T(x)) = <l' (.Z) / (~ 2::7 ! (Zi  if) • . Zn arc i. Tn has the same distribution £0 under H. otherwise.(iix u > z(a) or upon applying <l' to both sides if. where X and 0'2 are the MLEs of J1 and we obtain the statistic F(X + ax) Applying the sup distance again. .a) + IJth order statistic among Tn. this difficulty may be overcome by reporting the outcome of the experiment in tenus of the observed size or pvalue or significance probability of the test.3. where Z I. If we observe X = x = (Xl. and only if. .2. If the two experimenters can agree on a common test statistic T.. H is rejected.X) . That is.d. 1. whereas experimenter II accepts H on the basis of the same outcome x of an experiment.1 Introduction 221 (12. a > <l'( T(x)). . Tn sup IF'(X x x + iix) . {12 and is that of ~  (Zi . 1 < i < n. I). under H.i. Therefore.. if X = x. N(O. and only if. It is then possible that experimenter I rejects the hypothesis H. a satisfies T(x) = .. . . Experimenter] may be satisfied to reject the hypothesis H using a test with size a = 0. Example 4. the joint distribution of (Dq .01.1.X)/ii.d. if the experimenter's critical value corresponds to a test of size less than the p~value. But.. Tn!.05. whatever be 11 and (12. we would reject H if.Section 4. from N(O.4. I < i < n.) Thus. This quantity is a statistic that is defined as the smallest level of significance 0' at which an experimenter using T would reject on the basis ofthe observed outcome x. ~n) doesn't depend on fl. H is not rejected. .. .' . 1).. Now the Monte Carlo critical value is the I(B + 1)(1 .<l'(x)1 sup IG(x) . T nB . and the critical value may be obtained by simulating ij. thereby obtaining Tn1 .4) Considered as a statistic the pvalue is <l'( y"nX /u). . (4. Consider. whereas experimenter II insists on using 0' = 0. · · . We do this B times independently. T nB .. then computing the Tn corresponding to those Zi. .. for instance. observations Zi. In)..
.1. ~ H fJ.1.5).6). For example. The pvalue can be thought of as a standardized version of our original statistic.6) I • to test H.5) . and only if. these kinds of issues are currently being discussed under the rubric of datafusion and metaanalysis (e. for miu{ nOo. a(Tr ).1. aCT) has a uniform. if r experimenters use continuous test statistics T 1 . when H is well defined. In this context.. Proposition 4. the largest critical value c for which we would reject is c = T(x). let X be a q dimensional random vector. 1). a(T) is on the unit interval and when H is simple and T has a continuous distribution. that is. then if H is simple Fisher (1958) proposed using l j :i. But the size of a test with critical value c is just a(c) and a(c) is decreasing in c. More generally.1.. n(1 . Thus. Then if we use critical value c. distribution (Problem 4. 1985). The statistic T has a chisquare distribution with 2n degrees of freedom (Problem 4. Thus. (4.).1). I T r to produce p . Various melhods of combining the data from different experiments in this way are discussed by van Zwet and Osterhoff (1967). ' I values a(T. I. 80). !: The pvalue is used extensively in situations of the type we described earlier. •• 1. the smallest a for which we would reject corresponds to the largest c for which we would reject and is just a(T(x)). T(x) > c.222 Testing and Confidence Regions Cnapter 4 In general.1. . We will show that we can express the pvalue simply in tenns of the function a(·) defined in (4..1. The normal approximation is used for the pvalue also.1. . Thus.1. • i~! <:1 im _ . a(8) = p. we would reject H if.(S > s) '" 1. and the pvalue is a( s) where s is the observed value of X... H is rejected if T > Xla where Xl_a is the 1 . It is possible to use pvalues to combine the evidence relating to a given hypothesis H provided by several different independent experiments producing different kinds of data. but K is not. U(O.3. so that type II error considerations are unclear.• see f [' Hedges and Olkin.. . We have proved the following.. • T = ~2 I: loga(Tj) j=l ~ ~ r (4. Suppose that we observe X = x. .2. ''The actual value of p obtainable from the table by interpolation indicates the strength of the evidence against the null hypothesis" (p.<I> ( [nOo(1 ~ 0 )1' 2 s~l~nOo) 0 . This is in agreement with (4.g. Thus.4). to quote Fisher (1958).Oo)} > 5.ath quantile of the X~n distribution. The pvalue is a(T(X)). Similarly in Example 4.
3). L(x. 4. power. (4. In this case the Bayes principle led to procedures based on the simple likelihood ratio statistic defined by L(x. and S tends to be large when K : () = 01 > 00 is . 00 ) = 0. in the binomial example (4.1. subject to this restriction./00)5[(1. under H. critical regions. Typically a test statistic is not given but must be chosen on the basis of its perfonnance.2 CHOOSING A TEST STATISTIC: THE NEYMANPEARSON LEMMA We have seen how a hypothesistesting problem is defined and how perfonnance of a given test b.Otl/(1 . I) = EXi is large. We start with the problem of testing a simple hypothesis H : () = ()o versus a simple alternative K : 0 = 01.00)/00 (1 . Such a test and the corresponding test statistic are called most poweiful (MP). that is. (I . and. we specify a small number 0: and conStruct tests that have at most probability (significance level) 0: of rejecting H (deciding K) when H is true. by convention. o:(Td has a U(O. Summary. the highest possible power.01 )/(1. and pvalue. 00 .Oo)]n. size. 0. OIl where p(x. In the NeymanPearson framework. We introduce the basic concepts and terminology of testing statistical hypotheses and give the NeymanPearson framework. We introduce the basic concepts of simple and composite hypotheses.) which is large when S true. In Sections 3. For instance. that is. test functions. we consider experiments in which important questions about phenomena can be turned into questions about whether a parameter () belongs to 80 or e 1. This is an instance of testing goodness offit.0. type II error.3 we derived test statistics that are best in terms of minimizing Bayes risk and maximum risk. p(x. 00. significance level. we try to maximize the probability (power) of rejecting H when K is true.2 and 3. The statistic L is reasonable for testing H versuS K with large values of L favoring K over H. 1) distribution. The statistic L takes on the value 00 when p(x. test statistics. OIl > 0. or equivalently.2 Choosing a Test Statistic The NeymanPearson lemma 223 The preceding paragraph gives an example in which the hypothesis specifies a distribution completely. is measured in the NeymanPearson theory.00)ln~5 [0. power function. equals 0 when both numerator and denominator vanish. (0. (null) hypothesis H and alternative (hypothesis) K. then.Section 4. OIl = p (x. where eo and e 1 are disjoint subsets of the parameter space 6. we test whether the distribution of X is different from a specified Fo. In particular. 0) 0 p(x. In this section we will consider the problem of finding the level a test that ha<.2. 0) is the density or frequency function of the random vector X. a given test statistic T.)1 5 [(1. type I error.
1) if equality occurs.. i = 0.2.['Pk(X) .05 in Example 4.'P(X)] a. Bo. then ~ .3.7l'). thns. (c) If <p is an MP level 0: test.4) 1 j 7 . If 0 < 'P(x) < 1 for the observation vector x. BIl o .0262 if S = 5. i' ~ E. that is. Bo) = O} = I + II (say). the interpretation is that we toss a coin with probability of heads cp(x) and reject H iff the coin shows heads.3. then <{Jk is MP in the class oflevel a tests.p is a level a test. where 7l' denotes the prior probability of {Bo}. Theorem 4. and because 0 < 'P(x) < 1. (a) Let E i denote E8. we choose 'P(x) = 0 if S < 5.Bo.3). 0 < <p(x) < 1. (NeymanPearson Lemma).05 . Proof. They are only used to show that with randomization.'P(X)] 1{P(X.224 Testing and Confidence Regions Chapter 4 We call iflk a likelihood ratio or NeymanPearsoll (NP) test ifunction) if for some 0 k < OJ we can write the test function Yk as 'Pk () = X < 1 o if L( x.kEo['Pk(X) . likelihood ratio tests are unbeatable no matter what the size a is. and 'P(x) = [0.2) that 'Pk is a Bayes rule with k ~ 7l' /(1 .2. B .Bo. there exists k such that (4. o Because L(x. B . . If L(x. lOY some x. It follows that II > O. then it must be a level a likelihood ratio test.1).P(S > 5)I!P(S = 5) = .k] + E'['Pk(X) . we have shown that I j I i E. 'Pk is MP for level E'c'PdX).kJ}.2) forB = BoandB = B.3) 1 : > O. Note (Section 3. 'Pk(X) = 1 if p(x.k is < 0 or > 0 according as 'Pk(X) is 0 or 1. (a) If a > 0 and I{Jk is a size a likelihood ratio test.B. (b) For each 0 < a < 1 there exists an MP size 0' likelihood ratio test provided that randomization is permitted.) For instance. and suppose r. To this end consider (4.1. Eo'P(X) < a. We show that in addition to being Bayes optimal.Bd >k <k with 'Pdx) any value in (0.2.'P(X)] .'P(X)[L(X. EO'PdX) We want to show E. which are tests that may take values in (0. Note that a > 0 implies k < 00 and. using (4. (4. B.1.'P(X)] [:~~:::l.) . Because we want results valid for all possible test sizes Cl' in [0..1].[CPk(X) . k] . Such randomized tests are not used in practice.('Pk(X) . BI )  . 'P(x) = 1 if S > 5. 1.'P(X)] > kEo['PdX) . then I > O. Bo) = O.'P(X)] = Eo['Pk{X) .2.'P(X)] > O.2. Finally. (See also Section 1. where 1= EO{CPk(X)[L(X.3 with n = 10 and Bo = 0. if want size a ~ .) .. we consider randomized tests '1'.
2.) (1 . Then the posterior probability of (). 0.(X.. 0.. Remark 4. 00 .1.0. 00 . decides 0. 00 ) > and 0 = 00 . OJ. 00 . denotes the Bayes procedure of Exarnple 3.) > then to have equality in (4.2. where v is a known signal. 'Pk(X) = 1 and !. Therefore.10). Here is an example illustrating calculation of the most powerful level a test <{Jk.u 2 ) random variables with (72 known and we test H : 11 = a versus K : 11 = v. or 00 according as 1T(01 Ix) is larger than or smaller than 1/2. X n ) is a sample of n N(j. is 1T(O.4) we need tohave'P(x) ~ 'Pk(X) = I when L(x.) + 1T' (4. 60 .v) + nv ] 2(72 x . If not.2. then 'Pk is MP size a. 0.5) thatthis 0.2(b). 0. 00 . OJ Corollary 4.) (1 . also be easily argued from this Bayes property of 'Pk (Problem 4. 'Pk(X) =a . = 'Pk· Moreover. Let 1T denote the prior probability of 00 so that (I 1T) is the prior probability of (it. Next consider 0 < Q: < 1.5) If 0. 00 ) (11T)L(x. OJ! > k] < a and Po[T. then E9. It follows that (4.2) holds forO = 0. that is.OI)' Proof. 0. for 0 < a < 1.Section 4.2..O. .) + 1Tp(X. 'P(X) > a with equality iffp(" 60 ) = P(·. Consider Example 3.pk is MP size a. 0 It follows from the NeymanPearson lemma that an MP test has power at least as large as its level.2 Choosing a Test Statistic: The NeymanPearson lemma " 225 =:cc (b) If a ~ 0.) > k and have 'P(x) = 'Pk(X) ~ 0 when L(x.I.1T)L(x. See Problem 4. OJ! > k] > If Po[L(X.2.Oo.) = k j.) > k] Po[L(X. If a = 1. We found nv2} V n L(X.) . Part (a) of the lemma can.2. . k = 0 makes E.Oo. 0. { (7 i=l 2(7 Note that any strictly increasing function of an optimal statistic is optimal because the two statistics generate the same family of critical regions. 00 .fit. 0..fit [ logL(X.1. k = 00 makes 'Pk MP size a. tben.O. 0. then there exists k < 00 such that Po[L(X.v)=exp 2LX'2 .k] on the set {x : L(x.) = k] = 0. when 1T = k/(k + 1). = v. OJ! = ooJ ~ 0. 2 (7 T(X) ~. define a. 00 . (c) Let x E {x: p(x. Now 'Pk is MP size a.1.2.2.Po[L(X..7.3. Let Pi denote Po" i = 0./f 'P is an MP level a test.O.. 00 .2. I x) = (1 1T)p(X.2 where X = (XI. 00 .) < k. Because Po[L(X. 0. The same argument works for x E {x : p(x.1T)p(x. 0. we conc1udefrom (4.L. Example 4.2.
8). say. By the NeymanPearsoo lemma this is the largest power available with a level Ct test. then this test rule is to reject H if Xl is large. O)T and Eo = I.' Rejecting H for L large is equivalent to rejecting for I 1 .) test exists and is given by: Reject if (4.226 Testing and Confidence Regions Chapter 4 is also optimal for this problem. From our discussion there we know that for any specified Ct. then. (JI correspond to two known populations and we desire to classify a new observation X as belonging to one or the other.2. ). We will discuss the phenomenon further in the next section. T>z(la) (4. if we want the probability of detecting a signal v to be at least a preassigned value j3 (say. Eo are known. . Simple Hypothesis Against Simple Alternative for the Multivariate Normal: Fisher's Discriminant Flmction.. It is used in a classification context in which 9 0 .7) where c ~ z(1 . and only if. 0./ii/(J)) = 13 for n and find that we need to take n = ((J Iv j2[z(la) + z(I3)]'.90 or .2. The power of this test is.. Suppose X ~ N(Pj. this is no longer the case (Problem 4. that the UMP test phenomenon is largely a feature of onedimensional parameter problems. If this is not the case. Note that in general the test statistic L depends intrinsically on ito. the test that rejects if. a UMP (for all ). I I .4. • • . among other things. Example 4. they are estimated with their empirical versions with sample means estimating population means and sample covariances estimating population CQvanances.2.6) that is MP for a specified signal v does not depend on v: The same test maximizes the power for all possible signals v > O.2. itl = ito + )"6. large.2. . Such a test is called uniformly most powerful (UMP).2.2.2)./iil (J))..6) has probability of type I error 0:.. Thus. j = 0. This is the smallest possible n for any size Ct test. In this example we bave assnmed that (Jo and (J.Jto)E01X large. 0 An interesting feature of the preceding example is that the test defined by (4. 6. if Eo #.. <I>(z(a) + (v. But T is the test statistic we proposed in Example 4. .1. Particularly important is the case Eo = E 1 when "Q large" is equivalent to "F = (ttl .0. then we solve <1>( z(a) + (v.9). E j ). The following important example illustrates. (Jj = (Pj. The likelihood ratio test for H : (J = 6 0 versus K : (J = (h is based on ." The function F is known as the Fisher discriminant function.. We return to this in Volume II. if ito. l.I. itl' However if. by (4.a)[~6Eo' ~oJ! (Problem 4. > 0 and E l = Eo. If ~o = (1.95). for the two popnlations are known.1. . 0. however. E j ). .
k are given by Ow. then (N" .M( n..1) test !. where nl.)N' i=l to Here is an interesting special case: Suppose OjO integer I with 1 < I < k > 0 for all}. Ok = OkO. However. .p.1. . 0 < € < 1 and for some fixed (4. Before we give the example. 'P') > (3(0. . 1 (}k) distribution with frequency function. .. Example 4.1...3 UNIFORMLY MOST POWERFUL TESTS ANO MONOTONE LIKELIHOOD RATIO MODELS We saw in the two Gaussian examples of Section 4.. if a match in a genetic breeding experiment can result in k types.2) where . A level a test 'P' is wtiformly most powerful (UMP) for H : 0 E versus K : 0 E 8 1 if eo (3(0. r lUI nt···· nk· ".3. Nk) ..nk. there is sometimes a simple alternative theory K : (}l = (}ll. The NeymanPearson lemma. .. N k ) has a multinomial M(n. .. .3.u k nn. . 0" . .2 that UMP tests for onedimensional parameter problems exist.2 for deciding between 00 and Ol.'P) for all 0 E 8" for any other level 0:' (4. . We note the connection of the MP test to the Bayes procedure of Section 3. Suppose that (N1 . n offspring are observed.3.. . here is the general definition of UMP: Definition 4. 0) P n! nn...···. _ (nl.OkO' Usually the alternative to H is composite. . .. We introduce the simple likelihood ratio statistic and simple likelihood ratio (SLR) test for testing the simple hypothesis H : (j = ()o versus the simple alternative K : 0 = ()1. With such data we often want to test a simple hypothesis H : (}I = Ow.. 4... For instance. and N i is the number of offspring of type i.Section 4. ."" (}k = Oklo In this case. Ok)' The simple hypothesis would correspond to the theory that the expected proportion of offspring of types 1. nk are integers summing to n. which states that the size 0:' SLR test is uniquely most powerful (MP) in the class of level a tests. This phenomenon is not restricted to the Gaussian case as the next example illustrates.3.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 227 Summary. Two examples in which the MP test does not depend on 0 1 are given. . is established. (}l. Such tests are said to be UMP (unifonnly most powerful). the likelihood ratio L 's L~ rr(:. Testing for a Multinomial Vector.
we conclude that the MP lest rejects H. = . 0) ~. J Definition 4. Consider the oneparameter exponential family mode! o p(x. e c R. 0 Typically the MP test of H : () = eo versus K : () = (}1 depends on (h and the test is not UMP. Because dt does not depend on ()l. . Then L = pnN1£N1 = pTl(E/p)N1. is an MLRfamily in T(x). . : 8 E e}. where u is known.. 0 . (1) For each t E (0.228 Testmg and Confidence Regions Chapter 4 That is.. Example 4. : 0 E e} with e c R is said to be a monotone likelihood ratio (MLR) family if for (it < O the distributions POl and P0 2 are distinct and 2 the ratio p(x.3) with 6.. in the case of a real parameter. (2) If E'o6. 00 .OW and the model is by (4.o)n[O/(1 . Example 4. X ifT(x) < t ° (4.3.O) = 0'(1 . . under the alternative. Example 4. Because f. Note that because l can be any of the integers 1 .o)n. If {P. B( n. < 1 implies that p > E. is an MLRfamily in T(x).. this test is UMP fortesting H versus K : 0 E 8 1 = {/:/ : /:/ . Bernoulli case. N 1 < c.nx/u and ry(!") Define the NeymanPearson (NP) test function 6 ( ) _ 1 ifT(x) > t . . e c R.3. we get radically different best tests depending on which Oi we assume to be (ho under H. if and only if. However..3.1.d.3.nu)!". 0 form with T(x) .3.1. for testing H : 8 < /:/0 versus K:O>/:/I' . then 6. :. type I is less frequent than under H and the conditional probabilities of the other types given that type I has not occurred are the same under K as they are under H.i. set s then = l:~ 1 Xi.2. equals the likelihood ratio test 'Ph(t) and is MP.1) MLR in s. This is part of a general phenomena we now describe. = h(x)exp{ry(O)T(x) ~ B(O)}. Oil is an increasing function ofT(x).. If 1J(O) is strictly increasing in () E e. Critical values for level a are easily determined because Nl . there is a statistic T such that the test with critical region {x : T(x) > c} is UMP. In this i. . The family of models {P.2) with 0 < f < I}.(x) any value in (0. . = P( N 1 < c). k. 6.3. Moreover.3. the power function (3(/:/) = E. p(x..(X) is increasing in 0. then L(x.1 is of this (. Thus. we have seen three models where. = (I . for".6.(X) = '" > 0. it is UMP at level a = EOodt(X) for testing H : e = eo versus K : B > Bo in fact..: 0 E e}. Consider the problem of testing H : 0 = 00 versus K: 0 ~ 01 with 00 < 81. is of the form (4.2 (Example 4. . Oil = h(T(x)) for some increasing function h. . 02)/p(X.00). Theorem 4. is UMP level".2.1) if T(x) = t. .. then this family is MLR.3 continned).2. Ow) under H. Suppose {P.
we test H : u > Uo l versus K : u < uo. then by (1). If 0 < 00 . where h(a) is the ath quantile of the hypergeometric. .l of the measurements Xl. the critical constant 8(0') is u5xn(a).3. if N0 1 = b1 < bo and 0 < x < b1 . and specifies an 0' such that the probability of rejecting H (keeping a bad lot) is at most 0'. distribution. _1')2. If the distributionfunction Fo ofT(X) under X"" POo is continuous and ift(1a) isasolution of Fo(t) = 1 . Then.. Testing Precision.l. and only if. (X) < <> and b t is of level 0: for H : (J :S (Jo. Corollary 4. If a is a value taken on by the distribution of X.. d t is UMP for H : (J < (Jo versus K : (J > (Jo. 20' 2  This is a oneparameter exponential family and is MLR in T = So The UMP level 0' test rejects H if and only if 8 < s(a) where s(a) is such that Pa .(X) for testing H: 0 = 01 versus J( : 0 = 0.l) yields e L( 0 0 )=b. 0 Proof (I) follows from b l = iPh(t) The following useful result follows immediately. Suppose {Po: 0 E Example 4. . (l. . she formulates the hypothesis H as > (Jo. e c p(x. Suppose tha~ as in Example l. Quality Control.l.. H.2. and we are interested in the precision u. the alternative K as (J < (Jo. where b = N(J.Xn . where U O represents the minimum tolerable precision. recall that we have seen that e5 t maximizes the power for testing II : (J = (Jo versus K : (J > (Jo among the class of tests with level <> ~ Eo. .4. 1 bo(bol) (blX+l)(Nbl) (box+1)(Nbo) (Nbln+x+l) (Nbon+x+l)' .<» is lfMP level afor testing H: (J < (Jo versus K : (J > (Jo. Thus..3. N.Section 4.(X).(bl1) x. is an MLRfamily in T(r).U 2 ) population. distribution.Xn is a sample from a N(I1.1.3.bo > n. and because dt maximizes the power over this larger class. then e}.5. Let S = l:~ I (X.l. Eoo. If the inspector making the test considers lots with bo = N(Jo defectives or more unsatisfactory. Because the class of tests with level 0' for H : (J < (Jo is contained in the class of tests with level 0' for H : (J = (Jo. Suppose X!. For simplicity suppose that bo > n.o.<>. X < h(a). 0. we could be interested in the precision of a new measuring instrument and test it by applying it to a known standard. X is the observed number of defectives in a sample of n chosen at random without replacement from a lot of N items containing b defectives. where 11 is a known standard. then the test/hat rejects H if and only ifT(r) > t(1 . N . For instance.( NOo. 0 Example 4. To show (2).o. is UMP level a. Because the most serious error is to judge the precision adequate when it is not. Ifwe write ~= Uo t i''''l (Xi _1')2 000 we see that Sju5 has a X~ distribution.3 Uniformly Most Po~rful Tests and Monotone likelihood Ratio Models 229 and Corollary 4.. 0) = exp {~8 ~ IOg(27r0'2)} . n).1 by noting that for any (it < (J2' e5 t is MP at level Eo. R. (8 < s(a)) = a. where xn(a) is the ath quantile of the X. we now show that the test O· with reject H if.
if all points in the alternative are of equal significance: We can find > 00 sufficiently close to 00 so that {3( 0) is arbitrarily close to {3(00) = a. the appropriate n is obtained by solving i' . in general. we choose the critical constant so that the maximum probability of falsely rejecting the null hypothesis H is small. 0 Power and Sample Size In the NeymanPearson framework we choose the test whose size is small.Oo. in (O .1.:(x:.. ~) would be our indifference region. This continuity of the power shows that not too much significance can be attached to acceptance of H. Thus. this is a general phenomenon in MLR family models with p( x.I. Therefore. This is possible for arbitrary /3 < 1 only by making the sample size n large enough.x) <I L(x. By CoroUary 4.4 because (3(p.'..c:O". ". we specify {3 close to I and would like to have (3(!') > (3 for aU !' > t.. forO <:1' < b1 1.1. not possible for all parameters in the alternative 8. In our nonnal example 4. we want the probability of correctly detecting an alternative K to be large.. Off the indifference region.. (0. we would also like large power (3(0) when () E 8 1 . In Example 4.230 NotethatL(x.ntI" = z({3) whose solution is il . as seen in Figure 4. i. I' .a.. we want guaranteed power as well as an upper bound on the probability of type I error.1. . I i (3(t) ~ 'I>(z(a) + .L. For such the probability of falsely accepting H is almost 1 .OIl box (Nn+I)(blx) .(}o. {3(0) ~ a. ~) for some small ~ > 0 because such improvements are negligible. 0) continuous in 0.1 and formula (4.1. and the powers are continuous increasing functions with limOlO. This is not serious in practice if we have an indifference region. Thus.. The critical values for the hypergeometric distribution are available on statistical calculators and software.. That is. L is decreasing in x and the hypergeometric model is an MLR family in T( x) = r. ° ° .O'"O. 1 I .3. In both these cases.) is increasing. It follows that 8* is UMP level Q. On the other hand.x) (N .1.B 1 ) =0 forb l Testing and Confidence Regions Chapter 4 < X <n. that is. H and K are of the fonn H : () < ()o and K : () > ()o.ntI") for sample size n.' . In our example this means that in addition to the indifference region and level a. Note that a small signaltonoise ratio ~/ a will require a large sample size n.+..(b o . .Il = (b l .4 we might be uninterested in values of p. this is.2).n + I) . However. This equation is equiValent to = {3 z(a) + . This is a subset of the alternative on which we are willing to tolerate low power..
00 )] 1/2 .3 requires approximately 163 observations to have probability . Example 4.2) shows that.4. Now let .282 x 0. They often reduce to adjusting the critical value so that the probability of rejection for parameter value at the boundary of some indifference region is 0:.35 and n = 163 is 0.4. that is.) = (3 for n and find the approximate solution no+ 1 .1.0.05)2{1. we find (3(0) = PotS > so) = <I> ( [nO(l ~ 0)11/ 2 Now consider the indifference region (80 .55)}2 = 162. ifn is very large and/or a is small. It is natural to associate statistical significance with practical significance so that a very low pvalue is interpreted as evidence that the alternative that holds is physically significant. and 0.05.35.3 continued).. we next show how to find the sample size that will "approximately" achieve desired power {3 for the size 0: test in the binomial example.645 x 0.35. There are various ways of dealing with this problem.80 ) .3.6 (Example 4.Section 4.4. This problem arises particularly in goodnessoffit tests (see Example 4. As a further example and precursor to Section 5. where (3(0. test for = .90 of detecting the 17% increase in 8 from 0.3 Uniformly Most Powerful Tests and Monotone Likelihood Ratio Models 231 Dual to the problem of not having enough power is that of having too much. when we test the hypothesis that a very large sample comes from a particular distribution. Often there is a function q(B) such that H and K can be formulated as H : q(O) <: qo and K : q(O) > qo.1.35(0.1. 0 ° Our discussion can be generalized. far from the hypothesis." The reason is that n is so large that unimportant small discrepancies are picked up. we fleed n ~ (0. we can have very great power for alternatives very close to O. = 0. using the SPLUS package) for the level . Suppose 8 is a vector. > O.Oi) [nOo(1 . 1.5). if Oi = .90.7) + 1. Our discussion uses the classical normal approximation to the binomial distribution. 00 = 0. First.4) and find the approximate critical value So ~ nOo 1 + 2 + z(l.3 to 0. Formula (4. Such hypotheses are often rejected even though for practical purposes "the fit is good enough. (h = 00 + Dt.1. In Example 4. Again using the nonnal approximation. The power achievable (exactly.86.3.3(0. the size . (3 = . and only if. Thus. to achieve approximate size a. We solve For instance. Dt.05 binomial test of H : 8 = 0. Bd.4 this would mean rejecting H if. we solve (3(00 ) = Po" (S > s) for s using (4.
It may also happen that the distribution of Tn as () ranges over 9.3.3.4) j j 1 .1) 1(0..< (10 and rejecting H if Tn is small. for instance. This procedure can be applied.. for 9. However.(0) > O} and Co (Tn) = C'(') (Tn) for all O. 0 E 9. a E A = {O.< (10 among all tests depending on <. = {O : ).+ 00..(0) so that 9 0 = {O : ). 0 E 9. For instance. is detennined by a onedimensional parameter ). 00).l)l(O.7. I}. a reasonable class of loss functions are those that satisfy : 1 i j 1 1(0.1 by taking q( 0) equal to the noncentrality parameter governing the distribution of the statistic under the alternative. we illustrate what can happen with a simple example. Thus.1. .. In general. To achieve level a and power at least (3.Xf as in Example 2.5 that a particular test statistic can have a fixed distribution £0 under the hypothesis. when testing H : 0 < 00 versus K : B > Bo.. the critical value for testing H : 0.l' independent of IJ. to the F test of the linear model in Section 6. Testing Precision Continued. We have seen in Example 4. 0) = (B . .Bo). 0 = Complete Families of Tests The NeymanPearson framework is based on using the 01 loss function.2 is (}'2 = ~ E~ 1 (Xi . Suppose that j3(O) depends on () only through q( (J) and is a continuous increasing function of q( 0). and also increases to 1 for fixed () E 6 1 as n .(Tn ) is an MLR family. = (tm. J. is the a percentile of X~l' It is evident from the argument of Example 4.232 Testing and Confidence Regions Chapter 4 q..2 only.0) > ° I(O. The set {O : qo < q(O) < ql} is our indjfference region.O)<O forB < 00 forB>B o. Suppose that in the Gaussian model of Example 4. then rejecting for large values of Tn is UMP among all tests based on Tn. Implicit in this calculation is the assumption that POl [T > col is an increasing function ofn.4. We may ask whether decision procedures other than likelihood ratio tests arise if we consider loss functions l(O. > qo be a value such that we want to have power fJ(O) at least fJ when q(O) > q. (4.(0) = o} aild 9. Example 4.£ is unknown. Then the MLE of 0. first let Co be the smallest number c such that Then let n be the smallest integer such that P" IT > col > fJ where 00 is such that q( 00 ) = qo and 0. I . we may consider l(O. the distribution of Tn ni:T 2/ (15 is X.3..= (10 versus K : 0. .= 00 is now composite. For each n suppose we have a level a test for H versus I< based on a suitable test statistic T. a). Reducing the problem to choosing among such tests comes from invariance consideration that we do not enter into until Volume II. Although H : 0. The theory we have developed demonstrates that if C.3.2. is such that q( Oil = q.3 that this test is UMP for H : (1 > (10 versus K : 0.9. that are not 01.
I"(X) 0 for allO then O=(X) clearly satisfies (4. E.(X) = ". We also show how.(X) > 1. 1) 1(0.(Io.(X) = E"I"(X) > O.3.12) and. Intervals.(x)) . 1") = (1(0. INTERVALS. Proof.E. the test that rejects H : 8 < 8 0 for large values of T(x) is UMP for K : 8 > 8 0 .R(O.4 Confidence Bounds.(X) be such that. is an MLR family in T( x) and suppose the loss function 1(0.5).I"(X) for 0 < 00 .(X)) = 1. In such situations we show how sample size can be chosen to guarantee minimum power for alternatives a given distance from H.5) holds for all 8.2. We consider models {Po : () E e} for which there exist tests that are most powerful for every () in a composite alternative t (UMP tests). the class of NP tests is complete in the sense that for loss functions other than the 01 loss function. hence.6) But 1 .O)]I"(X)} Let o. the risk of any procedure can be matched or improved by an NP test. e c R.(I"(X))) < 0 for 8 > 8 0 .4 CONFIDENCE BOUNDS.1 and. e 4. is complete.5) That is. and Regions 233 if for any decision rule 'P The class D of decision procedures is said to be there exists E such that complete(I). 1") for all 0 E e. = R(O.3. If E.E.o. 0 Summary. a model is said to be monotone likelihood ratio (MLR) if the simple likelihood ratio statistic for testing ()o versus 8 1 is an increasing function of a statistic T( x) for every ()o < ()1. (4. a) satisfies (4. Now 0.3.) . O))(E.3. 1) 1(O. For MLR models.4).Section 4. For () real.O) + [1(0. (4.I") ~ = EO{I"(X)I(O.3.E. AND REGIONS We have in Chapter 2 considered the problem of obtaining precise estimates of parameters and we have in this chapter treated the problem of deciding whether the parameter {J i" a .(o.O)} E. locally most powerful (LMP) tests in some cases can be found.3. then the class of tests of the form (4.(2) a v R(O. then any procedure not in the complete class can be matched or improved at all () by one in the complete class. 0 < " < 1.{1(8.3. Thus. if the model is correct and loss function is appropriate.0. is similarly UMP for H : 0 > 8 0 versus K : 0 < 00 (Problem 4. we show that for MLR models. In the following the decision procedures arc test functions.o. when UMP tests do not exist. it isn't worthwhile to look outside of complete classes.o) < R(O.3. Thus. for some 00 • E"o. Finally. hence. Suppose {Po: 0 E e}. The risk function of any test rule 'P is R(O. is UMP for H : 0 < 00 versus K: 8 > 8 0 by Theorem 4. (4. Theorem 4. 1) + [1 1"(X)]I(O.3) with Eo.3.
In the N(".(X) and solving the inequality inside the probability for p. In this case.a) confidence bound for ". .. is a lower bound with P(. 1 X n are i.Q with 1 . or sets that constrain the parameter with prescribed probability 1 .1. We find such an interval by noting .. is a constant. We say that [j. . X E Rq.a) confidence interval for ".. it may not be possible for a bound or interval to achieve exactly probability (1 .. we find P(X . That is...) and =1 a .(X). In our example this is achieved by writing By solving the inequality inside the probability for tt.a)l. in many situations where we want an indication of the accuracy of an estimator. as in (1..(X) that satisfies P(. and we look for a statistic ..fri... we settle for a probability at least (1 .a)l. PEP. .+(X)] is a level (1 . We say that .±(X) ~ X ± oz (1. In the nonBayesian framework. X n ) to establish a lower bound .a.. is a parameter. In general..) = 1 .. This gives . Finally.a) of being correct. if v ~ v(P).234 Testing and Confidence Regions Chapter 4 member of a specified set 8 0 . we want to find a such that the probability that the interval [X .d.(Xj < .. intervals.oz(1 . we want both lower and upper bounds. Here ii(X) is called an upper level (1 .95. Suppose that JL represents the mean increase in sleep among patients administered a drug. ( 2 ) with (72 known. and X ~ P. .fri < .. Then we can use the experimental outcome X = (Xl.4 where X I..a. 0.! = X .a) for a prescribed (1.!a)/. .2 ) example this means finding a statistic ii(X) snch that P(ii(X) > . N (/1.a. .a..(X) is a lower confidence bound with confidence level 1 .Q equal to .oz(1 .8). As an illustration consider Example 4.. ..) = Ia.fri < .i. X + a] contains pis 1 .a)I..(X) .(X) for" with a prescribed probability (1 .. That is.a).3.a. Now we consider the problem of giving confidence bounds..95 Or some other desired level of confidence.fri. Similarly.a) such as . and a solution is ii(X) = X + O"z(1 . 11 I • J .) = 1 ... we may be interested in an upper bound on a parameter. • j I where .
(72) population.O!) lower confidence bound for v if for every PEP.e. and assume initially that (72 is known. The quantities on the left are called the probabilities of coverage and (1 . the confidence level is clearly not unique because any number (I ..! that Z(iL)1 "jVI(n . which has a X~l distribution. v(X) is a level (I . Example 4. In the preceding discussion we used the fact tbat Z (1') = Jii(X .o. P[v(X) < v < v(X)] > I .3. " X n be a sample from a N(J.4.o.L. ii( Xl] formed by a pair of statistics v(X). the minimum probability of coverage). where S 2 = 1 nI L (X.0) or a 100(1 . has a N(o. Let X t . by Theorem B. In general.3.l)s2 lIT'.1) = T(I') . . I) distribu~on to obtain a confidence interval for I' by solving z (1 . D(X) is called a level (1 . lhat is. independent of V = (n . In this process Z(I') is called a pivot.0)% confidence intervolfor v if.3.1.. finding confidence intervals (or bounds) often involves finding appropriate pivots. Now we tum to the (72 unknown case and propose the pivot T(J.0') < (J . and Regions 235 Definition 4.a) upper confidence bound for v if for every PEP.L) obtained by replacing (7 in Z(J.a) is called a confidence level.!o) < Z(I') < z (1.1. P E PI} (i.Section 4. Similarly. Note that in the case of intervals this is just inf{ PI!c(X) < v < v(X). The (Student) t Interval and Bounds.1')1".4 Confidence Bounds. For the normal measurement problem we have just discussed the probability of coverage is independent of P and equals the confidence coefficient.~o) for 1'. P[v(X) = v] > 10. We conclude from the definition of the (Student) t distribution in Section B. for all PEP. For a given bound or interval. A statistic v(X) is called a level (1 . Intervals.X) n i=l 2 .4.0) will be a confidence level if (I .0) is. Moreover. P[v(X) < vi > I . we will need the distribution of Now Z(I') = Jii(X 1')/".L) by its estimate s. has aN(O. the random interval [v( X).. In order to avoid this ambiguity it is convenient to define the confidence coefficient to be the largest possible confidence level. 1) distribution and is.
Confidence Intervals and Bounds for the Vartance of a Normal Distribution. the interval will have probability (1 .l (I . Example 4.. X +stn _ 1 (1 Similarly. .3. V( <7 2 ) = (n . we see that as n + 00 the Tn_l distribution converges in law to the standard normal distribution. or Tables I and II.1) in nonGaussian situations can be investigated using the asymptotic and Monte Carlo methods introduced in Chapter 5. Thus.~a) < T(J. ...n is. The properties of confidence intervals such as (4.. Suppose that X 1. To calculate the coefficients t n.) confidence interval of the type [X ..st n _ 1 (I .1 distribution if the X's have a distribution that is nearly symmetric and whose tails are not much heavier than the normal. For the usual values of a.1 (I . .1 (1 . and if al + a2 = a.1.d. " . then P(X("I) < V(<7 2) < x(l.01. By solving the inequality inside the probability for a 2 we find that [(n . distribution.)/ v'n].. By Theorem B .1 )s2/<7 2 has a X.a) in the limit as n + 00. we find P [X . . .)/..1) has confidence coefficient close to 1 .3.fii..2) f. Let tk (p) denote the pth quantile of the P (t n. or very heavytailed distributions such as the Cauchy.nJ = 1" X ± sc/. thus. • r.1. ~..a.11 whatever be Il and Tj.) /.~a)/. .l) is fairly close to the Tn ."2)' (n . N (p" a 2 ).a). . • " ~. Solving the inequality inside the probability for Jt.~a)/. X n are i.4.. . In this case the interval (4..355 is .~a) = 3.2. Testing and Confidence Regions Chapter 4 1 . Then (72. (4.1) The shortest level (I .4.4. we have assumed that Xl..4.1.Q).a) /.. .i.005. (4.. .1) can be much larger than 1 . we enter Table II to find that the probability that a 'TnI variable exceeds 3.l)s2/X(1. X + 3. Up to this point.Xn is a sample from a N(p" a 2 ) population.99 confidence intervaL From the results of Section B. if we let Xn l(p) denote the pth quantile of the X~1 distribution. On the other hand.fii are natural lower and upper confidence bounds with confidence coefficients (1 . t n . If we assume a 2 < 00.Q. if n = 9 and a = 0..12).355 and IX . . See Figure 5.1 (p) by the standard normal quantile z(p) for n > 120.l < X + st n_ 1 (1. the confidence coefficient of (4. we can reasonably replace t n . .236 has the t distribution 7. It turns out that the distribution of the pivot T(J.l (I .a. we use a calculator.~a)) = 1.1.7. X . 0 .st n_l (1 ~. i '1 !."2)) = 1".7 (see Problem 8.l)s2/X("I)1 is a confidence interval with confidence coefficient (1 ..355s/31 is the desired level 0. .l) < t n_ 1 (1.a).1 distribution and can be used as a pivot. computer software.n < J.fii and X + stn.!a) and tndl .355s/3..3. Hence.stn_ 1 (I . for very skew distributions such as the X2 with few degrees of freedom. For instance.
then X is the MLE of (j. g( 0.4) .Section 44 Confidence Bounds.0) has approximately a N(O.4.(2X + ~) e+X 2 For fixed 0 < X < I. There is no natural "exact" pivot based on X and O. the confidence interval and bounds for (T2 do not have confidence coefficient 1.X) < OJ'" 1where g(O.. The pivot V( (T2) similarly yields the respective lower and upper confidence bounds (n .Q) and (n .1. by the De MoivreLaplace theorem. We illustrate by an example. It may be shown that for n large. 1 X n are the indicators of n Bernoulli trials with probability of success (j.0) ] = Plg(O.S)jn] {S + k2~ + k~j4} / (n + k~) (4. Approximate Confidence Bounds and Intervalsfor the Probability of Success in n Bernoulli Trials. If we use this function as an "approximate" pivot and let:=:::::: denote "approximate equality. Asymptotic methods and Monte Carlo experiments as described in Chapter 5 have shown that the confidence coefficient may be arbitrarily small depending on the underlying true distribution. = ~(X) < 0 < B(X)]' Because the coefficient of (j2 in g(() 1 X) is greater than zero. the scope of the method becomes much broader.S)jn] + k~j4} / (n + k~).0) .ka)[S(n . In Problem 1.l)sjx(Q). which typically is unknown. = Z (1 . if we drop the nonnality assumption. taking at = (Y2 = ~a is not far from optimal (Tate and Klett.4. . Intervals. 0 The method of pivots works primarily in problems related to sampling from nonnal populations. and Regions 237 The length of this interval is random." we can write P [ Let ka vIn(X . If we consider "approximate" pivots.0)  1Ql] = 12 Q.X) = (1+ k!) 0' .3. In tenns of S = nX.1. 1) distribution.. In contrast to Example 4. However.4.0) < z (1)0(1 . [0: g(O.. which unifonnly minimizes expected length among all intervals of this type. vIn(X .3) + ka)[S(n .16 we give an interval with correct limiting coverage probability. If X I. they are(1) O(X) O(X) {S + k2~ .0(1.~ a) and observe that this is equivalent to Q P [(X . 1959).a even in the limit as n + 00.O)j )0(1 . X) < OJ (4.k' < . There is a unique choice of cq and a2.4. X) is a quadratic polynomial with two real roots. Example 4.l)sjx(1 .
96.600.: iI j . Cai. or n = 9.4. ka.a) confidence bounds.~a) ko 2 kcc i 1. and Das Gupta (2000) for a discussion.02 ) 2 . O(X)] is an approximate level (1 .4.X).4. n(l .~a = 0.n 1 tn (4. i " .4. = T.S)ln] Now use the fact that + kzJ4}(n + k~)l (S .4. of the interval is I ~ 2ko { JIS(n . and Das Gupta (2000).02 and is a confidence interval with confidence coefficient approximately 0. .601. = 0. These inlervals and bounds are satisfactory in practice for the usual levels. Cai. it is better to use the exact level (1 .(1. and we can achieve the desired .. to bound I above by 1 0 choose .975. and uses the preceding model. note that the length.a) procedure developed in Section 4. For small n. That is.5. say I. For instance. He draws a sample of n potential customers.8) is at least 6. Better results can be obtained if one has upper or lower bounds on 8 such as 8 < 00 < ~. He can then detennine how many customers should be sampled so that (4. consider the market researcher whose interest is the proportion B of a population that will buy a product. [n this case.7) See Brown.96) 2 = 9.6) Thus. we n . This fonnula for the sample size is very crude because (4. We can similarly show that the endpoints of the level (1 . This leads to the simple interval 1 I (4.2a) interval are approximate upper and lower level (1 .(n + k~)~ = 10 . .4.4) has length 0. = length 0.16.238 Testing and Confidence Regions Chapter 4 so that [O(X).0)1 JX(1 .~n)2 < S(n . if the smaller of nO.5) (4.S)ln ~ to conclude that tn . See Problem 4.5) is used and it is only good when 8 is near 1/2. Another approximate pivot for this example is yIn(X . To see this. we choose n so that ka.4.02 by choosing n so that Z = n~ ( 1..02. A discussion is given in Brown.96 0.O > 8 1 > ~. calls willingness to buy success. (1 . 1 . Note that in this example we can detennine the sample size needed for desired accuracy.a) confidence interval for O. i o 1 [.95.
Let Xl. In this case. .a) confidence region. I is a level (I . . Example 4.~a) < 0 <2nX/x (~a) where x(j3) denotes the 13th quantile of the X~n distribution. we will later give confidence regions C(X) for pairs 8 = (0 1 .a) confidence interval 2nX/x (I .). and Regions 239 Confidence Regions for Functions of Parameters We can define confidence regions for a function q(O) as random subsets of the range of q that cover the true value of q(O) with probability at least (1 ..a Note that if I.F(x) = exp{ x/O}. if q( B) = 01 . . then exp{ x/O} edenote the lower < q(O) < exp{ x/i)} is a confidence interval for q( 0) with confidence coefficient (1 .(X) < q.. j==1 (4.4. .8) .1. then the rectangle I( X) Ic(X) has level = h (X) x .(X).3. . Let 0 and and upper boundaries of this interval. q(C(X)) is larger than the confidence set obtained by focusing on B1 alone. 2nX/0 has a chi~square. We write this as P[q(O) E I(X)I > 1.a).. (T c' Tc ) are independent. X~n' distribution. j ~ I.4. ( 2 ) T. then q(C(X)) ~ {q(O) ... . . Note that ifC(X) is a level (1 ~ 0') confidence region for 0.. . ~. By Problem B. . Confidence Regions of Higher Dimension We can extend the notion of a confidence interval for onedimensional functions q( B) to rdimensional vectors q( 0) = (q.4. 0 E C(X)} is a level (1..a... .).(O) < ii. if the probability that it covers the unknown but fixed true (ql (0)..Section 44 Confidence Bounds. For instance.r} is said to be a level (1 . Intervals. Then the rdimensional random rectangle I(X) = {q(O). qc (0)).0').. we can find confidence regions for q( 0) entirely contained in q( C(X)) with confidence level (1 . . .. (0).a).0') confidence region for q(O). x IT(1a. Suppose X I.X n denote the number of hours a sample of internet subscribers spend per week on the Internet. Suppose q (X) and ii) (X) are realJ valued. £(8. ii. By using 2nX /0 as a pivot we find the (1 . Here q(O) = 1 . If q is not 1 .a). X n is modeled as a sample from an exponential.1 ). and suppose we want a confidence interval for the population proportion P(X ? x) of subscribers that spend at least x hours per week on the Internet. (X) ~ [q..) confidence interval for qj (0) and if the ) pairs (T l ' 1'.. That is.4. this technique is typically wasteful. distribution. qc( 0)) is at least (I .
LP[qj(O) '" Ij(X)1 > 1. ~ ~'f Dn(F) = sup IF(t) .0:.a) confidence rectangle for (/" .1.a = set of P with P( 00. From Example 4. an rdimensional confidence rectangle is in this case automaticalll obtained from the onedimensional intervals. . ( 2 ) sample and we want a confidence rectangle for (Ii.do}. h (X) X 12 (X) is a level (1 .. and we are interested in the distribution function F(t) = P(X < t). v(P) = F(·). Then by solving Dn(F) < do for F.4. D n (F) is a pivot. Example 4.. c c P[q(O) E I(X)] > 1. . Example 4. .F(t)1 tEn i " does not depend on F and is known (Example 4.. .1. F(t) ..5. .4. X n are i.4.Laj. Thus. j = 1.2). as X rv P. . in which case (Proposition 4. r. F(t) + do})· We have shown ~ ~ I o P(C(X)(t) :) F(t) for all t E R) for all P E 'P =1. consists of the interval i J C(x)(t) = (max{O.(1 .a)..i. for each t E R.6.4. min{l.5). Moreover. An approach that works even if the I j are not independent is to use Bonferroni's inequality (A. According to this inequality. if we choose a J = air. (7 2). j=1 j=1 Thus. is a confidence interval for J1 with confidence coefficient (1  I (X) _ [(nl)S2 (nl)S2] 2 Xnl (1 .d.1) the distribution of . Confidence Rectangle for the Parameters ofa Normal Distribution.2. See Problem 4..7). that is. .40) 2.ia)' Xnl (lo) is a reasonable confidence interval for (72 with confidence coefficient (1. That is.1 h(X) = X ± stn_l (1.a) r. then I(X) has confidence level (1 .0 confidence region C(x)(·) is the confidence band which.15.~a)/rn !a). Suppose Xl> . if we choose 0' j = 1 . We assume that F is continuous.240 Testing and Confidence Regions Chapter 4 Thus. tl continuous in t.2. 1 X n is a N (M. From Example 4.4.. 0 The method of pivots can also be applied to oodimensional parameters such as F.~a). It is possible to show that the exact confidence coefficient is (1 . then leX) has confidence level 1 . Let dO' be chosen such that PF(Dn(F) < do) = 1 . we find that a simultaneous in t size 1.a. Suppose Xl..
5. . we similarly require P(C(X) :J v) > 1.4. given by (4. Example 4.!(F+) and sUP{J.d.5 THE DUALITY BETWEEN CONFIDENCE REGIONS AND TESTS Confidence regions are random subsets of the parameter space that contain the true parameter with probability at least 1 . if J. Intervals for the case F supported on an interval (see Problem 4.0: for all E e.. J. Suppose that an established theory postulates the value Po for a certain physical constant.!(F)   ~ oosee Problem 4.0'.4. as X and that X has a density f( t) = F' (t). confidence intervals.6. 1992) where such bounds are discussed and shown to be asymptotically strictly conservative.1. We begin by illustrating the duality in the following example. By integration by parts. In a nonparametric setting we derive a simultaneous confidence interval for the distribution function F( t) and the mean of a positive variable X.!(F) : F E C(X)) ~ J. Suppose Xl. Example 4. A Lower Confidence Boundfor the Mean of a Nonnegative Random Variable." for all PEP.4 and 4. a level 1 .5 to give confidence regions for scalar or vector parameters in nonparametric models. then Let F(t) and F+(t) be the lower and upper simultaneous confidence boundaries of Example 4.0') lower confidence bound for /1. In a parametric model {Pe : BEe}.i.7. 0 Summary. for a given hypothesis H. X n are i. and more generally confidence regions. which is zero for t < 0 and nonzero for t > O. 1WoSided Tests for the Mean of a Normal Distribution. and we derive an exact confidence interval for the binomial parameter. o 4.4. subsets of the sample space with probability of accepting H at least 1 . A scientist has reasons to believe that the theory is incorrect and measures the constant n times obtaining .18) arise in accounting practice (see Bickel. We derive the (Student) t interval for I' in the N(Il.! ~ inf{J.4. .4.4.4. a 2 ) model with a 2 unknown. Acceptance regions of statistical tests are. We shall establish a duality between confidence regions and acceptance regions for families of hypotheses.Section 4.0: when H is true.4. Then a (1 ..9)   because for C(X) as in Example 4.19. For a nonparametric class P ~ {P} and parameter v ~ v(P).5 The Duality Between Confidence Regions and Tests 241 We can apply the notions studied in Examples 4.0: confidence region for a parameter q(B) is a set C(x) depending only on the data x such that the probability under Pe that C(X) covers q( 8) is at least 1 . We define lower and upper confidence bounds (LCBs and DCBs). is /1.!(F) : F E C(X)} = J.6.! ~ /L(F) = f o tf(t)dt = exists.
~a)] = 0 the test is equivalently characterized by rejecting H when ITI > tnl (1 . and in Example 4. i:.6. t n . Evidently.4.. We achieve a similar effect. O{X.00).2 = .5. Because the same interval (4. Let S = S{X) be a map from X to subsets of N.~a). Xl!_ Knowledge of his instruments leads him to assume that the Xi are independent and identically distributed normal random variables with mean {L and variance a 2 .4. then S is a (I . We can base a size Q' test on the level (1 . in Example 4. /l) to equal 1 if.I n . If any value of J. ...1 (1 . (4. consider v(P) = F. For instance. in Example 4.. generated a family of level a tests {J(X. if and only if.a) confidence region for v if the probability that S(X) contains v is at least (1 .1) by finding the set of /l where J(X. if and only if. and only if. We accept H. Because p"IITI = I n.. as in Example 4.5'[) If we let T = vn(X . Jlo) being of size a only for the hypothesis H : Jl = flo· Conversely. . For a function space example. where F is the distribution function of Xi' Here an example of N is the class of all continuous distribution functions.4.fLo)/ s. if we start out with (say) the level (1 . o These are examples of a general phenomenon.a) LCB X . . the postulated value JLo is a member of the level (1 0' ) confidence interval [X  Slnl (1 ~a) /vn. X .a). /l) ~ O.2.l other than flo is a possible alternative.5.5. (/" . then it is reasonable to formulate the problem as that of testing H : fl = Jlo versus K : fl i.!a) < T < I n. = 1. /l)} where lifvnIX.2) takes values in N = (00 . it has power against parameter values on either side of flo.a)s/ vn and define J'(X.~1 >tn_l(l~a) ootherwise...4.a.1) we constructed for Jl as follows.Q) confidence interval (4. by starting with the test (4. Consider the general framework where the random vector X takes values in the sample space X C Rq and X has distribution PEP.1 (1.. (4.2) we obtain the confidence interval (4.I n _ 1 (1. This test is called twosided because it rejects for both large and small values of the statistic T.2(p) takes values in N = (0. X + Slnl (1  ~a) /vn]. all PEP.l (1.4. 00) X (0.a)s/vn > /l..5.1.5.00).1) is used for every flo we see that we have.~Ct).4. 1.a. /l = /l(P) takes values in N = (00.flO. In contrast to the tests of Example 4. • j n . Let v = v{P) be a parameter that takes values in the set N.1 (1 .242 Testing and Confidence Regions Chapter 4 measurements Xl . then our test accepts H.00). generating a family of level a tests. that is P[v E S(X)] > 1 .2) These tests correspond to different hypotheses. in fact.
0 We next give connections between confidence bounds.(01 + 0"2). Similarly. By applying F o to hoth sides oft we find 8(t) = (O E e: Fo(t) < 1.1.1.Fo(t) for a test with critical constant t is increasing in (). The proofs for the npper confid~nce hound and interval follow by the same type of argument.Ct confidence region for v.3. Suppose we have a test 15(X.Ct. By Theorem 4.(t) in e. the acceptance region of the UMP size a test of H : 0 = 00 versus K : > ()a can be written e A(Oo) = (x: T(x) < too (1.Ct). acceptance regions. eo e Proof. 00). then PIX E A(vo)) > 1  a for all P E P vo if and only if S (X) is a 1 . If the equation Fo(t) = 1 . We next apply the duality theorem to MLR families: Theorem 4. Suppose X ~ Po where (Po: 0 E ej is MLR in T = T(X) and suppose that the distribution function Fo(t) ofT under Po is continuous in each of the variables t and 0 when the other is fixed. let . then [QUI' oU2] is confidence interval for () with confidence coefficient 1 . if S(X) is a level 1 . H may be rejected. Duality Theorem. H may be accepted.1. By Corollary 4. E is an upper confidence bound for 0 with any solution e. Fe (t) is decreasing in (). Let S(X) = (va EN. the power function Po(T > t) = 1 . X E A(vo)). if 8(t) = (O E e: t < to(1 .G iff 0 > Iia(t) and 8(t) = Ilia. this is a random set contained in N with probapility at least 1 . For some specified Va.Ct) is the 1.aj.0". is a level 0 test for HI/o.0" confidence region for v.vo) = OJ is a subset of X with probability at least 1 . Moreover. then the test that accepts HI/o if and only if Va is in S(X). < to(la).3. We have the following.Section 4. and pvalues for MLR families: Let t denote the observed value t = T(x) ofT(X) for the datum x.5. That i~.a has a solution O. It follows that Fe (t) < 1 . then ()u(T) is a lower confidence boundfor () with confidence coefficient 1 . for other specified Va. then 8(T) is a Ia confidence region forO.(T) of Fo(T) = a with coefficient (1 . Formally. Consider the set of Va for which HI/o is accepted. Conversely.a)}.0" of containing the true value of v(P) whatever be P. va) with level 0". Then the acceptance regIOn A(vo) ~ (x: J(x.a)) where to o (1 . let Pvo = (P : v(P) ~ Vo : va E V}.5 The Duality Between Confidence Regions and Tests 243 Next consider the testing framework where we test the hypothesis H = Hvo : v = Va for some specified value va. if 0"1 + 0"2 < 1.0" quantile of F oo • By the duality theorem.
1..1 244 Testing and Confidence Regions Chapter 4 1 o:(t. 00 ) denote the pvalue for the UMP size Q' test of H : () let = eo versus K : () > eo. for the given t.2.1). We illustrate these ideas using the example of testing H : fL = J. we seek reasonable exact level (1 .80 E (0. Figure 4. Under the conditions a/Theorem 4. I X n are U.3.B) > a} {B: a(t..v) : J(t. cr 2 ) with 0.5. a) . v will be accepted.a)~k(Bo.Fo(t) is decreasing in t.Fo(t) is increasing in O. '1 Example 4. We shall use some of the results derived in Example 4.1).a)] > a} = [~(t). v) = O}.a)ifBiBo.a) is nondecreasing in B. v) =O} gives the pairs (t.~) upper and lower confidence bounds and confidence intervals for B. : : !. and A"(B) = T(A(B)) = {T(x) : x Corollary 4.Xn ) = {B : S < k(B. We call C the set of compatible (t. and an acceptance set A" (po) for this example. The corresponding level (1 . a) denote the critical constant of a level (1. We have seen in the proof of Theorem 4. let a( t.~a)/ v'n}.Fo(t).1 I. In the (t. where S = Ef I Xi_ To analyze the structure of the region we need to examine k(fJ. .a) test of H. t is in the acceptance region. Then the set . . The result follows. Let k(Bo. v) <a} = {(t. . vertical sections of C are the confidence regions B( t) whereas borizontal sections are the acceptance regions A" (v) = {t : J( t. We claim that (i) k(B.1 that 1 .1 shows the set C. .p): It pI < <7Z (1. In general.1. N (IL.4.2 known.. B) ~ poeT > t) = 1 .to(l. Exact Confidence Bounds and Intervals for the Probability of Success in n Binomial Trials. • . Because D Fe{t) is a distribution function. Let T = X. For a E (0. Let Xl. B) plane. . A"(B) Set) Proof. (ii) k(B. vo) = 1 IT > cJ of H : v ~ Vo based on a statistic T = T(X) with observed value t = T(x).v) where. then C = {(t.v) : a(t.oo). _1 X n be the indicators of n binomial trials with probability of success 6. I c= {(t. a confidence region S( to)..5. .5. .I}. The pvalue is {t: a(t.a) confidence region is given by C(X t. vol denote the pvalue of a test6(T. a).d. B) pnints. To find a lower confidence bound for B our preceding discussion leads us to consider level a tests for H : 6 < 00. • a(t.10 when X I .3. E A(B)).B) = (oo.1. and for the given v. 1 .
3.Q) I] > Po.1 (i) that PetS > j] is nondecreasing in () for fixed j.L = {to in the normal model.5 The Duality Between Confidence Regions and Tests 245 Figure 4. The assertion (ii) is a consequence of the following remarks. [S > j] = Q and j = k(Oo. P.[S > k(02. Q). whereas A'" (110) is the acceptance region for Hp. Po. [S > j] < Q. If fJo is a discontinuity point 0 k(O. if 0 > 00 .Q) = n+ 1. it is also nonincreasing in j for fixed e. Then P. and (iv) we see that. S(to) is a confidence interval for 11 for a given value to of T.Q) ~ I andk(I. Q).Q) = S + I}. if we define a contradiction. [S > j] > Q. e < e and 1 2 k(O" Q) > k(02. (iii). and. hence.Section 4.[S > k(02. Therefore.5.[S > k(O"Q) I] > Q. (iv) k(O. The shaded region is the compatibility set C for the twosided test of Hp. From (i). Therefore. On the other hand.o' (iii) k(fJ. Po. I] if S ifS >0 =0 .Q)] > Po. (ii). To prove (i) note that it was shown in Theorem 4. Clearly.1. a) increases by exactly 1 at its points of discontinuity.[S > j] < Q for all 0 < 00 O(S) = inf{O: k(O. Q) as 0 tOo.I] [0.o : J. then C(X) ={ (O(S). let j be the limit of k(O. The claims (iii) and (iv) are left as exercises. Q) would imply that Q > Po.
From our discussion. These intervals can be obtained from computer packages that use algorithms based on the preceding considerations.16) for n = 2. 0.. As might be expected.2Q).3 0.1 0. Q) is given by.2. O(S) I oflevel (1.5.4. therefore. we define ~ O(S)~sup{O:j(O. .246 Testing and Confidence Regions Chapter 4 and O(S) is the desired level (I . . then k(O(S).Q) = S and. O(S) = I. O(S) together we get the confidence interval [8(S).2 portrays the situatiou.5. j '~ I • I • . when S > 0. Putting the bounds O(S).3. . When S = n.0. 1 _ . if n is large. When S 0.1 d I J f o o 0.16) 3 I 2 11.5 I i .4 0. j Figure 4. Then 0(S) is a level (1 .2 0.Q) DCB for 0 and when S < n. Similarly. 8(S) = O. i • . O(S) is the unique solution of 1 ~( ~ s ) or(1 _ 8)nr = Q. we find O(S) as the unique solution of the equation. these bounds and intervals differ little from those obtained by the first approximate method in Example 4.I.Q)=Sl} where j (0.Q) LCB for 0(2) Figure 4. . Plot of k(8. I • i I . 4 k(8. .
2.3) 3. Summary. and o Decide 8 > 80 if I is entirely to the right of B . Because we do not know whether A or B is to be preferred.i. We explore the connection between tests of statistical hypotheses and confidence regions. Using this interval and (4. If J(x.~a).B . Decide I' > I'c ifT > z(1 . However. vo) is a level 0' test of H ..0') confidence interval!: 1. o o Decide B < 80 if! is entirely to the left of B .5. it is natural to carry the comparison of A and B further by asking whether 8 < 0 or B > O. . o (4. suppose B is the expected difference in blood pressure when two treatments. then the set S(x) of Vo where .3. We can use the twosided tests and confidence intervals introduced in later chapters in similar fashions. 0.d. v = va. N(/l. The problem of deciding whether B = 80 .!a).4 we considered the level (1 . and vice versa. This event has probability Similarly. For this threedecision rule. Decide I' < 1'0 ifT < z(l.~Q).5 The Duality Between Confidence Regions and Tests 247 Applications of Confidence Intervals to Comparisons and Selections We have seen that confidence intervals lead naturally to twosided tests. we decide whether this is because () is smaller or larger than Bo. we test H : B = 0 versus K : 8 i. then we select A as the better treatment.Section 4.4. Make no judgment as 1O whether 0 < 80 or 8 > B if I contains B ..5. Then the wrong decision "'1 < !to" is made when T < z(1 . A and B. twosided tests seem incomplete in the sense that if H : B = B is rejected in favor of o H : () i.Q) confidence interval X ± uz(1 . when !t < !to.~Q) / . I X n are i. If we decide B < 0. Suppose Xl. If H is rejected. Example 4. o o For instance.3) we obtain the following three decision rule based on T = J1i(X  1'0)/": Do not reject H : /' ~ I'c if ITI < z(1 . the twosided test can be regarded as the first step in the decision procedure where if H is not rejected.13. the probability of the wrong decision is at most ~Q.O. To see this consider first the case 8 > 00 . the probability of falsely claiming significance of either 8 < 80 or 0 > 80 is bounded above by ~Q. we can control the probabilities of a wrong selection by setting the 0' of the parent test or confidence interval.!a).2 ) with u 2 known. In Section 4. Therefore. but if H is rejected.Jii for !t.8 < B • or B > 80 is an example of a threeo decision problem and is a special case of the decision problems in Section 1.5. we usually want to know whether H : () > B or H : B < 80 . Thus. Here we consider the simple solution suggested by the level (1 . we make no claims of significance.. by using this kind of procedure in a comparison or selection problem.2. are given to high blood pressure patients. and 3.
where 80 is a specified value... we find that a competing lower confidence bound is 1'2(X) = X(k). in fact.if. and the threedecision problem of deciding whether a parameter 1 l e is eo. 0"2) random variables with (72 known. I' i' . 4. But we also want the bounds to be close to (J.248 Testing and Confidence Regions Chapter 4 0("'. a level (1 any fixed B and all B' > B. We give explicitly the construction of exact upper and lower confidence bounds and intervals for the parameter in the binomial distribution.2) Lower confidence bounds e* satisfying (4.6 UNIFORMLY MOST ACCURATE CONFIDENCE BOUNDS In our discussion of confidence bounds and intervals so far we have not taken their accuracy into account.[B(X) < B'].6. (4.6.2 and 4.I (X) is /L more accurate than . P. . If (J and (J* are two competing level (1 . < X(n) denotes the ordered Xl.n. (4. va) ~ 0 is a level (1 .6.Xn and k is defined by P(S > k) = 1 . twosided tests. we say that the bound with the smaller probability of being far below () is more accurate.5.Ilo)/er > z(1 . ~).. and only if.3. (72) model. A level a test of H : /L = /La vs K .Q) UCB for B. for 0') UCB e is more accurate than a competitor e .IB' (X) I • • < B'] < P.2.Q for a binomial.~'(X) < B'] < P. they are both very likely to fall below the true (J. where X(1) < X(2) < .1 t! Example 4.6. optimality of the tests translates into accuracy of the bounds. the following is true. Ii n II r .6. Note that 0* is a unifonnly most accurate level (1 .. 0 I i.0) confidence region for va.6. Formally.1. We also give a connection between confidence intervals. random variable S.' . which is connected to the power of the associated onesided tests.2) for all competitors. This is a consequence of the following theorem.Q) LCB B if.z(1 .0') lower confidence bounds for (J. Which lower bound is more accurate? It does tum out that .a) LCB for if and only if 0* is a unifonnly most accurate level (1 . (X) = X .Q). Definition 4.Q)er/. for any fixed B and all B' < B. .Q) con· fidence region for v. e..u2(X) and is.1 continued). less than 80 .1 (Examples 3.6.n(X .. or larger than eo. Thus. Using Problem 4. . and only if. then the lest that accepts H . which reveals that (4. Suppose X = (X" .. for X E X C Rq.Xn ) is a sample of a N (/L. P.1) Similarly. .1) is nothing more than a comparison of (X)wer functions. unifonnly most accurate in the N(/L. We next show that for a certain notion of accuracy of confidence bounds. I' > 1'0 rejects H when . B(n. .6.. If S(x) is a level (1 .a) LCB 0* of (J is said to be more accurate than a competing level (1 .[B(X) < B']. A level (1. j i I .1) for all competitors are called uniformly most accurate as are upper confidence bounds satisfying (4. v = Va when lIa E 8(:1') is a level 0: test.. The dual lower confidence bound is 1'.
t o . 00 ) ~ 0 if.0') LeB for (J.a) upper confidence bound q* for q( A) = 1 .•.6.2. Boundsforthe Probability ofEarly Failure ofEquipment.a). for e 1 > (Jo we must have Ee.2 and the result follows. [O(X) > 001 < Pe. . However. X(k) does have the advantage that we don't have to know a or even the shape of the density f of Xi to apply it. We can extend the notion of accuracy to confidence bounds for realvalued functions of an arbitrary parameter. a real parameter. they have the smallest expected "distance" to 0: Corollary 4. Let f)* be a level (1 .4.a) lower confidence boundfor O. Pe[f < q(O')1 < Pe[q < q(O')] whenever q((}') < q((}).Because O'(X.a) Lea for q( 0) if.z(1 a)a/ JTi is uniformly most accurate.Section 4. 00 )) or Pe.6. 1 X n be the times to failure of n pieces of equipment where we assume that the Xi are independent £(A) variables.5 favor X{k) (see Example 3.a) LeB 00. and only if. Then O(X. and 0 otherwise. [O'(X) > 001. the robustness considerations of Section 3. Let XI.Oo)) < Eo.2.6.6 Uniformly Most Accurate Confidence Bounds 249 Theorem 4.a) lower confidence bound. Uniformly most accurate (UMA) bounds turn out to have related nice properties.Oo) 1 ifO'(x) > 00 ootherwise is UMP level a for H : (J = eo versus K : (J > 00 . 00 ) is UMP level Q' for H : (J = eo versus K : 0 > 00 . We want a unifonnly most accurate level (1 .(o(X. the probability of early failure of a piece of equipment.a) LCB q. if a > 0. Example 4. Let O(X) be any other (1 . We define q* to be a uniformly most accurate level (1 . Most accurate upper confidence bounds are defined similarly.5. Identify 00 with {}/ and 01 with (J in the statement of Definition 4. 00 ) by o(x.6. (O'(X. . O(x) < 00 . Defined 0(x.e>. 00 ) is a level a test for H : 0 = 00 versus K : 0 > 00 .1.7 for the proof). For instance (see Problem 4. Then!l* is uniformly most accurate at level (1 . Proof Let 0 be a competing level (1 .1. we find that j. (Jo) is given by o'(x.6.1 to Example 4. for any other level (1 . then for all 0 where a+ = a.2). such that for each (Jo the associated test whose critical function o*(x. and only if. Also. Suppose ~'(X) is UMA level (1 . 0 If we apply the result and Example 4. .1.
the UMP test accepts H if LXi < X2n(1. However. Considerations of accuracy lead us to ask that.\ *) where>. .a) intervals iliat has minimum expected length for all B.a)/ 2Ao i=1 n (4..6..a) by the property that Pe[T. ~ q(()') ~ t] for every ().5. There are. * is by Theorem 4.3) or equivalently if A X2n(1a) o < 2"'~ 1 X 2 L_n= where X2n(1. Pratt (1961) showed that in many of the classical problems of estimation . By Problem 4. That is.a) intervals for which members with uniformly smallest expected length exist. *) is a uniformly most accurate level (1 . there exist level (1 .a) unbiased confidence intervals.a) iliat has uniformly minimum length among all such intervals.a) is the (1.a) in the sense that.a).T. a uniformly most accurate level (1 . The situation wiili confidence intervals is more complicated. they are less likely ilian oilier level (1 .. l . ~ q(B) ~ t] ~ Pe[T.1 have this property. Thus. Therefore. it follows that q( >. there does not exist a member of the class of level (1 . By using the duality between onesided tests and confidence bounds we show that confidence bounds based on UMP level a tests are uniformly most accurate (UMA) level (1 . Summary. These topics are discussed in Lehmann (1997). B'. the interval must be at least as likely to cover the true value of q( B) as any other value. in the case of lower bounds.a) UCB'\* for A.250 Testing and Confidence Regions Chapter 4 We begin by finding a uniformly most accurate level (1. the situation is still unsatisfactory because.T. To find'\* we invert the family of UMP level a tests of H : A ~ AO versus K : A < AO. in general. the confidence region corresponding to this test is (0. Neyman defines unbiased confidence intervals of level (1 . we can restrict attention to certain reasonable subclasses of level (1 . subject to ilie requirement that the confidence level is (1 . 374376).. some large sample results in iliis direction (see Wilks. the intervals developed in Example 4.1. Of course. If we turn to the expected length Ee(t . is random and it can be shown that in most situations there is no confidence interval of level (1 .a) lower confidence bounds to fall below any value ()' below the true B. the lengili t .a) UCB for A and. 0 Discussion We have only considered confidence bounds.a) confidence intervals that have uniformly minimum expected lengili among all level (1. Confidence intervals obtained from twosided tests that are uniformly most powerful within a restricted class of procedures can be shown to have optimality properties within restricted classes.a) UCB for the probability of early failure.a) quantile of the X§n distribution.8. 1962.) as a measure of precision. pp. because q is strictly increasing in A. however.6. the confidence interval be as short as possible. In particular.6. as in the estimation problem.
from Example 1.a..7. then Ck = {a: 7r(lx) .2 and 1. a E e c R. with ~ a6) a6 fLB = ~ nx+l/1 ~t""o n ~ + I ~ ~2 .3.A)2 nro ao 1 Ji = liB + Zla Vn (1 + .a) lower and upper credible bounds for if they respectively satisfy II(fl. Thus. then 100(1 . Example 4. given a. and {j are level (1 . Instead.7.a)% confidence interval.7.a) credible region for e if II( C k Ix) ~ 1  a .'" . then Ck will be an interval of the form [fl. Suppose that. A consequence of this approach is that once a numerical interval has been computed from experimental data.1.aB = n ~ + 1 I ~ It follows that the level 1 . a Definition 4. We next give such an example.a lower and upper credible bounds for fL are  fL = fLB  ao Zla Vn (1 + .2. the posterior distribution of fL given Xl. then fl.i..)2 nro 1 . a a Turning to Bayesian credible intervals and regions. In the Bayesian formulation of Sections 1.7 Frequentist and Bayesian Formulations 251 4. Suppose that given fL. N(fL. Definition 4..Section 4. Let II( 'Ix) denote the posterior probability distribution of given X = x.a) credible bounds and intervals are subsets of the parameter space which are given probability at least (1 .Xn are i. with fLo and 75 known.a) by the posterior distribution of the parameter given the data. X has distribution P e.4. the interpretation of a 100(1".::: k} is called a level (1 . 75). no probability statement can be attached to this interval. If 7r( alx) is unimodal. what are called level (1 '. it is natural to consider the collec tion of that is "most likely" under the distribution II(alx). Xl.6. .a)% of the intervals would contain the true unknown parameter value.d. and that fL rv N(fLo.a. Let 7r('lx) denote the density of agiven X = x. a~). Then.1. and that a has the prior probability distribution II.1. (j].7 FREQUENTIST AND BAYESIAN FORMULATIONS We have so far focused on the frequentist formulation of confidence bounds and intervals where the data X E X c Rq are random while the parameters are fixed but unknown. with known.a)% confidence interval is that if we repeated an experiment indefinitely each time computing a 100(1 . II(a:s: t9lx) . :s: alx) ~ 1 .12.::: 1 .Xn is N(liB. .
In the Bayesian framework we define bounds and intervals.• .4 for sources of such prior guesses. the Bayesian interval tends to the frequentist interval.6. Then. 4. Let xa+n(a) denote the ath quantile of the X~+n distribution./10)2. . 01 Summary.2 . . we may want an interval for the . For instance.. N(/10.n. that determine subsets of the parameter space that are assigned probability at least (1 . where /10 is a prior guess of the value of /1.. = x a+n (a) / (t + b) is a level (1 . the interpretations of the intervals are different.02) where /10 is known.. .a) credible interval is similar to the frequentist interval except it is pulled in the direction /10 of the prior mean and it is a little narrower. Let A = a.12.4 we discussed situations in which we want to predict the value of a random variable Y. Compared to the frequentist bound (n 1) 8 2 / Xnl (a) of Example 4. ~b) density where a > 0.a). given Xl. t] in which the treatment is likely to take effect.2. called level (1 a) credible bounds and intervals. In the case of a normal prior w( B) and normal model p( X I B). by Problem 1. the probability of coverage is computed with the data X random and B fixed.7.. X n . the interpretations are different: In the frequentist confidence interval.d.a) credible interval is [/1. whereas in the Bayesian credible interval. Xl. See Example 1. D Example 4. however.a) lower credible bound for A and ° is a level (1 . it is desirable to give an interval [Y. Y] that contains the unknown value Y with prescribed probability (1 .Xn are Li. Suppose that given 0.2.252 Testing and Confidence Regions Chapter 4 while the level (1 .2 . We shall analyze Bayesian credible regions further in Chapter 5..a) by the posterior distribution of the parameter () given the data :c. then .8 PREDICTION INTERVALS In Section 1.2. the center liB of the Bayesian interval is pulled in the direction of /10.a) upper credible bound for 0. the probability of coverage is computed with X = x fixed and () random with probability distribution II (B I X = x). b > are known parameters. However. /1 +1with /1 ± = /111 ± Zl'" 2 ~ 00 . Note that as TO > 00. (t + b)A has a X~+n distribution.4. where t = 2:(Xi .2 and suppose A has the gamma f( ~a. Similarly..n(1+ ~)2 nTo 1 • Compared to the frequentist interval X ± Zl~oO/. a doctor administering a treatment with delayed effect will give patients a time interval [1::. In addition to point prediction of Y. the level (1 . is shifted in the direction of the reciprocal b/ a of the mean of W(A).3.
X n +1 '" N(O.1)1 L~(Xi .l + 1]cr2 ). . the optimal MSPE predictor is Y = X. As in Example 4.. It follows that Z (Y) p  Y vnY l+lcr has a N(O. Let Y = Y(X) denote a predictor based on X = (Xl. Tp(Y) !a) . In fact.Ct) prediction interval as an interval [Y.. Thus.1) Note that Tp(Y) acts as a prediction interval pivot in the same way that T(p.4.+ 18tn l (1  (4. we find the (1 . . 1) distribution and is independent of V = (n .8 Prediction Intervals 253 future GPA of a student or a future value of a portfolio.8. Then Y and Y are independent and the mean squared prediction error (MSPE) of Y is Note that Y can be regarded as both a predictor of Y and as an estimate of p" and when we do so. Y) ?: 1 Ct.1). ::..Y to construct a pivot that can be used to give a prediction interval.4.a) prediction interval Y = X± l (1 ~Ct) ::. AISP E(Y) = MSE(Y) + cr 2 . Note that Y Y = X .1. TnI.3. let Xl. We want a prediction interval for Y = X n +1. it can be shown using the methods of . we found that in the class of unbiased estimators. Also note that the prediction interval is much wider than the confidence interval (4. The (Student) t Prediction Interval. which has a X.i. "~). ~ We next use the prediction error Y . [n.l distribution. By solving tnl Y.3. Y ::. Y] based on data X such that P(Y ::. . In Example 3. where MSE denotes the estimation theory mean squared error.. We define a predictor Y* to be prediction unbiased for Y if E(Y* .8. has the t distribution. .1. and can c~nclu~e that in the class of prediction unbiased predictors.d. 8 2 = (n ..3 and independent of X n +l by assumption.Y) = 0.4.Xn be i. Moreover. by the definition of the (Student) t distribution in Section B. The problem of finding prediction intervals is similar to finding confidence intervals using a pivot: Example 4. fnl (1  ~a) for Vn.X) is independent of X by Theorem B.Xn . the optimal estimator when it exists is also the optimal predictor.4.1.) acts as a confidence interval pivot in Example 4.. as X '" N(p" cr 2 ). .1)8 2 /cr 2.. .Section 4. in this case. which is assumed to be also N(p" cr 2 ) and independent of Xl.8. X is the optimal estimator. We define a level (1. It follows that.
Q(.0:) in the limit as n ) 00 for samples from nonGaussian distributions.0:) Bayesian prediction interval for Y = X n + l if .. with a sum replacing the integral in the discrete case.1) is approximately correct for large n even if the sample comes from a nonnormal distribution. UCk ) = v)dH(u. Bayesian Predictive Distributions Xl' . < X(n) denote the order statistics of Xl.2.9.8. .2.i...8. This interval is a distributionfree prediction interval.xn ).2j)/(n + 1) prediction interval for X n +l .254 Testing and Confidence Regions Chapter 4 Chapter 5 that the width of the confidence interval (4. " X n + l are Suppose that () is random with () rv 1'( and that given () = i. Suppose XI.. whereas the level of the prediction interval (4. The posterior predictive distribution Q(.5 for a simpler proof of (4..2) P(X(j) It follows that [X(j) . as X rv F.i. 1).. then P(U(j) j :s: Un+l :s: UCk ») P(u:S: Un+l:S: v I U(j) = U.' X n . < UCn) be Ul ..8.v) j(v lI)dH(u. p(X I e).. By ProblemB. Let X(1) < . Moreover. " X n. ..d.8. uniform. U(O. Un ordered.j is a level 0: = (n + 1 .i.1) tends to zero in probability at the rate n!. . Set Ui = F(Xi ). . We want a prediction interval for Y = X n+l rv F. 0 We next give a prediction interval that is valid from samples from any population with a continuous distribution.4. See Problem 4.. X(k)] with k :s: Xn+l :s: XCk») = n + l' kj = n + 1 . . . (4. Let U(1) < . .'. Here Xl"'" Xn are observable and Xn+l is to be predicted. 00 :s: a < b :s: 00. whereas the width of the prediction interval tends to 2(]z (1. = i/(n+ 1).~o:).. by Problem B. I x) has in the continuous case density e.v) = E(U(k») .1) is not (1 .8..12. Ul . Now [Y B' YB ] is said to be a level (1 . then. E(UCi») thus. b). where F is a continuous distribution function with positive density f on (a.. where X n+l is independent of the data Xl'.Xn are i. that is. n + 1.d. i = 1. Example 4.Un+l are i.d.2). I x) of Xn+l is defined as the conditional distribution of X n + l given x = (XI. .. . the confidence level of (4.E(U(j)) where H is the joint distribution of UCj) and U(k). .' •.2.4.
and it is enough to derive the marginal distribution of Y = X n +l from the joint distribution of X. and 7r( B) is N( TJo . .8.8.1. 1983). (J"5).. (J"5). For a sample of size n + 1 from a continuous distribution we show how the order statistics can be used to give a distributionfree prediction interval.8. N(B. This is the same as the probability limit of the frequentist interval (4. To obtain the predictive distribution. The Bayesian formulation is based on the posterior predictive distribution which is the conditional distribution of the unobservable variable given the observable variables. 0 The posterior predictive distribution is also used to check whether the model and the prior give a reasonable description of the uncertainty in a study (see Box. we find that the interval (4.2 o + I' T2 ~ P. Consider Example 3. Because (J"~ + 0. by Theorem BA.c{(Xn +1 where. X n + l and 0.c{Xn+l I X = t} = .0)0] = E{E(Xn+l .3.9 4. B = (2 / T 2) TJo + ( 2 / (J"0 X. Note that E[(Xn+l .0:) Bayesian prediction interval for Y is [YB ' Yt] with Yf = liB ± Z (1. T2).8.0:). . the results and examples in this chapter deal mostly with oneparameter problems in which it sometimes is possible to find optimal procedures. (4.~o:) V(J"5 + a~.Section 4. (J"5 known. We consider intervals based on observable random variables that contain an unobservable random variable with probability at least (1.0 and 0 are uncorrelated and.7.3) converges in probability to B±z (1 . T2 known. However. Xn+l .9. from Example 4.0 and 0 are still uncorrelated and independent.1 LIKELIHOOD RATIO PROCEDURES Introduction Up to this point. Xn is T = X = n 1 2:~=1 Xi .1) . (J"5 + a~) ~2 = (J" B n I (.1 where (Xi I B) '" N(B . even in . Thus.8. (J" B n(J" B 2) It follows that a level (1 . where X and X n + l are independent.i. .B)B I 0 = B} = O.  0) + 0 I X = t} = N(liB.. In the case of a normal sample of size n + 1 with only n variables observable. we construct the Student t prediction interval for the unobservable variable. A sufficient statistic based on the observables Xl. Xn+l . (n(J"~ /(J"5) + 1.~o:) (J"o as n + 00. and X ~ Bas n + 00.I . The Bayesian prediction interval is derived for the normal model with a normal prior.3) we compute its probability limit under the assumption that Xl " '" Xn are i. note that given X = t.2.d. Summary.9 Likelihood Ratio Procedures 255 Example 4. Thus.3) To consider the frequentist properties of the Bayesian prediction interval (4. independent. 4.
Suppose that X = (Xl .2 that we think of the likelihood function L(e. . if Xl. ./LO. the MP level a test 'Pa(X) rejects H if T ::::. if /Ll < /Lo. z(a ).x) = p(x. . e) : e E 8 0 } e e e Tests that reject H for large values of L(x) are called likelihood ratio tests. Because h(A(X)) is equivalent to A(X) . by the uniqueness of the NP test (Theorem 4.. 3. p(x. Calculate the MLE e of e. Form A(X) = p(x. 1.. In this section we introduce intuitive and efficient procedures that can be used when no optimal methods are available and that are natural for multidimensional parameters. To see this.1(c».its (1 ./Lo)/a.e): E 8 0 }. we specify the size a likelihood ratio test through the test statistic h(A(X) and . On the other hand. eo). We start with a generalization of the NeymanPearson statistic p(x . Xn is a sample from a N(/L. 8 1 = {e l }.Xn ) has density or frequency function p(x. So.256 Testing and Confidence Regions Chapter 4 the case in which is onedimensional.2. e) : e E 8 l }./Lo.a )th quantile obtained from the table. e) E 8} : eE 8 0} (4. ed / p(x. the basic steps are always the same. a 2 ) population with a 2 known. Also note that L(x) coincides with the optimal test statistic p(x . optimal procedures may not exist. Calculate the MLE eo of e where e may vary only over 8 0 . 4. then the observed sample is best explained by some E 8 1 . e) is a continuous function of e and eo is of smaller dimension than 8 = 8 0 U 8 1 so that the likelihood ratio equals the test statistic I e e 1 A(X) = sup{p(x. ifsup{p(x.1 that if /Ll > /Lo. the MP level a test 6a (X) rejects H for T > z(l . In the cases we shall consider. .. Note that in general A(X) = max(L(x). For instance. B')/p(x .2. likelihood ratio tests have weak optimality properties to be discussed in Chapters 5 and 6. Find a function h that is strictly increasing on the range of A such that h(A(X)) has a simple form and a tabled distribution under H . To see that this is a plausible statistic. 1).2. and conversely.1) whose computation is often simple. e) as a measure of how well e "explains" the given sample x = (Xl.. sup{p(x. there can be no UMP test of H : /L = /Lo vs H : /L I. eo). The test statistic we want to consider is the likelihood ratio given by L(x) = sup{p(x. e) : e E 8l}islargecomparedtosup{p(x. eo) when 8 0 = {eo}. .'Pa(x). where T = fo(X . and for large samples. 2. e) : () sup{p(x. Because 6a (x) I. e) and we wish to test H : E 8 0 vs K : E 8 1 . . Although the calculations differ from case to case. note that it follows from Example 4. . We are going to derive likelihood ratio tests in several important testing problems.9. . . In particular cases. The efficiency is in an approximate sense that will be made clear in Chapters 5 and 6. Xn).~a). recall from Section 2. there is no UMP test for testing H : /L = /Lo vs K : /L I. ed/p(x.
sales performance before and after a course in salesmanship. and so on.9.2 Tests for the Mean of a Normal DistributionMatched Pair Experiments Suppose Xl.9 Likelihood Ratio Procedures 257 We can also invert families of likelihood ratio tests to obtain what we shall call likelihood confidence regions.2. This section includes situations in which 8 = (8 1 . Here are some examples.5. We shall obtain likelihood ratio tests for hypotheses of the form H : 81 = 8lD . The family of such level a likelihood ratio tests obtained by varying 8lD can also be inverted and yield confidence regions for 8 1 .e. Studies in which subjects serve as their own control can also be thought of as matched pair experiments. Let Xi denote the difference between the treated and control responses for the ith pair. After the matching. An important class of situations for which this model may be appropriate occurs in matched pair experiments. If the treatment and placebo have the same effect. In the ith pair one patient is picked at random (i. An example is discussed in Section 4. while the second patient serves as control and receives a placebo..9. the difference Xi has a distribution that is . the experiment proceeds as follows. Examples of such measurements are hours of sleep when receiving a drug and when receiving a placebo.82 ) where 81 is the parameter of interest and 82 is a nuisance parameter.9. In order to reduce differences due to the extraneous factors.'" . To see how the process works we refer to the specific examples in Sections 4.Section 4. diet. Response measurements are taken on the treated and control members of each pair. we consider pairs of patients matched so that within each pair the patients are as alike as possible with respect to the extraneous factors.2) where sUPe denotes sup over 8 E e and the critical constant c( 8) satisfies p.c 0 0 It is often approximately true (see Chapter 6) that c( 8) is independent of 8. Suppose we want to study the effect of a treatment on a population of patients whose responses are quite variable because the patients differ with respect to age. For instance.Xn form a sample from a N(JL. and soon. [ supe p(X. 4. 0'2) population in which both JL and 0'2 are unknown. which are composite because 82 can vary freely.24. we measure the response of a subject when under treatment and when not under treatment. 8n e (4. That is.a) confidence region C(x) = {8 : p(x.9. p (X . and other factors. We can regard twins as being matched pairs. In that case. we can invert the family of size a likelihood ratio tests of the point hypothesis H : 8 = 80 and obtain the level (1 . mileage of cars with and without a certain ingredient or adjustment. bounds. 8) ~ [c( 8)t l sup p(x.9. 8) > (8)] = eo a.8) . C (x) is just the set of all 8 whose likelihood is on or above some fixed value dependent on the data. with probability ~) and given the treatment. We are interested in expected differences in responses due to the treatment effect.
6. =Ie fJ.0. 8 0 = {(fJ" Under our assumptions. we test H : fJ.3. as representing the treatment effect. which has the immediate solution ao ~2 = .B): B E 8} = p(x. This corresponds to the alternative "The treatment has some effect. We think of fJ. Form of the TwoSided Tests Let B = (fJ" a 2 ).258 Testing and Confidence Regions Chapter 4 symmetric about zero.0 as an established standard for an old treatment.L Xi 1 ~( n i=l .0}. as discussed in Section 4. the test can be modified into a threedecision rule that decides whether there is a significant positive or negative effect.X) . = fJ. B) at (fJ. = fJ. The likelihood equation is oa 2 a logp(x. where we think of fJ.=1 fJ. B) 1[1~ ="2 a 4 L(Xi .o.L ( X i . Let fJ. However. The problem of finding the supremum of p(x. n i=l is the maximum likelihood estimate of B. TwoSided Tests We begin by considering K : fJ. B) was solved in Example 3. Our null hypothesis of no treatment effect is then H : fJ. = E(X 1 ) denote the mean difference between the response of the treated and control subjects. = O.0 )2 .5. a 2 ) : fJ.fJ. To this we have added the nonnality assumption. good or bad.L X i ' ." However. B) : B E 8 0 } boils down to finding the maximum likelihood estimate a~ of a 2 when fJ. The test we derive will still have desirable properties in an approximate sense to be discussed in Chapter 5 if the nonnality assumption is not satisfied. where B=(x. We found that sup{p(x.0 is known and then evaluating p(x.0. a~). 'for the purpose of referring to the duality between testing and confidence procedures. Finding sup{p(x. = fJ.0) 2  a2 n] = 0. .a)= ~ ~2 (1 ~ 1 ~ n i=l 2 ) . e).
9. However. Similarly. Mo . or Table III.4. &5 gives the maximum of p(x.\(x). the testing problem H : M ::.Mo)/a. which thus equals log . A proof is sketched in Problem 4. ITnl Z 2.x)2 = n&2/(n . the likelihood ratio tests reject for large values of ITn I.~a) and we can use calculators or software that gives quantiles of the t distribution. suppose n = 25 and we want a = 0. if we are comparing a treatment and control. are considered to be equal before the experiment is performed. and only if. . the test that rejects H for Tn Z t nl(1 . 8 Therefore.05. to find the critical value. A and B. Then we would reject H if. Thus. 8) for 8 E 8 0. To simplify the rule further we use the following equation.1). the relevant question is whether the treatment creates an improvement. which can be established by expanding both sides. Therefore.\(x) is equivalent to log . Because 8 2 function of ITn 1 where = 1 + (x .  {~[(log271') + (log&5)]~} Our test rule.a). OneSided Tests The twosided formulation is natural if two treatments. therefore. (Mo. the size a critical value is t n 1 (1 . the size a likelihood ratio test for H : M Z Mo versus K : M < Mo rejects H if.3. The statistic Tn is equivalent to the likelihood ratio statistic A for this problem.2. For instance.11 we argue that P"[Tn Z t] is increasing in 8. and only if. &0)) ~ 2 {~[(log271') + (log&2)]~} ~ log(&5/&2). where 8 = (M .9 Likelihood Ratio Procedures 259 By Theorem 2. The test statistic .9.1.\(x) logp(x. Mo versus K : M > Mo (with Mo = 0) is suggested.1)1 I:(Xi . rejects H for iarge values of (&5/&2). is of size a for H : M ::. Because Tn has a T distribution under H (see Example 4. Therefore.064.Mo) . &5/&2 (&5/&2) is monotone increasing Tn = y'n(x .MO)2/&2. (n .8) logp(x.1).Section 4. In Problem 4.
With this solution. we may be required to take more observations than we can afford on the second stage./Lo)/a) = 1.11) just as the power functions of the corresponding tests of Example 4.4. Similarly the onesided tests lead to the lower and upper confidence bounds of Example 4. The power functions of the onesided tests are monotone in 8 (Problem 4.2. 1) and X~ distributions.n(X ./Lo) / a has N (8./Lol ~ ~./Lo) / a. in which we estimate a for a first sample and use this estimate to decide how many more observations we need to obtain guaranteed power against all alternatives with I/L./Lo) / a as close to 0 as we please and. The density of Z/ . thus. and the power can be obtained from computer software or tables of the noncentral t distribution. bring the power arbitrarily close to 0:.1. whatever be n.. We have met similar difficulties (Problem 4.jV/k is given in Problem 4. ( 2) only through 8. This distribution. We can control both probabilities of error by selecting the sample size n large provided we consider alternatives of the form 181 ~ 81 > 0 in the twosided case and 8 ~ 81 or 8 :S 81 in the onesided cases.4. Because E[fo(X /Lo)/a] = fo(/L . Problem 17).l distribution.4. 260. note that from Section B.1)s2/a 2 are independent and that (n 1)s2/a 2 has a X.b distribution. we know that fo(X /L)/a and (n . To derive the distribution of Tn.1.3.jV/ k where Z and V are independent and have N(8. denoted by Tk.260 Power Functions and Confidence 4 To discuss the power of these tests. Computer software will compute n. however. Thus. say. / (n1)s2 /u 2 n1 V has a 'Tn1. Likelihood Confidence Regions If we invert the twosided tests.9. with 8 = fo(/L . . p.12./Lo)/a . the ratio fo(X . 1) distribution.9. The reason is that. If we consider alternatives of the form (/L /Lo) ~ ~. by making a sufficiently large we can force the noncentrality parameter 8 = fo(/L . is by definition the distribution of Z / . we need to introduce the noncentral t distribution with k degrees of freedom and noncentrality parameter 8.b. Note that the distribution of Tn depends on () (/L./Lo)/a and Yare . we obtain the confidence region We recognize C(X) as the confidence interval of Example 4. A Stein solution. respectively. is possible (Lehmann. 1997.7) when discussing confidence intervals. we can no longer control both probabilities of error by choosing the sample size.1 are monotone in fo/L/ a. fo( X .
..3 5 0. height.8 1.3 Tests and Confidence Intervals for the Difference in Means of Two Normal Populations We often want to compare two populations with distribution F and G on the basis of two independent samples X I.25..4 1..4 1. (See also (4. The preceding assumptions were discussed in Example 1. this is the problem of determining whether the treatment has any effect.4 4. ( 2) and N (Jl2.0 6 3. it is usually assumed that X I. . For instance. .2 0. temperature.3. blood pressure). YI .1 1. Then 8 0 = {e : Jli = Jl2} and 9 1 = {e : Jli =F Jl2}' The log of the likelihood of (X.4 3 0.Xnl and YI .1 0. while YI . we conclude at the 1% level of significance that the two drugs are significantly different. ..99 confidence interval for the mean difference Jl between treatments is [0.995) = 3.6 0.58. As in Section 4.6 4.'" .6. In the control versus treatment example. . 1958.0 7 3. .Section 4.5..Y ) is n2 where n = nl + n2. Then Xl.6 10 2.. . .0 4. For quantitative measurements such as blood pressure.Xn1 . . It suggests that not only are the drugs different but in fact B is better than A because no hypothesis Jl = Jl' < 0 is accepted at this level. respectively. 8 2 = 1.4 If we denote the difference as x's.. Because tg(0. .) 4. ( 2) populations.1 1. volume..6 0.32.1.9. . and ITnl = 4.3 4 1.9 likelihood Ratio Procedures 261 Data Example As an illustration of these procedures.7 1. one from each popUlation.513. Yn2 ..2 1.A in sleep gained using drugs A and B on 10 patients.5 1.9..0 3.8 9 0. . Tests We first consider the problem of testing H : Jl1 = J:L2 versus H : Jli =F Jl2. p. This is a matched pair experiment with each subject serving as its own control. Let e = (Jlt..2. A discussion of the consequences of the violation of these assumptions will be postponed to Chapters 5 and 6.g. Jl2.9 1. suppose we wanted to test the effect of a certain drug on some biological variable (e. Patient i A B BA 1 0.3). length. consider the following data due to Cushny and Peebles (see Fisher. then x = 1. it is shown that = over 8 by the maximum likelihood estimate e.1 0.Xn1 could be blood pressure measurements on a sample of patients given a placebo. The 0.7 5.8 2..06. 121) giving the difference B . e . ( 2). 'Yn2 are the measurements on a sample given the drug.9.2 the likelihood function and its log are maximized In Problem 4.Xn1 and YI . . 'Yn2 are independent samples from N (JlI.. and so forth.84]. Y) = (Xl..2 2 1.8 8 0. weight.
where and If we use the identities ~ fCYi ?l i=l j1)2 1 n fp'i i=l y)2 + n2 (Y _ j1)2 n obtained by writing [Xi .Y) '1':" a a a i=l a j=l J I .1 .X) and . Y. our model reduces to the onesample model of Section 4.2 1 I:rt2 . Y. By Theorem B. j1. . (7 ).j112 log likelihood ratio statistic [(Xi .X) + (X  j1)]2 and expanding.3 Tn2 1 . we find that the is equivalent to the test statistic ITI where and To complete specification of the size a likelihood ratio test we show that T has distribution when ttl = tt2.2 (y. .262 (X.1 I:rtl .9. where 2 Testing and Confidence Regions Chapter 4 When ttl tt2 = tt. the maximum of p over 8 0 is obtained for f) (j1.2 (Xi .3. Thus. (75).2.2 X.
twosample t test rejects if.~ Q likelihood confidence bounds.9.Section 4. X~ll' X~2l' respectively.2 has a X~2 distribution.2 • Therefore.ILl = Ll versus K : 1L2 . we find a simple equivalent statistic IT(Ll)1 where If 1L2 ILl = Ll.ILl.has a N(O.2)8 2 /0. this follows from the fact that.ILl we naturally look at likelihood ratio tests for the family of testing problems H : 1L2 . Similarly. for H : 1L2 ::. We conclude from this remark and the additive property of the X2 distribution that (n . T( Ll) has a Tn2 distribution and inversion ofthe tests leads to the interval (4. . As for the special case Ll = 0. and that j nl n2/n(Y . T and the resulting twosided. there are two onesided tests with critical regions. IL 1 and for H : ILl ::.2 under H bution and is independent of (n . Confidence Intervals To obtain confidence intervals for 1L2 . 1/nI).3) for 1L2 . 1L2. We can show that these tests are likelihood ratio tests for these hypotheses. f"V As usual.2)8 2/0. T has a noncentral t distribution with noncentrality parameter. corresponding to the twosided test. and only if. 1) distriTn .1L2. As in the onesample case.9 likelihood Ratio Procedures 263 are independent and distributed asN(ILI/o. if ILl #. onesided tests lead to the upper and lower endpoints of the interval as 1 . by definition.ILl #.N(1L2/0.X) /0.Ll. It is also true that these procedures are of size Q for their respective hypotheses. 1/n2).
395 ± 0. . it is known that the log of permeability is approximately normally distributed and that the variability from machine to machine is the same. it may happen that the X's and Y's have different variances.95 confidence interval for the difference in mean log permeability is 0. The level 0.0:) confidence interval has probability at most ~ 0: of making the wrong selection.776. Here y .0264.. Ji2) E R x R.3. p. except for an additive constant. respectively. . Y1. The log likelihood.975) = 2. From past experience.9. Again we can show that the selection procedure based on the level (1 .~o:). When Ji1 = Ji2 = Ji. For instance. setting the derivative of the log likelihood equal to zero yields the MLE . /11 = x and /12 = y for (Ji1. .. As a first step we may still want to compare mean responses for the X and Y populations. H is rejected if ITI 2 t n .. 4.4 The TwoSample Problem with Unequal Variances In twosample problems of the kind mentioned in the introduction to Section 4.790 1. If normality holds.845 1.x = 0. a treatment that increases mean response may increase the variance of the responses. consider the following experiment designed to study the permeability (tendency to leak water) of sheets of building material produced by two different machines.583 1.2 (1 .042 1.9. we would select machine 2 as producing the smaller permeability and.Xn1 . Because t4(0.282 We test the hypothesis H of no difference in expected log permeability. thus. 472) x (machine 1) Y (machine 2) 1.395.368. we conclude at the 5% level of significance that there is a significant difference between the expected log permeability for the two machines. On the basis of the results of this experiment. we are lead to a model where Xl.264 Testing and Confidence Regions Chapter 4 Data Example As an illustration. :.627 2. a~) samples.. and T = 2. . 'Yn2 are two independent N (Ji1. ai) and N(Ji2. the more waterproof material.8 2 = 0. Suppose first that aI and a~ are known. This is the BehrensFisher problem.977. . 1952. thus. The results in terms oflogarithms were (from Hald. is The MLEs of Ji1 and Ji2 are.
j1x = = nl .n2)' Similarly.1:::. it Thus. by Slutsky's theorem and the central limit theorem. DiaD has aN(I:::. n2.t1 = J.)18D has approximately a standard normal distribution (Problem 5. It follows that "\(x.t2..1:::.j1)2 j=1 = L(Yi . y) Next we compute = exp {2~~ eM  x)2 + 2:'~ (Ii  y?}. that is Yare X) + Var(Y) = nl + n2 .y)/(nl + .t2 :::.x)/(nl + . For large nl.fi)2 we obtain \(x. j1. For small and moderate nI.. where I:::. Unfortunately the distribution of (D 1:::.. n2. An unbiased estimate is J. IDI/8D as a test statistic for H : J..t2 must be estimated. n2 an approximation to the distribution of (D .9 Likelihood Ratio Procedures J 265 where.x)2 i=1 n2 i=1 n2 L(Yi ..Section 4..t1. (D 1:::.X and aJy is the variance of D. where D = Y .y) By writing nl L(Xi . 2 8y 8~ . J.t1. = at I a~. 8D=+' nl n2 It is natural to try and use D I 8 D as a test statistic for the onesided hypothesis H : J.3. = J.n2)' It follows that the likelihood ratio test is equivalent to the statistic IDl/aD.) I 8 D depends on at I a~ for fixed nl.)18D to generate confidence procedures.28)..X)2 + nl(j1.fi = n2(x .) 18 D due to Welch (1949) works well.n2(fi . 1) distribution. and more generally (D .fi? j=1 + n2(j1. BecauseaJy is unknown.
Yd.28.3 and Problem 5. then we end up with a bivariate random sample (XI.. Some familiar examples are: X test score on English exam.003.. with > 0. uI 4. u~ > O.1. etc.9. Y). Y = age at death.3. Y cholesterol level in blood.J1. Y) have a joint bivariate normal distribution. . mice. Empirical data sometimes suggest that a reasonable model is one in which the two characteristics (X. which works well if the variances are equal or nl = n2.266 Testing and Confidence Regions Chapter 4 Let c = Sf/nlsb' Then Welch's approximation is Tk where k c2 [ nl 1 (1 C)2]1 + n2 . our problem becomes that of testing H : p O. Y = blood pressure. If we have a sample as before and assume the bivariate normal model for (X. Y diet.5 likelihood Ratio Procedures for Bivariate Normal Distributions If n subjects (persons.01.05 and Q: 0. machines. can unfortunately be very misleading if =1= u~ and nl =1= n2. fields.u~. TwoSided Tests . X = average cigarette consumption per day in grams.9. Wang (1971) has shown the approximation to be very good for Q: = 0. The LR procedure derived in Section 4. p).1 When k is not an integer.) are sampled from a population and two numerical characteristics are measured on each case. The tests and confidence intervals resulting from this approximation are called Welch's solutions to the BehrensFisher problem. the maximum error in size being bounded by 0.3. N(j1.3. . Note that Welch's solution works whether the variances are equal or not.2 . X test tistical studies. Confidence Intervals for p The question "Are two random variables X and Y independent?" arises in many staweight. ur Testing Independence. the critical value is obtained by linear interpolation in the t tables or using computer software. X percentage of fat in score on mathematics exam. (Xn' Yn). . See Figure 5.
9..~( Yi .1.X)(Yi ~=1 fj)] /n(jl(72. A normal approximation is available. the distribution of pis available on computer packages.4) = Thus.0"2 = .. if we specify indifference regions. for any a. we can control probabilities of type II error by increasing the sample size.5. (jr .~( Xi . . If p = 0. (Un.1.X .Jl2)/0"2.9. . V n ) is a sample from the N(O.3. ii.2 distribution. p) distribution. the power function of the LR test is symmetric about p = 0 and increases continuously from a to 1 as p goes from 0 to 1. then by Problem B. p). the distribution of p depends on p only. where (jr e was given in Problem 2.8). and eo can be obtained by separately maximizing the likelihood of XI. Qualitatively.o.13 as 2 = . Because ITn [ is an increasing function of [P1. VI)"" .Now.1. y. . When p = 0. . .9. (j~. p::. .5) has a Tn.Xn and that ofY1 . 0. Because (U1. See Example 5.. When p = 0 we have two independent samples.(j~. Therefore.Jll)/O"I. where Ui = (Xi . To obtain critical values we need the distribution of p or an equivalent statistic under H. 1~ )2 2 1~ )2 0"1 n i=1 n i=1 p= [t(Xi . 1 (Problem 2.9 Likelihood Ratio Procedures 267 The unrestricted maximum likelihood estimate (x. We have eo = (x. and the likelihood ratio tests reject H for large values of [P1. (4.Section 4. log A(X) is an increasing function of p2.Y . 0) and the log of the likelihood ratio statistic becomes log A(X) (4.3. Yn . There is no simple form for the distribution of p (or Tn) when p #. the twosided likelihood ratio tests can be based on [Tn I and the critical values obtained from Table II..4. Pis called the sample correlation coefficient and satisfies 1 ::. ~ = (Yi .
0 versus K : P > O. We obtain c(p) either from computer software or by the approximation of Chapter 5. inversion of this family of tests leads to levell Q lower confidence bounds. we can start by constructing size Q likelihood ratio tests of H : P Po versus K : p > po. p::. Therefore.15). For instance. 370. we obtain a commonly used confidence interval for p. It can be shown that pis equivalent to the likelihood ratio statistic for testing H : P ::.268 Testing and Confidence Regions Chapter 4 OneSided Tests In many cases. we obtain size Q tests for each of these hypotheses by setting the critical value so that the probability of type I error is Q when p = O.75. c(Po)] 1 ~Q. Confidence Bounds and Intervals Usually testing independence is not enough and we want bounds and intervals for p giving us an indication of what departure from independence is present.9. we find that there is no evidence of correlation: the pvalue is bigger than 0. Because c can be shown to be monotone increasing in p. The power functions of these tests are monotone. We want to know whether there is a correlation between the initial weights and the weight increase and formulate the hypothesis H : p O.~Q bounds together. only onesided alternatives are of interest. and only if p .1 Here p 0.:::: c] is an increasing function of p for fixed c (Problem 4. We find the likelihood ratio tests and associated confidence procedures for four classical normal models: .7 18. c(po) where P po [I)' d(po)] P po [I)'::. Thus.: : d(po) or p::.Q. by putting two level 1 . These intervals do not correspond to the inversion of the size Q LR tests of H : p = Po versus K : p :::f. The likelihood ratio test statistic >. c(po)] 1 . 0 versus K : P > 0 and similarly that corresponds to the likelihood ratio statistic for H : P 0 versus K : P < O. if we want to decide whether increasing fat in a diet significantly increases cholesterol level in the blood. we would test H : P = 0 versus K : P > 0 or H : P ::. 'I7 Summary. However.48. Po but rather of the "equaltailed" test that rejects if. c(po)" where Ppo [p::. These tests can be shown to be of the form "Accept if. ! Data Example A~ an illustration.18 and Tn 0. We can show that Po [p. consider the following bivariate sample of weights Xi of young rats at a certain age and the weight increase Yi during the following week. by using the twosided test and referring to the tables. and only if. To obtain lower confidence bounds. is the ratio of the maximum value of the likelihood under the general model to the maximum value of the likelihood under the model specified by the hypothesis. We can similarly obtain 1 Q upper confidence bounds and. for large n the equal tails and LR confidence intervals approximately coincide with each other.
XI. we use (Y . When a? and a~ are known.. the likelihood ratio test is equivalent to the test based on IY . respectively. . .. X n are independently and identically distributed according to the uniform distribution U(O. where p is the sample correlation coefficient.. . (a) Compute the power function of 6c and show that it is a monotone increasing function (b) In testing H : exactly 0.Ll' an and N(J. We test the hypothesis that X and Yare indepen dent and find that the likelihood ratio test is equivalent to the test based on IPl. .X.Section 4. (3) Twosample experiments in which two independent samples are modeled as coming from N(J. Assume the model where X = (Xl.Ll' ( 2) and N(J.1 1.X)I SD.10 PROBLEMS AND COMPLEMENTS Problems for Section 4. (2) Twosample experiments in which two independent samples are modeled as coming from N(J. Suppose that Xl. Let Xl. Consider the hypothesis H that the mean life II A = J. ( 2) and we test the hypothesis that the mean difference J.Xn denote the times in days to failure of n similar pieces of equipment. (d) How large should n be so that the 6c specified in (b) has power 0.L is zero. .L2' a~) populations.L :::.L.L2' ( 2) populations.2 degrees of freedom. where SD is an estimate of the standard deviation of D = Y . We test the hypothesis that the means are equal and find that the likelihood ratio test is equivalent to the twosample (Student) t test. . X n ) and let 1 if Mn 2:: c = of e. .. J.. e). respectively.48. Approximate critical values are obtained using Welch's t distribution approximation. . .. . what choice of c would make 6c have size (c) Draw a rough graph of the power function of 6c specified in (b) when n = 20.05? e :::. (4) Bivariate sampling experiments in which we have two measurements X and Y on each case in a sample of n cases. The likelihood ratio test is equivalent to the onesample (Student) t test. We also find that the likelihood ratio statistic is equivalent to a t statistic with n .10 Problems and Complements 269 (1) Matched pair experiments in which differences are modeled as N(J. ~ versus K : e> ~. 0 otherwise.98 for (e) If in a sample of size n = 20. 4.Xn ) is an £(A) sample. When a? and a~ are unknown. what is the pvalue? 2. Let Mn = max(X 1 . .Lo. Mn e = ~? = 0.
under H. Days until failure: 315040343237342316514150274627103037 Is H rejected at level <> ~ 0.r. Assume that F o and F are continuous.4 to show that the test with critical region IX > I"ox( 1  <» 12n].. . . 1 using a I . is a size Q test.. then the pvalue <>(T) has a unifonn. Draw a graph of the approximate power function. (c) Give an approximate expression for the critical value if n is large and B not too close to 0 or 00.. where x(I . .ontinuous distribution. Let Xl. give a normal approximation to the significance probability. Hint: See Problem B. ..270 Testing and Confidence Regions Chapter 4 (a) Use the result of Problem 8.2 L:. U(O. . f(x. j . Tr are independent test statistics for the same simple H and that each Tj has a continuous distribution. . .0 > 0.) 4. j=l . T ~ Hint: See Problem B.1.<»1 VTi)J .2. Hint: Approximate the critical region by IX > 1"0(1 + z(1 . X n be a 1'(0) sample. . (0) Show that.. j = 1. . (d) The following are days until failure of air monitors at a nuclear plant. distribution. ...Xn be a sample from a population with the Rayleigh density . (cj Use the central limit theorem to show that <l?[ (I"oz( <» II") + VTi(1"  1"0) I 1"] is an approximation to the power of the test in part (a).O) = (xIO')exp{x'/20'}. i I (b) Check that your test statistic has greater expected value under K than under H. If 110 = 25.. (Use the central limit theorem. 0: (a) Construct a test of H : B = 1 versus K : B > 1 with approximate size complete sufficient statistic for this model. I .05? 3. Let Xl. 6.a) is the (1 . r.12.. Suppose that T 1 . (b) Give an expression of the power in tenns of the X~n distribution. (a) Use the MLE X of 0 to construct a level <> test for H : 0 < 00 versus K : 0 > 00 ..3. Establish (4. " i I 5. X> 0.~llog <>(Tj ) has a X~r distribution. .3). 7..a)th quantile of the X~n distribution.. 1nz t Z i . .. Show that if H is simple and the test statistic T has a c. . Hint: Use the central limit theorem for the critical value.4. . 1). Let o:(Tj ) denote the pvalue for 1j.3. (b) Show that the power function of your test is increasing in 8.
n (c) Show that if F and Fo are continuous and FiFo. j = 1. if X. . Fo(x)1 > kol x ~ Hint: D n > !F(x) ..o J J x .p(Fo(x))lF(x) . b > 0.T(B+l) denote T. (This is the logistic distribution with mean zero and variance 1. (a) Show that the power PF[D n > kal of the Kolmogorov test is bounded below by ~ sup Fpl!F(x).Fo(x)1 > k o ) for a ~ 0.)(Tn ) = LN(O. (In practice these can be obtained by drawing B independent samples X(I)..J(u) = 1 and 0: = 2. Suppose that the distribution £0 of the statistic T = T(X) is continuous under H and that H is rejected for large values of T.Fo• generate a U(O.p(F(x»!F(x) .p(Fo(x))IF(x) .5. . T(X(B)) is a sample of size B + 1 from La. 1) variable on the computer and set X ~ F O 1 (U) as in Problem B.. .5. . 80 and x = 0. Hint: If H is true T(X). and let a > O. with distribution function F and consider H : F = Fo.Section 4.p(u) be a function from (0.. then the power of the Kolmogorov test tends to 1 as n .5.) Evaluate the bound Pp(lF(x) .5 using the nonnal approximation to the binomial distribution of nF(x) and the approximate critical value in Example 4. 10.00).2. . Define the statistics S¢.1. = (Xi .Fo(x)1 foteach..u.o: is called the Cramervon Mises statistic. Express the Cramervon Mises statistic as a sum.. That is.o sup.Fo(x)IO x ~ ~ sup. is chosen 00 so that 00 x 2 dF(x) = 1.1. 00.) Next let T(I).a)/b.p(F(x)IF(x) .10 Problems and Complements 271 8.Fo(x)IOdFo(x) . Here. (a) For each of these statistics show that the distribution under H does not depend on F o. T(B) be B independent Monte Carlo simulated values of T. then T. In Example 4.(X') = Tn(X).Fo(x)IOdF(x). (b) Suppose Fo isN(O. Let T(l). . Vw.T. Use the fact that T(X) is equally likely to be any particular order statistic. 1.. B..Fo(x) 10 U¢.X(B) from F o on the computer and computing TU) = T(XU)). . .. . T(B) ordered. (b) When 'Ij.d. . let . 1) and F(x) ~ (1 +exp( _"/7))1 where 7 = "13/. . to get X with distribution . .10.. . .1.6 is invariant under location and scale.. (a) Show that the statistic Tn of Example 4... = ~ I. Show that the test rejects H iff T > T(B+lm) has level a = m/(B + 1). ~ ll. . .o T¢.12(b). and 1.. Let X I. T(l).l) (Tn)..o V¢.T(X(l)). (b) Use part (a) to conclude that LN(". 1) to (0. 9. X n be d.
Let T = X/"o and 0 = /" .Fo(T).1.2. you intend to test the satellite 100 times. Show that the EPV(O) for I{T > c) is uniformly minimal in 0 > 0 when compared to the EPV(O) for any other test. 00 .Fe(F"l(l. i i (b) Define the expected pvalue as EPV(O) = EeU. Consider a test with critical region of the fann {T > c} for testing H : () = Bo versus I< : () > 80 . 5 about 14% of the time.1.. ! . which is independent ofT. let N I • N 2 • and N 3 denote the number of Xj equal to I.. One has signalMtonoise ratio v/ao = 2... then the test that rejects H if. Expected pvalues. Whichever system you buy during the year. [2N1 + N. Let 0 < 00 < 0 1 < 1.272 Testing and Confidence Regions Chapter 4 (e) Are any of the four statistics in (a) invariant under location and scale.0) = (1. i • (a) Show that if the test has level a.2 I i i 1. Problems for Section 4.3. A gambler observing a game in which a single die is tossed repeatedly gets the impression that 6 comes up about 18% of the time. Without loss of generality. which system is cheaper on the basis of a year's operation? 1 I Ii I il 2.10. I X n from this population. 01 ) is an increasing function of 2N1 + N. I 1 (b) Show that if c > 0 and a E (0.) . respectively. Suppose that T has a continuous distribution Fe. For a sample Xl. f(3. (For a recent review of expected p values see Sackrowitz and SamuelCahn.. where if! denotes the standard normal distribution./"0' Show that EPV(O) = if! ( . 2. Consider a population with three kinds of individuals labeled 1. Let To denote a random variable with distribution F o. You want to buy one of two systems.2.') sample Xl.05.to on the basis of the N(p" . If each time you test.O) = 0'. X n . the other has v/aQ = 1.0) = 20(1. 3. f(2. and 3 occuring in the HardyWeinberg proportions f(I. ..1) satisfy Pe. > cis MP for testing H : 0 = 00 versus K : 0 = 0 1 • . the UMP test is of the form I {T > c). the other $105 • One second of transmission on either system COsts $103 each. I). The first system costs $106 . 12. you want the number of seconds of response sufficient to ensure that both probabilities of error are < 0. . and 3. 2N1 + N. ! . and only if.0)'.nO /.. = /10 versus K : J. Consider Examples 3. j 1 i :J . 1999. (a) Show that L(x. 2. where" is known. . whereas the other a I ~ a . Hint: peT < to I To = to) is 1 minus the power of a test with critical value to_ (d) Consider the problem of testing H : J1..) I 1 . > cJ = a. Hint: P(To > T) = P(To > tiT = t)fe(t)dt where fe(t) is the density of Fe(t).2 and 4.).L > J. ! J (c) Suppose that for each a E (0. take 80 = O. the power is (3(0) = P(U where FO 1 (u)  < a) = 1. (See Problem 4.0). then the pvalue is U =I ...a)) = inf{t: Fo(t) > u}.. Show that EPV(O) = P(To > T).
2. Show that if randomization is pennitted.imately a N(np" na 2 ) distribution.Pe.0. PdT < cJ is as small as possible. H : (J = (Jo versus K : (J = (J.e. two 5's are obtained. where 11 = I:7 1 ajf}j and a 2 = I:~ 1 f}i(ai 11)2. (a) Show that if in testing H : f} such that = f}o versus K : f) = f)l there exists a critical value c Po. In Examle 4. 7.6) where all parameters are known.. A newly discovered skull has cranial measurements (X. ..2. +.. (c) Using the fact that if(N" .2) that linear combinations of bivariate nonnal random variables are nonnaUy distributed. (b) Find the test that is best in this sense for Example 4. 5. Y) known to be distributed either (as in population 0) according to N(O.1. Prove Corollary 4.2.. and only if.. L .. In Example 4. 0. Upon being asked to play.. find an approximation to the critical value of the MP level a test for this problem.. . Nk) ~ 4. (a) What test statistic should he use if the only alternative he considers is that the die is fair? (b) Show that if n = 2 the most powerfullevel.1.0196 test rejects if.2.1. prove Theorem 4. if . Bk). Bo.Bd > cJ then the likelihood ratio test with critical value c is best in this sense.4 and recall (Proposition B.10 Problems and Complements 273 four numbers are equally likely to occur (i. M(n. Bo.. Hint: The MP test has power at least that of the test with test function J(x) 8.2. I. Hint: Use Problem 4.. Y) belongs to population 1 if T > c... +akNk has approx.2. MPsized a likelihood ratio tests with 0 < a < 1 have power nondecreasing in the sample size. I.N.1(a) using the connection between likelihood ratio tests and Bayes tests given in Remark 4.. 1.2.6) or (as in population 1) according to N(I. the gambler asks that he first be allowed to test his hypothesis by tossing the die n times..=  Section 4. = a. then the maximum of the two probabilities ofmisclassification porT > cJ.7).<l.. with probability . (X. of and ~o '" I. [L(X. 6.. [L(X. . .2. Y) and a critical value c such that if we use the classification rule. then a. find the MP test for testing 10.0. Find a statistic T( X.1. . derive the UMP test defined by (4.o = (1.~:":2:~~=C:="::===='. B" . . A fonnulation of goodness of tests specifies that a test is best if the max. 9..0. For 0 < a < I.4. Bd > cJ = I ..2.imum probability of error (of either type) is as small as possible.2.17). and to population 0 ifT < c.
show that the power of the UMP test can be written as (3(<J) = Gn(<J~Xn(<»/<J2) where G n denotes the X~n distribution function.&(>').Xn is a sample from a Weibull distribution with density f(x. (a) Exhibit the optimal (UMP) test statistic for H : 0 < 00 versus K : 0 > 00.01.3 Testing and Confidence Regions Chapter 4 1. .4. A) = c AcxC1e>'x . Here c is a known positive constant and A > 0 is the parameter of interest. Let Xl.95 at the alternative value 1/>'1 = 15. x > O. (a) Show that L~ K: 1/>. In Example 4.3. but if the arrival rate is > 15.0)"'/[1. i 1 .<» is the (1 . Hint: Show that Xf . Show that if X I. You want to ensure that if the arrival rate is < 10. . Use the normal approximation to the critical value and the probability of rejection.<»th quantile of the X~n distributiou and that the power function of the lIMP level a test is given by where G 2n denotes the X~n distribution function.3. 1 j ! Xf is an optimal test statistic for testing H : 1/ A < 1/ AO versus J ! . 1965) is the one where Xl. • • " 5. I i . 274 Problems for Section 4. that the Xi are a sample from a Poisson distribution with parameter 8. (b) For what levels can you exhibit a UMP test? (c) What distribution tables would you need to calculate the power function of the UMP test? 2. (c) Suppose 1/>'0 ~ 12. hence.01 lest to have power at least 0.. . > 1/>'0.Xn be the times in months until failure of n similar pieces of equipment.: (b) Show that the critical value for the size a test with critical region [L~ 1 Xi > k] is k = X2n(1 . . i' i . the probability of your deciding to stay open is < 0. Consider the foregoing situation of Problem 4. a model often used (see Barlow and Proschan. • I i • 4... r p(x.<»/2>'0 where X2n(1. .• n.01. the expected number of arrivals per day.. .(IOn x = 1•.1. I i .i .1 achieves this? (Use the normal approximation. A possible model for these data is to assume that customers arrive according to a homogeneous Poisson process and. ]f the equipment is subject to wear. the probability of your deciding to close is also < 0. How many days must you observe to ensure that the UMP test of Problem 4. Find the sample size needed for a level 0.3. .. . 0) = ( : ) 0'(1... Suppose that if 8 < 8 0 it is not worth keeping the counter open..) 3.Xn is a sample from a truncated binomial distribution with • I. . Let Xl be the number of arrivals at a service counter on the ith of a sequence of n days..
6. X N }). 1 ' Y > 0. Let the distribution of sUIvival times of patients receiving a standard treatment be the known distribution Fo.O<B<1.1.10 Problems and Complements 275 then 2. .B) = Fi!(x).•• of Li.Fo(y)  }). we derived the model G(y.Fo(y)l". where Xl_ a isthe (1 . imagine a sequence X I. Hint: Use the results of Theorem 1. X N e>.6. X 2. Suppose that each Xi has the Pareto density f(x. In the goodnessoffit Example 4. then e. suppose that Fo(x) has a nonzero density on some interval (a. Assume that Fo has ~ density fo.. . To test whether the new treatment is beneficial we test H : ~ < 1 versus K : .12. (i) Show that if we model the distribution of Y as C(max{X I .:7 I Xi is an optimal test statistic for testing H : () = eo versus l\ .) It follows that Fisher's method for cQmbining pvalues (see 4. b)."" X n denote the incomes of n persons chosen at random from a certain population. Show that the UMP test for testing H : B > 1 versuS K : B < 1 rejects H if 2E log FO(Xi) > Xl_a.d. and consider the alternative with distribution function F(x. In Problem 1. survival times of a sample of patients receiving an experimental treatment. x> c where 8 > 1 and c > O.XFo(y)  1 P(Y < y) = e ' 1 ' Y > 0.~) = 1. .6. 8. Y n be the Ll. A > O. (a) Express mean income JL in terms of e. .1. P(). random variable.Q)th quantile of the X~n distribution.6.6) is UMP for testing that the pvalues are uniformly distributed. . ~ > O.. 00 < a < b < 00. 0 < B < 1. For the purpose of modeling. .d. which is independent of XI. Show how to find critical values. survival times with distribution Fa.2 to find the mean and variance of the optimal test statistic. then P(Y < y) = 1 e  . 7.O) = cBBx(l+BI.. . against F(u)=u B.. .).Section 4. A > O. (b) Find the optimal test statistic for testing H : JL = JLo versus K . and let YI . Find the UMP test. (See Problem 4. JL (e) Use the central limit theorem to find a normal approximation to the critical value of test in part (b).5. X 2 . > po...tive.  ..[1.... Let N be a zerotruncated Poisson. Let Xl.. > 1. (a) Lehmann Alte~p.8 > 80 . (Ii) Show that if we model the distribotion of Y as C(min{X I . y > 0.1. (b) NabeyaMiura Alternative.1.
we test H : {} < 0 versus K : {} > O. = . where the €i are independent normal random variables with mean 0 and known variance .0) UeB for '. with distribution function F(x). 1 X n be i.exp( _x B). Problem 2. Find the MP test for testing H : {} = 1 versus K : B = 8 1 > 1. j 1 To see whether the new treatment is beneficial. (a) Show how to construct level (1 . x > 0. Let Xl.'? = 2.{Oo} = 1 .. Oo)d.6t is UMP for testing H : 8 80 versus K : 8 < 80 .X)' = 16. O)d. Using a pivot based on 1 (Xi .).3.. e e at l · 6. Show that under the assumptions of Theorem 4.276 (iii) Consider the model Testing and Confidence Regions Chapter 4 G(y.2. 'i . I Hint: Consider the class of all Bayes tests of H : () = . Assume that F o has a density !o(Y). ..(O) «) L > 1 i The lefthand side equals f6".L and unknown variance (T2.j "'" .d. 0) e9Fo (Y) Fo(Y).. .. What would you .. x > 0.i.{Od varies between 0 and 1. Show that the UMP test is based on the statistic L~ I Fo(Y. 10. Show that under the assumptions of Theorem 4.X)2. n announce as your level (1 . 0.0) confidence intervals of fixed finite length for loga2 • (b) Suppose that By 1 (Xi .' (cf. B> O.(O) L':x.. 1 • .0"..0 e6 1 0~0. 1 . Hint: A Bayes test rejects (accepts) H if J . /.(O) 6 . . n.. .. Oo)d.2 the class of all Bayes tests is complete.4 1. Let Xi (f)/2)t~ + €i.01.3. L(x.exp( x).O)d.3.....1). 00 p(x. 0 = 0. Show that the test is not UMP. I . Let Xl.  1 j . every Bayes test for H : < 80 versus K : > (Jl is of the fann for some t.' L(x. i = 1. > Er • .2..52.. 1 X n be a sample from a normal population with unknown mean J. The numerator is an increasing function of T(x). J J • 9. F(x) = 1 . . eo versus K : B = fh where I i I : 11. . We want to test whether F is exponential. Problems for Section 4. .1 and 01 loss. F(x) = 1 . Show that under the assumptions of Theorem 4. the denominator decreasing. 12. or Weibull. 0.(O)/ 16' I • 00 p(x.' " 2.
4. 5.al).(2) " . I")/so. 0. Then take N .. (a) Justify the interval [ii. Compare yonr result to the n needed for (4.~a)!.l.a. .3).a. Let Xl.c. .1. has a 1.1)] if 8 given by (4.2. i = 1. (Define the interval arbitrarily if q > q.4.n .IN] .) Hint: Use (A. what values should we use for the t! so as to make our interval as short as possible for given a? 3. minI ii. then the shortest level n (1 . there is a shorter interval 7.Xn be as in Problem 4. It follows that (1 . n.1.. 6. of the form Ji../N.n is obtained by taking at = Q2 = a/2 (assume 172 known).IN. < a.. with X . distribution. < 0. Suppose we want to select a sample size N such that the interval (4..4..7).IN(X  = I:[' lX. .1) based on n = N observations has length at most l for some preassigned length l = 2d.~a) /d]2 . but we may otherwise choose the t! freely. find a fixed length level (b) ItO < t i < I.a) interval of the fonn [ X .z(1 .d.. X + z(1 .01 [X .1] if 8 > 0. Hint: Reduce to QI +a2 = a by showing that if al +a2 with Ql + Q2 = a.) LCB and q(X) is a level (1.1..Section 4. where c is chosen so that the confidence level under the assumed value of 17 2 is 1 . ii are (b) Calculate the smallest n needed to bound the length of the 95% interval of part (a) by 0.. Suppose that in Example 4.no further observations. Use calculus. Suppose that an experimenter thinking he knows the value of 17 2 uses a lower confidence bound for Ji.(X) = X . Begin by taking a fixed number no > 2 of observations and calculate X 0 = (1/ no) I:~O 1 Xi and S5 = (no 1)II:~OI(Xi  XO )2... N(Ji" 17 2) and al + a2 < a.3).02. with N being the smallest integer greater than no and greater than or equal to [Sot".~a)/ .10 Problems and Complements 277 (a) Using a pivot based on the MLE (2L:r<ol (1 .SOt no l (1. then [q(X). where 8. .1.a) confidence interval for B. t.4. .. [0.X are ij.4.] . 0. X!)/~i 1tl ofB.(2) UCB for q(8).(al + (2)) confidence interval for q(8). Show that if q(X) is a level (1 . Show that if Xl. X+SOt no l (1. although N is random.q(X)] is a level (1 .3 we know that 8 < O. .1 Show that. What is the actual confidence coefficient of Ji" if 17 2 can take on all positive values? 4. Stein's (1945) twostage procedure is the following. .
95 confidence interval for (J. Such two sample problems arise In comparing the precision of two instruments and in detennining the effect of a treatment. (12 2 9. 1"2.!a)/ 4.. .~ a) upper and lower bounds. sn. Show that these two quadruples are each sufficient. 0") distribution and is independent of sn" 8. .O)]l < z (1. . ..jn(X .l4. (b) Exhibit a level (1 .a) confidence interval for (1/1'). 4(). . Y n2 be two independent samples from N(IJ. but we are sure that (12 < (It.051 (c) Suppose that n (12 = 5. (c) If (12. T 2 I I' " I'.  10. I . Let 8 ~ B(n. " 1 a) confidence . Indicate what tables you wdtdd need to calculate the interval.0)/[0(1. Let 8 ~ B(n. 0) and X = 81n. Show that ~().3) are indeed approximate level (1 . 1986.jn is an approximate level (b) If n = 100 and X = 0.18) to show that sin 1( v'X)±z (1 .. v.. find ML estimates of 11..1. exhibit a fixed length level (1 .a) interval defined by (4.. (a) If all parameters are unknown.3. in order to have a level (1 . The reader interested in pursuing the study of sequential procedures such as this one is referred to the book of Wetherill and Glazebrook.a) confidence interval for 1"2/(Y2 using a pivot based on the statistics of part (a). 278 Testing and Confidence Regions Chapter 4 I i' I is a confidence interval with confidence coefficient (1 .~a)].. (12) and N(1/. is _ independent of X no ' Because N depends only on sno' given N = k. ±.) (lr!d observations are necessary to achieve the aim of part (a).fN(X . and.a) confidence interval for sin 1 (v'e). use the result in part (a) to compute an approximate level 0. Hint: Set up an inequality for the length and solve for n..') populations. . = 0. (12.) Hint: Note that X = (noIN)Xno + (lIN)Etn n _ +IXi. (a) Show that in Problem 4..) I ~i .3. Let XI.O).4. > z2 (1  is not known exactly. (a) Show that X interval for 8.jn is an approximate level (1  • 12. . if (J' is large. respectively. Suppose that it is known that 0 < ~./3z (1. X n1 and Yll .0') for Jt of length at most 2d.4. 0. d = i II: . (1  ~a) / 2. we may very likely be forced to take a prohibitively large number of observations. (a) Use (A.6. X has a N(J1. I 1 j Hint: [O(X) < OJ = [. it is necessary to take at least Z2 (1 (12/ d 2 observations.a:) confidence interval of length at most 2d wheri 0'2 is known. By Theorem B.1') has a N(o. 1947. 11. i (b) What would be the minimum sample size in part (a) if (). (The sticky point of this approach is that we have no control over N.. Hence.001. Show that the endpoints of the approximate level (1 . and the fundamental monograph of Wald.: " are known. (J'2 jk) distribution.
"') .2 is known as the kurtosis coefficient. Z.2. 10.' 1 (Xi .' Now use the central limit theorem.Section 4.)/01' ~ (J1.4 ) .).J1.~Q). (In practice.".Q confidence interval for cr 2 .7).3. (a) Show that x( "d and x{1.) can be approximated by x(" 1) "" (n .1 and. give a 95% confidence interval for the true proportion of cures 8 using (al (4. 15.)4 < 00 and that'" = Var[(X.2.ia).J1.9 confidence interval for cr. Find the limit of the distribution ofn. and (b) (4. find (a) A level 0. and 10 000. (c) Suppose Xi has a X~ distribution.1) t z{1 . Slutsky's theorem. Hint: Use Problem 8.3. In Example 4. :L7/ (b) Suppose that Xi does not necessarily have a nonnal diStribution. Suppose that 25 measurements on the breaking strength of a certain alloy yield 11.' = 2:.4.IUI. then BytheproofofTheoremB. but assume that J1. In the case where Xi is normal.1)8'/0') . 14.J1. . If S "' 8(64.4. (d) A level 0. 0). (n _1)8'/0' ~ X~l = r(~(nl).4/0. Compare them to the approximate interval given in part (a). and the central limit theorem as given in Appendix A.<. Xl = Xn_l na). Hint: By B.2.)'.30.n} and use this distribution to find an approximate 1 .027 13.1 is known.4 and the fact that X~ = r (k. and X2 = X n _ I (1 .(n .n(X . 1) random variables. Compute the (K. Now the result follows from Theorem 8. known) confidence intervals of part (b) when k = I. (c) A level 0.3.8). See A.3).1) + V2( n . it equals O. ~).".3.4 = E(Xi . .9 confidence region for (It. Now use the law of large numbers. K.)' . 100. 4) are independent. XI 16. = 3.3.D andn(X{L)2/ cr 2 '" = r (4. is replaced by its MOM estimate. Assuming that the sample is from a /V({L.4.1.5 is (1 _ ~a) 2.I). cr 2 ) population. See Problem 5.) Hint: (n .9 confidence interval for {L.J1. Show that the confidence coefficient of the rectangle of Example 4.t ([en .I) + V2( n1) t z( "1) and x(1 . . K.9 confidence interval for {L x= + cr. (b) A level 0. V(o') can be written as a sum of squares of n 1 independentN(O. Suppose that a new drug is tried out on a sample of 64 patients and that S = 25 cures are observed.4.4. Hint: Let t = tn~l (1 .10 Problems and Complements 279 (b) What sample size is needed to guarantee that this interval has length at most 0.
05 and . 19. Consider Example 4.4. as X and that X has density f(t) = F' (t).1. indicate how critical values U a and t a for An (Fo) and Bn(Fo) can be obtained using the Monte Carlo method of Section 4.4. i 1 . .u.l).3 for deriving a confidence interval for B = F(x). .d. < xl has a binomial distribotion and ~ vnl1'(x) . • i.6.F(x)1 .F(x)1 Typical choices of a and bare . In Example 4.4.F(x)l.7.u) confidence interval for 1"..1'(x)] Show that for F continuous. )F(x)[l. 1'1(0) < X < 1'. . (a) For 0 < a < b < I.i.4. " • j .9) and the upper boundary I" = 00. distribution function. b) for some 00 < a < 0 < b < 00.4.95. Assume that f(t) > 0 ifft E (a.1 (b) V1'(x)[l. In this case nF(l') = #[X.F(x)]dx.. 18.F(x)1 )F(x)[1 . (a) Show that I" • j j j = fa F(x)dx + f. It follows that the binomial confidence intervals for B in Example 4. .a) confidence interval for F(x).~u) by the value U a determined by Pu(An(U) < u) = 1. Suppose Xl. find a level (I .11 .3 can be turned into simultaneous confidence intervals for F(x) by replacing z (I .. !i . I (b) Using Example 4. (b)ForO<o<b< I.1.6 with . (c) For testing H o : F = Fo with F o continuous. X n are i. we want a level (1 . Show that for F continuous where U denotes the uniform.F(x)1 is the approximate pivot given in Example 4.I I 280 Testing and Confidence Regions Chapter 4 17. That is. I' . define An(F) = sup {vnl1'(X) . define . verify the lower bounary I" given by (4.I Bn(F) = sup vnl1'(x) . pl(O) < X <Fl(b)}.r fixed. U(O.
~a)J has size (c) The following are times until breakdown in days of air monitors operated under two different maintenance policies at a nuclear power plant. O ) is an increasing 2 function of + B~. What value of c givessizea? (b) Using Problems B. Experience has shown that the exponential assumption is warranted. with confidence coefficient 1 .10 Problems and Complements 281 Problems for Section 4. .· (a) If 1(0.0. a (b) Show that the test with acceptance region [f(~a) for testing H : Il.2). M n /0.. = 1 rejected at level a 0.al) and (1 .. X 2 be independent N( 01. and consider the prob2 + 8~ > 0 when (}"2 is known. (d) Show that [Mn .~a) I Xl is a confidence interval for Il.1"2nl. Hint: Use the results of Problems B. How small must an alternative before the size 0.. ~ 01).1. Let Xl. [8(X) < 0] :J [8(X) < 00 ].) samples. lem of testing H : 8 1 = 82 = 0 versus K : Or (a) Let oc(X 1 .. Give a 90% confidence interval for the ratio ~ of mean life times. show that [Y f( ~a) I X.3.a) VCB for this problem and exhibit the confidence intervals obtained by pntting two such hounds of level (1 . (e) Suppose that n = 16.th quantile of the . 3. (e) Similarly derive the level (1 .(2) together.. 5. then the test that accepts. . .3.12 and B.2).Xn1 and YI .2 are level 0..4 and 8. Or .a) LCB corresponding to Oc of parr (a).. of l. Yn2 be independent exponential E( 8) and £(. for H : (12 > (15. (a) Deduce from Problem 4. Hint: If 0> 00 .2 that the tests of H : . is of level a for testing H : 0 > 00 . (15 = 1.\.1 O? 2. x y 315040343237342316514150274627103037 826 10 8 29 20 10 ~ Is H : Il.90? 4. = 0.5 1..2 = VCBs of Example 4.4.a) UCB for 0. X2) = 1 if and only if Xl + Xi > c. Show that if 8(X) is a level (1 . = 1 versns K : Il. Let Xl. test given in part (a) has power 0.5.1 has size a for H : 0 (}"2 be < 00 . if and only if 8(X) > 00 .5.05.3. N( O . 1/ n J is the shortest such confidence interval.a. (b) Derive the level (1.3. . and let Il. .13 show that the power (3(0 1 . "6 based on the level (1  a) (b) Give explicitly the power function of the test of part (a) in tenos of the X~l distribution function. .) denotes the o. respectively. Yf (1 . (a) Find c such that Oc of Problem 4.2n2 distribution. respectively.. < XI? < f(1.Section 4.
8) where () and fare unknown. og are independentN (0. 'TJo) of the composite hypothesis H : ry = ryo.2). X n  00 ) is a level a test of H : 0 < 00 versus K : 0 > 00 • (d) Deduce that X(nk(Q)+l) (where XU) is the jth order statistic of the sample) is a level (1 .[XU) (f) Suppose that a = 2(n') < OJ and P. Show that the conclusions of (aHf) still hold. Thus. of Example 4. >k Determine the smallest value k = k(a) such that oklo) is level a for H and show that for n large. .~n + ~z(l . Suppose () = ('T}.Xn be a sample from a population with density f (t . and 1 is continuous and positive. but I(t) = I( t) for all t.00 . T) where 1] is a parameter of interest and T is a nuisance parameter.  j 1 j . Let X l1 .a)y'n.. k . 0. I: H': PIX! >OJ < ~ versus K': PIX! >OJ> ~. lif (tllXi > OJ) ootherwise. 0. (c) Modify the test of part (a) to obtain a procedure that is level 0: for H : OJ e= O?. Show that PIX(k) < 0 < X(nk+l)] ~ . (g) Suppose that we drop the assumption that I(t) = J( t) for all t and replace 0 by the v = median of F. ). 6. . I . X.4. Let C(X) = (ry : o(X. . . . (e) Show directly that P. . N( 020g. Hint: (c) X. O. 1 Ct ~r (b) Find the family aftests corresponding to the level (1. we have a location parameter family.0:) confidence interval for 11.2).. O?.a. respectively.0:) confidence region for 1] is equivalent to a family of level tests of these composite hypotheses. . (b) The sign test of H versus K is given by. (a) Show that testing H : 8 < 0 versus K : e > 0 is equivalent to testing .IX(j) < 0 < X(k)] do not depend on 1 nr I.282 Testing and Confidence Regions Chapter 4 = ()~ 1 2 eg and exhibit the corresponding family of confidence circles for (ell ( 2).a) confidence region for the parameter ry and con versely that any level (1 . L:J ~ ( . l 7.1 when (72 is unknown. ry) = O}.0:) LeB for 8 whatever be f satisfying our conditions. (a) Shnw that C(X) is a level (l . (c) Show that Ok(o) (X. We are given for each possible value 1]0 of1] a level 0: test o(X.
3 and let [O(S). Y ~ N(r/. 9.a).10 Problems and Complements 283 8. 12. Let P (p.a.2a) confidence interval for 0 of Example 4. if 00 < O(S) (S is inconsistent with H : 0 = 00 ). 11. and let () = (ry. Suppose X. Let a(S. or).00 indicates how far we have to go from 80 before the value 8 is not at all surprising under H.Y.5.5. This problem is a simplified version of that encountered in putting a confidence interval on the zero of a regression line. Yare independent and X ~ N(v. . 1). That is. Y. p)} as in Problem 4. Define = vl1J.[q(X) < 0 2 ] = 1. 0) ranges from a to a value no smaller than 1 .1. Note that the region is not necessarily an interval or ray. laifO>O <I>(z(1 . Show that as 0 ranges from O(S) to 8(S).O ~ J(X.Oo) denote the pvalue of the test of H : 0 = 00 versus K : 0 > 00 in Example 4.a) . 1) and q(O) the ray (X .5. hence. Then the level (1 .pYI < (1 1 otherwise. the interval is unbiased if it has larger probability of covering the true value 1] than the wrong value 1]'.1) is unbiased. p. 10.a) confidence interval [ry(x). that suP. Show that the Student t interval (4.z(l.20) if 0 < 0 and.a).5. Po) is a size a test of H : p = Po.a))2 if X > z(1 .Section 4. Y.2. let T denote a nuisance parameter. 00) is = 0'. 7)). (b) Describe the confidence region obtained by inverting the family (J(X.7. (a) Show that the lower confidence bound for q( 8) obtained from the image under q of q(X) (X . the quantity ~ = 8(8) .p) = = 0 ifiX .2.a) = (b) Show that 0 if X < z(1 . 8( S)] be the exact level (1 .5. 1).7. a( S. Let X ~ N(O. Let 1] denote a parameter of interest. Establish (iii) and (iv) of Example 4. ' + P2 l' z(1 1 .2 a) (a) Show that J(X.z(1 . Thus. Hint: You may use the result of Problem 4. alll/. ij(x)] for ry is said to be unbiased confidence interval if Pl/[ry(X) < ry' < 7)(X)] <1 a for all ry' F ry.
05 could be the fifth percentile of the duration time for a certain disease. Suppose we want a distributionfree confidence region for x p valid for all 0 < p < 1. k(a) . Show that P(X(k) < x p < X(nl)) = 1.p)"j. (b) The quantile sign test Ok of H versus K has critical region {x : L:~ 1 1 [Xi > 0] > k}. X n be a sample from a population with continuous distribution F.4. L. Show that Jk(X I x'.7. Simultaneous Confidence Regions for Quantiles.4. Construct the interval using F. (t) Show that k and 1in part (e) can be approximated by h (~a) and h (1  ~a) where ! heal is given in part (b).. it is distribution free. il. (X(k)l X(n_I») is a level (1 .. (e) Let S denote a B(n. Thus.1 and Fu 1. . (g) Let F(x) denote the empirical distribution. Show that the interval in parts (e) and (0 can be derived from the pivot T(x ) p  ..95 could be the 95th percentile of the salaries in a certain profession l or lOOx .p) variable and choose k and I such that 1 . . We can proceed as follows. " II [I . (a) Show that testing H : x p < 0 versus I< : x p > 0 is equivalent to testing H' : P(X > 0) < (I .a) confidence interval for x p whatever be F satisfying our conditions.h(a). 1 . . X n x*) is a level a test for testing H : x p < x* versus K : x p > x*. be the pth quantile of F.p) versus K' : P(X > 0) > (1 .a) LeB for x p whatever be f satisfying our conditions. In Problem 13 preceding we gave a disstributionfree confidence interval for the pth quantile x p for p fixed. k+ ' i .a. Let F. 0 < p < 1. P(xp < x p < xp for all p E (0.b) (a) Show that this statement is equivalent to = 1. where ! 1 1 heal c: n(1  p) + zl_QVnp(1  pl.) Suppose that p is specified. 14. Detennine the smallest value k = k(a) such that Jk(Q) has level a for H and show that for n large. Confidence Regions for QlIalltiles.a = P(k < S < n I + 1) = pi(1. = ynIP(xp) .5. . . ! .F(x p)]. F(x).1» . I • (c) Let x· be a specified number with 0 < F(x') < 1. Let Xl. vF(xp) [1 F(xp)] " Hint: Note that F(x p ) = p.. .a. and F+(x) be as in Examples 4. Then   P(P(x) < F(x) < P+(x)) for all x E (a.284 Testing and Confidence Regions Chapter 4 13. That is.p). Let x p = ~ IFI (p) + Fi! I (P)]. iI .. i j (d) Deduce that X(n_k(Q)+I) (XU) is the jth order statistic of the sample) is a level (1 . (See Section 3. lOOx. That is.6 and 4..
.10 Problems and Complements 285 where x" ~ SUp{. LetFx and F x be the empirical distributions based on the i. .17(a) and (c) can be used to give another distributionfree simultaneous confidence band for x p . F(x) > pl.i. ~ ~ (a) Consider the test statistic x .1. F f (. (c) Show how the statistic An(F) of Problem 4.p. n Hint: nFx(x) = Li~ll[Fx(Xi) < Fx(x)] ~ nFu(F(x)) and ~ ~ nF_x(x) = L IIXi < xl ~ L i==I i==I n n IIFx( X.z. then L i==l n IIXi < X + t. then D(F ~ ~ ~ ~ as D(Fu 1 F I  U = F(X) U )..Section 4.. We will write Fx for F when we need to distinguish it from the distribution F_ x of X.T: a <...) < p} and. . The hypothesis that A and B are equally effective can be expressed as H : F ~x(t) = Fx(t) for all t E R.x. Suppose that X has the continuous distribution F.U with ~ U(O.X I. X I. TbeaitemativeisthatF_x(t) IF(t) forsomet E R. where F u and F t . . Hint: Let t.u are the empirical distributions of U and 1 .(x) = F ~(Fx(x» . where A is a placebo. Give a distributionfree level (1 . I).j < F_x(x)] See also Example 4..xp]: 0 < l' < I}. that is.5.r" ~ inf{x: a < x < b.(x)] < Fx(x)] = nF1_u(Fx(x)). L i==I n I IFx (Xi) .d.. (b) Suppose we measure the difference between the effects of A and B by ~ the difference between the quantiles of X and X. 15.. (b) Express x p and .r p in terms of the critical value of the Kolmogorov statistic and the order statistics. . VF(p) = ![x p + xlpl. That is.Q) simultaneous confidence band for the curve {VF(p) : 0 < p < I}. the desired confidence region is the band consisting of the collection of intervals {[.13(g) preceding. Suppose X denotes the difference between responses after a subject has been given treatments A and B. Express the band in terms of critical values for An(F) and the order statistics. X n and . F~x) has the same distribution Show that if F x is continuous and H holds. Note the similarity to the interval in Problem 4.X n .. .4.. where p ~ F(x).1.r < b.
'" 1 X n be Li.2A(x) ~ HI(F(x)) l 1 + vF(F(x))... F y ) . F y ) has the same distribntion as D(Fu . is the nth quantile of the distribution of D(Fu l F 1 . The result now follows from the properties of B(·).".d.x p . 1 Yn be i. Let t (c) A Distdbution and ParameterFree Confidence Interval. Give a distributionfree level (1. Hint: Define H by H. i . Properties of this and other bands are given by Doksum. Then H is symmetric about zero.. "" = Yp . he a location parameter as defined in Problem 3. where x p and YP are the pth quan tiles of F x and Fy. It follows that X is stochastically between Xsv F and XS+VF where X s HI(F(X)) has the symmetric distribution H.17. we get a distributionfree level (1 . vt(p)) is the band in part (b). where :F is the class of distribution functions with finite support.. let Xl.) < Fy(x p )] = nFv (Fx (t)) under H.1) empirical distributions.l (p) ~ ~[Fxl(p) . It follows that if c.(F(x)) . i=I n 1 i t . F y ).F x l (l .p)] = ~[Fxl(p) + F~l:(p)]. vt = O<p<l sup vt(p) O<p<l where [V.F~x.:~ 1 1 [Fy(Y. . and by solving D( Fx. for ~.) < Fx(x)) = nFv(Fx(x))..F1 _u).) ~ < F x (")] ~ ~ = nFu(Fx(x)). (b) Consider the parameter tJ p ( F x.:~ II[Fx(X.:~ I1IFx(X. treatment B responses. . Fx(t)l· . the probability is (1 . (p).) = D(Fu .. Let Fx and F y denote the X and Y empirical distributions and consider the test statistic ~ ~ .U ). then  nFy(x + A(X)) = L: I[Y. nFy(t) = 2.) < do.. = 1 i 16. .x.3. where do:.286 Testing and Confidence Regions ~ Chapter 4 '1..Fy ): 0 < p < 1}. <aJ Show that if H holds.Jt: 0 < p < 1) for the curve (5 p (Fx. As in Example 1.. Also note that x = H. treatment A (placebo) responses and let Y1 . nFx(x) we set ~ 2.a) simultaneous confidence band 15. we test H : Fx(t) ~ Fy (t) for all t versus K : Fx(t) # Fy(t) for some t E R.) < Fx(t)] = nFu(Fx(t)).5.x = 2VF(F(x)).x (': + A(x)).a) simultaneous confidence band for A(x) = F~l:(Fx(x)) .. Fenstad and Aaberge (1977). then D(Fx . To test the hypothesis H that the two treatments are equally effective. ~ ~ ~ F~x.Fy ) = maxIFy(t) 'ER "" .i. then D(Fx... I I I 1 1 D(Fx..d. We assume that the X's and Y's are in~ dependent and that they have respective continuous distributions F x and F y . Moreover. i I f F..) : :F VF ~ inf vF(p).(x) ~ R. Hint: nFx(t) = 2. where F u and F v are independent U(O. Hint: Let A(x) = FyI (Fx(x)) .t::. .1. Fx. Show that for given F E F. . Let 8(.) is a location parameter} of all location parameter values at F..a) that the interval IvF' vt] contains the location set LF = {O(F) : 0(. < F y I (Fx(x))] i=I n = L: l[Fy (Y.
) is a shift parameter} of the values of all shift parameters at (Fx .) I i=l L tt . F y. F y ) is in [6.('. Exhibit the UMA level (1 .. F x ) ~ a and YI > Y.Section 4.2. Let6.). Let 6 = minO<1'<16p(Fx. the probability is (1 .= minO<p<l 6. Fy. O(Fx.(Fx . ~) distribution. and 8 are shift parameters. F y ). Show that n p is known and 0 O' ~ (2 2~A X. where is nnknown. Suppose X I. thea It a unifonnly most accurate level 1 . F y. Show that for given (Fx . nFx ("') = nFu(Fx(x)).) = F (p) . where :F is the class of distri butions with finite support. F y ) E F x F.)I L ti .2uynz(1 i=! i=l all L i=l n ti is also a level (1..) < do: for Ll.(x) = Fy(x thea D(Fx.) . = 22:7 I Xf/X2n( a) is .6]. Show that for the model of Problem 4. F y ) and b _maxo<p<l 6p(Fx ..10 Problems and Complements 287 Moreover. Now apply the axioms. Show that B = (2 L X. 3.(P).) ~ + t.4..·) is a shift parameter. Properties of ~ £ = "" this and other bands are given by Doksum and Sievers (1976).(x)) .T. then B( Fx . 0 < p < 1. t.4.0:) simultaneouS coofidence band for t. Show that if 0(·. T n n = (22:7 I X. B(. (b) Consider the unbiased estimate of O.6+] contains the shift parameter set {O(Fx. Fy ) .2z(1 i=l n n a)ulL tt]} i=l is a uniformly most accurate lower confidence bound for 8.6. (d) Show that E(Y) . Q.. (a) Consider the model of Problem 4.E(X). then by solving D(F x. .) 12:7 I if. . 6. . Let do denote a size a critical value for D(Fu .(. :F x :F O(Fx. Fy ). t Hint: Set Y' = X + t. 2. moreover X + 6 < Y' < X + 6.6 1.Fy).0:) confidence bound forO. ~ D(Fu •F v ).0: UCB for J. (c) Show that the statement that 0* is more accurate than 0 is equivalent to the assertion that S = (2 E~ 1 X i )/ E~ 1 has uniformly smaller variance than T.(X)...Xn is a sample from a r (p. if I' = 1I>'. we find a distributionfree level (1 .. ~ ~ = (e) A Distribution and ParameterFree Confidence Interval. F y ) > O(Fx " Fy). Xl > X 8t 8t 0} (c) A parameter 0 = <5 ('. then Y' = Y.).L. 6+ maxO<p<1 6+(p).    Problems for Section 4. It follows that if we set F\~•. Fx +a) = O(Fx a. Fv ).. .a) that the interval [8. is called a shift parameter if O(Fx. tt Hint: Both 0 and 8* are nonnally distributed. x' 7 R.3.a) UCB for O.F y ' (p) = 6. F y ) < O(Fx .
Problems for Section 4.6.4. then )" = sB(r(1 .4. • . G corresponding to densities j.aj LCB such that I . where Show that if (B.2. Hint: Use Examples 4. f3( T.2.. Suppose that B' is a uniformly most accurate level (I . uniform.\ =.Bj has the F distribution F 2r . i . • • • = t) is distributed as W( s.3). p. Hint: Apply Problem 4. .288 Testing and Confidence Regions Chapter 4 4. . .B)+.l. 1. Suppose [B'. Prove Corollary 4. then E. Poisson.. ~.3.6 for c fixed..1 are well defined and strictly increasing.. Construct unifoffilly most accurate level 1 .0 upper and lower confidence bounds for J1 in the model of Problem 4.f:s F.[B < u < O]du. then E(U) > E(V). and that B has the = se' (t. I'.d. where s = 80+ 2n and W ~ X. t > e. 0'].Xn are Ll.( 0' Hint: E. X has a binomial. distribution and that B has beta.B(n. .B). E(U) = . t) is the joint density of (8. Suppose that given B Pareto. Suppose that given. respectively.O] are two level (1 .d. . s). where So is some COnstant and V rv X~· Let T = L:~ 1 Xi' (a) Show that ()" I T m = k + 2t. (0 . P(>. 0).B). 5. OJ. s). V be random variables with d.12 so that F. Let U.1.' with "> ~. ·'"1 1 .6. . 8.) and that. F1(tjdt. '" J. U(O. s > 0. Xl. Pa( c.1.2" Hint: See Sections 8.(O .6. Hint: By Problem B. > 1" (h) Show how quantiles of the X2 distribution can be used to determine level (I . 3.l2(b). Show that if F(x) < G(x) for all x and E(U).W < BI = 7.a) upper and lower credible bounds for A. /3( T.C. . distribution with r and s positive integers.2 and B. j ..a) confidence intervals such that Po[8' < B' < 0'] < p. (B" 0') have joint densities. satisfying the conditions of Problem 8.\ is distributed as V /050. . e> 0.3...."" X n are i. g.7 1 . B).B')+.2.W < B' < OJ for allB' l' B. (b) Suppose that given B = B. 2. s) distribution with T and s integers. .E(V) are finite. 1I"(t) Xl. Establish the following result due to Pratt (1961). n = 1. .B') < E.3.a. U = (8 . t)dsdt = J<>'= P. J • 6. (a) Show that if 8 has a beta. and for (). (1: dU) pes. Show how the quantiles of the F distribution can be used to find upper and lower credible bounds for.\. [B.B) = J== J<>'= p( s.3 and 4.X.6. establish that the UMP test has acceptance region (4.6 to V = (B . density = B. In Example 4.i.
0:) confidence interval for B..y) proportional to where 1f(r I so) is the density of solV with V ~ Xm+n2. y) of is (Student) t with m + n . where (75 is known.r). Hint: 7f(Ll 1x. . . (b) Show that given r..x) is aN(x. ..y.m} and + n. ". (a) Let So = E(x.Xm.112. y). 1 r. y) is proportional to p(8)p(x 1/11.10 Problems and Complements 289 (a)LetM = Sf max{Xj. Suppose that given 8 = (ttl' J. Show fonnaUy that the posteriof?r(O 1 x. (a) Give a level (1 . and Y1. (b) Find level (1 .2 degrees of freedom. . Here Xl.8 1.Xn is observable and X n + 1 is to be predicted.. as X. N(J.a) upper and lower credible bounds for O.a) upper and lower credible bounds forO to the level (Ia) upper and lower confidence bounds for B.y) = 7f(r I sO)7f(LlI x . .i. Hint: p( 0 I x. = So I (m + n  2).. r > O.r)p(y 1/12. In particular consider the credible bounds as n + 00.1. r) where 7f(Ll I x .... 4.2. (d) Use part (c) to give level (Ia) credible bounds and a level (Ia) credible interval for Ll.. r In) density. 1f(/11 1r... respectively. Let Xl. Yn are two independent N (111 \ 'T) and N(112 1 'T) samples. = PI ~ /12 and'T is 7f(Ll. . y) is obtained by integrating out Tin 7f(Ll.0"5). (c) Give a level (1 . 'P) is the N(x .Xn ). Xl.l.rlm) density and 7f(/12 1r. r( TnI (c) Set s2 + n. ..y. .. Show thatthe posterior distribution 7f( t 1 x.Xn + 1 be i.d. Suppose 8 has the improper priof?r(O) = l/r.1 )) distribution. y) is aN(y. T) = (111. y) and that the joint density of. Problems for Section 4. .Section 4.. 'T).y. Show that (8 = S IM ~ Tn) ~ Pa(c'.s') withe' = max{c.0:) prediction interval for Xnt 1.1. r 1x.r 1x. . (d) Compare the level (1..x)' + E(Yj y)'.x)1f(/1.y) is 1f(r 180)1f(/11 1 r. /11 and /12 are independent in the posterior distribution p(O X.
i. .i.10..290 Testing and Confidence Regions Chapter 4 • • (b) Compare the interval in part (a) to the Bayesian prediction interval (4. X is a binomial.0'5) with 0'5 known..'" .2) hy using the observation that Un + l is equally likely to be any of the values U(I).·) denotes the beta function.3) by doing a frequentist computation of the probability of coverage. • . (This q(y distribution. B(n. . x > 0. 4. That is.s+nx+my)/B(r+x. Let Xl.. (3(r. .9.e. A level (1 . 0'5)' Take 0'5 ~ 7 2 = I.. . give level (1 ~ a:) lower and upper prediction (c) If F is continuous with a positive density f on (a. .8. Hint: Xi/8 has a ~ distribution and nXn+d E~ 1 Xi has an F 2 .Xn + 1 be i.10. N(Jl. Un + 1 ordered.t for M = 5.9.Xn are observable and we want to predict Xn+l. i! Problems for Section 4.Q) distribution free lower and upper prediction bounds for X n + 1 • < 00. give level 3. Show that the likelihood ratio statistic for testing If : 0 = ~ versus K : 0 i ~ is equivalent to 12X .5.i. 00 < a < b (1 . Give a level (1 . Let X have a binomial.d. Suppose that Y.d... :[ = .12. distribution. 1 X n are i. X n+ l are i.11. . 1 2.a) lower and upper prediction bounds for X n + 1 .d. Then the level of the frequentist interval is 95%..nl.. 5. ~o = 10.2. let U(l) < .a (P(Y < Y) > I . In Example 4. as X where X has the exponential distribution F(x I 0) =I . distribution. q(ylx)= ( .9 1. .0:) lower (upper) prediction bound on Y = X n+ 1 is defined to be a function Y(Y) of Xl.. and that (J has a beta.' . 0 > 0.. which is not observable. F.) Hint: First show that .'.8.x ) where B(·. I x) is sometimes called the Polya q(y I x) = J p(y I 0). .a). has a B(m. . where Xl. (a) If F is N(Jl. ! " • Suppose Xl. 0) distribution given (J = 8.8. s). Present the results in a table and a graph. X n such that P(Y < Y) > 1.s+n. Show that the conditional (predictive) distribution of Y given X = xis 1 J I i I . " u(n+l)..8.. suppose Xl.8)... Find the probability that the Bayesian interval covers the true mean j. . )B(r+x+y. .x . Suppose XI. and a = . give level (I . random variable.(0 I x)dO.a) prediction interval for X n + l .. as X . < u(n+l) denote Ul .05..Xn are observable and X n + 1 is to be predicted.15. . Suppose that given (J = 0. 0). CJ2) with (J2 unknown.. n = 100. Establish (4.5. B( n.2n distribution..• .. b). (b) If F is N(p" bounds for X n + 1 .
(b) Use the nonnal approximatioh to check that CIn C'n n .C2n !' 1 as n !' 00. let X 1l . if Tn < and = (n/2) log(1 + T.I)) for Tn > 0. of the X~I distribution. 0' 4.. where F is the d. In Problems 24. 0'2 < O'~ versus K : 0'2 > O'~. XnI (1 .(Xi < 2" " (10 i=I  X) 2 < C2 where CI and C2 satisfy. 2.~Q) n + V2nz(l. = Xnl (1 . and only if. where Tn is the t statistic. In testing H : It < 110 versus K . Hint: Note that liD ~ X if X < 1'0 and ~ 1'0 otherwise.. 2 2 L.~Q) approximately satisfy (i) and also (ii) in the sense that the ratio . Thus. >.(x) = 0. OneSided Tests for Scale..10 Problems and Complements 291 is an increasing function of (2x .log (ii' / (5)1 otherwise. if ii' /175 < 1 and = . ° 3. Xn_1 (~a). We want to test H : = 0'0 versus K : 0' 1= 0'0' (a) Show that the size a likelihood ratio test accepts if. We want to test H . ~2 .(Xi . log . I . Hint: log .(x) Hint: Show that for x < !n. (i) F(c.n) and >. n Cj I L. (ii) CI ~ C2 = n logcl/C2.='7"nlog c l n /C2n CIn . · . = aD nO' 1".Q.F(CI) ~ 1. (n/2)[ii' / C a5  (b) To obtain size Q for H we should take Hint: Recall Theorem B. ..X) aD· t=l n > C. 0'2) sample with both JL and 0'2 unknown.f.Xn be a N(/L.Section 4. onesample t test is the likelihood ratio test (fo~ Q <S ~). (c) Deduce that the critical values of the commonly used equaltailed test. and only if..3. (c) These tests coincide with the testS obtained by inverting the family of level (1  Q) lower confidence bounds for (12.V2nz(1 . TwoSided Tests for Scale... JL > Po show that the onesided.(x) = .Q).3../(n . Show that (a) Likelihood ratio tests are of the form: Reject if..~Q) also approximately satisfy (i) and (ii) of part (a).) .(nx).( x) ~ 0.
2' (12) samples. 114. Forage production is also measured in pounds per acre.9. e 1 ) depends on X only throngh T.tl > {t2. • Xi = (JXi where X o I 1 + €i. n. respectively. I'· (c) Find a level 0.. (X.4.95 confidence interval for equaltailed tests of Problem 4. 110. Show that A(X.95 confidence interval for p. The following blood pressures were obtained in a sample of size n = 5 from a certain population: 124. find the sample size n needed for the level 0. a 2 ) random variables. . (a) Show that the MLE of 0 ~ (1'1.Xn are said to be serially correlated or to follow an autoregressive model if we can write ~ .l. Assume a Show that the likelihood ratio statistic is equivalent to the twosample t statistic T. Assume the onesample normal model. 190.O). where a' is as defined in < ~.05 onesample t test. 9..3. 6. (b) Consider the problem aftesting H : Jl. . whereas the treatment measurements (y's) correspond to 500 pounds of mulch per acre. 0 E e...292 Testing and Confidence Regions Chapter 4 5. = 0 and €l.lIl (7 2) and N(Jl.01 test to have power 0. . Suppose X has density p(x. 100.90 confidence interval for a by using the pivot 8 2 ja z . The control measurements (x's) correspond to 0 pounds of mulch per acre. (a) Using the size 0: = 0. eo. . . I Yn2 be two independent N(J...95 when nl = nz = ~n and (1'1 1'2)/17 ~ ~. The following data are from an experiment to study the relationship between forage production in the spring and mulch left on the ground the previous fall.. 8. .9.. Y. i = 1. . . 1 I 794 2012 1800 2477 576 3498 411 2092 897 1808 I Assume the twosample normal model with equal variances.05. . . Let Xl. (a) Find a level 0. and that T is sufficient for O. 0'2 ~ ! corresponding to inversion of the (c) Compute a level 0.. can we conclude that the mean blood pressure in the population is significantly larger than IDO? (b) Compute a level 0.z . (b) Can we conclude that leaving the indicated amount of mulch on the ground significantly improves forage production? Use ct = 0.P. 7..90 confidence interval for the mean bloC<! pressure J. The nonnally distributed random variables Xl. (c) Using the normal approximation <l>(z(a)+ nl nz/n(1'1 I'z) /(7) to the power.L.Xn1 and YI .1'2.. x y vi I I . (7 2) is Section 4. €n are independent N(O.1 < J. 0'2).L2 versus K : j. ..
. P. Y2 )dY2. Xo = o.'" p(X.O) for 00 = (Z. From the joint distribution of Z and V.. Show that. / L:: i 10. 12. .<»1 < e < <>. i = 1. and only if..[T > tl is an increasing function of 6. XZ distributions respectively. Let Xl. for each v > 0.6. The F Test for Equality of Scale. has density !k .. (b) Show that the likelihood ratio statistic of H : 0 = 0 (independence) versus K : 0 o(serial correlation) is equivalent to C~=~ 2 X i X i _I)2 / l Xl. The power functions of one. 7k. . Stein).O)e (a) What is the size a likelihood ratio test for testing H : B = 1 versus K : B 1= I? (b) Show that the test that rejects if.2) In exp{ (I/Z(72) 2)Xi . Let e consist of the point I and the interval [0. (An example due to C. . Define the frequency functions p(x. 0) by the following table. samples from N(J1l.6. Condition on V and apply the double expectation theorem.. ! x 0 2 1<> 2 I 0 <> I 2 1<> 2 1 ~a Ia 2 iI Oe (i~H <» (1') a 10: (t~) (~. Show that the noncentral t distribution. Hint: Let Z and V be independent and have N(o. . .IIZI > tjvlk] is increasing in [61. 13. I). (T~).IITI > tl is an increasing function of 161. Yl.OXi_d 2} i= 1 < Xi < 00. P.and twosided t tests. . J7i'k(~k)ZJ(k+ll io Hint: Let Z and V be as in the preceding hint. Fix 0 < a < and <>/[2(1 ..Section 4. . II. Then.(t) = .2. 7k.. Suppose that T has a noncentral t. (a) P. get the joint distribution of Yl = ZI vVlk and Y2 = V.<» (1 . respectively. distribution. Yn2 be two independent N(P. with all parameters assumed unknown. has level a and is strictly more 11. (b) P.[Z > tVV/k1 is increasing in 6. X powerful whatever be B. (Yl. = 0. Consider the following model... (Yl) = f PY"Y. X n1 .10 Problems and Complements 293 X n ) is n (a) Show that the density of X = (Xl. 1 {= xj(kl)ellx+(tVx/k'I'ldx. Then use py. (Tn..n.
T has a noncentral Tnz distribution with noncentrality parameter p. and only if. YI ). i=2 i=2 n n S12 =L i=2 n U.a/2) or F < f( n/2).1I(a).. where l _ .7 to conclude that tribution as 8 12 /81 8 2 . that T is an increasing function of R. i 0 in xi !:. Un = Un. using (4.Y)'/E(X.8. Let R = S12/SIS" T = 2R/V1 R'.7 and B. !I " .19 315 2. . I ' LX. . Argue as in Proplem 4. note that . .' ! . 0. and using the arguments of Problems B. i=l j=l n n (b) Show that if we have a sample from a bivariate N(1L1l 1L2. p) distribution. . Consider the problem of testing H : p = 0 versus K : p i O. Yn) be a sampl~ from a bivariateN(O.37 384 2..1)/(n.nl1 ar is of the fonn: Reject if. (1~..4. as an approximation to the LR test of H : a1 = a2 versus K : a1 =I a2. . si = L v. . a?. .9. (11. where a. Because this conditional distribution does not depend on (U21 •. . F > /(1 . .96 279 2.4. Hint: Use the transfonnations and Problem B.4. • . .~ I i " 294 Testing and Confidence Regions Chapter 4 . The following data are the blood cholesterol levels (x's) and weightlheight ratios (y's) of 10 men involved in a heart study. can you conclude at the 10% level of significance that blood cholest~rollevel is correlated with weight/height ratio? 'I I . (Xn .0 has the same distribution as R. 0.. . then P[p > c] is an increasing function of p for fixed c.12 337 1. Un). show that given Uz = Uz . . 14. L Yj'. 1. where f( t) is the tth quantile of the F nz . Let (XI. .8. · . . <. 1 . distribution and that critical values can be .1' Sf = L U. . x y 254 2.10. (a) Show that the likelihood ratio ~tatistic is equivalent to ITI where . 1. V.62 284 2.64 298 2.J J . . (Un..4. Vn ) is a sample from a N(O.9. and use Probl~1ll4. the continuous versiop of (B.9. .. p) distribution. (a) Show that the LR test of H : af = a~ versus K : ai > and only if. .V. F ~ [(nl . r> (c) Justify the twosided F test: Reject H if.68 250 2. Let '(X) denote the likelihood ratio statistic for testing H : p = 0 versus K : p the bivariate normal model.4..71 240 2.~ I . • I " and (U" V.1.nt 1 distribution. > C..61 310 2. vn 1 l i 16.0 has the same dis r j . p) distribution.94 Using the likelihood ratio test for the bivariate nonnal model. Finally.1 . (d) Relate the twosided test of part (c) to the confidence intervals for a?/ar obtained in Problem 4. 15.5) that 2 log '(X) V has a distribution.4) and (4.1)]E(Y.24) implies that this is also the unconditional distribution.).'. ! .X)' (b) Show that (aUa~)F has an F nz obtained from the :F table. .'. Sbow. .
Wiley & Sons. P. Box. "Is there a sex bias in graduate admissions?" Science. 00. and r.10.1 (1) The point of view usually taken in science is that of Karl Popper [1968]. 00. Nntes for Section 4.Section 4. PROSCHAN. HAMMEL. Notes for Section 4. Apology for Ecumenism in Statistics and Scientific lnjerence.~.819 for" = 0. and C..9. E.11. BICKEL. REFERENCES BARLOW. 187.0.398404 (975). Acceptance of a hypothesis is only provisional as an adequate current approximation to what we are interested in understanding.).4 (I) If the continuity correction discussed in Section A. R.851. T.01. . Consider the bioequivalence example in Problem 3.3). Rejection is more definitive. G. AND J. 1983.(t) = tl(.o = 0.15 is used here.12 1965.895 and t = 0. G. it also has confidence level (1 . Essentially. (2) In using 8(5) as a confidence hound we are using the region [8(5).. F..3 (1) Such a class is sometimes called essentially complete.035. see also Ferguson (1967). S in 8(X) would be replaced by 5 + and 5 in 8(X) is replaced by 5 . if the parameter space is compact and loss functions are bounded. (2) We ignore at this time some reallife inadequacies of this experiment such as the placebo effect (see Example 1.2. 4. (2) The theory of complete and essentially complete families is developed in Wald (1950). 1974) to the critical value is <.3.[ii) where t = 1. .I\ E. Editors New York: Academic Press.. !. Box. 4. The term complete is then reserved for the class where strict inequality in (4. Data Analysis and Robustness. Stephens. Mathematical Theory of Reliability New York: J. P... i]. T6 . W.05 and 0. (a) Find the level" LR test for testing H : 8 E given in Problem 3.. Leonard.2.01 + 0. E. respectively. (b) Compare your solution to the Bayesian solution based on a continuous loss function rro = 0.11 NOTES Notes for Section 4. Tl . Wu.1. i] versus K : 8 't [i.3) holds for some 8 if <p 't V. More generally the closure of the class of Bayes procedures (in a suitable metric) is complete.0. Because the region contains C(X).11 Notes 295 17. AND F. O'CONNELL. (3) A good approximation (Durbin.. the class of Bayes procedures is complete. Consider the cases Ii.[ii .9. 1973.
16.. 63. K. L. DURBIN. SAMUEL<:AHN. GLAZEBROOK. S." The American Statistician. Y. The Theory of Probability Oxford: Oxford University Press. Statistical Methods for MetaAnalysis Orlando. "Distribution theory for tests based on the sample distribution function. Sequential Analysis New York: Wiley. Testing Statistical Hypotheses.. "Optimal confidence intervals for the variance of a normal distribution. H.674682 (1959). VAN ZWET. Amer. . 53.. 326331 (1999). . A. ! I I I FERGUSON.• Statistical Methods for Research Workers. H." Biometrika. D. I ." 1. AND G." Biometrika. Math. CAl.. R.." An" Math. WETHERDJ. AND 1. A. WALD. Assoc. DAS GUPTA. Wiley & Sons. HAlO. . 54. Sequential Methods in Statistics New York: Chapman and Hall. "PloUing with confidence: Graphical comparisons of two populations. New York: Springer.. W. "A twosample test for a linear hypothesis whose pOWer is independent of the variance. New York: Hafner Publishing Company.. 421434 (1976)." J. the Growth ofScientific Knowledge.." J. "Plots and tests for symmetry. SACKRoWITZ. j j New York: 1... Aspin's tables:' Biometrika. 2nd ed. SIEVERS. Wiley & Sons. 1997. 1961. .. B. HEDGES. AND K. OSTERHOFF. 13th ed. 549567 (1961). 1950. STEPHENs. Statist. Mathematical Statistics New York: J. 1968. A. AND G. ! . LEHMANN. K. 36. Sratisr.• Conjectures and Refutations. 1. POPPER.. WANG... G... G.. A. . R. Amer. S. AARBERGE. Mathematical Statistics. A. KLETI. FL: Academic Press.. j 1 . 3rd ed. 56. 64. SIAM. Pennsylvania (1973). D." The AmencanStatistician. "Further notes on Mrs. New York: I J 1 Harper and Row. 1986. FISHER. DOKSUM.. "On the combination of independent test statistics. Statistical Theory with Engineering Applications . Statist. Philadelphia. "P values as random variablesExpected P values. . FENSTAD. "Length of confidence intervals. 9. T. K. J. 54 (2000). Statist.. OLlaN. 1967. V. 69. Statist.. TATE. AND R. Statist. 1958.. 1952.243258 (1945). W.296 Testing and Confidence Regions Chapter 4 BROWN. j 1 . AND I. 659680 (1967). A Decision Theoretic Approach New York: Academic Press. L. WALD. "Probabilities of the type I errors of the Welch tests. T. 8. WELCH. Amer. JEFFREYS. 38. AND A. 66. 473487 (1977)." Ann. 1%2... 1985. A." J. 243246 (1949). c. "EDF statistics for goodness of fit." Regional Conference Series in Applied Math. Assoc. "Interval estimation for a binomial proportion. Statistical Decision Functions New York: Wiley. M. R. PRATI. . . DOKSUM. STEIN. 605608 (1971). R. 1947. I . 730737 (1974). F. WILKS. E. L. Assoc. Amer. AND E.
Chapter 5 ASYMPTOTIC APPROXIMATIONS 5. This distribution may be evaluated by a twodimensional integral using classical functions 297 .3)..1. To go one step further. .1. (5..Xn are i.1) This is a highly informative formula. closed fonn computation of risks in tenns of known functions or simple integrals is the exception rather than the rule. and F has density f we can wrile (5. If n is odd. Even if the risk is computable for a specific P by numerical integration in one dimension. k Evaluation here requires only evaluation of F and a onedimensional integration. N(J1.1 degrees of freedom. computation even at a single point may involve highdimensional integrals.F(x)k f(x).1 (D.2 that .9. Ifwe want to estimate J1(F} = (5. . v(F) ~ F. However.3) gn(X) =n ( 2k ) F k(x)(1 .2) and (5. 1 X n from a distribution F. and calculable for any F and all n by a single onedimensional integration. if n ~ 2k + 1.. our setting for this section EFX1 and use X we can write. consider med(X 11 •.1. the qualitative behavior of the risk as a function of parameter and sample size is hard to ascertain.1. a 2 ) we have seen in Section 4. the qualitative behavior of the risk as a function of n and simple parameters of F is not discernible easily from (5. Worse. but a different one for each n (Problem 5. consider a sample Xl.1). Worse. .• . from Problem (B.1 INTRODUCTION: THE MEANING AND USES OF ASYMPTOTICS Despite the many simple examples we have dealt with.13).Xn ) as an estimate of the population median II(F)..1. and most of this chapter. telling us exactly how the MSE behaves as a function of n. In particular.2.i.1. If XI.2) where.fiiXIS has a noncentral t distribution with parameter 1"1 (J and n .d. consider evaluation of the power function of the onesided t test of Chapter 4.
The other. Thus. . or the sequence Asymptotic statements are always statements about the sequence. . Rn (F). Approximately evaluate Rn(F) by _ 1 B i (5. 1 < j < B from F using a random number generator and an explicit fonn for F.Xn):~Xi<n_l (~2 (EX. just as in numerical integration. where A= { ~ . S) and in general this is only representable as an ndimensional integral. Asymptotics. always refers to a sequence of statistics I..d.3. more specifically. 00.Xnj ) . which occupies us for most of this chapter. We shall see later that the scope of asymptotics is much greater. based on observing n i. i . • . The first..2) and its qualitative properties are reasonably transparent.I I !' 298 Asymptotic Approximations Chapter 5 (Problem 5. 10=1 f=l There are two complementary approaches to these difficulties. We now turn to a detailed discussion of asymptotic approximations but will return to describe Monte Carlo and show how it complements asyrnptotics briefly in Example 5.fii(Xn  EF(Xd) + N(O. is to approximate the risk function under study by a qualitatively simpler to understand and easier to compute function. {X 1j l ' •• . X nj }. of distributions of statistics. . . . j=l • By the law of large numbers as B j.o(X1). In its simplest fonn.i.. Asymptotics in statistics is usually thought of as the study of the limiting behavior of statistics or.. Monte Carlo is described as follows.n (Xl. But suppose F is not Gaussian. . which we explore further in later chapters. I j {Tn (X" . X n + EF(Xd or p £F( . X n as n + 00. .fiit !Xi. . save for the possibility of a very unlikely event. VarF(X 1 ». It seems impossible to determine explicitly what happens to the power function because the distribution of fiX / S requires the joint distribution of (X. is to use the Monte Carlo method. in this context.. _. .4) i RB =B LI(F. we can approximate Rn(F) arbitrarily closely. for instance the sequence of means {Xn }n>l. or it refers to the sequence of their distributions 1 Xi.. The classical examples are. Xn)}n>l. observations Xl. .3. but for the time being let's stick to this case as we have until now.1. where X n = ~ :E~ of medians. fiB ~ R n (F).1..)2) } . Draw B independent "samples" of size n.
. PF[IX n _  . this reads (5. 1'1 > €] < 2exp {~n€'}.11) .l) (5.1.2 is unknown be. £ = . the much PFlIx n  Because IXd < 1 implies that . (5.6) and (5.9) is . Similarly.1. .9) when .1.3).l1) states that if EFIXd' < 00.. For instance. We interpret this as saying that.7) where 41 is the standard nonnal d.. say.Section 5. then P [vn:(X F n .omes 11m'. if more delicate Hoeffding bound (B. Further qualitative features of these bounds and relations to approximation (5.10) is . say. x.9) As a bound this is typically far too conservative.5) That is.1 Introduction: The Meaning and Uses of Asymptotics 299 In theory these limits say nothing about any particular Tn (Xl.. Is n > 100 enough or does it have to be n > 100. the central limit theorem tells us that if EFIXII < 00.Xn ) or £F(Tn (Xl. . 7). . by Chebychev's inequality. ' .f.6) to fall.1.14. the celebrated BerryEsseen bound (A...15.14. X n ) we consider are closely related as functions of n so that we expect the limit to approximate Tn(Xl~"· . the right sup PF x [r.. which are available in the classical situations of (5.10) < 1 with .1.' = 1 possible (Problem 5. For instance.2 hand side of (5. below .8) are given in Problem 5.' m (5.8) Again we are faced with the questions of how good the approximation is for given n.1. (see A. For € = .1.1. if EFIXd < 00. Xn is approximately equal to its expectation.01.25 whereas (5.1..1.. Thus. and P F • What we in principle prefer are bounds. X n ) but in practice we act as if they do because the T n (Xl.1.. 1') < z] ~ ij>(z) (5. the weak law of large numbers tells us that..1. As an approximation.01.9. for n sufficiently large.(Xn .6) does not tell us how large n has to be for the chance of the approximation ~ot holding to this degree (the lefthand side of (5..1') < x] _ ij>(x) < CEFIXd' v'~ 3 1/' " "n (5.4./' is as above and .1. X n )) (in an appropriate sense). then (5.1. . n = 400.VarF(Xd. The trouble is that for any specified degree of approximation.2 .' 1'1 > €] < . DOD? Similarly.1.1.6) for all £ > O.6) gives IX11 < 1. . (5. if EFXf < 00. (5.1.1.
In particular. The methods are then extended to vector functions of vector means and applied to establish asymptotic normality of the MLE 7j of the canonical parameter 17  j • i. good estimates On of parameters O(F) will behave like  Xn does in relation to Ji. I . d) = '(II' . i .7) says that the behavior of the distribution of X n is for large n governed (approximately) only by j. (5.t and 0"2 in a precise way. As we mentioned.e(F)]) (1(e. : i . The qualitative implications of results such as are very impor~ tant when we consider comparisons between competing procedures. which is reasonable. 1 i . • c F (y'n[Bn .11) suggests. even here they are not a very reliable guide. quite generally. Section 5.' I I • If the agreement is satisfactory we use the approximation even though the agreement for the true but unknown F generating the data may not be as good. (b) Their validity for the given n and Ttl for some plausible values of F is tested by numerical integration if possible or Monte Carlo computation. For instance.di) where '(0) = 0. Although giving us some idea of how much (5.1.2 and asymptotic normality via the delta method in Section 5.1. (5. consistency is proved for the estimates of canonical parameters in exponential families.12) where (T(O. F) typically is the standard deviation (SD) of J1iOn or an approximation to this SD.1. for any loss function of the form I(F. Bounds for the goodness of approximations have been available for X n and its distribution to a much greater extent than for nonlinear statistics such as the median.8) is typically much betler than (5.300 Asymptotic Approximations Chapter 5 where C is a universal constant known to be < 33/4. Consistency will be pursued in Section 5. If they are simple. "(0) > 0. Section 5. Note that this feature of simple asymptotic approximations using the normal distribution is not replaceable by Monte Carlo. as we have seen. It suggests that qualitatively the risk of X n as an estimate of Ji. although the actual djstribution depends on Pp in a complicated way.1.2 deals with consistency of various estimates including maximum likelihood.F) > N(O 1) . • model. and asymptotically normal.8) differs from the truth.1. The estimates B will be consistent.1. . We now turn to specifics. for all F in the n n . behaves like . The arguments apply to vectorvalued estimates of Euclidean parameters.3 begins with asymptotic computation of moments and asymptotic normality of functions of a scalar mean and include as an application asymptotic normality of the maximum likelihood estimate for oneparameter exponential families. B ~ O(F).(l) The approximation (5. .5) and quite generally that risk increases with (1 and decreases with n.' (0)( (1 / y'n)( v'27i') (Problem 5. Yet. asymptotic formulae suggest qualitative properties that may hold even if the approximation itself is not adequate.3. As we shall see. Asymptotics has another important function beyond suggesting numerical approximations for specific nand F.I '1.1.11) is again much too consctvative generally. Practically one proceeds as follows: (a) Asymptotic approximations are derived. (5. .
but we shall use results we need from A. But. . If Xl.2.Xn from Po where 0 E and want to estimate a real or vector q(O).. 1 X n are i. Most aSymptotic theory we consider leads to approximations that in the i.2) Bounds b(n.1. . and 8. > O. asymptotics are methods of approximating risks.) However. ' .1) where I . Finally in Section 5.d. A stronger requirement is (5.2. For P this large it is not unifonnly consistent.7.2.2 5. for . with all the caveats of Section 5.2. if. P where P is unknown but EplX11 < 00 then.1) and (B. (See Problem 5.2. The least we can ask of our estimate Qn(X I. where P is the empirical distribution.7.. remains central to all asymptotic theory.2) is called unifonn consistency.14. we talk of uniform cornistency over K.i. 1denotes Euclidean distance. . e Example 5.7 without further discussion.2. by quantities that can be so computed. . That is. .1 CONSISTENCY PlugIn Estimates and MlEs in Exponential Family Models Suppose that we have a sample Xl. in accordance with (A. If is replaced by a smaller set K.lS. is a consistent estimate of p(P).1). and other statistical quantities that are not realistically computable in closed form.4 deals with optimality results for likelihoodbased procedures in onedimensional parameter models. We will recall relevant definitions from that appendix as we need them. 00. 'in 1] q(8) for all 8.. A. (5. AlS.1. which is called consistency of qn and can be thought of as O'th order asymptotics.2. case become increasingly valid as the sample size increases. Asymptotic statements refer to the behavior of sequences of procedures as the sequence index tends to 00.2) are preferable and we shall indicate some of qualitative interest when we can. Monte Carlo.) forsuP8 P8 lI'in .5 we examine the asymptotic behavior of Bayes procedures. The simplest example of consistency is that of the mean.. Section 5.Section 5. The notation we shall use in the rest of this chapter conforms closely to that introduced in Sections A. Means.q(8)1 > 'I that yield (5.14.2.Xn ) is that as e n 8 E ~ e. distributions. and probability bounds. for all (5.i.2 Consistency 301 in exponential families among other results.2.. Summary. We also introduce Monte Carlo methods and discuss the interaction of asymptotics.d. . by the WLLN. In practice.14. 5.1). and B. The stronger statement (5. X ~ p(P) ~ ~  p = E(XJ) and p(P) = X.
2. Other moments of Xl can be consistently estimated in the same way. Suppose thnt P = S = {(Pl.2. (5. and (Xl. . i . = Proof. Let X 1.1. w( q. w(q.1) and the result follows. . l iA) E S be the empirical distribution. xd is the range of Xl' Let N i = L~ 11(Xi = Xj) and Pj _ Njln.• X n be the indicators of binomial trials with P[X I = I] = p.." is the following: I X n are LLd. consider the plugin estimate p(Iji)ln of the variance of p.2. 0 < p < 1.3) that > <} (5.I l 302 instance.2. the kdimensional simplex. . where q(p) = p(1p). q(P) is consistent. 6) It easily follows (Problem 5..2. Letw. By the weak law of large numbers for all p. Then N = LXi has a B(n. it is uniformly continuous on S. Binomial Variance..4) I i (5. .l : [a.. Thus. w. Theorem 5. Ip' pi < 6«). · .6. with Xi E X 1 .p) distribution. Suppose that q : S + RP is continuous.b) say.2. P = {P : EpX'f < M < oo}. 0 In fact. Asymptotic Approximations Chapter 5 X is uniformly consistent over P because o Example 5. w(q. . for all PEP.6) = sup{lq(p)  q(p')I: Ip . then by Chebyshev's inequality. sup{ Pp[liln pi > 61 : pES} < kl4n6 2 (Problem 5.pi > 6«)] But.2. 0 ~ To some extent the plugin method was justified by consistency considerations and it is not suprising that consistency holds quite generally for frequency plugin estimates. Pp [IPn  pi > 6] ~ . in this case.·) is increasing in 0 and has the range [a. there exists 6 «) > 0 such that p. implies Iq(p')q(p) I < <Then Pp[li1n .3) Evidently. Suppose the modulus ofcontinuity of q.2. 6 >0 O.14. Because q is continuous and S is compact. for every < > 0. 0) is defined by ..p'l < 6}.1 < j < k.l «) = inf{6 : w(q.. which is ~ q(ji).b] < R+bedefinedastheinver~eofw.. Evidently. where Pi = PIXI = xi]' 1 < j < k. .. Then qn q(Pn) is a unifonnly consistent estimate of q(p). and p = X = N In is a uniformly consistent estimate of p. we can go further.6)! Oas6! O. But further.5) I A simple and important result for the case in which X!. by A. p' E S.LJ~IPJ = I}. Pn = (iiI. If q is continuous w(q.ql > <] < Pp[IPn .Pk) : 0 < Pi < 1..
Vi). Suppose P is a canonical exponentialfamily of rank d generated by T.) > 0.Jor all 0.) I < that h(D n) Foreonsisteney of h(g) apply Proposition B. af > o. (i) Plj [The MLE Ti exists] ~ 1.1 .6) if Ep[g(X.UV) so that E~ I g(Ui . . 11. ' .4. Let g(u. ai.v} = (u.6. .).Section 5. N2(JLI.1.1 that the empirical means..) > 0. 1 < i < n be i. Let TI.. ag. . Jl2.1.i. is a consistent estimate of q(O).V2. Then. Suppose [ is open. !:. Variances and Correlations.2.. If we let 8 = (JLI. thus.ar. 'Vi) is the statistic generating this 5parameter exponential family. Then. 0 Here is a general consequence of Proposition 5. EVj2 < 00.3. Let Xi = (Ui . then Ifh=m.p).1 and Theorem 2.2. Let g (91.a~. then vIP) h(g). We need only apply the general weak law of large numbers (for vectors) to conclude that (5.2 Consistency 303 Suppose Proposition 5.2. .7. is consistent for v(P). let mj(O) '" E09j(X. variances. conclude by Proposition 5. )1 < oo}.2. Var(U. If Xl. More generally if v(P) = h(E p g(X .V.1: D n 00.2. if h is continuous.)1 < 1 } tions such that EUr < 00.. 1 are discussed in Problem 5. (ii) ij is consistent. where h: Y ~ RP.lpl < 1.JL2. Theorem 5. U implies !'. [ and A(·) correspond to P as in Section 1..2. 1 < j < d. and correlation coefficient are all consistent. 1 < j < d. and let q(O) = h(m(O)).p).9d) map X onto Y C R d Eolgj(X')1 < 00. Questions of uniform consistency and consistency when P = { DistribuCorr(Uj.2. Var(V. X n are a sample from PrJ E P. where P is the empirical distribution. h(D) for all continuous h. o Example 5.3.d. We may.2. )) and P = {P : Eplg(X . = Proof.then which is well defined and continuous at all points of the range of m.U 2.
the MLE. exists iff the event in (5.3. i i • • ! .2. But Tj. We showed in Theorem 2.1I 0 ) foreverye I . I I Proof Recall from Corollary 2. D . X n be Li. for example.3. • .1. . Note that.j. II) 0 j =EII.2. vectors evidently used exponential family properties. By a classical result. T(Xdl < 6} C CT' By the law of large numbers. i I . i=l 1 n (5.. '1 P1)'[n LT(Xi ) E C T) ~ 1. 5.D(1I0 .d. II) = where.8) I > O.7) occurs and (i) follows. D( 110 . BEe c Rd. Let 0 be a minimum contrast estimate that minimizes I .. if 1)0 is true.lI o): 111110 1 > e} > D(1I0 .Xn ) exists iff ~ L~ 1 T(X i ) = Tn belongs to the interior CT of the convex support of the distribution of Tn. I ! .. (5.. . I ..3.2 Consistency of Minimum Contrast Estimates . z=l 1 n i. and the result follows from Proposition 5. n LT(Xi ) i=I 21 En.2.. j 1 Pn(X. Suppose 1 n PII sup{l. . 7 I . where to = A(1)o) = E1)o T(X 1 ). Rudin (1987).9) n and Then (} is consistent.1 that On CT the map 17 A(1]) is II and continuous on E. as usual. The argument of the the previous subsection in which a minimum contrast estimate.2. I J Theorem 5.1 to Theorem 2..= 1 . E1).2.1 that i)(Xj. .1I)11: II E 6} ~'O (5. = 1 n p.304 Asymptotic Approximations Chapter 5 . n . is solved by 110. {t: It .p(X" II) is uniquely minimized at 11 i=l for all 110 E 6. .3. 0 Hence.1 belong to the interior of the convex support hecause the equation A(1)) = to. see.d.. is a continuous function of a mean of Li.L[p(Xi .E1).7) I .2.LT(Xi ). T(Xl). which solves 1 n A(1)) = .. By definition of the interior of the convex support there exists a ball 8. . the inverse AI: A(e) ~ e is continuous on 8.  inf{D(II.2... Po.3. . (T(Xl)) must hy Theorem 2. 1 i . Let Xl. A more general argument is given in the following simple theorem whose conditions are hard to check. . II) L p(X n..lI) . .
PIJ. o Then (5.IJO)) ' IIJ IJol > <} n i=l . IJ) .inf{D(IJo.L[p(Xi .2. IJ)I : IJ K } PIJ !l o. IJ j )) : 1 <j < d} > <] < d maxi PIJ.13) By Shannon's Lemma 2."'[p(X i .2.p(X" IJo)) : IJ E g ~(P(X" K'} > 0] ~ 1.[IJ ¥ OJ] = PIJ. IJj))1 > <I : 1 < j < d} ~ O.IJd ).L(p(Xi.11) implies that 1 n ~ 0 (5.2. IJ).2. IIJ .2. ifIJ is the MLE.11) implies that the righthand side of (5. Ie ¥ OJ] ~ Ojoroll j.D(IJo. IJ)II ' IJ n i=l E e} > .9) hold for p(x. and (5. 0) . is finite. But for ( > 0 let o ~ ~ inf{D(IJ. An alternative condition that is readily seen to work more widely is the replacement of (5.8) and (5. IJ) .12) which has probability tending to 0 by (5.14) (ii) For some compact K PIJ. IJo)) .8) follows from the WLLN and PIJ.. IJ) n ~ i=l p(X"IJ o)] .D( IJ o.2.2.2 Consistency 305 proof Note that.10) tends to O.IJol > <} < 0] (5.1.2. 2 0 (5. IJ) .2. EIJollogp(XI. IJj ) .D(IJo. PIJ. 1 n PIJo[inf{.2.2. _ I) 1 n PIJ o [16 .10) By hypothesis.IJor > <} < 0] because the event in (5.8) by (i) For ail compact K sup In { c e.8) can often failsee Problem 5.IJ) p(Xi. E n ~ [p(Xi .2.2. then.1 we need only check that (5.2.8).IIIJ  IJol > c] (5.[max{[ ~ L:~ 1 (p(Xi .Section 5.2.11) sup{l.2.2.D(IJ o.D(IJo. . IJ j ) . 0 Condition (5.IJol > E] < PIJ [inf{ . . IIJ . A simple and important special case is given by the following. (5. (5.9) follows from Shannon's lemma. [I ~ L:~ 1 (p(Xi .2.5.IJ o)  D(IJo. Coronary 5.IJ)1 < 00 and the parameterization is identifiable. for all 0 > 0.lJ o): IIJ IJol > oj. IJ) = logp(x. /fe e =   Proof Note that for some < > 0. But because e is finite. [inf c e. {IJ" ..
We then sketch the extension to functions of vector means. ~ 5. .3.1) where • . let Ilglioo = sup{ Ig( t) I : t E R) denote the sup nonn.33. . Unifonn consistency for l' requires more. . Let h : R ~ R.3..d.. in Section 5.AND HIGHERORDER ASYMPTOTlCS: THE DelTA METHOD WITH APPLICATIONS We have argued.3.B(P)I > Ej : PEP} t 0 for all € > O. > 2.1. that sup{PIiBn . If (i) and (ii) hold.14) is in general difficult. and assume (i) (a) h is m times differentiable on R.) = <7 2 Eh(X) We have the following. ! ! (ii) EIXd~ < 00 Let E(X1 ) = Il. VariX. Unfortunately checking conditions such as (5. As usual let Xl. If fin is an estimate of B(P).Xn be i. We conclude by studying consistency of the MLE and more generally Me estimates in the case e finite and e Euclidean. We show how consistency holds for continuous functions of vector means as a consequence of the law of large numbers and derives consistency of the MLE in canonical multiparameter exponential families. we require that On ~ B(P) as n ~ 00. =suPx 1k<~)(x)1 <M < .306 Asymptotic Approximations Chapter 5 We shall see examples in which this modification works in the problems.1 that the principal use of asymptotics is to provide quantitatively or qualitatively useful approximations to risk. sequence of estimates) consistency. We introduce the minimal property we require of any estimate (strictly speak ing. Sufficient conditions are explored in the problems. A general approach due to Wald and a similar approach for consistency of generalized estimating equation solutions are left to the problems. m Mi) and assume (b) IIh(~)lloo j 1 .2. + Rm (5. 5.i. When the observations are independent but not identically distributed. Summary..8) and (5.3 FIRST. D .1 The Delta Method for Moments • • We begin this section by deriving approximations to moments of smooth functions of scalar means and even provide crude bounds on the remainders. J.Il E(X Il). X valued and for the moment take X = R.2. then = h(ll) + L (j)( ) j=l h .3. . ~l Theorem 5. consistency of the MLE may fail if the number of parameters tends to infinity. see Problem 5. I ~. We denote the jth derivative of h by 00 .
. (n .3.. Moreover. so the number d of nonzero tenns in (a) is b/2] (c) l::n _ r.'" . .5.3) (5.3... for j < n/2.. 'Zr J . ik > 2....t" = t1 .[jj2] + 1) where Cj = 1 <r. tI.4) for all j and (5.. Let I' = E(X.. and the following lemma.4) Note that for j eveo. . (b) a unless each integer that appears among {i I . bounded by (d) < t.3 First.3. t r 1 and [t] denotes the greatest integer n.'1 +. .3) and j odd is given in Problem 5.and HigherOrder Asymptotics: The Delta Method with Applications 307 The proof is an immediate consequence of Taylor's expansion. then (a) But E(Xi .3..) ~ 0.. +'r=j J .. The expression in (c) is. We give the proof of (5. In!. . . rl l:: . .3.1 < k < r}}.3.3. .1) .. . 21. C [~]! n(n . j > 2.Section 5.5[jJ2J max {l:: { . The more difficult argument needed for (5.3) for j even. .2) where IX' that 1'1 < IX . Lemma 5.ij } • sup IE(Xi . .2. (5..1'1..i j appears at by Problem 5. Xij)1 = Elxdj il.. . :i1 + . X ij ) least twice. . then there are constants C j > 0 and D j > 0 such (5.3. If EjX Ilj < 00. EIX I'li = E(X _1')' Proof. tr i/o>2 all k where tl . . + i r = j. _ .1. .3.
2 ).l) 6 + 2h(J. 1 iln .3. then I • I. respectively.2.4).3. then h(')( ) 2 0 = h(J.6) can be 1 Eh 2 (X) = h'(J.1 with m = 4.5) apply Theorem 5.J.l)h(J. (d).l) + [h(1)]2(J. Then Rm = G(n 2) and also E(X .3. if 1 1 <j (a) Ilh(J)II= < 00.l) + {h(21(J.l)}E(X .. I .3. (a) ifEIXd3 < 00 and Ilh( 31 11 oo < Eh(X) 00.3. < 3 and EIXd 3 < 00.l)E(X .l) + [h(llj'(1')}':.1. Proof (a) Write 00. i • r' Var h(X) = 02[h(11(J.1') + {h(2)(J. I .3.l)h(1)(J.llJ' + G(n3f2) (5.l) and its variance and MSE.3f2 ) in (5. (5..3.3) for j even.1')' + 1 E[h'](3)(X')(X _ 1')3 = h2 (J.3. I (b) Ifllh(j11l= < 00.6. By Problem 5.3.6) n .3. 00 and 11hC 41 11= < then G(n. .5) can be replaced by Proof For (5. if I' = O.+ O(n.1')3 = G(n 2) by (5. Because E(X . then G(n.l) + 00 2n I' + G(n3f2).3.1. In general by considering Xi . I 1 .ll j < 2j EIXd j and the lemma follows. n (b) Next. • . give approximations to the bias of h(X) as an estimate of h(J.1'11. (n li/2] + 1) < nIJf'jj and (c)..3.3.4) for j odd and (5.1.5) follows.3.JL as our basic variables we obtain the lemma but with EIXd j replaced by EIX 1 . I I I Corollary 5.3. If the conditions of (b) hold.1')2 = 0 2 In.l)h(J.1 with m = 3. 1 < j < 3. and EXt < replaced by G(n'). ..3f2 ). .308 But (e) Asymptotic Approximations Chapter 5 1 I j i njn(n 1) . using Corollary 5. Corollary 5..3. 0 1 I .3f') in (5. apply Theorem 5. EIX1 .. (5. 0 The two most important corollaries of Theorem 5. and (e) applied to (a) imply (5.5) (b) if E(Xt) < G(n.
h(/1)) h(2~(II):: + O(n.d.7) If h(t) ~ 1 . when hit) = t(1 . Here Jlk denotes the kth central moment of Xi and we have used the facts that (see Problem 5. {h(l) (/1)h(ZI U')/13 (5. 0 Clearly the statements of the corollaries as well can be turned to expansions as in Theorem 5.3. by Corollary 5..3..l / Z) unless h(l)(/1) = O. then we may be interested in the warranty failure probability (5. We will compare E(h(X)) and Var h(X) with their approximations. which is G(n. Bias and Variance of the MLE of the Binomial VanOance. . Thus.3.h(/1) is G(n.3.Z ".3. T!Jus.3. for large n.1 If the Xi represent the lifetimes of independent pieces of equipment in hundreds of hours and the warranty replacement period is (say) 200 hours.Z) (5. Example 5.11) Example 5. the MLE of ' is X . Note an important qualitative feature revealed by these approximations.l ).2 ) 2e.3./1)3 = ~~.= E"X I ~ 11 A.1.3(1_ ')In + G(n.3. (5. We can use the two coronaries to compute asymptotic approximations to the means and variance of heX). (1Z 5.2. X n are i. .3.. If heX) is viewed.9) o Further expansion can be done to increase precision of the approximation to Var heX) for large n..3. To get part (b) we need to expand Eh 2 (X) to four terms and similarly apply the appropriate form of (5. and.2 (Problem (5.4 ) exp( 2/t).3.t.3.Z. by expanding Eh 2 (X) and Eh(X) to six terms we obtain the approximation Var(h(X)) = ~[h(I)U')]'(1Z +.i.(h(X)) E.5). which is neglible compared to the standard deviation of h(X).t) and X.(h(X) . where /1.exp( 2It).3.i.exp( 2') c(') = h(/1). If X [.3. by Corollary 5.1 with bounds on the remainders. Bias.and HigherOrder Asymptotics: The Delta Method with Applications 309 Subtracting Cal from (b) we get C5.3.4) E(X .6). as we nonnally would. then heX) is the MLE of 1 .3.  .1) = 11.1.Section 5. A qualitatively simple explanation of this important phenonemon will be given in Theorem 5.8) because h(ZI (t) = 4(r· 3 . as the plugin estimate of the parameter h(Jl) then.3 First.10) + [h<ZI (/1)J'(14} + R~ ! with R~ tending to zero at the rate 1/n3 . the bias of h(X) defined by Eh(X) .
E(X') ~ p .2p(1 .. 1 < j < { Xl . .p) n . . the error of approximation is 1 1 + .2(1 . 1 and in this case (5. M2) ~ 2.2p)')} + R~.P )} (n_l)2 n nl n Because 1'3 = p(l.p) = p ( l . D 1 i I .3. n n Because MII(t) = I . Let h : R d t R. asSume that h has continuous partial derivatives of order up to m. (5. First calculate Eh(X) = E(X) .p).p) + .[Var(X) + (E(X»2] I nI ~ p(1 .p)(I.' • " The generalization of this approach to approximation of moments for functions of vector means is fonnally the same but computationally not much used for d larger than 2. .6p(l.3.10) yields . ~ Varh(X) = (1.2p) n n +21"(1 . I .p)] n I " .pl..p) .p(1 . Next compute Varh(X)=p(I.e x ) : i l + .p)J n p(1 ~ p) [I . " ! ..amh i.p) {(I _ 2p)2 n Thus. a Xd d} . 0 < i. ..!c[2p(1 n p) .2p)p(l.9d(Xi )f. and will illustrate how accurate (5.2.3.2p).3.2t.10) is in a situation in which the approximation can be checked. • ~ O(n.p(1 . Suppose g : X ~ R d and let Y i = g(Xi ) = (91 (Xi).2p)2p(1 . .. and that (i) l IIDm(h)ll= < 00 where Dmh(x) is the array (tensor) a i.2p)' .' .{2(1.310 Asymptotic Approximations Chapter 5 B(l. R'n p(1 ~ p) [(I . + id = m. (5.5) yields E(h(X))  ~ p(1 .j .p) .3. Theorem 5. II'. D " ~ I . < m.p)(I.5) is exact as it should be. .p)'} + R~ p(l ..P) {(1_2 P)2+ 2P(I.3 ). • .
and HigherOrder Asymptotics: The Delta Method with Applications 311 (ii) EIY'Jlm < 00.I' < 00..8.) Var(Y.). The most interesting application. The results in the next subsection go much further and "explain" the fonn of the approximations we already have.3. (5.) + a.3. Theorem 5.3.3. Similarly. and (5. and the appropriate generalization of Lemma 5. (J1.3.11.12) Var h(Y) ~ [(:. for d = 2.))) ~ N(O. Lemma 5. Suppose {Un} are real random variables and tfult/or a sequence {an} constants with an + 00 as n + 00.4.3. 0/ The result follows from the more generally usefullernma. Y 12 ) 3/2). The proof is outlined in Problem 5.)} + O(n (5. ijYk = ~ I:~ Yik> Y = ~ I:~ I Vi.2 ).2 The Delta Method for In law Approximations = As usual we begin with d !.5). under appropriate conditions (Problem 5.3. (J1.13) Moreover. B.3).6).". then O(n.3. if EIY..3 / 2 ) in (5. = EY 1 .1 Then.3. as for the case d = 1.(J1.3.3. (7'(h)) where and (7' = VariX.) + ~ {~~:~ (J1. We get. (5. + ~ ~:l (J1.).C( v'n(h(X) .3.15) Then . EI Yl13 < 00 Eh(Y) h(J1. then Eh(Y) This is a consequence of Taylor's expansion in d variables.3.) + 2 g:'.h(!.3.»)'var(Y'2)] +O(n') Approximations (5.) gx~ (J1.14) do not help us to approximate risks for loss functions other than quadratic (or some power of (d . is to m = 3.~E(X.~ (J1. E xl < 00 and h is differentiable at (5.Section 5.3 First.) Var(Y.'. 5.) )' Var(Y..12) can be replaced by O(n.hUll)~. and J. .2.. 1.13).14) + (g:. Y12 ) (5... 1 < j < d where Y iJ I = 9j(Xi ).1.x.) Cov(Y".3.3. Suppose thot X = R. h : R ~ R.) Cov(Y". by (5.
. 0 .N(O. (e) Using (e). V for some constant u. . j " But. _ F 7 . Thus.312 Asymptotic Approximations Chapter 5 (i) an(Un . Therefore.i. ~: . for every (e) > O.. By definition of the derivative. Formally we expect that if Vn !:.ul < Ii '* Ig(v)  g(u) .3. V and the result follows. (ii) 9 : R Then > R is d(fferentiable at u with derivative 9(1) (u). V. ~ EVj (although this need not be true. from (a). we expect (5. see Problems 5. The theorem follows from the central limit theorem letting Un X. for every € > 0 there exists a 8 > 0 such that (a) Iv . PIIUn €  ul < iii ~ 1 • "'. I .15) "explains" Lemma 5. ·i Note that (5. by hypothesis.u)!:. . an = n l / 2. u = /}" j \ V . an(Un .1.3.N(O. then EV. j .u) !:.8). V . for every Ii > 0 (d) j.3.32 and B. (5.17) . I (g) I ~: .16) Proof.3. ( 2 ). = I 0 . hence.7. 'j But (e) implies (f) from (b). • • .3. Consider 1 Vn = v'n(X !") !:.a 2 ).g(ll(U)(v  u)1 < <Iv .• ' " and. . .ul Note that (i) (b) '* .
3. where g(u. Yn2 be two independent samples with 1'1 = E(X I ). A statistic for testing the hypothesis H : 11. i=l If:F = {Gaussian distributions}. 1'2 = E(Y.'" . In general we claim that if F E F and H is true. EZJ > 0. FE :F where EF(X I ) = '". j even = o(nJ/'). al = Var(X.l (1.9. (a) The OneSample Case.18) In particular this implies not only that t n.TiS X where S 2 = 1 nl L (X.2.) and a~ = Var(Y. (b) The TwoSample Case.3.3 we saw that the two sample t statistic Sn vn1n2 (Y X) ' n = = n s nl + n2 has a Tn2 distribution under H when the X's and Y's are normal withar = a~.28) that if nI/n _ A. 1 2 a P by Theorem 5.d.+A)al + Aa~) .X) n 2 . and the foregoing arguments. then S t:. Then (5.> 0 . and S2 _ n nl ( ~ X) 2) n~(X./2 ). 1).3.N(O.j odd. But if j is even.a) for Tn from the Tn . 1).'" 1 X n1 and Y1 . For the proof note that by the central limit theorem.)..= 0 versuS K : 11.X" be i. we can obtain the critical value t n. Now Slutsky's theorem yields (5. £ (5. Let Xl. N n (0 Aal. sn!a). then Tn .s Tn =.i.3. "t" Statistics.1 (1 . else EZJ ~ = O.17) yields O(n J. ••• . Using the central limit theorem.a) critical value (or Zla) is approximately correct if H is true and F is not Gaussian.Section 5.).3 First.O < A < 1. VarF(Xd = a 2 < 00.3.2 and Slutsky's theorem. v) = u!v.1 (Ia) + Zla butthat the t n. (1 . we find (Problem 5. Let Xl. Consider testing H: /11 = j. In Example 4. (1  A)a~ ...3. . Example 5. Slutsky's theorem.18) because Tn = Un!(sn!a) = g(Un .t2 versus K : 112 > 111.and HigherOrder Asymptotics: The Delta Method with Applications 313 where Z ~ N(O.l distribution.
i 2(1 approximately correct if H is true and the X's and Y's are not normal. where d is either 2. Chisquare data I 0..02 .5 . then the critical value t.5 1 1. oL. ur . The X~ distribution is extremely skew. 10. Y .. We illustrate such simulations for the preceding t tests by generating data from the X~ distribution M times independently.95) approximation is only good for n > 10 2 ....1 (0. 20. .. The simulations are repeated for different sample sizes and the observed significance levels are plotted. the asymptotic result gives a good approximation when n > 10 1. each lime computing the value of the t statistics and then giving the proportion of times out of M that the t statistics exceed the critical values from the t table.1.314 Asymptotic Approximations Chapter 5 It follows that if 111 = 11. 316.c:'::c~:____:_J 0.. " . as indicated in the plot. . Figure 5. .I = u~ and nl = n2.1 shows that for the onesample t test... Here we use the XJ distribution because for small to moderate d it is quite different from the normal distribution. and the true distribution F is X~ with d > 10.. Each plotted point represents the results of 10.3.1. i I . One sample: 10000 Simulations. . 32.'. approximations based on asymptotic results should be checked by Monte Carlo simulations. or 50. 0) for Sn is Monte Carlo Simulation As mentioned in Section 5. Figure 5.... Other distributions should also be tried.5 3 Log10 sample size j i Figure 5..5 2 2. the For the twosample t tests.05.. ...3.3.2 shows that when t n _2(10:) critical value is a very good approximation even for small n and for X.5 .000 onesample t tests using X~ data. and in this case the t n . X~. when 0: = 0.2 or of a~.
2.(})) do not have approximate level 0.. and Yi . However. in the onesample situation. when both nl I=. or 50.02 d'::'. in this case.12 0.95) approximation is good for nl > 100. the t n 2(1 . ChiSquare Dala. as we see from the limiting law of Sn and Figure 5.a~ show that as long as nl = n2._' 0.5 2 2.5 3 log10 sample size Figure 5. Y .' 0. D Next.X = ~.) i O. 2:7 ar Two sample.o'[h(l)(JLJf). and the data are X~ where d is one of 2.Section 5_3 First.0:) approximation is good when nl I=. In this case Monte Carlo studies have shown that the test in Section 4. _ y'ii:[h(X) .3. }. For each simulation the two samples are the same size (the size indicated on the xaxis). Each plotted point represents the results of 10. then the twosample t tests with critical region 1 {S'n > t n .ll)(!. 1 (ri .n2 and at I=.L) where h is con tinuously differentiable at !'.Xd.4 based on Welch's approximation works well.3. Moreover. By Theoretn 5.and HigherOrder Asymptotics: The Delta Method with Applications 315 1 This is because.3. even when the X's and Y's have different X~ distributions. Other Monte Carlo runs (not shown) with I=.3.Xi have a symmetric distribution. 10.h(JL)] ".. and a~ = 12af.000 twosample t tests. scaled to have the same means.5 1 1. N(O.2 (1 . 10000 Simulations.0' H<I o 0.9.a~. let h(X) be an estimate of h(J.n2 and at = a'). y'ii:[h(X) .cCc":'::~___:. Equal Variances 0. To test the hypothesis H : h(JL) = ho versus K : h(JL) > ho the natural test statistic is T. af = a~.hoi n s[h(l)(X)1 . the t n _2(O.3.
The data in the first sample are N(O.. which are indexed by one or more parameters.3 and Slutsky's theorem. Poisson.3.5 Log10 (smaller sample size) l Figure 5.12 . " 0.oj:::  6K . Variance Stabilizing Transfonnations Example 5.{x)() twosample t tests. +2 + 25 J O'.0. o. Each plotted point represents the results of IO.3..02""~". I: . In Appendices A and B we encounter several important families of distributions. £l __ __ 0. 2nd sample 2x bigger 0.I _ __ 0 I +__ I :::.316 Asymptotic Approximations Chapter 5 Two Sample. From (5. For each simulation the two samples differ in size: The second sample is two times the size of the first. if H is true .3. +___ 9.4.5 1 1. then the sample mean X will be approximately normally distributed with variance 0"2 In depending on the parameters indexing the family considered. The xaxis denotes the size of the smaller of the two samples. 1) so that ZI_Q is the asymptotic critical value. such as the binomial. Tn ...3. Combining Theorem 5. . Unequal Variances. We have seen that smooth transformations heX) are also approximately normally distributed.J~ f).~'ri i _ ... It turns out to be useful to know transformations h.' OIlS . '! 1 . .6. N(O. gamma. and beta.3.. we see that here.3.". 0. such that Var heX) is approximately independent of the parameters indexing the family we are considering. 1) and in the second they are N(Ola 2) where a 2 takes on the values 1.10000 Simulations: Gaussian Data.£.6) and . If we take a sample from a member of one of these families. and 9.. as indicated in the plot.c:~c:_~___:J 0. called variance stabilizing. too.
. which varies freely. Major examples are the generalized linear models of Section 6. A > 0.3.15 and 5. .. h(t) = Ii is a variance stabilizing transformation of X for the Poisson family of distributions. sample. a variance stabilizing transformation h is such that Vi'(h('Y) . The notion of such transformations can be extended to the following situation. . . .13) we see that a first approximation to the variance of h( X) is a' [h(l) (/. is to exhibit monotone functions of parameters of interest for which we can give fixed length (independent of the data) confidence intervals. which has as its solution h(A) = 2. .3. Thus.i.3.6.3.1 0 ) Vi' ' is an approximate 1. by their definition..3. yX± r5 2z(1 . in the preceding P( A) case. The comparative roles of variance stabilizing and canonical transformations as link functions are discussed in Volume II. Thus. Such a function can usually be found if (J depends only on fJ. To have Varh(X) approximately constant in A.Xn are an i. If we require that h is increasing.hb)) + N(o.' . this leads to h(l)(A) = VC/J>.Ct confidence interval for J>... See Problems 5. Some further examples of variance stabilizing transformations are given in the problems. X n) is an estimate of a real parameter! indexing a family of distributions from which Xl. 538) one can improve on . h must satisfy the differential equation [h(ll(A)j2A = C > 0 for some arbitrary c > O. p.3. 1976. suppose that Xl.)] 2/ n . Thus. Suppose. See Example 5. As an example.16.6) we find Var(X)' ~ 1/4n and Vi'((X) . Edgeworth Approximations The normal approximation to the distribution of X utilizes only the first two moments of X. 1n (X I. I X n is a sample from a P(A) family..5. In this case a'2 = A and Var(X) = A/n. Also closely related but different are socaBed normalizing transformations. where d is arbitrary.(A)') has approximately aN(O.and HigherOrder Asymptotics: The Delta Method with Applications 317 (5.Section 5.19) is an ordinary differential equation. c) (5. 1/4) distribution.d..\ + d. Suppose further that Then again. In this case (5.. A second application occurs for models where the families of distribution for which variance stabilizing transformations exist are used as building blocks of larger models. Substituting in (5.3.19) for all '/. Under general conditions (Bhattacharya and Rao.3 First.)C. 0 One application of variance stabilizing transformations.. finding a variance stabilizing transfonnation is equivalent to finding a function h such that for all Jl and (J appropriate to our family.
5999 0.1964 1.9999 1.1000 0.3x) + _(xS .0250 0.1.0005 0 0.9724 0.0100 0.15 0.34 2. xro I • I I .40 0.8008 0.0284 1.0481 1.7792 5. Hs(x) = xS  IOx 3 + 15x. We can use Problem B. P(Tn < x).9900 0.3. ~ 2.0050 0..0655 O.IOx 3 + 15x)] I I y'n(X n 9n ! I i !. 0.9943 0. Therefore. Example 5.3.1.3.85 0 0.3. i = 1. 1).77 0.6999 0.66 0. TABLE 5.72 .9996 1.9876 0.0105 0.79 0.ססOO I .04 1.9750 0.95 :' 1. . • " .0254 0. Suppose V rv X~.1254 0. 0.9684 0. we need only compute lIn and 1'2n.9995 0.4 to compute Xl.ססoo 0.40 0.15 0. According to Theorem 8. I' 0.3. To improve on this approximation. " I. Table 5./2 2 . Let F'1 denote the distribution of Tn = vn( X .4415 0.JL) / a and let lIn and 1'211 denote the coefficient of skewness and kurtosis of Tn.1051 0.3513 0.75 0.0500 ! ! .9984 0.9990 0.5000 0.9097 o.0553 0. (5.. H 3 (x) = x3  3x.0010 ~: EA NA x Exact . 0 x Exacl 2.9950 0.0032 ·1. Exact to' EA NA {.9997 4. ii.38 0. Then under some conditionsY) where Tn tends to zero at a rate faster than lin and H 2 • H 3 • and H s are Hermite polynomials defined by H 2 (x) ~ x 2  I.86 0. V has the same distribution as E~ 1 where the Xi are independent and Xi N(O..n.61 0. Edgeworth Approximations to the X 2 Distribution.6548 4.20) is called the Edgeworth expansion for Fn .38 0.1 gives this approximation together with the exact distribution and the nonnal approximation when n = 10.!.0287 0.0397 0.6000 0.51 1. .3000 0. where Tn is a standardized random variable.5.21) The expansion (5.n)' _ 3 (2n)2 = 12 n 1 I .4000 0.n)1 V2ri has approximately aN(O.8000 0. "J 1 '~ • 'YIn = E(Vn? (2n)1 E(V .7000 0.ססoo 1.5421 0.0001 0 0.<p(x) ' .' •• .9905 0.2024 EA NA x 0. Edgeworth(2) and nonna! approximations EA and NA to the X~o distribution.3.3006 0.9029 0.9999 1.0877 0.95 3.1) + 2 (x 3 .2706 0.ססoo 0.318 Asymptotic Approximations Chapter 5 the normal approximation by utilizing the third and fourth moments.4999 0. .9500 'll .I .4000 0." • [3 n .. It follows from the central limit theorem that Tn = (2::7 I Xi n)/V2ri = (V ..35 0. Fn(x) ~ <!>(x) .2000 0..0208 ~1.II 0.2.91 0. 4 0. 1) distribution.9506 0.
Let p2 = Cov 2(X. with  r?) is asymptotically nonnal.3 and (B.and HigherOrder Asymptotics: The Delta Method with Applications 319 The Multivariate Case Lemma 5. Let central limit theorem T'f. Ui/U~U3.0 >'1.0. a~ = Var Y.J 2 2 + 4 2 4 2 Tn P 120 + P T 02 +2{ _2 p3Al.2.0. We can write r 2 = g(C. . we can show (Problem 5.3.2.1. Using the central limit and Slutsky's theorems. aiD. U3) = Ui/U2U3. and Yj = (Y .9) that vn(C .0 2 (5.J11)/a. The proof follows from the arguments of the proof of Lemma 5.n1EY/) ° ~ ai ~ and u = (p. _p2. where g(UI > U2. 1). 0< Ey4 < 00. Because of the location and scale invariance of p and r.1.2.3. Y)/(T~ai where = Var X. .3 First. Then Proof._p2).u) ~ N(O.ar. It follows from Lemma 5./U2U3.2 + p'A2.a~) : n 3 !' R.0 TO . we can .I.. Let (X" Y. >'1. (X n . = Var(Xkyj) and Ak. E).0 .J12) / a2 to conclude that without use the transformations Xi j loss of generality we may assume J11 = 1k2 = 0.u) where Un = (n1EXiYi.1) jointly have the same asymptotic distribution as vn(U n .d.6) that vn(r 2 N(O.9.1. Example 5. Yn ) be i. = ai = 1. 1.ij~ where ar Recall from Section 4. Y) where 0 < EX 4 < 00.5 that in the bivariate normal case the sample correlation coefficient r is the MLE of the population correlation coefficient p and that the likelihood ratio test of H : p = is based on Irl.2 >'1.n lEX. then by the 2 vn(U . (i) un(U n .j.3.2.i.3.p). xmyl). P = E(XY).1..3.m. .0.0.u) ~~ V dx1 forsome d xl vector ofconstants u.3. as (X.1.).6. Lemma 5.2.Section 5. . (ii) g.2.2 extends to the dvariate case. E Next we compute = T11 .= (Xi . Suppose {Un} are ddimensional random vectors and that for some sequence of constants {an} with an !' (Xl as n Jo (Xl.3.2 T20 .0.0 >'1.3. UillL2U~) = (2p.2. vn(i7f .2rJ >".5. and let r 2 = C2 /(j3<.. >'2. R d !' RP has a differential g~~d(U) at u.l = Cov(XkyJ. 4f..0.22) 2 g(l)(u) = (2U.o}.1) and vn(i7~ .2 >'2.
: ! + h(i)(rn)(y .3.Y) ~ N(/li. . I Y n are independent identically distributed d vectors with ElY Ii 2 < 00. j f· ~. . .24) I Proof.1) is achieved hy choosing h(P)=!log(1+ P ).h(p)]). it gives approximations to the power of these tests.a)% confidence interval of fixed length.3[h(c) .3. and it provides the approximate 100(1 .9) Refening to (5. EY 1 = rn.. ': I = Ilgki(rn)11 pxd.UI. 1938) that £(vn . which is called Fisher's z. Suppose Y 1.fii(r .mil £ ~ hey) = h(rn) and (b) so that . we see (Problem 5.23) • . N(O. c E (1. . .5 (a) ! . . has been studied extensively and it has been shown (e.3. (5. . ! f This expression provides approximations to the critical value of tests of H : p = O.320 When (X. p=tanh{h(r)±z (1.P The approximation based on this transfonnation.3. o Here is an extension of Theorem 5. .1). (5. . (1 _ p2)2).8. David. . N(O.3. then u5 . .rn) + o(ly . per < c) '" <I>(vn . Argue as before using B.fii[h(r) . Asymptotic Approximations Chapter 5 = 4p2(1_ p2)'. .3.u 2.p). that is.3(h(r) .h(p)J !:.~a)/vn3} where tanh is the hyperbolic tangent.19).g.p) !:.10) that in the bivariate nonnal case a variance stahilizing transformation h(r) with . 2 1.3..fii(Y . Var Y i = E and h : 0 ~ RP where 0 is an open subset ofRd. Then ~.1'2.3.h(p)) is closely approximated by the NCO.hp )andhhasatotaldifferentialh(1)(rn) f • . and (Prohlem 5. E) = ..h= (h i . Theorem 5.rn) N(o.4.l) distribution.
when k = 10. 1 Xl has a x% distribution.05 quantiles are 2.1. we can write. This row. D Example 5. I). only that they be i.0 distribution and 2.m' To show this.1. But E(Z') = Var(Z) = 1.k. . the:F statistic T _ k.25) has an :Fk.:l ~ m k+m L X. as m t 00.h(m)) = y'nh(l)(m)(Y  m) + op(I).. Suppose for simplicity that k = m and k .Xn is a sample from a N(O. if .21 for the distribution of Vlk.1 in which the density of Vlk.3..d. check the entries of Table IV against the last row. The case m = )'k for some ).m. 14.Section 5.7) implies that as m t 00.3. Suppose that Xl.211 = 0.0 < 2.7.3. > 0 is left to the problems. where V .:: .05 and the respective 0. we conclude that for fixed k. is given as the :FIO .= density..3.m. the :Fk. when the number of degrees of freedom in the denominator is large. To get an idea of the accuracy of this approximation.i=k+l Xi ". > 0 and EXt < 00.i. We write T k for Tk.k+m L::l Xl 2 (5.l) distribution. gives the quantiles of the distribution of V/k.37] = P[(vlk) < 2.3 First.9) to find an approximation to the distribution of Tk. Next we turn to the normal approximation to the distribution of Tk. L::+. See also Figure B.37 for the :F5 . Then according to Corollary 8..).. By Theorem B.m  (11k) (11m) L. Thus. When k is fixed and m (or equivalently n = k + m) is large.1. xi and Normal Approximation to the Distribution of F Statistics. EX. Now the weak law of large numbers (A. we first note that (11m) Xl is the average ofm independent xi random variables.~1.26) l. We do not require the Xi to be normal.m distribution. where k + m = n. with EX l = 0.' = Var(X. Suppose that n > 60 so that Table IV cannot be used for the distribution ofTk.3. the mean ofaxi variable is E(Z2). if k = 5 and m = 60. which is labeled m = 00.and HigherOrder Asymptotics: The Delta Method with Applications 321 (c) jn(h(Y) .x%. 00.3. we can use Slutsky's theorem (A. where Z ~ N(O. i=k+1 Using the (b) part of Slutsky's theorem. For instance. By Theorem B.m distribution can be approximated by the distribution of Vlk.15. . then P[T5 .3. Then.. (5.
3.3.k(tl)] '" 'fI() :'. X n are a sample from PTJ E P and ij is defined as the MLE 1 . 1'.3. Equivalently Tk = hey) where Y i ~ (li" li. when rnin{k.3. = E(X.29) is unknown.1) "" N(O.m • < tl P[) :'. (5. E(Y i ) ~ (1. j m"': k (/k. and a~ = Var(X. . v'n(Tk .2Var(Yll))' In particular if X. 1'.3. _. 1953) is that unlike the t test..a) . i.)T.m = (1+ jK(:: z.1)/12 j 2(m + k) mk Z. 0 ! j i · 1 where K = Var[(X.28) . ifVar(Xf) of 2a'. Thus (Problem 5. .. 322 Asymptotic Approximations Chapter 5 where Yi1 = and Yi2 = Xf+i/a2. the upper h. the distribution of'. h(i)(u. when Xi ~ N(o.8(a» does not have robustness of level. v) identity.8(d))._o l or ~..m 1) can be ap proximated by a N(O. i = 1.27) where 1 = (1.).28) satisfies Zto '" 1 I I:. We conclude that xl jer}. v) ~ ~.o) k) K (5.m(l.. Then if Xl. l (: An interesting and important point (noted by Box. 2) distribution. 4).7).' !k . = (~. a'). .3. !1 .m 1) <) :'. as k ~ 00. ~ Ck.'.).m critical value fk. I' Theorem 5.k(T>. P[Tk. i.3 Asymptotic Normality of the Maximum likelihood Estimate in Exponential Families Our final application of the 8method follows.)/ad 2 .m(1 . Specifically.4.8(c» one has to use the critical value . ~ N (0. • • 1)/12) .5. I)T and h(u. 1)T.I.3. where J is the 2 x 2 v'n(Tk 1) ""N(0. a'). In general (Problem 53.a) '" 1 + is asymptotically incorrect. which by (5. the F test for equality of variances (Problem 5. :. By Theorem 5.3. :f. 1 5.3. if it exists and equal to c (some fixed value) otherwise. k.3. Suppose P is a canonical exponential family of rank d generated by T with [open.m(1. ! i l _ . When it can be eSlimated by the method of moments (Problem 5.m} t 00. . In general.»)T and ~ = Var(Yll)J.a). .k(Tk.k (t 1 (5.
3.Nd(O.3. therefore.11) eqnals the lower bound (3. For (ii) simply note that. 0').24).3.2 where 0" = T.Section 5.N(o. Thus.A 1(71))· Proof.2. as X witb X ~ Nil'. X n be i. Example 5. = 1/20'. (i) follows from (5.4.T. and 1). PT/[ii = AI(T)I ~ L Identify h in Theorem 5.'l 1 (ii) LT/(Vii(iiT/)). = (5. iiI = X/O".38) on the variance matrix of y'n(ij .2.I(T/) (5.3.4.). hence.2 that. (ii) follows from (5. (I" +0')] . T/2 By Theorem 5. Recall that A(T/) ~ VarT/(T) = 1(71) is the Fisher information. where. Now Vii11.33) 71' 711 Here 711 = 1'/0'.".3. the asymptotic variance matrix /1(11) of .i. = A by definition and.6.1..3.A(T/)) + oPT/ (n .3.) .3. = 1/2.32) Hence.3 First.23).. o Remark 5.2.3. in our case. I'.In(ij . This is an "asymptotic efficiency" property of the MLE we return to in Section 6.l4. by Corollary 1.3. The result is a consequence.1. by Example 2. . PT/[T E A(E)] ~ 1 and. thus.1.and HigherOrder Asymptotics: The Delta Method with Applications 323 (i) ii = 71 + ~ I:71 A •• • I (T/)(T(X.4. are sufficient statistics in the canonical model.EX. Then T 1 = X and 1 T2 = n. of Theorems 5..  (TIl' . if T ~ I:7 I T(X.2 and 5.5. and.3.4 with AI and m with A(T/).30) But D A . Ih our case.8. Thus. (5. Let Xl>' .ift = A(T/).O. Note that by B. (5.d.8.3. We showed in the proof ofTheorem 5.71) for any nnbiased estimator ij.31) .
Xk} only so that P is defined by p .4 and Problem 2.h(JL) where Y 1.1) i . Fundamental asymptotic formulae are derived for the bias and variance of an estimate first for smooth function of a scalar mean and then a vector mean. Y n are i. .. and il' = T.26) vn(X /". variables are explained in tenns of similar stochastic approximations to h(Y) .g.Pk) where (5.d. Assume A : 0 ~ pix. under Li.1 by studying approximations to moments and central moments of estimates. is twice differentiable for 0 < j < k. stochastic approximations in the case of vector statistics and parameters are developed. for instance.4. . .L~ I l(Xi = Xj) is sufficient. which lead to a result on the asymptotic nonnality of the MLE in multiparameter exponential families.') N(O...3.. Specifically we shall show that important likelihood based procedures such as MLE's are asymptotically optimal. 0).4 ASYMPTOTIC THEORY IN ONE DIMENSION I: I " ! . Following Fisher (1958)'p) we develop the theory first for the case that X""" X n are i. 5. . sampling.d. 0). These stochastic approximations lead to Gaussian approximations to the laws of important statistics. Firstorder asymptotics provides approximations to the difference between a quantity tending to a limit and the limit. . taking values {xo.1 Estimation: The Multinomial Case . . see Example 2. EY = J.1.. . . .s where Eo = diag(a 2 )2a 4 ). I i I . Thus.. We focus first on estimation of O. and h is smooth.3. Summary.. .: CPo.P(Xk' 0)) : 0 E 8}. .4. We begin in Section 5. . i 7 •• . I I I· < P. and confidence bounds. . Secondorder asyrnptotics provides approximations to the difference between the error and its firstorder approximation. testing.. N = (No. These "8 method" approximations based on Taylor's fonnula and elementary results about moments of means of Ll. Higherorder approximations to distributions (Edgeworth series) are discussed briefly. il' . I . .i. We consider onedimensional parametric submodels of S defined by P = {(p(xo.3.i. Eo) .15). and so on.L. 0.33) and Theorem 5.. the difference between a consistent estimate and the parameter it estimates.d.. .4 to find (Problem 5. Consistency is Othorder asymptotics.7).. i In this section we define and study asymptotic optimality for estimation. Nk) where N j . Finally. we can use (5. In Chapter 6 we sketch how these ideas can be extended to multidimensional parametric families. J J .324 Asymptotic Approximations Chapter 5 Because X = T. when we are dealing with onedimensional smooth parametric models. < 1. and PES.1. . 0 .3.(T. .". 5. the (k+ I)dimensional simplex (see Example 1.6.d. The moment and in law approximations lead to the definition of variance stabilizing transfonnations for classical onedimensional exponential families. 8 open C R (e.)'. as Y.
if A also holds.4. > rl(O) M a(p(8)) P. h) is given by (5. 0 <J (5.2).8) . h) with eqUality if and only if.4) g.4. . (5.4.6) 1.2 (8.5) As usual we call 1(8) the Fisher infonnation. Consider Example where p(8) ~ (p(xo. Many such h exist if k 2. 8) is similarly bounded and well defined with (5. < k.1. bounded random variable (5. Moreover.4 Asymptotic Theory in One Dimension 325 Note that A implies that k [(X I .8)1(X I =Xj) )=00 (5.1..3) Furthermore (See'ion 3..4..2(0.:) (see (2. fJl E.Jor all 8. Theorem 5. 88 (Xl> 8) and =0 (5.11)) of (J where h: S satisfies ~ R h(p(8» = 8 for all 8 E e > (5. .4. (0) 80(Xj. (5. Then we have the following theorem.11).4.4..2) is twice differentiable and g~ (X I) 8) is a well~defined.8). pC') =I 1 m .8) ~ I)ogp(Xj.l (Xl. Assume H : h is differentiable.9) .7) where .4.1. for instance. Next suppose we are given a plugin estimator h (r.p(Xk. 8))T.4. Under H.4.4.8) logp(X I .4.Section 5.0).
(8.326 Asymptotic Approximations Chapter 5 Proof.. (5. not only is vn{h (. • • "'i. using the correlation inequality (A.8)&Pj 2: • &h a:(p(0))p(x. which implies (5..4. ~ir common variance is a'(8)I(8) = 0' (0.8) ) ~ . h k &h &pj(p(8)) (N p(xj.p(xj.4. I (x j.. but also its asymptotic variance is 0'(8. we obtain 2: a:(p(8)) &0(xj. 2: • j=O &l a:(p(8))(I(XI = Xj) . we obtain &l 1 <0 . h(p(8)) ) ~ vn Note that.4.9).8) ) +op(I).10). whil.6).I n. Taking expectations we get b(8) = O. I h 2 (5.8)) = a(8) &8 (X" 8) +b(O) &h Pl (5.10) Thus.4. 0)] p(Xj.. Apply Theorem 5.')  h(p(8)) } asymptotically normal with mean 0. we s~~ iliat equality in (5.4. by noting ~(Xj.4.8) gives a'(8)I'(8) = 1.14) .2 noting that N) vn (h (. h). kh &h &Pj (p(8))(I(Xi = Xj) .p(xj. : h • &h &Pj (p(8)) (N .4.11) • (&h (p(O)) ) ' p(xj.II. (5.4.8))./: . 8). h) = I .13) I I · ..3.4. by (5.12) [t.8) j=O 'PJ Note that by differentiating (5.l6) as in the proof of the information inequality (3.13). for some a(8) i' 0 and some b(8) with prob~bility 1.12).h)Var.p(x). Noting that the covariance of the right' anp lefthand sides is a(8). &8(X.h)I8) (5.4.16) o . using the definition of N j .8) with equality iff. 0) = • &h fJp (5.8) = 1 j=o PJ or equivalently.4.15) i • . ( = 0 '(8.4. By (5. (5.4.
. the MLE of B where It is defined implicitly ~ by: h(p) is the value of O.. 0 Both the asymptotic variance bound and its achievement by the MLE are much more general phenomena.o(p(X"O)  p(X"lJo )) . achieved by if = It (r.3.3. 0 E e.4..xd).Section 5A Asymptotic Theory in One Dim"e::n::'::io::n ~_ __'3::2:. . As in Theorem 5.4. by ( 2. 0).2.5 applies to the MLE 0 and ~ e is open.3 we give this result under conditions that are themselves implied by more technical sufficient conditions that are easier to check. .1.4.4.19) The binomial (n.. Suppose p(x.. which (i) maximizes L::7~o Nj logp(xj.17) L. J(~)) with the asymptotic variance achieving the infonnation bound Jl(B). . Note that because n N T = n .3 that the information bound (5.18) and k h(p) = [A]I LT(xj)pj )". . 5.4. if it exists and under regularity conditions.(vn(B . Suppose i.Xn are tentatively modeled to be distributed according to Po. . .d.i. Xl. Then Theo(5.4."j~O (5..2 Asymptotic Normality of Minimum Contrast and MEstimates o e o We begin with an asymptotic normality theorem for minimum contrast estimates. In the next two subsections we consider some more general situations. 0) and (ii) solvesL::7~ONJgi(Xj.I L::i~1 T(Xi) = " k T(xJ )".4.0)) ~N (0. is a canonical oneparameter exponential family (supported on {xo.A(O)}h(x) where h(x) = l(x E {xo. Let p: X x ~ R where e D(O.8) is.8) = exp{OT(x) . : E Ell... p) and HardyWeinberg models can both be put into this framework with canonical parameters such as B = log ( G) in the first case..0 (5.3) L. Write P = {P. OneParameter Discrete Exponential Families.Oo) = E.0)=0. Example 5.::7 We shall see in Section 5.. Xk}) and rem 5. E open C R and corresponding density/frequency functions p('. then.).
21) i J A2: Ep.p(X" On) n = O. ~ (X" 0) has a finite expectation and i ! . We need only that O(P) is a parameter as defined in Section 1. O)dP(x) = 0 (5.. as pointed out later in Remark 5. 1 n i=l _ . .20) In what follows we let p. rather than Pe. (5. That is.L .4.22) J i I .21) and. { ~ L~ I (~(Xi. Theorem 5.p = Then Uis well defined. j.~(Xi..p E p 80 (X" O(P») A3: . That is.O). hence. 0) is differentiable. 0 E e. • is uniquely minimized at Bo.t) . i' On where = O(P) +.4. J is well defined on P.pix. o if En ~ O. PEP and O(P) is the unique solution of(5.0(P)) l. . As we saw in Section 2.p(Xi.3. P) = . f 1 .1. O)ldP(x) < co.p(x.!:.4. . Under AOA5. < co for all PEP.328 Asymptotic Approximations Chapter 5 .p(x. under regularity conditions the properties developed in this section are valid for P ~ {Pe : 0 E e}.O(p)))I: It .=1 n ! (5. A4: sup. # o. O(P))) .1/ 2) n '.4.O(P)) +op(n. O(Pe ) = O.23) I. ..p(x.2. Let On be the minimum contrast estimate On ~ argmin .p(.LP(Xi. O(P). denote the distribution of Xi_ This is because.O(P)I < En} £. L. ~i  i . parameters and their estimates can often be extended to larger classes of distributions than they originally were defined for. 8. . n i=l _ 1 n Suppose AO: . ~I AS: On £.4..1. O(P» / ( Ep ~~ (XI. Suppose AI: The parameter O(P) given hy the solution of .4. (5.' '.4.p2(X. On is consistent on P = {Pe : 0 E e}.
28) .O(P)) Ep iJO (Xl. 1 1jJ(Xi .8(P)) I n n n~ t=1 n~ t=1 e (5. A2.4.p) = E p 1jJ2(Xl .' " iJ (Xi .4.4.O(P)) = n. But by the central limit theorem and AI.O(P)) 2' (E '!W(X. O(P)).25) where 18~  8(P)1 < IiJn . (5.4.29) .1j2 ).O(P)) ~ .4.L 1jJ(X" 8(P)) = Op(n.20).27) we get. By expanding n.O(P)) / ( El' ~t (X" O(P))) o while E p 1jJ2(X"p) = (T2(1jJ.l   2::7 1 iJ1jJ. using (5. applied to (5.20) and (5.4.4.425)(5.4. .22) because .O(P))) p (5..4.4 Asymptotic Theory in One Dimension 329 Hence. (iJ1jJ ) In (On .26) and A3 and the WLLN to conclude that (5.4.O(P)I· Apply AS and A4 to conclude that (5.27) Combining (5.21).22) follows by a Taylor expansion of the equations (5. en) around 8(P).lj2 and L 1jJ(X" P) + op(l) i=1 n EI'1jJ(X 1. and A3.4. Next we show that (5.24) follows from the central limit theorem and Slutsky's theorem. t=1 1 n (5.' " 1jJ(Xi . Let On = O(P) where P denotes the empirical probability.._ .4.On)(8n .Ti(en . 8(P)) + op(l) = n ~ 1jJ(Xi . we obtain.. n.4.p) < 00 by AI.24) proof Claim (5. where (T2(1jJ.Section 5.
4. n I _ . 0) is replaced by Covo(?jJ(X" 0).22) follows from the foregoing and (5.30) suggests that ifP is regular the conclusion of Theorem 5.31) for all O.O). B) where p('. This extension will be pursued in Volume2.12).2 may hold even if ?jJ is not differentiable provided that .O)?jJ(X I . X n are i.2.O) = O. it is easy to see that A6 is the same as (3.Eo (XI.4.i. A2. Remark 5. we see that A6 corresponds to (5. Our arguments apply to Mestimates.29).4. 1 •• I ! .4. en j• . and A3 are readily checkable whereas we have given conditions for AS in Section 5.4.'.30) is formally obtained by differentiating the equation (5. AI. Our arguments apply even if Xl. essentially due to Cramer (1946). (2) O(P) solves Ep?jJ(XI.4.4. B» and a suitable replacement for A3. Suppose lis differen· liable and assume that ! • &1 Eo &O(X I . I 'iiIf 1 I • I I .: 1. for A4 and A6.e.2 is valid with O(P) as in (I) or (2). 0 • . . Remark 5.4. Conditions AO. and we define h(p) = 6(xj )Pj.28) we tinally obtain On . 0) = 6 (x) 0.4. for Mestimates. O)dp. . that 1/J = ~ for some p).?jJ(XI'O)). If further Xl takes on a finite set of values.1. . written as J i J 2::..(x) (5. • Theorem 5. 330 Asymptotic Approximations Chapter 5 Dividing by the second factor in (5.d. 0) = logp(x..8). =0 (5.3.30) Note that (5. An additional assumption A6 gives a slightly different formula for E p 'iiIf (X" O(P)) if P = Po. I o Remark 5.4.21).! . If an unbiased estimateJ (X d of 0 exists and we let ?jJ (x. A4 is found. A6: Suppose P = Po so that O(P) = 0.4.20) are called M~estimates. Nothing in the arguments require that be a minimum contrast as well as an Mestimate (i.2. O(P) in AIA5 is then replaced by (I) O(P) ~ argmin Epp(X I . Identity (5.O(P) Ik = n n ?jJ(Xi . 0) l' P = {Po: or more generally.4.4.1. 0) Covo (:~(Xt. Z~ (Xl.4. Solutions to (5. e) is as usual a density or frequency function. P but P E a}. O(P)) + op (I k n n ?jJ(Xi . O(P)) ) and (5. {xo 1 ••• 1 Xk}.4.=0 ?jJ(x.. This is in fact truesee Problem 5. We conclude by stating some sufficient conditions. O)p(x.4. and that the model P is regular and letl(x.4.
0) (5.4 Asymptotic Theory in One Dimension 331 A4/: (a) 8 + ~~(xI. "dP(x) = p(x)diL(x).4.33) so that (5. 0) g~ (x.O) = l(x. 0') is defined for all x. (b) There exists J(O) sup { > 0 such that O"lj') Ehi) eo (Xl.8) is a continuous function of8 for all x..35) with equality iff.4. where iL(X) is the dominating measure for P(x) defined in (A.34) Furthermore. w. s) diL(X )ds < 00 for some J ~ J(O) > 0.3.p(x.4.4. then the MLE On (5.O) = l(x.4.eo (Xl.3 Asymptotic Normality and Efficiency of the MLE The most important special case of (5.4. 0') ..1). 5. In this case is the MLE and we obtain an identity of Fisher's.O) 10gp(x. Theorem 5. satisfies If AOA6 apply to p(x. = en en = &21 Ee e02 (Xl. where EeM(Xl.O).O) and P ~ ~ Pe. We can now state the basic result on asymptotic normality and e~ciency of the MLE.Section 5.0'1 < J(O)} 00. That is." Details of how A4' (with ADA3) iIT!plies A4 and A6' implies A6 are given in the problems.20) occurs when p(x. .01 < J( 0) and J:+: JI'Ii:' (x. 0) < A6': ~t (x.4.4) but A4' and A6' are not necessary (Problem 5. 10' .4. Pe) > 1(0) 1 (5. AOA6.p = a(0) g~ for some a 01 o. lO.:d in Section 3.p.. 0) and .4. We also indicate by example in the problems that some conditions are needed (Problem 5. 0) obeys AOA6. < M(Xl. then if On is a minimum contrast estimate whose corresponding p and '1jJ satisfy 2 0' (.4. B). 0) .32) where 1(8) is t~e Fisher information introduq.
37) .36) Because Eo 'lj.<1>( _n '/4 . see Lehmann and Casella.l'1 Let X" .. 1 I I. 1 .. N(B.4.4. . p....ne . (5.'(B) = I = 1(~I' B I' 0. Let Z ~ N(O. we know that all likelihood ratio tests for simple BI e e eo.4. for higherdimensional the phenomenon becomes more disturbing and has important practical consequences. {X 1. 1. 00. j 1 " Note that Theorem 5..4.35)...30) and (5.4. .4.439) where . .3 is not valid without some conditions on the esti· mates being considered. PoliXI < n1/'1 .4. for some Bo E is known as superefficiency.4.1/4 I (5. B) = 0.i .. I. .A'(B)..r 332 Asymptotic Approximations Chapter 5 Proof Claims (5.". 1998. if B I' 0..4. PIIZ + .36) is just the correlation inequality and the theorem follows because equality holds iff 'I/J is a nonzero multiple a( B) of 3!. If B = 0.. 1. 0 because nIl' . Hodges's Example. The major testing problem if B is onedimensional is H : < 8 0 versus K : > If p(.nBI < nl/'J <I>(n l/ ' . B) is an MLR family in T(X). 1).(X B).1 once we identify 'IjJ(x. However.3 generalizes Example 5. . claim (5.2. . PolBn = Xl .nB) .35) is equivalent to (5.1).2.4. I I Example 5.B). .. . and Polen = OJ .d. }. By (5. B) with J T(x) . ..• •. . 0 1 .. thus.4.4. . . PoliXI < n 1/4 ] . .. 1 . cross multiplication shows that (5.nB).i. Then X is the MLE of B and it is trivial to calculate [( B) _ 1.34) follow directly by Theorem 5.. We next compule the limiting distribution of .1/4 X if!XI > n.4.'(0) = 0 < l(~)' The phenomenon (5. (5. Consider the following competitor to X: B n I 0 if IXI < n. and.4. 442.39) with "'(B) < [I(B) for all B E eand"'(Bo) < [I(Bo). The optimality part of Theorem 5.4.1/ 4 " and using X as our estimate if the test rejects and 0 as our estimate otherwise. We discuss this further in Volume II..38) Therefore.33) and (5. We can interpret this estimate as first testing H : () = 0 using the test "Reject iff IXI > n.4 Testing ! I i .4. Therefore.n(Bn .Xn be i. For this estimate superefficiency implies poor behavior of at values close to 0. 5. Then . 0 e en e.
• • where A = A(e) because A is strictly increasing.a quantile o/the N(O.3 versus simple 2 .44) But Polya's theorem (A. 00 )" where P8 .(1 1>(z))1 ~ 0. Zla (5.[Bn > c(a. Let en (0:'. 00) .43) Property (5..Oo) = 00 + ZIa/VnI(Oo) +0(n. 1) distribution. ~ 0.4.. and (5.Oo)] PolOn > cn(a.00) > z] .0)]. It seems natural in general to study the behavior of the test. 00 )]  ~ 1. "Reject H for large values of the MLE T(X) of >.(a. derive an optimality property. On the (5. .45) which implies that vnI(Oo)(c.4 Asymptotic Theo:c'Y:c. . > AO.r'(O)) where 1(0) (5.l4. < >"0 versus K : >.00) other hand..:cO"=~ ~ 3=3:. The test is then precisely. as well as the likelihood ratio test for H versus K. Suppose (A4') holds as well as (A6) and 1(0) < oofor all O.4.(a.. .0:c":c'=D. are of the form "Reject H for T( X) large" with the critical value specified by making the probability of type I error Q at eo. PolO n > c. Thus. ljO < 00 .4.4.4.(a.00) > z] ~ 11>(z) by (5. eo) denote the critical value of the test using the MLE en based On n observations.4. Then c.41) where Zlct is the 1 .42) is sometimes called consistency of the test against a fixed alternative. (5. 00) .m='=":c'. Proof.:. e e e e.. Then ljO > 00 .40) > Ofor all O. Xl.:c". Theorem 5.4.4. the MLE. The proof is straightforward: PoolvnI(Oo)(Bn . (5.22) guarantees that sup IPoolvn(Bn . and then directly and through problems exhibit other tests wi!. 00)] = PolvnI(O)(Bn  0) > VnI(O)(cn(a.4.::..Oo)] = a and B is the MLE n n of We will use asymptotic theory to study the behavior of this test when we observe ij. 1 < z . "Reject H if B > c(a. 0 E e} is such thot the conditions of Theorem 5. b).46) PolBn > cn(a.I / 2 ) (5.4.h the same behavior.40).d. That is.41) follows.42)  ~ o.. X n distributed according to Po. a < eo < b.4.4. (5. Suppose the model P = {Po . B E (a. PoolBn > 00 + zl_a/VnI(Oo)] = POo IVnI(Oo) (Bn  00) > ZI_a] ~ a.. [o(vn(Bn 0)) ~N(O. this test can also be interpreted as a test of H : >.4... If pC e) is a oneparameter exponential family in B generated by T(X).Section 5.2 apply to '0 = g~ and On.
4 tells us that the test under discussion is consistent and that for n large the power function of the test rises steeply to Ct from the left at 00 and continues rising steeply to 1 to the right of 80 .4. the test based on 8n is still asymptotically MP. ~ " uniformly in 'Y. ~! i • Note that (5.J 334 Asymptotic Approximations Chapter 5 By (5. < 1 lI(Zl_a ")'.80) tends to infinity. I.80 ). .jnI(80) +0(n. 00 if8 > 80 0 and .50)) and has asymptotically smallest probability of type I error for B < Bo. If "m(8 . In fact.50) i.jI(80)) ill' < O. 00 if 8 < 80 .80) ~ 8)J Po[.jnI(8)(Bn ..4. then (5.4. 80 )  8) . Optimality claims rest on a more refined analysis involving a reparametrization from 8 to ")' "m( 8 . (5.4.8 + zl_o/.jI(80)) ill' > 0 > llI(zl_a ")'.k'Pn(X1.41).jnI(8)(80 . . the power of tests with asymptotic level 0' tend to 0'. LetQ) = Po.48) .48).1/ 2 ))].  i. Claims (5.jnI(8)(80 .49) ! i .8) > . I .4.jnI(80) + 0(n.[.4.jnI(O)(Bn .1 / 2 )) .. .")' ~ "m(8  80 ). .jnI(8)(80 .4. iflpn(X1 .···.42) and (5.4.8) > . . the power of the test based on 8n tends to I by (5.jnI(8)(Bn . o if "m(8 .jnI(8)(Cn(0'.50).80 1 < «80 )) I J .40) hold uniformly for (J in a neighborhood of (Jo.8) < z] . f. Write Po[Bn > cn(O'. Theorem 5. On the other hand. (5. (3) = Theorem 5. assume sup{IP. then limnE'o+. . these statements can only be interpreted as valid in a small neighborhood of 80 because 'Y fixed means () + B . 0. (5. X n ) i.   j Proof.51) I 1 I I .4.4.(1 lI(z))1 : 18 . Furthennore. . 1 ~. then by (5.49)) the test based on rejecting for large values of 8n is asymptotically uniformly most powerful (obey (5. 80)J 1 P.8) + 0(1) .2 and (5..8 + zl_a/. j .4.4. In either case. X n ) is any sequence of{possibly randomized) critical (test) functions such that .4.50) can be interpreted as saying that among all tests that are asymptotically level 0' (obey (5.4..4.4.jnI(8)(Cn(0'.43) follow.5.80 ) tends to zero. That is. Suppose the conditions afTheorem 5.017".47) lor.48) and (5.[.(80) > O.4. j (5.4.
. .52) > 0.j1i(8  8 0 ) is fixed..80 ) < n p (Xi. .Xn) is the critical function of the Wald test and oLn (Xl. Finally.4. hence.4.50) note that by the NeymanPearson lemma. is asymptotically most powerful as well. These are the likelihood ratio test and the score or Rao test.8 0 ) . 1. € > 0 rejects for large values of z1.. Q). Further Taylor expansion and probabilistic arguments of the type we have used show that the righthand side of (5.Section 5.4 Asymptotic Theory in One Dimension 335 If...o + 7. It is easy to see that the likelihood ratio test for testing H : g < 80 versus K : 8 > 00 is of the form "Reject if L log[p(X" 8n )!p(X i=1 n i .50) for all. " Xn. i=1 P 1. ..4.4.4. 0 The asymptotic!esults we have just established do not establish that the test that rejects for large values of On is necessarily good for all alternatives for any n. 0 n p(Xi .. [5In (X" . . note that the Neyman Pearson LR test for H : 0 = 00 versus K : 00 + t.53) is Q if. Llog·· (X 0) =dn (Q. P. . Thus. X. The test li8n > en (Q..48) follows..4. . ..4. 00 + E) logPn(X l E I " .50) and.4.5.54) Assertion (5.OO)J ." + 0(1) and that It may be shown (Problem 5. hand side of (5.Xn ) = 5Wn (X" ..4.<P(Zl_a . k n (80 ' Q) ~ if OWn (Xl.4 and 5. is O.' L10g i=1 p (X 8) 1..7). 8 + fi) 0 P'o+:.4. 00)]1(8n > 80 ) > kn(Oo..4. if I + 0(1)) (5. X n .8) that. There are two other types of test that have the same asymptotic behavior. 0) denotes the density of Xi and dn .. t n are uniquely chosen so that the right.4. . 80)] of Theorems 5.. 1(8) = 1(80 + + fi) ~ 1(80) because our uniformity assumption implies that 0 1(0) is continuous (Problem 5. for all 'Y. > dn (Q.4.53) where p(x. ~ .53) tends to the righthand side of (5.Xn) is the critical function of the LR test then.8) 1. for Q < ~.O+.OO+7n) +EnP.jn(I(8o) + 0(1))(80 .4. . The details are in Problem 5.jI(Oo)) + 0(1)) and (5. 0 (5.5 in the future will be referred to as a Wald test.<P(Zla(1 + 0(1)) + . .)] ~ L (5.[logPn( Xl.?.54) establishes that the test 0Ln yields equality in (5.4. To prove (5.
336
Asymptotic Approximations
Chapter 5
where PlI(X 1, ... ,Xn ,8) is the joint density of Xl, ... ,Xn . Fort: small. n fixed, this is approximately the same as rejecting for large values of a~o logPn(X 11 • • • 1 X n ) eo).
,
• ,
The preceding argument doesn't depend on the fact that Xl" .. , X n are i.i.d. with common density or frequency function p{x, 8) and the test that rejects H for large values of a~o log Pn (XI, ... ,Xn , eo) is, in general, called the score or Rao test. For the case we are considering it simplifies, becoming
"Reject H iff
t
i=I
iJ~
: ,
logp(X i ,90) > Tn (a,9 0)."
0
I I
(5.4.55)
It is easy to see (Problem 5.4.15) that
Tn(a, 90 ) = Z10 VnI(90) + o(n I/2 )
and that again if G (Xl, ... , X n ) is the critical function of the Rao test then
nn
,
•
.,
,•
Po,+' WRn (X t, ... ,Xn ) = Ow n(X t , ... ,Xn)J ~ 1, rn
(5.4.56)
(Problem 5.4.8) and the Rao test is asymptotically optimal. Note that for all these tests and the confidence bounds of Section 5.4.5, I(90 ), which d' may require numerical integration, can be replaced by _n l d021n(Bn) (Problem 5.4.10).
5.4,5
Confidence Bounds
Q
We define an asymptotic Levell that
lower confidence bound (LCB) On by the requirement
(5.4.57)
r I I i ,
., '
.'
for all () and similarly define asymptotic level!  a DeBs and confidence intervals. We can approach obtaining asymptotically optimal confidence bounds in two ways:
(i) By using a natural pivot.
, f.
(
.,
(ii) By inverting the testing regions derived in Section 5.4.4.
,. "',
Method (i) is easier: If the assumptions of Theorem 5.4.4 hold, that is, (AO)(A6), (A4'), and I(9) finite for all it follows (Problem 5.4.9) that
e.
Co(
V n  e)) ~ N(o, 1) nI(lin)(li
Z1a/VnI(lin ).
(5.4.58)
for all () and, hence. an asymptotic level!  a lower confidence bound is given by
9~ = lin e~,
(5.4.59)
Turning tto method (ii), inversion of 8Wn gives fonnally
= inf{9: en(a, 9) > 9n }
(5.4.60)
,
=
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
337
or if we use the approximation C (0, e) ~ n
e+ zlQ/vnI(iJ), (5,4,41),
e) > en}'
~~, = inf{e , Cn(C>,

(5,4,61)
In fact neither e~I' or e~2 properly inverts the tests unless cn(Q, e) and Cn (Q, e) are increasing in The three bounds are different as illustrated by Examples 4.4.3 and 4.5.2. If it applies and can be computed, e~l is preferable because this bound is not only approximately but genuinely level 1 Q. But computationally it is often hard to implement because cn(Q, 0) needs, in general, to be computed by simulation for a grid of values. Typically, (5.4.59) or some equivalent alternatives (Problem 5,4,10) are preferred but Can be quite inadequate (Problem 5,4,1 I), These bounds e~, O~I' e~2' are in fact asymptotically equivalent and optimal in a suitable sense (Problems 5,4,12 and 5,4,13),
e.
e
Summary. We have defined asymptotic optimality for estimates in oneparameter models. In particular, we developed an asymptotic analogue of the information inequality of Chapter 3 for estimates of in a onedimensional subfamily of the multinomial distributions, showed that the MLE fonnally achieves this bound, and made the latter result sharp in the context of oneparameter discrete exponential families. In Section 5.4.2 we developed the theory of minimum contrast and M estimates, generalizations of the MLE, along the lines of Huber (1967), The asymptotic formulae we derived are applied to the MLE both under the mooel that led to it and tmder an arbitrary P. We also delineated the limitations of the optimality theory for estimation through Hodges's example. We studied the optimality results parallel to estimation in testing and confidence bounds. Results on asymptotic properties of statistical procedures can also be found in Ferguson (1996), Le Cam and Yang (1990), Lehmann (1999), Rao (1973), and Serfling (1980),
e
5.5
ASYMPTOTIC BEHAVIOR AND OPTIMALITY OF THE POSTERIOR DISTRIBUTION
Bayesian and frequentist inferences merge as n t 00 in a sense we now describe. The framework we consider is the one considered in Sections 5.2 and 5.4, i.i.d. observations from a regular madel in which is open C R or = {e 1 , , , , , e,} finite, and e is identifiable, Most of the questions we address and answer are under the assumption that fJ = 0, an arbitrary specified value, or in frequentist tenns, that 8 is true.
e
e
Consistency The first natural question is whether the Bayes posterior distribution as n + 00 concentrates all mass more and more tightly around B. Intuitively this means that the data that are coming from Po eventually wipe out any prior belief that parameter values not close to are likely, Formalizing this statement about the posterior distribution, II(· I X It •.• , X n ), which is a functionvalued statistic, is somewhat subtle in general. But for = {O l ,' .. , Ok} it is
e
e
i
338
straightforward. Let
Asymptotic Approximations Chapter 5
I.
11(8 i XI,···, Xn)
Then we say that II(·
=PIO = 8 I Xl,···, Xn].
(5.5.1)
e, P,li"(8 I Xl, ... , Xn)  11 > ,] ~ 0 for all f. > O. There is a slightly stronger definition: rIC I XI,' .. ,Xn ) is a.S. iff for all 8 E e, 11(8 I Xl, ... , Xn) ~ 1 a.s. P,.
is consistent iff for all 8 E
General a.s. consistency is not hard to formulate:
I Xl, ... , Xn)
(5.5.2)
consistent
(5.5.3)
11(· I X), ... , Xn)
=}
OJ'} a.s. P,
(5.5.4)
where::::} denotes convergence in law and <5{O} is point mass at satisfactory result for finite.
e
e.
There is a completely
, ,
,
Theorem 5.5.1. Let 1rj  p[e = Bj ], j = 1, ... 1 k denote the prior distribution 0/8. Then II(· I Xl, ... ,Xn ) is consistent (a.s. consistent) iff 7fj > afor j = I, ... , k.
Proof. Let p(., B) denote the frequency or limit j function of X. The necessity of the condition is immediate because 1["] = 0 for some j implies that 1f(Bj I Xl, ... ,Xn ) = 0 for all Xl, .. . , X n because, by (1.2.8),
.,
,J,
11(8j
I Xl, ... ,Xn)
PIO = 8j I Xl, ... ,Xn ] 11j Ir~l p(Xi , 8il
L:.~l 11. ni~l p(Xi , 8.)
k
n .
(5.5.5)
,
,
, ,,
Intuitively, no amount of data can convince a Bayesian who has decided a priori that OJ is impossible. On the other hand, suppose all 71" j are positive. If the true is (J j or equivalently 8 = (J j, then
e
log
11(8.IXl, ... ,Xn) =n 11(8 j I X), ... ,Xn)
(11og+ L.. og P(Xi,8.)) . 11. 1{f.,1 n
7fj
n i~l
p(Xi ,8j)
,
By the weak (respectively strong) LLN, under POi'
.,i
1{f.,log p(Xi ,8.)  Lni~l
p(Xi ,8j )
+
E
OJ
(I
I
P(XI ,8.)) og p(X I ,8j )
i
I
in probability (respectively a.s.). But Eo;
(log: ~:::;))
~
< 0, by Shannon's inequality, if
Ba
• .,
=I=
Bj' Therefore,
11(8.IXI, ... ,Xn) 1 og 11(8j I X), ... , X n )
00
,
in the appropriate sense, and the theorem follows.
o
i ~,
h
_
Section 55
Asymptotic Behavior and Optimality of the Posterior Distribution
339
e
Remark 5.5.1. We have proved more than is stated. Namely. that for each I XI, . .. ,Xn ]  a exponentially.
e E e. Po[O =l0
As this proof suggests, consistency of the posterior distribution is very much akin to consistency of the MLE. The appropriate analogues of Theorem 5.2.3 are valid. Next we give a much stronger connection that has inferential implications:
Asymptotic normality of the posterior distribution
Under conditions AOA6 for p(x, B) that if B is the MLE,

= lex, Bj
=logp(x, B), we showed in Section 5.4
(5.5.6)
Ca(y'n(e  B» ~ N(O,rl(B)).
Consider C( ..;ii((}  B) I Xl, ... , X n ), the posterior probability distribution of y'n((} B( Xl, ... , X n )), where we emphasize that (j depends only on the data and is a constant given XI, ... , X n . For conceptual ease we consider A4(a.s.) and A5(a.s.), assumptions that strengthen A4 and A5 by replacing convergence in Po probability by convergence a.s. p•. We also add,



A7: For all (), and all 0> o there exists t(o,(})
p. [sup
> 0 such that
{~ t[I(Xi,B') /(Xi,B)]: 18'  BI > /j} < '(0, B)] ~ I.
e such that 1r(') is continuous and positive
AS: The prior distribution has a density 1f(') On at all B. Remarkably,
Theorem 55.2 (UBernsteinlvon Mises"). If conditions ADA3, A4(a.s.), A5(a.s.), A6, A7, and A8 hold. then
C(y'n((}(})
I X1, ... ,Xn )
~N(O,l
1
(B»)
(5.5.7)
a.s. under P%ralle.
We can rewrite (5.5.7) more usefully as
sup IP[y'n((}  e) < x I Xl, ... , X n]  of>(xVI(B»)j ~ 0
x
(5.5.8)
for all a.s. Po and, of course, the statement holds for our usual and weaker convergence in Po probability also. From this restatement we obtain the important corollary. Corollary 5.5.1. Under the conditions of Theorem 5.5.2,
e
sup IP[y'n(O  e) < x j Xl, ... , XnJ  of>(xVl(e)1
x
~0
(5.5.9)
a.s. P%r all B.
1 , ,
340
Asymptotic Approximations Chapter 5
Remarks
(I) Statements (5,5.4) and (5,5,7)(5,5,9) are, in fact, frequentist statements about the
asymptotic behavior of certain functionvalued statistics.
(2) Claims (5.5.8) and (5.5.9) hold with a.s. replaced by in P, probability if A4 and
A5 are used rather than their strong formssee Problem 5.5.7.
(3) Condition A7 is essentially equivalent to (5.2.8), which coupled with (5.2.9) and
identifiability guarantees consistency of Bin a regular model.

Proof We compute the posterior density of .,fii(O  B) as
(5.5.10)
where en = en(X!, . .. ,Xn) is given by

Divide top and bottom of (5.5.10) by
;,'
II7
1 p(Xi ,
B) to obtain
(5.5.11)

where l(x,B)
= 10gp(x,B) and
,
,
We claim that
for all B. To establish this note that (a) sup { 11" + 1I"(B) : ItI < M} tent and 1T' is continuous. (b) Expanding, (5.5.13)
(e In) 
~ 0 a.s.
for all M because
eis a.s. consis
I
I
1
i
1 ! , ,
I
p
J
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
341
where
Ie  Bit)1 < )n.
We use I:~
1
g~ (Xi, e) ~ 0 here. By A4(a.s.), A5(a.s.),
1
n
sup { n~[}B,(Xi,B'(t))n~[}B,(Xi,B):ltl<M
In [}'I
[}'l
}
~O,
for all M, a.s. Po. Using (5.5.13). the strong law of large numbers (SLLN) and A8, we obtain (Problem 5.5.3),
Po
[dnqn(t)~1f(B)exp{Eo:;:(Xl,B)~}
forallt] =1.
(5.5.14)
Using A6 we obtain (5.5.12).
Now consider
dn =
I:
r
+y'n
1f(e+
;")exp{~I(Xi,9+;,,) 1(Xi,e)}ds
(5.5.15)
dnqn(s)ds
J1:;I<o,fii
J
1f(t) exp
{~(l(Xi' t) 1(X
i,
9)) } l(lt 
el > o)dt
By AS and A7,
Po [sup { exp
{~(l(Xi,t) 1(Xi , e») } : It  el > 0} < e"'("O)] ~ 1
(5.5.16)
for all 0 so that the second teon in (5.5.14) is bounded by y'ne"'("O) ~ 0 a.s. Po for all 0> O. Finally note that (Problem 5.5.4) by arguing as for (5.5.14), tbere exists o(B) > 0 such that
Po [dnqn(t) < 21f(8) exp {~ Eo (:;: (Xl, B))
By (5.5.15) and (5.5.16), for all 0
~}
for all It 1 < 0(8)y'n]
~ I.
(5.5.17)
> 0,
(5.5.18)
Po [dn 
r dnqn(s)ds ~ 0] = I. J1:;I<o,fii
exp {_ 8'I(B)} ds
2
Finally, apply the dominated convergence theorem, Theorem B.7.5, to dnqn(sl(lsl < 0(8)y'n)), using (5.5.14) and (5.5.17) to conclude that, a.s. Po,
d ~ 1f(B)
n
r= L=
= 1f(8)v'21i'.
JI(B)
(5.5.19)
,
I
342
Hence, a.S. Po,
Asymptot'lc Approximations Chapter 5
qn(t) ~ V1(e)<p(tvI(e))
where r.p is the standard Gaussian density and the theorem follows from Scheffe's Theorem B.7.6 and Proposition B.7.2. 0 Example 5.5.1. Posterior Behavior in the Normal Translation Model with Normal Prior. (Example 3.2.1 continued). Suppose as in Example 3.2.1 we have observations from a N{ (), ( 2 ) distribution with a 2 known and we put aN ('TJ, 7 2 ) prior on 8. Then the posterior
distribution of8 isN(Wln7J!W2nX,
(~I r12)1) where
,,2
W2n
.,
I
• •
WIn
= nT 2 +U2'
= !WIn
(5.5.20)
,
'"
,
r
, .,,
I
Evidently, as n + 00, WIn + 0, X + 8, a.s., if () = e, and (~I T\) 1 + O. That is, the posterior distribution has mean approximately (j and variance approximately 0, for n large, or equivalently the posterior is close to point mass at as we vn(O  9) has posterior distribution expect from Theorem 5.5.1. Because 9 =
;
N ( .,!nw1n(ry  X), n (~+
;'»
1).
x,
e
Now, vnW1n
=
O(n 1/ 2) ~ 0(1) and
0
n (;i + ~ ) 1
rem 5.5.2.
+ (12 =
II (8) and we have directly established the conclusion of Theo
I ,
I
I
Example 5.5.2. Posterior Behavior in the Binomial~Beta Model. (Example 3.2.3 continued). If we observe Sn with a binomial, B(n, 8), distribution, or equivalently we observe X" ... , X n Li.d. Bernoulli (I, e) and put a beta, (3(r, s) prior on e, then, as in Example 3.2.3, (J has posterior (3(8n +r, n+s  8 n ). We have shown in Problem 5.3.20 that if Ua,b has a f3(a, b) distribution, then as a + 00, b + 00,
I
If 0
a) £ (a+b)3]1( Ua,b a+b ~N(o,I). [ ab
< B<
(5.5.21)
j I ,
i j
1 is true, Sn/n ~. () so that Sn + r + 00, n + s  Sn + 00 a.s. Po. By identifying a with Sn + r and b with n + s  Sn we conclude after some algebra that because 9 = X,
vn((J  X)!:' N(O,e(l e))
a.s. Po, as claimed by Theorem 5.5.2.
o
Bayesian optimality of optimal frequentist procedures and frequentist optimality of
Bayesian procedures
•
Theorem 5.5.2 has two surprising consequences. (a) Bayes estimates for a wide variety of loss functions and priors are asymptotically efficient in the sense of the previous section.
,
1 ,
I
I
t
hz _
I
Section 5.5
Asymptotic Behavior and Optimality of the Posterior Distribution
343
(b) The maximum likelihood estimate is asymptotically equivalent in a Bayesian sense to the Bayes estimate for a variety of priors and loss functions. As an example of this phenomenon consider the following.
~
Theorem 5.~.3. Suppose the conditions of Theorem 5.5.2 are satisfied. Let B be the MLE ofB and let B* be the median ofthe posterior distribution ofB. Then
(i)
(5.5.22)
a.s. Pe for all
e. Consequently,
~, _ I ~ 1 az. I (e)ae(X" e) +op,(n 1/2 ) e  e+ n L.
l=l
(5.5.23)
and LO( .,fii(rr  e)) ~ N(o, rl(e)).
(ii)
(5.5.24)
E( .,fii(111 
el11111 Xl,'"
,Xn) = mjn E(.,fii(111  dl 
1(11) 1Xl.··· ,Xn) + op(I).
(5.5.25)
Thus, (i) corresponds to claim (a) whereas (ii) corresponds to claim (b) for the loss functions In (e, d) = .,fii(18 dlIell· But the Bayes estimatesforl n and forl(e, d) = 18dl must agree whenever E(11111 Xl, ... , Xn) < 00. (Note that if E(1111 I Xl, ... , X n ) = 00, then the posterior Bayes risk under l is infinite and all estimates are equally poor.) Hence, (5.5.25) follows. The proof of a corresponding claim for quadratic loss is sketched in Problem 5.5.5.
Proof. By Theorem 5.5.2 and Polya's theorem (A.l4.22)
sup IP[.,fii(O  e)
< x I Xl,'" ,Xu) 1>(xy'""'I(""'e))1 ~
Oa.s. Po.
(5.5.26)
But uniform convergence of distribution functions implies convergence of quantiles that are unique for the limit distribution (Problem B.7.1 I). Thus, any median of the posterior distribution of .,fii(11  e) tends to 0, the median of N(O, II (~)), a.s. Po. But the median of the posterior of .,fii(0  (1) is .,fii(e'  e), and (5.5.22) follows. To prove (5.5.24) note that
~ ~ ~
and, hence, that
E(.,fii(IOelll1e'l) IXl, .. ·,Xn) < .,fiile
e'l ~O
(5.5.27)
a.s. Po, for all B. Because a.s. convergence Po for all B implies. a.s. convergence P (B.?). claim (5.5.24) follows and, hence,
E( .,fii(10 
01 101) I h ... , Xn)
= E( .,fii(10 
0'1  101) I X" ... , X n ) + op(I).
(5.5.28)
344
~
Asymptotic Approximations
Chapter 5
Because by Problem 1.4.7 and Proposition 3.2.1, B* is the Bayes estimate for In(e,d), (5.5.25) and the theorem follows. 0 Remark. In fact, Bayes procedures can be efficient in the sense of Sections 5.4.3 and 6.2.3 even if MLEs do not exist. See Le Cam and Yang (1990).
Bayes credible regions
~
There is another result illustrating that the frequentist inferential procedures based on f) agree with Bayesian procedures to first order.
Theorem 5.5.4. Suppose the conditions afTheorem 5.5.2 are satisfied. Let
where en is chosen so that 1l"(Cn I Xl, ... ,Xn) = 1  0', be the Bayes credible region defined in Section 4.7. Let Inh) be the asymptotically level 1  'Y optimal interval based on B, given by
~
where dn(y)
i •
= z (! !)
JI~). ThenJorevery€ >
0, 0,
~ 1.
P.lIn(a + €) C Cn(X1 , .•. ,Xn ) C In(a  €)J
(5.5.29)
I
I
I
The proof, which uses a strengthened version of Theorem 5.5.2 by which the posterior density of Jii( IJ  0) converges to the N(O,Il (0)) density nnifonnly over compact neighborhoods of 0 for each fixed 0, is sketched in Problem 5.5.6. The message of the theorem should be clear. Bayesian and frequentist coverage statements are equivalent to first order. A finer analysis both in this case and in estimation reveals that any approximations to Bayes procedures on a scale finer than n 1j2 do involve the prior. A particular choice, the Jeffrey's prior, makes agreement between frequentist and Bayesian confidence procedures valid even to the higher n 1 order (see Schervisch, 1995).
Thsting
! ,
•
Bayes and frequentist inferences diverge when we consider testing a point hypothesis. For instance, in Problem 5.5.1, the posterior probability of 00 given X I, ... ,Xn if H is false is of a different magnitude than the pvalue for the same data. For more on this socalled Lindley paradox see Berger (1985) and Schervisch (1995). However, if instead of considering hypothesis specifying one points 00 we consider indifference regions where H specifies [00 + D.) or (00  D., 00 + D.), then Bayes and freqnentist testing procedures agree in the limit. See Problem 5.5.2. Summary. Here we established the frequentist consistency of Bayes estimates in the finite parameter case, if all parameter values are a prior possible. Second. we established
i
! I
I b
II
!
_
j
TI
Section 5.6
Problems and Complements
345
the socalled Bernsteinvon Mises theorem actually dating back to Laplace (see Le Cam and Yang, 1990), which establishes frequentist optimality of Bayes estimates and Bayes optimality of the MLE for large samples and priors that do not rule out any region of the parameter space. Finally, the connection between the behavior of the posterior given by the socalled Bernstein~von Mises theorem and frequentist contjdence regions is developed.
5.6
PROBLEMS AND COMPLEMENTS
Problems for Section 5.1
1. Suppose Xl, ... , X n are i.i.d. as X ous case density.
rv
F, where F has median F 1 (4) and a continu
(a) Show that, if n
= 2k + 1,
, Xn)
EFmed(X),
n (
_
~
)
l'
k (1  t)kdt F' (t)t
EF
med (X"
2
,Xn )
n( 2;) [1P'(t)f tk (1t)k dt
= 1, 3,
(b) Suppose F is unifonn, U(O, 1). Find the MSE of the sample median for n and 5.
2. Suppose Z ~ N(I', 1) and V is independent of Z with distribution X;'. Then T
Z/
=
(~)!
is said to have a noncentral t distribution with noncentrality J1 and m degrees
of freedom. See Section 4.9.2. (a) Show that
where fm(w) is the x~ density, and <P is the nonnal distribution function.
(b) If X" ... ,Xn are i.i.d.N(I',<T2 ) show that y'nX /
(,.~, L:(Xi
_X)2)! has a
noncentral t distribution with noncentrality parameter .fiiJ1/IT and n  1 degrees of freedom. (c) Show that T 2 in (a) has a noncentral :FI,m distribution with noncentrality parameter J12. Deduce that the density of T is
p(t)
= 2L
i=O
00
P[R = iJ . hi+,(f)[<p(t 1')1(t > 0)
+ <p(t + 1')1(t < 0)1
where R is given in Problem B.3.12.
, " I I. ,
346
Hint: Condition on
Asymptotic Approximations
Chapter 5
ITI.
1, then Var(X) < 1 with equality iff X
3. Show that if P[lXI < 11
±1 with
probability! .
Hint: Var(X) < EX 2
4. Comparison of Bounds: Both the Hoeffding and Chebychev bounds are functions of n and f. through ..jiif..
(a) Show that the ratio of the Hoeffding function h( VilE) to the Chebychev function e( Jii€) tends to as Jii€ ~ 00 so that he) is arbitrarily better than en in the tails.
°
I,.,'
i~
(b) Show that the normal approximation 24> (
V;€)  1 gives lower results than h in
00.
,
the tails if P[lXI < 1] = 1 because, if ,,2 < 1. 1  <p(t)  <p(t)lt as t ~ Note: Hoeffding (1963) exhibits better bounds for known a 2 .
R has .\(0) ~ 0, is bounded, and has a hounded second derivative .\n. Show that if Xl, ... , X n are i.i.d., EX l = f.L and Var Xl = 02 < 00, then
5. Suppose.\ : R
~
E.\(X 
1') = .\'(0)
;'/!; +
0
(~)
as n >
00.
= E>.'(O)JiiIX  1'1 + E (";' (X  I')(X I'?) where IX  1'1 < Ix  1'1· The last term is < suPx I>." (x) 1,,2 In and the first tends to
Hint: JiiE(.\(IX 1,1) .\(0))
I
",
1! "
.\'(0)" f== Izl<p(z)dz by Remark B.7. 1(2).
Problems for Section 5,2
1. Using the notation of Theorern 5.2.1, show that
"
2. Let X" ... ,Xn be ij,d. N(I',,,2), Show that for all n
Sup p(•• u) [IX
u
>
1, all €
>
°
/,1 > ,] ~ 1.
Hint: Let (J
)
00.
•
3, Establish (5.2.5). Hint: Iiln  q(p)1
> € =} IPn  pi > w (€).
l
J
4. Let (Ui , V;), 1 < i < n, be i.i.d.  PEP.
(a) Let y(P)
= PIU, > 0, V, > OJ. Show that if P = N(O, 0,1,1, p), then
p
~ sin21l' (Y(P)  ~).
Or of sphere centers such that K Now inf n {e: Ie . ". e E Rand (i) For some «eo) >0 Eo. e') .(ii) holds.lO)2)) . eol > A} > 0 for some A < 00. Show that the sample correlation coefficient continues to be a consistent estimate of p(P) but p is no longer consistent. i (e))} > <.l)2 _ + tTl = o:J. by the basic property of maximum contrast estimates. and < > 0 there is o{0) > 0 such that e. e) .Section 5.6 Problems and Complements 347 (b) Deduce that if P is the bivariate normal distribution.2 rII.o)} =0 where S( 0) is the 0 ball about Therefore. t e.\} c U s(ejJ(e j=l T J) {~ t n . Show that the maximum contrast estimate 8 is consistent. eo)} : e E K n {I : 1 IB .p(X. (1 (J. inf{p(X. sup{lp(X. (a) Show that condition (5. Hint: From continuity of p.Xn are i. inf{p(X. (c) Suppose p{P) is defined generally as Covp (U. then is a consistent estimate of p.. Hint: sup 1. Hint: K can be taken as [A. {p(X" 0) . where A is an arbitrary positive and finite constant. n Lt= ".14)(i) add (ii) suffice for consistency. By compactness there is a finite number 8 1 . 5.e)1 : e' E 5'(e.i. N (/J.eol > . 7. Suppose Xl. eo) : Ie  : Ie .8) fails even in this simplest case in which X ~ /J is clear.~ .(eo)} < 00. (Wald) Suppose e ~ p( X. Eo.14)(i). (i).. 05) where ao is known. VarpUVarpV > O}. e)1 (ii) Eo.eol > .. e) is continuous. eo) : e' E S(O. . .2. .p(Xi . e') .d. .e')p(X.n 1 (XiJ.e'l < .sup{lp(X. 5_0  lim Eo. 6. (Ii) Show that condition (5.lJ. A]. and the dominated convergence theorem.p(X. V)j /VarpU Varp V for PEP ~ {P: EpU' + EpV' < 00.p(X. . Prove that (5.2.2.\} . for each e eo.•.
) 2] . then by Jensen's inequality.[. n.i. Establish (5.. L:!Xi . Hint: Taylor expand and note that if i . 9.O'l n i=l p(X"Oo)}: B' E S(Bj. . The condition of Problem 7(ii) can also fail. Indicate how the conditions of Problem 7 have to be changed to ensure uniform consistencyon K. (72)... Let X~ be i. < > OJ. J (iii) Condition on IXi  X."d' k=l d < mm+1 L d ElY.3 I.. 1/(J. P > 1.l m < Cm k=I n.1.. .xW) < I'lj· 3. and if CI. For r fixed apply the law of large numbers.3. .3) for j odd as follows: .d.9) in the exponential model of Example 5.1'. 2. + id = m E II (Y..O(BJll}.X~ are i. li . 8.k / 2 . Then EIX . • • (i) Suppose Xf. . for some constants M j .Xn but independent of them. i .)I' < MjE [L:~ 1 (Xi . . en are constants.3.. Establish (5. < Mjn~ E (.I'lj < EIX . Show that the log likelihood tends to 00 as a + 0 and the condition fails. ~ . I 10. (ii) If ti are ij.. . .. Compact sets J( can be taken of the form {II'I < A. .L.i.348 Asymptotic Approximations Chapter 5 > min l<J$r {~tinf{p(Xi. and let X' = n1EX. .X. . = 1. 1 + . < < 17 < 1/<.d.)2]' < t Mjn~ E [.3.1.11).3.. Problems for Section 5. in (i) and apply (ii) to get (iv) E IL:~ 1 (Xi  x. Hint: See part (a) of the proof of Lemma 5.2. and take the values ±1 with probability ~. . II I . Extend the result of Problem 7 to the case () E RP. Establish Theorem 5.. . with the same distribution as Xl. 4. ..3.X'i j .. L:~ 1(Xi ..d. .X. i . Establish (5.3.
6. (d) Let Ck.\k for some . .d.i.I. then (sVaf)/(s~/a~) has an . 1 < j < m. . i.Xn1 bei..28. P(sll s~ < ~ 1~ a as k ~ 00.m 00..).andsupposetheX'sandY's are independent..Fk. 2 = Var[ ( ) /0'1 ]2 . j = 1. if a < EXr < 00..\ > 0 and = 1 + JI«k+m) ZIa. .. ~ iLl' I IXli > liLl}P(lXd > 11.l(e) Now suppose that F and G are not necessarily nonnal but that and that 0 < Var( Xn < Ck. (b) Show that when F and G are nonnal as in part (a). .pli ~ E{IX.6 Problems and Complements 349 Suppose ad > 0. Show that ~ sup{ IE( X. . PH(st! s~ < Ck. .a~d < [max(al"" . (a) Show that if F and G are N(iL"af) and N(iL2. Show that under the assumptions of part (c). 0'1 = Var(Xd· Xl /LI Ck. I < liLl}P(IXd < liLl)· 7.a as k t 00. Show that if EIXlii < 00.. km K. .i.m distribution with k = nl . . 1)'2:7' .d. Establish 5.(X. then EIX..m) Then. ~ xl". ~ iLli I IX. n} EIX. 8. X. under H : Var(Xtl ~ Var(Y. R valued with EX 1 = O. s~ ~ (n2 . G./LI = E(Xt}.aD.1) +E{IX. ~ iLli < 2 i EIX. )=1 5.(l'i .i.Section 5. LetX1"". Let XI.m with K.c> . j > 2.) I : i" ."" X n be i.ad)]m < m L aj j=1 m <mmILaj. where s1 = (n.r'j". L~=1 i j = m then m a~l.d..I) Hint: By the iterated expectation theorem EIX.m be Ck. respectively. replaced by its method of moments estimate.3.Tn) + 1 . theu the LR test of H : or = a~ versus K : af =I a~ is based on the statistic s1 / s~. Show that if m = .1 and 'In = n2 .1)'2:7' ... FandYi""'Yn2 bei.
. Without loss of generality. from {t I. jn((C . if ~ then ..+. In Example 5. JLl = JL2 = 0. .XC) wHere XC = N~n Et n+l Xi. i (b) Suppose Po is such that X I = bUi+t:i. Wnte (l 1pP .xd. . Po.XN)} we are interested in is itself a sample from a superpopulation that is known up to parameters. . be qk. Show that under the assumptions of part (e).2< 7'. . "i. Consider as estimates e = ') (I X = x..2" ( 11. .i. () E such that T i = ti where ti = (ui.4. then jn(r . .. Hint: Use the centra] limit theorem and Slutsky's theorem.I' ill "rI' and "2 = Var[( X. err eri Show that 4log (~+~) is the variance stabilizing transfonnation for the correlation 1 Ip coefficient in Example 5.1. ~ ] .m (0 Let 9. (a) If 1'1 = 1'2 ~ 0.. X) U t" (ii) X R = bopt(U ~  iL) as in Example 3.. • . i = 1.p). In particular. = (1. (iJl. Var(€.t .m) + 1 .a: as k + 00.. .iL) • . . . 4p' (I . show that i' ~ .3.1 EX.x) ~ N(O.1 that we use Til.xd. ' . .+xn n when T.A») where 7 2 = Var(XIl. if 0 < EX~ < 00 and 0 < Eyl8 < 00. and if EX. (UN.. as N 2 t Show that. H En. . t N }. i = I. (b) Use the delta method for the multivariate case and note (b opt  b)(U .a as k ..1'2)1"2)' such that PH(SI!S~ < Qk.. 0 < .Ii > 0. .6. In survey sampling the modelbased approach postulates that the population {Xl. Show that jn(XRx) ~N(0. = = 1.Il2.6. .. (b) If (X.".. jn(r . • op(n').I))T has the same asymptotic distribution as n~ [n. . TN i.. "1 = "2 = I. iik." . Y) ~ N(I'I. to estimate Ii 1 < j < n.1JT. (I _ P')').\ 00. .I. suppose in the context of Example 3. .l EY? . N where the t:i are i. + I+p" ' ) I I I II.00. Instead assume that 0 < Var(Y?) < x. Without loss of generality. (iJi . . p).3.p.l EXiYi . • 10.d.XN} or {(uI.p) ~ N(O. suppose i j = j.. Et:i = 0. then jn(r' .d.m (depending on ~'l ~ Var[ (X I . (a) jn(X .1 · /. 7 (1 . there exists T I ...I).p) ~ N(O.) =a 2 < 00 and Var(U) Hint: (a) X . n. 1).350 Asymptotic Approximations Chapter 5 (e) Next drop the assumption that G E g. if p i' 0.p')') and. (cj Show that if p = 0.".i. Under the assumptions of part (c)." •. then P(sUs~ < qk. < 00 (in the supennodel). . .4.p') ~ N(O. Tin' which we have sampled at random ~ E~ 1 Xi.~) (X .1. . that is.\ < 1.. ._ J .. = ("t . use a normal approximation to tind an approximate critical value qk. ! .(I_A)a2). . n.m with KI and K2 replaced by their method of moment estimates.lIl) + 1 .N.
Normalizing Transformation for the Poisson Distribution. X~.. 1931) is found to be excellent Use (5.. (a) Show that the only transformations h that make E[h(X) .99.i'b)(Yc .. X n are independent. (a) Use this fact and Problem 5.y'ri has approximately aN (0.14). X..3' Justify fonnally j1" variance (72.10. is known as Fisher's approximation. E(WV) < 00. EU 13.s. .. (h) Use (a) to justify the approximation 17. 14. !) distribution. Suppose XI..3. Suppose X 1l .E{h(X))1' = 0 to terms up to order l/n' for all A > 0 are of the form h{t) ~ ct'!3 + d. . (a) Show that if n is large. . W). X n is a sample from a population with mean third central moment j1.3..90.3. = 0. each with HardyWeinberg frequency function f given by .3. (c) Compare the approximation of (b) with the central limit approximation P[Sn < xl = 1'((x . X = XO. It can be shown (under suitable conditions) that the nonnal approximation to the distribution of h( X) improves as the coefficient of skewness 'Y1 n of heX) diminishes.3 < 00.13(c). and Hint: Use (5. Here x q denotes the qth quantile of the distribution. Let Sn have a X~ distribution.Section 5. (b) Let Sn . (h) Deduce fonnula (5.25.: .3..12).i'c) I < Mn'. . Hint: If U is independent of (V. Suppose X I I • • • l X n is a sample from a peA) distrihution. (a) Suppose that Ely. n = 5. then E(UVW) = O. 15.6 Problems and Complements 351 12. This < xl (h) From (a) deduce the approximation P[Sn '" 1>( v'2X  v'2rl).6) to explain why.i'a)(Yb ... .n)/v'2rl) and the exact values of PISn < xl from the X' table for x = XO.14 to explain the numerical results of Problem 5. Show that IE(Y.... 16. The following approximation to the distribution of Sn (due to Wilson and Hilferty.
Let then have a beta distribution with parameters m and n.. h 2 (x. (1'1. ° . Show that ifm and n are both tending to oc in such a way that m/(m + n) > a. 172 + [h 2(1'l. + (mX/nY)] where Xl. . X n be the indicators of n binomial trials with probability of success B. . 1'2)(Y  + O( n ').. Var(Y) = C7~. E(Y) = 1'2.h(l'l.. Justify fonnally the following expressions for the moments of h(X. Yn) is a sample from a bivariate population with E(X) ~ 1'1. Bm. Var Bmnm + n +Rmn = 'm+n ' ' where Rm.I'll + h2(1'1.. Var(X) = 171. Y) where (X" Yi). is given by h(t) = (2/1r) sin' (Yt). Variance Stabilizing Transfo111U1tion for the Binomial Distribution. i 21. .n P .a) E(Bmn ) = . Show that the only variance stabilizing transformation h such that h(O) = 0.a) Hmt: Use Bm..n  m/(m + n») < x] 1 > 'li(x). 19. Let X I. .5 that under the conditions of the previous problem.a tends to zero at the rate I/(m + n)2. (Xn .• • • ~{[hl (1". . where I' = (a) Find an approximation to P[X < E(X. (b) Find an approximation to P[JX' < t] in terms of 0 and t.2.. 1'2) = h.n tends to zero at the rate I/(m + n)2. V)) '" . • j b _ . Cov(X. Show directly using Problem B. 0< a < I.1') + X 2 .n = (mX/nY)i1 independent standard eXJX>nentials. 1'2)(X . v'a(l. and h'(t) > for all t.1'2)]2 177 n +2h. Y" . then I m a(l. if m/(m + n) . Yn are < < . 20. (e) What is the approximate distribution of Vr'( X . [vm +  n (Bm. which are integers.)? 18. Y) ~ p!7IC72. (a) (b) Var(h(X. (1'1. 1'2)pC7.y).(x. where I I h. 101 02 20(10) I f (10)2 2 tJ in terms of fJ and t.Xm . 1'2)h2(1'1. 1'2)]2C7n + O(n 2) i . h(l) = I. ..y) = a axh(x. .y)...y) = a ayh(x. I ~ 352 x Asymptotic Approximations Chapter 5 where °< e < f(x) 1. 1'2) Hint: h(X. Y) .
l and variance (J2 < Suppose h has a second derivative h<2) continuous at IJ.2) using the delta method.X) in tenns of the distribution function when J.2)/24n2 Hint: Therefore..l = ~.) + ": + Rn where IRnl < M 1(J1. Suppose IM41 (x)1 < M for all x and some constant M and suppose that J.2V with V "' Give an approximation to the distribution of X (1 . (a) Show that y'n[h(X) .J1. Show that Eh(X) 3 h(J1. XI.6 Problems and Complements 353 22.2V where V ~ 00. Compare your answer to the answer in part (a).l4 is finite.2.4 to give a directjustification of where R n / yin 0 as in n Recall Stirling's approximation: + + 00.. k > I.). Let Xl>"" X n be a sample Irom a population with and let T = X2 be an estimate of 112 .h(J1. Let Xl. 24. ° while n[h(X h(J1. (e) Fiud the limiting laws of y'n(X . Let Sn "' X~. X n is a sample from a population and that h is a realvalued function 01 X whose derivatives of order k are denoted by h(k)..l) = o.)] ~ as ~h(2)(J1.3 1/6n2 + M(J1. xi 25.)] !:.)1J1. (a) When J1. Usc Stirling's approximation and Problem 8. < t) = p(Vi < (b) When J1. = ~. find the asymptotic distribution of y'n(T. ()'2 = Var(X) < 00. . find the asymptotic distrintion of nT nsing P(nT y'nX < Vi). 1 X n be a sample from a population with mean J. ..(1. n[X(I .4 + 3o.) 23.)2 and n(X .) + ~h(2)(J1..Section 5.J1. _.X2 . and that h(1)(J.J1.J1. . J1. = 0. xi.J1.)] is asymptotically distributed (b) Use part (a) to show that when J1. Suppose that Xl. (It may be shown but is not required that [y'nR" I is bounded. ..)2. = E(X) "f 0.
\. . 1 Xn.\ul + (1 ~ or al .9. We want to obtain confidence intervals on J1..3).354 26.3) has asymptotic probability of coverage < 1 .) of Example 4. (d) Make a comparison of the asymptotic length of (4. .. We want to study the behavior of the twosample pivot T(Li. Let n _ 00 and (a) Show that P[T(t. . .9.Jli = ~. 29.(~':~fJ').t. and SD are as defined in Section 4. 'I :I ! t.I I (a) Show that T has asymptotically a standard Donnal distribution as and I I . Show that if Xl" .33) and Theorem 5. Viillnl ~ 2vul + a~z(l . 28. Asymptotic Approximations Chapter 5 then ~ £ c: 2 yn(X . the intervals (4. " Yn as separate samples and use the twosample t intervals (4. .\ < 1..\a'4)I'). cyi . .2a 4 ). al = Var(XIl. n2 + 00. a~. Suppose Xl. .3 independent samples with III = E(XIl.. 27. Assume that the observations have a common N(jtl. N(Pl (}2). XII are ij. I i t I . . .2 . so that ntln ~ .}.9. I . n2 + 00. < tJ ~ <P(t[(.t. (c) Apply Slutsky's theorem.(j ' a ) N(O. What happens? Analysis for fixed n is difficult because T(~) no longer has a '12n2 distribution.3. p) distribution. Hint: Use (5.d.9. (Xnl Yn ) are n sets of control and treatment responses in a matched pair experiment. . . b l .4.Xi we treat Xl.JTi times the length of the onesample t interval based on the differences. . 1 X n1 and Y1 .~a) > 2( Val + a'4  ~ a) where the righthand side is the limit of .) (b) Deduce that if. . Suppose (Xl.\ probability of coverage.\)a'4)/(1 . that E(Xt) < 00 and E(Y. .9.') < 00. (c) Show that if IInl is the length of the interval In.9. Suppose nl + 00 .\)al + . .9. if n}.3. Yd.. E In] > 1 . .3) and the intervals based on the pivot ID . then limn PIt.. and a~ = Var(Y1 ).4.. 0 < .4. (a) Show that P[T(t.3. Y1 . Let T = (D .9.. /"' = E(YIl.a.1/ S D where D and S D are as in Section 4. Hint: (a).o. Eo) where Eo = diag(a'.t2.) < (b) Deduce that if p tl ~ <P (t [1.•.3). the interval (4. 2pawz)z(1  > 0 and In is given by (4.9.a./".3) have correct asymptotic (c) Show that if a~ > al and . Suppose that instead of using the onesample t intervals based on the differences Vi . whereas the situation is reversed if the sample size inequalities and variance inequalities agree.\ > 1 .)/SD where D. Yn2 are as in Section 4. = = az. . .\..
3 using the Welch test based on T rather than the twosample t test based on Sn.z are Iii = k. Let X ij (i = I..X)" Then by Theorem B. vectors and EIY 1 1k < 00. T and where X is unifonn. 30. then lim infn Var(Tn ) > Var(T). EIY. 33.1)1 L. (a) SupposeE(X 4 ) < 00. I is the Euclidean norm.z = Var(X)..4.Z). ().1)." .. Show that if 0 < EX 8 < 00. .Section 5.I)sz/(). (d) Find or write a computer program that carries out the Welch test. has asymptotic level a. Generalize Lemma 5.a)) and evaluate the approximations when F is 7. x = (XI.~ t (Xi . .3._I' Find approximations to P(Yn and P{Vn < XnI (I . Y nERd are Li.3..3.d. X.4. Hint: If Ixll ~ L. 32.d. Show that k ~ ex:: as 111 ~ 00 and 112 t 00. and let Va ~ (n .£.£. then P(Vn < va<) t a as 11 t 00. .9 and 4. k) be independent with Xi} ~ N(l'i. (c) Let R be the method of moment estimaie of K. I< = Var[(X 1')/()'jZ. U( 1.i.I)z(a).1 L j=I k X ij and iT z ~ (kp)I L L(Xij i=l j=l p k Iii)" .z has a X~I distribution when F is theN(fJl (72) distribution. Let XI. It may be shown that if Tn is any sequence of random variables such that Tn if the variances ofT and Tn exist. .8Z = (n . Vn = (n . then there exist universal constants 0 < Cd < Cd < 00 Such that cdlxl1 < Ixl < Cdlxh· 31.a).:~l IXjl.. Hint: See Problems B. where tk(1. X n be i.I k and k only.16. . Let . but Var(Tn ) + 00. Show that as n t 00. . (a) Show that the MLEs of I'i and ().6 Problems and Complements 355 (b) Let k be the Welch degrees of freedom defined in Section 4.9. as X ~ F and let I' = E(X). (c) Show using parts (a) and (b) that the tests that reject H : fJI = 112 in favor of K: 112 > III when T > tk(l .1. Tn . .p... ().3. < XnI (a)) (b) Let Xn_1 (a) be the ath quantile of X. then for all integers k: where C depends on d.3 by showing that if Y 1. . . j = I.a) is the critical value using the Welch approximation.. where I. Carry out a Monte Carlo study such as the one that led to Figure 5. Plot your results.I) + y'i«n ..xdf and Ixl is Euclidean distance.
.(0) Varp!/!(X. (e) Suppose lbat the d. 00 (iii) I!/!I(x) < M < for all x.... and (iii) imply that O(P) defined (not uniquely) by Ep.p( 00) as 0 ~ 00.. . (c) Assume the conditions in (a) and (b)..p(x) (b) Suppose lbat for all PEP.. Deduce that the sample median is a consistent estimate of the population median if lbe latter is unique..n(X .0)..n for t E R.Xn be i. aliI/ > O(P) is finite.0) !:..On) .)). Show that _ £ ( . N(O. . (a) Show that (i). (ii). N(O) = Cov(. . 1) for every sequence {On} wilb On 1 n = 0 + t/. (Use ..p(X. Suppose !/! : R~ R (i) is monotone nondecreasing (ii) !/!(oo) < 0 < !/!(oo) ..p(X.. (k . where P is the empirical distribution of Xl.0) ~ N Hint: P( . N(O. ...) Hint: Show that Ep1/J(X1  8) is nonincreasing in B. 1 X n. • 1 = sgn(x). Let i !' X denote the sample median.p(X. I i . (c) Give a consistent estimate of (]'2.1)cr'/k.0)..n(On . .p(X. I .d. Use the bounded convergence lbeorem applied to . 1/4/'(0)). Ep. .p(X. random variables distributed according to PEP. then 8' !. Show that On is consistent for B(P) over P. Assmne lbat '\'(0) < 0 exists and lbat .p(Xi . That is the MLE jl' is not consistent (Neyman and Scott. .f.4 1.O(P)) > 0 > Ep!/!(X.n7(0) ~[.L~. . F(x) of X. .nCOn . I ! 1 i . ... then < t) = P(O < On) = P (. ... . . [N(O))' . 1/).. I(X. Show lbat if I(x) ~ (d) Assume part (c) and A6..i. O(P) is lbe unique solution of Ep. .. . . . \ . ! Problems for Section 5. (b) Show that if k is fixed and p ~ 00. Let On = B(P). _ .On) < 0).0) and 7'(0) .0))  7'(0)) 0. .p(X. I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _1 .\(On)] !:. Let Xl. .356 Asymptotic Approximations Chapter 5 . n 1 F' (x) exists. is continuous and lbat 1(0) = F'(O) exists.0) = O.. .0) !.. Show that... Set (.. under the conditions of (c). 1948)..
Condition A6' pennits interchange of the order of integration by Fubini's theorem (Billingsley.(x.05. 0 «< 0.. 2.0)p(x.i. ( 2 ). lJ :o(1J. evaluate the efficiency for € = .(x) + <'P.B) < xl = 1.O).7. then 'ce(n(B . J = 1. If (] = I. 0» density. (a) Show that g~ (x. denotes the N(O. Hint: teJ1J. X) as defined in (f). then ep(X. 'T = 4.4. Show that assumption A4' of this section coupled with AGA3 implies assumption A4.jii(Oj . (h) Suppose that Xl has the Cauchy density fix) = 1/1r(1 + x 2). .  = a?/o.0.Xn ) is the MLE.    .(x..(x.b)dl"(x) J 1J.(x. Show that   ep(X. oj).c)'P.B)) ~ [(I/O). X) = 1r/2.b)p(x. . x E R. the asymptotic 2 .0) ~ N(O.O)p(x.exp(x/O). 3.(1:e )n ~ 1.15 and 0.~ for 0 > x and is undefined for 0 g~(X.(x) = (I . ~"'. 82 ) Pis N(I". .y.d.Section 5. Show that if (g) Suppose Xl has the gross error density f.O)dl"(x)) = J 1J. 0) = .'" .. Let XI..a)dl"(x).9)dl"(x) = J te(1J.6 Problems and Complements 357 (0 For two estImates 0. . Conclude that (b) Show that if 0 = max(XI. and O WIth .O) is defined with Pe probability I but < x. X n be i.a)p(x. 0 > 0. U(O.. Find the efficiency ep(X. Thus."' c . " relative efficiency of 8 1 with respect to 82 is defined as ep(8 1 .' .0...5  and '1'.(x).(x .2.0) (see Section 3. This is compatible with I(B) = 00. Show that A6' implies A6.. Hint: Apply A4 and the dominated convergence theorem B.O))dl"(x) ifforalloo < a < b< 00.O)p(x.I / 2 .20 and note that X is more efficient than X for these gross error cases. not only does asymptotic nonnality not hold but 8 converges B faster than at rate n.(x.X) =0.5) where f. not O! Hint: Peln(B .10. 4. 1979) which you may assume.
) P"o (X .i (x.O) 1(0) and Hint: () _ g.4. 1(80 ).In ~ "( n & &8 10gp(Xi. Suppose A4'.18 . A2.Oo) p(Xi .. I . &l/&O so that E.(X.4. I . 80 +. (8) Show that in Theorem 5..i.5 continue to hold if .4.i'''i~l p(Xi.g. .358 5.. ) n en] . 8 ) i . Hint: (b) Expand as in (a) but aroond 80 (d) Show that ?'o+".0'1 < En}'!' 0 . i I £"+7.O') . I (c) Prove (5.in) . and A6 hold for.7. 8) is continuous and I if En sup {:.~ (X. . 0 ~N ~ (. ~log (b) Show that n p(Xi .. 0 for any sequence {en} by using (b) and Polyii's theorem (A. 8)p(x."('1(8) .00) .·_t1og vn n j = p( x"'o+. I 1 . ) ! .22).80 + p(X .i(X. .80 p(X" 80) +.4) to g. under F oo ' and conclude that . .In) + .5 Asymptotic Approximations Chapter 5 ~lOg n p(Xi.in) 1 is replaced by the likelihood ratio statistic 7...p ~ 1(0) < 00. Show that the conclusions of PrQple~ 5. 0).< ~log n p(X.80 + p(Xi .in) ~ . j.in.14.50). I I j 6.[ L:. i. Show that 0 ~ 1(0) is continuous. Apply the dominated convergence theorem (B. 8 ) 0 .. O."log . ~ L.
B' _n + op (n 1/') 13. (Ii) Suppose the conditions of Theorem 5.57) can be strengthened to: For each B E e. for all f n2 nI n2 . (Ii) Compare the bounds in Ca) with the bound (5. Hint.Q" is asymptotically at least as good as B if. = X.4. ~ 1 L i=I n 8'[ ~ 8B' (Xi. (a) Establish (5.4.4. (Ii) Deduce that g.4. ~. Let B be two asymptotic l. Let B~ be as in (5. Hint: Use Problems 5. which agrees with (4.61).6. setting a lower confidence bound for binomial fJ. 11. 9. all the 8~i are at least as good as any competitors. 10. f is a is an asymptotic lower confidence bound for fJ. Hint: Use Problem 5.4.4.14.4. B).2.2.4. at all e and (A4').4. A.54) and (5. if X = 0 or 1. which is just (4.5.59) coincide. Then (5.59).59). We say that B >0 nI Show that 8~I and. Let [~11. Establish (5.4.6 and Slutsky's theorem. gives B Compare with the exact bound of Example 4.:vell .58). (a) Show that under assumptions (AO)(A6) for all Band (A4').7).61) for X ~ 0 and L 12.6 Problems and Complements 359 8. B' nj for j = 1.3.7. Consider Example 4. (a) Show that under assumptions CAO)(A6) for 1/! consistent estimate of I (fJ).5 hold. hence.5 and 5.3).4. (c) Show that if Pe is a one parameter exponential family the bound of (b) and (5.2.4. and give the behavior of (5.Section 5. Compare Theorem 4.56).4. . the bound (5. (a) Show that.a. there is a neighborhood V(Bo) of B o o such that limn sup{Po[B~ < BJ : B E V(Bo)} ~ 1 .4.4.B lower confidence bounds.4.4.
1 15.2.\ < Ao.. .2.. N(I'. X n be i.01 .riX) = T(l + nr 2)1/2rp (I+~~.=2[1  <1>( v'nIXI)] has a I U(O. where .)) and that when I' = LJ.d. I' has aN(O.•.5 ! 1.\ > o...)l/2 Hint: Use Examples 3. .LJ. Hint: By (3. (a) Show that the test that rejects H for large values of v'n(X .4.5. Let X 1> .i. 1) distribution. : A < Ao (a) Find the NeymanPearson (NP) test for testing H : . (b) Suppose that I' ). 7f(1' 7" 0) = 1 . A exp .. Show that (J/(J l'. .. (c) Suppose that I' = 8 > O.55). By Problem 4. Consider the Bayes test when J1. phas a U(O.1.11). 1) distribution.] versus K .4. Suppose that Xl. 1. is a given number. I . .L and A. Now use the central limit theorem.I). T') distribution.1.1 and 3. Consider testing H : j.i.i. = O. ! I • = '\0 versus K : . That is..\ = '\0 versus K • (b) Show that the NP test is UMP for testing H : A > AO versus K : A < AO. X > 0.21J2 2x A} .LJ. Show that (J l'.. 00. (d) Find the Wald test for testing H : A = AO versus K : A < AO. 1.360 1 Asymptotic Approximations Chapter 5 14.. 2..'" XI! are ij. where Jl is known. 0 given Xl. . X n i. 1 . variables with mean zero and variance 1(0). 1). N(Jt. ! (a) Show that the posterior probability of (OJ is where m n (. the test statistic is a sum of i.d.I ' '1 against H as measured by the smallness of the pvalue is much greater than the evidence measured by the smallness of the posterior probability of the hypothesis (Lindley's "paradox"). the pvalue fJ . inverse Gaussian with parameters J. =f:.10) and (3. (c) Find the approximate critical value of the NeymanPearson test using a Donnal approximation. is distributed according to 7r such that 1> 7f({O}) and given I' ~ A > 0. each Xi has density ( 27rx3 A ) 1/2 {AX + J. That is. Jl > OJ .\ l . if H is false.. Problems for Section 5.1. Establish (5. " " I I .t = 0 versus K : J. J1.d.) has pvalue p = <1>(v'n(X .2. LJ. Consider the problem of testing H : I' E [0.4. > ~. (e) Find the Rao score test for testing H : .d.A 7" 0.6. the evidence I .
s. Hint: In view of Theorem 5. [0.. yn(E(9 [ X) .sup{ :.2. J:J'). 1) prior.5.5.058 20 . Show that the posterior probability of H is where an = n/(n + I).1 I'> = 1. (Lindley's "paradox" of Problem 5. Hint: 1 n [PI .6 Problems and Complements 361 (b) Suppose that J.) sufficiently large.17).:.s. 4.5.8) ~ 0 a. ~ E.0 3.. Suppose that in addition to the conditions of Theorem 5..)}. 10 .034 . Apply the argnment used for Theorem 5.O'(t)).. 1) and p ~ L U(O..01 < 0 } continuThen 00. Establish (5. yn(anX   ~)/ va.5.Section 5. i ~. ifltl < ApplytheSLLN and 0 ous at 0 = O. ~l when ynX = n 1'>=0.14).. By (5.5.2 and the continuity of 7f( 0). for all M < 00..029 . 0' (i» n L 802 i=l < In n ~sup {82 8021(X i .2 it is equivalent to show that J 02 7f (0)dO < a.i It Iexp {iI(O)'.Oo»)+log7f(O+ J.(Xi. Hint: By (5.05.~~(:.046 .050 logdnqn(t) = ~ {I(O).} dt < .l has a N(O. Extablish (5.' " .) (d) Compute plimn~oo p/pfor fJ. all .5.I).13) and the SLLN.( Xt.for M(..i Itlqn(t)dt < J:J'). By Theorem 5. ~ L N(O.042 .054 50 . L M M tqn(t)dt ~ 0 a.1 is not in effect.5...2.4.052 100 .17).645 and p = 0.S. Fe. (e) Show that when Jl ~ ~. oyn. (e) Verify the following table giving posterior probabilities of 1.0'): iO' .O') : 100'1 < o} 5.(Xi.I(Xi.1. P.
I : ItI < M} O. Bernstein and R. ~ II• '.5 " I i . ~ 0 a. 5. Finally. A. . . .5. Notes for Section 5.2 we replace the assumptions A4(a. J I .8) and (5.I(Xi}»} 7r(t)dt.s(8)vn tqn(t)dt ~ roc vIn(t  0) exp {i=(l(Xi' t) . (O)<p(tI. .s.16) noting that vlnen.s.7 NOTES Notes for Section 5.4 (1) This result was first stated by R.5. convergence replaced by convergence in Po probability. 'i roc J.". (1) This famous result appears in Laplace's work and was rediscovered by S. . " . I: . Hint: (t : Jf(O)<p(tJf(O)) > c(d)} = [d. Fisher (1925). (a) Show that sup{lqn (t) .5.5. Show tbat (5. 7. i=l }()+O(fJ) I Apply (5. . jo I I . Suppose that in Theorem 5.. . for all 6. . .9) hold with a.Pn where Pn a or 1. 362 f Asymptotic Approximations Chapter 5 > O.!.d) for some c(d). A proof was given by Cramer (1946).5. .1.) and A5(a. (I) If the rightband side is negative for some x. . Fn(x) is taken to be O. to obtain = I we must have C n = C(ZI_ ~ [f(0)nt l / 2 )(1 + op(I)) by Theorem 5.1 . von Misessee Stigler (1986) and Le Cam and Yang (1990). (0» (b) Deduce (5. dn Finally. .f.s.. I i. ! ' (2) Computed by Winston Cbow. The sets en (c) {t . See Bhattacharya and Ranga Rao (1976) for further discussion.5. all d and c(d) / in d.) ~ 0 and I jtl7r(t)dt < co.S. • Notes for Section 5.I Notes for Section 5.3 ). • . . qn (t) > c} are monotone increasing in c.29).1 (I) The bound is actually known to be essentially attained for Xi = a with probability Pn and 1 with probability 1 . For n large these do not correspond to distributions one typically faces. L .) by A4 and A5.
NEYMAN. Prob. C. U. HANSCOMB.. p. Camb. E. Linear Statistical Inference and Its Applications. "Theory of statistical estimation. CASELLA. H. AND R. S.. R. 1938. Mathematical Analysis. New York: J. 58. R. T." Econometrica. Wiley & Sons. "Consistent estimates based on partially consistent observations. 684 (1931). L. SERFLING. S. J. 1996. 1986. Editors Cambridge: Cambridge University Press. Soc. 'The distribution of chi square... RAo. Statistical Decision Theory and Bayesian Analysis New York: SpringerVerlag. Monte Carlo Methods London: Methuen & Co. 40. E. H. Theory ofPoint EstimatiOn New York SpringerVerlag... 1946. L. AND G. 1958.S. Proc... Cambridge University Press. 17. MA: Harvard University press. "Probability inequalities for sums of bounded random variables. HOEFFDING. Tables of the Correlation Coefficient.. J. AND G. YANG.8 REFERENCES BERGER. RANGA RAO. A. 3rd ed. J. Vth Berkeley Symposium. AND E. 1964. R. HfLPERTY. Symp. The Behavior of the Maximum Likelihood Estimator Under NonStandard Conditions. P. AND D. reprinted in Biometrika Tables/or Statisticians (1966). 16. Acad.Section 5. Hartley and E. Amer.. L. I. Approximation Theorems of Mathematical Statistics New York: J. 1979.. C. Probability and Measure New York: Wiley." J. Statisticallnjerence and Scientific Method. 2nd ed. SCOTT. W. 1380 (1963). Theory o/Statistics New York: Springer. CRAMER. M. B.. N. FISHER. LEHMANN. 318324 (1953). HAMMERSLEY. 1987. Some Basic Concepts New York: Springer. O. A. M. S. J. Vol. H. Statist.. E. 22. 1998.. Nat. LEHMANN. P. BHATTACHARYA. F. AND M. 1999. 1985... Phil.. A CoUrse in Large Sample Theory New York: Chapman and Hall. 1973. Wiley & Sons. 1967." Proc. L. .8 References 363 5. FISHER. The History of Statistics: The Measurement of Uncertainty Before 1900 Cambridge. Pearson. Vth Berk. NJ: Princeton University Press. 132 (1948)... 700725 (1925). E. J.. Box. G. M. R. Mathematical Methods of Statistics Princeton. DAVID. I Berkeley.A. P. Sci. R. 1995. Math. 1976. W.. Asymptotics in Statistics.. RUDIN. FERGUSON. Assoc. 1980. STIGLER." Proc. CA: University of California Press. Normal Approximation and Asymptotic Expansions New York: Wiley. Vol. New York: McGraw Hill. LE CAM. WILSON. Statist. "Nonnormality and tests on variances:' Biometrika. 1990. L. HUBER. Elements of LargeSample Theory New York: SpringerVerlag. 3rd ed. BILLINGSLEY. SCHERVISCH.
I .I 1 .\ . . ' i ' . . . (I I i : " .. .
2.7.6.3. 2.1. There is. 365 . confidence regions. The approaches and techniques developed here will be successfully extended in our discussions of the delta method for functionvalued statistics. curve estimates. However. 1. we have not considered asymptotic inference.2. The inequalities ofVapnikChervonenkis.4.6. with the exception of Theorems 5. We shall show how the exact behavior of likelihood procedures in this model correspond to limiting behavior of such procedures in the unknown variance case and more generally in large samples from regular ddimensional parametric models and shall illustrate our results with a number of important examples. multiple regression models (Examples 1.3). and confidence regions in regular onedimensional parametric models for ddirnensional models {PO: 0 E 8}.2. are often both large and commensurate or nearly so.2.5.or nonpararnetric models.Chapter 6 INFERENCE IN THE MUlTIPARAMETER CASE 6. for instance. the number of parameters. and n. the modeling of whose stochastic structure involves complex models governed by several.3. [n this final chapter of Volume I we develop the analogues of the asymptotic analyses of the behaviors of estimates.1. This chapter is a leadin to the more advanced topics of Volume II in which we consider the construction and properties of procedures in non. often many. the number of observations. tests. and efficiency in semiparametric models.and semiparametric models.3. an important aspect of practical situations that is not touched by the approximation. testing. the properties of nonparametric MLEs.4. and prediction in such situations.3).1 INFERENCE FOR GAUSSIAN LINEAR MODElS • Most modern statistical questions iovol ve large data sets. 2. We begin our study with a thorough analysis of the Gaussian linear model with known variance in which exact calculations are possible.8 C R d We have presented several such models already. the fact that d.1) and more generally have studied the theory of multiparameter exponential families (Sections 1. in which we looked at asymptotic theory for the MLE in multiparameter exponential families. Talagrand type and the modern empirical process theory needed to deal with such questions will also appear in the later chapters of Volume II. the bootstrap.2 and 5. the multinomial (Examples 1. 2. real parameters and frequently even more semi. however.
The normal linear regression model is }i = /31 + L j=2 p Zij/3j + €i.1.1.Zip. Regression...1.1.. Example 6. 366 Inference in the Multiparameter Case Chapter 6 ! .1. say the ith case. .. when there is no ambiguity. . . 0"2).1. • . . . . .2(4) in this framework. . let expressions such as (J refer to both column and row vectors. Notational Convention: In this chapter we will.d.l p.Herep= 1andZnxl = (l. In vector and matrix notation.1) j .1..3) whereZi = (Zil. e ~N(O.3): l I • Example 6.2) Ii .n (6. . the Zij are called the design values. The model is Yi j = /31 + €i. are U.'" .a2). .Zip' We are interested in relating the mean of the response to the covariate values.d. i = 1.1 The Classical Gaussian linear Model l' f Many of the examples considered in the earlier chapters fit the framework in which the ith measurement Yi among n independent observations has a distribution that depends on known constants Zil. 6.3)..N(O.. . i = 1..1.1. I I. We have n independent measurements Y1 ..1. In the classical Gaussian (nannal) linear model this dependence takes the fonn p 1 Yi where EI. . It turn~ out that these techniques are sensible and useful outside the narrow framework of model (6.1. n (6.1.. Z = (Zij)nxp... '.j . and for each case. we have a response Yi and a set of p . I The regression framewor~ of Examples 1. 1 en = LZij{3j j=1 +Ei. The OneSample Location Problem. 1 Yn from a population with mean /31 = E(Y).1. N(O. . . We consider experiments in which n cases are sampled from a population. i . andJ is then x nidentity matrix. In this section we will derive exact statistical procedures under the assumptions of the model (6. . and Z is called the design matrix.4 and 2.1..6 we will investigate the sensitivity of these procedures to the assumptions of the model. In Section 6.. " 'I and Y = Z(3 + e. Here Yi is called the response variable.zipf. • i ..5) .2. " n (6.1 covariate measurements denoted by Zi2 • .1. i = l.2J) (6. . These are among the most commonly used statistical techniques.. .l)T.. Here is Example 1.€nareij. we write (6.3).1 is also of the fonn (6.4) 0 where€I." .
~) random variables. The random design Gaussian linear regression model is given in Example 1. ..1. TWosample models apply when the design values represent a qualitative factor taking on only two values. . ..1.6) where Y kl is the response of the lth subject in the group obtaining the kth treatment.0: because then Ok represents the difference between the kth and average treatment . the design matrix has elements: 1 if L jl nk + 1<i < L nk k=1 j k=l ootherwise and z= o 0 ••• o o Ip where I j is a column vector of nj ones and the 0 in the "row" whose jth member is I j is a column vector of nj zeros. and the €kl are independent N(O.4.6) is an example of what is often called analysis ojvariance models.Section 6.1.1. In this case.1fwe set Zil = 1.1. and so on. we are interested in qualitative factors taking on several values. one from each population. where YI.3. we often have more than two competing drugs to compare.6) is often reparametrized by introducing ( l ~ pl I:~~1 (3k and Ok = 13k .. n. nl + .3.3 and Section 4. and so on. · · .. .5) is called the fixed design norrnallinear regression model.1. (6. 0 Example 6.2) and (6.Yn . Ynt +n2 to that getting the second.. Yn1 correspond to the group receiving the first treatment. To fix ideas suppose we are interested in comparing the performance of p > 2 treatments on a population and that we administer only one treatment to each subject and a sample of nk subjects get treatment k.. We treat the covariate values Zij as fixed (nonrandom). if no = 0. . Frequently. we arrive at the one~way layout or psample model.. 13k is the mean response to the kth treatment. /3p are called the regression coefficients. .Way Layout.. . The pSample Problem Or One. 1 < k < p. Generally..1. we want to do so for a variety of locations. Then for 1 < j < p. If the control and treatment responses are independent and nonnally distributed with the same variance a 2 .9. To see that this is a linear model we relabel the observations as Y1 .3) applies./. We can think of the fixed design model as a conditional version of the random design model with the inference developed for the conditional distribution of Y given a set of observed covariate values. (6. . The model (6. Yn1 + 1 . If we are comparing pollution levels. In Example I. The model (6.•.1 Inference for Gaussian Linear Models 367 where (31 is called the regression intercept and /32. + n p = n. then the notation (6. this terminology is commonly used when the design values are qualitative.3 we considered experiments involving the comparisons of two population means when we had available two independent samples."i = 1.
.. then r is the rank of Z and w has dimension r. Let T denote the number of linearly independent Cj. j = 1.. (3 E RP}.g. . . Recall that orthononnal means Vj = 0 fori f j andvTvi = 1. an orthononnal basis VI.. i=I (6. . . TJi = E(Ui) = V'[ J1.po In tenns of the new parameter {3* = (0'. The Canonical Fonn of the Gaussian Linear Model The linear model can be analyzed easily using some geometry.. j = I •..Y.a 2 ) is identifiable if and only ifr = p(Problem6.. Because dimw = r. ... .2).1. the linear Y = Z'(3' + E. V r span w. " ..3.. .6jCj are the columns of the design matrix.. . • I t Ew ¢:> t = L:(v[t)Vi i=l ¢:> vTt = 0. i5p )T. Cj = (Zlj.. •. Z) and Inx I is the vector with n ones. j = I. When VrVj = O. i . The parameter set for f3 is RP and the parameter set for It is W = {I' = Z(3.17). We assume that n > r. k model is Inference in the Multiparameter Case Chapter 6 1. •• Note that w is the linear space spanned by the columns Cj.r additional linear restrictions have been specified.. . V n for Rn such that VI.. . It follows that the parametrization ({3.7) !. i = T + 1. b I _ . We now introduce the canonical variables and means .1.p.. Note that Z· is of rank p and that {3* is not identifiable for {3* E RP+l. . of the design matrix. r i' . E ~N(O.! x (p+l) = (1. . This type oflinear model with the number of columns d of the design matrix larger than its rank r. I5 h . n. . • . Note that any t E Jl:l can be written . ... is common in analysis of variance models..• znj)T..368 effects. Even if {3 is not a parameter (is unidentifiable).. . n. and with the parameters identifiable only once d .(T2J) where Z. Ui = v. . the vector of means p of Y always is..P.. n. there exists (e. i = 1. vT n i I t and that T = L(vTt)v. It is given by 0 = (itl. by the GramSchmidt process) (see Section B. we call Vi and Vj orthogonal. However. . 1 JLn)T I' where the Cj = Z(3 = L j=l P . {3* is identifiable in the pdimensional linear subspace {(3' E RP+l : L:~~l dk = O} of RP+ 1 obtained by adding the linear restriction E~=l 15k = 0 forced by the definition of the 15k '5. .
'" ~ A 1'1. .1.. 7J = AIJ..1.8)(6..1 Inference for G::'::"::"::"::"::L::'::"e::'::'::M::o::d::e'::' '" '3::6::::. •. E w. . which is the guide to asymptotic inference in general..1.. where = r + 1. also UMVU for CY. Let A nxn be the orthogonal matrix with rows vi. it = E.10). 0. .1..Ur istheMLEof1]I. whereas (6. . u) based on U £('I.3. (6. In the canonical/orm ofthe Gaussian linear model with 0. • .)T is sufJicientfor'l. .1.• .9 N(rli.•• (ii) U 1 . Theorem 6.2 ) using the parametrization (11. .. r.1.).and 7J are equivalently related.2.2"log(21r<7 ) t=1 n 1 n _ _ ' " u. and .2 and E(Ui ) = vi J.L = 0 for i = r + 1. n. . (3.(Ui .10) It will be convenient to obtain our statistical procedures for the canonical variables U...1.1. 20. Note that Y~AIU..2 Estimation 0. .2 We first consider the known case. then the MLE of 0: Ci1]i is a = E~ I CiUi.. (iii) U i is the UMVU estimate of1]i. (6.. i (iv) = 1. and by Theorem B.2 known (i) T = (UI •. U1 . = L.Cr are constants.. (T2)T... . while (1]1. is Ui = making vTit. We start by considering the log likelihood £('1... = 1.2 L i=l + _ '" ryiui _ 02~ i=l 1 r r 2 (6.u) 1 ~ 2 n 2 . 1 If Cl. .9) Var(Y) ~ Var(U) = .2<7' L.'J nxn . Theorem 6. .2.. Un are independent nonnal with 0 variance 0.1.'Ii) . ais (v) The MLE of. Moreover.' based on Y using (6.8) So. 1 ViUi and Mi is UMVU for Mi.1]r. . IJ.. .Section 6.. observing U and Y is the same thing. .11) _ n log(21r"'). .. . 2 ' " 'Ii i=l ~2(T2 6. 1]r)T varies/reely over Rr. . and then translate them to procedures for . n.1. . i it and U equivalent. .. n. v~. n because p.. . The U1 are independent and Ui 11i = 0.. Proof. which are sufficient for (IJ.. U. Then we can write U = AY. 2 0. . i i = 1..".
.'7. (U I .6. i = 1.N(~. . and .4. r. r.1. "7r because.2.. (v) Follows from (iv).1 L~ r+l U.. ( 2 )T. By (6. . .. By Problem 3. . .. (ii) U1. by observation. the MLE of q(9) = I:~ I c.2 .ti· 1 .) (iii) By Theorem 3. is q(8) = L~ 1 <. this statistic is equivalent to T and (i) follows.6 to the canonical exponential family obtained from (6. " as a function of 0. (We could also apply Theorem 2. T r + 1 = L~ I and ()r+l = 1/20. (iii) 8 2 =(n .1]r. WI = Q is sufficient for 6 = Ct.1. Theorem 6. . . To this end. . I i > . . To show (ii).. apply Theorem 3. Proof.1...2 . Let Wi = vTu. (i) By observation. (iv) By the invariance of the MLE (Section 2. 0. {3.11) is an exponential family with sufficient statistic T. wecan assume without loss of generality that 2:::'=1 c.. j = 1. Proof.10.. recall that the maximum of (6. i > r+ 1.11) by setting T j = Uj .) = a.1.1.11).2. ~i = vi'TJ. there exists an orthonormal basis VI) . 1lr only through L~' 1 CUi .O)T E R n . and is UMVU for its expectation E(W.1. . To show (iv). .(J2J) by Theorem B. The distribution of W is an exponential family. . I.Ui · If all thee's are zero.• EUl U? Ul Projections We next express j1 in terms of Y. I . By GramSchmidt orthogonalization. Assume that at least one c is different from zero.1. . and give a geometric interpretation of ii.. (6. (6. 1 Ur .4 and Example 3. .r) 2:7 1 L~ r+l ul is an unbiased estimator of (J2..3. ..11) has fJi 1 2 = Ui .. we need to maximize n (log 27f(J2) 2 II ". 0  i Next we consider the case in which a 2 is unknown and assume n >r+ 1. = 1."7i)2 and is minimized by setting '(Ii = Ui. 1 U1" are the MLEs of 1}1. But because L~ 1 Ui2 L:~ I Ui2 + r+l Ul. . define the norm It I of a vector tERn by  ! It!' = I:~ .1. . .3 and Example (iii) is clear because 3.Ur'L~ 1 Ul)T is sufficient..3. where J is the n x n identity matrix.. 2:7 r+l un Tis sufficientfor (1]1. ... . then W . n.11) is a function of 1}1.3.4.2 (ii)..(v) are still valid. " V n of R n with VI = C = (c" .4. l Cr .. In the canonical Gaussian linear model with a 2 unknown.. Q is UMVU. (iv) The conclusions a/Theorem 6.'  .1).4. . 2 2 ()j = 1'0/0. (ii) The MLE ofa 2 is n. = 0. That is. obtain the MLE fJ of fl.4. (i) T = (Ul1" . The maximizer is easily seen to be n 1 L~ r+l (Problem 6.2. n " . 370 Inference in the Multiparameter Case Chapter 6 I iI . I i I ..2).1.. Ui is UMVU for E(U...) = '7" i = 1. 2(J i=r+l ~ U..
' and by Theorems 6.3.1...1 and Section 2.L = {s ERn: ST(Z/3) = Oforall/3 E RP}.1. by (6.Z/3I' : /3 E W}. In the Gaussian linear model (i) jl is the unique projection oiY on L<. and  Jii is the UMVU estimate of J.Ii = .n. equivalently. which implies ZT s = 0 for all s E w. .3 of. _. = Z/3 and (6.1 Inference for Gaussian linear Models 371 E Rn on w is the point Definition 6... i=l. any linear combination of U's is a UMVU estimate of its expectation.. ZT (Y .12) ii. .1. the MLE of {3 equal~ the least squares estimate (LSE) of {3 defined in Example 2. fj = (ZTZ)lZTji. To show (iv). spans w.1. To show f3 = (ZTZ)lZTy. .Z/3 I' . 0 ~ .ti.1.3 because Ii = l:~ 1 ViUi and Y .1.1.9). Thus.3 E RP..1.jil' /(n r) (6. note that 6... .13) (iv) lfp = r.ji) = 0 and the second equality in (6.. = Z T Z/3 and ZTji = ZTZfj and. any linear combination ofY's is also a linear combination of U's.. /3 = (ZTZ)l ZT.1.1. That is. (ii) and (iii) are also clear from Theorem r+l VjUj . The projection Yo = 7l"(Y I L.14) (v) f3j is the UMVU estimate of (3j.J) of a point y Yo =argmin{lyW: tEw}. note that the space w. It follows that /3T (ZT s) = 0 for all/3 E RP. ZTZ is nonsingular.3(iv). /3.4. We have Theorem 6. fj = arg min{Iy .Section 6. f3j and Jii are UMVU because. because Z has full rank.2(iv) and 6. <T) = .' and is given by ji (ii) jl is orthogonal to Y (iii) 8' ~ = zfj (6. then /3 is identifiable.1. the MLE = LSE of /3 is unique and given by (6. Proof..2<T' IY .1.L of vectors s orthogonal to w can be written as ~ 2:7 w.12) implies ZT.2.3 maximizes 1 log p(y.1..L... (i) is clear because Z.14) follows.1. ~ The maximum likelihood estimate .p. and /3 = (ZTZll ZT 1". j = 1.n log(21f<T . IY . ) 2 or.
= [y. In this method of "prediction" of Yi. . L7 1 . 1 · . Var(€) = (12(J .15) Next note that the residuals can be written as ~ I . ~ ~ ·.3) that if J = J nxn is the identity matrix.. By Theorem 1. 1 < i < n.' € = Y . In the Gaussian linear model (i) the fitted values Y = iJ. n}.H)). • ~ ~ Y=HY where i• 1 . In statistics it is also called the hat matrix because it "puts the hat on Y. We can now conclude the following. I.1. The residuals are the projection of Y on the orthocomplement of wand ·i . (iv) ifp = r. i = 1. There the points Pi = {31 + fhzi.Y = (J .2 with P = 2.. The goodness of the fit is measured by the residual sum of squares (RSS) IY .16) J I I CoroUary 6. the best MSPE predictor of Y if f3 is known as well as z is E(Y) = ZT{3 and its best (UMVU) estimate not knowing f3 is Y = ZT {j. H I I It follows from this and (B. Y = il.({3t + {3. (12 (ZTz) 1 ).)I are the vertical distances from the points to the fitted line. =H .1. I . i = 1.3) is to be taken.1. . Suppose we are given a value of the covariate z at which a value Y following the linear model (6.H). and the residual € are independent. That is.2 illustrates this tenninology in the context of Example 6. • . (12(J . n lie on the regression line fitted to the data {('" y. Note that by (6. Example 2.1.4. it is commOn to write Yi for 'j1i.ill 2 = 1 q. ." As a projection matrix H is necessarily symmetric and idempotent.1. 1 ~ ~ ! I ~ ~ ~ i .). ..1. ...5. w~ obtain Pi = Zi(3. the residuals €. (6. . moreover. (ii) (iii) y ~ N(J. .H)Y.1.ii is called the residual from this fit. the ith component of the fitted value iJ. i .14) and the normal equations (Z T Z)f3 = Z~Y.'.1. H T 2 =H.. (j ~ N(f3.12) and (6.1. € ~ N(o.14).1. 1 < i < n.2.I" (12H). .1 we give an alternative derivation of (6.372 Inference in the Multiparameter Case Chapter 6 Note that in Example 2. The estimate fi = Z{3 of It is called thefitted value and € = y . Taking z = Zi. : H = Z(Z T Z)IZT The matrix H is the projection matrix mapping R n into w. whenp = r. and I. see also Section RIO. then (6.1.
j = 1..1.. If {Cijk _..Y are given in Corollary 6.p. .3. . and € are nonnally distributed with Y and € independent. . Thus... .2. in the Gaussian model. Example 6.8. Var(. 1=1 At this point we introduce an important notational convention in statistics. i = 1.5. .1.ii = Y..p. Moreover. (not Y. is th~ UMVU estimate of the average effect of all a 1 p = Yk .i respectively. 0 ~ ~ ~ ~ ~ ~ ~ Example 6. In the Gaussian case. If the design matrix Z has rank P. which we have seen before in the unbiased estimator 8 2 of (72 is L:~ I (Yi ~  Yf/ Problem 1.1 ~ Inference for Gaussian Linear Models 373 Proof (Y.1).8.p. The independence follows from the identification of j1 and € in tenns of the Ui in the theorem. . . 'E) is a linear transformation of U and. k=I.3).. where n = nl + . In this example the nonnal equations (Z T Z)(3 = ZY become n.a of the kth treatment is ~ Pk = Yk.1.1.no The variances of .1 and Section 2. (n . .8 and that {3j and f1i are UMVU for {3j and J1.. ..8) = (J2(Z T Z)1 follows from (B. a = {3.1. joint Gaussian. and€"= Y .2). hence. k = 1.8 and (3.81 and i1 = {31 Y. We now see that the MLE of It is il = Z. then the MLE = LSE estimate is j3 = (ZTZr1 ZTy as seen before in Example 2. Y.1... } is a multipleindexed sequence of numbers or variables. o .3.4. 0 Example 6. Here Ji = . the treatments.Ill'.2.p. then replacement of a subscript by a dot indicates that we are considering the average over that subscript. The OneWay Layout (continued). 0 ~ We now return to our examples.. in general) p k=l L and the UMVU estimate of the incremental effect Ok = {3k . nk(3k = LYkl. + n p and we can write the least squares estimates as {3k=Yk . By Theorem 6.. Regression (continued). The error variance (72 = Var(El) can be unbiasedly estimated by 8 2 = (n _p)lIY ..Section 6..1.1.ii. k = 1. .3.1. One Sample (continued). .
For instance." Now.6. Thus.1. under H.4.JL): JL E w} sup{p(Y. j 1 • 1 . An alternative approach 10 the MLEs for the nonnal model and the associated LSEs of this section is an approach based on MLEs for the model in which the errors El. The most important hypothesistesting questions in the context of a linear model correspond to restriction of the vector of means JL to a linear subspace of the space w.Li = 13 E R. I. Next consider the psample model of Example 1. j where Zi2 is the dose level of the drug given the ith patient.p}.1) have the Laplace distribution with density . {JL : Jti = 131 + f33zi3. . the mean vector is an element of the space {JL : J. a regression equation of the form mean response ~ . .. see Koenker and D'Orey (1987) and Portnoy and Koenker (1997)..1...374 Inference in the Multiparameter Case Chapter 6 Remark 6. we let w correspond to the full model with dimension r and let Wo be a qdimensional linear subspace over which JL can range under the null hypothesis H.al. . ! j ! I I .2. . (6.1. under H. Now we would test H : 132 = 0 versus K : 132 =I O.3 with 13k representing the mean resJXlnse for the kth population.zT .1. .17) I. in a study to investigate whether a drug affects the mean of a response such as blood pressure we may consider. q < r. In general.17).i 1 6..En in (6. which together with a 2 specifies the model.. . " . and the matrix IIziillnx3 with Zil = 1 has rank 3. i = 1.3 Tests and Confidence Intervals . . For more on LADEs.L are least absolute deviation estimates (LADEs) obtained by minimizing the absolute deviation distance L~ 1 IYi .7 and 2. .1.\(y) = sup{p(Y. We first consider the a 2 known case and consider the likelihood ratio statistic . 1 and the estimates of f3 and J. see Problems 1. Zi3 is the age of the ith patient. 1 < . The first inferential question is typically "Are the means equal or notT' Thus we test H : 131 = . in the context of Example 6. • = 131 + f32Zi2 + f33Zi3. .. which is a onedimensional subspace of Rn. • . the LADEs are obtained fairly quickly by modem computing methods. The LADEs were introduced by Laplace before Gauss and Legendre introduced the LSEssee Stigler (1986). n} is a twodimensional linear subspace of the full model's threedimensional linear subspace of Rn given by (6. However. i = 1. The LSEs are preferred because of ease of computation and their geometric properties. whereas for the full model JL is in a pdimensional subspace of Rn.JL): JL E wo} ~ ! . .1.1. .31. j..2. = {3p = 13 for some 13 E R versus K: "the f3's are not all equal.
. We have shown the following.iii' ...l2).. . V q (6.1. We write X. Proposition 6.l9) then.1.. . . i=q+l r = a 21 fL  Ito [2 (6. (}r. then r.IY . V r span wand set .1.\(Y) ~ exp {  2t21Y .1. . Write 'Ii is as defined in (6.3.. But if we let A nx n be an orthogonal matrix with rows vi.. when H holds.iio 2 1 } where i1 and flo are the projections of Yon wand wo..1.J X. ._q' = AJL where A L 'I.21 ) where fLo is the projection of fL on woo In particular.2(v).1.'" (}r)T (see Problem B.q( (}2) distribution with 0 2 =a 2 ' \ '1]i L. by Theorem 6.\(Y) = L i=q+l r (ud a )'.1. = IJL i=q+l r JLol'. . by Theorem 6.. In this case the distribution of L:~q+1 (Uda? is called a chisquare distribution with r ..1.\(Y) Proof We only need to establish the second equality in (6. span Wo and VI.\(Y) has a X.20) i=q+I 210g.I8) then. 2 log .19). .1. v~' such that VI. Note that (uda) has a N(Oi.Section 6_1 Inference for Gaussian Linear Models 375 for testing H : fL E Wo versus j{ : JL E W  woo Because (6.1. o .q {(}2) for this distribution. In the Gaussian linear model with 17 2 known. 2 log . 1) distribution with OJ = 'Ida. respectively.\(Y) It follows that = exp 1 2172 L '\' Ui' (6.21).1.q degrees offreedom and noncentrality parameter Ej2 = 181 2 = L:=q+ I where 8 = «(}q+l.
3. is poor compared to the fit under the general model. The resulting test is intuitive.'}' YJ. has the noncentral F distribution F r _q.I.nr distribution.22) Because T = {n .22). J • T = (noncentral X:. We write :h./L.l) {Ir .1. we obtain o A(Y) = P Y. In Proposition 6.2 is known" is replaced by '.1.iT'): /L E w} .1. By the canonicaL representation (6. (6.1. Thus.liol' IY r _q IY _ iii' iii' (r . I .21ii. .1. Substituting j).1 that the MLEs of a 2 for It E wand It E wQ are .q)'{[A{Y)J'/n .itol 2 have a X.max{p{y. {io./Lol'. The distribution of such a variable is called the noncentral F distribution with noncentrality parameter 0 2 and r . ]t consists of rejecting H when the fit.1 suppose the assumption "0./L. We have shown the following.L a =  n I' and . we can write .r IY . In particular.E Wo for K : Jt E W .a:.r degrees affreedam (see Prohlem B.14). /L.q variable)/df (central X~T variahle)/df with the numerator and denominator independent.18).') denotes the righthand side of (6. 8 2 .~I.1.liol' (n _ r) 'IY _ iii' . IIY . We know from Problem 6.1. T has the representation .(Y).wo.1. and 86 into the likelihood ratio statistic. Remark 6.2 1JL . = aD IIY . T has the (central) Frq. it can be shown (Problem 6..1 that 021it . .itol 2 = L~ q+ 1 (Ui /0) 2.q and m = n ..'IY _ iii' = L~ r+l (Ui /0)2. .q and n . as measured by the residual sum of squares under the model specified by H.q)'Ili ..2. We have seen in PmIXlsition 6.. statistic :\(y) _ max{p{y.1.iT') : /L E wo} (6.1.r){r . For the purpose of finding critical values it is more convenient to work with a statistic equivalent to >.. T is an increasing function of A{Y) and the two test statistics are equivalent. which has a X~r distribution and is independent of u.2.a = ~(Y'E. E In this case.376 Inference in the Multiparameter Case Chapter 6 Next consider the case in which a 2 is unknown.23) ../to I' ' n J respectively. Proposition 6..n_r(02) where 0 2 = u. .JtoI 2 . T = n ..'I/L.I}.JLo.• j . which is equivalent to the likelihood ratio statistic for H : Ii.r.2 is the same under Hand K and estimated by the MLE 0:2 for /.L n where p{y. .1.0. In the Gaussian linear model the F statistic defined by (6.m{O') for this distrihution where k = r . T is called the F statistic for the general linear hypothesis. when H I lwlds.5) that if we introduce the variance equal likelihood ratio w'" ._q(02) distribution with 0' = .19).J.
24) Remark 6. Yl = Y2 Figure 6. Example 6.25) which we exploited in the preceding derivations. See Pigure 6.1. 0 . In this = (Y 1'0)2 (nl) lE(Y.3.(Y) equals the likelihood ratio statistic for the 0'2. where t is the onesample Student t statistic of Section 4. and the Pythagorean identity. Y3 y Iy .IO.Section 6.Y)2' which we recognize as t 2 / n. q = 0.1. The projections il and ilo of Yon w and Wo. We test H : (31 case wo = {I'o}.1.1.~T = (n .22). It follows that ~ (J2 known case with rT 2 replaced by r.19) made it possible to recognize the identity IY .q noncentral X? q 2Iog.1'0 I' y.iLl' ~++Yl I' I' 1 . (6.iLol'.1.9.1.1. (6. One Sample (continued).2. We next return to our examples.1 Inference for Gaussian Linear Models ~ 377 then >.1.r)ln central X~_cln where T is the F statistic (6.1. The canonical representation (6.iLol' = IY . This is the Pythagorean identity.iLl' + liL . r = 1 and T = 1'0 versus K : (3 'f 1'0.1 and Section B.\(Y) = .1.
dfp) RSSF/dh I 2 where RSSF = IY and RSSH = IY .2. which only depends on the second set of variables and coefficients. (6.1.1.RSSF)/(dflf . To formulate this question.1. 0 I j Example 6.. L~"(Yk' . 7) iW l. Under the alternative F has a noncentral Fp_q. in general 02 depends on the sample correlations between the variables in Zl and those in Z2' This issue is discussed further in Example 62. . P . anddfF = np and dfH = nq are the corresponding degrees of freedom..Way Layout (continued). P nk (Y. I • I b .q covariates does not affect the mean response. Z2) where Z1 is n x q and Z2 is 11 X (p . '1 Yp.1...{3p are Y1. Without loss of generality we ask whether the last p .ito 1 are the residual sums of squares under the full model and H.Yk.22) we can write the F statistic version of the likelihood ratio test in the intuitive fonn =   i . 1 02 = (J2(p .np distribution.26) and H..q)1f3nZfZ2 . age.)2.. and we partition {3 as f3T = (f3f. treatment) effect coefficients and f3 1 is a q x 1 vector of "nuisance" (e.. Now the linear model can be written as i i I I ! We test H : f32 (6.1.1..  I 1.=11=1 k=1 •• Substituting in (6.1..g.y)2 k .. The One. we want to test H : {31 = . .: I :I T _ n ~ p L~l nk(Y . Under H all the observations have the same mean so that.q covariates in multiple regression have an effect after fitting the first q. The F test rejects H if F is large when compared to the ath quantile of the Fpq. • 1 F = (RSSlf .)2= Lnk(Yk.q) x 1 vector of main (e.22) we obtain the F statistic for the hypothesis H in the oneway layout . .vy.1 L~=. j ! j • litPol2 = LL(Yk _y.P .' • ito = Thus. .. Using (6.q).. We consider the possibility that a subset of p .ZfZl(ZiZl)'ziz. In this case f3 (ZrZ)lZry and f3 0 = (Z[Ztl1Z[Y are the ML& under the full model (6. In the special case that ZrZ2 = 0 so the variables in Zl are orthogonal to the variables in Z2.3. we partition the design matrix Z by writing it as Z = (ZI.1.n 378 Inference in the Multiparameter Case Chapter 6 iI II " • Example 6. . .}f32. Regression (continued). respectively. 02 simplifies to (J2(p .27) 1 . = {3p. Recall that the least squares estimates of {31.n_p(fP) distribution with noncentrality parameter (Problem 6.. respectivelY.g.1. I3I) where {32 is a (p . _y. •. .Q)f3f(ZfZ2)f32. As we indicated earlier.. I • ! . However. economic status) coefficients.)2' .26) 0 versus K : f3 2 i' O..
. .. Because 88B/0" and SSw / a' are independent X' variables with (p .np distribution.. into two constituent components.1. . . This information as well as S8B/(P .28) where j3 = n I :2:=~"" 1 nifJi. As an illustration. .p) of the (n 1) degrees offreedom of SST /0" as "cooling" from S8w/a'. Note that this implies the possibly unrealistic .29) Thus. Yi nl .. consider the following data(I) giving blood cholesterol levels of men in three different socioeconomic groups labeled I.. If the fJ..1) degrees of freedom and noncentrality parameter 0'. We assume the oneway layout is valid.)' .p) degrees of freedom. .p).3. are often summarized in what is known as an analysis a/variance (ANOVA) table.fLo 12 for the vector Jt (rh. the between groups (or treatment) sum of squares and SSw.1. the unbiased estimates of 02 and a 2 .1.1) and (n ... To derive IF.. is a measure of variation between the p samples YII .. .fJ.)'.···. SSB. SST/a 2 is a (noncentral) X2 variable with (n .. Ypl . [f we define the total sum of squares as 88T =L l' L(Y nk k'  Y. See Tables 6.1 and 6. are not all equal. compute a. which is their ratio. the total sum of squares. . (31. .1) and 8Sw /(n . then by the Pythagorean identity (6. Ypnp' The 88w ~ L L(Y Y p "' k'  k )'. we see that the decomposition (6.1. T has a Fpl. 888 =L k=I P nk(Yk. the within groups (or residual) sum of squares. n. .1..30) can also be viewed stochastically. The sum of squares in the numerator. (3p. and III with I being the "high" end. (3p)T and its projection 1'0 = (ij. ijjT = There is an interesting way of looking at the pieces of infonnation summarized by the F statistic.. T has a noncentral Fp1. SST.. .Section 6. (3" . . identifying 0' and (p .Y. . k=1 1=1 which measures the variability of the pooled samples. respectively.1) degrees offreedom as "coming" from SS8/a' and the remaining (n . sum of squares in the denominator. k=l 1=1 measures variation within the samples.2 1M . (6.1.np distribution with noncentrality parameter (6. . and the F statistic.1 Inference for Gaussian Linear Models 379 When H holds.25) 88T = 888 + 88w .. we have a decomposition of the variability of the whole set of data.
j:
I ,
380
Inference in the Multiparameter Case Chapter 6
,
:
TABLE 6.1.1. ANOYA table for the oneway layout
Sum of squares
d.f.
,
Between samples
Within samples
SSe
I:
r I lld\'k_
Mean squares
F value
1\1 S '
i'dS B
Lf!
P
1
MSB
~
, ,
Total
58W  ""k '1'1n I "(I'k l  I')'  I " k 55T  ,. P 1 L "I" 1 (I'kl _ I' )' ~ k I
np
Tl  1
" '''
A1S w = SSw
1
1
j
TABLE 6.1.2. Blood cholesterol levels
I
J
286 290
II III
403 312 403
311 222 244
269 302 353
336 420 235
259 420 319
386 260
353
210
l 1
I
I
,
i,
.!
, , I
I , ,
assumption that the variance of the measurement is the same in the three groups (not to speak of normality). But see Section 6.6 for "robustness" to these assumptions. We want to test whether there is a significant difference among the mean blood cholesterol of the three groups. Here p = 3, nl = 5, n2 = 10, n3 = 6. n = 21, and we compute
TABLE 6.1.3. ANOYA table for the cholesterol data
;
,
88
Between groups Within groups Total
I
,
,I
1202.5 85,750,5 86,953.0
dJ, 2 18 20
M8
601.2 4763,9
F~value
0.126
'I
From :F tables, we find that the pvalue corresponding to the Fvalue 0.126 is 0.88. Thus, there is no evidence to indicate that mean blood cholesterol is different for the three socioeconomic groups. 0 Remark 6.1.4. Decompositions such as (6.1.29) of the response total sum of squares SST into a variety of sums of squares measuring variability in the observations corresponding to variation of covariates are referred to as analysis oj variance. They can be fonnulated in any linear model including regression models. See Scheff" (1959, pp. 4245) and Weisberg (1985, p. 48). Originally such decompositions were used to motivate F statistics and to establish the distribution theory of the components via a device known as Cochran's theorem (Graybill, 1961, p. 86). Their principal use now is in the motivation of the convenient summaries of infonnation we call ANOVA tables.
,
,I "
,
,
,
Section 6.1
Inference for Gaussian linear Models
~
381
Confidence Intervals and Regions We next use our distributional results and the method of pivots to find confidence intervals for J.li, 1 < i < n, !3j, 1 <j < p, and in general, any linear combination
n
,p
= ,p(/l) = Lai/Li
i::: 1
~ aT /l
of the J.l's. If we set;j;
= 1:7
1 ai/Ii
= aT fl and
~
~
where H is the hat matrix, then (,p  ,p)ja(,p) has a N(O, I) distribution. Moreover,
(n  r)8 2ja 2 ~
IY  iii 2 ja 2 =
L
i=r+l
n
(Uda 2 )'
has a X~r distribution and is independent of ;;;. Let
~
~
be an estimate of the standard deviation a('IjJ) of
~
'0. This estimated standard deviation is
called the standard error of 'IjJ. By referring to the definition of the t distribution, we find that the pivot
has a TnT. distribution. Let t n  r (1  40) denote the 1 ~o: quantile of the bution, then by solving IT(,p)1 < t n _, (I  ~",) for,p, we find that
Tn r
distri
is, in the Gaussian linear model, a 100(1  a)% confidence interval for 1/J. Example 6.1.1. One Sample (continued). Consider'IjJ = p. We obtain the interval
i' ~ Y ± t n 1
(1
~q) sj.,fii,
which is the same as the interval of Example 4.4.1 and Section 4.9.2. Example 6.1.2. Regression (continued). Assume that p = T. First consider 1/J = f3j for some specified ~gression coefficient (3j. The 100(1  a)% confidence interval for (3j is
(3j
}" = (3j ± t n p (','" s{ [ (ZT Z) 11 j j ' 1 )
382
Inference in the Multiparameter Case
Chapter 6
where [(ZTZ)~l]j) is the jthdiagonal element of (ZTZ)~ '. Computersoftware computes (ZTZ)~I and labels S{[(ZTZ)~lI]j); as the standard errOr of the (estimated) jth regression coefficient. Next consider ljJ = j.ti = mean response for the ith case, 1 < i < n. The level (1  0:) confidence interval is
J.Li =
Jii ± t n _ p (1 
!a) sJh:
where hii is the ith diagonal element of the hat matrix H. Here sjh;; is called the standard error of the (estimated) mean of the ith case. Next consider the special case in which p = 2 and
Yi = 131 + 132Zi2 + Eil i = 1 ", n.
"
If we use the identity
n
~)Zi2  Z2)(l'i  Y)
i=1
~)Zi2  Z.2)l'i,
We obtain from Example 2.2.2 that
ih ~
Because Var(Yi) = a 2 , we obtain
~ Var(.6,)
L~l (Zi2  z.,)l'i
L~ 1 (Zi2  Z.2)2 .
(6.1.30)
= (J I L)Zi' i=l
''''
n
Z.2) ,
,
and the 100(1  a)% confidence interval for .6, has the form
732 ± t n p (I  ~a) sl J2:(Zi2  Z.2)'. The confidence interval for 131 is given in Problem 6.1.10. .62
=
Similarly, in the p = 2 case, it is straightforward (Problem 6.1.10) to compute
i.
h ii
I ,
•
;
_ 1

(Zi2  z.,)2
",n
+ n
L....i=l Zi2 
(
Z·2
)'
,
0
•
I
and the confidence interval for th~ mean response J1.i of the ith case has a simple explicit
fonn.
,
1
,
i ,
.I
Example 6.1.3. OneWay Layout (continued). We consider 'I/J = 13k. 1 < k .6k = Yk. ~ N(.6k, (J'lnk), we find the 100(1  a)% confidence interval
~
< p. Because
•
,
I
= 7J. ± t n  p (1 ~a) slj'nj; where 8' = SSwl(np). The intervals for I' = .6. and the incremental effect Ok =.6k1'
.6k
are given in Problem 6.1.11. 0
j
j
"I
I
i
,
I
I
Do
l
Section 6.2
Asymptotic Estimation Theory in p Dimensions
383
Joint Confidence Regions
We have seen how to find confidence intervals for each individual (3j, 1 < j < p. We next consider the problem of finding a confidence region C in RP that covers the vector /3 with prescribed probability (1  0:). This can be done by inverting the likelihood ratio test or equivalently the F test That is, we let C be the collection of f3 0 that is accepted when the level (1  a) F test is used to test H : (3 ~ (30' Under H, /' = /'0 = Z(3o; and the numerator of the F statistic (6.1.22) is based on
Iii  /'01'
C "
=
Izi3 
Z(3ol' =
(13  (30)T(Z T Z)(i3 (30)
(30)
Thus, using (6.1.22), the simultaneous confidence region for
f3 is the ellipse
= {(30 .. (13  (30)T(Z 2 Z)(i3 rs
T
< f r,ni"" (1 _ 1 20
'J}
. .T
(6.1.31)
where fr,nr
(1  40:)
is the 1  !o quantile of the :Fr,nr distribution.
Example 6.1.2. Regression (continued). We consider the case p = r and as in (6.1.26) write Z = (ZJ, Z2) and /3T = (f3j, {3f), where f32 is a vector of main effect coefficients and f3 1is a vector of "nuisance" coefficients. Similarly, we partition {3 as {3 = ({31 ,(32 ) where (31 is q x 1 and (3, is (p  q) x 1. By Corollary 6.1.1, O"(ZTZ) is the variancecovariance matrix of {3. It follows that if we let 8 denote the lower right (p  q) x (p ~ q) comer of (ZTZ)I, then (728 is the variancecovariance matrix of 132' Thus, a joint 100(1  0:)% confidence region for {32 is the p  q dimensional ellipse
~ ~ ~
.T ",T
C={(3
0' .
. (i3,(302)TSl(i3,_(3o,) <f
() 2
pqs

pq,np
(11
20:·
l}
o
Summary. We consider the classical Gaussian linear model in which the resonse Yi for (3jZij of the ith case in an experiment is expressed as a linear combination J1i = covariates plus an error fi, where Ci, ... 1 f n are i.i.d. N (0, (72). By introducing a suitable orthogonal transfonnation, we obtain a canonical model in which likelihood analysis is straightforward. The inverse of the orthogonal transfonnation gives procedures and results in terms of the original variables. In particular we obtain maximum likelihood estimates, likelihood ratio tests, and confidence procedures for the regression coefficients {(3j}, the resJXlnse means {J1i}, and linear combinations of these.
LJ=l
6.2
ASYMPTOTIC ESTIMATION THEORY IN p DIMENSIONS
In this section we largely parallel Section 5.4 in which we developed the asymptotic properties of the MLE and related tests and confidence bounds for onedimensional parameters. We leave the analogue of Theorem 5.4.1 to the problems and begin immediately generalizing Section 5.4.2.
384'
f ",'c:,c:""""ce=:::;'=''''h':.M=,'"';,,pa::.'::'m'''e::.t::"...:C::':::",..::Cc::h,:,p:::'e::'~6
6.2.1
Estimating Equations
OUf assumptions are as before save that everything is made a vector: X!, ... , X n are i.i.d. Pwhere P E Q, a model containing P = {PO: 0 E e} such that
(i)
e open C RP.
e.
(ii) Densities of P, are pC 0),9 E
1 , I
The following result gives the general asymptotic behavior of the solution of estimating equations.
AO. 'I'
=(,p" ... , ,pp)T where,p) ~ g:., is well defined and
 L >I'(X n.
1=1
1
n
..
i,
On) = O.
(6.2.1)
A solution to (6.2.1) is called an estimating equation estimate or an M estimate.
AI. The parameter 8( P) given by the solution of (the nonlinear system of p equations in p
unknowns):
J
A2. Epl'l'(X" 0(P)1 2
>I'(x, O)dP(x)
=0
(6.2.2)
, , ,
I
is well defined on Q so that O(P) is the unique solution of (6.2.2). Necessarily O(PO) because Q => P.
=0
I , ,
< 00 where I· I is the Euclidean nonn.
I I:
.' , ,
,
I
A3. 'l/Ji{', 8), 1 < i < P. have firstorder partials with respect to all coordinates and using the notation of Section B.8,
I
where
': I
,
~
.,
i
~~
l:iI ~
is nonsingular.
A4. sup
..
{I ~ L~ 1 (D>I'(X"
p
t)  D>I'(Xi , O(P))) I: It  O(P)I < En} ~ 0 if En ~ O.
, .I ,
,
I
I
AS. On ~ O(P) for all P E Q.
Theorem 6.2.1. Under AOA5 ofthis section
8n = O(P) + where
L iii(X n.
t=l
1
n
i,
O(P))
+ op(n 1 / 2 )
(6.2.3)
i..I , ,
• •
iii(x,o(p))
= (EpD>I'(X" 0(P)W 1 >1'(x, O(P)).
(6.2.4)
b
Section 6.2
Asymptotic Estimation Theory in p Dimensions
385
Hence,
(6.2.5)
where
E(q" P) ~ J(O. p)Eq,q,T(X" O(p))F (0, P)
and
81/J~
J
(6.2.6)
r'(o,p)
= EpDq,(X"O(P))
~
E p 011 (X"O(P))
The proof of this result follows precisely that of Theorem 5.4.2 save that we need multivariate calculus as in Section B.8. Thus,
1
 n
2.:= q,(Xi , O(P)) = n 2.:= Dq,(Xi,0~)(9n
i=l
n
1
n
 O(P)).
(6.2.7)
i=l
Note that the lefthand side of (6.2.7) is a p x 1 vector, the right is the product of a p x p matrix and a p x 1 vector. The rest of the proof follows essentially exactly as in Section 5.4.2 save that we need the observation that the set of nonsingular p x p matrices, when viewed as vectors, is an open , subset of RP , representable, for instance, as the Set of vectors for which the determinant, a continuous function of the entries, is different from zero. We use this remark to conclude that A3 and A4 guarantee that with probability tending to 1, ~ l::~ I Dq,(Xi , 6~) is nonsingular. Note. This result goes beyond Theorem 5.4.2 in making it clear that although the definition of On is motivated by p, the behavior in (6.2.3) is guaranteed for P E Q, which can include P <1c P. In fact, typically Q is essentially the set of P's for which O(P) can be defined uniquely by (6.2.2). We can again extend the assumptions of Section 5.4.2 to: A6. If 1(,0) is differentiabLe
EODq,(X"O)

EOq,(X" O)DI(X" 0) CovO(q,(X" 0), DI(X, , 0))
(6.2.8)
defined as in B.5.2. The heuristics and conditions behind this identity are the same as in the onedimensional case. Remarks 5.4.2, 5.4.3, and Assumptions A4' and A61 extend to the multivariate case readily. Note that consistency of On is assumed. Proving consistency usually requires different arguments such as those of Section 5.2. It may, however. be shown that with probability tending to 1, a rootfinding algorithm starting at a consistent estimate 6~ will find a solution On of (6.2.1) that satisfies (6.2.3) (Problem 6.2.10).
386
Inference in the Multiparameter Case
Chapter 6
6,2,2
comes
Asymptotic Normality and Efficiency of the MLE
[fwe take p(x,O)
=
l(x,O)
= 10gp(x,0), and >I>(x,O)
obeys AOA6, then (62,8) be.
T
l
EODl (X" O)D l( X 1,0» VarODl(X"O)
(62,9)
where
j 1 ,
, ,
is the Fisher information matrix I(e) introduced in Section 3.4. If p: e _ R, e c R d , is a scalar function, the matrix)! 8~~P(}J (e) is known as the Hessian or curvature matrix of the sutface p. Thus, (6.2.9) stateS that the expected value of the Hessian of l is the negative of the Fisher information. We also can immediately state the generalization of Theorem 5.4.3.
Theorem 6.2.2. If AOA6 holdfor p(x, 0)
,1, , ,
I
j
=10gp(x, 0), then the MLE  satisfies On
i,
(62,10)
(62,11)
On ~ 0+  :Lr'(O)DI(Xi,O) + op(n"/')
n
i=1
1
n
,,
so that
is a minimum contrast estimate with p and 'f/J satisfying AOA6 and corresponding asymptotic variance matrix E('I1, Pe), then
"
"
If en
"
, ,
E(>I>,PO)
On
> r'(O)
(62,12)
in the sense of Theorem 3.4.4 with equality in (6,2,12) for 0
,
= 0 0 iff, undRr 0 0 ,
(6,2,13)
~
On + op(n 1/2 ),
, i ,
•
,
Proof. The proofs of (6,2,10) and (6,2,11) parallel those of (5.4.33) and (5.4,34) exactly, The proof of (6,2,12) parallels that of Theorem 3.4.4, For completeness we give it Note that hy (6,2,6) and (6,2,8)
I
" :I I.
1" ,
where U >I>(Xt,O). V Var(U T , VT)T nonsingular Var(V)
=
E(>I>,PO)
= CovO'(U, V)VarO(U)CovO'(V, U)
~
(62,14)
DI(X1,0),
But hy (B,lO.8), for any U,V with
(6,2.15)
> Cov(U, V)Var1(U)Cov(V, U),
Taking inverses of both sides yields
r1(0)
= Var01(V) < E(>I>,O).
(6.2.16)
• • !
Section 6.2
Asymptotic Estimation Theory in p Dimensions
387
Equality holds in (6.2.15) by (B. 10.2.3) iff for some b U
= b(O)
(6.2.17)
= b + Cov(U, V)Var1(V)V with probability 1. This means in view of Eow = EODl = 0 that
w(X"O) = b(0)Dl(X1,0).
In the case of identity in (6.2.16) we must have
[EODw(X 1, OW'W(X 1 , 0)
=
r1(0)DI(X" 0).
(6.2.18)
Hence, from (6.2.3) and (6.2.10) we conclude that (6.2.13) holds.
o
apxl, a T 8 n
~
We see that, by the theorem, the MLE Is efficient in the sense that for any has asymptotic bias o(n1/2) and asymptotic variance nlaT ]1(8)a, which is n.? larger than that of any competing minimum contrast estimate. Further any competitor 8 n such that aTO n has the same asymptotic behavior as a T 8 n for all a in fact agrees with On to ordern 1/2


A special case of Theorem 6.2.2 that we have already established is Theorem 5.3.6
on the asymptotic nonnality of the MLE in canonical exponential families. A number of important new statistical issues arise in the multiparameter case. We illustrate with an example. Example 6.2.1. The Linear Model with Stochastic Covariates. Let Xi = (Zr, Yi)T. 1 < i < n, be ij.d. as X = (ZT, Y) T where Z is a p x 1 vector of explanatory variables and Y is the response of interest. This model is discussed in Section 2.2.1 and Example 1.4.3. We specialize in two ways:
(il
(6.2.19) where, is distributed as N(O, (),2) independent of Z and E(Z) Z, Y has aN (a + ZT [3, (),2) distrihution.
= O.
That is, given
(ii) The distribution Ho of Z is known with density h o and E(ZZT) is nonsingular.
The second assumption is unreasonable but easily dispensed with. It readily follows (Prohlem 6.2.6) that the MLE of [3 is given by (with probability 1) [3 = [Z(n}Z(n}]
T
~
lT
Zen} Y.
(6.2.20)
Here Zen) is the n x p matrix IIZij ~ Z.j II where z.j = ~ 1 Zij. We used subscripts (n) to distinguish the use of Z as a vector in this section and as a matrix in Section 6.1. In the present context, Zen) = (Zl, .. , 1 Zn)T is referred to as the random design matrix. This example is called the random design case as opposed to the fixed design case of Section 6.1. Also the MLEs of a and ()'2 are
p
2:7
Ci
=Y
 ~ Zj{3j, (j
J=l
"

2
.
1 ~2 = IY  (Ci + Z(n)[3)1 .
n
(6.2.21 )
I
388
~
Inference in the Multiparameter Case
Chapter 6
Note that although given ZI," " Zn, (3 is Gaussian, this is not true of the marginal distribution of {3. It is not hard to show that AOA6 hold in this case because if H o has density k o and if 8 denotes (a,{3T,a 2 )T. then
~
j
1
j
I(X,8) Dl(X,8)
and
 20'[Y  (a + Z T 13W
(
1
)
2(logo'
1
)
+ log21T) + logho(z)
(6.2.22)
;2'
1
i,
Z ;" 20 4
(.2
o o
1)
I
.,,
a 2
1(8) =
I
0 0' E(ZZT)
0
,
o o
20 4
(6.2.23)
i ,
,
1
so that by Theorem 6.2.2
iI
I'
L(y'n(ii 
a,(3  13,8'  0 2»
~ N(O,diag(02,02[E(ZZ T W',20 4 ».
(6.2.24)
1
!
I
!
,,
This can be argued directly as well (Problem 6.2.8). It is clear that the restriction of H o 2 known plays no role in the limiting result for a,j3,Ci • Of course, these will only be the MLEs if H o depends only on parameters other than (a, f3,( 2 ). In this case we can estimate E(ZZT) by ~ L~ 1 ZiZ'[ and give approximate confidence intervals for (3j. j = 1 .. ,po " An interesting feature of (6.2.23) is that because 1(8) is a block diagonal matrix so is II (6) and, consequently, f3 and 0'2 are asymptotically independent. In the classical linear model of Section 6.1 where we perfonn inference conditionally given Zi = Zi, 1 < i < n, we have noted this is exactly true. This is an example of the phenomenon of adaptation. If we knew 0 2 , the MLE would still be and its asymptotic varianc~ optimal for this model. If we knew a and 13. ;;2 would no longer be the MLE. But its asymptotic variance would be the same as that of the MLE and, by Theorem 6.2.2, 0=2 would be asymptotically equivalent to the MLE. To summarize, estimating either parameter with the other being a nuisance parameter is no harder than when the nuisance parameter is known. Formally, in a model P = {P(9,"} : 8 E e, '/ E £}
~
j,
1 ,
l
I ,
1 ,
1
13
. ,
:
~
•
L
.' ,
we say we can estimate B adaptively at 'TJO if the asymptotic variance of the MLE (J (or more generally, an efficient estimate of e) in the pair (e, iiJ is lhe same as that of e(,/o), the efficient estimate for 'P'I7o = {P(9,l'jo) : E 8}. The possibility of adaptation is in fact rare. though it appears prominently in this way in the Gaussian linear model. In particular consider estimating {31 in the presence of a, ({32 ... , /3p ) with
~ ~
e
(i) a, (3" ... , {3p known. (ii)
13 arbitrary.
/32
... = {3p = O. LetZi
In case (i), we take, without loss of generality, a = (ZiI, ... , ZiP) T. then the efficient estimate in case (i) is
,
•• "
I
;;n _ L~l Zit Yi Pi n 2
Li=l Zil
(6,2.25)
, , ,
, I
.
I
Section 6.2
Asymptotic Estimation Theory in p Dimensions
389
with asymptotic variance (T2[EZlJ1. On the other hand, [31 is the first coordinate of {3 given by (6.2.20). Its asymptotic variance is the (1,1) element of O"'[EZZTjl, which is strictly bigger than ,,' [EZfj1 unless [EZZTj1 is a diagonal matrix (Problem 6.2.3). So in general we cannot estimate [31 adaptively if {3z, .. . , (3p are regarded as nuisance parameters. What is happening can be seen by a representation of [Z'fn) Z(n)ll Zen) Y and [11(11) where ['(II) Ill'j(lI) II· We claim that


=
2i
fJl
= 2:~_l(Z"
",n
 Zill))y; (Z _ 2(11 ),
tl
t
(6.2.26)
L....t=1
where Z(1) is the regression of (Zu, ... , Z1n)T on the linear space spanned by (Zj1,' . " Zjn)T, 2 < j < p. Similarly.
[11 (II)
= 0"'/ E(ZlI
 I1(ZlI
1
Z'l,"" Zp,»)'
(6.2.27)
where II(Z11 I ZZ1,' .. , Zpl) is the projection of ZII on the linear span of Z211 .. , , Zpl (Problem 6.2.11). Thus, I1(ZlI I Z'lo"" Zpl) = 2:;=, Zjl where (ai, ... ,a;) minimizes E(Zll  2: P _, ajZjl)' over (a" ... , ap) E RPI (see Sections 1.4 and B.10). What (6.2.26) and (6.2.27) reveal is that there is a price paid for not knowing [3" ... , f3p when the variables Z2, .. . ,Zp are in any way correlated with Z1 and the price is measured by
a;
[E(Zll  I1(Zll I Z'lo"" Zpl)'jl = ( _ E(I1(Zll 1 Z'l, ... ,ZPI)),)I '''''';:;f;c;o,~''"'''''1 , E(ZII) E(ZII)
(6.2.28) In the extreme case of perfect collinearity the price is 00 as it should be because (31 then becomes unidentifiable. Thus, adaptation corresponds to the case where (Z2, . .. , Zp) have no value in predicting Zl linearly (see Section 1.4). Correspondingly in the Gaussian linear model (6.1.3) conditional on the Zi, i = 1, ... , n, (31 is undefined if the denominator in (6.2.26) is 0, which corresponds to the case of collinearity and occurs with probability 1 if E(Zu  I1(ZlI I Z'l, ... , Zpt})' = O. 0

Example 6.2.2. M Estimates Generated by Linear Models with General Error Structure. Suppose that the €i in (6.2.19) are ij.d. but not necessarily Gaussian with density ~fo (~). for instance, e x
fo(x)
~
=
(1 + eX)"
the logistic density. Such error densities have the often more realistic, heavier tails(l) than the Gaussian density. The estimates {301 0'0 now solve
and
390
Inference in the Multiparameter Case
Chapter 6
f' ~ ~ where1/; = 'X;,X(Y) =  (iQ)~ Yfo(y)+1 ,(30 (fito, ... ,ppO)T The assumptions of Theorem 6.2.2 may be shown to hold (Problem 6.2.9) if
=
(i) log fa is strictly concave, i.e.,
*
is strictly decreasing.
(ii) (log 10)" exists and is bonnded.
Then, if further fa is symmetric about 0,
I(IJ)
cr"I((3T, I)
= cr"
where
Cl
(C 1 Z ) E(t
~)
(6.2.29)
, ,
= J (f,(x»)' lo(x)dx, c, = J (xf,(x) + I)' 10 (x)dx.
~
Thus, ,Bo, ao are opti
mal estimates of {3 and (l in the sense of Theorem 6.2.2 if fo is true. Now suppose fa generating the estimates Po and (T~ is symmetric and satisfies (i) and (ii) but the true error distribution has density f possibly different from fo. Under suitable conditions we can apply Theorem 6.2.1 with
1 ,
,
i
where
i
1f;j(Z,y,(3,(J)
,
;
"
~ 1f; (y  L~1 Zkpk )
, I
<j < p
(6.2.30)
!
,
•
> (y  L~1
to conclude that
where (301 ao solve
I 1
j
Zkpk )
1
I
I
I
L:o( y'n(,Bo  (30)) ~ N(O, I: ('I', P») L:( y'n(a  cro) ~ N(O, (J'(P))
1 •
,
(6.2.31)
• ,
J'I'(y zT(30)dP=O
• ,
I
p
II . ,.I'
ii
I:
and E('I', P) is as in (6.2.6). What is the relation between (30' (Jo and (3, (J given in the Gaussian model (6.2.19)? If 10 is symmetric about 0 and the solution of (6.2.31) is unique, . then (30 = (3. But (Jo = c(fo)q for some, c(to) typically different from one. Thus, (30 can be used for estimating (3 althougll if t1j. true distribution of the is N(O, (J') it should ' perform less well than (3. On the Qther hand, o is an estimate of (J only if normalized by a constant depending on 10. (See Problem 6.2.5.) These are issues of robustness, that
~

a
'i
1:
is, to have a bounded sensitivity curve (Section 3~5. Problem 3.5.8), we may well wish to use a nonlinear bounded '.Ii = ('!f;11'" l ,pp)T to estimate f3 even though it is suboptimal when € rv N(O, (12). and to use a suita~ly nonnalized version of {To for the same purpose. One effective choice of.,pj is the Hu~er function defined in Problem 3.5.8. We will discuss these issues further in Section 6.6 and Volume II. 0
•
A4(a. Wald tests (a generalization of pivots).). fJ p vary freely. because we then need to integrate out (03 . Confidence regions that parallel the tests will also be developed in Section 6. All of these will be developed in Section 6. the latter can pose a fonnidable problem if p > 2.5. Op). Although it is easy to write down the posterior density of 8. . in perfonnance and computationally. We defined minimum contrast (Me) and M estimates in the case of pdimensional parameters and established their convergence in law to a nonnal distribution.. . .s. typically there is an attempt at "exact" calculation. 1T(8) rr~ 1 P(Xi1 8).19). ~ (6. Summary. 6. and interpret I . The problem arises also when. A5(a. However. . See Schervish (1995) for some of the relevant calculations. Optimality criteria are not easily stated even in the fixed sample case and not very persuasive except perhaps in the case of testing hypotheses about a real parameter in the presence of other nuisance parameters such as H : fJ 1 < 0 versus K : fh > 0 where fJ 2 . up to the proportionality constant f e .s.) and A6A8 hold then.2.3 The Posterior Distribution in the Multiparameter Case The asymptotic theory of the posterior distribution parallels that in the onedimensional case exactly..2. say (fJ 1.3 are the same as those of Theorem 5. 2 The asymptotic theory we have developed pennits approximation to these constants by the procedure used in deriving (5. the likelihood ratio principle. This approach is refined in Kass. say.3. A class of Monte Carlo based methods derived from statistical physics loosely called Markov chain Monte Carlo has been developed in recent years to help with these problems. Ifthe multivariate versions of AOA3. . Kadane. When the estimating equations defining the M estimates coincide with the likelihood .2.. The three approaches coincide asymptotically but differ substantially. we are interested in the posterior distribution of some of the parameters. We simply make 8 a vector.5. These methods are beyond the scope of this volume but will be discussed briefly in Volume II.8 we obtain Theorem 6.s.32) a. under PeforallB.. as is usually the case.2. the equivalence of Bayesian and frequentist optimality asymptotically. and Rao's tests. We have implicitly done this in the calculations leading up to (5. Again the two approaches differ at the second order when the prior begins to make a difference. for fixed n.5. O ).(t) n~ 1 p(Xi .2. A new major issue that arises is computation. t)dt. ife denotes the MLE. Using multivariate expansions as in B. and Tierney (1989).19) (Laplace's method).3.2 Asymptotic Estimation Theory in p Dimensions 391 Testing and Confidence Bounds There are three principal approaches to testing hypotheses in multiparameter models.I as the Euclidean nonn in conditions A7 and A8.Section 6.3. The consequences of Theorem 6.
Another example deals with M estimates based on estimating equations generated by linear models with nonGaussian error distribution. We find that the MLE is asymptotically efficient in the sense that it has "smaller" asymptotic covariance matrix than that of any MD or AIestimate if we know the correct model P = {Po : BEe} and use the MLE for this model. adaptive estimation of 131 is possible iff Zl is uncorre1ated with every linear function of Z2) . We present three procedures that are used frequently: likelihood ratio.4. 9 E 8} sup{p(x.4.2 to extend some of the results of Section 5. In such . as in the linear model. ! I In Sections 4. 9 E 8" 8 1 = 8 . These were treated for f) real in Section 5. to the N(O.9). I . 9 E 8 0 } for testing H : 9 E 8 0 versus K . we need methods for situations in which.f/ : () E 8. confidence regions. '(x) = sup{p(x.3.I . and we tum to asymptotic approximations to construct tests.1 we considered the likelihood ratio test statistic. and showed that in several statistical models involving normal distributions.s. in many experimental situations in which the likelihood ratio test can be used to address important questions. this result gives the asymptotic distribution of the MLE.392 Inference in the Multiparameter Case Chapter 6 . equations. 1 '~ J .d. "' ~ 6. In Section 6. A(X) simplified and produced intuitive tests whose critical values can be obtained from the Student t and :F distributions. and confidence procedures. However. ry E £} if the asymptotic distribution of In(() . 1 . We shall show (see Section 6.110 : () E 8}.3 • .4 to vectorvalued parameters. LARGE SAMPLE TESTS AND CONFIDENCE REGIONS ~ 0) converges a. Asymptotic Approximation to the Distribution of the Likelihood Ratio Statistic i.4. However.B} has mean zero and variance matrix equal to the smallest possible for a general class of regular estimates of () in the family of models {PO. I j I I . . 170 specified. X n are i. and other methods of inference. covariates can be arbitrary but responses are necessarily discrete (qualitative) or nonnegative and Gaussian models do not seem to he appropriate approximations.denotes the MLE for PO' then the posterior distribution of yn(O if 0 r' (9)) distribution. We use an example to introduce the concept of adaptation in which an estimate f) is called adaptive for a model {PO. I H I! ~.1 .8 0 . . the exact critical value is not available analytically. Finally we show that in the Bayesian framework where given 0. PO'  . In these cases exact methods are typically not available. under P..i. Wald and Rao large sample tests. 6..6) that these methods in many respects are also approximately correct when the distribution of the error in the model fitted is not assumed to be normal.•• . . In linear regression. .9 and 6.9) . 1 Zp.". I .I .1 we developed exact tests and confidence regions that are appropriate in re~ gression and anaysis of variance (ANOVA) situations when the responses are normally distributed. In this section we will use the results of Section 6. . X II .
n. Suppose XL X 2 ... We next give an example that can be viewed as the limiting situation for which the approximation is exact: independent with Yi rv N(Pi.1.£ X~" for degrees of freedom d to be specified later.Section 6.. .\(Y) = L X. . which is usually referred to as Wilks's theorem or approximation. ~ X.q' i=q+1 r Wilks's theorem states that.1.3. .i = 1.2 we showed how to find B as a nonexplicit solution of likelihood equations. when testing whether a parameter vector is restricted to an open subset of Rq or R r .v'[. . V q span Wo and VI. Using Section 6.n. Here is an example in which Wilks's approximation to £('\(X)) is useful: Example 6.. 0) = IT~ 1 p(Xi.3.tl.. ~ ~ ~ ~ ~ The approximation we shall give is based on the result "2 log '\(X) . SetBi = TU/uo."" V r spanw. •.. .3.2. 0).. Let YI . Thus.tn)T is a member of a qdimensional linear subspace of woo versus the alternative that J. the numerator of . i = 1.l. As in Section 6.d. .4. iJ). Wilks's approximation is exact. such that VI. iJo) is the denominator of the likelihood ratio statistic.i.3 Large Sample Tests and Confidence Regions 393 cases we can turn to an approximation to the distribution of . This is not available analytically.1 exp{ iJx )/qQ).1.. qQ..i = 1. x ~ > 0.5) to be iJo = l/x and p(x.3 we test whether J. E W . Moreover... q < r. .3.2 we showed that the MLE. Other approximations that will be explored in Volume II are based on Monte Carlo and bootstrap simulations. . In this u 2 known example. . under regularity conditions.. .\(Y)). as X where X has the gamma.\(x) is available as p(x. 0) ~ iJ"x·. 0 It remains to find the critical value. iJ > O. ~ In Example 2. and we transform to canonical form by setting Example 6. r and Xi rv N(O. = (J..1. iJ). Suppose we want to test H : Q = 1 (exponential distribution) versus K : a i. Then Xi rvN(B i . 2 log . i = r+ 1. the hypothesis H is equivalent to H: 6 q + I = .3. where A nxn is an orthogonal matrix with rows vI. <1~) where {To is known. Q > 0.. Y n be nn. . . ... 0 = (iX.\(X) based on asymptotic theory. 1. The Gaussian Linear Model with Known Variance.1).Xn are i.Wo where w is an rdimensional linear subspace of Rn and W ::> Wo. =6r = O. we conclude that under H.randXi = Uduo.l. . the X~_q distribution is an approximation to £(21og .'··' J. . exists and in Example 2. The MLE of f3 under H is readily seen from (2. 0 We iHustrate the remarkable fact that X~q holds as an approximation to the null distribution of 2 log A quite generally when the hypothesis is a nice qdimensional submanifold of an rdimensional parameter space with the following. 1). . distribution with density p(X..
!. I.\(Y) £ X.2.2 with 9(1) = log(l + I).(Jol. we derived e e 2log >'(Y) ~ n log ( 1 + 2::" L. Note that for J. 2 log >'(Y) = Vn ". under H .. (72) ranges over an r + Idimensional Example 6.3.3.1. B). ranges over a q + Idimensional manifold.d.3. . y'ri(On . 21og>. I I i Example 6.3. Because .logp(X.(X) ~ 1 . hy Corollary B..i.q' Finally apply Lemma 5.7 to Vn = I:~_q+lXl/nlI:~ r+IX. Eln«(Jo) ~ I«(Jo).2. Here ~ ~ j. an expansion of lnUI) about en evaluated at 8 = 0 0 gives 2[ln«(Jn) In((Jo)] = n«(Jn . an = n.q also in the 0.~ L H .(Jol. where x E Xc R'. If Yi are as in = (IL. "'" (6. . X. N(o. case with Xl.. j . where Irxr«(J) is the Fisher information matrix...2 but (T2 is unknown then manifold whereas under H." .(Jo) In«(Jn)«(Jn . . ". I1«(Jo)).. and conclude that Vn S X. we can conclude arguing from A. By Theorem 6.· ..2) V ~ N(o. .Xn a sample from p(x. c = and conclude that 210g . I ..2 are satisfied. The result follows because. The Gaussian Linear Model with Unknown Variance._q as well. In Section 6.1. and (J E c W.. (6. I(J~ .(In[ < l(Jn .. o . . (J = (Jo.. . = 2[ln«(Jn) In((Jo)] ~ ~ ~ £ X.1 ((Jo)).3. Then. Suppose the assumptions of Theorem 6. : . I( I In((J) ~ n 1 n L i=l & & &" &". Write the log likelihood as e In«(J) ~ L i=I n logp(X i . .. .3.i=q+l 'C'" I=r+l Xi' ) X. V T I«(Jo)V ~ X.3.(Jol < I(J~ ...3.3. 1 • ! We tirst consider the simple hypothesis H : 6 = 8 0 _ Theorem 6. i\ I . where DO is the derivative with respect to e.(J) Uk uJ rxr c'· • 1 .2.2. .Onl + IOn  (Jol < 210n . ·'I1 .3 and AA that that In((J~) "'" 2[ln((Jn) In((Jo)] £r ~ V I((Jo)V.6. Hence.(Y) defined in Remark ° 6.2.1. "'" T. Proof Because On solves the likelihood equation DOln(O) = 0. ! .(Jo) ".394 Inference in the Multiparameter Case Chapter 6 1 .1) " I. .. 0 Consider the general i.(Jo) for some (J~ with [(J~ . Apply Example 5. 0).2 unknown case.
Section 6.T. Examples 6.a.(9 0 .0) is the 1 and C!' quantile of the X.3.8.n = (80 .. where x r (1.)1'.0 0 ).3 Large Sample Tests and Confidence Regions 395 = As a consequence of the theorem.(X) 8 0 when > x. T (1) (2) Proof. 8 2)T where 8 1 is the first q coordinates of8.)T. Then under H : 0 E 8 0. ~ {Oo : 2[/ n (On) /n(OO)] < x..8q ).35) where 8(00) = n 1/2 L Dl(X" 0) i=1 n and 8 = (8 1 .j. Suppose that the assumptions of Theorem 6.0(1) = (8 1" ". ..n and (6...81(00). for given true 0 0 in 8 0 • '7=M(OOo) where.6) .+l. We set d ~ r .(Ia). where e is open and 8 0 is the set of 0 E e with 8j = 80 . OT ~ (0(1). . the test that rejects H : () 2Iog>.' ". 0).3. 0 E e.n) /n(Oo)l.80. and {8 0 .2.3.". j ~ q + 1.q. 0b2) = (80"+1"..0(2». By (6.2. Let 00.1) applied to On and the corresponding argument applied to 8 0 . distribution.j} are specified values.3.1 and 6.0:'.)T..3.3.3.(] .2[1.2 hold for p(x. 0(2) = (8. 10 (0 0 ) = VarO.3.3) is a confidence region for 0 with approximat~ coverage probability 1 .4).2 illustrate such 8 0 .10) and (6. SupposetharOo istheMLEofO underH and that 0 0 satisfiesA6for Po.8. dropping the dependence on 6 0 • M ~ PI 1 / 2 (6. Furthennore. has approximately level 1. (6. Let 0 0 E 8 0 and write 210g >'(X) ~ = 2[1n(9 n )  In(Oo)] . (JO. ~(11 ~ (6. Let Po be the model {Po: 0 E eo} with corresponding parametn"zation 0(1) = ~(1) (1) ~(1) (8 1. Theorem 6.4) It is easy to see that AOA6 for 10 imply AOA5 for Po.a)) (6. Next we tum to the more general hypothesis H : 8 E 8 0 . Make a change of parameter.2..
• 1 y{X) = sup{P{x.+I> . Me : 1]q+l = TJr = a}. ..1'1 '1 E eo} and from (B.Tf(O)T . .0: confidence region is i ..+ 1 .9).8. Thus.8) that if T('1) then = 1 2 n.6) and (6.1ITD III(x.• I . His {fJ E applying (6.396 Inference in the Multiparameter Case Chapter 6 and P is an orthogonal matrix such that.. a piece of 8.1/ 2 p = J. because in tenus of 7].(X) = y{X) where ! 1 1 . (6.3..9) . Or)J < xr_. is invariant under reparametrization .00 : e E 8 0} MAD = {'1: '/.3. if A o _ {(} .1 '1) = [M..In( 00 .0.1 because /1/2 Ao is the intersection of a q dirnensionallinear subspace of R r with JI/2{8 .1 '1). Such P exists by the argument given in Example 6.2. rejecting if .. = 0.l../ L i=I n D'1I{X" liD + M.a) is an asymptotically level a test of H : 8 E Of equal importance is that we obtain an asymptotic confidence region for (e q+ 1.00 .(1 .1I0 + M. ..II)..1I0 + M... I I eo~ ~ ~ {(O. ~ 'I. We deduce from (6.3.(X) > x r .1> .. J) distribution by (6.+1 ~ . with ell _ . 1e ).(l .1 '1ll/sup{p(x.3.5) to "(X) we obtain.2.1I 0 + M. by definition. then by 2 log "(X) TT(O)T(O) .I '1): 11 0 + M.7).3. ). 18q acting as r nuisance parameters.. Moreover.3. • .3.l. .3. = . This asymptotic level 1 .3. which has a limiting X~q distribution by Slutsky's theorem because T(O) has a limiting Nr(O. ••. under the conditions of Theorem 6. : • Var T(O) = pT r ' /2 II.11) . Tbe result follows from (6. (O) r • + op(l) (6.LTl(o) + op(l) i=l i=l L i=q+l r Ti2 (0) + op(l).. Now write De for differentiation with respect to (J and Dry for differentiation with respect to TJ.10) LTl(o) .8 0 : () E 8}.all (6..13) D'1I{x.2. Note that. 0 Note that this argument is simply an asymptotic version of the one given in Example 6.3. . IIr ) : 2[ln(lI n ) . '1 E Me}.
e:::S:::. .l:::.3.d..13).3. Theorem 6.3.. Then...3.80.::Se::. (6.cTe:::.3.2 and the previously conditions on g. It can be extended as follows. 210g A(X) ".:::m..3.i. Suppose the assumptions of Theorem 6. Suppose the MLE hold (JO. 8.3. let (XiI.. X.8 0 E Wo where Wo is a linear space of dimension q are also covered. If 81 + B2 = 1.3.. v r span " R T then. which require the following general theorem... The formulation of Theorem 6. q + 1 < j < r.6. R. .n under H is consistent for all () E 8 0.q at all 0 E 8.. 9j .12) The extension of Theorem 6..3..3. of f)l" . Vl' are orthogonal towo and v" . . ...3. themselves depending on 8 q + 1 .:::f. Wilks's theorem depends critically On the fact that nOt only is open but that if giveu in (6. (6. where J is the 2 x 2 identity matrix aud 8 0 = {O : B1 + O < I}. . such that Dg(O) exists and is of rauk r .::. if A(X) is ~ the likelihood ratio statistic for H : 0 E 8 0 given in (6..g. More sophisticated examples are given in Problems 6. Defiue H : 0 E 8 0 with e 80 = {O E 8 : g(O) = o}. Here the dimension of 8 0 and 8 is the same but the boundary of 8 0 has lower dimension.t. 2' 2 + X 2 )) 2 and 210g A(X) ~ xf but if 81 + 82 < 1 clearly 2 log A(X) ~ op(I).3).. We need both properties because we need to analyze both the numerator and denominator of A(X). Theorem 6.. Suppose H is specified by: There exist d functions. Examples such as testing for independence in contingency tables.2.. if 0 0 is true.( 0 ) ~ o} (6. X~q under H. .2 falls under this schema with 9.. q + 1 <j < r}.2 to this situation is easy and given in Problem 6.) be i...3.3..3.. 2 e eo Ii o ~ (Xl + X 2) 2 + ~ 1 _ (X. The esseutial idea is that... J).. A(X) behaves asymptotically like a test for H : 0 E 8 00 where 800 = {O E 8: Dg(Oo)(O .13) then the set {(8" .o''C6::. "'0 ~ {O : OT Vj = 0.3.. " ' J v q and Vq+l " .. . t. B2 . .3.(0) ~ 8j . 8q o assuming that 8q + 1 . The proof is sketched iu Problems (6.13) Evideutly.3. As an example ofwhatcau go wrong.14) a hypothesis of the fonn (6.p:::'e:.13). l B.2 is still inadequate for most applications.."de"'::'"e:::R::e"g. q + 1 < j < r written as a vector g.. N(B 1 . We only need note that if WQ is a linear space spanned by an orthogonal basis v\. will appear in the next section..2)(6. More complicated linear hypotheses such as H : 6 .8r are known.'! are the MLEs.5 and 6.3. 8q)T : 0 E 8} is opeu iu Rq.o:::"'' 397 CC where 00.'':::':::"::d_C:::o:::.
Y .3. .7. I .fii(1J IJ) ~N(O.a)} is an ellipsoid in W easily interpretable and computablesee (6. ~ ~ 00.31).4.7.3.l(a) that I(lJ n ) ~ I(IJ) asn By Slutsky's theorem B.. I.2.15) Because I( IJ) is continuous in IJ (Problem 6.16).3.2.lJ n ) where IJ n statistic as ~(1) ~(2) ~(1) = (0" .2 hold.3. in particular .3.. it follows from Proposition B.q' ~(2) (6. X.2 Wald's and Rao's Large Sample Tests The Wald Test Suppose that the assumptions of Theorem 6. respectively.398 Inference in the Multiparameter Case Chapter 6 6.3.2. Slutsky's it I' I' ..2.. For the more general hypothesis H : (} E 8 0 we write the MLE for 8 E as = ~ ! I: (IJ n . (6.15) and (6. according to Corollary B.rl(IJ» asn ~ p ~ L 00. . But by Theorem 6. (6. .X. [22 is continuous. I22(lJo)) if 1J 0 E eo holds. 1(8) continuous implies that [1(8) is continuous and. The last Hessian choice is ~ ~ .) and IJ n ~ ~ ~(2) ~ (Oq+" . hence. the Hessian (problem 6.. y T I(IJ)Y . It and I(8 n ) also have the advantage that the confidence region One generates {6 : Wn (6) < xp(l .1.2.~ D 2ln (IJ n ).10).0.Nr(O.fii(lJn (2) L 1J 0 ) ~ Nd(O. More generally.: b. y T I(IJ)Y. r 1 (1J)) where.9). the lower diagonal block of the inverse of . theorem completes the proof. Under the conditions afTheorem 6. More generally /(9 0 ) can be replaced by any consistent estimate of I( 1J0 ).6.. . ~ Theorem 6. for instance.Or) and define the Wald (6..3. .17) ~ ~ e ~ en 1 • I I Wn(IJ~2)) ~ n(9~2) 1J~2)l[I"(9nJrl(9~) _1J~2») where I 22 (IJ) is the lower diagonal block of II (IJ) written as I2 I 1 _ (Ill(lJ) (IJ) I21(IJ) I (1J)) I22(IJ) with diagonal blocks of dimension q x q and d x d.2. Then . has asymptotic level Q.16) n(9 n _IJ)TI(9 n )(9 n IJ) !:. It follows that the Wald test that rejects H : (J = 6 0 in favor of K : () • i= 00 when .~ D 2 1n ( IJ n). i favored because it is usually computed automatically with the MLE. .~ D 2l n (1J 0 ) or I (IJ n ) or .2. If H is true.3.18) i 1 Proof. j Wn(IJ~2)) !:.3. i . (6.3. 0 ""11 F . I 22 (9 n ) is replaceable by any consistent estimate of J22( 8).
3.3. under H.19) : (J c: 80. asymptotically level Q. The Wald test leads to the Wald confidence regions for (B q +] . This test has the advantage that it can be carried out without computing the MLE.6.nlE ~ (2) _ T  . The Rao test is based on the statistic Rn(Oo ) ~n'1'n(Oo.n)[D.0:) is called the Rao score test..22) .9.3..3. as n _ CXJ.:El ((}o) is ~ n 1 [D.. 1(0 0 )) where 1/J n = n . = 2 log A(X) + op(l) (6.n W (80 . Let eo '1'n(O) = n. 2 Wn(Oo ) ~(2) where ...3. is.. Rn(Oo) = nt/J?:(Oo)r' (OO)t/Jn(OO) !.. the two tests are equivalent asymptotically. as (6. N(O. the Wald and likelihood ratio tests and confidence regions are asymptotically equivalent in the sense that the same conclusions are reached for large n. 112 is the upper right q x d block. The test that rejects H when R n ( (}o) > x r (1.I Dln ((Jo) is the likelihood score vector.2 that under H.. It follows from this and Corollary B. in practice they can be very different. What is not as evident is that. ~ B ) T given by {8(2) : r W n (0(2)) < x r _ q (1 These regions are ellipsoids in R d Although.. (J = 8 0 . a consistent estimate of .19) indicates.ln(8 0 . The argument is sketched in Problem 6.n] D"ln(Oo.nl]  2  1 . .n) ~  where :E is a consistent estimate of E( (}o). by the central limit theorem..3. and the convergence Rn((}o) ~ X. The Rao Score Test For the simple hypothesis H that. requires much weaker regularity conditions than does the corresponding convergence for the likelihood ratio and Wald tests. therefore.n under H.21) where III is the upper left q x q block of the r x r infonnation matrix I (80 ).3.8) E(Oo) = 1. (6.3.9) under AOA6 and consistency of (JO. Rao's SCOre test is based on the observation (6.3 Large Sample Tests and Confidence Regions 399 The Wold test.ln(Oo.' (0 0 )112 (00) (6. It can be shown that (Problem 6. ~1 '1'n(OO.. un. Furthermore.(00 )  121 (0 0 )/.. which rejects iff HIn (9b ») > x 1_ q (1 . the asymptotic variance of .Section 6.Q). and so on.n) 2  + D21 ln (00.20) vnt/Jn(OO) !.1/ 2 D 2 In (0) where D 1 l n represents the q x 1 gradient with respect to the first q coordinates and D 2 l n the d x 1 gradient with respect to the last d. X. The extension of the Rao test to H : (J E runs as follows.n) under n H.\(X) is the LR statistic for H Thus. (Problem 6.
We established Wilks's theorem. I I . . e(l). Consistency for fixed alternatives is clear for the Wald test but requires conditions for the likelihood ratio and score testssee Rao (1973) for more on this. I .) . Summary.q distribution under H.a)} and • The Rao large sample critical and confidence regions are {R n (Ob » {0(2) : Rn (0(2) < xd(l. which measures the distance between the hypothesized value of 8(2) and its MLE. where X~ (')'2) is the noncentral chi square distribution with m degrees of freedom and noncentrality parameter "'? It may he shown that the equivalence (6. under regularity conditions. it shares the disadvantage of the Wald test that matrices need to be computed and inverted. The asymptotic distribution of this quadratic fonn is also q• I e X: X: 6. On the other hand. Power Behavior of the LR. The advantage of the Rao test over those of Wald and Wilks is that MLEs need to be computed only under H. which stales that if A(X) is the LR statistic. .2 but with Rn(lJb 2 1 1 » !:.2.3. X~' 2 j > xd(1 .. for the Wald test neOn .4 LARGE SAMPLE METHODS FOR DISCRETE DATA In this section we give a number of important applications of the general methods we have developed to inference for discrete data. . . We also considered a quadratic fonn. then.3.a)}.)I(On)( vn(On  On) + A) !:. Finally. .19) holds under On and that the power behavior is unaffected and applies to all three tests. 0(2). X.5. and so on. In particular we shall discuss problems of . We considered the problem of testing H : 8 E 80 versus K : 8 E 8 ._q(A T I(Oo)t.On) + t. b .. called the Wald statistic. .0 0 ) I(On)(On ..8 0 where is an open subset of R: and 8 0 is the collection of 8 E 8 with the last r ~ q coordinates 0(2) specified.400 Inference in the Multiparameter Case Chapter 6 where D~ is the d x d matrix of second partials of ill with respect to matrix of mixed second partials with respect to e(l). For instance. which is based on a quadratic fonn in the gradient of the log likelihood. D 21 the d x d Theorem 6. 2 log A(X) has an asymptotic X. and showed that this quadratic fonn has limiting distribution q . Under H : (J E A6 required only for Po eo and the conditions ADAS a/Theorem 6. . The analysis for 8 0 = {Oo} is relatively easy. . we introduced the Rao score test. and Wald Tests It is possible as in the onedimensional case to derive the asymptotic power for these tests for alternatives of the form On = ~ eo + ~ where 8 0 E 8 0 . Rao.0 0 ) = T'" "" (vn(On .
.OOi)(B. j=l .6.5.. k Wn(Oo) = L(N.nOo. L(9. and 2.32) thaI IIIij II. j = 1. 6. k.()k_l)T and test the hypothesis H : ()j = ()OJ for specified (JOj. .2. ~ N.d.2.4 Large Sample Methods for Discret_e_D_a_ta ~_ _ 401 goodnessoffit and special cases oflog linear and generalized linear models (GLM).Section 6. For i.1.0).3. I ij [ . + n LL(Bi . It follows that the large sample LR rejection region is ~ k 2IogA(X) = 2 LN. we need the information matrix I = we find using (2.7. consider i.+ ] 1 1 0. Let ()j = P(Xi = j) be the probability of the jth category.) > Xkl(1. we may be testing whether a random number generator used in simulation experiments is producing values according to a given distribution. .3. . # j. trials in which Xi = j if the ith trial prcx:luces a result in the jth category.. . ()j.8 we found the MLE 0.33) and (3. j=1 Do. .lnOo." . Pearson's X2 Test As in Examples 1. 2.) /OOk = n(Bk .. . k1. where N.4. the Wald statistic is klkl Wn(Oo) = n LfB. treated in more detail in Section 6.00. .OO.)/OOk.00k)2lOOk.. . k . or we may be testing whether the phenotypes in a genetic experiment follow the frequencies predicted by theory..]2100. log(N. Ok Thus.8. with ()Ok =  1 1 E7 : kl j=l ()OJ.1 GoodnessofFit in a Multinomial Model. . Thus.4.2.i.. Because ()k = 1 . j = 1. we consider the parameter 9 = (()l. In Example 2.. In.)2InOo. = L:~ 1 1{Xi = j}. j=1 To find the Wald test.E~. Ok if i if i = j. j=l i=1 The second term on the right is kl 2 n Thus. j = 1.
. where N = (N1 . by A.1S. 80j 80k 1 and expand the square keeping the square brackets intact. we could invert] or note that by (6. Then.4. The second term on the right is n [ j=1 (!. ~ . It is easily remembered as 2 = SUM (Observed .L .L .~) 8 80j 80 k ~ ~ ~ ] 2 0j = n [~ 80 k ~ 1] 2 To simplify the first term on the right of (6. note that from Example 2. ..~ 8 8 0j 0k  2 80j ) ( n LL _ _ L kl kl kl ( J=1 z=1 ~ 8_Z 80i ~) 8k 80k (~ 8 _J 80J ~) ) 8k . .Expected)2 Expected (6.. ~ = Ilaijll(kl)X(kl) with Thus. the Rao statistic is . n ( L kl ( j=1 ~ ~) !.11).4. . with To find ]1. '" .8..4. 80i 80j 80k (6.402 Inference in the Multiparameter Case Chapter 6 The term on the right is called Pearson's chisquare (X 2 ) statistic and is the statistic that is typically used for this multinomial testing problem..80k).. . 8k 80 k = {[8 0k (8 j  80j )]  [8 0j (8 k ~  80k ) ] } .13.80j ) = (Ok . because kl L(Oj . we write ~ ~  8j 80j .Nk_d T and.4..1) X where the sum is over categories and "expected" refers to the expected frequency E H (Nj ).1) of Pearson's X2 will reappear in other multinomial applications in this section..2.. j=1 .2).2. The general form (6. ]1(9) = ~ = Var(N). To derive the Rao test.2) .
where 8 0 is a composite "smooth" subset of the (k . .k. nOlO = 312. However.1) dimensional parameter space k 8={8:0i~0. 4 as above and associated probabilities of occurrence fh. 3. n4 = 32. i=1 For example. in the HardyWeinberg model (Example 2.fh. . If we assume the seeds are produced independently.4. Mendel's theory predicted that 01 = 9/16.75)2 104. (2) wrinkled yellow.1. n2 = 101. LOi=l}. 2. fh = 03 = 3/16. and we want to test whether the distribution of types in the n = 556 trials he performed (seeds he observed) is consistent with his theory. n3 = 108.1.: . Possible types of progeny were: (1) round yellow. 02. There is insufficient evidence to reject Mendel's hypothesis. Mendel observed the different kinds of seeds obtained by crosses from peas with round yellow seeds and peas with wrinkled green seeds.4.Section 6. In experiments on pea breeding. we can think of each seed as being the outcome of a multinomial trial with possible outcomes numbered 1. which is a onedimensional curve in the twodimensional parameter space 8.75. which has a pvalue of 0.48 in this case. Mendel observed nl = 315..25 + (2.2) becomes It follows that the Rao statistic equals Pearson 's X2. For comparison 210g A = 0. Example 6. k = 4 X 2 = (2.75 . 7. 8). Contingency Tables Suppose N = (NI . M(n. Nk)T has a multinomial.9 when referred to a X~ table. 04 = 1/16.25 + (3.i::. (3) round green. n040 = 34. 04.2 GoodnessofFit to Composite Multinomial Models. 8 0 .75 + (3.. and (4) wrinkled green.25.4 Large Sample Methods for Discrete Data 403 the first term on the right of (6. j. • 6.75. Testing a Genetic Theory. Here testing the adequacy of the HardyWeinberg model means testing H : 8 E 8 0 versus K : 8 E .4. this value may be too small! See Note 1.25)2 312. distribution. n020 = n030 = 104.4).75)2 = 04 34. l::. Then.25)2 104. We will investigate how to test H : 0 E 8 0 versus K : 0 ¢:.
X approximately has a X. If £ is not open sometimes the closure of £ will contain a solution of (6. . approximately X... Maximizing p(n1. We suppose that we can describe 8 0 parametrically as where 11 = ('r/1. . which will be pursued further later in this section..4. To apply the results of Section 6.. the log likelihood ratio is given by log 'x(nb .4. . then it must solve the likelihood equation for the model.... l~j~q. If a maximizing value. under H. nk) = L ndlog(ni/n) log ei(ij)]. The Rao statistic is also invariant under reparametrization and.X approximately has a xi distribution under H.. .1.. e~) T ranges over an open subset of Rq and ej = eOj .r."" 'r/q) T.nk. 1 ~ j ~ q or k (6.. involve restrictions on the e obtained by specifying independence assumptions on i classifications of cases into different categories.. .3) Le.1. ek(11))T takes £ into 8 0 . j = 1. 8) for 8 E 8 0 is the same as maximizing p( n1. and ij exists. j Oj is. . .ne j (ij)]2 = 2 ne.. . j = q + 1.4. The Wald statistic is only asymptotically invariant under reparametrization.("") X J 11 I where the righthand side is Pearson's X2 as defined in general by (6. The algebra showing Rn(8 0 ) = X2 in Section 6. However... . e~ = e2 . 8(11)) for 11 E £.. r.3. 8 0 • Let p( n1. 8) denote the frequency function of N.3 that 2 log ." For instance.r for specified eOj . is a subset of qdimensional space. ij satisfies l a a'r/j logp(nl.. . and the map 11 ~ (e 1(11). . i=l k If 11 ~ e( 11) is differentiable in each coordinate..4. a 11 'r/J . nk.404 Inference in the Multiparameter Case Chapter 6 8 1 where 8 1 = 8 . Other examples.1).. .q' Moreover.2) by ej(ij). the Wald statistic based on the parametrization 8 (11) obtained by replacing e by e (ij). To avoid trivialities we assume q < k . nk.q) exists. That is.. 8(11)) : 11 E £}. 2 log .4. 8(11)) = 0. Consider the likelihood ratio test for H : e E 8 0 versus K : e 1:. also equal to Pearson's X2 • . £ is open.(t)~ei(11)=O.3).8 0 . {p(. .q distribution for large n. .l now leads to the Rao statistic R (8("") n 11 =~ ~ j=l [Ni . . where 9j is chosen so that H becomes equivalent to "( e~ . by the algebra of Section 6. .2~(1 . i=l t n. we define ej = 9j (8)..~) and test H : e~ = O. thus. ij = (iiI. nk.4. we obtain the Rao statistic for the composite multinomial hypothesis by replacing eOj in (6. to test the HardyWeinberg model we set e~ = e1 . Then we can conclude from Theorem 6. .3 and conclude that.. .. .
{}21. .4. then (Nb . TJ). A selfcrossing of maize heterozygous on two characteristics (starchy versus sugary. If Ni is the number of offspring of type i among a total of n offspring. T() test the validity of the linkage model we would take 8 0 {G(2 + TJ).. HardyWeinberg. nl (2+TJ) (n2 + n3) n4 (1TJ) +:ry 0. The results are assembled in what is called a 2 x 2 contingency table such as the one shown.. The Fisher Linkage Model. 1} a "onedimensional curve" of the threedimensional parameter space 8. 1958. AB. (}4) distribution. respectively. k 4. The only root of this equation in [0. AB. i(1 .a) with if (2nl + n2)/2n.• . To study the relation between the two characteristics we take a random sample of size n from the population. iTJ) : TJ :. (2nl + n2)(2 3 + n2) . We often want to know whether such characteristics are linked or are independent. The likelihood equation (6.Section 6.4. we obtain critical values from the X~ tables. . (4) starchygreen. is or is not a smoker.4. Because q = 1. 9(ij) ((2nl + n2) 2n 2n 2. ij {}ij = (Bil + Bi2 ) (Blj + B2j ). 0 Testing Independence of Classifications in Contingency Tables Many important characteristics have only two categories. (6. 301). We found in Example 2. ( 2n3 + n2) 2) T 2n 2n o Example 6. do smoking and lung cancer have any relation to each other? Are sex and admission to a university department independent classifications? Let us call the possible categories or states of the first characteristic A and A and of the second Band B. (2) sugarygreen. (3) starchywhite. . For instance. {}22..1). and so on.3) becomes i(1 °:.4. A linkage model (Fisher.4 Methods for Discrete Data 405 Example 6. p.6 that Thus.4. {}12.5.4. AB. green base leaf versus white base leaf) leads to four possible offspring types: (1) sugarywhite. An individual either is or is not inoculated against a disease. is male or female. specifies that where TJ is an unknown number between 0 and 1. H is rejected if X2 2:: Xl (1 . N 4 ) has a M(n. (h. Then a randomly selected individual from the population can be one of four types AB. Independent classification then means that the events [being an A] and [being a B] are independent or in terms of the B .2. 1] is the desired estimate (see Problem 6.TJ). Denote the probabilities of these types by {}n.4) which reduces to a quadratic equation in if. .
These solutions are the maximum likelihood estimates. 021 . the (Nij . 0 11 . Here we have relabeled 0 11 + 0 12 . ( 22 ).Ti2) (6.'r/2). 0 12 .4. if N = (Nll' N 12 . because k = 4. q = 2. the likelihood equations (6.RiCj/n) are all the same in absolute value and. where z tt [R~jl1 2=1 J=l .'r/1) (1 . for example N12 is the number of sampled individuals who fall in category A of the first characteristic and category B of the second characteristic. 'r/1 ::. Then. we have N rv M(n. 'r/2 to indicate that these are parameters. Pearson's statistic is then easily seen to be (6. For () E 8 0 . This suggests that X2 may be written as the square of a single (approximately) standard normal variable.4. (6. which vary freely.Tid (n12 + n22) (1 . 'r/1 (1 .'r/1). Thus.2). 'r/2 ::. 0 ::. (1 .~ + n12) iiI (nll + n2d (nll i72 (n21 + n22) (1 . 8 0 . N 22 )T. In fact (Problem 6.6) the proportions of individuals of type A and type B. By our theory if H is true.4.5) whose solutions are Til 'r/2 = (n11+ n 12)/n (nll + n21)/n. X2 has approximately a X~ distribution.7) where Ri = Nil + Ni2 is the ith row sum. 1.4.011 + 021 as 'r/1. C j = N 1j + N 2j is the jth column sum.406 Inference in the Multiparameter Case Chapter 6 l A 11 The entries in the boxes of the table indicate the number of individuals in the sample who belong to the categories of the appropriate row and column.'r/2)) : 0 ::.3) become . I}. We test the hypothesis H : () E 8 0 versus K : 0 f{. respectively. where 8 0 is a twodimensional subset of 8 given by 8 0 = {( 'r/1 'r/2. 'r/2 (1 .4. N 21 .
The N ij can be arranged in a a x b contingency table. that A is more likely to occur in the presence of B than it would in the presence of B).. If we take a sample of size n from a population and classify them according to each characteristic we obtain a vector N ij . i :::.. that is. respectively.. Z ~ z(l. (Jij : 1 :::..e. .Section 6.. a. b NIb Rl a Nal C1 C2 .3) that if A and B are independent. Next we consider contingency tables for two nonnumerical characteristics having a and b states.. b ~ 2 (e. Positive values of Z indicate that A and B are positively associated (i. . Nab Cb Ra n .. . b} "J M(n. Therefore. if and only if. B. then {Nij : 1 :::. a. a. . 1 Nu 2 N12 . . if X2 measures deviations from independence. j :::.. b). i :::. b where the TJil.. b where N ij is the number of individuals of type i for characteristic 1 and j for characteristic 2. B. Thus.4. 1).4. It may be shown (Problem 6..peA I B)] [~(B) ~(~)ll/2 peA) peA) where P is the empirical distribution and where we use A. and only if.8) Thus.a) as a level a onesided test of H : peA I B) = peA I B) (or peA I B) :::. a. i :::. j :::. ... . i = 1. eye color. .4.4 Large Sample Methods for Discrete Data 407 An important altemative form for Z is given by (6.. Z = v'n[P(A I B) . 1 :::. TJj2 are nonnegative and 2:~=1 = 2:~=1 TJj2 = 1. The X2 test is equivalent to rejecting (twosidedly) if.4. If (Jij = P[A randomly selected individual is of type i for 1 and j for 2].. . then Z is approximately distributed as N(O. B. peA I B) = peA I B). hair color). B to denote the event that a randomly selected individual has characteristic A..g. it is reasonable to use the test that rejects. j = 1. 1 :::. a. . 1 :::. peA I B)) versus K : peA I B) > peA I B). (Jij TJil The hypothesis that the characteristics are assigned independently becomes H : TJil TJj2 for 1 :::.. Z indicates what directions these deviations take. j :::..
log ( ~: ) .f. or (3) market research where a potential customer either desires a new product (Y = 1) or does not (Y = 0).11) When we use the logit transfonn g(7r).." We assume that the distribution of the response Y depends on the known covariate vector ZT. ' XI.{3p. a simple linear representation zT ~ for 7r(.3 Logistic Regression for Binary Responses In Section 6. (6. and the loglog transform g2(7r) = 10g[log(1 . we observe the number of successes Xi = L. The log likelihood of 7r = (7rl.~ =1 Zij {3j = unknown parameters {31. we observe independent Xl' . such as the probit gl(7r) = <I>1(7r) where <I> is the N(O.6. Because 7r(z) varies between 0 and 1. . 1) d.Li = L. we call Y = 1 a "success" and Y = 0 a "failure.4. Next we choose a parametric model for 7r(z) that will generate useful procedures for analyzing experiments with binary responses..4. i :::. log(l "il] + t. approximately normally distributed ~ for known constants {Zij} and and whose means are modeled as J.8 as the canonical parameter zT TJ = g ( 7r) = log [7r / (1 . (6.4.) over the whole range of z is impossible."" 7rk)T based on X = (Xl"'" Xk)T is t.. k.1 we considered linear models that are appropriate for analyzing continuous responses {Yi} that are."F~1 Yij where Yij is the response on the jth of the mi trials in block i. In this section we will consider Bernoulli responses Y that can only take on the values 0 and 1. Instead we turn to the logistic transform g( 7r). The argument is left to the problems as are some numerical applications. (2) election polls where a voter either supports a proposition (Y = 1) or does not (Y = 0). Thus. Maximum likelihood and dimensionality calculations similar to those for the 2 x 2 table show that Pearson's X2 for the hypothesis of independence is given by (6.408 Inference in the Multiparameter Case Chapter 6 with row and column sums as indicated. we obtain what is called the logistic linear regression model where . In this section we assume that the data are grouped or replicated so that for each fixed i. .9) which has approximately a X(al)(bl) distribution under H. B(mi' 7ri)' where 7ri = 7r(Zi) is the probability of success for a case with covariate vector Zi. Examples are (1) medical trials where at the end of the trial the patient has either recovered (Y = 1) or has not recovered (Y = 0).7r)] are also used in practice.10..7r)]. [Xi C :i"J log + m.4. whicll we introduced in Example 1. 6. with Xi binomial. Other transforms. perhaps after a transfonnation. 1 :::. As is typical. usually called the log it.
Li = E(Xi ) = mi1fi.log(l:'. .Section 6. the empirical logistic transform.4.1. j = 1.Li. mi 2mi Here the adjustment 1/2mi is used to avoid log 0 and log 00.[ i(l7r iW')· . Zi The log likelihood l( 7r(f3)) = == k (1..J) SN(O. j Thus. where Z = IIZij Ilrnxp is the design matrix.2 can be used to compute the MLE of f3.4.p.1 applies and we can conclude that if 0 < Xi < mi and Z has rank p. Because f3 = (ZTz) 1 ZT 'T] and TJi = 10g[1fi (1 1fi) in TJi has been replaced by 730 is a plugin estimate of f3 where 1fi and (1 1f. if N = 2:7=1 mi.~ Xi 2 1 +"2 ' (6. We let J.l..+ . i ::.15) Vi = log X+. Zi)T is the logistic regression model of Problem 2.13) By Theorem 2. W is estimated using Wo = diag{mi1f.+ . ..3..3. 7r Using the 8method. Tp) T. p (Jj Tj  ~ mi loge1+ exp{Zif3} ) + ~ log '.1fi)*}' + 00.f3p )T is. in (6..4.3. . .3.1 and by Proposition 3.L) = O. . Similarly. k m} (V. we can guarantee convergence with probability tending to 1 as N + 00 as follows.4. mi 2mi Xi 1 Xi 1 = .4. that if m 1fi > 0 for 1 ::.. The condition is sufficient but not necessary for existencesee Problem 2.16) 1fi) 1]. k ( ) (6.14) where W = diag{ mi1fi(l1fi)}kxk.J.! ) ( mi ..14).3.1fi )* = 1 . I N(f3) = f.3. IN(f3) of f3 = (f31. ( 1 .(1. It follows that the NILE of f3 solves E f3 (Tj ) = T .4. the likelihood equations are just ZT(X . The coordinate ascent iterative procedure of Section 2. Although unlike coordinate ascent NewtonRaphson need not converge. (6.4 the Fisher information matrix is I(f3) = ZTWZ (6.. Then E(Tj ) = 2:7=1 Zij J.4.4 Large Sample Methods for Discrete Data 409 The special case p = 2. As the initial estimate use (6.. Alternatively with a good initial value the NewtonRaphson algorithm can be employed.4. or Ef3(Z T X) = ZTX. Theorem 2. the solution to this equation exists and gives the unique MLE 73 of f3. Note that IN(f3) is the log likelihood of a pparameter canonical exponential model with parameter vector f3 and sufficient statistic T = (T1' .12) where T j = 2:7=1 ZijXi and we make the dependence on N explicit.. it follows by Theorem 5.
\ has an asymptotic X.4. Thus.w is denoted by D(y.4. and the MLEs of 1ri and {Li are Xdmi and Xi. f"V .6. it foHows (Problem 6.. ji)...~. The Binomial OneWay Layout. we set n = Rk and consider TJ E n. k < oosee Problem 6. i 1. In the present case.4.12) k zT D(X. i = 1.4. Suppose that k treatments are to be tested for their effectiveness by assigning the ith treatment to a sample of mi patients and recording the number Xi of patients that recover. .13.Wo 210g.4. suppose we want to compare k different locations with respect to the percentage that have a certain attribute such as the intention to vote for or against a certain proposition. then we can form the LR statistic for H : TJ E Wo versus K: TJ E w .4.4. recall from Example 1. In this case the likelihood is a product of independent binomial densities. the MLE of 1ri is 1ri = 9.:IITlPt'pr Case 6 Because Z has rankp. 2 log .\ for testing H : TJ E w versus K : TJ n . D(X.410 Inference in the Mllltin::lr. from (6."" Xk independent with Xi example.. The samples are collected indeB(1ri' mi). that is. ji) 2 I)Xi 10g(Xdfli) + XIlog(XI!flDJ i=1 (6. distribution for TJ E w as mi + 00. . by Problem 6.14) that {3o is consistent. Here is a special case.q distribution as mi + 00. To get expressions for the MLEs of 7r and JL.flm. By the multivariate delta method. ..11) and (6. For a second pendently and we observe Xl. Testing In analogy with Section 6..) +X. If Wo is a qdimensional linear subspace of w with q < r. we let w = {TJ : 'f]i = {3. .8 that the inverse of the logit transform 9 is the logistic distribution function Thus.IOg(E. The LR statistic 2 log . . fl) has asymptotically a X%r. As in Section 6. ji) measures the distance between the fit ji based on the model wand the data X.1 linear subhypotheses are important.17) where X: = mi Xi and M~ mi Mi. and for the ith location count the number Xi among mi that has the given attribute..4.4. where ji is the MLE of JL for TJ E w..1. We obtain k independent samples..k. one from each location. i = 1. Theorem 5.1 (L:~=l Xij(3j).)l {LOt (6. k.\=2t [Xi log i=1 (Ei.13. {3 E RP} and let r be the dimension of w. We want to contrast w to the case where there are no restrictions on TJ.18) {Lo~ where jio is the MLE of JL under H and fl~i = mi .1.3. . D(X. Example 6.
In the special case of testing independence of multinomial frequencies representing classifications in a twoway contingency table.mk7r)T." which equals the sum of standardized squared distances between observed frequencies and expected frequencies under H. and as in that section.3.5 GENERALIZED LINEAR MODELS Yi In Sections 6.3. We derive the likelihood equations. if J.1. (3 = L Zij{3j j=l p of covariate values. It follows from Theorem 2.3 to find tests for important statistical problems involving discrete data. z. and (3 = 10g[7r1(1 . The LR statistic is given by (6. we find that the MLE of 7r under His 7r = TIN. in the case of a Gaussian response. the Pearson statistic is shown to have a simple intuitive form. then J.4.L (ml7r.5 Generalized Linear Models 411 This model corresponds to the oneway layout of Section 6. The Pearson statistic 2 _ X  L k1 i=l (X "'(1) m'7r . and give the LR test. an important hypothesis is that the popUlations are homogenous.7r ~ i "')2 mi 7r is a Wald statistic and the X2 test is equivalent asymptotically to the LR test (Problem 6.Section 6. discuss algorithms for computing MLEs. N 2::=1 mi. we test H : 7rl = 7r2 7rk = 7r. Summary.Idimensional parameter space e.. 6.4. 7r E (0.N log(1+ exp{{3}) + ~ log ( k ~: ) where T = 2::=1 Xi. we considered logistic regression for binary responses in which the logit transformation of the probability of success is modeled to be a linear function of covariates. we give explicitly the MLEs and X2 test.13). Thus. When the hypothesis is that the multinomial parameter is in a qdimensional subset of the k . . where J.18) with JiOi mi7r.1.4. In particular. Using Z as given in the oneway layout in Example 6. Finally.Li = ~i. In the special case of testing equality of k binomial parameters.Li 7ri = gl(~i)' where gl(y) is the logistic distribution . We used the large sample testing results of Section 6.4. Under H the log likelihood in canonical exponential form is {3T .. the Wald and Rao statistics take a form called "Pearson's X2.Li of a response is expressed as a function of a linear combination ~i =. versus the alternative that the 7r'S are not all equal. the Rao statistic is again of the Pearson X2 form. J. We found that for testing the hypothesis that a multinomial parameter equals a specified value.7r)].1 that if 0 < T < N the MLE exists and is the solution of (6.3.4.15).3 we considered experiments in which the mean J.1).1 and 6.Li = EYij. In Section 6. .
. Y) where Y = (Yb . the GLM is the canonical subfamily of the original exponential family generated by ZTy. in which case 9 is also called the link function. Canonical links The most important case corresponds to the link being canonical. ••• . As we know from Corollary 1. . g = or 11 L J'j Z(j) = Z{3.1)..r Case 6 function. in which case JLi A~ (TJi). These Ii are independent Gaussian with known variance 1 and means JLi The model is GLM with canonical g(fL) fL. the mean fL of Y is related to 11 via fL = A(11).. Znj)T is the jth column vector of Z. More generally. = L:~=1 Zij{3.. but in a subset of E obtained by restricting TJi to be of the form where h is a known function.5. (ii) Log linear models. Special cases are: (i) The linear model with known variance. the identity.1) where 11 is not in E. that is. McCullagh and NeIder (1983. which is p x 1. 1989) synthesized a number of previous generalizations of the linear model.. g(fL) is of the form (g(JL1) .. Yn)T is n x 1 and ZJxn = (ZI. We assume that there is a onetoone transform g(fL) of fL.5.412 Inference in the 1\III1IITin'~r:>rn"'T.1. 11) given by (6. most importantly the log linear model developed by Goodman and Haberman."" zn) with Zi = (Zib"" Zip)T nonrandom and Y has density p(y. the natural parameter space of the nparameter canonical exponential family (6. fL determines 11 and thereby Var(Y) A( 11). Typically. such that g(fL) = L J'jZ(j) j=1 p Z{3 where Z(j) (Zlj. . Typically. g(JLn)) T. See Haberman (1974).6. Note that if A is oneone. . A (11) = Ao (TJi) for some A o. . j=1 p In this case.. The generalized linear model with dispersion depending only on the mean The data consist of an observation (Z. called the link junction.
a.2) It's interesting to note that (6. is that of independence Bij = Bi+B+j where Bi+ = 2:~=1 Bij . that Y = IIYij 111:Si:Sa. /1(. 1 ::.6. .2) (or ascertain that no solution exists).. 1 ::. The log linear label is also attached to models obtained by taking the Yi independent Bernoulli ((}i). ./1(. With a good starting point f3 0 one can achieve faster convergence with the NewtonRaphson algorithm of Section 2.5 Generalized linear Models 413 Suppose (Y1. Suppose. If we take g(/1) = (log J.3.8) is orthogonal to the column space of Z. j::. by Theorem 2.tp) T .7.5. and the log linear model corresponding to log Bij = f3i + f3j. 1 ::.1.1::. Bj > 0. say. In this case. 0 < (}i < 1 withcanonicallinkg(B) = 10g[B(1B)].. j ::. the .5. log J. 1f. in general. . f3j are free (unidentifiable parameters). The coordinate ascent algorithm can be used to solve (6 . B+ j = 2:~=1 Bij .. so that Yij is the indicator of. classification i on characteristic 1 and j on characteristic 2.5. See Haberman (1974) for a further discussion. . b. i ::. Then. "Ij = 10gBj. that procedure is just (6. NewtonRaphson coincides with Fisher's method of scoring described in Problem 6.2) can be interpreted geometrically in somewhat the same way as in the Gausian linear modelthe "residual" vector Y . . Then () = IIBij II. j ::. where f3i.5.3) where In this situation and more generally even for noncanonical links.4. 1 ::. Algorithms If the link is canonical. the link is canonical. ZTy or = ZT E13 y = ZT A(Z.Yp)T is M(n. 2:~=1 B = 1.4. as seen in Example 1.3. But.t1.8) (6. The models we obtain are called log linearsee Haberman (1974) for an extensive treatment. p.80 ~ f3 0 . This isjust the logistic linear model of Section 6..1.5. p are canonical parameters. if maximum likelihood estimates they necessarily uniquely satisfy the equation f3 exist.8) is not a member of that space. . for example.B 1. b.Bp). j ::..Section 6..
1.5. Yn ) T (assume that Y is in the interior of the convex support of {y : p(y.1](Y)) l(Y .2.2. D generally except in the Gaussian case. 1]) > o}). This quantity. (6. Testing in GLM Testing hypotheses in GLM is done via the LR statistic.D(Y . the correction Am+ l is given by the weighted least squares formula (2.5.1) for which p = n.4) That is.4).1](lLa)] for the hypothesis that IL = lLa within M as a "measure" of (squared) distance between Y and lLo. the variance covariance matrix is W m and the regression is on the columns ofW mZProblem 6. Write 1](') for AI. The LR statistic for H : IL E Wo is just D(Y. In that case the MLE of J1 is iL M = (Y1 .5. For the Gaussian linear model with known variance 0'5 (Problem 6.20) when the data are the residuals from the fit at stage m. as n ~ 00. Unfortunately tl =f. 210gA = 2[l(Y. iLo) . If Wo C WI we can write (6.Wo with WI ~ Wo is D(Y.5. In this context. iLl)' each of which can be thought of as a squared distance between their arguments.414 Inference in the Multiparameter Case Chapter 6 true value of {3. . iLo) . As in the linear model we can define the biggest possible GLM M of the form (6. We can then formally write an analysis of deviance analogous to the analysis of variance of Section 6. We can think of the test statistic . ••• .. This name stems from the following interpretation. called the deviance between Y and lLo.D(Y. iLl) == D(Y.6) a decomposition of the deviance between Y and iLo as the sum of two nonnegative components.5.13m' which satisfies the equation (6. the deviance of Y to iLl and tl(iL o. iLl) where iLl is the MLE under WI. then with probability tending to 1 the algorithm converges to the MLE if it exists. IL) : IL E wo} where iLo is the MLE of IL in Woo The LR statistic for H : IL E Wo versus K : IL E WI .5) is always ~ O. the algorithm is also called iterated weighted least squares. Let ~m+l == 13m+1 . iLo) == inf{D(Y.5.
.. Zn in the sample (ZI' Y 1 )..2.1 . .3). we can calculate i=l where {3 H is the (p xl) MLE for the GLM with {3~xl = (0.8) This is not unconditionally an exponential family in view of the Ao (z. which. .. in order to obtain approximate confidence procedures.3 applies straightforwardly in view of the general smoothness properties of canonical exponential families. asymptotically exists. . This can be made precise for stochastic GLMs obtained by conditioning on ZI. . (Zn' Y n ) has density (6. .7) More details are discussed in what follows. and 6. there are easy conditions under which conditions of Theorems 6. . = f3d = 0. . fil) is thought of as being asymptotically X.. d < p. 1. and can conclude that the statistic of (6. if we take Zi as having marginal density qo. y . Ao(Z..Section 6 . .2. if we assume the covariates in logistic regression with canonical link to be stochastic.. can be estimated by f = :EA(Z{3) where :E is the sample variance matrix of the covariates. is consistent with probability 1. .q.5. Thus. (3))Z.5.0. (Zn' Y n ) can be viewed as a sample from a population and the link is canonical..2.5. . then (ZI' Y1 ). (3) is (Yi and so.5. which we temporarily assume known..5 Generalized Linear Models 415 Formally if Wo is a GLM of dimension p and WI of dimension q with canonical links. so that the MLE {3 is unique.3 hold (Problem 6. then ll(fio. Asymptotic theory for estimates and tests If (ZI' Y 1 ). Y n ) from the family with density (6. . 6.9) What is ]1 ((3)? The efficient score function a~i logp(z . (3) term..3. Similar conclusions r '''''''. For instance.. and (6.5. ••• .2 and 6.3. However. the theory of Sections 6. f3p ). (Zn.10) is asymptotically X~ under H. f3d+l. we obtain If we wish to test hypotheses such as H : 131 = .
A) families.10) is of product form A( 11) [1/ c( r)] whereas the righthand side cannot always be put in this form. These conclusions remain valid for the usual situation in which the Zi are not random but their proof depends on asymptotic theory for independent nonidentically distributed variables. As he points out. we take g(JL) = <1>1 (JL) so that 7ri = <1> (zT!3) we obtain the socalled probit model. r)dy = 1. 11. Note that these tests can be carried out without knowing the density qo of Z1.1) depend on an additional scalar parameter r.5. JL ::. Existence of MLEs and convergence of algorithm questions all become more difficult and so canonical links tend to be preferred. the results of analyses with these various transformations over the range .3. The generalized linear model The GLMs considered so far force the variance of the response to be a function of its mean. (J2) and gamma (p.5.14) so that the variance can be written as the product of a function of the mean and a general dispersion parameter. and NeIder (1983. Cox (1970) considers the variance stabilizing transformation which makes asymptotic analysis equivalent to that in the standard Gaussian linear model. Because (6.1 (r)(11T y . For instance. From the point of analysis for fixed n. 1989).11) Jp(y.12) The lefthand side of (6. Important special cases are the N (JL. for c( r) > 0. (6. noncanonicallinks can cause numerical problems because the models are now curved rather than canonical exponential families.13) (6. r). . when it can. r) = exp{c. It is customary to write the model as. p(y.416 Inference in the Multiparameter Case Chapter 6 follow for the Wald and Rao statistics. if in the binary data regression model of Section 6. An additional "dispersion" parameter can be introduced in some exponential family models by making the function h in (6. which we postpone to Volume II.5. 11.9 are rather similar. then it is easy to see that E(Y) = A(11) Var(Y) = c(r)A(11) (6. .5.5.4.1 ::. However.A(11))}h(y.r)dy.5. then A(11)/c(r) = log! exp{c1 (r)11T y}h(y. For further discussion of this generalization see McCullagp. General link functions Links other than the canonical one can be of interest.
For this model it turns out that the LSE is optimal in a sense to be discussed in Volume II."n". whose further discussion we postpone to Volume II. We found that if we assume the error distribution fa.5).i. under further mild conditions on P.2.2. are discussed below.. y)T rv P that satisfies Ep(Y I Z) = ZT(3.d. then the LSE of (3 is. where the Z(j) are observable covariate vectors and (3 is a vector of regression coefficients.2.1) where E is independent of Z but ao(Z) is not constant and ao is assumed known.6 ROBUSTNESS PROPERTIES AND SEMIPARAMETRIC MODELS As we most recently indicated in Example 6." Of course. roughly speaking. the distributional and implicit structural assumptions of parametric models are often suspect.6 Robustness Pr. Ii 6. We discussed algorithms for computing MLEs of In the random design case.d. That is. if we consider the semiparametric model. We considered the canonical link function that corresponds to the model in which the canonical exponential model parameter equals the linear predictor. we use the asymptotic results of the previous sections to develop large sample estimation results. In Example 6. There is another semiparametric model P 2 = {P : (ZT. is "act as ifthe model were the one given in Example 6. of a linear predictor of the form Ef3j Z(j). These issues.2.2.r+i". called the link function. which is symmetric about 0.31) with fa symmetric about O.19) with Ei i. ____ . a 2 ) or. with density f for some f symmetric about O}. the right thing to do if one assumes the (Zi' Yi) are i. in fact. /3.1.Section 6.. the LSE is not the best estimate of (3. had any distribution symmetric around 0 (Problem 6. still a consistent asymptotically normal estimate of (3 and.i. so is any estimate solving the equations based on (6.8 but otherwise also postponed to Volume II.6. the GaussMarkov theorem. EpZTZ nonsingular. confidence procedures. Another even more important set of questions having to do with selection between nested models of different dimension are touched on in Problem 6. ". Y) has a joint distribution and if we are interested in estimating the best linear predictor /lL(Z) of Y given Z. have a fixed covariate exact counterpart.2.2. the resulting MLEs for (3 optimal under fa continue to estimate (3 as defined by (6. E p y2 < oo}. Furthermore. and tests. We considered generalized linear models defined as a canonical exponential model where the mean vector of the vector Y of responses can be written as a function. PI {P : (zT. we studied what procedures would be appropriate if the linearity of the linear model held but the error distribution failed to be Gaussian.J ..2. if P3 is the nonparametric model where we assume only that (Z. for estimating /lL(Z) in a submodel of P 3 with /3 (6. and Models 417 Summary. in fact.2.6. Y) rv P given by (6..19) even if the true errors are N(O.
our current Ji coincides with the empirical plugin estimate of the optimal linear predictor (1.Ej) = a La. . and Ji = 131 = Y. the conclusions (1) and (2) of Theorem 6.'. Ej). 0 Note that the preceding result and proof are similar to Theorem 1.4. N(O.an. . 418 Inference in the Multiparameter Case Chapter 6 Robustness in Estimation We drop the assumption that the errors E1. and by (B. The preceding computation shows that VarcM(Ci) = Varc(Ci). Ifwe replace the Gaussian assumptions on the errors E1.S. ..Yn . n = 0:.1. Varc(Ci) for all unbiased Ci. {3 is p x 1.1. in Example 6. Var(jj) = a 2 (ZTZ)1. Var(e) = a 2 (I .6. and ifp = r.i. JL = f31. Instead we assume the GaussMarkov linear model where (6. and a is unbiased.14).6. Because Varc = VarCM for all linear estimators. Yj) = Cov( Ei.3.d.3(iv). Theorem 6. Because E( Ei) = 0.1.1. In Example 6. . Then.' En with the GaussMarkov assumptions. Varc(a) ::.2. is UMVU in the class of linear estimates for all models with EY/ < 00. for any parameter of the form 0: = E~l aiJLi for some constants a1. However. See Problem 6.. the estimate = E~l aiJii has uniformly minimum variance among all unbiased estimates linear in Y 1 .3) are normal. . n 2 VarcM(a) = La.1.2) holds.. jj and il are still unbiased and Var(Ji) = a 2 H. i=l i<j i=l where VarCM refers to the variance computed under the GaussMarkov assumptions. One Sample (continued)...Y .p .6).1. in addition to being UMVU in the normal case.1. Var(Ei) + 2 Laiaj COV(Ei.4 are still valid. Example 6. Let Ci stand for any estimate linear in Y1 . € are n x 1. 13i of Section 6.En in the linear model (6.2) where Z is an n x p matrix of constants.. The GaussMarkov theorem shows that Y. Many of the properties stated for the Gaussian case carry over to the GaussMarkov case: Proposition 6..1 and the LSE in general still holds when they are compared to other linear estimates.1. .6.2(iv) and 6. E(a) = E~l aiJLi Cov(Yi.1. Moreover.En are i. .. for . . and Y.1. n a Proof.4..6.4 where it was shown that the optimal linear predictor in the random design case is the same as the optimal predictor in the multivariate normal case.1.1. .6.. Suppose the GaussMarkov linear model (6.H). where Varc(Ci) stands for the variance computed under the Gaussian assumption that E1. . The optimality of the estimates Jii. the result follows. In fact. moreover. ( 2 ). By Theorems 6. .
our estimates are estimates of Section 2. the asymptotic behavior of the LR.4) and the Wald test statis8 0 evidently we have i= Tw But if 8(P) = 8 0 .1. Another weakness is that it only applies to the homoscedastic case where Var(1'i) is the same for all i.ennip. we can use the GaussMarkov theorem to conclude that the weighted least squares are unknown.2.6 Robustness "" . Thus.8(P)) ~ N(O. Remark 6. which is the unique solution of ! and w(x. However. Y has a larger variance (Problem 6.12).5) . which does not belong to the model.6.6. This observation holds for the LR and Rao tests as wellbut see Problem 6.and twosample problems using asymptotic and Monte Carlo methods. Il E R.Ill}. Suppose (ZI' Yd. 8)dP(x) ° = 80 (6. The 6method implies that if the observations Xi come from a distribution P. and ...Section 6. Y is unbiased (Problem 3.P) i= II (8 0 ) in general. For this density and all symmetric densities..2 we know that if '11(" 8) = Die.3) . As seen in Example 6.d. 1 (Zn."n". 0.2 are UMVU. If 8(P) (6. 00. If the are version of the linear model where E( €) known. Wald and Rao tests depends critically on the asymptotic behavior of the underlying MLEs 8 and 80 . the asymptotic distribution of Tw is not X. Example 6.6. Now we will use asymptotic methods to investigate the robustness of levels more generally. . for instance.2.:. aT Robustness of Tests In Section 5. From the theory developed in Section 6.3 we investigated the robustness of the significance levels of t tests for the one. The Linear Model with Stochastic Covariates.4. ~('11..6..4.ii(8 . 0..2. P)) with :E given by (6. when the not optimal even in the class of linear estimates.6). as (Z..6. ~(w. .. y E R. Suppose we have a heteroscedastic 0.. P)). but Var( Ei) depends on i.ti".2. Y) where Z is a (p x 1) vector of random covariates and we model the relationship between Z and Y as (6. 8) and Pis true we expect that 8n 8(P).6.:lrarnetlric Models 419 sample n large.6.6. consider H : 8 tics Tw = n(e ( 0 )T I(8 0 )(e ( 0 ). then Tw ~ VT I(8 0 )V where V '" N(O..5) than the nonlinear estimate Y median when the density of Y is the Laplace density 1 2A exp{ Aly . Y n ) are d. a major weakness of the GaussMarkov theorem is that it only applies to linear estimates.1. There is an important special case where all is well: the linear model we have discussed in Example 6. Because ~ ('lI . A > O.
30) then (3(P) specified by (6.3) equals (3 in (6.6. For instance.1 that (6.6. (i) the set of P satisfying the hypothesis remains the same and (ii) the (asymptotic) variance of (3 is the same as under the Gaussian model.7. !ltin:::. Wald.6.t.r::nn". n lZT Z (n) (n) so that the confidence ~ ~ ZiZ'f n ~ 1 (6.8) Moreover.B) by where W pxp is Gaussian with mean 0 and.t xp(l a) by Example 5. the test (6. when E is N(O.Yi) = IY n . and we consider.420 Inference in the M . It is intimately linked to the fact that even though the parametric model on which the test was based is false.1 and 6. . under H : (3 = 0. (see Examples 6.2) the LR. it is still true by Theorem 6.np(l a) ..5) and (12(p) = Varp(E).2. Zen) S (Zb .p or more generally (3 E £0 + (30' a qdimensional affine subspace of RP.np(1 .2.2. EpE 0. Then. Y) such that Eand Z are independent.. ... it is still true that if qi is given by (6.6.6) (3 is the LSE. by the law of large numbers. It follows by Slutsky's theorem that. even if E is not Gaussian.q+ 1.Zn)T and  2 1 ~ " 2 1 ~(Yi . This kind of robustness holds for H : (Jq+ 1 (Jo.r Case 6 with the distribution P of (Z.a) where _ ""T T  2 (6. (12). hence. .6. X..P i=l n p  Z(n)(31 . Thus.6.6. and Rao tests all are equivalent to the F test: Reject if Tn = (3 Z(n)Z(n)(3/ S 2 !p. . E p E2 < 00. the limiting distribution of is Because !p.5) has asymptotic level a even if the errors are not Gaussian.2. (Jp (Jo.  2 Now.9) i=l n1Zfn)Z(n)S2 / Vii are still asymptotically of correct level. H : (3 = O. say...7) and (6..3. procedures based on approximating Var(..
6. Q) (6..3. Suppose without loss of generality that Var( (') = L Then (6.. under H.6.. E (i.13) but & V2Ep(X(I») ::f. then it's not clear what H : (J = (Jo means anymore. let the distribution of ( given Z = z be that of a(z)(' where (' is independent of Z. The Linear Model with Stochastic Covariates with E and Z Dependent.14) o Example 6.Section 6.12) The conditions of Theorem 6.10) nL and ~ ~ X(j) 2 (6.6 Robustness Properties and Semiparametric Models 421 If the first of these conditions fails.17) is Suppose our model is that X "Reject iff n (0 1 where &2 2 O) > x 1 (1  . i=1 (6.2.6.7) fails and in fact by Theorem 6.. To see this. Suppose E(E I Z) = 0 so that the parameters f3 of (6. XU) and X(2) are identically distributed. V2 VarpX{I) (~2) Xz in general. We illustrate with two final examples.6. .6. If it holds but the second fails. Example 6.a)" (6. That is.6. (X(I). The TwoSample Scale Problem..10) does not have asymptotic level a in general. the lifetimes A2 the standard Wald test (6. However suppose now that X(I) 101 and X(2) 102 are identically distributed but not exponential.11) i=1 &= v 2n ! t(xP) + X?»).3.3. For simplicity.2 clearly hold.5) are still meaningful but (and Z are dependent.1 Vn(~ where (3) + N(O. X(2) are independent E of paired pieces of equipment.) respectively. the theory goes wrong.6. If we take H : Al . (6.6. But the test (6.. Then H is still meaningful.6. X(2») where X(I). It is possible to construct a test equivalent to the Wald test under the parametric model and valid in general. (6.4. 2  (th).15) .6... we assume variances are heteroscedastic. note that.6. Simply replace (12 by (J ~2 = ~ ~ { ( ~1) n~ 2=1 Xz _ (X(I) + X(2»))2 2 + _ (X(1) + X(2»))2} 2 .
.... the test (6..6. ..6. the confidence procedures derived from the normal model of Section 6.16) Summary. in general. For the canonical linear Gaussian model with (52 unknown. In this case. . Compare this bound to the variance of 8 2 • . use Theorem 3.3 to compute the information lower bound on the variance of an unbiased estimator of (52. In the linear model with a random design matrix. identical variances. 1967).422 Inference in the Multiparameter Case Chapter 6 (Problem 6. . the methods need adjustment.1 1.7 PROBLEMS AND COMPLEMENTS Problems for Section 6. (5 2)T·IS (U1.Id) has a multinomial (A1.i.. It is easy to see that our tests of H : {32 = . = {3d = 0 fail to have correct asymptotic levels in general unless (52 (Z) is constant or A1 = ..1. .6) does not have the correct level. 0 < Aj < 1. = Ad = 1/ d (Problem 6."i=r+ I i .Ad) distribution. 0 To summarize: If hypotheses remain meaningful when the model is false then. The GaussMarkov theorem states that the linear estimates that are optimal in the linear model continue to be so if the i.. Wald. . Wald. and Rao tests. . the twosample problem with unequal sample sizes and variances.6. (52) are still reasonable when the true error distribution is not Gaussian. N(O. (52) assumption on the errors is replaced by the assumption that the errors have mean zero.Id . In partlCU Iar. ""2 _ ""12 (TJ1. Ur. (5 .fiT) o:2)T of U 2)T . We considered the behavior of estimates and tests when the model that generated them does not hold. and are uncorrelated.3. .d. . and Rao tests need to be modified to continue to be valid asymptotically.11 ..d. Yn .1. . A solution is to replace Z(n)Z(n)/ 8 2 in (6. j ::. (6.Ad)T where (I1.6) by Q1.9.. . ••• . TJr. . . and we gave the sandwich estimate as one possible adjustment to the variance of the MLE for the smaller model. in general. 1 ::. d. For d = 2 above this is just the asymptotic solution of the BehrensFisher problem discussed in Section 4. n 1 L.6). 6. Specialize to the case Z = (1. we showed that the MLEs and tests generated by the model where the errors are U. We also demonstrated that when either the hypothesis H or the variance of the MLE is not preserved when going to the wider model. then the MLE (iil. (ii) if n 2: r + 1.11) with both 11 and (52 unknown.4.. then the MLE and LR procedures for a specific model will fail asymptotically in the wider model.. In particular. N(O..A1.1 are still approximately valid as are the LR. Show that in the canonical exponential model (6.6. (i) the MLE does not exist if n = r. provided we restrict the class of estimates to linear functions of Y1 .6) and.n l1Y . .6. where Qis a consistent estimate of Q.4.. This is the stochastic version of the dsample model of Example 6.JL • ""n 2. . The simplest solution at least for Wald tests is to use as an estimate of [Var y'n8]1 not 1(8) or ~D2ln(8) but the socalled "sandwich estimate" (Huber. LR.
is (c) Show that Y and Bare unbiased.2 ). ..(_C)i)2 l+c L i=l (a) Show that Bis the weighted least squares estimate of ().2 replaced by 0: 2 • 6. . . . and the 1. Consider the model (see Example 1. i = 1.1] is a known constant and are i. .. .1. Var(Uf) = 20."" (Z~. c E [0.i.13..2 known case with 0. and (3 and {Lz are as defined in (1.1. zi = (Zi21"" Zip)T. n. 8. i eo = 0 (the €i are called moving average errors. 0. 4. Yn ) where Z. where a. 9.. . i = 1..7 Problems and Complements 423 Hint: By A.'''' Zip)T.. = (Zi2.2.1.2 with p = r coincides with the empirical plugin estimate of JLL = ({LL1. where IJ.2 . N(O.d.4 • 3.Section 6. Let 1. Show that >:(Y) defined in Remark 6. i = 1.a)% confidence interval for /31 in the Gaussian linear model is Z2)2 ] . n. 1 {LLn)T. then B the MLE of ().2 )... Yl). of 7. i = 1. fO 0. .1. .1.22.29) for the noncentrality parameter 82 in the oneway layout. . (e) Show that Var(B) < Var(Y) unless c = O.l+c (_c)j+l)/~ (1. 5. fn where ei = ceil + fi..d. Show that in the regression example with p = r = 2.4. 1 n f1. see Problem 2..5) Yi () + ei. .. Show that Ii of Example 6.Li = {Ly+(zi {Lz)f3. Derive the formula (6. the 100(1 . {Ly = /31.29)... Find the MLE B ().. .2 coincides with the likelihood ratio statistic A(Y) for the 0. Suppose that Yi satisfies the following model Yi ei = () + fi.14). . Let Yi denote the response of a subject at time i.n. 0.28) for the noncentrality parameter ()2 in the regression example. Here the empirical plugin estimate is based on U. Derive the formula (6. J =~ i=O L)) c i (1. (Zi. t'V (b) Show that if ei N(O.n ~ where fi can be written as fi = ceil + ei for given constant c satisfying 0 ~ c are independent identically distributed with mean zero and variance 0. (d) Show that Var(O) ~ Var(Y).
p (1 .Z. Assume the linear regression model with p future observation Y to be taken at the pont z. .1. ~ 1 . In the oneway layout.a).. ~(~ + t33) t31' r = 2..2)(Zj2 .p)S2 /x n . We want to predict the value of a 14.. Show that if p Inference in the Multiparameter Case Chapter 6 = r = 2 in Example 6. Yn be a sample from a population with mean f.. . (a) Find a level (1 . which is the amount .p)S2 /x n . The following model is useful in such situations. . Note that Y is independent of Yl. .. = 2(Y . then Var(8k ) is minimized by choosing ni = n/2.• statistics HYl.2) L(Zi2 . n2 = . l(Yh . 'lj.2). Consider the three estimates T1 = and T3 Y. where n is even.Z.e . 1  (a) Why can you conclude that T1 has a smaller MSE (mean square error) than T2? (b) Which estimate has the smallest MSE for estimating 0 13. (a) Show that level (1 . a 2 ::. . Consider a covariate x. ••. Y ::.k)' C (b) If n is fixed and divisible by p. Yn ).a) confidence intervals for linear functions of the form {3j .. Let Y 1 .1 and variance a 2 . (b) Find a level (1 a) prediction interval for Y (i.1 2)? and that a level (1 a) confidence interval for a 2 is given by ~a)::.:1 Yi + (3/ 2n) L~= ~ n+ 1 }i. Often a treatment that is beneficial in small doses is harmful in large doses.a)% confidence intervals for a and 6k • 12.• 424 10. Show that for the estimates (a) Var(a) = +==_=:.a) confidence interval for the best MSPE predictor E(Y) = {31 + {32 Z.I)..2)2 a and 8k in the oneway layout Var(8k ) . Yn ) such that P[t..p (~a) (n .{3i are given by = ~(f.. .. (Zi2 Z. (n . then the hat matrix H = (h ij ) is given by 1 n 11. = np n/2(p 1). then Var( a) is minimized by choosing ni = n / p.":=::. Yn .2. (c) If n is fixed and divisible by 2(p . ::. 15. T2 1 (1/ 2n) L.~ L~=1 nk = (p~:)2 + Lk#i .. (b) Find confidence intervals for 'lj.. (d) Give the 100(1 .
5 Zi2 * + (33zi2 where Zil = Xi  X. Assume the model = (31 + (32 X i + (33 log Xi + Ei.. (a) For the following data (from Hald.7 Probtems and Lornol. compute confidence intervals for (31. .or dose of a treatment. 7 2 ) does not exist so that A6 doesn't hold. i = 1.Section 6. (b) Plot (Xi. (33.. . which is yield or production. a 2 <7 2 .1) + 2<t'CJL. 1 2 where <t'CJL. 8). X (nitrogen) Hint: Do the regression for {Li = (31 + (32zil + (32zil = 10gXi . and a response variable Y. fli) where fi1.• . E :::.I::llogxi' You may use X = 0. Find the value of X that maxi mizes the estimated yield Y= e131 e132xx133.77 167. ( 2 ) density. En Xi. In the Gaussian linear model show that the parametrization ({3. P = {N({L. xC' = 0 => x = 0 for any rvector x. Y (yield) 3. hence.r2 )l 1 2 7 > O. and p be as in Problem 6. Show that if C is an n x r matrix of rank r.95 Yi = e131 e132xi Xf3. But xC'C = o => IIxC'I1 2 = xC'Cx' = 0 => xC' = O. r :::. nonsingular. .1 and let Q be the class of distributions with densities of the form (1 E) <t' CJL.. Problems for Section 6. Suppose a good fit is obtained by the equation where Yi is observed yield for dose 10gYi where E1. . (a) Show that AOA4 and A6 hold for model Qo with densities of the form 1 2<t'CJL. fi2. and Q = P.0'2)(x ) + E<t'(JL. (32. 8) = 2. •.0'2) is the N(Il'. Hint: Because C' is of rank r. Yi) and (Xi. a 2 > O}. (b) Show that the MLE of ({L. . ( 2 )T is identifiable if and only if r = p. Let 8. 1952.r2) (x). 17.emEmts 425 .0289. 653). Check AO. : {L E R. p(x. ( 2 ).. p. 16. ( 2 ). P. fi3 and level 0. n are independent N(O.2. then the r x r matrix C'C is of rank r and. ( 2 ) logp(x. n.A6 when 8 = ({L.2 1.
5.24).'. and that 0:2 is independent of the preceding vector with n(j2 /0 2 having a (b) Apply the law oflarge numbers to conclude that T P n I Z(n)Z(n) ~ E ( ZZ T ).2. (I) Combine (aHe) to establish (6. Hint: !x(x) = !YIZ(Y)!z(z). 6.426 Inference in the Multiparameter Case Chapter 6 I .2.i > 1. show that the assumptions of Theorem 6. .2 hold if (i) and (ii) hold. t) is then called a conditional MLE. In some cases it is possible to find T(X) such that the distribution of X given T(X) = tis Q" which doesn't depend on H E H.IY .I3). (J derived as the limit of Newtonwith equality if and only if > [EZfj1 4. 3.24) directly as follows: (a) Show that if Zn = ~ I:~ 1 Zi then. and ..2. (fj multivariate nannal distribution with mean 0 and variance. Show that if we identify X = (z(n).21). Hint: (a). i' " (b) Suppose that the distribution of Z is not known so that the model is semiparametric. . X . Fill in the details of the proof of Theorem 6.. (e) Construct a method of moment estimate moments which are ~ consistent.". .(b) The MLE minimizes .1/ 2 ).  I . H E H}.)Z(n)(fj . (a) In Example 6. Z i ~O. show that c(fo) = (To/a is 1 if fo is normal and is different from 1 if 10 is logistic. The MLE of based on (X.2. hence. H abstract._p distribution.1 show thatMLEs of (3.. (6. show that ([EZZ T jl )(1.20). T 2 ) based on the first two I (d) Deduce from Problem 6..2.. . In Example 6. 8. T(X) = Zen)..2.2 are as given in (6. Establish (6. then ({3.1. iT 2 ) are conditional MLEs. I I .1) EZ . (e) Show that 0:2 is unconditionally independent of (ji.Zen) (31 2 e ~ 7.2. that (d) (fj  (3)TZ'f. I . ell of (} ~ = (Ji. .Pe"H).2.10 that the estimate Raphson estimates from On is efficient. In Example 6. (P("H) : e E e. (c) Apply Slutsky's theorem to conclude that and. In Example 6. e Euclidean. n[z(~)~en)jl)' X.2..1.2.  (3)T)T has a (". /1.(3) = op(n. given Z(n).2.2. '".2. b .fii(ji . Y).
1' _ On = O~  [ . a > 0 are unknown and ( has known density f > 0 such that if p(x) log f(x) then p" > 0 and. (al Let ii n be the first iterate of the NewtonRapbson algorithm for solving (6.0 0 ). Him: You may use a uniform version of the inverse function theorem: If gn : Rd j.Oo) i=cl (1 . Y.' . (logistic). p i=l J.Section 6. and f(x) = e. p is strictly convex. .LD. Examples are f Gaussian. . (b) Show that under AGA4 there exists E > 0 such that with probability tending to 1.l real be independent identically distributed Y.L 'l'(Xi'O~) n i=l 1 1 =n n L 'l'(Xi. . (ii) gn(Oo) ~ g(Oo). the <ball about 0 0 . Show that if BB a unique MLE for B1 exists and 10.1/ 2 ).7 Problems and Complements 427 9.l + aCi where fl. Let Y1 •.2. <).p(Xi'O~) + op(1) n i=l n ) (O~ .1) starting at 0.l exists and uniquely ~ .(XiI') _O. hence.3). (iv) Dg(O) is continuous at 0 0 .x (1 + C X ) . 8~ = 1 80 + Op(n.•. ao (}2 = (b) Write 8 1 uniquely solves 1 8. (iii) Dg(0 0 ) is nonsingular.'2:7 1 'l'(Xi.p(Xi'O~) n.O) has auniqueOin S(Oo.Dg(O)j . R d are such that: (i) sup{jDgn(O) . = J..2.. (a) Show that if solves (Y ao is assumed known a unique MLE for L... 10 . Hint: n .LD.L'l'(Xi'O~) n. Suppose AGA4 hold and 8~ is vn consistent. that is.0 0 1 < <} ~ 0. t=l 1 tt Show that On satisfies (6. ] t=l 1 n . E R. a' .
(Z" . Problems for Section 6. •.d. . Hint: You may use the fact that if the initial value of NewtonRaphson is close enough to a unique solution. iteration of the NewtonRaphson algorithm starting at 0..II(Zil I Zi2. Suppose that Wo is given by (6. q + 1 < j < r} then )'(X) for the original testing problem is given = by )'(X) = sup{ q(X· 'I) : 'I E 3}/ sup{q(X. 1 Yn are independent Poisson variables with Yi . 'I) = p(.. .27). hence.1 on 8(9 0 • 6) (c) Conclude that with probability tending to 1.ip range freely and Ei are i.O).3. > 0 such that gil are 1 . LJ j=2 Differentiate with respect to {31.26) and (6. Show that if 3 0 = {'I E 3 : '1j = 0. = versus K : O oF 0. CjIJtlZ. Establish (6. )Ih  '" j=l LJ(lJj + where ~(l) = I:j and the Cj do not depend on j3.3. Ii 'I :f Z'[(3)2 over all {3 is the same as minimizing n . . • (6.3. Reparametnze l' by '1(0) = Lj~1 '1j(O)Vj where '1j(9) 0 Vj. and ~ .. log Ai = ElI + (hZi. 2 ° 2. . f. {Vj} are orthonormal. Zi... Thus. for "II sufficiently large.2.3. .i.2. ~.O E e./3)' = L i=1 1 CjZiU) n p . . N(O. q(. 1 Zw Find the asymptotic likelihood ratio.. and Rao tests for testing H : Ih.12) and the assumptions of Theorem (6.. I '" LJ i=l ~(1) Y. .3 i 1... Y.6)..Zi )13.(j) l. < Zn 1 for given covariate values ZI..~_.. then it converges to that solution. minimizing I:~ 1 (Ii 2 i. that Theorem 6. •.12). converges to the unique root On described in (b) and that On satisfies 1 1 • j. i2. Wald. a 2 ). P I . 'I) : 'I E 3 0 } and. P(Ai). Inference in the Multiparameter Case Chapter 6 > O. 0 < ZI < . . Zip)) + L j=2 p "YjZij + €i J . i=l • Z. = 131 (Zil .3). there exists a j and their image contains a ball S(g(Oo).2. t• where {31.2 holds for eo as given in (6. 8) for 'I E 3 and 3 = {'1( 0) : 9 E e}. Hint: Write n J • • • i I • DY.. i .428 then. II. . .. .' " 13.l hold for p(x.(Z"  ~(') z. Similarly compute the infonnation matrix when the model is written as Y. Suppose responses YI .
(ii) E" (a) Let ).3. Let (Xi. 0 ») is nK(Oo.7 Problems and C:com"p"":c'm"'.n) satisfy the conditions of Problem 6. Show that under H.) O.3... \arO whereal.. 210g.. There exists an open ball about 1J0.) 1(Xi .aq are orthogonal to the linear span of ~(1J0) .3 hold. 1 5.. with density p(.X.. p .. . .gr. Consider testing H : (j = 00 versus K : B = B1 . Testing Simple versus Simple.3 is valid.(X1 . X n Li. q + 1 <j < r S(lJo) . {IJ E S(lJo): ~j(lJ) = 0. IJ.(X" . Let e = {B o . Bt ) is a KullbackLeibler in fonnation ° "_Q K(Oo. hence. B). . N(Oz. (Adjoin to 9q+l.Xn ) be the likelihood ratio statistic. .l.d. S(lJ o) and a map ce which is continuously differentiable such that (i) ~J(IJ) = 9J(IJ) on S(lJo). < 00. 'I) S(lJo)} then.d. N(IJ" 1). (b) If h = 2 show that asymptotically the critical value of the most powerful (NeymanPearson) test with Tn ~ L~ . aTB..2.n 7J(Bo. respectively.. be ii.o log (X 0) PI. Oil + Jria(00 . with Xi and Y. Oil and p(X"Oo) = E. q('. .'1) and Tin '1(lJ n) and. = ~ 4. Xi.) Show that if we reparametrize {PIJ: IJ E S(lJo)} by q('.'1(IJ» is uniquely defined on:::: = {'1(IJ) : IJ E Tio.3... q + 1 <j< rl. .. independent. Assume that Pel of Pea' and that for some b > 0.'1 8 0 = and.Section 6. Deduce that Theorem 6.. pXd . Suppose that Bo E (:)0 and the conditions of Theorem 6. . 1 < i < n. Consider testing H : 8 1 = 82 = 0 versus K : 0. > 0 or IJz > O.. =  p(·. .(I(X i . j = 1.IJ) where q(. IJ.'" . 1). Suppose 8)" > 0. Vi).n:c"' 4c:2=9 3. OJ}. (ii) 'I is IIon S( 1J0 ) and D'I( IJ) is a nonsingular r x r matrix for aIlIJ E S( 1J0). even if ~ b = 0.2. .) where k(Bo.
Z 2 = poX1Y v' . we test the efficacy of a treatment on the basis of two correlated responses per individual.  OJ.3. 0. (b) If 0.d. = a + BB where .. and Dykstra.. 2 log >'( Xi. Let Bi l 82 > 0 and H be as above. ii.1. 1988).) if 0. are as above with the same hypothesis but = {(£It. Show that (6. 0. Then xi t. ±' < i < n) is distributed as a respectively.0. (b) Suppose Xil Yi e 4 (e) Let (X" Y. 1 < i < n. Note: The results of Problems 4 and 5 apply generally to models obeying AQA6 when we restrict the parameter space to a cone (Robertson. ~ ° with probability ~ and V is independent of ~ (e) Obtain tbe limit distribution of y'n( 0. Hint: ~ (i) Show that liOn) can be replaced by 1(0). > 0.1. I .i.  0.li : 1 <i< n) has a null distribution. Po . (d) Relate the result of (b) to the result of Problem 4(a).\( Xi. 2 e( y'n(ii.. Po) distribution and (Xi. (h) : 0 < 0.. 7. XI and X~ but with probabilities ~ .6. = (T20 = 1 andZ 1= X 1.) under the model and show that (a) If 0. (0. . for instance. In the model of Problem 5(a) compute the MLE (0.)) ~ N(O.~. ' · H In!: Consl'defa1O . j ~ ~ 6.\(Xi. = v'1~c2' 0 <.=0 where U ~ N(O. /:1. Exhibit the null distribution of 2 log . Yi : l<i<n). . (ii) Show that Wn(B~2» is invariant under affine reparametrizations "1 B is nonsingular. which is a mixture of point mass at O.6.\(X)..) have an N. Hint: By sufficiency reduce to n = 1. O > 0. Show that 2log. under H.0"10' O"~o.0.3. • i . Wright.19) holds. and ~ where sin.3. =0. (iii) Reparametrize as in Theorem 6.0.0. Yi : 1 and X~ with probabilities ~.2 for 210g. Sucb restrictions are natural if. < cIJ"O.0). > 0.2 and compute W n (8~2» showing that its leading term is the same as that obtained in the proof of Theorem 6.. 1) with probability ~ and U with the same distribution.430 Infer~nce in the Multiparameter Case Chapter 6 (a) Show that whatever be n. = 0. li). > OJ. < .. be i. mixture of point mass at 0.
21). hence.4. Show that under AOA5 and A6 for 8 11 where ~(lIo) is given by (6. 2.4. .3.2 to lin .P(A)P(B) JP(A)(1 .7 Problems and Complements ~(1 ) 431 8. (e) Conclude that if A and B are independent.P(A))P(B)(1 .8) (e) Derive the alternative form (6. A611 Problems for Section 6. In the 2 x 2 contingency table model let Xi = 1 or 0 according as the ith individual sampled is an A or A and Yi = 1 or 0 according as the ith individual sampled is a Born.10. (a) Show that for any 2 x 2 contingency table the table obtained by subtracting (estimated) expectations from each entry has all rows and columns summing to zero.22) is a consistent estimate of ~l(lIo).1(0 0 ).3. (a) Show that the correlation of Xl and YI is p = peA n B) .0 < PCB) < 1.2.3. Show that under A2. 9.nf.4. 3.Section 6.4) explicitly and find the one that corresponds to the maximizer of the likelihood. (b) (6. 10. Hint: Argue as in Problem 5.6 is related to Z of (6.4. 0 < P( A) < 1.P(B))· (b) Show that the sample correlation coefficient r studied in Example 5.4 ~ 1(11) is continuous. A3. then Z has a limitingN(011) distribution. Hint: Write and apply Theorem 6. 1. Under conditions AOA6 for (a) and AOA6 with A6 for (a) ~(1 ) i!~1) for (b) establish that [~D2ln(en)]1 is a consistent estimate of 1.. Exhibit the two solutions of (6.8) for Z.3. is of the fonn (b) Deduce that X' ~ Z' where Z is given by (6.8) by Z ~ .
C. .~. .4 deduce that jf j(o:) (depending on chosen so that Tl.... . 8l! / (8 1l + 812 )).. (b) Sbow that 812 /(8l! ° + 812 ) ~ 821 /(8 21 + 822 ) iff R 1 and C 1 are independent.2.~~. n R. N ll and N 21 are independent 8( r. C i = Ci. B!c1b!. nab) A ) = B.u. b I Ri ( = Ti. I (c) Show that under independence the conditional distribution of N ii given R. i = 1.~~. 811 \ 12 .9) and has approximately a X1al)(bl) distribution under H.6 that H is true. (22 ) as in the contingency table... Hint: (a) Consider the likelihood as a function of TJil. . Ti) (the hypergeometric distribution).. 8(r2' 82 I/ (8" + 8. (a) Let (NIl... TJj2.)). .1.D. Let N ij be the entries of an let 1Jil = E~=l (}ij. 8 21 .. R 2 = T2 = n . TJj2 are given by TJil ~ = . n a2 ) ..4.4. Fisher's Exact Test From the result of Problem 6. (b) Deduce that Pearson's X2 is given by (6. n.. j = 1. =T 1.1 . n) can be then the test that rejects (conditionally on R I = TI' C 1 = GI) if N ll > j(a) is exact level o. 7. (a) Show that then P[N'j niji i = 1. . TJj2 = 2::: a x b contingency table with associated probabilities Bij and 1 Oij. . N 22 ) rv M (u. in principle. i = 1. S. C j = L' N'j. ( rl. Suppose in Problem 6.b1only..4. Let R i = Nil + N i2 • Ci = Nii + N 2i · Show that given R 1 = TI. .1 II .ra) nab . j. N 12 • N 21 . Cj = Cj] : ( nll. TJj2 ~ = Cj n where R.. j = 1. . = Lj N'j. 1 : X 2 (b) How would you.~l.. are the multinomial coefficients. This is known as Fisher's exact test. 6. ( where ( .. (a) Show that the maximum likelihood estimates of TJil.TI. Consider the hypothesis H : Oij = TJil TJj2 for all i..2 is 1t(Ci. use this result to construct a test of H similar to the test with probability of type I error independent of TJil' TJj2? 1 . . nal ) n12. . . It may be shown (see Volume IT) that the (approximate) tests based on Z and Fisher's test are asymptotically equivalent in the sense of (5. 432 Inference in the Multiparameter Case Chapter 6 i 4. CI.54). a . ...
.14). for suitable a. consider the assertions. Test in both cases whether the events [being a man] and [being admitted] are independent. C are three events. 5 Hint: (b) It is easier to work with N 22 • Argue that the Fisher test is equivalent to rejecting H if N 22 > q2 + n .(rl + cd or N 22 < ql + n . Establish (6. Give pvalues for the three cases. there is a UMP level a test. (b). if A and C are independent or B and C are independent.. . C2). classified by sex and admission status. B. = 0 in the logistic model.BINDEPENDENTGNENC) p(AnB I C) ~ peA I C)P(B I C) (A. (a) If A. if and only if. but (iii) does not.1""". (i) p(AnB (ii) I C) = PeA I C)P(B I C) (A. (e) The following 2 x 2 tables classify applicants for graduate study in different departments of the university according to admission status and sex. B INDEPENDENT) (iii) PeA (C is the complement of C. where Pp~ [2:f .4.ZiNi > k] = a. 9.12 I .) Show that (i) and (ii) imply (iii).5? Admit Deny Men Women 1 19 .ziNi > k. Zi not all equal. Admit Men Women 1 235 1~35' 38 7 273 42 n = 315 Deny Admit 270 45 Men Women I 122 1'.05 level (a) using the X2 test with approximate critical value? (b) using Fisher's exact test of Problem 6. + IhZi. N 22 is conditionally distributed 1t(r2. and petfonn the same test on the resulting table.0. and that under H. 2:f . 10. Show that. ~i = {3.4. which rejects.BINDEPENDENTGIVENC) n B) = P(A)P(B) (A. n.Section 6.(rl + cI). Would you accept Or reject the hypothesis of independence at the 0. The following table gives the number of applicants to the graduate program of a small department of the University of California..7 Problems and Complements 433 8. and that we wish to test H : Ih < f3E versus K : Ih > f3E. 11. Suppose that we know that {3. (h) Construct an experiment and three events for which (i) and (ii) hold.93'1 215 103 69 172 Deny 225 162 n=387 (d) Relate your results to the phenomenon discussed in (a). Then combine the two tables into one.
1 .4) is as claimed formula (2. Show that. · .d. 2. " Ok = 8kO..4.f. 13.)llmi~i(1 'lri). asymptotically N(O. a under H... with (Xi I Z..5. Use this to imitate the argument of Theorem 6.18) tends 2 to Xrq' Hint: (Xi ... . " lOk vary freely. Nk) ~ M(n.5 1. .4.5.3.2.20) for the regression described after (6. . = rn I < /3g and show that it agrees with the test of (b) Suppose that 131 is unknown. X k be independent Xi '" N (Oi. Verify that (6.4. 3.4. a 2 ) where either a 2 = (known) and 01 . • I 1 • • (a)P[Z.Oio)("Cooked data").Ld. . mk ~ 00 and H : fJ E Wo is true then the law of the statistic of (6.8) and. a 2 = 0"5 is of the form: Reject if (1/0".11... .i.". i 16. then 130 as defined by (6. case.00 or have Eo(Nd : . a5 j J j 1 J I .. . Given an initial value ()o define iterates  Om+l .5. for example. This is an approximation (for large k. Suppose that (Z" Yj).(Ni ) < nOiO(1 . . In the binomial oneway layout show that the LR test is asymptotically equivalent to Pearson's  X2 test in the sense that 2log'\  X2 .3.1 14. .. i Problems for Section 6..5 construct an exact test (level independent of (31).. 010. . < f3g in this (c) By conditioning on L~ 1 Xi and using the approach of Problem 6. . . Xi) are i. I). 434 • Inference in the Multiparameter Case Chapter 6 f • 12.OiO)2 > k 2 or < k}. (Zn. Fisher's Method ofScoring The following algorithm for solving likelihood equations was proosed by Fishersee Rao (1973). if the design matrix has rank p.. Let Xl. . Show that the likelihO<Xt ratio test of H : O} = 010 ..d.OkO) under H.LJ2 Z i )). ~ 8 m + [1(8 m )Dl(8 m ). . 1 i Show that for GLM this method coincides with the NewtonRaphson method of Section 2.4.) ~ B(m. 15.. Tn. . Compute the Rao test statistic for H : (32 case.z(kJl) ~I 1 . . in the logistic regression model. Y n ) have density as in (6. but under K may be either multinomial with 0 #..4). E {z(lJ. . . n) and simplification of a model under which (N1. I...11 are obtained as realization of i.. (a) Compute the Rao test for H : (32 Problem 6. but Var. . or Oi = Bio (known) i = 1. ..4. ..: nOiO..'Ir(.l 1 . Show that if Wo C WI are nested logistic regression models of dimension q < r < k and mI.. which is valid for the i. k and a 2 is unknown.3) l:::~ I (Xi .i.15) is consistent. Suppose the ::i in Problem 6. I < i < k are independent.. Zi and so that (Zi.
Show that when 9 is the canonical link. .y . .p.ztkl} is RP (c) P[ZI ~ z(jl] > 0 for all j. Let YI. b(9). 05. (Zn. and v(. give the asymptotic distribution of y'n({3 . Hint: Show that if the convex support of the conditional distribution of YI given ZI = zU) contains an open interval about p'j for j = 1.. (d) Gaussian GLM...7 Problems and Complements 435 (b) The linear span of {ZII). In the random design case. . b(B). Wn). T). Wi = w(". and b' and 9 are monotone. . Yn ) are i. h(y. g(p.. .) . Find the canonical link function and show that when g is the canonical link.F.i. C(T). the deviance is 5. your result coincides with (6..d.9). T. (a) Show that the likelihood equations are ~ i=I L.O(z)) where O(z) solves 1/(0) = gI(zT {3). .) and C(T) such that the model for Yi can be written as O.. ball about "k=l A" zlil in RP ..5. Assume that there exist functions h(y..)z'j dC. k. L.= 1 .. (e) Suppose that Y. d".. . Give 0. . 9 = (b')l.. and v(.5. Show that the conditions AQA6 hold for P = P{3o E P (where qo is assumed known). h(y. 1'0) = jy 1'01 2 /<7. .. has the Poisson.. (c) Suppose (Z" Y I ).5). b(9). under appropriate conditions.. . Y) and that given Z = z. T. <. j = 1. P(I"). . contains an open L. k. as (Z. then the convex support of the conditional distribution of = 1 Aj Yj zU) given Z j = Z (j) .). (y.Section 6. ~ N("" <7. Yn be independent responses and suppose the distribution of Yi depends on a covariate vector Zi.T)exp { C(T) where T is known. .. J JrJ 4.) ~ 1/11("i)( d~i/ d". Show that for the Gaussian linear model with known variance D(y. Set ~ = g(. J . distribution..) and v(.(. Suppose Y. g("i) = zT {3... T). Y follow the model p(y. C(T). Give 0.WZ v where Zv = Ilz'jll is the design matrix and W = diag( WI. Show that.1 = 0 V fJ~ .b(Oi)} p(y.9). Hint: By the chain rule a l(y a(3j' 0) = i3l dO d" a~ . the resuit of (c) coincides with (6.(3).) = ~ Var(Y)/c(T) b"(O). 00 d" ~ a(3j (b) Show that the Fisher information is Z].5..Oi)~h(y.). . ~ ..T)..
. 0'2). in fact. = 1" (b) Show that if f is N(I'.10). In the hinary data regression model of Section 6.p under the sole assumption that E€ = 0.s(t). (a) Show that if f is symmetric about 1'. if VarpDl(X.2. W n . .6. replacing &2 by 172 in (6. Suppose ADA6 are valid. Consider the Rao test for H : f} = f}o for the model "P = {P/I : /I E e} and ADA6 hold. but that if the estimate ~ L:~ dDIllDljT (Xi.10) creates a valid level u test. . /10) is estimated by 1(80 ). . W n and R n are computed under the assumption of the Gaussian linear model with a 2 known.6. then v( P) . I "j 5. but if f~(x) = ~ exp Ix 1'1.3 is as given in (6.1 under this model and verifying the fonnula given. 0 < Vae f < 00. I 4. t E R. . then O' 2(p) > 0'2 = Varp(X. i 6. P. Wald.3.d.i.2 and the hypothesis (3q+l = (3o. Wn Wn + op(l) + op(I).6.. I . .6.4.3. Establish (6.3) then f}(P) = f}o. Show that. .j 436 Inference in the Multiparameter Case Chapter 6 i Problems for Section 6. and R n are the corresponding test statistics. Consider the linear model of Example 6. Suppose that the ttue P does not belong to"P but if f}(P) is defined by (6. let 1r = s(z. l 1 .2.3 and. ! I I ! I .Q+1>'" . and Rao tests are still asymptotically equivalent in the sense that if 2 log An. n .1.1) ~ .6.(3p ~ (3o. Show that 0: 2 given in (6.6.6 1. Note: 2 log An. (3) where s(t) is the continuous distribution function of a random variable symmetric about 0. I: . 2.6. Apply Theorem 6. 7. j.O' 2 (p») = 1/4f(v(p». Suppose Xl.Xn are i.14) is a consistent estimate of2 VarpX(l) in Example 6. f}o) is used.6.. O' 2(p)/0'2 = 2/1r. . the infonnation bound and asymptotic variance of Vri(X 1'). then the sample median X satisfies ~ f at I I Vri(X where O' (p) 2 yep») ~ N(0. Show that the standard Wald test forthe problem of Example 6.15) by verifying the condition of Theorem 6. (6. hence. Show that the LR.). that is. if P has a positive density v(P). • I 3. then under H. By Problem 5. set) = 1.. the unique median of p.1. Hint: Retrace the arguments given for the asymptotic equivalence of these statistics under parametric model and note that the only essential property used is that the MLEs under the model satisfy an appropriate estimating equation. then it is. then 0'2(P) < 0'2.7. then the Rao test does not in general have the correct asymptotic level. " .
then 13L is not a consistent estimate of f3 0 unless s(t) is the logistic distribution. Suppose that the covariates are ranked in order of importance and that we entertain the possibility that the last d . . (a) Show that EPE(p) ~ . _. O)T and deduce that (c) EPE(p) = RSS(p) + ~. f3p. i = 1.. i .Section 6. for instance)..lpI)2 be the residual sum of squares.(b) continue to hold if we assume the GaussMarkov model.d... . the model with 13d+l = . at Zll'" . i = 1. 8. Let f3(p) be the LSE under this assumption and Y.Zn and evaluate the performance of Yjep) . Show that if the correct model has Jri given by s as above and {3 = {3o. where ti are i.9. Zi.. L:~ 1 (P.. .. A natural goal to entertain is to obtain new values Yi". hence. 1 V.' i=l ..jP)2 where ILl ) = z.. y~p) and.. ..2 is known. are indepeqdent of}] 1 ' · ' 1 Yn and ~* is distributed as }'i. Suppose that . then ~ Vri(rJL QI ({3) Var(ZI (Y1   {3 Ll has a limiting normal distribution with mean 0 and variance A(Zr {3)) )[QI ({3)J where Q({3) = E(Zr A(Zr{30)ZI) is p x p and necessarily nonsingular.2.p don't matter.9. 0.7 Problems and Complements 437 (a) Show that ~ Jr can be written in this form for both the probit and logit models.. (Model Selection) Consider the classical Gaussian linear model (6. that Zj is bounded with probability 1 and let ih(X ln )).2 (b) Show that (1 + ~D + . 1 n. .. Hint: Apply Theorem 6.1) Yi = J1i + ti. .1. /3P+I = '" = /3d ~ O.{3lp) p and {3(P) ~ (/31. be the MLE for the logit model.I EL(Y.lp)2 Here Yi" l ' •• 1 Y. Gaussian with mean zero J1i = Z T {3 and variance 0'2.. . But if 13L is defined as the solution of EZ1s(Zr {30) = Q(/3) where Q({3) = E(Zr A(Zr/3) is p x 1.(p) the corresponding fitted value. ..p.2 is an unbiased estimate of EPE(P). Zi) : 1 <i< n}.d. where Xln) ~ {(Y.. . 1973.1.. ~ ~ (d) Show that (a). . Model selection consists in selecting p to minimize EPE(p) and then using Y(P) as a predictor (Mallows. 1 n.. (b) Suppose that Zi are realizations of U. = (3p = 0 by the (average) expected prediction error ~ ~ n EPE(p) = n.i . Zi are ddimensional vectors for covariate (factor) values. Let RSS(p) = 2JY.
1970. ti. . Note for Section 6.  J . i 6.6.438 (e) Suppose p ~ Inference in the Multiparameter Case Chapter 6 2 and 11(Z) ~ Ii. New York: McGrawHill. this makes no sense for the model we discussed in this section.1 (1) From the L A.L . LR test statistics for enlarged models of this type do indeed reject H for data corresponding to small values of X2 as well as large ones (Problem 6. D. 6. i=1 Derive the result for the canonical model.8 NOTES Note for Section 6. EPE(p) n R SS( p) = .16).2 (1) See Problem 3. Heart Study after Dixon and Massey (1969). W.) . . (b) The result depends only on the mean and covariance structure of the i=l"".. AND F. 1969. t12 and {Z/I. .". .' . . The moral of the story is that the practicing statisticians should be on their guard! For more on this theme see Section 6. 1 "'( i=1 .4 (1) R.i2} such that the EPE in case (i) is smaller than in case (d) and vice versa. For instance. if we consider alternatives to H. ~(p) n . Hint: (a) Note that ". we might envision the possibility that an overzealous assistant of Mendel "cooked" the data.5. . ! . and (ii) 'T}i = /31 Zil + f32zi2.. 1 n n L (I'. A.9 REFERENCES Cox. To guard against such situations he argued that the test should be used in a twotailed fashion and that we should reject H both for large and for small values of X2 • Of course. Fisher pointed out that the agreement of this and other data of Mendel's with his hypotheses is too good.. Note for Section 6.9 for a discussion of densities with heavy tails. Y. but it is reasonable. DIXON.n. Introduction to Statistical Analysis. MASSEY.~I'. Give values of /31. R.4. )I'i). Y/" j1~. Use 0'2 = 1 and n = 10. which are not multinomial. 3rd 00. The Analysis of Binary Data London: Methuen. + Evaluate EPE for (i) ~i ~ "IZ. 1 I .z.
New York. /5. KASS. L. New York: Wiley. MA: Harvard University Press. 13th ed. $CHERVISCH. S. second edition. WRIGHT.. RAO. Fifth BNkeley Symp. WEISBERG. TIERNEY.. 2nd ed.661675 (1973). HUBER. 1985.• "Sur quelques points du systeme du monde. Applied linear Regression. Linear Statisticallnference and Its Applications. 1973.. T. Ser. A. II. HABERMAN. 36. An Inrmdllction to Linear Stati. AND j. "The behavior of the maximum likelihood estimator under nonstandard conditions. Statistical Theory with £r1gineering Applications New York: Wiley. Math.. New York: J." Memoires de l'Academie des Sciences de Paris (Reprinted in Oevres CompUtes. "Some comments on C p . AND R. D'OREY. LAPLACE.Him{ Models. 1995. 383393 (1987). SCHEFFE." Statistical Science. S. AND R. Univ. C. DYKSTRA. 1989. KADANE AND L.. Statist. NELDER. R. A. I. II. The Analysis of Frequency Data Chicago: University of Chicago Press. GauthierVillars. I New York: McGrawHill. KOENKER. PORTNOY. 76. Order Restricted Statistical Inference New York: Wiley. The l!istory of Statistics: The Measuremem of Uncel1ainty Before 1900 Cambridge... c.425433 (1989). 1988. E. STIGLER. 279300 (1997). The Analysis of Variance New York: Wiley. 2nd ed. 1961. Theory ofStatistics New York: Springer. R..Section 6. GRAYBILL. New York: Hafner. PS. 1958. 12. HAW.9 References 439 Frs HER." J. J. "Computing regression qunntiles. R. 1952.• St(ltistical Methods {Of· Research lV(Jrkers. M. "Approximate marginal densities of nonlinear functions. R." Technometrics. C." Proc. . 1974. KOENKER. Statist. S. A. McCULLAGH. S. "The Gaussian Hare and the Laplacian Tortoise: Computability of squarederror versus absoluteerror estimators. 475558. Paris) (1789). Vol.. f. 1986.. 1959. AND Y. Prob. Soc. 221233 (1967). T. of California Press. P. Wiley & Sons. P. J. t 983. ROBERTSON. Genemlized Linear Models London: Chapman and Hall." Biometrika. Roy. MALLOWS.
j. I I . i 1 I . I .. . . . I j i.. . . .I :i .
Appendix A A REVIEW OF BASIC PROBABILITY THEORY In statistics we study techniques for obtaining and using information in the presence of uncertainty. at least conceptually. A prerequisite for such a study is a mathematical model for randomness and some knowledge of its properties. an experiment is an action that consists of observing or preparing a set of circumstances and then observing the outcome of this situation. The adjective random is used only to indicate that we do not. Therefore. A. We add to this notion the requirement that to be called an experiment such an action must be repeatable. Probability theory provides a model for situations in which like or similar causes can produce one of a number of unlike effects. The Kolmogorov model and the modem theory of probability based on it are what we need. The intensity of solar flares in the same month of two different years can vary sharply. A coin that. is tossed can land heads or tails. we include some commentary. Sections A.IS contain some results that the student may not know. What we expect and observe in practice when we repeat a random experiment many times is that the relative frequency of each of the possible outcomes will tend to stabilize. In Appendix B we will give additional probability theory results that are of special interest in statistics and may not be treated in enough detail in some probability texts. we include some proofs as well in these sections. The purpose of this appendix is to indicate what results we consider basic and to introduce some of the notation that will be used in the rest of the book. The situations we are going to model can all be thought of as random experiments. Viewed naively. The reader is expected to have had a basic course in probability theory. This 441 . A group of ten individuals selected from the population of the United States can have a majority for or against legalized abortion. which are relevant to our study of statistics. Because the notation and the level of generality differ somewhat from that found in the standard textbooks in probability at this level. although we do not exclude this case.14 and A. in addition. require that every repetition yield the same outcome.I THE BASIC MODEl Classical mechanics is built around the principle that like causes produce like effects.
A is always taken to be a sigma field.l.l The sample space is the set of all possible outcomes of a random experiment.. the probability model.3 Subsets of are called events. intersection. from horse races to genetic experiments. For example. intersections. The relation between the experiment and the model is given by the correspondence "A occurs if and only if the actual outcome of the experiment is a member of A. A.l. and inclusion as is usual in elementary set theory. A probabiliry distribution or measure is a nonnegative function P on A having the following properties: (i) P(Q) n 1 n . the null set or impossible event. . and Berger (1985).2 A sample point is any member of 0. the operational interpretation of the mathematical concept of probability. . A2 . 1 1  . We denote events by A. 1977). Chung. Savage (1962). A. complementation. n." The set operations we have mentioned have interpretations also. By interpreting probability as a subjective measure. . .\ is the number of times the possible outcome A occurs in n repetitions. is to many statistician::. Its complement. and Loeve. Lindley (1965).." Another school of statisticians finds this formulation too restrictive. If A contains more than one point. For a discussion of this approach and further references the reader may wish to consult Savage (1954). . and so on or by a description of their members. as we shall see subsequently. = 1. then (ii) If AI. where 11. C for union. which by definition is a nonempty class of events closed under countable unions. 1992. is denoted by A. whether it is conceptually repeatable or not. • . A. I l 1. set theoretic difference. they are willing to assign probabilities in any situation involving uncertainty. For technical mathematical reasons it may not be possible to assign a probability P to every subset of n. almost any kind of activity involving uncertainty. Raiffa and Schlaiffer (\96\). . induding the authors.~ n and is typically denoted by w.1/ /1. However. and complementation (cf. 1974. In this section and throughout the book. " are pairwise disjoint sets in I • I Recall that Ui I Ai is just the collection of points that are in anyone of the sets Ai and that two sets are disjoint if they have no points in common. A random experiment is described mathematically in tenns of the following quantities. de Groot (1970). it is called a composite event.4 We will let A denote a class of subsets of to which we an assign probabilities. If wEn. Grimmett and Stirzaker. In this sense. the relation A C B between sets considered as events means that the occurrence of A implies the occurrence of B. i ! A. we preSUnle the reader to be familiar with elementary set theory and its notation at the level of Chapter I of Feller (1968) or Chapter 1 of Parzen (1960). We now turn to the mathematical abstraction of a random experiment. Ii I I I j . {w} is called an elementary event.I I 442 A Review of Basic Probability Theory Appendix A 1 longtefm relative frequency 11 .1. We denote it by n.. B.1. falls under the vague heading of "random experiment. We shall use the symbols U. c.
2 ELEMENTARY PROPERTIES OF PROBABILITY MODELS The following are consequences of the definition of P.P(A). > PeA). Section 8 Grimmett and Stirzaker (1992) Section 1.L~~l peA.2. (A..3. C An . and P together describe a random experiment mathematically. Sections 13.' 1 peA.3 DISCRETE PROBABILITY MODELS A.2. (n~~l Ai) > 1.3 Panen (1960) Chapter I.l1.2 Parzen (1960) Chapter 1. A.3 A. then P (U::' A. then PCB . c . We shall refer to the triple (0. References Gnedenko.3 Hoe!..t A probability model is called discrete if is finite or countably infinite and every subset of f! is assigned a probability. Sections 45 Pitman (1993) Section 1. W2. P(B) A..2.2. c A. .Section A.1 If A c B. In this case.3 If A C B.l. A. P(0) ~ O. P) either as a probability model or identify the model with what it represents as a (random) experiment.. by axiom (ii) of (A. Sections 15 Pitman (1993) Sections 1.) (Bonferroni's inequality). A. A.40 < P(A) < 1.2.1. (1967) Chapter l.S P A.2 PeN) 1 .68 Grimmett and Stirzaker (1992) Sections 1. A. (U.PeA). ~ A..3. References Gnedenko (1967) Chapter I.2.2 Elementary Properties of Probability Models 443 A.6 If A .).3 Hoel. Port..2 and 1. That is. } and A is the collection of subsets of n.A) = PCB) .2) n .' 1 An) < L... .. 1.S The three objects n. J.. and Stone (1992) Section 1. we can write f! = {WI. For convenience. Port. when we refer to events we shall automatically exclude those that are not members of A. we have for any event A.7 P 1 An) = limn~= P(A n ). and Stone (1971) Sections 1.3 A.2.4).
3. Then selecting an individual from this population in such a way that no one member is more likely to be drawn than another.3) 1 I A. . Such selection can be carried out if N is small by putting the "names" of the Wi in a hopper. and drawing..1. ~f 1 .• 1 B n are (pairwise) disjoint events of positive probability whose union is fl. machines.I B) is a probability measure on (fl. 1 1 PiA n B) = P(B)P(A I B).=:.:. Sections 67 Pitman (1993) Section l. A.4. (A. Then P( {w}) = 1/ N for every wEn. and P( A ) ~ j .l n' " I"I ". are (pairwise) disjoint events and P(B) > 0. (A. is an experiment leading to the model of (A. flowers... then P( A I B) corresponds to the frequency of occurrence of A relative to the class of trials in which B does occur.(A n B j ).444 A Review of Basic Probability Theory Appendix A An important special case arises when n has a finite number of elements.1) If P(A) corresponds to the frequency with which A occurs in a large number of repetitions of the experiment. (A.4)(ii) and (A.4. (A.4. say N. A) which is referred to as the conditional probability measure given B.4 CONDITIONAL PROBABILITY AND INDEPENDENCE I j 1 1 ! ..4 Suppose that WI. i • i: " i:. then . .~ 1 1 ~• 1 . a random number table or computer can be used. n PiA) = LP(A I Bj)P(Bj ). If A" A 2 . by l P(A I B) ~ PtA n B) P(B) . I B). guinea pigs.. For large N. Given an event B such that P( B) > 0 and any other event A. which we write PtA I B)..4.• P (Q A.3).3) If B l l B 2 . From a heuristic point of view P(A I B) is the chance we would assign to the event A if we were told that B has occurred.3) yield U.=. shaking well. selecting at random. j=l (A.4. Sections 45 Parzen (1960) Chapter I. i References Gnedenko (1967) Chapter I.). .' . all of which are equally likely. etc.1 • 1 . .. I B) = ~ PiA.2) In fact.3.4) . '< = ~=~:. .:c Number of elements in A N .3.:c:. WN are the members of some population (humans. the identity A = . (A. for fixed B as before. the function P(..l) gives the multiplication rule. . we define the conditional probability of A given B. Transposition of the denominator in (AA.
I_1 . .3).i.)P(B.• .4) and obtain Bayes rule .. Port. I.8) may be written P(A I B) ~ P(A) (AA. Chapter 3.. relation (A.7) whenever P(B I n . . B 2 ) ..) j=l k (AA. B ll is written P(A B I •.lO) for any subset {iI. (AA.1 ). Sections IA Pittnan (1993) Section lA ..) ~ P(A J ) (AA.. . References Gnedenko (1967) Chapter I. . and Stone (1971) Sections lA. .... .. I PeA I B. A and B are independent if knowledge of B does not affect the probability of A..4.8) > 0. Two events A and B are said to be independent if P(A n B) If P( B) ~ P(A)P(B). .II) for any j and {i" . we can combine (A.. Ifall theP(A i ) are positive. (AA.}...n}.. n B n ) > O.. . . . The events All ..S) The conditional probability of A given B I defined by ...4. n Bnd > O. n··· nA i .S parzen (1960) Chapter 2.) = II P(A. P(il. Sections 9 Grimmett and Stirzaker (1992) Section IA Hoel.." .I J (AA. . Section 4.A. the relation (AA.En such that P(B I n . An are said to be independent if P(A i . BnJl (AA.lO) is equivalent to requiring that P(A J I A. P(B 1 n·· n B n ) ~ P(B 1 )P(B2 I BJlP(B3 I ill.i. and (A.9) In other words.···.. P(B" I Bl. B n ) and for any events A..i k } of the integers {l.. .. Simple algebra leads to the multiplication rule...) A) ~ ""_ PIA I B )p(BT L...4. . B I .} such thatj ct {i" .Section AA Conditional Probability and Independence 445 If P( A) is positive...
An are events. To say that £t has had outcome pound event (in . 446 A Review of Basic Probability Theory Appendix A A. We shall speak of independent experiments £1.1 ... X . x flnl n ' ... A2). x'" . (A. W n ) : W~ E Ai.' . 1977) that if P is defined by (A. . 1 1 . For example. the outcome of the first experiment (toss) reasonably has nothing to do with the outcome of the second.. These will be discussed in this section. Chung. ) = P(A. To be able to talk about independence and dependence of experiments.. X A 2 X '" x fl n ). If we want to make the £i independent. x An by 1 j P(A...0 1 x··· XOi_I x {wn XOi+1 x··· x.' . The (n stage) compound experiment consists in performing component experiments £1. x flnl n Ifl) x A 2 x ' ...An with Ai E Ai. X fl n 1 X An). on (fl2. P(fl. .5. independent.o i . (A53) It may be shown (Billingsley. if Ai E Ai. and if we do not replace the first chip drawn before the second draw. . x On in the compound experiment. I P n on (fln> An). .w n ) E o : Wi = wf}. ... . . x " . 1 £n if the n stage compound experiment has its probability structure specified by (A5. then intuitively we should have alI classes of events AI.. I ! i w? 1 . ... that is. InformaUy. then the probability of a given chip in the second draw will depend on the outcome of the first dmw. x An) ~ P(A. the subsets A ofn to which we can assign probability(1). Pn(A n ).'" £n and recording all n outcomes. . . This makes sense in the compound experiment. I n  I . 1974.. if we toss a coin twice.53) holds provided that P({(w"". Loeve. it can be uniquely extended to the sigma field A specified in note (l) at the end of this appendix.0 1 X '" x .6 where examples of compound experiments are given. On the other hand. In the discrete case (A. \ I I . P([A 1 x fl2 X .0 if and only if WI is the outcome of £1.5 COMPOUND EXPERIMENTS There is an intuitive notion of independent experiments.. wn ) is a sample point in . 1995.) . More generally. then Ai corresponds to .on. . x An. If P is the probability measure defined on the sigma field A of the compound experiment. we should have .. (A5A) i.. A. r. An is by definition {(WI.0 ofthe n stage compound experiment is by definition 0 1 x·· . The reader not interested in the formalities may skip to Section A. .1 X Ai x 0i+ I x . . . the sigma field corresponding to £i.3).3) for events Al x . (A5.0) given by .."" . P.. a compound experiment is one made up of two or more component experiments.1 Recall that if AI. it is easy to give examples of dependent experiments: If we draw twice at random from a hat containing two green chips and one red chip. If we are given n experiments (probability models) t\. . .. .. we introduce the notion of a compound experiment. Pn({W n }) foral! Wi E fli. . I <i< n.5. There are certain natural ways of defining sigma fields and probabilities for these experiments. W2 is the outcome of £2 and E Oi corresponds to the Occurrence of the comso on.wn )}) I: " = PI ({Wi}) ...on. .o n = {(WI. . . The interpretation of the sample space 0 is that (WI. then the sample space .. x . . £n with respective sample spaces OJ. the Cartesian product Al x . 1 < i < n}. then (A52) defines P for A) x ". x An of AI. x fl 2 x '" x fln)P(fl.. ." X An) = P.2) 1 i " 1 If we are given probabilities Pi on (fl" AI).. .
.SP({(w] .6.. If o is the sample space of the compound experiment. By the multiplication rule (A. the following.k)!' The fonnula (A.Section A 6 Bernoulli and Multinomial Trials. The simplest example of such a Bernoulli trial is tossing a coin with probability p of landing heads (success). (A.. In the discrete case we know P once we have specified P( {(wI ... and Stone (1971) Section 1. Sampling With and Without Replacement 447 Specifying P when the £1 are dependent is more complicated.1 Suppose that we have an experiment with only two possible outcomes. If Ak is the event [exactly k S's occur]..Pq' If fl is the sample space of this experiment and W E fl. ...3) is known as the binomial probability. . If the experiment is perfonned n times independently..2) where k(w) is the number of S's appearing in w. References Grimmett and Stirzaker ( (992) Sections (.6. SAMPLING WITH AND WITHOUT REPLACEMENT A. .. A. ill the discrete case. then (A.6 BERNOULLI AND MULTINOMIAL TRIALS.'" . . .3) where n ) ( k = n! kl(n .4 More generally.. Other examples will appear naturally in what follows.S) ..6. If we repeat such an experiment n times independently. n The probability structure is determined by these conditional probabilities and conversely.u. then (A.7) we have.. . 1. we refer to such an experiment as a multinomial trial with probabilities PI. which we shall denote by 5 (success) and F (failure).: n )}) for each (WI. Port.·.: n ) with Wi E 0 1.5..6. if an experiment has q possible outcomes WI. i = 1" . we shall refer to such an experiment as a Bernoulli trial with probability of success p.5 Parzen (1960) Chapter 3 A. . .4. A. J. any point wEn is an ndimensional vector of S's and F's and.wq and P( {Wi}) = Pi. we say we have performed n Bernoulli trials with success probability p. the compound experiment is called n multinomial trials with probabilities PI.6.· IPq. ..5.w n )}) = P(£l has outcome wd P(£2 hasoutcomew21 £1 has outcome WI)'" P(£T' has outcome W I £1 has outcome WI. Ifweassign P( {5}) = p.6.£"_1 has outcome wnd.6 Hoel. . /I.
A. I' .9) . Sectiou 11 Hoel. .k q is the event (exactly k l WI 's are observed..6.1 . __________________J i . . then P( k" . and P(Ak) = 0 otherwise. · · • A. kq!Pt' .A l P). The probability models corresponding to such experiments can all be thought of as having a Euclidean space for sample space. If Np of the members of n have a "special" characteristic S and N (1 ~ p) have the opposite characteristic F and A k = (exactly k "special" individuals are obtained in the sample). Sections 14 Pitman (1993) Section 2... Np). .1O) J for max(O.. ..6. .P))nk k (N)n = (A6. . .p)) < k < min( n. we shall sometimes refer to the outcome of the compound experiment as a sample of size n from the population given by (n.6.k. When n is finite the tenn.1 . Port. 1 . P) independently n times. i .0.6. and Stone (1971) Section 2. I A. . PtA ) k =( n ) (Np). 10) is known as the hypergeometric probability. the component experiments are not independent and. ·Pq t (A. (N)n . exactly kqwq's are observed). .) A = n! k k k !. .7 PROBABILITIES ON EUCLIDEAN SPACE Random experiments whose outcomes are real numbers playa central role in theory and practice.4 Parzen (1960) Chapter 3.448 A Review of Basic Probability Theory Appendix A where k~(w) = number of times Wi appears in the sequence w.(N(I. then ~ . . A.6. and the component experiments are independent and P( {a}) = liNn..S If we have a finite population of cases = {WI"" WN} and we select cases Wi successively at random n times without replacement.. = N! (N .6) where the k i are natural numbers adding up to n. References Gnedeuko (1967) Chapter 2. . n . we are sampling with replacement. i . .N (1 . n P({a}) ~ (N)n where 1 I (A. exactly k 2 wz's are observed. with replacement is added to distinguish this situation from that described in (A. If AkJ. The fonnula (A. .n)!' If the case drawn is replaced before the next drawing..S) as follows. .) of the compound experiment.. for any outcome a = (Wil"" 1Wi.1 j .6.7 If we perform an experiment given by (.
A.. . (A7A) Conversely. . P(A) is the volume of the "cylinder" with base A and height p(x) at x. Integrals should be interpreted in the sense of Lebesgue.. x A. A..7.bk ) are k open intervals.. . . This definition is consistent with (A.S) is by definition r JR' 1A(X)P(x)dx where 1A(x) ~ 1 if x E A.S) for some density function P and all events A. We will only consider continuous probability distributions that are also absolutely continuous and drop the term absolutely. we shall call the set (aJ. 1 <i <k}anopenkrectallgle. Geometrically.7. any nonnegative function pon R k vanishing except on a sequence {Xl. That is.7.. .7.b k ) = {(XI"". Recall that the integral on the right of (A.(ak. where ( )' denotes transpose. . It may be shown that a function P so defined satisfies (AlA). ..3.5) A.7. } of vectors and that satisfies L:~ I P(Xi) = 1 defines a unique discrete probability distribution by the relation P(A) = L x. We will write R for R1 and f3 for f31. However.S) is given by (A. (A7.7 Probabilities on Euclidean Space 449 We shall use the notation R k of k~dimensional Euclidean space and denote members of Rk by symbols such as x or (Xl.. Riemann integrals are adequate.Xk) :ai <Xi <bi..EA pix. is defined to be the smallest sigma field having all open k rectangles as members.Xn .1If (al. and 0 otherwise. xd'.7 A continuous probability distn'bution on Rk is a probability P that is defined by the relation P(A) = L p(x)d x =1 (A7. for practical purposes.7. only an Xi can occur as an outcome of the experiment.9) .Section A. X n . Thefrequency function p of a discrete distribution is defined on Rk by n pix) = P({x»)..7. A. .6 A nonnegative function p on R k • which is integrable and which has r p(x)dx = 1.8 are usually called absolutely continuous. P defined by A.). JR' where dx denotes dX1 .7. which we denote by f3k.. } are equivalent..I) because the study of this model and that of the model that has = {Xl.2 The Borelfield in R k ...bd.7. dx n • is called a density function. Any subset of R k we might conceivably be interested in turns out to be a member of f3k.bd x '" (ak. . An important special case of (A.. .3 A discrete (probability) distribution on R k is a probability measure P such that L:~ I P( {Xi}) = 1 for some sequence of points {xd in R k .
. (A.1 and 4. 5.7. For instance. .450 A Review of Basic Probability Theory Appendix A It turns out that a continuous probability distribution determines the density that generates it "uniquely. F is continuous at x if and only if P( {x}) (A.7.16) defines a unique P on the real line.17) = O. References Gnedenko (1967) Chapter 4. then by the mean value theorem paxo .15) I. .f.1. defines P in the sense that if P and Q are two probabilities with the same d. x n j X =? F(x n ) ~ (A.16) I It may be shown that any function F satisfying (A.4.13HA.7.h.2 parzen (1960) Chapter 4.2. Sections 14. Thus. Sections 21. We always have F(x)F(xO)(2) =P({x}). P Xl (A.14) . • J .:0 and Xl are in R. ! I . r . (A.]).7 Pitman (1993) Sections 3. .lO) The ratio p(xo)jp(xl) can. the density function has an operational interpretation close to thal of the frequency function.. (A. 3.7. and h is close to 0.. then P = Q.12) The dJ. and Stone (1971) Sections 3.7.5 i • . thus. x. limx~oo F(x) limx~_oo =1 F(x) = O. x.xo + h]) '" 2hp(xo) and P([ h Xl 1 + h]) + Xl p(xo) hi) '" ( ).13) x <y =? F(x) < F(y) (Monotone) F(x) (Continuous from the right) (A. Xo P([xo . .J x .7. x (00. POlt.1'.1. 5.h.11 The distribution function (dJ.7. be thought of as measuring approximately how much more Or less likely we are to obtain an outcome in a neighborhood of XQ then one in a neighborhood of Xl_ A.) F is defined by F(Xl' .7.7. 22 Hoel. if p is a continuous density on R. 4.x.. F is a function of a real variable characterized by the following properties: > I ..7."(l) Although in a continuous model P( {x}) = 0 for every x. When k = 1.) = P( ( 00.
Letg be any function from Rk to Rm. k.S..(1) = A. the time to breakdown and length of repair time for a randomly chosen machine. Forexample.1 A random variable X is a function from Oto Rsuch that the set {w: X(w) E B} X. if X is continuous.7. we will describe them purely in terms of their probability distributions without any further specification of the underlying sample space on which they are defined.8.1 (H) is in A for every B E BkJI) For k = 1 random vectors are just random variables. the probability measure Px in the model (R k . Similarly. The subscript X or X will be used for densities. .a RANDOM VARIABLES AND VECTORS: TRANSFORMATIONS Although sample spaces can be very diverse.Section A.8) PIX E: A] LP(X).8. The probability of any event that is expressible purely in tenns of X can be calculated if we know only the probability distribution of X. such that(2) gl(B) = {y E: Rk : g(y) E: . density. The probability distribution of a random vector X is. and so on of a random vectOr when we are.8 Random VariOlbles and Vectors: Transformations 451 A. 13k . by definition. these quantities will correspond to random variables and vectors. or equivalently a function from to Rk such that the set {w . we measure the weight of pigs drawn at random from a population.5) and (A. When we are interested in particular random variables or vectors.or vectorvalued functions of a random vectOr X is central in the theory of probability and of statistics. the yield per acre of a field of wheat in a given year. Thus. referring to those features of its probability distribution. and so on.8. In the probability model. in fact. we will refer to the frequency Junction. Here is the formal definition of such transformations. X(w) E: B} ~ X. In the discrete case this means we need only know the frequency function and in the continuous case the density.1 (B) is in 0 fnr every BE B.2 A random vector X = (Xl •. The event Xl( B) will usually be written [X E B] and P([X E BJ) will be written PIX E: H].4 A random vector is said to have a continuous or discrete distribution (or to be continuous or discrete) according to whether its probability distribution is continuous or discrete. ifXisdiscrete xEA L (A. .5) p(x)dx .'s. dJ. A. and so on to indicate which vector or variable they correspond to unless the reference is clear from the context in which case they will be omitted. the concentration of a certain pollutant in the atmosphere.Xk)T is ktuple of random variables. Px ) given by n Px(B) = PIX E BI· (A. the statistician is usually interested primarily in one or more numerical characteristics of the sample point that has occurred.7.8. from (A. The study of real. m > 1. dj.3) A.
.)'.y). . Then the random tran~form(lti(m g( X) is defined by g(X)(w) = g(X(w)).8. y (A. y). The (marginal) frequency or density of X is found as in (A. These notions generalize to the case Z = (X.8. j . i . y)T is continuous with density p(X. (A.7.1') .(5) (A. if (X. a # 0.8.7) and (A.y)(x.8.Y). then g(X) is discrete and has frequency function Pg(X)(t) = L {x:g(x)=t} Px(x).12) 1 . . Another common example is g(X) = (min{X.8)) that X is a marginal density function given by px(x) ~ 1: P(X. V). . Pg(X) (I) = j.8.1O) From (A. Yf is a discrete random vector with frequency function p(X.8.92/ with 91 (X) = k.1 E~' l(X i .1 1 Xi = X and 92(X) = k. This is called the change of variable formula. assume that the derivative l of 9 exists and does not vanish on S. • II ! I • . Then g(X) is continuous with density given by .11) Similarly. and X is continuous.8) it follows that if (X.y)(x. Furthennore. (A. .S.).jPX a (A. ! for Pg(x)(I) ~ PX(gl(t)) Ig'(g 1(1))1 (A. and 0 otherwise.11) and (A.452 A Review of BClsic ProbClbility Theory Appendix A [J} E BI. The probability distribution of g(X) is completely detennined by that of X through L:: P[g(X) E BI = PIX E gI(B)].8) Suppose that X is continuous with density PX and 9 is realvalued and onetoone(3) on an open set S such that P[X E 5] = 1.8. known as the marginal frequency function. then 1 (I .6) An example of a transformation often used in statistics is g (91.y)dy.Y) (x. . it may be shown (as a consequence of (A.12) by summing or integrating out over yin P(X. (A. then the frequency function of X.X)2. a random vector obtained by putting two random vectors together.S.Y).9) t E g(S). Discrete random variables may be used to approximate continuous ones arbitrarily closely and vice versa. If g(X) ~ aX + 1'.: for every B E Bill. is given by(4) i I ( PX(X) = LP(X.7) If X is discrete with frequency function Px. .8. max{X.8.
1. .[Xn E An] are independent.6.1 Two random variables X I and X 2 are said to be independent if and only if for sets A and B in E.3. . 5.2.An in S. X n with X (XI>"" Xn)' . S.9.3 By (A.9. X 2 ) and (Yt> Y2 ) are independent. References Gnedenko (1967) Chapter 4. and only if.7 Hoel. Then the random variables Xl. the approximation of a discrete distribution by a continuous one is made reasonable by one of the limit theorems of Sections A. " X n ) is either a discrete or continuous random vector..4 A.7).9.l5 and B.S. A.. .9 INDEPENDENCE OF RANDOM VARIABLES AND VECTORS A. then X = (Xl. Port.7.5) = following two conditions hold: A. if (Xl. it is common in statistics to work with continuous distributions. A. To generalize these definitions to random vectors Xl. so are g(X) and h(Y).. .S.Section A. (X"XIX. 9 Pitman (1993) Section 4. Theorem A... which may be easier to deal with.. if X and Yare independent. Sections 15.9. all random variables are discrete because there is no instrument that can measure with perfect accuracy. " X n ) is continuous. Suppose X (Xl. .9. .. A.4) (A.) and Y" and so on.9 Independence of Random Variables and Vectors 453 In practice.9. The justification for this may be theoretical or pragmatic. For example. whatever be g and h.9. One possibility is that the observed random variable or vector is obtained by rounding off to a large number of places the true unobservable continuous random variable specified by some idealized physical modeL Or else. Nevertheless. .13 A convention: We shan write X = Y if the probability of the event IX fe YI is O..7 The preceding equivalences are valid for random vectors XlJ .Xn are said to be (mutually) independent if and only if for any sets AI. E BI are independent...4 Parzen (1960) Chapter 7. the events [XI E AI and IX. the events [X I E All.. . 6. either ofthe (A.. .9. 1 X n are independent if. and Stone (1971) Sections 3. Sections 2124 Grimmett and Stirzaker (1992) Section 4. ' .6IftheXi are all continuous and independent.. so are Xl + X2 and YI Y2 .1.2 The random variables X I. .Xn (not necessarily of the same dimensionality) we need only use the events [Xi E Ail where Ai is a set in the range of Xi' A.. .
. if X is discrete..4 Par""n (1960) Chapter 7. Port. Take • . i==l (A.9) If we perfonn n Bernoulli trials with probability of success p and we let Xi be the indicator of the event (success on the ith trial).2.454 A Review of Basic Probability Theory Appendix A A.~ ( . discrete random variable with possible values {Xl. In line with these ideas we develop the general concept of expectation as follows. F x or density (frequency function) PX.2. Such samples win be referred to as the indicators of n Bernoulli tn"als with probability ofsuccess p" References Gnedenko (1967) Chapter 4.Ll XiP[X = Xi] where P[X = Xi] is just the proportion of individuals of height Xi in the population. Sections 6. A.4) from a population and measuring k characteristics on each member.. (A9.I0. written E(X).2 Hoel..j j .W.3. If XI . the indicator of the event A. .2 More generally.X..I l I . X l l is called a random sample (l size n from a population with dJ. i . If X is a nonnegative.. X n are independent identically distributed kdimensional random vectors with dJ. then Xl . by 1 ifw E A o otherwise. .3 ..I) . . 7 Pitman (1993) Sections 2. Then a reasonable measure of the center of the distribution of X is the average height of an individual in the given population. .5. I. we define the random variable 1 A. decompose {Xl l X2.• }.EA XiPX (Xi) < . then the Xi form a sample from a distribution that assigns probability p to 1 and (1. . In statistics. such a random sample is often obtained by selecting n members at random in the sense of (A. px(i) = i(i+1)' i= 1. it follows that this average is given by 'L.S If Xl ... X2. .I . Sections 23. The same quantity arises (approximately) if we use the longrun frequency interpretation of probability and calculate the average height of the individuals in a large sample from the population in question. .p) to 0. Fx or density (frequency function) Px. "" 1 X q are the only heights present in the population. and Stone (1971) Section 3.. 24 Grimmett and Stirzaker (1992) Sections 3..) A.9. . we define the expectation or mean of X. If A is any event. ... } into two sets A and B where A consists of aU nonnegative Xi and B of all negative Xi· If either 'L. I i I I . by 00 1 i .' (Infinity is a possible value of E(X).. . . 4.~ 1 i E(X) = LXiPX(Xi).I0 THE EXPECTATION OF A RANDOM VARIABLE r: r I Let X be the height of an individual sampled at random from a finite population. 1 Xi =i. 5..
lOA) If X is an ndimensional random vector. (A..l"O_T_h''CEx.p.J:. Otherwise. .'"":. If X is a continuous random variable. If X is a constant... ."5'.lO. 10..7) if Qt. A. 10.rn".)Px (.."_0_"_A". it is natural to attempt a definition of the expectation via approximation from the discrete case. . E(Y) are defined. I 0.7) it follows that if X < Y and E(X). . .9.5) As a consequence of this result."_"_b_.V.9) as the definition of the expectation or mean of X f~oo xpx(x)dx is finite.tO A random variable X is said to be integrable if E(IXI) < 00.c..8From (A.v') = c for all w. and if E(lg(X) I) < 00.. if 9 is a realvalued function on Rn.IO. then E(X) = P(A).'_""o"_of_'""R_'_"d"o:.IO. . Here are some properties of the expectation that hold when X is discrete. an are constants and E(!Xil) < 00.6) Taking g(x) = L:~ 1 O:'iXi we obtain the fundamental relationship (A. we have 00 E(IXI) ~ L i=l IXiIPx(Xi). then E(X) = c. 10. E(X) is left undefined. . then E(X) < E(Y). X(t.ES( x. If X (A. then it may be shown that 00 E(g(X)) = Lg(Xi)PX(Xi) i=l (A.I). Jo= xpx(x)dx or A.9)). (A. 00 455 or L. we define E(X) unambiguously by (A. n. It may be shown that if X is a continuous k~dimensional random vector and g(X) is any random variable such that JR' { Ig(x)IPx(x)dx < 00.) < 00. we leave E(X) undefined.. 10.3) = IA (cf (A. i = 1. Otherwise. 10. Those familiar with Lebesgue integration will realize that this leads to E(X) = 1: xpx(x)dx whenever (A.
lO.1 I 1 .. The interested reader may consult an advanced text such as Chung (1974). • g(x)dP(x) = L g(Xi)p(Xi). The fonnulae (A 10. Section 26 Grimmett and Stirzaker (1992) Sections 3.3 Hoel. E(X k ) j = Lxkpx(x) if X is discrete . 3A. It I! II . Chung (1974) Chapter 3 Gnedenko (1967) Chapter 5.. j A. II.!I) In the continuous case expectation properties (A. which means J .S) as well as continuous analogues of (A. We assume that all moments written here exist..2) xkpx(x)dx if X is continuous..3).H.1 parzen (1960) Chapter 5. I: We refer to J1 = IlP as the dominating measure for P.5) and (A. 7. By (A. 1 ... It is possible to define the expectation of a random variable in general using discrete approximations..6) hold. POrl. I0. Chapter S. lOA). I In general. (A.' • 1: x I (A I 1.12) where F denotes the distribution function of X and P is the probability function of X defined by (AS.5) and (A.IRk g(x)dP(x) (A.. 10.1... g(X)Px(x)dx. and Stone (1971) Sections 4. 'I I' I.ll MOMENTS 1 ..3. In the continuous case dJ1(x) = dx and M(X) is called Lebesgue measure. discrete case f: f References i~l (AIO. I I) are both sometimes written as E(g(X)) ~ r ink = g(x)dF(x) or r . the moments depend on the distribution of X only. (A. continuous case.7). 10.B) g(x)p(x)dx. and (AIO.IO.456 then E(g(X» exists and E(g(X» A Review of BaSIC Probability Theory Appendix A = 1. Sections 14 Pitman (1993) Sections 3. A convenient notation is dP(x) = p(x)dl'(x)..! If k is any natural number and X is a random variable.. 4.3. I0.5) and (A 10.. . 4. j . j i . 10. . (AIO... I . In the discrete case J1 assigns weight one to each of the points in {x : p(x) > O} and it is called counting measure.5). the kth moment of X is defined to be the expectation of X k..11). . A. We will often refer to p(x) as the density of X in the discrete case as well as the continuous case.. Chapter 3.
This is the case.ll.E(X 2 ))i]. j are natural numbers. are the same as those of X.E(X)).2 are expressed in tenns of cumulants..6) (One side of the equation exists if and only if the other does. then by (A. and is denoted by I'k.IO The third and fourth central moment.H. for example.E(X))k].1 and the kunosis ')'2.n. then 11 A. If Var X = 0.j) of Xl and X 2 is.l and. then X = O. See also Section A. and X 2 is again by definition E[(X. These results follow.7) and (All.7) Var(aX + b) = a2 Var X. 10. A.S The second central moment is called the variance of X and will be written Var X. the kth moment of A. If a and b are constants.U. 0 2).15». by definition.2). By (A 10. The central product moment of order (1. For sim plicity we consider the case k = 2. then the coefficient of skewness and the kurtosis of Y = 12 = O.1». E(XiX~). If X ~ N(/l. then the product moment of order (i. which are defined by where 0'2 = Var X.n. are used in the coefficient of skewness . X = E(X) (a constant).j) of X.E(X))/vVar x. the stan E(Z) = 0 and Var Z = 1. . from (A. The variance of X is finite if and only if the second moment of X is finite (cf.7 If X is any random variable with welldefined (finite) mean and variance.n.12 It is possible to generalize the notion of moments to random vectors. The nonnegative square root of Var X is called the standard deviation of X.9 If E(X 2 ) = 0.ll Moments 457 A. If XI and X 2 are random variables and i.15. . for instance. It is also called a measure of scale. The central product moment of order (i.6) it follows then that A. A.S) A. (All. is by definition E[(X . These descriptive measures are useful in comparing the shapes of various frequently used densities. which is often referred to as the mean deviation.ll. (AI2.E(XIl)i(X.1l.II. (AIl. Another measure of the same type is E(jX .12 where. (A.n If Y = a + bX with b > 0.3 The distribution of a random variable is typically uniquely specified by its moments. 1) is .Section A. The standard deviation measures the spread of the distribution of X about its expectation.E(X)J).4 The kth central moment of X (X . A. if the random variable possesses a moment generating function (cf.) dardized version or Zscore of X is the random variable Z ~ (X .
ar '2 .E(X.. The correlation of Xl and X 2.14).E( X.) = ac Cov(X I .458 A Review of Basic Prob<lbility Theory Appendix A called the cOl'Orionce of X 1 and. •  II . = a + oXI . = X.I6) . The correlation inequality correspcnds to the special case ZI = Xl .). X. Cov(aX I + oX" eX.). A proof of the CauchySchwartz inequality is given in Remark 1. • i . 7). we get the formula Var X = E(X') . denoted by Corr(XI1 X2).) (X. Z.\:2 and is written Cov(..) = lfwe put XI ~ ~E(XI .X.)) and using (AI 0.).I 1.) (A. E(Zil < 00. X 3 ) + be Cov(X" X 3 ) and + ad Cov(X I .\ 1.IIl4) If Xi and Xf are distributed as Xl and X 2 and are independent of Xl and X 2. I 1 1 j I I .E(X I ). then Cov(XI.E(X J!)( X.1. 0 # 0) of XI' l il . X. we obtain the relations. . with equality holding if and only if (1) Xl or X 2 is a constant Or ! • • . o. + dX.. . Equality holds if and only if X.E(X I ») = CO~(X~X. This is the correlation inequality.4.X 2).19) 1 • t . i .[E(X)j2 (A. is linear function (X. J I • for any two random variables Z" Z.)(X. such that E(Zf) < 00. =X in (A.X.3) and (AID. .) + od Cov(X"X. (2) (XI .lI. (AI 117) j r . .E(X.11l3) (A. Equality holds if and only if one of Zll Z2 equals 0 or 2 1 = aZ2 for some constant a. . is defined whenever Xl and X 2 are not constant and the variances of Xl and X 2 are finite by (A. . It may be obtained from the CauchySchwartz inequality.I US) The correlation of Xl and X 2 is the covariance of the standardized versions of Xl and X 2· The correlation inequality is equivalent to the statement (A.X. I liS) The covariance is defined whenever Xl and X 2 have finite variances and in that case (AII. By expanding the product (XI .
12. are If M x is well defined in a neighhorhood finite and {s .23) A.. It is not trUe in general that X I and X 2 that satisfy (AIl.12.4. lsi < so} of zero.Xn are independent with finite variances. . Chapter 8. As a consequence of (AI 1.) ~ o when Var(X. + X n ) = L Var Xi· i=I n (A.llf E(c'nIXI) < 00 for some So > 0.bX 1. Xl)· (A. 28. Mx(s) ~ E(e'X) is well defined for and is called the moment generating function of X. XII have finite variances.20)..'" . 10. See also Section 1.12 Moment and Cumulant Generating Functions 459 If Xl.2) eSXpx(x)dx if X is continuous.5) and (A.e.II.II. Port.4 ° + ..) > 0. and Stone (1971) Sections 4.12 MOMENT AND CUMULANT GENERATING FUNCTIONS A. CoV(X1...13) the relation Var(X 1 + .24. are uncorrelated) need be independent. we obtain as a consequence of (A. I 1.3 Parzen (1960) Chapter 5. The correlation coefficient roughly measures the amount and sign of linear relationship between Xl and X 2.II).X. all moments of X k! sk. i ~ 1... It is 1 Or 1 in the case of perfect relationship (X2 = a l. 11..20) If Xl and X 2 are independent and X I and X 2 are integrable.ID. +2L '<7 COV(X. respectively). b < orb> 0.2.22) and (AI 1.22) (i.3) k=O . (A.J4). Sections 14 Pitman (1993) Section 6. lsi < So Mx(s) LeSX'PX(Xi) if X is discrete 1: Mx(s) i~1 (A. By (A.l1.Section A.. + X n ) = L '1=1 n Var X. then (A.X2 ) = Corr(X1. 30 Hoel. then Var(X 1 References Gnedenko (1967) Chapter 5.22) This may be checked directly. 7.21 ) or in view of(All. Sections 27.l2. =L = E(X k ) lsi < so· (A.5. we see that if Xl.
. Chapter 8. If XI . .Us hils derivatives of all orders at :. ~ Cj(X) ~ d . . See • 1 ~ .(s).460 A Review of Basic: Probability Theory Appendix A A.7) is called the cumulant generating function of X.+x"I(S) = II Mx. In this section we introduce certain families of b •. ! . SOME CLASSICAL DISCRETE AND CONTINUOUS DISTRIBUTIONS I f • .1 parzen (1960) Chapter 5. 11. References Hoel.12. Ii . = 0 and dk :\1. . is called the jth cumulant of X.9) l 1 " I J .+. If X is nonnally distributed. j A.10» can be written as 1'1 = Problem 8.. For a generalization of the notion afmoment generating function to random vectors.1x I ' .l1. .12.(X) + cJ(Y). • 111. . and C4 = J.t of X.13 ... c.'(8) ds~ = E(X'). 12. Sections 23 Rao (1973) Section 2bA j i i I .8. . K x can be represented by the convergent Taylor expansion Kx(s) = where L j=O 00 Cisj (A. The first cumulant Cl is the mean J..12.2l). C2 and C3 equal the second and third central moments f12 and f13 of X.3. i=l n (A. and Stone (1971) Chapter 8. • By definition. j > L For j > 2 and any constant a. Cj = 0 for j > 3. then Xl + . the probability distribution of a random variable or vector is just a probability measure on a suitable Euclidean space.12.~o sJ dj (A. If lv/x is well defined in some neighborhood of zero.Kx(s)I. + XI! has moment generating function given by M(x. then Cj(X + Y) = c.12. .3J1~. see Section B. Port.4 The moment generuting function .5.'(". The function Kx(s) = log M x (s) (A. XII are independent random variables with moment generating functions 11. The coefficients of skewness and kurtosis (see (A.t4 . Section 8.6) This follows by induction from the definition and (A. A/x determines the distribution